<div dir="ltr">Thank you Eric.  Have read, am pondering, and welcome other people to weigh in.<div><br></div><div>..m</div></div><br><div class="gmail_quote"><div dir="ltr">On Tue, Jun 28, 2016 at 8:30 PM Eric S. Raymond <<a href="mailto:esr@thyrsus.com">esr@thyrsus.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">In recent discussion of the removal of memlock, Hal Murray said<br>

"Consider ntpd running on an old system that is mostly lightly loaded<br>

and doesn't have a lot of memory."<br>

<br>

By doing this, he caused me to realize that I have not been explicit<br>

about some of the assumptions behind my technical strategy.  I'm now<br>

going to try to remedy that.  This should have one of three results:<br>

<br>

(a) We all develop a meeting of of the minds.<br>

<br>

(b) Somebody gives me technical reasons to change those assumptions.<br>

<br>

(c) Mark tells me there are political/marketing reasons to change them.<br>

<br>

So here goes...<br>

<br>

One of the very first decisions we made early last year was to code to a<br>

modern API - full POSIX and C99. This was only partly a move for ensuring<br>

portability; mainly I wanted a principled reason (one we could give potential<br>

users and allies) for ditching all the cruft in the codebase from the big-iron<br>

era.<br>

<br>

Even then I had clearly in mind the idea that the most effective<br>

attack we could make on the security and assurance problem was to<br>

ditch as much weight as possible.  Hence the project motto: "Perfection<br>

is achieved, not when there is nothing more to add, but when there is<br>

nothing left to take away."<br>

<br>

There is certainly a sense in which my ignorance of the codebase and<br>

application domain forced this approach on me.  What else *could* I<br>

have done but prune and refactor, using software-engineering skills<br>

relatively independent of the problem domain, until I understood enough<br>

to do something else?<br>

<br>

And note that we really only reached "understood enough" last week<br>

when I did magic-number elimination and the new refclock directive.<br>

It took a year because *it took a year!*.  (My failure to deliver<br>

TESTFRAME so far has to be understood as trying for too much in the<br>

absence of sufficient acquired knowledge.)<br>

<br>

But I also had from the beginning reasons for believing, or at least<br>

betting, that the most drastic possible reduction in attack surface<br>

would have been the right path to better security even if the state of<br>

my knowledge had allowed alternatives. C. A. R. Hoare: "There are two<br>

ways of constructing a software design: One way is to make it so<br>

simple that there are obviously no deficiencies, and the other way is<br>

to make it so complicated that there are no obvious deficiencies.<br>

<br>

So, simplify simplify simplify and cut cut cut...<br>

<br>

I went all-in on this strategy.  Thus the constant code excisions over<br>

the last year and the relative lack of attention to NTP Classic bug<br>

reports. I did so knowing that there were these associated risks: (1)<br>

I'd cut something I shouldn't, actual function that a lot of potential<br>

customers really needed, or (2) the code had intrinsic flaws that would<br>

make it impossible to secure even with as much reduction in attack surface<br>

and internal complexity as I could engineer, or (3) my skills and intuition<br>

simply weren't up to the job of cutting everything that needed to be cut<br>

without causing horrible, subtle breakage in the process.<br>

<br>

(OK, I didn't actually worry that much about 3 compared to 1 and 2 - I<br>

know how good I am. But any prudent person would have to give it a<br>

nonzero probability. I figured Case 1 was probably manageable with good<br>

version-control practice.  Case 2 was the one that made me lose some<br>

sleep.)<br>

<br>

This bet could have failed.  It could have been the a priori *right*<br>

bet on the odds and still failed because the Dread God Finagle<br>

pissed in our soup. The success of the project at its declared<br>

objectives was riding on it. And for most of the last year that was a<br>

constant worry in the back of my mind.  *What if I was wrong?* What I<br>

was like the drunk in that old joke, looking for his keys under the<br>

streetlamp when he's dropped then two darkened streets over because<br>

"Offisher, this is where I can see".<br>

<br>

It didn't really help with that worry that I didn't know *anyone* I<br>

was sure I'd give better odds at succeeding at this strategy than<br>

me. Keith Packard, maybe.  Poul Henning-Kemp, maybe, if he'd give up<br>

timed for the effort, which he wouldn't. Recently I learned that Steve<br>

Summit might have been a good bet. But some problems are just too<br>

hard, and this codebase was *gnarly*.  Might be any of us would have<br>

failed.<br>

<br>

And then...and then, earlier this year, CVEs started issuing that we<br>

dodged because I had cut out their freaking attack surface before we<br>

knew there was a bug!  This actually became a regular thing, with the<br>

percentage of dodged bullets increasing over time.<br>

<br>

Personally, this came as a vast and unutterable relief. But,<br>

entertaining narrative hooks aside, this was reality rewarding my<br>

primary strategy for the project.<br>

<br>

So, when I make technical decisions about how to fix problems, one of<br>

the main biases I bring in is favoring whatever path will allow me<br>

to cut the most code.<br>

<br>

On small excisions (like removing memory locking, or yet another<br>

ancient refclock driver) I'm willing to trade a nonzero risk that<br>

removing code will break some marginal use cases, in part because I am<br>

reasonably confident of my ability to revert said small excisions. We<br>

remove it, someone yells, I revert it, no problem.<br>

<br>

So don't think I'm being casual when I do this. What I'm really doing<br>

is exploiting how good modern version control is.  The kind of tools<br>

we now have for spelunking code histories give us options we didn't<br>

have in elder days. Though of course there's a limit to this sort of<br>

thing.  It would be impractical to restore mode 7 at this point.<br>

<br>

Now let's talk about hardware spread and why, pace Hal, I don't really<br>

care about old, low-memory systems and am willing to accept a fairly high<br>

risk of breaking on them in order to cut out complexity.<br>

<br>

The key word here is "old".  I do care a lot about *new* low-memory<br>

systems, like the RasPis in the test farm. GPSD taught me to always<br>

keep an eye towards the embedded space, and I have found that the<br>

resulting pressure to do things in lean and simple ways is valuable<br>

even when designing and implementing for larger systems.<br>

<br>

So what's the difference?  There are a couple of relevant ones.  One<br>

is that new "low-memory" systems are actually pretty unconstrained<br>

compared to the old ones, memory-wise.  The difference between (say) a<br>

386 and the ARM 7 in my Pis or the Snapdragon in my smartphone is<br>

vast, and the worst-case working set of ntpd is pretty piddling stuff<br>

by modern standards.  Looking at the output of size(1) and thinking<br>

about the size of struct peer my guess was that it would be running<br>

with about 0.8GB of RAM, and top(1) on one of my Pis seems to confirm<br>

this.<br>

<br>

Another is that disk access is orders of magnitude faster than it<br>

used to be, and ubiquitous SSDs are making it faster yet.  Many<br>

of the new embedded systems (see: smartphones) don't have spinning<br>

rust at all.<br>

<br>

What this means in design terms is that with one single exception,<br>

old-school hacks to constrain memory usage, stack size, volume<br>

of filesystem usage, and so forth - all those made sense on<br>

those old systems but are almost dead weight even on something<br>

as low-end as a Pi.  The one exception is that if you have an<br>

algorithmic flaw that causes your data set to grow without bound<br>

you're screwed either way.<br>

<br>

But aside from that, the second that resource management becomes a<br>

complexity and defect source, it should be dumped.  This extends from<br>

dropping mlockall() all the way up to using a GC-enabled language like<br>

Python rather than C whenever possible.  Not for nothing am I planning<br>

to at some point scrap ntpq in C to redo it in Python.<br>

<br>

Now, as to *why* I don't care about old low-power systems - it's<br>

because the only people who are going to run time service on them are<br>

a minority of hobbyists.  A minority, I say, because going forward<br>

most of the hobbyists interested in that end of things are going to be<br>

on Pis or Beaglebones or ODroids so they can have modern toolchains<br>

thank you.<br>

<br>

Let's get real, here.  The users we're really chasing are large data<br>

centers and cloud services, because that's where the money (and<br>

potential funding) is.  As long as we don't make algorithmic mistakes<br>

that blow up our big-O, memory and I/O are not going to be performance<br>

problems for their class of hardware in about any conceivable<br>

scenario.<br>

<br>

Here's what this means to me: if I can buy a complexity reduction (and<br>

thus a security gain) by worrying less about how the resulting code<br>

will perform on machines from before the 64-bit transition of 2007-2008,<br>

you damn betcha I will do it and sleep the sleep of the just that<br>

night.<br>

<br>

When all is said and done, we could outright *break* on hardware that<br>

old and I wouldn't care much. Unless somebody is paying us to care and<br>

I get a cut, in which case I will cheerfully haul out my shovels and<br>

rakes and implements of destruction and fix it, and odds are high<br>

we'll end up with better code than we inherited.<br>

<br>

Yeah, it's nice to squeeze performance out of old hardware, and it's<br>

functional to be sparing of resources.  But when everything in both<br>

our security objectives and our experience says "cut more code" I'm<br>

going to put that first.<br>

<br>

This is how I will proceed until someone persuades me otherwise or<br>

our PM directs me otherwise.<br>

--<br>

                <a href="<a href="http://www.catb.org/~esr/" rel="noreferrer" target="_blank">http://www.catb.org/~esr/</a>">Eric S. Raymond</a><br>

<br>

_______________________________________________<br>

devel mailing list<br>

<a href="mailto:devel@ntpsec.org" target="_blank">devel@ntpsec.org</a><br>

<a href="http://lists.ntpsec.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ntpsec.org/mailman/listinfo/devel</a><br>

</blockquote></div>