Technical strategy and performance

Wed Jun 29 03:30:13 UTC 2016

In recent discussion of the removal of memlock, Hal Murray said
"Consider ntpd running on an old system that is mostly lightly loaded
and doesn't have a lot of memory."

By doing this, he caused me to realize that I have not been explicit
about some of the assumptions behind my technical strategy.  I'm now
going to try to remedy that.  This should have one of three results:

(a) We all develop a meeting of of the minds.

(b) Somebody gives me technical reasons to change those assumptions.

(c) Mark tells me there are political/marketing reasons to change them.

So here goes...

One of the very first decisions we made early last year was to code to a
modern API - full POSIX and C99. This was only partly a move for ensuring
portability; mainly I wanted a principled reason (one we could give potential
users and allies) for ditching all the cruft in the codebase from the big-iron
era.

Even then I had clearly in mind the idea that the most effective
attack we could make on the security and assurance problem was to
ditch as much weight as possible.  Hence the project motto: "Perfection
is achieved, not when there is nothing more to add, but when there is
nothing left to take away."

There is certainly a sense in which my ignorance of the codebase and
application domain forced this approach on me.  What else *could* I
have done but prune and refactor, using software-engineering skills
relatively independent of the problem domain, until I understood enough
to do something else?

And note that we really only reached "understood enough" last week
when I did magic-number elimination and the new refclock directive.
It took a year because *it took a year!*.  (My failure to deliver
TESTFRAME so far has to be understood as trying for too much in the
absence of sufficient acquired knowledge.)

But I also had from the beginning reasons for believing, or at least
betting, that the most drastic possible reduction in attack surface
would have been the right path to better security even if the state of
my knowledge had allowed alternatives. C. A. R. Hoare: "There are two
ways of constructing a software design: One way is to make it so
simple that there are obviously no deficiencies, and the other way is
to make it so complicated that there are no obvious deficiencies.

So, simplify simplify simplify and cut cut cut...

I went all-in on this strategy.  Thus the constant code excisions over
the last year and the relative lack of attention to NTP Classic bug
reports. I did so knowing that there were these associated risks: (1)
I'd cut something I shouldn't, actual function that a lot of potential
customers really needed, or (2) the code had intrinsic flaws that would
make it impossible to secure even with as much reduction in attack surface
and internal complexity as I could engineer, or (3) my skills and intuition
simply weren't up to the job of cutting everything that needed to be cut 
without causing horrible, subtle breakage in the process.

(OK, I didn't actually worry that much about 3 compared to 1 and 2 - I
know how good I am. But any prudent person would have to give it a
nonzero probability. I figured Case 1 was probably manageable with good
version-control practice.  Case 2 was the one that made me lose some
sleep.)

This bet could have failed.  It could have been the a priori *right*
bet on the odds and still failed because the Dread God Finagle
pissed in our soup. The success of the project at its declared
objectives was riding on it. And for most of the last year that was a
constant worry in the back of my mind.  *What if I was wrong?* What I
was like the drunk in that old joke, looking for his keys under the
streetlamp when he's dropped then two darkened streets over because
"Offisher, this is where I can see".

It didn't really help with that worry that I didn't know *anyone* I
was sure I'd give better odds at succeeding at this strategy than
me. Keith Packard, maybe.  Poul Henning-Kemp, maybe, if he'd give up
timed for the effort, which he wouldn't. Recently I learned that Steve
Summit might have been a good bet. But some problems are just too
hard, and this codebase was *gnarly*.  Might be any of us would have
failed.

And then...and then, earlier this year, CVEs started issuing that we
dodged because I had cut out their freaking attack surface before we
knew there was a bug!  This actually became a regular thing, with the
percentage of dodged bullets increasing over time.

Personally, this came as a vast and unutterable relief. But,
entertaining narrative hooks aside, this was reality rewarding my
primary strategy for the project.

So, when I make technical decisions about how to fix problems, one of 
the main biases I bring in is favoring whatever path will allow me 
to cut the most code.  

On small excisions (like removing memory locking, or yet another
ancient refclock driver) I'm willing to trade a nonzero risk that
removing code will break some marginal use cases, in part because I am
reasonably confident of my ability to revert said small excisions. We
remove it, someone yells, I revert it, no problem.

So don't think I'm being casual when I do this. What I'm really doing
is exploiting how good modern version control is.  The kind of tools
we now have for spelunking code histories give us options we didn't
have in elder days. Though of course there's a limit to this sort of
thing.  It would be impractical to restore mode 7 at this point.

Now let's talk about hardware spread and why, pace Hal, I don't really
care about old, low-memory systems and am willing to accept a fairly high 
risk of breaking on them in order to cut out complexity.

The key word here is "old".  I do care a lot about *new* low-memory
systems, like the RasPis in the test farm. GPSD taught me to always
keep an eye towards the embedded space, and I have found that the
resulting pressure to do things in lean and simple ways is valuable
even when designing and implementing for larger systems.

So what's the difference?  There are a couple of relevant ones.  One
is that new "low-memory" systems are actually pretty unconstrained
compared to the old ones, memory-wise.  The difference between (say) a
386 and the ARM 7 in my Pis or the Snapdragon in my smartphone is
vast, and the worst-case working set of ntpd is pretty piddling stuff
by modern standards.  Looking at the output of size(1) and thinking
about the size of struct peer my guess was that it would be running
with about 0.8GB of RAM, and top(1) on one of my Pis seems to confirm
this.

Another is that disk access is orders of magnitude faster than it
used to be, and ubiquitous SSDs are making it faster yet.  Many
of the new embedded systems (see: smartphones) don't have spinning
rust at all.

What this means in design terms is that with one single exception, 
old-school hacks to constrain memory usage, stack size, volume 
of filesystem usage, and so forth - all those made sense on
those old systems but are almost dead weight even on something
as low-end as a Pi.  The one exception is that if you have an
algorithmic flaw that causes your data set to grow without bound
you're screwed either way.

But aside from that, the second that resource management becomes a
complexity and defect source, it should be dumped.  This extends from
dropping mlockall() all the way up to using a GC-enabled language like
Python rather than C whenever possible.  Not for nothing am I planning
to at some point scrap ntpq in C to redo it in Python.

Now, as to *why* I don't care about old low-power systems - it's
because the only people who are going to run time service on them are
a minority of hobbyists.  A minority, I say, because going forward
most of the hobbyists interested in that end of things are going to be
on Pis or Beaglebones or ODroids so they can have modern toolchains
thank you.

Let's get real, here.  The users we're really chasing are large data
centers and cloud services, because that's where the money (and
potential funding) is.  As long as we don't make algorithmic mistakes
that blow up our big-O, memory and I/O are not going to be performance
problems for their class of hardware in about any conceivable
scenario.

Here's what this means to me: if I can buy a complexity reduction (and
thus a security gain) by worrying less about how the resulting code
will perform on machines from before the 64-bit transition of 2007-2008,
you damn betcha I will do it and sleep the sleep of the just that
night.

When all is said and done, we could outright *break* on hardware that
old and I wouldn't care much. Unless somebody is paying us to care and
I get a cut, in which case I will cheerfully haul out my shovels and
rakes and implements of destruction and fix it, and odds are high
we'll end up with better code than we inherited.

Yeah, it's nice to squeeze performance out of old hardware, and it's
functional to be sparing of resources.  But when everything in both
our security objectives and our experience says "cut more code" I'm
going to put that first.

This is how I will proceed until someone persuades me otherwise or
our PM directs me otherwise.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>