Linux Journal article on NTPsec

Sun Aug 21 20:15:46 UTC 2016

This will be published in Linux Journal, probably in October, possibly
as the cover story. They asked me for cover concepts: I suggested either
Salvador Dali's "The Persistence of Memory" (the one with the melting clocks)
or the famous silent-film image of Harold Lloyd hanging from the hands
of a tower clock.

Mark has already reviewed it and suggested one very minor change. There's
still time for corrections.  Daniel, if you have harder numbers for
how we've dodged CVEs, that 'graph could be updated.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>
-------------- next part --------------
= NTPsec - a secure, hardened NTP implementation =

// This file is marked up in asciidoc

== Introduction ==

Network time synchronization - aligning your computer's clock to the
same Universal Coordinated Time (UTC) that everyone else is using - is
both necessary and a hard problem.  Many Internet protocols rely on
being able to exchange UTC timestamps accurate to small tolerances,
but the clock crystal in your computer drifts (its frequency varies by
temperature) so it needs occasional adjustments.

That's where life gets complicated.  Sure, you can get another
computer to tell you what time it thinks it is, but if you don't know
how long that packet took to get to you the report isn't very useful.
On top of that, its clock might be broken. Or lying.

To get anywhere, you need to exchange packets with several computers
that allow you to compare your notion of UTC with theirs, estimate
network delays, apply statistical cluster analysis to the resulting
inputs to get a plausible approximation of real UTC, and then adjust
your local clock to it. Generally speaking you can get sustained
accuracy to on the close order of 10 milliseconds this way, though
asymmetrical routing delays can make it much worse if you're in a bad
neighborhood of the Internet.

The protocol for doing this is called "NTP", Network Time Protocol,
and the original implementation was written near the dawn of Internet
time by an eccentric genius named Dave Mills.  Legend has it that
Dr. Mills was the person who got a kid named Vint Cerf interested in
this ARPANET thing. Whether that's true or not, for decades Mills was
*the* go-to guy for computers and high-precision time measurement.

Eventually, though, Dave Mills semi-retired, then retired completely.
His implementation (which we now call "NTP Classic") was left in the
hands of the Network Time Foundation and Harlan Stenn, the man
Information Week feted as "Father Time" in 2015 <<FT>>. Unfortunately,
on NTF's watch some serious problems accumulated. By that year the
codebase was already more than a quarter-century old, and techniques
that had been state-of-the-art when it was first built were showing
their age.  The code had become rigid and difficult to modify, a
problem exacerbated by the fact that very few people actually
understood the Byzantine time-synchronization algorithms at its core.

Among the real-world symptoms of these problems were serious security
issues.  That same year of 2015, infosec researchers becan to realize
that NTP Classic installations were being routinely used as DDoS
amplifiers - ways for crackers to packet-lash target sites by remote
control.  NTF, which had complained for years of being underbudgeted
and understaffed, seemed unable to fix these bugs.

This is intended to be a technical article, so I'm going to pass
lightly over the political and fundraising complications that
ensued. There was, alas, a certain amount of drama.  When the dust
finally settled, a very reluctant fork of the Mills implementation had
been performed in early June 2015 and named 'NTPsec', I had been funded
on an effectively full-time basis by the Linux Foundation to be the
NTPsec's architect/tech-lead, and we had both the nucleus of a capable
development team and some serious challenges.

This much about the drama I will say because it is technically
relevant: One of NTF's major problems was that though NTP Classic was
nominally under an open-source license, NTF retained pre-open-source
habits of mind.  Development was closed and secretive, technically and
socially isolated by NTF's determination to keep using the BitKeeper
version-control system.  One of our mandates from the Linux Foundation
was to fix this, and one of our first serious challenges was simply
moving the code history to git.

This is never trivial for a codebase as large and old as NTP Classic,
and it's especially problematic when the old version-control system is
proprietary with code you can't touch.  I ended up having to heavily
revise Andrew Tridgell's sourecepuller utility - yes, the same code
that triggered Linus Torvald's famous public break with BitKeeper back
in '05 - to do part of the work.  The rest was tedious and difficult
hand-patching with reposurgeon <<RS>>. A year later in May 2016 - far
too late to be helpful - BitKeeper went open-source.

== Strategy and challenges ==

Getting a clean history conversion to git took ten weeks and,
gruelling as that was, it was only the beginning. I had a problem: I was
expected to harden and secure the NTP code, but came in knowing very
little about time service and even less about security engineering.
I'd picked up a few clues about the former from my work leading GPSD
<<GPSD>>, which is widely used for time service. About the latter, I
had some basics about how to harden code - because when you get right
down to it, *that* kind of security engineering is a special case of
reliability engineering, which I *do* understand.  But I had no
experience at "adversarial mindset", the kind of active defense that
good infosec people do, nor any instinct for it.

A way forward came to me when I remembered a famous quote by
C. A. R. Hoare: "There are two ways of constructing a software design:
One way is to make it so simple that there are obviously no
deficiencies, and the other way is to make it so complicated that
there are no obvious deficiencies."  A slightly different angle on
this was the perhaps better-known aphorism by St.-Exupéry that
I was to adopt as NTPsec's motto: "Perfection is achieved, not when
there is nothing more to add, but when there is nothing left to take
away."

In the language of modern infosec, Hoare was talking about reducing
attack surface, global complexity, and the scope for unintended
interactions leading to exploitable holes. This was bracing, because
it suggested that maybe I didn't actually need to learn to think like
an infosec specialist or a time service expert. If I could refactor,
cut, and simplify the NTP Classic codebase enough, maybe all those
domain-specific problems would come out in the wash. And if not, then
at least taking the pure software-engineering approach I was
comfortable with might buy me enough time to learn the domain-specific
things I needed to know.

I went all-in on this strategy. It drove my argument for one of the
very first decisions we made, which was to code to a fully modern API
- pure POSIX and C99. This was only partly a move for ensuring portability;
mainly I wanted a principled reason (one we could give potential users
and allies) for ditching all the cruft in the codebase from the
big-iron Unix era.

And there was a *lot* of that.  The code was snarled with portability
#ifdefs and shims for a dozen ancient Unix systems: SunOS, AT&T System
V, HP-UX, UNICOS, DEC OSF/1, Dynix, AIX, and others more obscure. All
relics from the days before API standardization really took hold. The
NTP Classic people were too terrified of offending their legacy
customers to remove any of this stuff, but I knew something they
apparently didn't.  Back around 2006 I had done a cruft-removal pass
over GPSD, pulling it up to pretty strict POSIX conformance - and
nobody from GPSD's highly varied userbase ever said boo about it or
told me they missed the ancient portability shims at all. Thus, what I
had in my pocket was nine years of subsequent GPSD field experience
telling me that the standards people had won their game without most
Unix systems programmers actually capturing all the implications of
that victory.

So I decrufted the NTP code *ruthlessly*. Sometimes I had to fight my
own reflexes in order to to do it.  I too have long been part of the
culture that says "Oh, leave in that old portability shim, you never
know, there just *might* still be a VAX running ISC/5 out there, and
it's not doing any harm."

But when your principal concern is reducing complexity and attack
surface, that thinking is wrong. No individual piece of obsolete code
costs very much, but in a codebase as aged as NTP Classic the
cumulative burden on readability and maintainability becomes massive
and paralyzing.  You have to be hard about this; it all has to go, or
exceptions will pile up on you and you'll never achieve the mission
objective.

I'm emphasizing this point because I think much of what landed NTP
Classic in trouble was not want of skill but a continuing failure of
what one might call surgical courage - the kind of confidence and
determination it takes to make that first incision, knowing that
you're likely to have to make a bloody mess on the way to fixing
what's actually wrong.  Software systems architects working on legacy
infrastructure code need this quality almost as much as surgeons do.

The same applies to superannuated features.  The NTP Classic codebase
was full of dead ends, false starts, failed experiments, drivers for
obsolete clock hardware, and other code that might have been a good
idea once but had long outlived the assumptions behind it.  Mode 7
control messages.  Interleave mode.  Autokey.  An SNMP daemon that was
never conformant to the published standard and never finished. Half a
dozen other smaller warts. Some of these (Mode 7 handling and Autokey
especially) were major attractors for security defects.

As with the port shims, these lingered in the NTP Classic codebase
not because they couldn't have been removed, but because NTF cherished
compatibility back to the year zero and had an allergic reaction to
the thought of removing any features at all. 

Then there were the incidental problems, the largest of which was
Classic's build system.  It was a huge, crumbling, buggy,
poorly-documented pile of autoconf macrology. One of the things that
jumped out at me when I studied NTF's part of the code history was
that in recent years they seemed to spend as much or more effort
fighting defects in their build system as they did modifying code.

But there was one amazingly good thing about the NTP Classic code:
that despite all these problems it *still worked*.  It wheezed and
clanked and was rife with incidental security holes, but it did the
job it was supposed to do. When all was said and done and all the
problems admitted, Dave Mills had been a brilliant systems architect
and, even groaning under the weight of decades of unfortunate
accretions, NTP Classic still functioned.

Thus, the big bet on Hoare's advice at the heart of our technical
strategy unpacked to two assumptions: (a) that beneath the cruft and
barnacles the NTP Classic codebase was fundamentally sound, and (b)
that it would be practically possible to clean it up without breaking
that soundness.

Neither assumption was trivial.  This could have been the a priori
*right* bet on the odds and still failed because the Dread God Finagle
and his mad prophet Murphy micturated in our soup. Or, the code left
after we scraped off the barnacles could actually turn out to be
unsound, fundamentally flawed.

Nevertheless, the success of the team and the project at its declared
objectives was riding on these premises. Through 2015 and early 2016
that was a constant worry in the back of my mind.  *What if I was
wrong?* What I was like the drunk in that old joke, looking for his
keys under the streetlamp when he's dropped then two darkened streets
over because "Offisher, this is where I can see".

The final verdict is not quite in on that question; as I write, NTPsec
is still in beta.  But, as we shall see, there are now (in August
2016) solid indications that the project is on the right track.

== Stripping down, cleaning up ==

One of our team's earliest victories after getting the code history
moved to git was throwing out the autoconf build recipe and replacing
it with one written in a new-school build engine called waf (also used
by Samba and RTEMS). Builds became *much* faster and more reliable.
Just as importantly, this made the the build recipe an order of magnitude
smaller so it could be comprehended as a whole and maintained.

Another early focus was cleaning up and updating the NTP documentation.
We did this before most of the code modifications because the research
required to get it done was an excellent way way to build knowledge about
what was actually going on in the codebase.

These moves began a virtuous cycle. With the build recipe no longer a
buggy and opaque mess, the code could be modified more rapidly and
with more confidence. Each bit of cruft removal lowered the total
complexity of the codebase, making the next one slightly easier.

Testing was pretty ad-hoc at first. Around May 2016, for reasons not
originally related to NTPsec, I became interested in Raspberry Pis.
Then it occurred to me that they would make an excellent way to run
long-term stability tests on NTPsec builds.  Thus it came to be that
the windowsill above my home-office desk is now home to six headless
Raspberry Pis, all equipped with on-board GPSes, all running stability
and correctness tests on NTPsec 24/7. Just as good as a conventional
rack full of servers, but far less bulky and expensive!

We got a lot done over our first eighteen months.  The headline number
that shows just how much was the change in the codebase's total size.
We went from 227KLOC to 88KLOC, cutting the total line count by almost
a factor of three.

Dramatic as that sounds, it actually understates the attack-surface
reduction we achieved, because complexity was not evenly distributed
in the codebase.  The worst technical debt, and the security holes,
tended to lurk in the obsolete and semi-obsolete code that hadn't
gotten any developer attention in a long time.  NTP Classic was not
exceptional in this; I've seen the same pattern in other large, old
codebases I've worked on.

Another important measure was systematically hunting down and replacing
all unsafe C function calls with equivalents that can provably not cause
buffer overruns.  I'll quote from NTPsec's hacking guide:

------------------------------------------------------------------------
* strcpy, strncpy, strcat:  Use strlcpy and strlcat instead.
* sprintf, vsprintf: use snprintf and vsnprintf instead.
* In scanf and friends, the %s format without length limit is banned.
* strtok: use strtok_r() or unroll this into the obvious loop.
* gets: Use fgets instead. 
* gmtime(), localtime(), asctime(), ctime(): use the reentrant *_r variants.
* tmpnam() - use mkstemp() or tmpfile() instead.
* dirname() - the Linux version is re-entrant but this property is not portable.
------------------------------------------------------------------------

This formalized an approach I?d used successfully on GPSD ? instead of
fixing defects and security holes after the fact, constrain your code
so that it *cannot have* entire classes of defects.

The experienced C programmers out there are are thinking "What about
wild-pointer and wild-index problems?" And it?s true that the achtung
verboten above will not prevent those kinds of overruns. That's why another
prong of the strategy was systematic use of static code analyzers like
Coverity, which actually is pretty good at picking up the defects that
cause that sort of thing. Not 100% perfect, C will always allow you to
shoot yourself in the foot, but I knew from prior success with GPSD
that the combination of careful coding with automatic defect scanning
can reduce your bug load a very great deal.

To help defect scanners do a better job, we enriched the type
information in the code.  The largest single change of this kind was
changing int variables to C99 bools everywhere they were being used as
booleans.

Little things also mattered, like fixing all compiler warnings. I
thought it was shockingly sloppy that the NTP Classic maintainers
hadn?t done this. The pattern detectors behind those warnings are
there because they often point at real defects. Also, voluminous
warnings make it too easy to miss actual errors that break your
build. And you never want to break your build, because later on that
will make bisection testing more difficult.

An early sign that this systematic defect-prevention approach was
working was the extremely low rate of bugs we detected by testing as
having been introduced during our cleanup. In the first fourteen
months we averaged less than one iatrogenic C bug every ninety days.

I would have had a lot of trouble believing that if GPSD hadn't posted
a defect frequency nearly as low over the previous five years.  A
major lesson from both projects is that applying best practices in
coding and testing really works.  I pushed this point back in 2012 in
my essay on GPSD for 'The Architecture of Open Source, Volume 2" <<AOS2>>;
what NTPsec shows is that GPSD is not a fluke.

I think this is one of the most important takeaways from both
projects.  We really don't have to settle for what have historically
been considered "normal" defect rates in C code.  Modern tools and
practices can go a very long way towards driving those defect rates
towards zero.  It's no longer even very difficult to do the right
thing; what's too often missing is a grasp of the possibility and the
determination to pursue it.

And here's the real payoff. Early in 2016, CVEs (security alerts)
started issuing against NTP Classic that NTPsec dodged because we had
already cut out their attack surface before we knew there was a bug!  This
actually became a regular thing, with the percentage of dodged bullets
increasing over time.  Somewhere, Hoare and St.-Exupéry might
be smiling.

The cleanup isn't done yet.  We're testing a major refactoring and
simplification of the central protocol machine for processing NTP
packets. We believe this has already revealed a significant number of
potential security defects nobody ever had a clue about before. Every
one of these will be another dodged bullet attributable to getting our 
practice and strategic direction right.

== Features? What features? ==

I have yet to mention new features because NTPsec doesn't have many;
that's not where our energy has been going.  But here's one that
came directly out of the cleanup work...

When NTP was originally written, computer clocks only delivered
microsecond precision. Now they deliver nanosecond precision (though
not all of that precision is accurate). By changing some internal
representations we have made NTPsec able to use the full precision of
modern clocks when stepping them, which can result in a factor 10 or more
of accuracy improvement with real hardware such as GPSDOs and
dedicated time radios.

Fixing this was about a four-line patch. It might have been noticed
sooner if the code hadn't been using an uneasy mixture of microsecond and
nanosecond precision for historical reasons. As it is, anything short 
of the kind of systematic API-usage update we were doing would have
been quite unlikely to spot the problem.

A longstanding pain point we've begun to address is the
nigh-impenetrable syntax of the ntp.conf file. We've already
implemented a new syntax for declaring reference clocks that is far
easier to understand than the old.  We have more work planned towards
making composing NTP configurations less of a black art.

The diagnostic tools shipped with NTP Classic were messy,
undocumented, and archaic.  We have a new tool, ntpviz, which gives
time-server operators a graphical and much more informative view of
what's been going on in the server logfiles.  This will assist in
understanding and mitigating various sources of inaccuracy.

== Where we go from here ==

We don't think our 1.0 release is far in the future - in fact, given
normal publication delays, it might well have shipped by the time you
read this. An early-adopter contingent - including at least one
high-frequency-trading company for which accurate time is
business-critical - is already happily using NTPsec for production.

There remains much work to be done after 1.0.  We're cooperating
closely with IETF to develop a replacement for Autokey public-key
authentication that actually works. We want to move as much of the C
code as possible outside ntpd itself to Python in order to reduce
long-term maintainance load.  There's a possibility that the core
daemon itself might be split in two to separate the TCP/IP parts from
the handling of local reference clocks, drastically reducing global
complexity.

Beyond that, we're gaining insight into the core time-synchronization
algorithms and suspect there are real possibilities for improvement in
those. Better statistical filtering that's sensitive to measurements
of network weather and topology looks possible.

It's an adventure, and we welcome anyone who'd like to join in.
NTP is vital infrastructure, and keeping it healthy over a time-frame
of decades will need a large, flourishing community.  You can learn
more about how to take part at our project website <<NTPSEC>>>.

== References ==

[bibliography]

[[[FT]]] http://www.informationweek.com/it-life/ntps-fate-hinges-on-father-time/d/d-id/1319432[NTP's Fate Hinges On "Father Time"]

[[[RS]]] http://www.catb.org/esr/reposurgeon/[reposurgeon]

[[[GPSD]]] http://catb.org/gpsd/[GPSD]

[[[AOS2]]] http://www.aosabook.org/en/gpsd.html[GPSD in AOS2]

[[[NTPSEC]]] https://www.ntpsec.org/[Welcome to NTPsec]