Resuming the great cleanup

Sun May 27 16:31:08 UTC 2018

This note attempts to clarify the next set of architectural challenges
in the NTPsec code.  It is particularly intended for Mark Atwood and
Daniel Franke because some of our main goals are tied into supporting NTS.

= OBJECTIVES =

Here are the objectives for the next round of work:

* NTS: Support Daniel's NTS symbiont.

* SINGLESOCK: Move from one socket per interface to a single socket
listening on all interfaces.

* EVENTS: Move ntpd to an event- and alarm-driven architecture, eliminating
the tick-per-second processing.

* GOPREP: Clear the path to moving the codebase to Go. We haven't committed
to doing this yet, but the odds on that happening someday look high enough
that I think it is good to already be factoring it into our planning.

Now I'll talk about motivation.

NTS is (obviously) a big deal because it would put us well ahead of
other time-service implementations in an are a lot of data centers
have serious concerns about.

There are four different reasons for SINGLESOCK:

1. I want to simplify the ntp_io.c code.  I consider the present
implementation dangerously ugly and complicated - it's the last really
gnarly place in the code after the big cleanup.  Our worst remaining
platform dependencies are largely concentrated in the fancy dances
required to iterate across interfaces and handle routing sockets.
Getting rid of all that would be good.

2. Doing SINGLESOCK would make EVENTS easier.  That's because after
singlesock all the inbound traffic will arrive at one endpoint (one
epoll call) rather than at a variable number of endpoints - one per
interface - as it does now. That will make it much easier to write and
verify a main event loop that handles alarms as well.

3. Going to NTS will move us from paying 1 UDP socket per interface to
paying that plus 1 TCP socket per client association - which is potentially
a very large number. Daniel argues that this means we want a single epoll
on all sockets to avoid scaling badly.

In general I think it's unwise to make architectural changes for
performance/scaling reasons without measurements proving that the
bottleneck is (a) a problem, and (b) where we think it is. But the
code-cleanliness arguments for SINGLESOCK and EVENTS are pretty strong,
so we might as well collect whatever scaling benefits we get from them
and call that good.

4. Returning to code simplification, every simplification helps with
GOPREP.  This one more so than most because it will dramatically
reduce the spread of platform-dependent code paths we have to map to
Go if and when it comes time to do that.

Reasons for EVENTS:

1. Mark wants to reduce power consumption for deployment in mobile and
embedded systems. I understand how this is compromised by tick-per-second
and agree with the goal.

2. Daniel says he wants this to make NTS easier.  I don't know exactly what it
will do for him that SINGLESOCK doesn't.  Perhaps he'll explin in a followup.

Reasons for GOPREP:

Go means (1) no more buffer-overrun attacks, ever, (2) no more memory-leak
issues, ever, (3) *vast* code simplification (LOC might easily drop 50%), (4)
greatly improved maintainability of what remains.

= OBSTACLES =

Next, I'm going to describe obstacles. Some of these are quite formidable.

NTS: I'm not going to try to scope this yet. I don't understand NTS
well enough.  We'll need a complexity estimate from Daniel. He does say most
of the code will run in a symbiont.

SINGLESOCK:  While messy and somewhat difficult, this is mostly a SMOP (Simple
Matter of Programming). There is one potential technical risk, relatively
minor I think.

The reason for iterating over interfaces is that ntpd has the capability to block
incoming packets by interface of origin. In order to go to a single epoll we
either need to (a) abandon this feature, or (b) find a way to query the device
a packet came through from the packet.

There was some previous discussion of (a).  We dropped the idea because I
argued that lopping off working features on non-obsolete hardware without
a strong security reason was too likely to anger an important contingent
of grognard time admins.

What I didn't know at the time - didn't find out until months later -
is that there is, in fact, a way to do (b). It's part of the RFC 3542
Advanced Sockets API, a query type called IP_PKTINFO that dodn't exist
when the main body of the NTP code was written.  It is supposed to be
portable to FreeBSD at least, and there is example code here:

https://stackoverflow.com/questions/603577/how-to-tell-which-interface-the-socket-received-the-message-from

The risk is related to the fact that this is a *really* obscure and
seldom-used feature.  It's difficult to tell from the sockets API
documentation that it even exists - I made several serious tries at
finding something equivalent and failed before this.

Because this feature is so recherche it has probably not been tested a
lot, and I do not have maximum confidence in its robustness or actual
portability.  However, "interface filtering fails because your OS
upstream fooed up" is survivable in a way that "we dropped it because
it was inconvenient" might not be.

The is, however, an easy way to check this.  We put in the IP_PKTINFO
query *before* changing the interface-iteration code, check each
incoming packet, and log a problem if the result from the query
doesn't match the interface we know it came in on.  We burn this in
on all our Tier 1 platforms and see.

EVENTS: The code currently has a once-per-second tick that we want to
eliminate in favor of alarms that only fire as needed.  Unfortunately,
this is going to be quite difficult.  And we won't collect the major
benefit (lower power consumption) until every piece of it is done.

In theory, what we ought to be able to do is spin on a single event
dispatcher using epoll, waking up only when there's an incoming
packet, or when we're due transmit to a peer, or once per fixed time
interval of an hour to check leap second status.  Presently we
unconditionally wake up once per second and do nothing if none of
these things is required.

There are a couple of problems with this.

1. If we have local refclocks we're going to wake up once per second
anyway because that is almost always how often they ship a time
report.  We've heard that Dr. Mills once turned down a patch set to
make the NTP Classic code entirely event-driven, and this is probably
most of the reason he didn't think it was valuable enough to risk the
code churn - only no-refclock deployments could see a benefit, and Dr.
Mills loved his refclocks.

In our deployment scenarios, how often do we think a low-power device
is *not* going to be watching a GPS/1PPS refclock?  Smartphones and
tablets are right out - anything mobile with a browser wants to know
location, therefore will have a GPS.

Given that you get 1PPS for free from an embedded GPS and those are
both cheap and ubiquitous, the real question is actually a bit more
fundamental.  How often does any low-power device care about time
without caring about location?  The power benefit from trying to go
tickless is upper-bounded by the size of this set.

2. There's a subtle issue here with frequency of clock adjustment.
Currently if we're slewing the clock it gets adjusted once per
second. If we go to a fully event-driven architecture (and there are
no refclocks) the frequency of adjustments will drop to the frequency
of network traffic. This may not be a practical problem - I'm inclined
to think it won't be - but we won't know until we measure.

3. Implementation complexity. The code says

	 * The basic timerevent is one second.  This is used to adjust the
	 * system clock in time and frequency, implement the kiss-o'-death
	 * function and the association polling function.

but this is a pretty serious oversimplification.  If you look at the code
in ntp_timer.c  there's all kinds of random other stuff being done.  Each
conditional has its own timer and would have to be unpacked into a different
event type on the queue - I see 7 types. This implies  a big messy change
that will take significant time to do and to verify.

4. On the other hand, after staring at the code I can now report that the
1PPS issue I thought might be going to block EVENTS is...not exactly gone, but
subsumed by the "refclocks tick once per second" problem.  Because NTP uses
the RFC 2783 interface to the PPSAPI kernel facility, no explicit 1PPS tick
needs to be dealt with.  I got this wrong because GPSD does *not* assume
RFC 2783 will be available and in many cases camps on one of the GPS's
handshake lines to pick it up.

GOPREP:  Aside from the considerable labor of code translation, there is
only one problem blocking a move to Go.  That is how GC stop-the-world
pauses might stall refclock reports.

What you pay for Go's GC is that the runtime occasionally has to stop all
threads to do it.  These pauses are not frequent.  See for example this 2016
graph of Go 1.9 performance on an 18GB heap:

https://twitter.com/brianhatfield/status/804355831080751104

They get serious spikes once or twice an hour, bounded by 1ms and
averaging 0.5ms.

For network-peer connections this is not a problem - it isn't large
compared to random network-weather effects that are exactly what NTP
is designed to deal.  However, it is a substantial amount of
unpredictable latency to impose on local refclock reports.

I'm not going to try to plan around this yet except to note that
it could well turn into a reason to revive refclockd partitioning
(with the clock code staying in C.)

= DEPENDENCIES =

The logical order to do these things in is: SINGLESOCK first.  Then
EVENTS, if we choose to do it.  Then NTS.  GOPREP isn't a task to be
scheduled (at least not yet) but a set of issues to keep an eye on.

= UNRESOLVED QUESTIONS =

I'm now confident that SINGLESOCK is doable fairly promptly.

The big question is whether EVENTS is worth the effort.  I'm now
leaning towards "no", but my mind could be changed if either (a) Mark
tells me there's a really important deployment class that is both low
power and without refclocks, or (b) Daniel tells me there's some reason
NTS really needs things to be rearchitected this way.

We need to keep an eye on Go's worst-case stop-the-world pause times,
because they keep falling in successive compiler revisions. There's
some magic threshold of microseconds-per-hour below which we wouldn't
actually care about the induced clock-report jitter; I don't know what
that is, but Hal or Gary might be able to tell us.  We probably don't
want to plan a Go move until we have confidence that the
95th-percentile of stop-the-world pauses is going to below that
threshold.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

To make inexpensive guns impossible to get is to say that you're
putting a money test on getting a gun.  It's racism in its worst form.
        -- Roy Innis, president of the Congress of Racial Equality (CORE), 1988