Driver strategy - we need to decide among incompatible goals

Sun Aug 11 19:27:21 UTC 2019

Achim Gratz via devel <devel at ntpsec.org>:
> In other words, while there may have been blobs there, none of them were
> actually in NTPd. 

The fact that they had to be linked to the kernel rather than being in
userland made them *worse* security risks.  Those were the first
drivers I dropped.

> > Whatever your threat model is, reducing attack surface is effective
> > security hardening.  Reducing total LOC and complexity in the codebase
> > reduces attack surface.  Thus, reducing LOC anywhere you can do it 
> > is a hardening strategy.
> 
> As a strategy it's fine, as a criterion for deciding what to let go it's
> useless.  That is very much the point I was trying to make: your
> proposed criterion doesn't actually tell us anything about which threat
> you are going to mitigate.

Who said I was trying to use it to decide what specific things to remove?

The strategy implies that everything that can be removed should be.  It's
a reversal of Classic's policy of not throwing away code ever. But just
having that as a goal doesn't say what to get rid of.

To decide what should be removed in what order one has to apply other
criteria.  Like "Will this driver ever be used again?".  Removing all
the ones dependent on the old WWVB modulation, for example, was a
particularly easy decision.

> > If you're only just now noticing that this is NTPsec development's
> > central thrust, and has been since 2015, and that judging by CVEs in
> > Classic that we've evaded it has been rather spectacularly successful,
> > maybe you ought to be paying closer attention to what we're actually
> > doing and achieving before you criticize.
> 
> How many of these were related to device support, obsolete or not?

CVEs?  Not many: two, maybe three IIRC.  Autokey has a much worse defect history.
But when you're running a strategy centered on attack-surface reduction you
squeeze out code *everywhere you can* and that is exactly what we have done, for
a reduction in codebase size of 4:1.

> >> > NTPsec aims to be highly secure and reliable.  If we're serious about
> >> > that, we need to reduce our vulnerability to defects from these
> >> > wraparound/rollover problems. 
> >> 
> >> You won't make even a tiny step in that direction based on your current
> >> understanding of the issues.
> >
> > Please read https://docs.ntpsec.org/latest/rollover.html so you won't 
> > be under any misapprehensions about what we understand.  You might
> > also want to read the big comment at  
> >
> > https://gitlab.com/gpsd/gpsd/blob/master/timebase.c
> >
> > You can see from that how firm a grasp Gary Miller and I had on these
> > problems before NTPsec.
> 
> Appeal to authority won't get you anywhere while you continue to skirt
> the actual discussion.  But the first of the two citations is in fact a
> lot more careful and nuanced in its claims than your broad-brushed
> missive regarding device support.

Surprise! I wrote that entire "careful and nuanced" discussion myself.
You didn't know how much I know before you read it. You should at
least consider the possibility that four years of success hacking at
this giant hairball has taught me things *you* don't know.

> > Yes, in the presence of era wraparounds perfect resolution of absolute time
> > is not possible. We're not under any illusion that it is. What  *is* 
> > achievable to to reduce the complexity of the failure cases and make the
> > code better at self-auditing and notifying a human when it enters a bad 
> > state.
> 
> Yet you haven't addressed the actual failure cases and how you plan to
> mitigate them.

That's because I don't have a detailed plan.  I'm learning my
way into the problem, simplifying as I go.

I've already collected one major gain from this process.  Now you can
recover from a trashed system clock if you trust your clock sources
not to be lying to you about the year.  Of course they will sometimes
suffer era rollovers and tell a lie, but it's a hell of a lot easier
to detect that failure when you're looking at 4-digit years than at
two-digit year parts.

That last is an example of what I mean by simplifying the range of
failure modes.  My plan is to just keep chiseling away the rock until
I can wrap my head around all the failure interactions and produce
something like a proof of behavior.

Even if I never get to that point, every one of the simplification
steps required to go in that durection pushes the code in the
direction of better maintainability and auditability.

> > Generally speaking, you can tell improvement of this kind is happening
> > any time you rip out old shims.  The code that prevented autonomous
> > operation from working at all before I fixed it in 2017 was, I believe,
> > an old shim from the early days of the Y2K panic.
> 
> More anectodal arguing.

What you call "anecdotal arguing" is what you get when you're working 
a heuristic with a telos rather that a plan where you can spec the form
of the final solution.

That's all I have.  Because some problems are so messy that you have
to approach them with heuristics, humility, and patience.

> >> > My thinking was that we would eventually drop all of the 2-digit-only
> >> > modes and drivers, and say "if your refclock doesn't ship 4-digit
> >> > years, it's disqualified".  Besides the autonomy issue, devices with
> >> > this quirk are often very old hardware with wraparound problems.
> >> 
> >> So, all GPS receivers, to start with? 
> >
> > No, but it is conceivable that we might someday disqualify NMEA receivers
> > that don't ship a ZDA sentence.
> 
> Based on what argument?

That we want to remove the layer of kluges intended to back-fill incomplete
dates.

Why do that?  Because reasoning about rollover detection is already quite
hard enough without that additional layer of complications, thank you.

> I'm not going to continue that line of discussion.  Again, what you're
> continue to ignore is that I haven't said _anything_ about any
> particular driver getting the axe or not, but that the criteria you
> propose to make that decision are bunk.

I think we've established that you didn't understand my criteria
correctly. Maybe we can have a more constructive discussion now.

> > On the other hand, if you grasp that modern primary clocks are *not
> > expensive*, then maybe you start seeing support for that old hardware
> > as a source of technical debt that should be cleared.
> 
> A GPS receiver is not a primary clock, when it's used like that it's a
> clock distribution system based on common view principles.  At least
> some of the old stuff you want to throw out actually have clocks in them
> that allow limited hold-over.  Getting that when buying new is _not_
> cheap by any means, even if you're willing to spend considerable effort
> in building the system yourself.

Which is why our driver retention policy specifies the holdover 
capability is a good reason not to ditch a clock even if its jitter is
not up to snuff with respect to a modern GPS.

You're actually late to this party.  I thought through most of these
issues when I was putting together NTPsec's technical plan back in
2015.  And that was informed by 11 years of prior experience on GPSD.

> >> How about an option 4. where you admit that full autonomy not only is
> >> provably impossible to result in an absolute time with bounded error
> >> margin, but also not even interesting to NTP.  That might get you to
> >> recognize that the only way to synchronize to some notationally correct
> >> time is to use as many _independent_ sources of time as you can get hold
> >> of.
> >
> > Fine, we agree on that.  But there's no rule that says any of those 
> > multiple sources must be a network peer, and important deployment cases 
> > in which you'd like them to be (say) GPSes watching three different
> > constellations with two rubidium clocks for holddover/backup.
> 
> I've said independent.  Having two GPS using the same antenna are not
> independent even when they watch two different constellations for
> instance.

Not sure how "same antenna" makes a difference.  Seems to me like having multiple
constellations driven by different fountain clocks is the important thing for
risk-spreafing.  Explain please?

> > Classic couldn't handle that case.  NTPsec *can*.  And there's room forv 
> > more improvement in that area.
> 
> Classic in fact could, just not in the way you envision it.  It would
> have spread the radios / clocks over a number of stratum-1 servers and a
> secondary layer of stratum-2 with symmetric peering would have presented
> the resulting amalgam clock to your network.  In a way that is much more
> resilient than the setup you allude to above, which would need to pull
> the function of the secondary layer up into the clock source handling of
> ntpd.  Now that we have NTS we could get symmetric peering back without
> the non-security implications it had.

Not good enough.

I'm responding to real-world deployment cases that *absolutely* will not
punch a firewall hole for NTP. Not just spok shops, either - would you
believe chemical plants and oil refineries?

> Regarding your blurb on driver retention on the website:
> 
> "It may actually be the case that all the Stratum 1 sites running
> non-custom ntpd instances are using GPSes now."
> 
> Bzzzt. Wrong.  Right now I have:
> 
> +isis.uni-paderborn.de       .DCF.            1 u    8   64  377 22.657ms 126.19us 184.79us
> -h-213.61.224.35.host.de.col .DCFa.           1 u   49  128  377 29.603ms -9.225ms 203.35us

OK, I'll correct that.  U.S. vs. non-U.S. conditions make a difference here.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>