Stratum one autonomy and assumptions about GPS

Fri Aug 26 16:39:26 UTC 2016

Hal Murray <hmurray at megapathdsl.net>:
> 
> esr at thyrsus.com said:
> > One of my agenda items for a while has been to develop a protocol for places
> > that want to firewall out NTP for security reasons - that is, a set of
> > practices and error estimates based on documented assumptions. 
> 
> Background data is good.  I think you will get into trouble if you try to 
> turn that into useful recommendations.  There are too many assumptions and 
> there will be huge differences between places.

The way you get around that problem is to document your assumptions, then
tell the users *how to test whether their environment matches those
assumptions.*

This is one reason I've been willing to put a lot of effort into
ntpviz.  It doesn't yet give us statistics about GPS outage-time
distribution, but we're not very far from where it will be able to do
so.  Once we have that kind of visibility, we can reasonably make
claims about what kind of preconditions are needed to run autonomously
with GPS and then, with ntpviz, give users a way to *test* for those
preconditions.

Certainty isn't achievable, but we can say things like: Keep watching
your GPS until outage times have a distribution (or a worst case) you
think you can model. Once you have that, you can make inferences about
how often you expect an outage to drive time accuracy outside a
specified range, and use that that to evaluate your relative risk
between time inaccuracy due to natural disruptions vs. cyberattack.
That will tell you if you should try to run autonomously.

Part of my strategic push behind ntpviz (besides making our funding
authority happy with pretty pictures) was to get us from a place where
analytics like that seem like too hard to tackle to a place where
they're natural, relatively easy extensions of a code framework
already in place.

Now that we have that framework in place, I'm certain you and Gary
have the domain knowledge required to make it tell us what we need to
know to make confidence estimates for running autonomous. But first
you have to get used to the idea that this is possible, that there
really is a rational statistical-estimation model you can apply!

Concrete direction: we should get to where we can plot a histogram of
outage lengths over a timespan of months and curve-fit it to a Poisson
distribution (if you need an explanation of why Poisson is appropriate
I can supply one). Then, by setting a maximum allowable skew and modeling
the drift envelope, we should be able to put a line on that chart that
expresses how often outages will bust that maximum skew.  This is a
fairly simple problem in applied statistics; people analyzing noisy
experimental data have to do similar things all the time.

> > Right.  This sort of thing you hedge against with tell-me-three-times.
> > (Remember that the user story assumes a big hardware budget.) 
> 
> If you have a big budget, you buy a does-it-all box with GPSDO and NTP 
> server, and pay a defense contractor to integrate things.
>   http://spectracom.com/products-services/precision-timing

True.  Either way, this is not a risk category that *we* need to design
around.

> > I don't think I need to care about specific causality much.  The overriding
> > question is what the statistical distribution of outage lengths is.
> 
> The time pattern depends upon the mechanism.  If your GPS unit fades out 
> several times a day you can collect useful statistics.  If your GPS screws up 
> due to a 1024 week rollover it's hard to get enough data to work with.

I think this was a bad example to use. We know exactly when the
rollovers are coming, they're rate, and we can have a human watching
to mitigate problems when they happen.  The real problem is large
*random* outages - that we don't necessarily know what the tail of the
distribution looks like.

> > (Rain/snow on your trees.  Why is this special? Does it increase multipath
> > reflections or something?) 
> 
> Water attenuates GPS signals.  So wet trees/roofs attenuate more than dry.

Noted.  Is this effect significant enough that SNR drops below
critical in severe weather?  If so, that would change the risk
profile a lot.

> I think you are mixing two uses for good background data.  Earlier you said 
> your target was a firewall blocking all NTP traffic to the WAN.  Now you are 
> back to running a WAN server.

You're right.  I talked about two different cases and didn't draw the link
between them.  My tacit assumptions are as follows:

(a) Users have come to expect WAN time-sync accuracy on the close order of
    10ms from existing NTP and network infrastructure.

(b) That expectation conditions the way people design for soft realtime even
    in off-net applications.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>