Graphs from NTP servers

Mon Feb 8 22:57:15 UTC 2016

I'll turn this into a web page, but this is what I have now.  
Corrections/feedback encouraged.  Off-list is fine.

The place to start is a system's loopstats file.  This is from a low cost 
DigitalOcean cloud server in San Francisco.
  http://users.megapathdsl.net/~hmurray/ntpsec/SFO-self.png

That is the system's opinion of how good its clock is.  There are two types 
of errors to consider.  The first is the wiggles in that graph.  That tells 
you how stable the local clock is.  In this case, except for a few spikes 
early on, the system mostly thinks it is within 1/2 ms of the correct time.  
So as long as we are interested in millisecond accuracy rather than 
microseconds, this system is probably a good place to stand while looking at 
other servers and/or the internet connections from here to there.

The other type of error is systematic errors, for example, using the wrong 
edge of a PPS pulse or asymmetric network delays.  They don't show up in 
loopstats.  You can't detect them without digging deeper.

Both types of errors are something you need to keep in mind when looking at 
graphs.

After the typical request-response packet exchange, a NTP client has 4 time 
stamps:
  The time the request left the client
  The time the request arrived at the server
  The time the response left the server
  The time the response arrived at the client
Note that there are two different clocks used to make those time stamps, 
either of which may be inaccurate.

NTP servers also act as clients to get their time from lower stratum servers. 
 ntpd logs those time stamps in the rawstats file.  If you use the "noselect" 
option on a "server" line in your config file, you can collect info without 
letting dirty data corrupt your local clock.

Here is a graph of the round trip times from San Francisco to several servers 
on the east coast:
  http://users.megapathdsl.net/~hmurray/ntpsec/SFO-east-rtt.png
The steps in the green and red dots are due to routing changes.  The fuzz on 
the blue dots is queuing delays on some overloaded link.  The cap on the fuzz 
indicates that the overloaded link has 10 ms of buffering.  There are a few 
scattered red dots.  The ones that indicate extra delays are typical network 
glitches.  I don't have a good story for the ones at 14 and 15 hours that 
indicate reduced time.  My guess would be a transient network path that was a 
few ms shorter but didn't happen often enough to show up clearly.

Normally, ntpd assumes that the network delays are symmetrical.  That lets it 
compute the offset between the local clock and the remote clock.  Here is a 
graph of results of that calculation:
  http://users.megapathdsl.net/~hmurray/ntpsec/SFO-east-off.png

If instead, you assume that both clocks are accurate, you can compute the 
network transit delays in each direction.  I picked well run servers for this 
experiment, so that assumption is probably valid. The limiting factor is 
probably the ms or so on the local clock.

Here is a graph of the delays to/from rackety:
  http://users.megapathdsl.net/~hmurray/ntpsec/SFO-rackety-out-back.png
That shows that the congestion is on the return path.  It also shows that the 
return path takes about 5 ms longer than the forward path.

Here is the out/back graph for the NIST systems:
  http://users.megapathdsl.net/~hmurray/ntpsec/SFO-nist-out-back.png
The first thing to notice is that the outgoing path takes over twice as long 
as the return path.   Going back to the round trip time graph, it's 
suspicious that systems located relatively near each other have such large 
differences in round trip times. The return times are close to the times 
to/from rackety.

Note that there are only a few steps in the bottom/return path and the steps 
in the top/forward path match the steps in the round trip time so most of the 
routing changes are on the long forward path.

There is an interesting event associated with time-d from 17.5 to 18.5 hours. 
 Note that the out/back steps are mirror images of each other and that there 
is no change in the round trip time during that time slot.  That would happen 
if the time on the remote system was offset.  It could also happen with some 
unlikey changes in routing.

Here is the round trip time graph for the nearby clocks used as references by 
this system:
  http://users.megapathdsl.net/~hmurray/ntpsec/SFO-local-rtt.png
And the corresponding offset graph:
  http://users.megapathdsl.net/~hmurray/ntpsec/SFO-local-off.png
The routing to all 3 clocks is stable, but something is off by 1/2 ms.

Here is the out/back graph:
  http://users.megapathdsl.net/~hmurray/ntpsec/SFO-local-out-back.png
(I dropped one of the HP clocks to reduce clutter.)
The mirror image pattern is due to offsets/errors in the local clock.  (It 
could be due to errors in the remote clocks, but all 3 have GPS/PPS inputs 
and the return paths all agree.)

Note the 1/2 ms offset between the two out times.  In order to figure out 
which clock/path is correct, I'll have to find at least one more good clock.  
(The 2 clocks at HP are on the same subnet so they only get one vote.)

-- 
These are my opinions.  I hate spam.