Apparent protocol-machine bug, new top priority
Fred Wright
fw at fwright.net
Mon Sep 25 23:55:30 UTC 2017
On Mon, 25 Sep 2017, Achim Gratz via devel wrote:
> Fred Wright via devel writes:
> > I get a kick out of you guys fussing over "thermal stability" when the
> > largest source of time error is the interrupt latency in timing the PPS
> > signal.
>
> The median interrupt latency shows up as an additional offset on top of
> other such offsets. The variability on that latency gets filtered
> pretty nicely by ntpd, especially the long tail at large latencies.
> Now, interrupts never have been a particularly strong point for ARM, I
> give you that.
But it can't "filter" the overall offset - it has no way to know what it
is.
> > Just because you can't see the error in the graphs doesn't mean
> > it isn't there. :-)
>
> Again, that number isn't materially affecting the frequency stability,
> only the time offset. If you look at that, you will quickly find that
> your assertion of thermal effects getting dominated by the interrupt
> latency is wrong.
Of course it has nothing to do with the frequency stability, but it
directly affects the time offset. And my assertion is based on the actual
data. See below.
> > On the Beaglebone, it's typically around 15us with the
> > CPU running at 1GHz, going up to around 42us at 300MHz. It's directly
> > measurable because the "real" PPS timing is via counter capture, with a
> > total capture uncertainty (the equivalent of NTP RTT) of typically 583ns
> > at 1GHz and 1083ns at 300MHz.
>
> If you have histograms, I'd like to see them. But that seems to be in
OK, A Tale of Two Servers - It was the best of times, it was the worst of
times...
This is a day's data from my experimental time server:
http://sonic.net/~fw/private/NTP/BB2-2017-09-24/
The main timing reference is a rubidium oscillator, so frequency-related
effects are essentially nonexistent. In fact, the error in the Linux
kernel's limited-precision *representation* of the frequency is about an
order of magnitude larger than the typical actual frequency error.
PPS(2) is the counter-capture PPS source, and is the primary timing
reference. SHM(1) is the combined NMEA/PPS source from GPSD, which is
configured to use the interrupt-based PPS driver, and hence illustrates
the offset in the interrupt-based capture. Between 2100Z and 2200Z I
switched the governor to powersave (300MHz CPU clock instead of 1GHz), and
you can see the effect on the latency (but negligible effect on the actual
timing accuracy).
SHM(0) is a noselect peer that's included just to track the TOFF of the
GPS receiver. PPS(4) is the PPS from the undisciplined rubidium
oscillator, whose drift represents the rubidium frequency error, and with
a step change where I'd manually reset it to the correct phase.
And here's a day's data from my primary time server:
http://sonic.net/~fw/private/NTP/Time-2017-09-22/
This one is running classic ntpd, but that shouldn't really matter for
this purpose. The timing reference is just the normal crystal, with no
special thermal treatment.
Again, PPS(2) is the main timing reference, though it's listed as
127.127.22.2 due to the lame partial translation table in ntpviz. Again,
SHM(1) is the NMEA/PPS source with interrupt-based PPS capture, and the
offset is a combination of the interrupt latency and the frequency-related
time offsets. The actual time offsets are visible in the loopstats graph
and in the PPS(2) peer offset graph, and are substantially smaller than
the offsets in SHM(1). QED.
SHM(0) is a noselect peer here as well.
> the right ballpark. Note that you could do something similar by running
> the PPS capture on the VC4 instead of ARM subsystem, but that part of
> the rasPi is woefully under-documented. In principle the hardware
> should allow capturing PPS at up to 250MHz and sending the timestamp via
> mailbox to the ARM is not timing-critical at all.
Lots of stuff on the Pi is woefully under-documented. :-) After all, this
is from Broadcom, an industry leader in closed documentation. It must
have *really* pained them even to provide the documentation that does
exist.
> If you wanted to eliminate that, you'd better use an FPGA or some other
> microcontroller that has capture units and hardware timestamps for the
> NIC.
The Beaglebone chipset already has the hardware, it's reasonably well
documented, and there's driver support for it in Linux. The biggest
source of inaccuracy is that the generic timekeeping code in Linux
provides no way to convert a *supplied* counter value to a timestamp.
The best one can do purely within the driver involves reading the current
counter value and the system timestamp as close together as possible, and
using that correspondence to map the captured counter value to the
corresponding timestamp. The delay in that sequence accounts for the
majority of the capture uncertainty. My experimental version of the
driver has some improvements in that area, but I've squeezed it about as
much as is possible without touching timekeeping.c.
BTW, the Beaglebone chipset also has hardware timestamping in the NIC, and
I believe there's kernel support for it, but one can't take full advantage
of that without solving the "send timestamp problem".
On Mon, 25 Sep 2017, Gary E. Miller via devel wrote:
> On Mon, 25 Sep 2017 12:58:42 -0700 (PDT)
> Fred Wright via devel <devel at ntpsec.org> wrote:
>
> > I get a kick out of you guys fussing over "thermal stability" when the
> > largest source of time error is the interrupt latency in timing the
> > PPS signal.
>
> Uh, that is not my experience. And I have more control over my
> temperature than I have over my interrupt latency.
See above. And note that you can at least make the latency (as well as
the variation in latency) as small as possible by running the CPU as fast
as possible, rather than slowing it down for "thermal stability".
Fred Wright
More information about the devel
mailing list