TESTFRAME is dead. This accelerates us some.

Wed Oct 5 20:26:26 UTC 2016

Mark: Heads up!  Is it worth giving up Mac OS X for another shot at TESTFRAME?
Sadly, even that only gives us poor odds...

Hal Murray <hmurray at megapathdsl.net>:
> I hope you are putting it on the back burner rather than totally giving up.
> 
> As I understand things, the recent problem for TESTFRAME was the overlap 
> between adjtimex and ntp_adjtime.

Not quite.  It's the fact that with KERNEL_PLL enabled, the capture
files will have an event in the stream that won't replay properly on a
non-KERNEL_PLL system, and vice-versa - the code path through replay will
be incompatibly different.

The specific location of the problem is near line 827 in ntpd/ntp_loopfilter.c:

		/*
		 * Pass the stuff to the kernel. If it squeals, turn off
		 * the pps. In any case, fetch the kernel offset,
		 * frequency and jitter.
		 */
		ntp_adj_ret = intercept_ntp_adjtime(&ntv);

If you look at the calling context, you'll see that whether this call is
reached depends on whether KERNEL_PLL is enabled.  And there's a second
call down around line 862.

#if defined(STA_NANO) && NTP_API == 4
		/*
		 * If the TAI changes, update the kernel TAI.
		 */
		if (loop_tai != sys_tai) {
			loop_tai = sys_tai;
			ntv.modes = MOD_TAI;
			ntv.constant = sys_tai;
			if ((ntp_adj_ret = intercept_ntp_adjtime(&ntv)) != 0) {
			    ntp_adjtime_error_handler(__func__, &ntv, ntp_adj_ret, errno, false, true, __LINE__ - 1);
			}
		}
#endif /* STA_NANO */

Now think about what will happen if you make a capture log on a
KERNEL_PLL system and try to replay it on a non-KERNEL_PLL system.
The replay will never execute its way into the calling context that
ought to pick up the two recorded ntp_adjtime events per local_clock()
call, because that context was conditioned out at build time!  Not
safe to ignore them either, because they set system variables later
used for time computations - that's the whole point.

You have the reverse problem on a KERNEL_PLL system trying to replay a
non-KERNEL_PLL capture.  It *will* execute its way into that context, then the
first intercept_ntp_adjtime() call will barf and die because it didn't find
a matching event in the capture log.

It doesn't matter here whether KERNEL_PLL is an alias for
HAVE_SYS_TIMEX_H or something else. The problem is the existence of
two different divergent code paths on different platforms, making them
two distinct finite-state machines.  A necessary condition for
TESTFRAME to work is that *one of those paths has to be
abolished*. The code has to define the *same* honkin' big FSM
everywhere.

There are a couple of scenarios in which this happens.  In one,
we simply give up on non-KERNEL_PLL platforms right now.  The problem
with this plan is, as you might half-expect, OS X:

https://github.com/bsdphk/Ntimed/issues/2

We'd also lose the capability to build on older BSDs that are POSIX
compliant but only have adjtime(2), not ntp_adjtime(2), which is
where OS X comes from.  I'm pretty sure it would be a problem if
we ever decide to go back to Windows, too.

There is a possible future that's nicer. If we can solve the
slow-convergence problem thoroughly enough that KERNEL_PLL is no
longer needed for decent performance, then we can tell the world's OS
maintainers "Hey! NTP just lost a *really annoying* dependency and you
can drop a bunch of code out of your kernels!" at which point amidst
general rejoicing *we* drop the KERNEL_PLL code.

However, see the end of this note for why this wouldn't entirely get
us out of the woods.

> glib-2.23 says:
>   @code{adjtimex} is functionally identical to @code{ntp_adjtime}.
> So that shouldn't be a problem.
> 
> Somehow, we got started thinking that --disable-kernel-pll would fix the 
> adjtimex/ntp_adjtime problem.
> 
> That uncoverd a mess in the code.  It's the sort of thing you are usually 
> very good at cleaning up.
> 
> Grep for HAVE_KERNEL_PLL and HAVE_ADJTIMEX and HAVE_SYS_TIMEX_H
> 
> As far as I can tell, HAVE_KERNEL_PLL is misnamed.  Without 
> disable-kernel-pll, it's an alias for HAVE_SYS_TIMEX_H  If we assume POSIX 
> support for ntp_adjtime then sys/timex.h exists.

I think this is almost right. Unfortunately, it's only "almost".

First, ntp_adjtime() and sys/timex.h are *not* POSIX.  Mark and I have
discussed trying to fix this working through The Open Group, but that's
a multi-year process.

Second, I am not *entirely* sure HAVE_KERNEL_PLL and HAVE_SYS_TIMEX_H
do mean the same thing.  It is true they both *seem* to depend only on
the existence of sys/timex.h. It is true that our waf recipe assumes
this:

        # Does the kernel implement a phase-locked loop for timing?
        # All modern Unixes (in particular Linux and *BSD) have this.
        #
        # The README for the (now deleted) kernel directory says this:
        # "If the precision-time kernel (KERNEL_PLL define) is
        # configured, the installation process requires the header
        # file /usr/include/sys/timex.h for the particular
        # architecture to be in place."
        #
        if ctx.get_define("HAVE_SYS_TIMEX_H") and not ctx.options.disable_kernel_pll:
                ctx.define("HAVE_KERNEL_PLL", 1, comment="Whether phase-locked loop for timing exists and is enabled")

It is also true that we inherited this assumption from Classic's autoconf
system.  Unfortunately, both may be wrong.  There are some murky waters here.

The reason I am not sure about this is that ntp_adjtime(2)/adjtimex(2) is
a complex, poorly documented, very stateful interface.  As one example of the
complexity, some platforms interpret one of its fields as a microsecond count,
while others interpret the *same* field as a nanosecond count (look at what
STA_NANO does).

It is entirely possible that the calls within scope of KERNEL_PLL are
exercising what is in effect a different API than the ones outside,
because they're talking to different sets of kernel facilities in ways
that are selected at runtime by different mode bits.  If this is the
case, HAVE_SYS_TIMEX_H === HAVE_KERNEL_PLL is accidental, not
entailed.  This concern is why I haven't abolished the distinction.

It is also possible that the above paragraph is worrying too much.  It
might be that the reason autoconf made this assumption is because it
has been correct for years - the incompatible variants were ancient
big-iron systems now dead.  It's difficult to know.  To be sure I'd
have to do a really close audit of all the calls in NTP, then read a
bunch of the library and kernel code it's cooperating with across our
target platforms.

I'll probably want to do that someday, for documentation purposes, but
I don't think there will be time for that before 1.0 even under Case Blue.

And remember, even if it turns out they're equivalent, we have
non-HAVE_SYS_TIMEX_H (and thus non-HAVE_KERNEL_PLL) platforms to worry
about.  Yes, INSTALL and hacking.txt are behind on this; that's
because until I started deeper recently I did not actually realize
that OS X was an exception that we're supporting anyway.

(I've been toying with the idea of having business cards printed up
with an image of a fedora and bullwhip on them and the title "Forensic
Systems Architect".)

Next an explanation of why I am now very doubtful TESTFRAME will really
work as intended even if we get past all these obstacles.  It has to
do with free parameters in the state machine and their effect on
reproducibility, and it's something I should have thought through
sooner but didn't until I started looking at a lot of capture logs
recently.

I imagined TESTFRAME based on successful experience with gpsfake.  There,
the plan to treat gpsd as a big honkin' FSM with reproducible behavior
worked because while the transformation from GPS packets to JSON is
complicated, it has very few free parameters.  One is the current
century.  Another is the time of last GPS week rollover.  A third
is the current leap-second offset.  A fourth is UERE - the basic
parameter expressing magnitude of error that you multiply by
terms from the satellite-view covariance matrix to get directional
error estimates.

If you change any of these free parameters, you have a different FSM
and odds are good that the correct output transformation will change
visibly and break your tests.  (For example, changing UERE changes
most of the reported error estimates.)

Fortunately, this is a small set of free parameters and, more
importantly, they change only rarely and at predictable times.  They
do *not* change during normal evolution of GPSD.  This, when a
regression test goes sproing you know pretty reliably whether it's
because of a scheduled, rare change in a free parameter or because of
a bug.

This regularity - the ease of distinguishing signal from noise, and
the low odds that free parameters will have to change during normal
development - is what makes gpsfake effective. You can re-use tests
for many years at a time (and we do).

What snuck up on me as I was looking through lots of capture logs,
trying to make replay work, just before we hit the two-code-paths
problem hard, is that ntpd is very different. It has lots of free
parameters. And not only will they change during normal development,
they're going to have to in order for us to make any progress on
things like convergence speed.

The recent changes in default poll intervals, and the allowable minima
and maxima the protocol machine hunts in, are examples of this.  So is
the default of minsane.  Anything the tinker command sets is also a
free parameter.  In effect, the entire logic of the sync algorithms is
a gigantic free parameter with no real equivalent in gpsd's simple,
straight-line data transformations.

Think of code you're trying to test as a FSM with an internal mutation
rate as the free parameters change for whatever reason.  As the mutation rate
rises, the expected value of the useful lifetime of a capture log (that is,
the span over which it will constitute a valid error check) falls. The
signal-to-noise ratio in test breakage drops.

Furthermore, capture-log brittleness is most likely to mess us over
exactly when we most need it not to - during periods of heavy development.
Again, this is a problem gpsd didn't have because normal development
didn't touch its free parameters.

My judgment is that even if we sacrifice OS X and Windows to banish
the two-paths problem, the useful-lifetime span of capture logs will
pribably have exactly the wrong statistics - short, and shorter the
harder we are working on improvements.

The conclusion that follows is that TESTFRAME is not viable.  It's
my biggest mistake on this project; I should have grasped the
free-parameter problem sooner.  I *did* see the two-paths problem
coming, but expected I could punch through it somehow when it
became a blocker.  That was wrong, too.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>