Testing

Eric S. Raymond esr at thyrsus.com
Mon Jul 15 13:04:52 UTC 2019


Hal Murray <hmurray at megapathdsl.net>:
> 
> > It's...hm...maybe a good way to put it is that the structure of the NTPsec
> > state space and sync algorithms is extremely hostile to testing.
> 
> I still don't have a good understanding of why TESTFRAME didn't work.  I can't 
> explain it to somebody.
> 
> We've got
>   code mutations
>   hidden variables in the FSM
>   hostile
> 
> So what makes it hostile?  Is it more than just complexity?

Once you have TESTFRAME in place, and are logging all the input state
including things like the system clock and PLL state, and all the oput
state too, it's just complexity.  But that "just" is a doozy.

Let me refocus on the basic question here.  Let's say you've put
TESTFRAME back in place and finished it.  You can now start ntpd up
and make captures that include all of the input state of the FSM and
all of its outputs.

How do you tell that any given capture represents correct operation?
What check do you apply to the captured I/O history to verify that the
sync algorithms were functioning as intended when the cature was
taken?  *That's* the hard part - not the mechanics of TESTFRAME 
itself, which is just tooling.

If you have such a check, then TESTFRAME can be used to verify
correctness of operation.  You do it the way I built a test suite for
GPSD. You take a whole bunch of captures. Run your magic check on the
relationship between input and output to verify that operation
is correct on each.  Stash the captures in the tests directory.  Then,
when you change the code, you rerun each of the caputures. If actual
and expected outputs don't diverge, you're good.

(You still don't know how to compose captures to trigger specied corner
cases, but there's no point in worrying about that problem until you
have your check procedure.)

In GPSD, the magic check is just looking at a capture, because the
correctness of the relationship between input packets and output JSON
is pretty easy for an unaided Mark I Brain to verify.  In reposurgeon
it's a little tricker to verify that a load/checkfile pair represents
correct operation, but not all that diificult for small carefully
crafted cases.  You end up crafting a lot of small cases - I have 145
of them.

I don't know how to write that check for NTPsec captures - it sure as
hell can't be done by eyeballing the packet traffic.  That's the first part
of I mean by "hostile to testing"; there are other issues, but until
we know how to address this one there's little point in even
enumerating them.

In the absence of such a check procedure for captures,
TESTFRAME is nearly useless. You can use it to test
same-input-same-output stability over time but that's about it.

> Why isn't this sort of testing even more valuable when things get complex?

Of course it would be more valuable because things are complex.  If it
were practical at all; I don't think it is.  I would be very happy if
you were to prove me wrong.

> How do we tell that it is working without TESTFRAME?  I eyeball ntpq -p and/or 
> graphs of loopstats and friends.  That's using the stats files as a summary of 
> the internal state.

OK, you have something resembling a check procedure.  I can't do that.  I
don't know enough about the visible signs of correct vs. incorrect 
operation to trust my ability to tell.

Now we get to the next kind of hostility to testing.  How do you compose
captures so they explore some desired set of transitions in the daemon?
I know how to do that in GPSD and reposurgeon; I haven't the faintest
clue how to do it in NTPsec.

And the next. Test-pair brittleness. Reposurgeon tests never break
once composed unless I decide to deliberately change the behavior of a
feature. In GPSD test pairs will break only when the leap offset goes
up, or on era rollover.  Those are rare events because GPSD relies on
very little hidden or retained state between packet bursts.

ntpd retains a huge amount of state (packet queues for median
filtering, etc).  That's why reasoning forward from inputs to outputs, 
or backwards from outputs to inputs, would be brutally hard
even for someone with perfect knowledge of the sync-algorithm
theory of operation.  There's sensitive dependence on that state.

> Did TESTFRAME capture the stats files?

I think they became part of the TESTFRAME capture.  If they didn't, that would
be easy to fix.  That's just mechanics, it's not the hard part.

> With a bit more logging, we could probably log enough data so that it would be possible to do the manual verification of what is going on.  We would have to write a memo explaining how it works, maybe that would include chunks of pseudo code.

Good luck. I did not have the knowledge base to do that. If you do, more power to you.

> How much of the problem is that Eric didn't/doesn't understand the way the inner parts of ntpd work?  I've read the descriptions many times but I still don't understand it well enough to explain it to somebody.

Hal, I don't think *anybody* does. Daniel comes close - he can explain
the theory better than I can. But there's a gap between cup and lip; 
knowing how an idealized NTP-like sync is works is not the same
thin as being able to do detailed enough predictive modeling of the
implementation to *compose* test cases, the way I can in GPSD or
reposurgeon.

> I'm pretty sure we gave up on systems that don't support adjtimex.  OpenBSD doesn't have it, but does have enough to slew the clock.  We dropped support for OpenBSD when that shim was removed.

Thsat's not how I renember it.  Both code paths are still present in the codebase.

> How far did you get with TESTFRAME?  Do you remember why you decided to give up?  Was there something in particular, or did you just get tired of banging your head against the wall?
> 
> How many lines of code went away when you removed it?

commit e3fa301b1ae9d5502f955b47b60fe067e15d0755
Author: Matt Selsky <matthew.selsky at twosigma.com>
Date:   Wed Feb 1 02:05:14 2017 -0500

    Remove ntpd flags related to TESTFRAME

commit df63da97a1563572b2f4252d67998e6342f4f207
Author: Eric S. Raymond <esr at thyrsus.com>
Date:   Fri Oct 7 00:44:20 2016 -0400

    TESTFRAME: Withdraw the TESTFRAME code.
    
    There's an incompatible split between KERNEL_PLL and non-KERNEL_PLL
    capture logs - neither can be interpreted by the replay logic that
    would work for the other.
    
    Because we can't get rid of KERNEL_PLL without seriously hurting
    convergence time, this means the original dream of a single set of
    regression tests that can be run everywhere by waf check is dead.
    Possibly to be revived if we solve the slow-convergence problem
    and drop KERNEL_PLL, but that's far in the future.
    
    Various nasty kludges could be attempted to partly save the concept
    by, for example, having two different sets of capture logs.  But, as
    the architect of TESTFRAME, I have concluded that this would be
    borrowing trouble we don't need - there are strong reasons to suspect
    the additional complexity would be a defect attractor.
    
    One problem independent of the KERNEL_PLL/non-KERNEL_PLL split is that
    once capture mode was (mostly) working, it became apparent that the
    log format is very brittle in the sense that captures would easily be
    rendered invalid for replay by minor logic changes or even changes
    in tuning parameters for the sync algorithms.

At that time I did not yet fully understand the brittleness problem;
I now think the last paragraph greatly understates it.

If you reverse those, in that order, you'll get most of TESTFRAME
back.  Some hand-patching will be required because this was before
Daniel's big refactor landed, but that's a small effort compared to
recreating TESTFRAME from scratch (which took me months!).  I did
document the assumptions and architecture pretty carefully.

I think the only piece of mechanics missing is mocking UDP input.
The code for landing input packets was much hairier then than it
is now.

> Would it be interesting for me to take a try?

Well, sure.  You're one of the very few who knows the problem space
(not the code, but the problem space) better than I do.  I don't think
much of your odds of success, but they're not zero.

> But back to the big picture.  How can we test corner cases?
> 
> Is it reasonable to look for patterns in the log file?
> 
> Is it reasonable to look for patterns in the output of ntpq -p?  Graphs?

I think I've implied answers to these questions

> When you do a Go port, what can you do to make testing easier?

Not much.  I think mocking UDP input will be easier in Go-land, but
the serious problems are language-inspendent and not mechanical.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>




More information about the devel mailing list