Testing

Mon Jul 15 02:55:58 UTC 2019

> It's...hm...maybe a good way to put it is that the structure of the NTPsec
> state space and sync algorithms is extremely hostile to testing.

I still don't have a good understanding of why TESTFRAME didn't work.  I can't 
explain it to somebody.

We've got
  code mutations
  hidden variables in the FSM
  hostile

So what makes it hostile?  Is it more than just complexity?

Why isn't this sort of testing even more valuable when things get complex?

---------

> If you try to do this kind of eyeballing in NTPsec it will make your brain
> hurt.  It's not just that the input and output packets are binary, that's
> superficial and fixable with textualization tools I can write in my sleep.
> Fine, let's say you've done that. You've got an interleaved stream of input
> and output timestamps.  How do you reason through the sync algorithms to know
> whether the relationships are correct? 

How do we tell that it is working without TESTFRAME?  I eyeball ntpq -p and/or 
graphs of loopstats and friends.  That's using the stats files as a summary of 
the internal state.

Did TESTFRAME capture the stats files?

With a bit more logging, we could probably log enough data so that it would be possible to do the manual verification of what is going on.  We would have to write a memo explaining how it works, maybe that would include chunks of pseudo code.

How much of the problem is that Eric didn't/doesn't understand the way the inner parts of ntpd work?  I've read the descriptions many times but I still don't understand it well enough to explain it to somebody.  Maybe I could work up a presentation with the code in one hand and the descriptions in the other hand.  It would take a while.  That is, I know the general idea and recognize all the pieces but don't have a good feel for how the pieces fit together to make up the big picture.

> Not only are there time-dependent hidden inputs to the computation from the
> kernel clock and PLL, but they're going to be qualitatively different
> depending on whether you have an adjtimex or not.

There wasn't supposed to be anything hidden.  TESTFRAME was supposed to intercept all the relevant calls like getting the time from the kernel.

I'm pretty sure we gave up on systems that don't support adjtimex.  OpenBSD doesn't have it, but does have enough to slew the clock.  We dropped support for OpenBSD when that shim was removed.

------------

How far did you get with TESTFRAME?  Do you remember why you decided to give up?  Was there something in particular, or did you just get tired of banging your head against the wall?

How many lines of code went away when you removed it?

Would it be interesting for me to take a try?  Now isn't a good time and there may be more important things to work on, but I think we should explore and understand this option.

------------

But back to the big picture.  How can we test corner cases?

Is it reasonable to look for patterns in the log file?

Is it reasonable to look for patterns in the output of ntpq -p?  Graphs?

When you do a Go port, what can you do to make testing easier?

-- 
These are my opinions.  I hate spam.