Wonky NTP startup and the incremental-configuration problem

Fri Jun 10 08:34:17 UTC 2016

(Jay: You can skip down to where it says "DEC Alpha".)

Hal Murray <hmurray at megapathdsl.net>:
> 
> esr at thyrsus.com said:
> > ntpq has dangerous operations that tweak parameters of the time-sync
> > algorithms on the fly - operations that can be triggered remotely. Or so I
> > gather from things Hal Murray has said; my outside view is weak here, I've
> > never explored those operations. 
> 
> ntpq can be used to tweak things, but it takes a password.
> I've never used it that way.

And if *you* haven't...I begin to wonder if 99% of the userbase even
knows this feature exists.

I'm sorely tempted to just rip everything password-protected out of
ntpq and server side both, muttering "security" if we get any
pushback.

Hal: Do you think we'd get any pushback?

Mark: What does your political spider-sense say about this?

> esr at thyrsus.com said:
> > How it should work is that there is just one way to hack your configuration,
> > modifying ntp.conf, and restarting the daemon to reread it is a low-cost
> > operation that produces only transient synchronization glitches.  Of course
> > this would also imply faster crash recovery.
> 
> It won't help with system crashes.
> ntpd doesn't crash often enough for that to be a problem.

I'm not arguing either point, but it would help me to know *why* you're
sure it won't help with system crashes.  If my mental model isn't wildly off
it ought to help with any outage sufficiently short for propagation delays
from up-stratum server to remain similar.

> > 1. Why is convergence from a standing start so slow?
> 
> We should collect a few examples.  In particular, compare various OSes.
> 
> Mills was very good about that sort of stuff.  But lots of people have 
> "fixed" things in the kernel.  A while ago, Linux rewrote the time keeping 
> code in the kernel.  They may have broken one of his assumptions.
> 
> It might be interesting to try really old kernels.
> 
> Does anybody have access to an old DEC Alpha running True64?  1/2 :) I'm 
> pretty sure Mills was happy with them and I doubt if anybody has made changes 
> in that area.  (I remember some comment about their crystal being stable 
> under temperature.  I think it was SAW.)

I've copied Jay Maynard, who I believe has an Alpha and might be willing
to let us use it.

> esr at thyrsus.com said:
> > 2. If there is a fundamental reason for the slowness, shouldn't it
> >    be possible to dump some kind of state that would allow ntpd
> >    to reread it and resume from a running start? The key question
> >    is whether we can identify that state. 
> 
> That feels like more complexity than it is worth.

Believe me, I don't *want* to do it.  The last thing this code needs its
more chunks of opaque state being slung around.

> esr at thyrsus.com said:
> > (This is also why I haven't yet removed the SAVECONFIG code, much as I'd
> > like to. Shortly after the Penguicon meeting I found that the beliefs we
> > based the decision to remove it on were inaccurate - it is not in itself a
> > potential security hole, and it has a real use when runtime config
> > operations are allowed, which is to dump your actual configuration so you
> > can check the cumulative results of your tweaks.) 
> 
> It's disabled by default.  It's not really useful in my opinion since it 
> writes stuff out in the order it chooses thus sorting the file.  I think it 
> drops comments.  So it's not a useful way to maintain a config file.

OK, that's good enough for me.  I'll take it out.  IIRC doing this is going
to simplify some ugly crap in the ntpd initialization logic. 

> There may be a start-too-slow bug.  I think I may have seen some of them, but 
> there was enough going on that I haven't looked into it carefully.  The with 
> PPS case may be different.
> 
> What are your goals?  What is good enough?  What is not good enough?

I have no feel for how fast convergence to good time "should" be, so
I'm relying on people like Gary and you with operational exprience to
tell me.  To be honest, my goal is to hear somebody who *does* have
that ops experience say "Wow.  That's much faster convergence.  I'm not
afraid of restarts any more."

> gem at rellim.com said:
> >         a. the '-g' startup algorithm is acting perversely.  Ntpd just 
> 
> That's an interesting possibility.  Is that based on solid observations or 
> just a wild guess?

I'd like to know that too.  If the non-g mode isn't wandering all over the
park that is valuable information for characterizing the bug.

> gem at rellim.com said:
> > Then put your NMEA refclock at the top of the ntp.conf and watch the 'fun'. 
> 
> That sounds like a handy way to trigger issue #68

Agreed. I'll probably use it.

> > 2. Second, figure out what state ntpd needs to get a running start and toss
> > it to disk for reread on next startup. 
> 
> I can't think of a lot that isn't in the config file or drift file.  Each 
> peer has a buffer of the last 8 requests.  That can be reloaded in a dozen 
> seconds using iburst.

Damn, Hal.  This is why we'd be lost without you.  The existence of
that request buffering is *really important*, and I had no clue it was
there.  Nor about the interaction with iburst.

It seems to me that both burst and iburst are in serious need of being better
documented.  Would you do something about thst, please?
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>