ntpq mru hang

Mon Dec 19 18:53:59 UTC 2016

Hal Murray <hmurray at megapathdsl.net>:
> 
> esr at thyrsus.com said:
> > Wait, then I have failed to undersrtaand your bug report.  This can happen
> > in a different, less odd way than the nonce update getting lost by packet
> > drop? 
> 
> Yes.  The no-packet-lost case is broken if it needs a second batch.
> 
> Each batch gets a new nonce.  The code doesn't do anything with it.  So 
> asking for the second batch is using a stale nonce.

*Each* batch?  I'm looking at the ntp_control.c code.  I don't see how this
is possible. I looked for CTL_OP_REQ_NONCE, and it looks like the only time
a response of that type is shipped is when the client requests one.

The Python client-side code thinks it should request a nonce at the
beginning of the fetch and every four span requests thereafter.  I
went back and re-checked the C code to make sure I hadn't mistranslated
this.  I hadn't.

What I do now see is that nonces are supposed to age out after 16 seconds.
(ntpd/ntp_control.c, line 3054 at the end of the validate_nonce() function)
The ntpd side is not counting them at all.

Right there I see a problem...

> > Or we could just write a script using the Python Mode 6 library to flood a
> > running ntp with bogus Mode 6 packets.  That way we wouldn't have to add
> > cruft in C. 
> 
> Just generating crap won't help.  You need to forge the source IP Address.  
> (I think you could do it semi-cleanly by setting up a bazillion extra IP 
> Addresses on your driver.  I forget what they are called.)

I think we have a protocol issue to solve first.  It looks like the client
code was *never* properly matched to the server-side.  It only happened to
work if 4 requests could always be processed within 16 seconds.

Maybe this accounts for Sanjeev's bug, #206 on the tracker.

> > What is failing to work exactly?
> 
> Currently, ntpq dies as soon as it asks for the second batch.
> 
> I've seen it ask for a new nonce, but that didn't recover.  I didn't 
> investigate since that was the same time I saw that it wasn't picking up the 
> new nonce.
> 
> The old ntpq doesn't work either.  It used to work before the traffic jump.  
> I assume something got pushed over the edge.  The obvious thing is slots 
> getting updated faster then they can be retrieved.  I think we need to add a 
> bunch of counters but I don't know where to put them.

I wish I'd known that sooner.  I've been beating myself up trying to figure
out what I could have gotten wrong in the Python translation.

I think I need to fix the aging code in the client first.  Then let's see
what the transaction looks like.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>