ntpq mru hang
Eric S. Raymond
esr at thyrsus.com
Mon Dec 19 18:53:59 UTC 2016
Hal Murray <hmurray at megapathdsl.net>:
> esr at thyrsus.com said:
> > Wait, then I have failed to undersrtaand your bug report. This can happen
> > in a different, less odd way than the nonce update getting lost by packet
> > drop?
> Yes. The no-packet-lost case is broken if it needs a second batch.
> Each batch gets a new nonce. The code doesn't do anything with it. So
> asking for the second batch is using a stale nonce.
*Each* batch? I'm looking at the ntp_control.c code. I don't see how this
is possible. I looked for CTL_OP_REQ_NONCE, and it looks like the only time
a response of that type is shipped is when the client requests one.
The Python client-side code thinks it should request a nonce at the
beginning of the fetch and every four span requests thereafter. I
went back and re-checked the C code to make sure I hadn't mistranslated
this. I hadn't.
What I do now see is that nonces are supposed to age out after 16 seconds.
(ntpd/ntp_control.c, line 3054 at the end of the validate_nonce() function)
The ntpd side is not counting them at all.
Right there I see a problem...
> > Or we could just write a script using the Python Mode 6 library to flood a
> > running ntp with bogus Mode 6 packets. That way we wouldn't have to add
> > cruft in C.
> Just generating crap won't help. You need to forge the source IP Address.
> (I think you could do it semi-cleanly by setting up a bazillion extra IP
> Addresses on your driver. I forget what they are called.)
I think we have a protocol issue to solve first. It looks like the client
code was *never* properly matched to the server-side. It only happened to
work if 4 requests could always be processed within 16 seconds.
Maybe this accounts for Sanjeev's bug, #206 on the tracker.
> > What is failing to work exactly?
> Currently, ntpq dies as soon as it asks for the second batch.
> I've seen it ask for a new nonce, but that didn't recover. I didn't
> investigate since that was the same time I saw that it wasn't picking up the
> new nonce.
> The old ntpq doesn't work either. It used to work before the traffic jump.
> I assume something got pushed over the edge. The obvious thing is slots
> getting updated faster then they can be retrieved. I think we need to add a
> bunch of counters but I don't know where to put them.
I wish I'd known that sooner. I've been beating myself up trying to figure
out what I could have gotten wrong in the Python translation.
I think I need to fix the aging code in the client first. Then let's see
what the transaction looks like.
<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>
More information about the devel