ntpq mru hang

Mon Dec 19 06:50:39 UTC 2016

I added + and - as keyboard commands to ntpmon to bump the debug level.
The first + opens ntpmon.log in append mode

ntpq's debug commands now open ntpq.log

I haven't sorted out what gets printed at each level.
I bumped the hex packet printout to level 5.
Level 4 just prints out the text which works fine for mru debugging and is 
much easier to read and saves screen space.

We are getting close to being able to debug this stuff.

----

>> A successful batch returns a new nonce.  ntpq asks for more using the old 
>> one.  Nothing comes back.

> Don't do that, then! 

I assume that's sarcastic.  All I'm doing is running ntpq/mru pointing at a 
server that has enough mru slots to need more than one batch.  Testing on a 
local server never triggers that case.

Perhaps we should add a debugging hack to ntpd that would preload the mru 
list with a batch of crap so we can test this case with a local server that 
doesn't get enough traffic to do it naturally.

> If it's relatively easy to do, I'd support returning a BADVALUE serrver in
> this case error 

Maybe.  That would avoid a timeout, but we already had a previous timeout so we should know what is going on.

How about waiting until we learn more?

> That is relatively easily arranged.  I could add a command to flip it to an
> all-MRU display. 

I think that would be great.

>> I think we need an option to ntpq/mru to tell it how many slots to ask for,
> We have that. I added it a few days ago.  ntpmon uses it to avoid hanging on
> servers with long MRU lists. See "recent" on the Mode 6 page. 

I think that ntpd is ready.  We don't have a way to use it from ntpq

> Trouble is, "all that fits in a batch" can differ depending on the length of
> the data literals in records... 

I'm willing to round down.  (I assume we can come up with a worst case length.)  I'm willing to specify how-many manually.  I'd like to be able to specify the number of packets in a batch to experiment.  (I'm happy to use the editor for that until we learn more.)

The big picture is that I'm trying to understand more about what's going on with a busy pool server.  Tangled up with that is that the new/python ntpq isn't working yet and/or I don't know how to use what does work.

But ntpq-classic doesn't work either.  I think the problem is that slots are getting recycled faster than it can read them so it never converges.

I think being able to read N newest slots would be a very important tool.  A script around ntpq could collect samples of data to be manually inspected.  It might be better if the lstint column were turned into seconds-this-day.

How does the MRU list stuff know when it is finished?  We may need to teach that area to give up sooner and print what it has.  Mumble.

-- 
These are my opinions.  I hate spam.