ntpq quirks

Eric S. Raymond esr at thyrsus.com
Thu Dec 1 00:27:44 UTC 2016


Hal Murray <hmurray at megapathdsl.net>:
> The mru list is getting printed out oldest first.  I thought you fixed that.

I thought I did too.  Testing...

esr at snark:~/software/ntp-rescue/ntpsec$ ntpq -c mru
Ctrl-C will stop MRU retrieval and display partial results.
Retrieved 12 unique MRU entries and 0 updates.
lstint avgint rstr r m v  count rport remote address
==============================================================================
   229    896  1d0 . 4 4    383   123 mail.coldnorthadmin.com
   242    887  1d0 . 4 4    387   123 level1e.cs.unc.edu
   257   1004  1d0 . 4 4    342   123 66.241.101.63
   646    963  1d0 . 4 4    356   123 208.53.158.34 (mirror)
   679   1068  1d0 . 4 4    321   123 enigma.wiredgoats.com
   707    895  1d0 . 4 4    383   123 104.131.53.252
   776    991  1d0 . 4 4    346   123 104.156.99.226
   929    892  1d0 . 4 4    384   123 helium.constant.com
  1031    960  1d0 . 4 4    357   123 mis.wci.com
 87595  10973    0 . 3 4     20 37200 localhost
342915     55  1d0 . 4 4     11   123 cheezum.mattnordhoff.net
342932     59  1d0 . 4 4     10   123 209-133-217-165.static.hvvc.us

That looks newest first.  Either I'm deeply confused or you're running
a stale version somehow.

> The print-what-you-have after ^C doesn't work if the output is a pipe.  
> (Classic doesn't work either.)
> [murray at second ~]$ ntpq -nc mru | tee foo.log
> Ctrl-C will stop MRU retrieval and display partial resultcds.
> ^Cclose failed in file object destructor:
> sys.excepthook is missing
> lost sys.stderr

This, alas, is unfixable - Python loses its cookies in a way I can do
nothing about.  I'm not sure whether the Python interpeter's behavior
qualifies as a bug or if there's nothing better it can do.  C ntpq's
similar failure to cope hints at the latter.

I'll ask around on #python about this. It's possible there's some workaround;
I doubt it, but it's possible.

> The date in the python version string is close to useless.
> 
> [murray at hgm ~]$ ntpd --version
> ntpd 0.9.6-b9cadf5-hgm Nov 30 2016 03:44:12
> [murray at hgm ~]$ ntpq --version
> ntpq 0.9.6-119-play 2016-11-06T00:05:45Z
> [murray at hgm ~]$ 
> 
> I don't know what the right answer is.
> 
> The date is ages ago.  The "play" is the directory I built it in.  The "hgm" 
> in the ntpd string is what I put there with waf --build-version-tag=

Here's what generates that:

def stdversion():
    return "%s-%s-%s %s" % (ntp.version.VERSION, ntp.version.VCS_TICK,
                            ntp.version.VCS_BASENAME, ntp.version.VCS_DATE)

You're looking at the value of VCS_DATE, which is "The date of the
most recent commit in true ISO-8601/RFC3339 format, including
seconds."

What do you think should be there instead?

> I'm getting hangs with mrulist.  I haven't figured out what's going on.  Yes, 
> it works OK in small local cases.  I'm getting hangs in big local cases (pool 
> server).  If I wait a while then ^C, it prints out a bunch of stuff (in 
> backwards order), then stops.  It doesn't print out any of the
>   346 (0 updates)
> messages, but I don't know if the new version does that normally.
> ...
>   1231      0   90 . 3 4      1   123 92.24.199.193
>   1230      0   90 . 3 4      1 30704 223.24.43.95
>   1230      0   90 . 3 3      1 60456 78.145.21.250
>   1230   0.00   90 . 3 3      3  3688 86.135.151.203
>   1230      0   90 . 3 3      1 44231 175.151.247.31
> [murray at second ~]$
> 
> It's a pool server getting lots of traffic.  The 1200 is ballpark of how long 
> since ntpd was restarted.  So either local packets are getting dropped and 
> the retransmission logic is broken or something else is broken.
> 
> Any suggestions for how to debug this?

My normal procedure is just to add transaction debugging messages until
something jumps out at me.  It usually doesn't take long.

> How about adding a few counters to ntpq so we can at least get some hints?

You can try setting -D 4 to see the packet traffic.

The good news is this is probably not a frag-reassembly problem.  If that
logic were busted we'd almost see much more obvious problems in peer requests,
which usually get two-frag responses.

It is much more likely that the error is in the mrulist() method of the
ControlSession class in pylib/util.py.  If I weren't about to go out
to bimontly Steak Night, I would scrutinize lines 1325-1364 carefully; 
that's where it sets up to ship the second through nth requests for
mrulist spans.  Comparing it to the mrulist code in Classic's C ntpq
would probably be fruitful.

I wouldn't be at all surprised if I fluffed something in translating the C
code and didn't catch it because tickling this case is difficult here. It's
not like the protocol was ever really documented.

Should I be able to reproduce this on tom?
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>


More information about the devel mailing list