ntpq, pool

Fri Dec 16 15:04:13 UTC 2016

Hal Murray <hmurray at megapathdsl.net>:
> The traffic on the pool took a big jump recently.  There were a couple of 
> comments on the pool list, but nothing past "lots of traffic".
> 
> Understanding what's going on is the top of my list.

If it matters, I agree with that allocation of your time.

NTPsec won't be hurt if it takes a bit longer for unusual use cases of
ntpq to be gotten right. The traffic spike is a more prompt problem.

> A couple of interesting counters were lost in the recent (month or 2 ago) 
> work on the packet processing.
> 
> I think the mru list doesn't count mode 6 packets.  Fixing that is probably 
> important in being able to monitor DDoS activity or probes.

Hm.  Yes, I think that might be true. One problem I had to correct was 
that during his refactor Daniel moved the ntp_monitor call that logs packets
to the wrong place - down a code path most packets never take.  Moving it
earlier in the processing seemed to solve that problem, but...

/me looks.

...yep, you are right.  There's now a "Move ntp_monitor() call to
where it catches Mode 6 packets." commit.

I don't *think* it will have any bad side effects, but the call is now
just before the authentication logic rather than just after and it's
theoretically possible that ntp_monitor() could do something
relevantly bad to the restrict mask on the way through. Eyes other
than mine should audit.  I've asked Daniel, but he's on vacation.

> How much of the old structure of ntpq did you preserve?

In the front-end, a lot of it - many of the helper functions even.
kept their names. I was nervous about changing anything I didn't
understand and that code was pretty messy, so I mostly went for almost
*transliterating* C to Python rather than a free translation.

I'd say there were just two big structural changes (this is
discounting a lot of code that used to be free-standing functions
becoming class methods - that shuffled a lot of logic around, but
didn't *change* a lot of logic or the overall dataflow).

One is the protocol handling getting packed into a class, separated
from the front-end logic, and extensively reworked.  It got changed
more than the front end, because one of my goals was to refactor so as
much code should be shared with other clients as possible.  That, at
least, succeeded big-time; I was able to write the ntpmon proof of
concept in about 45 minutes.

*However*, that having been said, the lowest level - the logic of
request sending and response-fragment reassembly - moved over to the
new back end almost unchanged.  It resembles the C code it was derived
from very closely except for using Python exceptions to bail out of
panic cases.

The other is that I exploited a happy coincidence.  Cosmetically and
logically the C-ntpq command interpreter looked a whole *lot* like an
instance of a Python library class called cmd.Cmd - actually the
resemblance was so strong that I wouldn't be surprised if the ntpq UI
were modeled after some ancestral program that the designer of cmd.Cmd
was quasi-emulating. (If I had to guess, it was some old-school pre-gdb
symbolic debugger, or something of that sort.)

One of the major simplifications in the rewrite was throwing out all the
logic that cmd.Cmd replaced.

> Can you say anything about how the python version of ntpq works that isn't 
> obvious from looking at the code?  I'm looking for the big picture?  The 
> stuff that's obvious after you know it but hard to put together if you don't 
> know what you are looking for because it is spread over many screens.

Thinking...

Well, the most important structural thing about it is the layering.
The front end, ntpq proper, is mostly one big instance of a class
derived from cmd.Cmd. That command interpreter, the Ntpq class,
manages an instance of a back-end class called ControlSession that
lives in ntp.packet. The cmd.Cmd methods are mostly pretty thin
wrappers around calls to (/me counts) eight methods of ControlSession
corresponding to each of the implemented Mode 6 request types.

Within ControlSession, those methods turn into wrappers around
doquery() calls.  doquery() encapsulates "send a request, get a
response" and includes all the response fragment reassembly, retry,
and time-out/panic logic.  As I alluded to earlier, that code resembles the
old C more than the dispatch layer above it does.

Even the code for making the actual displays mostly doesn't live in
the front end.  It's in ntp.util, well separated from both the command
interpreter and the protocol back end so it can be re-used.  And is,
in fact, re-used by ntpmon.

> How does the MRU stuff work?  I think I saw some debugging printout 
> indicating that it got back a clump of packets for each request.  If it 
> misses one, will it use the data up to the gap?

The mrulist() method in ControlSession is more complex than the rest of the
back end code put together except do_query() itself.  It is the one part
that was genuinely difficult to write, as opposed to merely having high
friction because the C I was translating was so grotty.

Yes, the way that part of the protocol works is a loop that does two
layers of segment reassembly.  The lower layer is the vanilla UDP
fragment reassembly encapsulated in do_query() and shared with the
other request types.  That part I'm pretty confident in; if it didn't work
100%, things like peer listings would break.

In order to avoid blocking for long periods of time, and in order to
be cleanly interruptible by control-C, the upper layer does a sequence
of requests for MRU spans, which are multi-frag sequences of
ASCIIizations of MRU records, oldest to newest.  The spans include
sequence metadata intended to allow you to stitch them together on the
fly in O(n) time.

A further interesting complication is use of a nonce to foil DDoSes by
source-address spoofing.  The mrulist() code begins by requesting a
nonce from ntpd, which it then replays between span requets to
convince ntpd that the address it's firehosing all that MRU data at is
the same one that asked for the nonce. To foil replay attacks, the
nonce is timed out; you haveto re-request another every 4 span
fetches. This is a clever trick and I will certainly use it the next
time I need to design a connectionless protocol.

But...I never completely understood the old logic for stitching
together the MRU spans; it was *nasty* and looked pretty fragile in
the presence of span dropouts (I don't know that those can ever
happen, but I don't know that they can't, either).  Fortunately I
didn't have to. It worked just to brute-force the problem - accumulate
all the MRU spans until either the protocol marker for the end of the
last one or ^C interrupting the span-read loop, and then quicksort the
list before handing it up to the front end for display.

The old way mode some sense, I guess, back when processor clocks were
so expensive that we worked a lot harder to avoid O(n log n) operations.
But I can't say I liked that part of the protocol design even a little bit.

The answer to your original question is this: I don't really know how
well the old code dealt with gaps in the sequence of spans.  I think
it was intended to cope, but I wouldn't bet *anything* I valued
against there being bugs in the coping strategy or implementation.  My
brute-force method will work better.

One consequence of the brute-forcing change is that I never figured
out where to put the update-message generation that the C version did.
I'm not even completely sure what granularity of update it was
counting, nor how that count interacted with the old way of doing
stale-record elimination on the fly.  (Did I mention that code was
nasty?)

I would have pushed harder to replicate the old behavior exactly (including
the update messages) except that (a) I thought it was ugly, and (b) I already
had ntpmon in mind.  I know I needed the mrulist() method to run *quietly*
rather than assuming as the old code did that it could just belt update
messages to the terminal.

I'm still not happy about the fact that there's a keyboard-interrupt catcher
*inside* the mrulist() method.  That feels like a layering violation, but
I haven't come up with a better way to partition things.  Under the given
constraints there may not be one.

> Is there any documentation on the packet format?

There is.  It's described in detail on the docs/mode6.txt page.  I was careful
about that because I really needed to understand it before I tried translating
the ugly C code.

>                                               I saw some ".." in an ASCII 
> packet dump.  That was a CR/LF in the hex part.  It looked like each slot was 
> multiple lines.  What marks the end of a slot and/or start of a new one?

All explained in docs/mode6.txt.  If that's at all unclear ask me questions
and we'll improve that page.

> How does the retransmission logic work?

You mean for requests to ntpd?

There are two timeouts, five seconds and three seconds.  The request is shipped
once. If the primary timeout is exceeded without a response coming back,
the request is repeated.  If the second timeout happens before a response
the whole request is aborted.

This behavior is direct from the C version and I'm pretty sure the
Python implementation is right - the breakage if it weren't would not be
subtle.

> I want to add a bunch of counters.  Where should they go?

I'm not sure.  It depends on what you want to count. If it's packets
with particular mode bits set, or something like that, probably in
the ControlSession class.

I advise checking your premises before you code anything.  Now that
the monitor code counts Mode 6 packets again you may be able to get
away with a lot less work.

> What should I have asked?

Dunno. I've tried to give you as complete a brain dump as I can, because
you seem to be heading towards extending and maintaining the Mode 6 stuff
and I think that's a good idea on several levels.

> I got this from an old ntpq looking at a busy server.
> 
>   Ctrl-C will stop MRU retrieval and display partial results.
>   116 (0 updates) Giving up after 8 restarts from the beginning.
>   With high-traffic NTP servers, this can occur if the
>   MRU list is limited to less than about 16 seconds' of
>   entries.  See the 'mru' ntp.conf directive to adjust.
> 
> I think that's trying to tell me that things are getting updated faster than 
> they are getting retrieved.  What will your new code do in that case?

The same as the old code, I think.  I monkey-copied the logic from C
because I was not certain I understood it enough to modify it.  Here's
what I *think* was and is happening:

Each span request except the first is supposed to include identifications
of late MRU entries from the previous span.  If the daemon can't match those
from the MRU records it's holding in core, that means some of the records
that existed at the time of the last request have been thrown out of core
to make room for newer ones without exceeding the configured limit on
MRU memory usage.

When this condition occurs, the design assumes you'd rather have a continous
traffic record from a later start point than one that has gaps of unknown
size in it.  So ntpd throws up its hands and starts resending the whole
current MRU list.

ntpd tracks the number of times it has to do this restart.  If that number
exceeds 8, it figures that it's never going to get everything to you
before stuff ages out, and returns an error code indicating a stall.  I
see two ways for this to happen: a really low mru limit, or a really
slow network. The second condition might have been much more common when
this code was written, but I'd be surprised to encounter it now.

The stall error code, coming back to C ntpq, is what throws up the error
message you saw.  In the Python version, you should see this:

***No response, probably high-traffic server with low MRU limit.

The difference is that where the C message display for this case was
wired right into the MRU-response handler logic, the above is packed inside a
class-valued back-end exception that the front end can handle as it
likes.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>