Monitoring busy servers

Sun Dec 25 02:04:46 UTC 2016

I added a few lines of code to the main receive routine.  They were copied 
from old-old code and restore bumping sys_newversion and sys_oldversion.  
There is also a bail case that I don't fully understand.  Would you please 
check.  It seems to be working, but there might be something interesting in 
the bail path.

I'm working on adding counters to the mrulist code.  So I'm thinking about 
counters.

We should review the existing counters for the main packet flow.  They get 
logged hourly via sysstats and ntpq can also show them via sysstats.

The sysstats counters get reset every hour.  There is no way to see the 
totals.  We can reconstruct them from the log files.

When things get complicated, I think the key step for understanding what 
counters mean is an overview description of how the code works - a high level 
flow chart where we can associate a counter with a box.

Do we have technology for making flow charts?  Are they better than text?  
For the MRU stuff, I think I can do a decent job with text.  (It's not that 
complicated.)  Should the text go in the code or in a separate doc where the 
ntpq documentation can refer to it?  ...

--------

There is some "interesting" code at the bottom of ntpd/ntp_monitor.c
Go to the bottom of the file, then back up a screen or so to the "Got one, 
initialize it" comment.  10 lines above that is a call to ntp_random.  That 
code either gives up and doesn't return a slot or it preempts the oldest slot 
when it isn't old enough for the normal recycle-oldest-slot path to work.

The only description of that idea I can find is in 
docs/includes/access-commands.txt
under discard monitor.  The description says "probability", so I expect the 
argument should be in the range of 0 to 1.  But the code seems to be working 
with the slot age in seconds.  So either the code or documentation needs 
fixing.  It's probably minor in either case, but I haven't figured out a 
simple description for what the code is actually doing.  It may be just a 
scale factor.

It seems unlikely that code path is ever used, at least on a well tuned 
system.  It may have been useful back before the MRU list had a hash table so 
large tables weren't possible.  It might be interesting today for a memory 
constrained system supporting a lot of traffic.

I think the idea is to make sure that abusive clients don't get lost in the 
noise on a busy server.

Each slot is roughly 100 bytes.  So 100 megabytes is a million slots.  (I 
think that fits even on a Raspberry Pi.)  At 1000 packets per second, the 
oldest slot would be 1000 seconds even if the noise was only 1 hit per slot.  
An abusive client has to be sending faster than that in order to be abusive.

Ahh.  Maybe I just figured it out.  There is a design oversight in the MRU 
recycle/preempt logic.

There are 2 simple cases where you want to reuse the last slot.  The first is 
when the table isn't full, but the oldest slot is older than mru_maxage.  
This code exists.  It lets you avoid cluttering up the list with stuff that 
is too old to be interesting.  (We could add a background call to move old 
slots to the free list, but currently they just sit there so you can get 
really old slots.)

The other case doesn't exist yet and/or is tangled with that "random".  If 
the table is full you want to use the oldest slot.  But maybe not if it isn't 
old enough.  I think we need another parameter and a few more lines of code 
to implement this.

My "seems unlikely" comment above was assuming the second case was handled 
sanely.

More cruft.  All the tests for oldest != NULL can be removed.  If the list is 
totally empty, the slot would have come from the free list.

-- 
These are my opinions.  I hate spam.