Monitoring busy servers
Hal Murray
hmurray at megapathdsl.net
Sun Dec 25 02:04:46 UTC 2016
I added a few lines of code to the main receive routine. They were copied
from old-old code and restore bumping sys_newversion and sys_oldversion.
There is also a bail case that I don't fully understand. Would you please
check. It seems to be working, but there might be something interesting in
the bail path.
I'm working on adding counters to the mrulist code. So I'm thinking about
counters.
We should review the existing counters for the main packet flow. They get
logged hourly via sysstats and ntpq can also show them via sysstats.
The sysstats counters get reset every hour. There is no way to see the
totals. We can reconstruct them from the log files.
When things get complicated, I think the key step for understanding what
counters mean is an overview description of how the code works - a high level
flow chart where we can associate a counter with a box.
Do we have technology for making flow charts? Are they better than text?
For the MRU stuff, I think I can do a decent job with text. (It's not that
complicated.) Should the text go in the code or in a separate doc where the
ntpq documentation can refer to it? ...
--------
There is some "interesting" code at the bottom of ntpd/ntp_monitor.c
Go to the bottom of the file, then back up a screen or so to the "Got one,
initialize it" comment. 10 lines above that is a call to ntp_random. That
code either gives up and doesn't return a slot or it preempts the oldest slot
when it isn't old enough for the normal recycle-oldest-slot path to work.
The only description of that idea I can find is in
docs/includes/access-commands.txt
under discard monitor. The description says "probability", so I expect the
argument should be in the range of 0 to 1. But the code seems to be working
with the slot age in seconds. So either the code or documentation needs
fixing. It's probably minor in either case, but I haven't figured out a
simple description for what the code is actually doing. It may be just a
scale factor.
It seems unlikely that code path is ever used, at least on a well tuned
system. It may have been useful back before the MRU list had a hash table so
large tables weren't possible. It might be interesting today for a memory
constrained system supporting a lot of traffic.
I think the idea is to make sure that abusive clients don't get lost in the
noise on a busy server.
Each slot is roughly 100 bytes. So 100 megabytes is a million slots. (I
think that fits even on a Raspberry Pi.) At 1000 packets per second, the
oldest slot would be 1000 seconds even if the noise was only 1 hit per slot.
An abusive client has to be sending faster than that in order to be abusive.
Ahh. Maybe I just figured it out. There is a design oversight in the MRU
recycle/preempt logic.
There are 2 simple cases where you want to reuse the last slot. The first is
when the table isn't full, but the oldest slot is older than mru_maxage.
This code exists. It lets you avoid cluttering up the list with stuff that
is too old to be interesting. (We could add a background call to move old
slots to the free list, but currently they just sit there so you can get
really old slots.)
The other case doesn't exist yet and/or is tangled with that "random". If
the table is full you want to use the oldest slot. But maybe not if it isn't
old enough. I think we need another parameter and a few more lines of code
to implement this.
My "seems unlikely" comment above was assuming the second case was handled
sanely.
More cruft. All the tests for oldest != NULL can be removed. If the list is
totally empty, the slot would have come from the free list.
--
These are my opinions. I hate spam.
More information about the devel
mailing list