mrulist direct mode, monitoring pool servers
Eric S. Raymond
esr at thyrsus.com
Thu Dec 22 01:07:57 UTC 2016
Apologies for the delay in responding. There have been Lots of fires
to fight recently, none of them major but they add up
Hal Murray <hmurray at megapathdsl.net>:
> I implemented a direct mode. It writes out each batch of slots as soon as it
> gets them. Any sort options are ignored. There will be duplicates of any
> slots that get updated after they are retrieved. I think the filtering stuff
> should still work but I didn't try it.
I can't find any commit that looks like it involvs a 'direct' flag.
Did you push this, or is it a private set of changes? If the latter,
I'd like to see the patch and play with it.
> The code and UI need more work, but as a proof of concept it managed to
> capture everything from a busy server.
> I think collecting data from a busy server will always be "interesting". I
> know about 2 issues.
> The first is the race between collecting data and having slots get moved or
> recycled while you are collecting. This is obviously easier if you can run
> on the same system as the server so there are no network delays.
Right, and it's a *fundamental* problem because the overhead of assembling
the batches means pumping them out intrinsically happens slower than logging
I don't think there's anything to be done about this other than
document it as a known problem with monitoring heavily loaded servers.
Pushing their traffic high enough will reliably push them into this
lagging state. The only question is whether this happens in your
normal traffic regime.
Nothing we do on the client side can prevent this, though a slow
client could make it worse if local computation or I/O stalls its
network reads. I think stalling due to processor lag is unlikely to
happen. Even a low-power ARM has lots of headroom with respect to
network speeds these days.
On the other hand, if ntpmon's screen I/O happened between spans
rather than after the sort and reassembly, that could be pretty
> If we can't go fast enough, we should be able to get some of the data and/or
> some estimates of how much we are missing.
Some of the data, yes. As the Mode 6 protocol is designed I don't see how
to get good estimates.
On the other hand, I can imagine an inexpensive protocol extension
that would help a lot - adding a tag to the front of each span that
reports the MRU-list size at the time of transmission. If your client sees
this number rising rather than falling during a span sequence then you
can at least be warned that you're probably in a losing race.
> We can probably test that by
> running over a network. (That will also test the lost packet code.) We need
> to be sure to debug this case/mode so we will have useful tools when the next
> big burst of traffic hits the pool.
I'm in favor of *that*...
> The other issue is memory and CPU on the system collecting the data. I don't
> know which limit will kick in first. It takes a lot of CPU, but that's not a
> problem as long as you can keep up with the server. I think that translates
> into a threshold for how busy a server you can grab complete data from.
Yes, that matches my own analysis.
> I think memory will be a serious issue. I saw troubles before switching to
> direct mode but it should work on a system with more memory or less traffic.
> Direct mode doesn't use much memory so this probably won't be a problem.
Can't easily see it being a big problem in the normal mode either,
frankly. By definition the client memory usage has to be linearly
related to the memory usage on the server, and even in Python I don't
think the constant of proportionality can be very large. I'd guess
If I turn out to be wrong about that, there is recourse. Changing the
representation of spans from a list of objects from a list of tuples,
for example, would drop about 40 bytes per item.
> Any suggestions for a UI/CLI?
Not before seeing the patch, no.
<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>
More information about the devel