How much do we care about high-load scenarios?

Wed Sep 14 21:15:09 UTC 2016

Achim Gratz <Stromeko at nexgo.de>:
> …or move the refclocks out of ntpd altogether and use some shared memory
> or mailbox system to have ntpd have a look at the timestamp stream from
> each refclock.

Yeah, this is one of my longer-term plans.  It was in the original technical
proposal I wrote 18 months ago, labeled REFCLOCKD.

> I think the appeal would be to use multiple cores on a multiprocessor,
> ostensibly not to cut down on the response time, but rather to shorten
> the long tail of the distribution.  Whether or not multiple threads or
> processes achieve that objective needs investigation of course.

As I just wrote, I want to see measurements before I invest in complexity.

Especially since one of the things we know is that the servers deployed
out there (a) are *not* using concurrency, and (b) nobody is identifying
poor performance under load as a pain point.

Therefore I'm going to need persuading that high load is even a real
problem, let alone that concurrency is the right solution.

> Your experience with the rasPi is maybe just the hammer that makes all
> problems look like nails when in reality you'll also need to deal with
> welds, screws, press-fit, snap-in and folded cardboard.

Fair point.  You may be right that I'm being too optimistic here.

> They are not yet cheap enough to have a single server per function, at
> least not on iron.  Virtualization and containerization hasn't yet
> diffused into all places so you can't even assume that NTP is isolated
> to one VM or container.

Not that that would matter - VMing doesn't magically make more cycles
available, in fact quite the reverse.

> You seem to be arguing from the standpoint of whether or not ntpd keeps
> the local clock in sync (or at least that's my impression).  Let's
> assume you already ascertain that, then the question becomes how many
> clients you can serve that time with some bounded degradation under
> various load conditions.  For NTP the response time is somewhat
> critical, but even more important is that the variability of the
> response time should be minimized.  This is of course where it gets
> tricky to even measure…

Yes. On the other hand, I repeat: we have the real-world information
that "Help!  My time-service performance is degrading under load!" is
*not* a theme being constantly sounded on bug-trackers or in time-nuts
or elsewhere.  In fact I've never seen this complaint even once.

The simplest explanation for this dog not barking is that there's no
burglar - that is, you have to load your timeserver to a *ridiculous*
extent before performance will degrade enough to be visible at the
scale of WAN or even LAN time service.

I think this explanation is very likely to be the correct one. If
you want to persuade me otherwise, show me data.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>