How much do we care about high-load scenarios?

Tue Sep 13 06:25:41 UTC 2016

Mark: Heads up! Your input requested.

Hal and I were having an off-list argument.  It started here:

Hal:

But maybe we should have a  separate thread per refclock to get the
time stamp.  ...

Me:

Not a crazy idea, but I lean against it. My problem is that threading is a
serious defect attractor (as we have in fact seen in the async-DNS code).
I'd take a good deal of persuading about the expected gain in performance
before I'd accept the implied increase in expected defect rates.

Recording and graphing some timing figure that we think is an upper
bound on the gains would be helpful here.  If the median, or even the
two-sigma worst case, is high compared to our accuracy target, then
threading might be indicated. Otherwise I think it would be best to
leave well enough alone.

The crude metric that occurs to me first is just interval between
select calls.  If you think a different timing figure would be a
better predictor, I'm very open to argument.

Hal:

I'm slightly scared about your extrapolation from measurements.  (Both here 
and in other cases.)

I think it's pretty easy to see that the problem will depend upon load.  So 
the only question is how much load is your system going to have?  That's one 
of the things we can't easily predict.  It will get (much) worse if security 
gets deployed and uses serious cycles.

Me:

Ever had one of those moments when you realize your priors have
shifted, in a way that surprises you, while you weren't looking?  I'm
having one now.

What I've just realized, somewhat to my own startlement, is that I no
longer care enough about high-load scenarios to spend a lot of effort
hedging the design against them.  It's because of the Pis on my
windowsill - I've grown used to thinking of NTP as something you throw
on a cheap single-use server so it's not contending with a
conventional job load.

Go ahead and argue that I'm wrong, if you like.  But make that
argument acknowledging that anybody who cares about accurate time can
spend just $80 on a Pi and a HAT, and what's the point of heavily
complicating the NTPsec code to handle high load when *that's* true?

*blink* I did not actually realize I believed this until just a few
minutes ago now.  It's ever so slightly disorienting.

Hal:

I think you should run that past Mark.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A Pi with GPS HAT is a nice toy.  I see two problems.

One is that the NIST servers are already overloaded.  Things will get worse 
if security uses more cycles.  "Sorry" might be the right answer, but I think 
it requires a good story.

The other is that it doesn't fit into a typical server room or office 
environment.  There is too much EMI in a server room.  You need either to 
locate the Pi and antenna someplace where the antenna will see the satellites 
and there isn't too much EMI, or you need a long coax connection so the Pi 
can live in the machine room and get connected to an antenna on the roof.  A 
Pi isn't packaged for typical server room mechanicals.  (Just the power via 
USB would probably get it laughed out of most places.)

(History ends here.)

Me:

I think over-fixating on the Pi's limitations is a mistake.  And the EMI
issue is orthogonal to whether you expect your time service to be running
on a lightly or heavily-loaded machine - either way you're going to need
to pipe your GPS signal down from a roof mount.

The underlying point is that blade and rack servers are cheap.  Cycles
are cheap.  This gives the option of implicitly saying to operators
"high-load conditions are *your* problem - fix it by rehosting your
NTP" rather than doing what I think would be premature optimization
for the high-load case.  If we jump right in and implement threading
we *are* going to pay for it in increased future defect rates.

My preference is to engineer on the assumption that local cycles are cheap
and we can therefore stick with the simple dataflow that we
have now - synchronous I/O with one worker thread to avoid stalls
on DNS lookups.

I'd prefer to bias towards architectural simplicity unless and until
field reports force us to optimize for the high-load case.

This turns into a strategic question that I can't answer.  Given Mark's
plans and LF's objectives, what *should* we assume?  Are NIST's overloaded
servers (and installations like them) our target?
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>