Wonky NTP startup and the incremental-configuration problem

Thu Jun 9 20:02:28 UTC 2016

Heads up, Mark.  Strategic issue being raised here.

Gary Miller says ntpd has a tendency to go nuts when restarted and
converges on good time only slowly, sometimes taking 24-48 hours.

This is a performance problem in itself, which needs to be high
up on the list to be attacked.  But it's also a significant issue
for our security-hardening efforts.

ntpq has dangerous operations that tweak parameters of the time-sync
algorithms on the fly - operations that can be triggered remotely. Or so I
gather from things Hal Murray has said; my outside view is weak here,
I've never explored those operations.

It would be better for code verifiability and security if the
only source of configuration information for the daemon were the
ntp.conf file.  (We can't quite get there due to the requirement
to store drift state, but closer would be better.)

Thus, ideally, for security's sake, we would plain remove the
runtime-configuration stuff .  It's an obvious attack surface.  But I
don't think we can do that while the accuracy hit from an ntpd restart
remains high.  If we did so, we'd in effect be screwing experienced
NTP users who are used to runtime tweaking to avoid that reconvergence
time.  They'd howl, and they'd be right to.

(This is also why I haven't yet removed the SAVECONFIG code, much as
I'd like to. Shortly after the Penguicon meeting I found that the
beliefs we based the decision to remove it on were inaccurate - it
is not in itself a potential security hole, and it has a real use when
runtime config operations are allowed, which is to dump your
actual configuration so you can check the cumulative results of your
tweaks.)

There is no doubt in my mind that this is a place where ntpd is
overengineered, wrong for current conditions, and arguably just as
wrong when it was built.  This sort of feature (runtime config of a
long-running service) is a notorious defect and vulnerability
attractor everywhere it's implemented, not just in ntpd.  Mills and
his cohorts got too cute for anybody's good.

How it should work is that there is just one way to hack your
configuration, modifying ntp.conf, and restarting the daemon to
reread it is a low-cost operation that produces only transient
synchronization glitches.  Of course this would also imply
faster crash recovery.

There are two major questions here:

1. Why is convergence from a standing start so slow?

2. If there is a fundamental reason for the slowness, shouldn't it
   be possible to dump some kind of state that would allow ntpd
   to reread it and resume from a running start? The key question
   is whether we can identify that state.

I don't yet know enough to have intuitions about these questions. If
some of you do, please speak up.
--
					>>esr>>