Incompatible changes, security, and good practice

Thu Oct 6 21:06:29 UTC 2016

Mark: Management and policy decisions implicated here.

When I landed Daniel's refactoring of the protcol machine, a couple
of things temporarily broke - MS-SNTP and the broadcast-mode support.

One thing got deliberately and silently changed; unauthenticated
symmetric-peer mode was disabled. As Daniel later said on IRC,

14:57:39       dfranke | But we need to make some change that's   │
                       | going to break something for somebody,   │
                       | because the current situation with       │
                       | symmetric mode, especially symmetric     │
                       | passive mode, is crazy dangerous and     │
                       | doesn't provide any benefit that         │
                       | remotely justifies that danger.          │

This change was well-meant, but it caused a lot of havoc.  In the
rest of this note I'm going to explain where I think we failed to
apply good practice and how we can avoid similar havoc in the future.

There are three issues here: (1) Was the change narrowly justified on
technical grounds? (2) Was it communicated properly to those affected
by it? (3) Was the change correct as a matter of our policy about how
we weight goals such as backward compatibility and increased security?

1. Technical justification

Daniel has explained that an unauthenticated symmetric-peer connection
can readily be exploited to mess with your time corrections by anyone
who can inject packets with spoofed source addresses. This certainly
makes it inappropriate for WAN use; even on a LAN it is not good
practice to rely on your firewall to prevent this.

I am not an infosec specialist, but I can follow enough of this
argument to know it is sound.  It is not difficult to imagine a
datacenter being messed with by a RasPi-class computer snuck into
a wiring closet. It is NTPsec's job to harden NTP against this
sort of thing.

The narrow technical case for disabling unauthenticated symmetric-peer
mode is clear.

2. Proper communication

All kinds of havoc and tsuris ensued because dfranke did not realize
he needed to broadcast a much louder alert about this change.  As he
later said:

14:52:14       dfranke | gemiller: sorry, about the breakage /    │
                       | communication screwup. I hadn't intended │
                       | the code to get merged in the state it's │
                       | in without further policy discussion. We │
                       | started that discussion with an          │
                       | inconclusive result a few months ago,    │
                       | then when esr came back around to        │
                       | merging the refactor I forgot that it    │
                       | was an open issue.                  

All told, Gary and I lost a substantial portion of a working week each.
because of this. The ripple effects reached as far as Matt Selsky and
back to Daniel.

To be fair, part of this was my fault. I made three big changes - the
proto-refactor merge, removing TESTFRAME, and cleaning out struct
timeval operations - too close to each other. And bisection around
a big merge can get tricky.  The result was that Gary, seeing breakage
in symmetric-peer mode, couldn't pin down the bad change properly and
came to think it was due to the TESTFRAME withdrawal.  I thought this
couldn't be right, but had no easy way to prove that.

I looked at the change history, concluded that we were falling into
what scuba divers call an "incident pit" (that's where frantic
attempts at corrective action put you in ever-deeper trouble) and took
the extremely drastic step of force-pushing the public repo back to a
known good state (e.g. before the protocol-refactor merge) so most of
the later work could be restored by small steps amenable to bisection.

This invalidated some private branches held by Matt Selsky and Daniel,
causing them enough trouble that they complained to me on IRC and I
felt required to apologize.  And some of my work from before the head
reset still hasn't made it back in, nearly a week later.

Also, when Gary learned that his mystery bug was probably due to
Daniel making a backwards-incompatible change, he blew up like a
Roman candle on IRC.

All this -- *all* of this -- could have been avoided if Daniel had
been properly careful about tracking the behavior changes due to the
proto-refactor merge and notifying everyone *in advance* what to
expect on this list. At the very least, it was his responsibility
when he realized the merge had landed to remember the kind of
problems it might cause and issue an informative warning as soon
as possible after the fact.

This was a serious breakdown in good practice, measured by its
consequences.  I'm dissecting it not because I want to beat up on
Daniel, but because we all need to learn not to screw up like this.
When you work on a team, you have a positive responsibility to
communicate well about making potentially disruptive changes.

3. Compatibility vs. security tradeoffs.

This is one of those cases - like the proposed change in access
defaults I posted about recently - where there are serious questions
about how we trade off backward compatibility versus improved
security.

Gary is upset about this change because he thinks NTPsec won't get
enough early-adopter uptake for its security advantages to matter if
it's not a drop-in replacement for Classic, including compatibility
with old bugs like unathenticated symmetric peering being allowed.

My position is different.  I am willing to trade away some backward
compatibility for better security, both because I think that's the
right thing to do technically and because I think NTPsec's early
adopters will almost certainly be technically capable people who will
respond well to security-centered explanations of why feature X has
been removed.

Having conflicting theories about this is a recipe for more dissension
and unproductive yelling.  NTP Classic was designed in a more innocent
time; we've had arguments about this tradeoff before and will need
to have them again.  Such arguments are only resolvable if everybody
has roughly compatible assumptions.

Daniel triggered this week's problem, and shouldn't have, but it's not
Daniel's fault that the larger issue is contentious and unresolved.
That fault lies with myself as tech lead and Mark as PM; it was our
responsibility to get the whole team operating from at least a rough
consensus on this, and we haven't met it.

Here's my marker on this issue: I always have my eye on estimates of
how widely any given feature is used.  If I think the answer is
"rarely", but the gain in security and simplicity is significant, then
I'm willing to take the risk of annoying small segments of our
potential userbase in order to better serve much larger segments.

I believe we can mitigate this annoyance to acceptable levels with two
tactics.  (1) Document *very clearly* what we are changing and why,
highlighting the security rationale.  (2) Have --enable-classic-mode
to enable a build in which bug-for-bug compatibility is restored.

I think symmetric peering is a rarely-used feature, and am therefore
willing to drop unauthenticated mode on convincing representation
that it is a security hazard.

If you are of a more conservative turn of mind, there are two ways
to argue against this position. One is to oppose the policy principle,
claiming that we need to be absolute about drop-in compatibility.  The
other is to argue a fact case that unauthenticated symmetric peering
is so widely used and valuable that the security win does not
justify dropping it.

Is anybody willing to argue that we should be absolutists about
bug-for-bug compatibility?  Gary? Anybody else?

If not, if we've accepted *as a team* that features-vs.security
tradeoffs have to be made, then what we need is for our tradeoff
criteria to be clustered close enough together that we don't
have screaming matches about them.

Here's where I am:

I think *everything* other than pure client mode with an ntp.conf full
of dumb defaults is a sufficiently rarely used feature to be sacrified
for any significant security gain.  I understand that some of us think
differently because we're in the tiny minority of those who operate
servers and tinker with ntp.conf. It is perfectly natural that we have
a perspective exaggerating our own concerns, but the truth is that
J. Random Linux or BSD user will never even notice any of the things
we argue about.

If any of you have a principled position different from that, I want
to hear about it.  If you *agree* with that position, let me hear
that, too.  What I want is for us to arrive at a rough consensus
so the arguments stay civilized.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

The kind of charity you can force out of people nourishes about as much as
the kind of love you can buy --- and spreads even nastier diseases.