Sometimes Ignoring Time on Certificates (Was: Re: Docs we will need)

Wed Feb 6 08:49:14 UTC 2019

On 2/5/19 7:49 PM, Richard Laager wrote:
> I have a specific proposal that I'll hopefully write up tonight, which
> may address the needs in this space.
I did some brainstorming on this with a colleague. I initially started
with an approach that would consider the system clock (if after
BUILD_EPOCH), then the drift file (if after BUILD_EPOCH), then accept
anything. But in the course of discussing it, I came up with something
that is a lot simpler and easier to reason about.

To start, let's assume that ntpd has an NTS client implemented in it. It
is doing certificate verification as per usual TLS.

To recap, the problem we are trying to solve is that the system clock
may be sufficiently wrong (e.g. because we have no RTC) that certificate
validation fails. We want to address that while minimizing the impact on
the security provided by NTS.

I'm going to use the word "suspect" here. I'm not married to that word,
but I need something.

A "peer" is an object for a particular association.

Here are the modifications:

1) Where the certificate verification is happening, assume there is some
code with the following effect:

if (!certificate_valid(...))
{
    return fail;
}

that becomes:

if (!certificate_valid(...))
{
    if (!system_clock_set_once_by_ntpd)
    {
        if (certificate_valid_ignoring_times(...))
        {
            peer->suspect = TRUE;
            return success;
        }
    }
    return fail;
}

Note that this is not necessarily what the implementation will actually
look like. This is just describing the effects we need to achieve: if
the system clock has not been set, and the ONLY reason a certificate
failed is due to time issues, mark the peer as suspect and continue as
if it passed validation normally.

We have to be careful about the implementation here. You can't just ask
the SSL library to verify the certificate and then look at its reason
for failure. If it checks times before checking the chain, you would
then conclude that the certificate failed for time, and allow it, when
the certificate chain is potentially broken. That's bad!

2) In the clock selection algorithm, very early on, add this:

if (peer->suspect) {
    // Suspect peers are ignored ("leave the island"), unless
    // <some condition>.

    // <some condition> is something that indicates we would have
    // "normally" synced the clock by now.  My example was that
    // reach (as output by ntpq -p) was 377 (i.e. we had 8 successful
    // polls on that peer), or maybe allow for one missed? It probably
    // cannot be time passed because the network could be down for an
    // indeterminate length of time when ntpd comes up.

    if (!<some condition>)
        continue;
}

The idea here is that we spin up the suspect associations normally, but
we ignore them for "a while" which would normally be sufficient to set
the clock. If there are enough other associations working, great, we
didn't use the suspect association(s) at all, so there was no loss in
security. Only if we couldn't set the clock in a reasonable amount of
time or whatever would we then fall back to considering the suspect
associations. But, because they have been running the whole time rather
than just starting now, we minimize the time to clock update when we do
need to use the suspect associations.

3) Once we set the clock the first time, kill all suspect associations
(even those used to set the time), forcing those peers to re-run NTS-KE
and start over. They will either pass normally or fail normally.

To get attacked, someone would have to present an otherwise valid
certificate. For example, if they have a previously-valid certificate
and key of mine because they previously compromised my time server, they
could use that to give time to RTC-less clients (if the attacker can
also divert the client's network traffic to themselves).

We could stop the implementation at this point, but there are more
enhancements we can do. These things aren't my ideas, except where I'm
adding something. They are in the NTS draft:

4) Because this does weaken security, there should probably be a
configuration option (global or per-peer if it's easier). The default
can either be OFF (which prioritizes security over setting the clock) or
ON (which prioritizes setting the clock over security). The check in #1
would change like this:
-    if (!system_clock_set_once_by_ntpd)
+    if (suspect_peers_allowed && !system_clock_set_once_by_ntpd)

This is suggested in the draft:
      Allow the system administrator to specify that certificates should
      *always* be strictly validated.  Such a configuration is
      appropriate on systems which have a battery-backed clock and which
      can reasonably prompt the user to manually set an approximately-
      correct time if it appears to be needed.

5) To improve the security, implement the check as described in the draft:

      Do not process time packets from servers if the time computed from
      them falls outside the validity period of the server's
      certificate.  However, clients should not perform a new NTS-KE
      handshake solely based on the fact that the certificate used by
      the NTS-KE server in a previous handshake has expired, if the
      client has previously received valid NTS protected NTP replies
      that lay within the certificate's validity time.

That is, when we perform the NTS validation, we should keep the
certificate's notBefore and notAfter times in the peer object. This
should probably be done for all NTS peers, not just "suspect" ones.

My enhancement to this would be... It might be wise to record the latest
notBefore and earliest notAfter of all the certificates in the chain as
well as the stapled OCSP response* if you have one. A normal CA will not
issue certificates that are valid longer than their root or
intermediate, but an attacker who has compromised an intermediate (e.g.
from a company's private CA) could in an effort to defeat this check.
That said, given the long lifetimes of roots and intermediates, this is
not likely to add much, though it won't ever hurt.

If the NTP server gives (authenticated) time earlier than the notBefore
time, discard that time (or mark that peer a falseticker).

As described in the draft, discard time after notAfter, but only if we
haven't yet received valid time from the server. As the draft mentions,
you do NOT want to do a new NTS-KE handshake just because you passed the
initial notAfter time. If you did this, the clients would create a
thundering herd hitting the server right after the certificate expires.

These checks should never trigger on legitimate traffic, as that would
mean the NTP server disagrees with its NTS-KE server's CA about time.

In the worst case where the client has only one time server, this check
provides some value. It means the attacker can only set a time that
corresponds to the validity of the certificate they compromised, not any
arbitrary time.

Imagine you have three servers: a time server, a web server, and a
database server. The attacker has, in the past, stolen a key (and cert)
from the time server and database server. This attack was noticed and
cleaned up, with the holes patched, everything reinstalled, security,
certificates revoked, passwords changed, etc.

The web server is getting time from only the single time server. The web
server speaks TLS to the database server, with full certificate
verification. The attacker hasn't re-compromised the servers, but has
somehow been able to MITMs the traffic. Their goal is to get the web
server to connect to them as the database server and give away the (new)
database password.

The web server starts up, talks to the attacker (thinking they are the
time server). They present a previously valid, but expired, certificate.
The client, not able to trust its own system clock, ultimately takes
time from the attacker. Now, the web server starts talking to the
attacker (thinking they are the database server).

This rule about the NTP time having to fall within the NTS certificate
time means that the compromised database server certificate must have
overlapping validity with the compromised time server certificate. If it
does, the attacker can serve time in that overlap and succeed. But if
those were compromised at separate, non-overlapping times, the attack
will fail. So this check adds some security.

If the client has multiple time servers, the attacker would have to
compromise enough (minsane) of them to get the client to take time from
them. If they manage to do that, this rule additionally limits the times
they can serve to only times overlapping enough (minsane) of the time
servers.

* OCSP stapling can dramatically reduce the allowed-times-to-serve
window. If the certificate is marked "must staple", then if there is no
OCSP stapled response, the certificate validation fails. This is a
validation failure other than time, so it would not trigger my "suspect"
behavior; it would be a hard fail. Now, the attacker can certainly
replay an old stapled OCSP response when they replay the certificate
corresponding to the key they compromised, but note that OCSP staples
are typically valid for around a week. That is much shorter than the 3
months for a Let's Encrypt certificate or up to 2 years for a purchased
certificate.

Some interesting notes on OCSP stapling:
https://blog.cloudflare.com/high-reliability-ocsp-stapling/

Also, here are some notes about how to implement good OCSP stapling on
the server side:
https://gist.github.com/sleevi/5efe9ef98961ecfb4da8
linked from:
https://community.letsencrypt.org/t/ocsp-stapling-advantages-and-disadvantages/34465/11

OCSP stapling support in the NTS-KE server is certainly NOT a "first
ship" requirement.

6) The draft also says:
      Once the clock has been synchronized, periodically write the
      current system time to persistent storage.  Do not accept any
      certificate whose notAfter field is earlier than the last recorded
      time.

This prevents replay of expired certificates to a system that was synced
at some point after their expiration.

This does have the same problem as I mentioned with the ntp drift file
timestamp. If ntpd sets your system clock way in the future (e.g.
because your GPS rolled over to a year of 99 -> 2099), ntpd will also
happily write that to the file, and now you're going to ignore every
valid certificate.

-- 
Richard