Time to slow down and be more careful

Eric S. Raymond esr at thyrsus.com
Mon Apr 17 22:40:03 UTC 2017

Hal Murray <hmurray at megapathdsl.net>:
> esr at thyrsus.com said:
> > This morning, while investigating a recent code change that smelled bad to
> > me, I discovered that an error cascade of small, wrong changes starting some
> > weeks ago had destroyed the mechanism that would allow instances of ntpd to
> > interoperate across the epoch 1 boundary in 2036. 
> Could you please say more.  If I screwed up, I'd like to learn something from 
> it.

You did, but only in a minor way.  You removed a call from
libntp/ntp_calendar.c that the pivot code needed to do cross-era
resolution before it was wrongly deleted.  This complicated the fix -
I had to figure out which parts of your cleanup to revert - but it
was no part of the original error.

I wouldn't have been surprised if you had noticed that the code that
exercised that entry point shouldn't have been removed, you're good at
being that kind of careful, but it wasn't really your responsibility
to notice; it was the tech lead's, e.g. mine.

For *you*, I think the only lesson out of this one is to be more careful
about dead-code removal.  There was a lot of really useless stuff in the
codebase at fork time, but I ripped most of that crap out last year. Now,
if you see what looks like dead code, you need to double-and triple-check
whether it should have a call site that shouldn't have been dropped.  This
will probably involve sniffing around the Classic tree a bit.

> Looking back, I should have written something about how that stuff works.  
> It's in several messages but never made it to a file that got committed.
> I think the old code converted l_fp to full time.  That needs a pivot.
> I changed things so that there is never a conversion from l_fp to full time.  
> There is a subtract done on the l_fp side.  The clock offset in l_fp is 
> converted to an offset in seconds.  I think it's a double.  That eventually 
> turns into a clock adjustment.  There is no explicit pivot.  There is an 
> implicit pivot of the current time.

I'm actually not sure which code you're talking about here, and I think it's
important that I should.

> That turns into a requirement for the time to be reasonably close before ntpd 
> is started.  Reasonably close is within 68 years.

A half cycle, yes.  That's the same constraint the original Mills code has. The
underlying modular arithmetic is clear to me, even if its relationship
to the implementation remains a bit murky.

> That will screw up in 2038 on systems like the Raspberry Pi that don't have a 
> battery backed RTC.  I see two ways to fix that.  One would be to put a pivot 
> like time stamp into ntpd and early in the startup sequence, bump the clock 
> if the current time is earlier than the pivot.  The other would be to run 
> some other program before starting ntpd.  That program could use a compiled 
> in time stamp or look in the file system or ...

This is going to take some very careful work, with planning and discussion
beforehand.  *After* 1.0.
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Please consider contributing to my Patreon page at https://www.patreon.com/esr
so I can keep the invisible wheels of the Internet turning. Give generously -
the civilization you save might be your own.

More information about the devel mailing list