Tracker bugs and our release process

Eric S. Raymond esr at thyrsus.com
Mon Aug 14 15:47:54 UTC 2017


I've spent the last week triaging and resolving items from the NTPsec
issue tracker.  We're making excellent progress; the count of
unresolved issues has gone from 41 to 15.

I shall round up the remaining issues and discuss where I think our
priorities need to be.

Summary:

* I need to work on #348: reverse function for restrict

* unpeer should be made to fully work from ntpq :config.  This one is mine too.

* In my opinion, our only real blocker is #347: ntpd doesn't synchronize
  quickly.  This is one for Gary or Hal (our guys with operations and
  measurement experience), and I'd appreciate it if one of you stepped up.

* There are two waf recipe bugs that I'm completely blocked on, despite
  having stared at them a lot. We need a waf expert, but I don't know
  where to find one.

* waf configure needs a --unitdir option.  Matt Selsky was going to do
  it but it hasn't landed yet. Matt, can you schedule time to complete
  this?

* We need RPM packaging.  No volunteer has followed through on this yet.

* We have a NetBSD port bug that should be easy to fix, but I can't do
  it; no test access.  Matt Selsky is the logical person to tackle
  this.

* I have written and documented an implementation of config directories
  that some of our other devs don't like.  I don't think we'll have
  time to resolve that argument before 1.0, so I'm going to mark this
  feature unstable/experimental in the documentation and hope we
  don't get flamed if we change it.

* We have a couple of serious issues with the GPSD_JSON driver, a
  half-baked experimental feature of Classic. 

Following the details section I have a summary of requests to our devs.

Details:

---------------------------------------------------------------------------
#356: RFE: reverse function for restrict
https://gitlab.com/NTPsec/ntpsec/issues/356

Hans Meyer: "The current implementation of NTPsec allows to configure
detailed restrictions. Command line tool "ntpq" can be used to define
restrictions during runtime. But the current implementation doesn't
allow to remove already defined restrictions. "restrict" can only add
definitions even if the attributes define less permissions. Therefore
I ask for a reverse function like 'release' or 'unblock'."

I was going to let this RFE slide until after 1.0, but there are two
reasons not to.  One is that we're light on user-visible features for
a 1.0.  The other is that Meyer has been our most persistent outside
beta tester, and making him happy to keep him engaged seems like a
good idea.

I have to do this one, nobody else knows the configuration machinery well
enough.  Difficulty seems moderate.  Probably a couple days of work.

---------------------------------------------------------------------------
#348: server statement not checking for valid IP to be resolvable
https://gitlab.com/NTPsec/ntpsec/issues/348

Configuring a server with a typo in its name produces a bogus peer
entry that (naturally) hangs in INIT state forever. It can't be
removed with unpeer.

There are two issues here.  One is that unpeer is not doing what it
should.  That is a bug and needs to be fixed.  
whether ntpd should re-try failed peer name lookups.

There's an argument in the bug thread over whether ntpd should retry
failed peer-name lookups, and if so how often. Currently it does not

Arguments for: (1) Allows recovery from temporary DNS failures, (2)
deals with any possible boot-time race between DNS coming up and NTP
coming up.  (I note, however, that the latter seems to be only a
theoretical problem; I've never seen a bug report that ckearly matches
this scenario.)

Arguments against: (1) Additional code complexity, (2) DDoS risk.

In my mind, "against" wins. Here's why:

The users of ntpsec will be divided into two cohorts.  99% will never
use anything but a canned configuration that talks to pool servers.
For these people, a new set of retry-policy knobs will be useless;
they never even look at their configs! The other 1% is experienced
time sysadmins who use ntpq and are quite capable of noticing an entry
stuck in INIT or STEP state and dealing with it manually.

At best, adding another policy knob could only help part of that 1% -
and people in that group don't qualify new hosts very often, anyway.

Conclusion: adding a retry facility Classic never had isn't a good
idea. Making unpeer work, on the other hand, seems worth doing.

(Anybody who wants to argue with this decision should do so in the issue
thread, not here.)

---------------------------------------------------------------------------
#347: ntpd doesn't synchronize quickly
https://gitlab.com/NTPsec/ntpsec/issues/347

Expected time to first sync has increased since 0.9.7.  I consider
this an important place to not let the competition win.

This is the only tracker bug I consider a release blocker.  We need to
bisect and figure out what change slowed us down, and fix it.

Hal suspects his DNS changes of a few months ago might be implicated.
He's the logical person to work this.

---------------------------------------------------------------------------
#312: pyc generated files do not have matching timestamps
https://gitlab.com/NTPsec/ntpsec/issues/312

Something is not quite right in our waf recipe. The three files in
question are generated with some rather odd productions in
pylib/wscript that tla helped me develop.

The fix for this would almost certainly be trivial if we knew what it
was. The real problem here is that waf is so badly documented that
troubleshooting problems like this is extremely difficult.

We need a waf expert. I don't know where to find one. I've stared at
this problem a lot but gotten nowhere.

---------------------------------------------------------------------------
#273: No repo or cache detected
https://gitlab.com/NTPsec/ntpsec/issues/273

Another waf recipe problem I have not been able to gain a clue about.
As before, we need a waf expert.

---------------------------------------------------------------------------
#270: Loss of precision in step_systime()
https://gitlab.com/NTPsec/ntpsec/issues/270

This isn't going to get done in 1.0.  Gary and I need to have a design
argument (with Hal pitching in) about how pivoting works, and should
work.  This is a particularly murky area of Mills's code - I'm not
sure *any* of us understands it right.

---------------------------------------------------------------------------
#269: Update and install systemd services if user requires them
https://gitlab.com/NTPsec/ntpsec/issues/269

This one seems mostly resolved.  Matt Selsky promised to add a
--unitdir option that would do the rest. Matt, can you finish that?

---------------------------------------------------------------------------
#252: Need an RPM package
https://gitlab.com/NTPsec/ntpsec/issues/252

Yes, we do.  Occasionally we get a volunteer surfacing on #ntpsec to
do this, but nobody has followed up yet.

I've put my apprentice Keane (Dr. Daemoneye) on this problem.  He
thinks he can have results this week.

---------------------------------------------------------------------------
#251: Add fudge option to server config
https://gitlab.com/NTPsec/ntpsec/issues/251

Gary and Daniel are having an argument over whether this is a good idea.

Me, I'd rather not do it.  Just to keep life simple.  But they
understand the terrain in ways I don't.

---------------------------------------------------------------------------
#220: ntpc.so is unable to resolve libpython2.7.1.0 on NetBSD
https://gitlab.com/NTPsec/ntpsec/issues/220

This appears to be a waf recipe problem, not passing -R/usr/pkg/lib
to the linker as it should. Matt, you can test on NetBSD.  Can you
follow up on this?

---------------------------------------------------------------------------
#204: Support /etc/ntp.d
https://gitlab.com/NTPsec/ntpsec/issues/204

There is disagreement about how this should work.  Probably not to be
resolved before 1.0.

---------------------------------------------------------------------------
#62: Refclock #20 behaves perversely on GPS signal loss.
https://gitlab.com/NTPsec/ntpsec/issues/204

I see the problem Gary is describing, but I don't know if a fix is
possible even in principle. Gary, if you have a problem analysis that
suggests a fix, please describe in the issue thread.  If you don't,
tell me so we can document this as a known (unsolvable) problem.

---------------------------------------------------------------------------
#57: Refclock #46, GPSD_JSON, bad NMEA time
https://gitlab.com/NTPsec/ntpsec/issues/57
#55: ntpd refclock #46 just stops working.
https://gitlab.com/NTPsec/ntpsec/issues/55

I've grouped these together because they are aspects of the same
problem: the GPSD_JSON driver was a bad idea to begin with and
is in pretty crappy shape internally.

As the designer of GPSD_JSON, I am in a unique position to be able
to say to the world "this was a bad idea and I'm killing it".  I
intend to to do exactly that before 1.0 if it doesn't get fixed.

---------------------------------------------------------------------------
#44: Confusion with drift at the rail
https://gitlab.com/NTPsec/ntpsec/issues/44

I don't fully undetand this issue. I need Hal, who raised it, to
suggest at least a theoretical fix.

---------------------------------------------------------------------------

Work requests:

I don't normally like to try to hand out assignments or get people to
commit to doing them, but coming up on a release I need to have some
idea what we can realistically get done and where we need to somehow
recruit extra help.

Gary:
  Our top priority needs to be #347, slow startup.  I need to know
  that either you or Hal is on this and will nail it down.

  Also it's up to you to save the GPSD_JSON driver.  I don't think
  anyone else is invested in it, and I'd frankly prefer dropping it
  to trying to fix it. #57, #55.

  Also I need a better characterization of #62.

  If you can, please tackle these in roughly the order listed.

Matt:
   You took on being our build-system expert a while back, which puts
   #312 #273 #269 #220 on your list.  I hate to stick you with trying
   to decrypt the waf docs, but there isn't anyone obviously better
   equipped.

Hal:
   Either Gary needs to be on #347 or you do.  There's also #44, our
   oldest open bug.

Keane:
   You've taken on #252.   

Myself:
   #355 and #358 are obviously mine.  And I'm the backstop for everbody
   else, which is why I'm not assigning myself more up front.

I've put corresponding assignments on the tracker issues.

RSVP, everybody.  I need to know what you can do and are
willing to do.  Remember, September 28th. 

If we get through these there are maybe some more fun things we can do
before release.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Gun Control: The theory that a woman found dead in an alley, raped and
strangled with her panty hose, is somehow morally superior to a
woman explaining to police how her attacker got that fatal bullet wound.
	-- L. Neil Smith


More information about the devel mailing list