From esr at thyrsus.com Mon Aug 7 16:58:13 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Mon, 7 Aug 2017 12:58:13 -0400 (EDT) Subject: Time to plan for 1.0 Message-ID: <20170807165813.BF82413A0206@snark.thyrsus.com> Summary: * We need to start working towards a 1.0 release no later than 28 September. * I need our senior devs to identify any release-blocker issues and tell me what they think our pre-release priorities should be. Details: On Saturday, I had a phone conversation with Mark Atwood during which he apologized to me and the team for being pretty absent recently. I assured him that we all get it about a adjusting to a senior position at Amazon being enough to eat anyone's bandwidth. Then yesterday (Sunday), at an ICEI planning meeting, Susan Sons revealed a hard deadline for an NTPsec 1.0 release. For fundraising purposes she needs it to be out by the O'Reilly infosec conference on 28 October. If I had believed that Mark was going to be back on stream in the near future I would have left it to him to respond. As it is, and considering my evaluation of the state of the project, I assured Susan that Oct 28 was doable and committed us to it. Since Mark and I were previously discussing an end-of-summer release date, I doubt he will object. If and when Mark becomes available I will cheerfully defer to his judgment about state of readiness and release timing, if we have not already shipped. In the mean time, I'll step up. This might have been a tougher call, but since early summer we've basically been polishing (Ian's AgentX work will be a nice-to-have but I do not regard it as essential for a 1.0 release). I experimentally faded out of view for a couple of weeks to find out if the project would stall or hit serious difficulty without my hand on things. It didn't. I found that reassuring. Accordingly, I told Susan that if she needed us to ship a week from *now*, it would be a bit hair-raising but doable. I've seen nothing on the issue list that I think is a blocker. But I need our devs to tell me if I'm missing anything, and what set of priorities we should put on pending work. I'd like to aim for no later that 28 September. That way we'll be able to report not just first ship but a month of field experience. If anyone thinks my assumptions are incorrect, speak up quickly, please. Otherwise let's ID what we need to get done and do it. I actually think we ought to be fully able to ship in three weeks (that is, around 28 August); let's try for that. Gary, Hal, Matt, Daniel: Would all of you check in on this, please? -- Eric S. Raymond Non-cooperation with evil is as much a duty as cooperation with good. -- Mohandas Gandhi From gem at rellim.com Mon Aug 7 17:07:26 2017 From: gem at rellim.com (Gary E. Miller) Date: Mon, 7 Aug 2017 10:07:26 -0700 Subject: Time to plan for 1.0 In-Reply-To: <20170807165813.BF82413A0206@snark.thyrsus.com> References: <20170807165813.BF82413A0206@snark.thyrsus.com> Message-ID: <20170807100726.09441b65@spidey.rellim.com> Yo Eric! On Mon, 7 Aug 2017 12:58:13 -0400 (EDT) "Eric S. Raymond via devel" wrote: > * We need to start working towards a 1.0 release no later than 28 > September. Very doable, good plan to do so. > If anyone thinks my assumptions are incorrect, speak up quickly, > please. I'm 100% with your assumptions. > Otherwise let's ID what we need to get done and do it. I > actually think we ought to be fully able to ship in three weeks > (that is, around 28 August); let's try for that. OK. > Gary, Hal, Matt, Daniel: Would all of you check in on this, please? Done. RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 gem at rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can?t measure it, you can?t improve it." - Lord Kelvin -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From dfoxfranke at gmail.com Mon Aug 7 17:18:14 2017 From: dfoxfranke at gmail.com (Daniel Franke) Date: Mon, 7 Aug 2017 13:18:14 -0400 Subject: Time to plan for 1.0 In-Reply-To: <20170807165813.BF82413A0206@snark.thyrsus.com> References: <20170807165813.BF82413A0206@snark.thyrsus.com> Message-ID: If we're aiming for a September 28 release then I propose we should have a dev freeze by September 1. Bug fixes only during that month; anything that's mere polishing goes on a branch. I don't want to release 1.0 without having https://tools.ietf.org/html/draft-ietf-ntp-data-minimization-01 implemented. As of the last IETF meeting I'm confident that there aren't going to be any significant normative changes before it's finalized. I'll make the time for this before my proposed September 1 freeze. On 8/7/17, Eric S. Raymond via devel wrote: > Summary: > > * We need to start working towards a 1.0 release no later than 28 > September. > > * I need our senior devs to identify any release-blocker issues > and tell me what they think our pre-release priorities should be. > > Details: > > On Saturday, I had a phone conversation with Mark Atwood during which > he apologized to me and the team for being pretty absent recently. I > assured him that we all get it about a adjusting to a senior position > at Amazon being enough to eat anyone's bandwidth. > > Then yesterday (Sunday), at an ICEI planning meeting, Susan Sons > revealed a hard deadline for an NTPsec 1.0 release. For fundraising > purposes she needs it to be out by the O'Reilly infosec conference on > 28 October. > > If I had believed that Mark was going to be back on stream in the near > future I would have left it to him to respond. As it is, and > considering my evaluation of the state of the project, I assured Susan > that Oct 28 was doable and committed us to it. > > Since Mark and I were previously discussing an end-of-summer release > date, I doubt he will object. If and when Mark becomes available I > will cheerfully defer to his judgment about state of readiness and > release timing, if we have not already shipped. In the mean time, > I'll step up. > > This might have been a tougher call, but since early summer we've > basically been polishing (Ian's AgentX work will be a nice-to-have but > I do not regard it as essential for a 1.0 release). I experimentally > faded out of view for a couple of weeks to find out if the project > would stall or hit serious difficulty without my hand on things. It > didn't. I found that reassuring. > > Accordingly, I told Susan that if she needed us to ship a week from > *now*, it would be a bit hair-raising but doable. I've seen nothing on > the issue list that I think is a blocker. But I need our devs to tell > me if I'm missing anything, and what set of priorities we should put > on pending work. > > I'd like to aim for no later that 28 September. That way we'll be > able to report not just first ship but a month of field experience. > > If anyone thinks my assumptions are incorrect, speak up quickly, > please. Otherwise let's ID what we need to get done and do it. I > actually think we ought to be fully able to ship in three weeks > (that is, around 28 August); let's try for that. > > Gary, Hal, Matt, Daniel: Would all of you check in on this, please? > -- > Eric S. Raymond > > Non-cooperation with evil is as much a duty as cooperation with good. > -- Mohandas Gandhi > _______________________________________________ > devel mailing list > devel at ntpsec.org > http://lists.ntpsec.org/mailman/listinfo/devel > From gem at rellim.com Mon Aug 7 17:44:21 2017 From: gem at rellim.com (Gary E. Miller) Date: Mon, 7 Aug 2017 10:44:21 -0700 Subject: Time to plan for 1.0 In-Reply-To: References: <20170807165813.BF82413A0206@snark.thyrsus.com> Message-ID: <20170807104421.16338573@spidey.rellim.com> Yo Daniel! On Mon, 7 Aug 2017 13:18:14 -0400 Daniel Franke via devel wrote: > I don't want to release 1.0 without having > https://tools.ietf.org/html/draft-ietf-ntp-data-minimization-01 > implemented. As of the last IETF meeting I'm confident that there > aren't going to be any significant normative changes before it's > finalized. I'll make the time for this before my proposed September 1 > freeze. +1 Any other upcoming RFC's we may also want to work on? RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 gem at rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can?t measure it, you can?t improve it." - Lord Kelvin -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From jdb at systemsartisans.com Mon Aug 7 17:52:12 2017 From: jdb at systemsartisans.com (John D. Bell) Date: Mon, 7 Aug 2017 13:52:12 -0400 Subject: Time to plan for 1.0 In-Reply-To: <20170807165813.BF82413A0206@snark.thyrsus.com> References: <20170807165813.BF82413A0206@snark.thyrsus.com> Message-ID: I thought that one necessity before 1.0 were at least preliminary "packaged" version for the major distros - i.e., .deb and .rpm files, conformant to the conventions (file locations, etc.) of the systems that used them. Am I wrong? If not, do you know what the status of these are? On 08/07/2017 12:58 PM, Eric S. Raymond via devel wrote: > Summary: > > * We need to start working towards a 1.0 release no later than 28 September. > > * I need our senior devs to identify any release-blocker issues > and tell me what they think our pre-release priorities should be. > > .... - *John D. Bell* -------------- next part -------------- An HTML attachment was scrubbed... URL: From esr at thyrsus.com Mon Aug 7 19:51:13 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Mon, 7 Aug 2017 15:51:13 -0400 Subject: Time to plan for 1.0 In-Reply-To: References: <20170807165813.BF82413A0206@snark.thyrsus.com> Message-ID: <20170807195113.GA7950@thyrsus.com> John D. Bell : > > I thought that one necessity before 1.0 were at least preliminary > "packaged" version for the major distros - i.e., .deb and .rpm files, > conformant to the conventions (file locations, etc.) of the systems that > used them. > > Am I wrong? If not, do you know what the status of these are? This came up at the staff meeting; unexpectedly, Mark was there. The only packaging he seems to consider must-have is Debian. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From gem at rellim.com Mon Aug 7 19:59:00 2017 From: gem at rellim.com (Gary E. Miller) Date: Mon, 7 Aug 2017 12:59:00 -0700 Subject: Time to plan for 1.0 In-Reply-To: <20170807195113.GA7950@thyrsus.com> References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170807195113.GA7950@thyrsus.com> Message-ID: <20170807125900.491007ba@spidey.rellim.com> Yo Eric! On Mon, 7 Aug 2017 15:51:13 -0400 "Eric S. Raymond via devel" wrote: > John D. Bell : > > > > I thought that one necessity before 1.0 were at least preliminary > > "packaged" version for the major distros - i.e., .deb and .rpm > > files, conformant to the conventions (file locations, etc.) of the > > systems that used them. > > > > Am I wrong? If not, do you know what the status of these are? > > This came up at the staff meeting; unexpectedly, Mark was there. The > only packaging he seems to consider must-have is Debian. And we have a chicken and egg thing with the distro maintainers, they do not want to work with a pre-1.0 version. RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 gem at rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can?t measure it, you can?t improve it." - Lord Kelvin -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From ianbruene at gmail.com Mon Aug 7 20:04:34 2017 From: ianbruene at gmail.com (Ian Bruene) Date: Mon, 7 Aug 2017 15:04:34 -0500 Subject: Fwd: Re: Time to plan for 1.0 In-Reply-To: References: Message-ID: <03550c2f-c751-0092-3e2c-d8829e8ebc49@gmail.com> Note to self: check the reply address in the future. -------- Forwarded Message -------- Subject: Re: Time to plan for 1.0 Date: Mon, 7 Aug 2017 14:59:14 -0500 From: Ian Bruene To: Eric S. Raymond On 08/07/2017 11:58 AM, Eric S. Raymond via devel wrote: > If anyone thinks my assumptions are incorrect, speak up quickly, > please. Otherwise let's ID what we need to get done and do it. I > actually think we ought to be fully able to ship in three weeks > (that is, around 28 August); let's try for that. > > Gary, Hal, Matt, Daniel: Would all of you check in on this, please? Only thing I know of is bug #341, which I have repeatedly forgot or deliberately ignored because other things were higher priority. *If* the bug still even exists, as I have heard nothing about it since back when it was posted. I just need to rendezvous with Gary to either mark it fixed or track it and stomp it into the ground. -- In the end; what separates a Man, from a Slave? Money? Power? No. A Man Chooses, a Slave Obeys. -- Andrew Ryan -------------- next part -------------- An HTML attachment was scrubbed... URL: From gem at rellim.com Mon Aug 7 20:47:04 2017 From: gem at rellim.com (Gary E. Miller) Date: Mon, 7 Aug 2017 13:47:04 -0700 Subject: Time to plan for 1.0 In-Reply-To: <03550c2f-c751-0092-3e2c-d8829e8ebc49@gmail.com> References: <03550c2f-c751-0092-3e2c-d8829e8ebc49@gmail.com> Message-ID: <20170807134704.7b7d6caf@spidey.rellim.com> Yo Ian! On Mon, 7 Aug 2017 15:04:34 -0500 Ian Bruene via devel wrote: > Only thing I know of is bug #341, which I have repeatedly forgot or > deliberately ignored because other things were higher priority. *If* > the bug still even exists, as I have heard nothing about it since > back when it was posted. A hard one as it is very intermittent. I do see your new code to try to mitigate. I'm running 6 copies of ntpmon on 6 different hosts. If it runs overnight we can call it closed. RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 gem at rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can?t measure it, you can?t improve it." - Lord Kelvin -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From jdb at systemsartisans.com Mon Aug 7 20:52:40 2017 From: jdb at systemsartisans.com (John D. Bell) Date: Mon, 7 Aug 2017 16:52:40 -0400 Subject: Time to plan for 1.0 In-Reply-To: <20170807125900.491007ba@spidey.rellim.com> References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170807195113.GA7950@thyrsus.com> <20170807125900.491007ba@spidey.rellim.com> Message-ID: <326686dc-ba1c-5f78-1b6d-39ab74e45c66@systemsartisans.com> Sorry I missed the meeting. I understand why distro maintainers wouldn't want pre-1.0 code. I thought that you would release 1.0 both as the usual source repo _and_ with "example" packages. I would expect that the maintainers might well repackage to suit their tastes. On 08/07/2017 03:59 PM, Gary E. Miller via devel wrote: > Yo Eric! > > On Mon, 7 Aug 2017 15:51:13 -0400 > "Eric S. Raymond via devel" wrote: > >> John D. Bell : >>> I thought that one necessity before 1.0 were at least preliminary >>> "packaged" version for the major distros - i.e., .deb and .rpm >>> files, conformant to the conventions (file locations, etc.) of the >>> systems that used them. >>> >>> Am I wrong? If not, do you know what the status of these are? >> This came up at the staff meeting; unexpectedly, Mark was there. The >> only packaging he seems to consider must-have is Debian. > And we have a chicken and egg thing with the distro maintainers, they do > not want to work with a pre-1.0 version. > > RGDS > GARY > --------------------------------------------------------------------------- > Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 > gem at rellim.com Tel:+1 541 382 8588 > > Veritas liberabit vos. -- Quid est veritas? > "If you can?t measure it, you can?t improve it." - Lord Kelvin > > > _______________________________________________ > devel mailing list > devel at ntpsec.org > http://lists.ntpsec.org/mailman/listinfo/devel -- - *John D. Bell* -------------- next part -------------- An HTML attachment was scrubbed... URL: From esr at thyrsus.com Mon Aug 7 21:03:24 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Mon, 7 Aug 2017 17:03:24 -0400 Subject: Time to plan for 1.0 In-Reply-To: References: <20170807165813.BF82413A0206@snark.thyrsus.com> Message-ID: <20170807210324.GB7950@thyrsus.com> Daniel Franke : > If we're aiming for a September 28 release then I propose we should > have a dev freeze by September 1. Bug fixes only during that month; > anything that's mere polishing goes on a branch. > > I don't want to release 1.0 without having > https://tools.ietf.org/html/draft-ietf-ntp-data-minimization-01 > implemented. As of the last IETF meeting I'm confident that there > aren't going to be any significant normative changes before it's > finalized. I'll make the time for this before my proposed September 1 > freeze. I'm not sure that declaring a dev freeze would be meaningful. The only activity that would be affected (that, is, the only stuff not directly addressing a pending tracker issue) would be Ian's work on the snmpd server. I'm willing to consider that a special case, since it's completely isolated from the C core and we might not ship it in 1.0 anyway. About all a dev freeze would mean is that we start deferring feature MRs that touch the C. I'd also like to avoid holding the release to an artificially late date. You want data minimization in; Mark wants Debian packaging. That's fine but I think we should plan in terms of shipping very soon after those land, like with a three-day cool-down. When is the soonest you think you can work in data minimization? -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From esr at thyrsus.com Mon Aug 7 21:08:09 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Mon, 7 Aug 2017 17:08:09 -0400 Subject: Time to plan for 1.0 In-Reply-To: <20170807134704.7b7d6caf@spidey.rellim.com> References: <03550c2f-c751-0092-3e2c-d8829e8ebc49@gmail.com> <20170807134704.7b7d6caf@spidey.rellim.com> Message-ID: <20170807210809.GD7950@thyrsus.com> Gary E. Miller via devel : > Yo Ian! > > On Mon, 7 Aug 2017 15:04:34 -0500 > Ian Bruene via devel wrote: > > > Only thing I know of is bug #341, which I have repeatedly forgot or > > deliberately ignored because other things were higher priority. *If* > > the bug still even exists, as I have heard nothing about it since > > back when it was posted. > > A hard one as it is very intermittent. I do see your new code to > try to mitigate. I'm running 6 copies of ntpmon on 6 different hosts. > If it runs overnight we can call it closed. OK. You two are on top of this, I'll take that as dispositive. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 811 bytes Desc: not available URL: From esr at thyrsus.com Mon Aug 7 21:10:50 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Mon, 7 Aug 2017 17:10:50 -0400 Subject: Time to plan for 1.0 In-Reply-To: <20170807100726.09441b65@spidey.rellim.com> References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170807100726.09441b65@spidey.rellim.com> Message-ID: <20170807211050.GE7950@thyrsus.com> Gary E. Miller : > > Gary, Hal, Matt, Daniel: Would all of you check in on this, please? > > Done. Good. Please confirm that you're not identifying any potential blockers other than #341. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 811 bytes Desc: not available URL: From hmurray at megapathdsl.net Tue Aug 8 05:48:07 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Mon, 07 Aug 2017 22:48:07 -0700 Subject: Time to plan for 1.0 In-Reply-To: Message from "Eric S. Raymond via devel" of "Mon, 07 Aug 2017 12:58:13 EDT." <20170807165813.BF82413A0206@snark.thyrsus.com> Message-ID: <20170808054807.A3BCD40605C@ip-64-139-1-69.sjc.megapath.net> There is one bug that I think should get fixed. I don't know the number. ntpq fails too often on flaky links. I seem to be the only one who notices it. I assume it's something in the retransmission logic. -- These are my opinions. I hate spam. From esr at thyrsus.com Tue Aug 8 10:27:25 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 8 Aug 2017 06:27:25 -0400 Subject: Time to plan for 1.0 In-Reply-To: <20170808054807.A3BCD40605C@ip-64-139-1-69.sjc.megapath.net> References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170808054807.A3BCD40605C@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170808102725.GB19389@thyrsus.com> Hal Murray : > > There is one bug that I think should get fixed. I don't know the number. > > ntpq fails too often on flaky links. I seem to be the only one who notices > it. > > I assume it's something in the retransmission logic. Are all these sentences describing one problem, or is there a bug *other* than failed retransmission on flaky links? -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From ianbruene at gmail.com Tue Aug 8 13:12:56 2017 From: ianbruene at gmail.com (Ian Bruene) Date: Tue, 8 Aug 2017 08:12:56 -0500 Subject: Time to plan for 1.0 In-Reply-To: <20170808102725.GB19389@thyrsus.com> References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170808054807.A3BCD40605C@ip-64-139-1-69.sjc.megapath.net> <20170808102725.GB19389@thyrsus.com> Message-ID: This morning on #ntpsec: 07:45 <@esr> ianbruene: I knew the snmpd project would lead to much yak shaving. This is actually part of the reason I viewed it it as a good training opportunity. When you're not under time pressure and can thus afford to get the bovid tonsuring *right* - well, that is the Tibetan spelling of "learning opportunity". Or should be. I was already on the verge of calling it, given the above I am now Officially Calling SNMP support as slipping past 1.0. The pattern of recent development has been: work on ntpsnmpd for a few minutes, go shave a dozen yaks in agentx.py. I see no reason why this would change by very much before ntpsnmpd is near complete. And that is before counting the Yaks I already know should be shaved without any specific trigger from ntpsnmpd. -- In the end; what separates a Man, from a Slave? Money? Power? No. A Man Chooses, a Slave Obeys. -- Andrew Ryan From ianbruene at gmail.com Tue Aug 8 13:36:55 2017 From: ianbruene at gmail.com (Ian Bruene) Date: Tue, 8 Aug 2017 08:36:55 -0500 Subject: Time to plan for 1.0 In-Reply-To: References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170808054807.A3BCD40605C@ip-64-139-1-69.sjc.megapath.net> <20170808102725.GB19389@thyrsus.com> Message-ID: Oh, and I can confirm that the code in agentx.py (also ntpsnmpd but I haven't merged that) is completely separate from everything else. agentx.py can be ripped out for 1.0 without problems, it could be shipped in a perfectly broken state without problems outside of being /incredibly/ bad form Old Chap. -- In the end; what separates a Man, from a Slave? Money? Power? No. A Man Chooses, a Slave Obeys. -- Andrew Ryan From esr at thyrsus.com Tue Aug 8 14:01:30 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 8 Aug 2017 10:01:30 -0400 Subject: Time to plan for 1.0 In-Reply-To: References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170808054807.A3BCD40605C@ip-64-139-1-69.sjc.megapath.net> <20170808102725.GB19389@thyrsus.com> Message-ID: <20170808140130.GA26265@thyrsus.com> Ian Bruene via devel : > I was already on the verge of calling it, given the above I am now > Officially Calling SNMP support as slipping past 1.0. Strictly speaking you can't do that. It's the PM's decision whether we want to drop the feature or change the schedule. We're pretty informal here, but you need to know where those lines of authority are, because many other projects (especially in corporate-land) aren't. But, speaking in Mark's absence, I'm OK with slipping it. It's not like Classic had this right; their snmpd was broken by design (not RFC-conformant) and in fact its author urged me to remove it. I think our support can wait for 1.1. Possibly Mark might do an override on this, but I doubt it. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From esr at thyrsus.com Tue Aug 8 14:02:43 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 8 Aug 2017 10:02:43 -0400 Subject: Time to plan for 1.0 In-Reply-To: References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170808054807.A3BCD40605C@ip-64-139-1-69.sjc.megapath.net> <20170808102725.GB19389@thyrsus.com> Message-ID: <20170808140243.GB26265@thyrsus.com> Ian Bruene via devel : > > Oh, and I can confirm that the code in agentx.py (also ntpsnmpd but I > haven't merged that) is completely separate from everything else. > > agentx.py can be ripped out for 1.0 without problems, it could be shipped in > a perfectly broken state without problems outside of being /incredibly/ bad > form Old Chap. No big deal, we can leave it in the repo but out of the tarball. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From ianbruene at gmail.com Tue Aug 8 14:31:05 2017 From: ianbruene at gmail.com (Ian Bruene) Date: Tue, 8 Aug 2017 09:31:05 -0500 Subject: Time to plan for 1.0 In-Reply-To: <20170808140130.GA26265@thyrsus.com> References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170808054807.A3BCD40605C@ip-64-139-1-69.sjc.megapath.net> <20170808102725.GB19389@thyrsus.com> <20170808140130.GA26265@thyrsus.com> Message-ID: <4d53f917-db65-af6b-4282-20ce3e79e250@gmail.com> On 08/08/2017 09:01 AM, Eric S. Raymond wrote: > Strictly speaking you can't do that. It's the PM's decision whether > we want to drop the feature or change the schedule. We're pretty informal > here, but you need to know where those lines of authority are, because > many other projects (especially in corporate-land) aren't. Ah, I misunderstood the structure. Also forgot that there is something called a PM and it exists for a reason.... I am Officially Suggesting that SNMP support be allowed to slip 1.0 unless I can make major progress in the next couple weeks. > But, speaking in Mark's absence, I'm OK with slipping it. It's not > like Classic had this right; their snmpd was broken by design (not > RFC-conformant) and in fact its author urged me to remove it. NTPc having broken SNMP support was a major factor in my willingness to slip for rightness, I wasn't aware that it was so bad the maintainer urged you to remove it. > I think our support can wait for 1.1. Possibly Mark might do an > override on this, but I doubt it. Is there any idea on how long that would be after 1.0? I'm not saying anything about "plenty of time", already eating my "plenty of time till 1.0" statement. -- In the end; what separates a Man, from a Slave? Money? Power? No. A Man Chooses, a Slave Obeys. -- Andrew Ryan From esr at thyrsus.com Tue Aug 8 16:16:30 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 8 Aug 2017 12:16:30 -0400 Subject: Time to plan for 1.0 In-Reply-To: <4d53f917-db65-af6b-4282-20ce3e79e250@gmail.com> References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170808054807.A3BCD40605C@ip-64-139-1-69.sjc.megapath.net> <20170808102725.GB19389@thyrsus.com> <20170808140130.GA26265@thyrsus.com> <4d53f917-db65-af6b-4282-20ce3e79e250@gmail.com> Message-ID: <20170808161630.GA27904@thyrsus.com> Ian Bruene : > On 08/08/2017 09:01 AM, Eric S. Raymond wrote: > >Strictly speaking you can't do that. It's the PM's decision whether > >we want to drop the feature or change the schedule. We're pretty informal > >here, but you need to know where those lines of authority are, because > >many other projects (especially in corporate-land) aren't. > > Ah, I misunderstood the structure. Also forgot that there is something > called a PM and it exists for a reason.... We're a bit unusual that way, a hybrid of normal open-source practice with the way things are done in corporate shops. Most open-source projects don't have a PM, because they don't have funding that needs PM-style oversight. The role may be partly filled by a senior dev with the defined role of "release manager" (Battle for Wesnoth, for example, does it that way), but under those circumstances it is in fact more likely that a subsystem or feature owner would get to make the call about what release first code ships in. (In this context you are the subsystem owner of the Python tools. I dropped that on you because you seemed ready for it.) > I am Officially Suggesting that SNMP support be allowed to slip 1.0 unless I > can make major progress in the next couple weeks. Thank you, that is correct procedure. Acting in Mark's absence I concur; we'll slip it if you don't have it done. Mark has the privilege to override this decision but I will be quite surprised if he does. > NTPc having broken SNMP support was a major factor in my willingness to slip > for rightness, I wasn't aware that it was so bad the maintainer urged you to > remove it. Yes. The Classic code was a prototype written before RFC 5907 ("Definitions of Managed Objects for Network Time Protocol Version 4") and nobody ever got around to syncing the behavior to the RFC after it issued. Frankly, it should have been deleted from the Classic tree long before I nuked it from ours, but they for whatever reasons never removed old cruft. > >I think our support can wait for 1.1. Possibly Mark might do an > >override on this, but I doubt it. > > Is there any idea on how long that would be after 1.0? I'm not saying > anything about "plenty of time", already eating my "plenty of time till 1.0" > statement. Mmmm...I think 90 to 120 days would be a reasonable guess. In any case, my direction to you is "all deliberate speed". I get that you like to work hard and respect that, but I don't want you knocking yourself out for a feature that's not urgent, especially when ICEI might want to second you for work on Hathi or Fabricode that really do have deadlines. Unless something in the external conditions of the project changes (like, say, a wealthy donor with an urgent desire for SNMP support) the main near-term value of this particular feature is staff development - that is, improving your skills. That's why I don't mind the yak shaving. The yak shaving is almost the point here. It's not quite the same case as the Python client tools, where having them by damn *work* is more important than using them to advance your training. Thus, I would step in to fix things if you weren't up to it. You haven't made that necessary. Your priorities should be, in this order, (1) Any NTPsec tracker issues you can close, (2) Any deadline-sensitive work you are seconded to on other projects, (3) AgentX. Just tell us when it's done. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From ianbruene at gmail.com Tue Aug 8 17:24:12 2017 From: ianbruene at gmail.com (Ian Bruene) Date: Tue, 8 Aug 2017 12:24:12 -0500 Subject: Time to plan for 1.0 In-Reply-To: <20170808161630.GA27904@thyrsus.com> References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170808054807.A3BCD40605C@ip-64-139-1-69.sjc.megapath.net> <20170808102725.GB19389@thyrsus.com> <20170808140130.GA26265@thyrsus.com> <4d53f917-db65-af6b-4282-20ce3e79e250@gmail.com> <20170808161630.GA27904@thyrsus.com> Message-ID: <218e7197-3a0d-13fb-ee28-77f7b7c67000@gmail.com> On 08/08/2017 11:16 AM, Eric S. Raymond wrote: > Thank you, that is correct procedure. Acting in Mark's absence I > concur; we'll slip it if you don't have it done. Mark has the > privilege to override this decision but I will be quite surprised if > he does. Well now I'm increasing this to Official /Strong/ Suggestion: the packet.py testing project just got moved to the top of the list because of bug #341. And IIRC the code in packet.py is entangled with the comm channels in ways that may require a significant test jig, or refactor. Either way AgentX got plonked onto the backburner. > Your priorities should be, in this order, (1) Any NTPsec tracker > issues you can close, (2) Any deadline-sensitive work you are seconded > to on other projects, (3) AgentX. Just tell us when it's done. Got it (see above). -- In the end; what separates a Man, from a Slave? Money? Power? No. A Man Chooses, a Slave Obeys. -- Andrew Ryan From esr at thyrsus.com Tue Aug 8 17:36:01 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 8 Aug 2017 13:36:01 -0400 Subject: Time to plan for 1.0 In-Reply-To: <218e7197-3a0d-13fb-ee28-77f7b7c67000@gmail.com> References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170808054807.A3BCD40605C@ip-64-139-1-69.sjc.megapath.net> <20170808102725.GB19389@thyrsus.com> <20170808140130.GA26265@thyrsus.com> <4d53f917-db65-af6b-4282-20ce3e79e250@gmail.com> <20170808161630.GA27904@thyrsus.com> <218e7197-3a0d-13fb-ee28-77f7b7c67000@gmail.com> Message-ID: <20170808173601.GA29838@thyrsus.com> Ian Bruene : > Well now I'm increasing this to Official /Strong/ Suggestion: the packet.py > testing project just got moved to the top of the list because of bug #341. > And IIRC the code in packet.py is entangled with the comm channels in ways > that may require a significant test jig, or refactor. Either way AgentX got > plonked onto the backburner. Correctly called. Do that. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From gem at rellim.com Tue Aug 8 18:41:35 2017 From: gem at rellim.com (Gary E. Miller) Date: Tue, 8 Aug 2017 11:41:35 -0700 Subject: Time to plan for 1.0 In-Reply-To: References: <20170807165813.BF82413A0206@snark.thyrsus.com> <20170808054807.A3BCD40605C@ip-64-139-1-69.sjc.megapath.net> <20170808102725.GB19389@thyrsus.com> Message-ID: <20170808114135.20fbbf99@spidey.rellim.com> Yo Ian! On Tue, 8 Aug 2017 08:36:55 -0500 Ian Bruene via devel wrote: > Oh, and I can confirm that the code in agentx.py (also ntpsnmpd but I > haven't merged that) is completely separate from everything else. > > agentx.py can be ripped out for 1.0 without problems, it could be > shipped in a perfectly broken state without problems outside of being > /incredibly/ bad form Old Chap. If it does anything useful at all then I'd prefer that the snmp code be shipped with 1.0. Just make sure it is clearlly marked as WIP. You never know, someone else may get the itch to finish it for you. RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 gem at rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can?t measure it, you can?t improve it." - Lord Kelvin -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From hmurray at megapathdsl.net Wed Aug 9 04:54:36 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Tue, 08 Aug 2017 21:54:36 -0700 Subject: Time to plan for 1.0 In-Reply-To: Message from "Eric S. Raymond via devel" of "Tue, 08 Aug 2017 06:27:25 EDT." <20170808102725.GB19389@thyrsus.com> Message-ID: <20170809045436.D24AA40605C@ip-64-139-1-69.sjc.megapath.net> >> ntpq fails too often on flaky links. I seem to be the only one >> who notices it. >> I assume it's something in the retransmission logic. > Are all these sentences describing one problem, or is there a > bug *other* than failed retransmission on flaky links? I'm not sure what you are asking. I think failed-retransmissions would explain everything but I can't rule out something else instead and/or in addition. -- These are my opinions. I hate spam. From esr at thyrsus.com Wed Aug 9 05:03:27 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Wed, 9 Aug 2017 01:03:27 -0400 Subject: Time to plan for 1.0 In-Reply-To: <20170809045436.D24AA40605C@ip-64-139-1-69.sjc.megapath.net> References: <20170808102725.GB19389@thyrsus.com> <20170809045436.D24AA40605C@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170809050327.GB7243@thyrsus.com> Hal Murray : > >> ntpq fails too often on flaky links. I seem to be the only one > >> who notices it. > >> I assume it's something in the retransmission logic. > > > Are all these sentences describing one problem, or is there a > > bug *other* than failed retransmission on flaky links? > > I'm not sure what you are asking. > > I think failed-retransmissions would explain everything but I can't rule out > something else instead and/or in addition. I was trying to verify that the bug number referred to the same problem as the rest of your text description. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From hmurray at megapathdsl.net Wed Aug 9 05:27:43 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Tue, 08 Aug 2017 22:27:43 -0700 Subject: Time to plan for 1.0 In-Reply-To: Message from "Eric S. Raymond via devel" of "Wed, 09 Aug 2017 01:03:27 EDT." <20170809050327.GB7243@thyrsus.com> Message-ID: <20170809052743.55E4B40605C@ip-64-139-1-69.sjc.megapath.net> > I was trying to verify that the bug number referred to > the same problem as the rest of your text description. I expect there is a bug matching what I described, but I'm not sure and I'm not in a position to search or it. -- These are my opinions. I hate spam. From hmurray at megapathdsl.net Wed Aug 9 05:43:22 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Tue, 08 Aug 2017 22:43:22 -0700 Subject: Testing, 1.0, startup time Message-ID: <20170809054322.2133740605C@ip-64-139-1-69.sjc.megapath.net> One of the things we need to verify is that our code starts as fast as ntp clsssic. I think they get going in 11 seconds. It's on their web someplace. I don't know how to easily automate that. A hack to scan log files might help. This may be the tip of an iceberg, but I can't think of any other examples. -- These are my opinions. I hate spam. From gem at rellim.com Wed Aug 9 06:13:09 2017 From: gem at rellim.com (Gary E. Miller) Date: Tue, 8 Aug 2017 23:13:09 -0700 Subject: Time to plan for 1.0 In-Reply-To: <20170809045436.D24AA40605C@ip-64-139-1-69.sjc.megapath.net> References: <20170808102725.GB19389@thyrsus.com> <20170809045436.D24AA40605C@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170808231309.2e71ad31@spidey.rellim.com> Yo Hal! On Tue, 08 Aug 2017 21:54:36 -0700 Hal Murray via devel wrote: > >> ntpq fails too often on flaky links. I seem to be the only one > >> who notices it. > >> I assume it's something in the retransmission logic. > > > Are all these sentences describing one problem, or is there a > > bug *other* than failed retransmission on flaky links? > > I'm not sure what you are asking. > > I think failed-retransmissions would explain everything but I can't > rule out something else instead and/or in addition. I see the same thing. ntpq uses much of the same code as ntpmon, so I find leaving ntpmon running is smoking out a bunch of hard failure. Since I am querying localhost I am not sure why I get any failures at all, cutting them down, but there are still a lot of interesting failure cases. RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 gem at rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can?t measure it, you can?t improve it." - Lord Kelvin -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From esr at thyrsus.com Wed Aug 9 08:37:11 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Wed, 9 Aug 2017 04:37:11 -0400 Subject: Testing, 1.0, startup time In-Reply-To: <20170809054322.2133740605C@ip-64-139-1-69.sjc.megapath.net> References: <20170809054322.2133740605C@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170809083711.GA9798@thyrsus.com> Hal Murray : > One of the things we need to verify is that our code starts as fast as ntp > clsssic. I think they get going in 11 seconds. It's on their web someplace. > > I don't know how to easily automate that. A hack to scan log files might > help. Watching ntpmon, timing it to see when the first peer goes to '*' status, might do it. > This may be the tip of an iceberg, but I can't think of any other examples. We have a pending bug about that: #347. The state of play two weeks ago, before I went on vacation, seemed to be that sartup time has slowed recently. Mat Nordoff's comment was: "Sounds like the [slowdown] bug was introduced after 0.9.7." I seem to recall that you were worried about your DNS-lookup changes slowing startup. Were you able to put that concern to rest? -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From ianbruene at gmail.com Wed Aug 9 11:34:17 2017 From: ianbruene at gmail.com (Ian Bruene) Date: Wed, 9 Aug 2017 06:34:17 -0500 Subject: Time to plan for 1.0 In-Reply-To: <20170808231309.2e71ad31@spidey.rellim.com> References: <20170808102725.GB19389@thyrsus.com> <20170809045436.D24AA40605C@ip-64-139-1-69.sjc.megapath.net> <20170808231309.2e71ad31@spidey.rellim.com> Message-ID: <3783864b-ff13-9496-2399-192b9256fe91@gmail.com> On 08/09/2017 01:13 AM, Gary E. Miller via devel wrote: > I find leaving ntpmon running is smoking out a bunch of hard failure. In hindsight (yeah) we should have thought of this - *I* should have thought of this - a long time ago. -- In the end; what separates a Man, from a Slave? Money? Power? No. A Man Chooses, a Slave Obeys. -- Andrew Ryan From hmurray at megapathdsl.net Wed Aug 9 16:21:15 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Wed, 09 Aug 2017 09:21:15 -0700 Subject: Testing, 1.0, startup time In-Reply-To: Message from "Eric S. Raymond" of "Wed, 09 Aug 2017 04:37:11 EDT." <20170809083711.GA9798@thyrsus.com> Message-ID: <20170809162115.EA7B040605C@ip-64-139-1-69.sjc.megapath.net> > I seem to recall that you were worried about your > DNS-lookup changes slowing startup. Were you able to > put that concern to rest? I've assumed that any concerns about fast startup would avoid DNS. We should compare new DNS, old DNS, and ntp classic with DNS. -- These are my opinions. I hate spam. From gem at rellim.com Wed Aug 9 16:28:04 2017 From: gem at rellim.com (Gary E. Miller) Date: Wed, 9 Aug 2017 09:28:04 -0700 Subject: Time to plan for 1.0 In-Reply-To: <3783864b-ff13-9496-2399-192b9256fe91@gmail.com> References: <20170808102725.GB19389@thyrsus.com> <20170809045436.D24AA40605C@ip-64-139-1-69.sjc.megapath.net> <20170808231309.2e71ad31@spidey.rellim.com> <3783864b-ff13-9496-2399-192b9256fe91@gmail.com> Message-ID: <20170809092804.6ac2cc04@spidey.rellim.com> Yo Ian! On Wed, 9 Aug 2017 06:34:17 -0500 Ian Bruene via devel wrote: > On 08/09/2017 01:13 AM, Gary E. Miller via devel wrote: > > I find leaving ntpmon running is smoking out a bunch of hard > > failure. > > In hindsight (yeah) we should have thought of this We did, issue #341 is two months old. And was known before then. RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 gem at rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can?t measure it, you can?t improve it." - Lord Kelvin -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From esr at thyrsus.com Mon Aug 14 15:47:54 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Mon, 14 Aug 2017 11:47:54 -0400 (EDT) Subject: Tracker bugs and our release process Message-ID: <20170814154754.D167913A0206@snark.thyrsus.com> I've spent the last week triaging and resolving items from the NTPsec issue tracker. We're making excellent progress; the count of unresolved issues has gone from 41 to 15. I shall round up the remaining issues and discuss where I think our priorities need to be. Summary: * I need to work on #348: reverse function for restrict * unpeer should be made to fully work from ntpq :config. This one is mine too. * In my opinion, our only real blocker is #347: ntpd doesn't synchronize quickly. This is one for Gary or Hal (our guys with operations and measurement experience), and I'd appreciate it if one of you stepped up. * There are two waf recipe bugs that I'm completely blocked on, despite having stared at them a lot. We need a waf expert, but I don't know where to find one. * waf configure needs a --unitdir option. Matt Selsky was going to do it but it hasn't landed yet. Matt, can you schedule time to complete this? * We need RPM packaging. No volunteer has followed through on this yet. * We have a NetBSD port bug that should be easy to fix, but I can't do it; no test access. Matt Selsky is the logical person to tackle this. * I have written and documented an implementation of config directories that some of our other devs don't like. I don't think we'll have time to resolve that argument before 1.0, so I'm going to mark this feature unstable/experimental in the documentation and hope we don't get flamed if we change it. * We have a couple of serious issues with the GPSD_JSON driver, a half-baked experimental feature of Classic. Following the details section I have a summary of requests to our devs. Details: --------------------------------------------------------------------------- #356: RFE: reverse function for restrict https://gitlab.com/NTPsec/ntpsec/issues/356 Hans Meyer: "The current implementation of NTPsec allows to configure detailed restrictions. Command line tool "ntpq" can be used to define restrictions during runtime. But the current implementation doesn't allow to remove already defined restrictions. "restrict" can only add definitions even if the attributes define less permissions. Therefore I ask for a reverse function like 'release' or 'unblock'." I was going to let this RFE slide until after 1.0, but there are two reasons not to. One is that we're light on user-visible features for a 1.0. The other is that Meyer has been our most persistent outside beta tester, and making him happy to keep him engaged seems like a good idea. I have to do this one, nobody else knows the configuration machinery well enough. Difficulty seems moderate. Probably a couple days of work. --------------------------------------------------------------------------- #348: server statement not checking for valid IP to be resolvable https://gitlab.com/NTPsec/ntpsec/issues/348 Configuring a server with a typo in its name produces a bogus peer entry that (naturally) hangs in INIT state forever. It can't be removed with unpeer. There are two issues here. One is that unpeer is not doing what it should. That is a bug and needs to be fixed. whether ntpd should re-try failed peer name lookups. There's an argument in the bug thread over whether ntpd should retry failed peer-name lookups, and if so how often. Currently it does not Arguments for: (1) Allows recovery from temporary DNS failures, (2) deals with any possible boot-time race between DNS coming up and NTP coming up. (I note, however, that the latter seems to be only a theoretical problem; I've never seen a bug report that ckearly matches this scenario.) Arguments against: (1) Additional code complexity, (2) DDoS risk. In my mind, "against" wins. Here's why: The users of ntpsec will be divided into two cohorts. 99% will never use anything but a canned configuration that talks to pool servers. For these people, a new set of retry-policy knobs will be useless; they never even look at their configs! The other 1% is experienced time sysadmins who use ntpq and are quite capable of noticing an entry stuck in INIT or STEP state and dealing with it manually. At best, adding another policy knob could only help part of that 1% - and people in that group don't qualify new hosts very often, anyway. Conclusion: adding a retry facility Classic never had isn't a good idea. Making unpeer work, on the other hand, seems worth doing. (Anybody who wants to argue with this decision should do so in the issue thread, not here.) --------------------------------------------------------------------------- #347: ntpd doesn't synchronize quickly https://gitlab.com/NTPsec/ntpsec/issues/347 Expected time to first sync has increased since 0.9.7. I consider this an important place to not let the competition win. This is the only tracker bug I consider a release blocker. We need to bisect and figure out what change slowed us down, and fix it. Hal suspects his DNS changes of a few months ago might be implicated. He's the logical person to work this. --------------------------------------------------------------------------- #312: pyc generated files do not have matching timestamps https://gitlab.com/NTPsec/ntpsec/issues/312 Something is not quite right in our waf recipe. The three files in question are generated with some rather odd productions in pylib/wscript that tla helped me develop. The fix for this would almost certainly be trivial if we knew what it was. The real problem here is that waf is so badly documented that troubleshooting problems like this is extremely difficult. We need a waf expert. I don't know where to find one. I've stared at this problem a lot but gotten nowhere. --------------------------------------------------------------------------- #273: No repo or cache detected https://gitlab.com/NTPsec/ntpsec/issues/273 Another waf recipe problem I have not been able to gain a clue about. As before, we need a waf expert. --------------------------------------------------------------------------- #270: Loss of precision in step_systime() https://gitlab.com/NTPsec/ntpsec/issues/270 This isn't going to get done in 1.0. Gary and I need to have a design argument (with Hal pitching in) about how pivoting works, and should work. This is a particularly murky area of Mills's code - I'm not sure *any* of us understands it right. --------------------------------------------------------------------------- #269: Update and install systemd services if user requires them https://gitlab.com/NTPsec/ntpsec/issues/269 This one seems mostly resolved. Matt Selsky promised to add a --unitdir option that would do the rest. Matt, can you finish that? --------------------------------------------------------------------------- #252: Need an RPM package https://gitlab.com/NTPsec/ntpsec/issues/252 Yes, we do. Occasionally we get a volunteer surfacing on #ntpsec to do this, but nobody has followed up yet. I've put my apprentice Keane (Dr. Daemoneye) on this problem. He thinks he can have results this week. --------------------------------------------------------------------------- #251: Add fudge option to server config https://gitlab.com/NTPsec/ntpsec/issues/251 Gary and Daniel are having an argument over whether this is a good idea. Me, I'd rather not do it. Just to keep life simple. But they understand the terrain in ways I don't. --------------------------------------------------------------------------- #220: ntpc.so is unable to resolve libpython2.7.1.0 on NetBSD https://gitlab.com/NTPsec/ntpsec/issues/220 This appears to be a waf recipe problem, not passing -R/usr/pkg/lib to the linker as it should. Matt, you can test on NetBSD. Can you follow up on this? --------------------------------------------------------------------------- #204: Support /etc/ntp.d https://gitlab.com/NTPsec/ntpsec/issues/204 There is disagreement about how this should work. Probably not to be resolved before 1.0. --------------------------------------------------------------------------- #62: Refclock #20 behaves perversely on GPS signal loss. https://gitlab.com/NTPsec/ntpsec/issues/204 I see the problem Gary is describing, but I don't know if a fix is possible even in principle. Gary, if you have a problem analysis that suggests a fix, please describe in the issue thread. If you don't, tell me so we can document this as a known (unsolvable) problem. --------------------------------------------------------------------------- #57: Refclock #46, GPSD_JSON, bad NMEA time https://gitlab.com/NTPsec/ntpsec/issues/57 #55: ntpd refclock #46 just stops working. https://gitlab.com/NTPsec/ntpsec/issues/55 I've grouped these together because they are aspects of the same problem: the GPSD_JSON driver was a bad idea to begin with and is in pretty crappy shape internally. As the designer of GPSD_JSON, I am in a unique position to be able to say to the world "this was a bad idea and I'm killing it". I intend to to do exactly that before 1.0 if it doesn't get fixed. --------------------------------------------------------------------------- #44: Confusion with drift at the rail https://gitlab.com/NTPsec/ntpsec/issues/44 I don't fully undetand this issue. I need Hal, who raised it, to suggest at least a theoretical fix. --------------------------------------------------------------------------- Work requests: I don't normally like to try to hand out assignments or get people to commit to doing them, but coming up on a release I need to have some idea what we can realistically get done and where we need to somehow recruit extra help. Gary: Our top priority needs to be #347, slow startup. I need to know that either you or Hal is on this and will nail it down. Also it's up to you to save the GPSD_JSON driver. I don't think anyone else is invested in it, and I'd frankly prefer dropping it to trying to fix it. #57, #55. Also I need a better characterization of #62. If you can, please tackle these in roughly the order listed. Matt: You took on being our build-system expert a while back, which puts #312 #273 #269 #220 on your list. I hate to stick you with trying to decrypt the waf docs, but there isn't anyone obviously better equipped. Hal: Either Gary needs to be on #347 or you do. There's also #44, our oldest open bug. Keane: You've taken on #252. Myself: #355 and #358 are obviously mine. And I'm the backstop for everbody else, which is why I'm not assigning myself more up front. I've put corresponding assignments on the tracker issues. RSVP, everybody. I need to know what you can do and are willing to do. Remember, September 28th. If we get through these there are maybe some more fun things we can do before release. -- Eric S. Raymond Gun Control: The theory that a woman found dead in an alley, raped and strangled with her panty hose, is somehow morally superior to a woman explaining to police how her attacker got that fatal bullet wound. -- L. Neil Smith From Stromeko at nexgo.de Mon Aug 14 16:24:16 2017 From: Stromeko at nexgo.de (Achim Gratz) Date: Mon, 14 Aug 2017 18:24:16 +0200 Subject: #251: Add fudge option to server config (was: Tracker bugs and our release process) References: <20170814154754.D167913A0206@snark.thyrsus.com> Message-ID: <87tw1arubz.fsf@Rainer.invalid> Eric S. Raymond via devel writes: > #251: Add fudge option to server config > https://gitlab.com/NTPsec/ntpsec/issues/251 > > Gary and Daniel are having an argument over whether this is a good > idea. The general idea of allowing NTP to compensate for asymmetric network delays is sound, I think. Attaching that information to the server config is one way to do that and the only sane way of compensating for asymmetries that are introduced by both the local and remote end. > Me, I'd rather not do it. Just to keep life simple. But they > understand the terrain in ways I don't. I'd leave that out for 1.0, especially as it's almost impossible to figure out what those asymmetries really are without having a few known-good stratum-1 sources at various points in the network topology. I've ran at around 1.5?2.0ms excess upstream delay for the last few years (VDSL2) and had to shift the GPS chimers accordingly in order to allow reasonably smooth client fallback from internal to external sync. But that asymmetry is actually variable and goes to over 30ms when you load the downstream to capacity. I've got switched to VDSL2 w/ full vectoring last week and the asymmetry went to below 300?s best I can tell, so I've removed the fudge completely for now. I can't fully load the downstream anymore as I'm currently synced at ~105MBit but get capped at 50MBit via QOS by the provider, but I've seen delay spikes to around 10ms so far. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Wavetables for the Terratec KOMPLEXER: http://Synth.Stromeko.net/Downloads.html#KomplexerWaves From gem at rellim.com Mon Aug 14 16:34:36 2017 From: gem at rellim.com (Gary E. Miller) Date: Mon, 14 Aug 2017 09:34:36 -0700 Subject: #251: Add fudge option to server config (was: Tracker bugs and our release process) In-Reply-To: <87tw1arubz.fsf@Rainer.invalid> References: <20170814154754.D167913A0206@snark.thyrsus.com> <87tw1arubz.fsf@Rainer.invalid> Message-ID: <20170814093436.1cb5d9ce@spidey.rellim.com> Yo Achim! On Mon, 14 Aug 2017 18:24:16 +0200 Achim Gratz via devel wrote: > I've ran at around 1.5?2.0ms excess upstream delay for the last few > years (VDSL2) and had to shift the GPS chimers accordingly in order to > allow reasonably smooth client fallback from internal to external > sync. Thank you for the confirmation. I've added your comment to bug #251 RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 gem at rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can?t measure it, you can?t improve it." - Lord Kelvin -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From hmurray at megapathdsl.net Tue Aug 15 12:27:11 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Tue, 15 Aug 2017 05:27:11 -0700 Subject: Tracker bugs and our release process In-Reply-To: Message from "Eric S. Raymond via devel" of "Mon, 14 Aug 2017 11:47:54 EDT." <20170814154754.D167913A0206@snark.thyrsus.com> Message-ID: <20170815122711.770AD406063@ip-64-139-1-69.sjc.megapath.net> > * I need to work on #348: reverse function for restrict > * unpeer should be made to fully work from ntpq :config. This one is mine too. There is a quirk tangled in this area. I don't know if there is a bug for it. When the pool mode adds a server, if needed, it pokes a hole in the restrictions. We need to remove that hole when that server is removed. I think we need to add a new flag to indicate that the restrict slot was automatically added. Maybe we should add another flag to disable poking holes. Maybe it's an enhancement rather than bug fix, but this would be the time to do it. ------- I didn't see fixing ntpq retransmissions on your list. (I'm still catching up so I might have missed it.) I think the way to fix this is to clean up the logging. I've never been particularly happy with the standard log level approach. Maybe it would make more sense if there was a description of what the levels were intended to cover. Some of the cruft may be my fault. I hacked a lot of the logging to do what I needed when chasing some bug(s?). I may have broken any plan that you had. Maybe we need log-to-file. In the context of fixing this bug, I think I would like a logging mode that showed the command line and the packets. Reply packets can be verbose so maybe we need a switch/level for that. Maybe we need another switch/level to show steps within long running commands. Mumble. -- These are my opinions. I hate spam. From fallenpegasus at gmail.com Tue Aug 15 14:44:03 2017 From: fallenpegasus at gmail.com (Mark Atwood) Date: Tue, 15 Aug 2017 14:44:03 +0000 Subject: Tracker bugs and our release process In-Reply-To: <20170815122711.770AD406063@ip-64-139-1-69.sjc.megapath.net> References: <20170814154754.D167913A0206@snark.thyrsus.com> <20170815122711.770AD406063@ip-64-139-1-69.sjc.megapath.net> Message-ID: I have read ESR's writeup on our buglist, and agree with his assessments. ..m On Tue, Aug 15, 2017 at 5:27 AM Hal Murray via devel wrote: > > * I need to work on #348: reverse function for restrict > > * unpeer should be made to fully work from ntpq :config. This one is > mine too. > > There is a quirk tangled in this area. I don't know if there is a bug for > it. > > When the pool mode adds a server, if needed, it pokes a hole in the > restrictions. > > We need to remove that hole when that server is removed. > > I think we need to add a new flag to indicate that the restrict slot was > automatically added. > > Maybe we should add another flag to disable poking holes. Maybe it's an > enhancement rather than bug fix, but this would be the time to do it. > > ------- > > I didn't see fixing ntpq retransmissions on your list. (I'm still catching > up so I might have missed it.) > > I think the way to fix this is to clean up the logging. > > I've never been particularly happy with the standard log level approach. > Maybe it would make more sense if there was a description of what the > levels > were intended to cover. > > Some of the cruft may be my fault. I hacked a lot of the logging to do > what > I needed when chasing some bug(s?). I may have broken any plan that you > had. > > Maybe we need log-to-file. > > In the context of fixing this bug, I think I would like a logging mode that > showed the command line and the packets. Reply packets can be verbose so > maybe we need a switch/level for that. Maybe we need another switch/level > to > show steps within long running commands. Mumble. > > > -- > These are my opinions. I hate spam. > > > > _______________________________________________ > devel mailing list > devel at ntpsec.org > http://lists.ntpsec.org/mailman/listinfo/devel > -- Mark Atwood http://about.me/markatwood +1-206-604-2198 Mobile & Signal -------------- next part -------------- An HTML attachment was scrubbed... URL: From esr at thyrsus.com Tue Aug 15 17:23:02 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 15 Aug 2017 13:23:02 -0400 (EDT) Subject: pool and restrictions Message-ID: <20170815172302.965E113A0206@snark.thyrsus.com> Hal Murray write: > When the pool mode adds a server, if needed, it pokes a hole in the > restrictions. > > We need to remove that hole when that server is removed. Fair enough. I have an idea for a simple way to implement this. But I can't find where the hole-poking is actually being done - it's apparently not via a hack_restrict() call, which is what I'd have expected. Can you give me a file and line number? > I think we need to add a new flag to indicate that the restrict slot was > automatically added. > > Maybe we should add another flag to disable poking holes. Maybe it's an > enhancement rather than bug fix, but this would be the time to do it. I'm generally opposed to adding more interface knobs. The configuration language is tricky enough as it is. Why do you think we might need one? -- Eric S. Raymond "I hold it, that a little rebellion, now and then, is a good thing, and as necessary in the political world as storms in the physical." -- Thomas Jefferson, Letter to James Madison, January 30, 1787 From hmurray at megapathdsl.net Tue Aug 15 18:44:17 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Tue, 15 Aug 2017 11:44:17 -0700 Subject: Is restrict broken? Message-ID: <20170815184417.DE451406063@ip-64-139-1-69.sjc.megapath.net> Should this work? restrict default limited nomodify nopeer noquery restrict 192.168.0.0 mask 255.255.0.0 restrict 127.0.0.1 ntpwait times out. It works when they are commented out. I haven't investigated. I'm trying to catch up after a two week break. I'm pretty sure it used to work but it could easily be something stupid on my end. I haven't tried bisecting. -- These are my opinions. I hate spam. From esr at thyrsus.com Tue Aug 15 18:59:28 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 15 Aug 2017 14:59:28 -0400 Subject: Is restrict broken? In-Reply-To: <20170815184417.DE451406063@ip-64-139-1-69.sjc.megapath.net> References: <20170815184417.DE451406063@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170815185928.GA16146@thyrsus.com> Hal Murray via devel : > > Should this work? > restrict default limited nomodify nopeer noquery > restrict 192.168.0.0 mask 255.255.0.0 > restrict 127.0.0.1 > > ntpwait times out. It works when they are commented out. > > I haven't investigated. I'm trying to catch up after a two week break. I'm > pretty sure it used to work but it could easily be something stupid on my > end. I haven't tried bisecting. It is just possible that I broke restrict this morning at e7a4b0d3cf8932feeb898ed1343f25e8e65688d9 Address GitLab issue #356: reverse function for restrict I tested it, but it could be I screwed up the test. On the other hand, my instance is running just fine. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From hmurray at megapathdsl.net Tue Aug 15 19:02:20 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Tue, 15 Aug 2017 12:02:20 -0700 Subject: pool and restrictions In-Reply-To: Message from "Eric S. Raymond via devel" of "Tue, 15 Aug 2017 13:23:02 EDT." <20170815172302.965E113A0206@snark.thyrsus.com> Message-ID: <20170815190220.86EAF406063@ip-64-139-1-69.sjc.megapath.net> > Fair enough. I have an idea for a simple way to implement this. But I > can't find where the hole-poking is actually being done - it's apparently > not via a hack_restrict() call, which is what I'd have expected. Can you > give me a file and line number? ntp_proto.c, line 2480, in dns_take_pool restrict_mask = restrictions(&peer->srcadr); /* FIXME-DNS: RES_FLAGS includes RES_DONTSERVE?? */ if (RES_FLAGS & restrict_mask) { msyslog(LOG_INFO, "Pool poking hole in restrictions for: %s", socktoa(&peer->srcadr)); restrict_source(&peer->srcadr, false, current_time + POOL_SOLICIT_WINDOW + 1); } ----------- >> Maybe we should add another flag to disable poking holes. >> Maybe it's an enhancement rather than bug fix, but this would >> be the time to do it. > I'm generally opposed to adding more interface knobs. The configuration > language is tricky enough as it is. Why do you think we might need one? With the current setup, restrict is essentially ignored for pool hosts. There is no way to say "I want to use the pool, but skip hosts at a.b.c.d/16". It will poke a hole in that restriction. You might want to do that because your routing to there is crappy so you get crappy time. Or maybe they are known bad guys and the pool operators are slow to respond or don't think they are bad enough to kick out or ... I generally agree with your more knobs comment. This might be filling in a hole and thus make things cleaner overall rather than more complicated. If we don't fix this, we should at least make sure the documentation is clear. -- These are my opinions. I hate spam. From hmurray at megapathdsl.net Tue Aug 15 19:10:23 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Tue, 15 Aug 2017 12:10:23 -0700 Subject: Is restrict broken? In-Reply-To: Message from "Eric S. Raymond via devel" of "Tue, 15 Aug 2017 14:59:28 EDT." <20170815185928.GA16146@thyrsus.com> Message-ID: <20170815191023.4D50A406063@ip-64-139-1-69.sjc.megapath.net> devel at ntpsec.org said: > It is just possible that I broke restrict this morning at > e7a4b0d3cf8932feeb898ed1343f25e8e65688d9 Address GitLab issue #356: reverse > function for restrict I reverted that fix and it's working again. -- These are my opinions. I hate spam. From esr at thyrsus.com Tue Aug 15 19:27:26 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 15 Aug 2017 15:27:26 -0400 Subject: Is restrict broken? In-Reply-To: <20170815191023.4D50A406063@ip-64-139-1-69.sjc.megapath.net> References: <20170815185928.GA16146@thyrsus.com> <20170815191023.4D50A406063@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170815192726.GA16617@thyrsus.com> Hal Murray : > > devel at ntpsec.org said: > > It is just possible that I broke restrict this morning at > > e7a4b0d3cf8932feeb898ed1343f25e8e65688d9 Address GitLab issue #356: reverse > > function for restrict > > I reverted that fix and it's working again. Well, shit. That's bad. Two reasons: (1) The new code is there for a reaon, and (2) Your report makes it seem as though my instance ought to be failing, and it's not. The second problem actually worries me more than the first. In config.c there's a stretch of code that looks like this: int op = (my_node->mode == T_Restrict) ? RESTRICT_FLAGS : RESTRICT_UNFLAG; hack_restrict(op, &addr, &mask, mflags, flags, 0); If you replace the second 'op' with RESTRICT_FLAGS, does the behavior look normal again? -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From esr at thyrsus.com Tue Aug 15 19:32:57 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 15 Aug 2017 15:32:57 -0400 Subject: pool and restrictions In-Reply-To: <20170815190220.86EAF406063@ip-64-139-1-69.sjc.megapath.net> References: <20170815172302.965E113A0206@snark.thyrsus.com> <20170815190220.86EAF406063@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170815193257.GB16617@thyrsus.com> Hal Murray : > > > Fair enough. I have an idea for a simple way to implement this. But I > > can't find where the hole-poking is actually being done - it's apparently > > not via a hack_restrict() call, which is what I'd have expected. Can you > > give me a file and line number? > > ntp_proto.c, line 2480, in dns_take_pool > restrict_mask = restrictions(&peer->srcadr); > /* FIXME-DNS: RES_FLAGS includes RES_DONTSERVE?? */ > if (RES_FLAGS & restrict_mask) { > msyslog(LOG_INFO, "Pool poking hole in restrictions for: %s", > socktoa(&peer->srcadr)); > restrict_source(&peer->srcadr, false, > current_time + POOL_SOLICIT_WINDOW + 1); > } OK. I think your hole-filling might already be done in the restrict_source() call here: /* * unpeer - remove peer structure from hash table and free structure */ void unpeer( struct peer *peer ) { mprintf_event(PEVNT_DEMOBIL, peer, "assoc %u", peer->associd); restrict_source(&peer->srcadr, true, 0); set_peerdstadr(peer, NULL); peer_demobilizations++; peer_associations--; if (FLAG_PREEMPT & peer->flags) peer_preempt--; #ifdef REFCLOCK /* * If this peer is actually a clock, shut it down first */ if (FLAG_REFCLOCK & peer->flags) refclock_unpeer(peer); #endif free_peer(peer); } > >> Maybe we should add another flag to disable poking holes. > >> Maybe it's an enhancement rather than bug fix, but this would > >> be the time to do it. > > I'm generally opposed to adding more interface knobs. The configuration > > language is tricky enough as it is. Why do you think we might need one? > > With the current setup, restrict is essentially ignored for pool hosts. > There is no way to say "I want to use the pool, but skip hosts at > a.b.c.d/16". It will poke a hole in that restriction. You might want to do > that because your routing to there is crappy so you get crappy time. Or > maybe they are known bad guys and the pool operators are slow to respond or > don't think they are bad enough to kick out or ... > > I generally agree with your more knobs comment. This might be filling in a > hole and thus make things cleaner overall rather than more complicated. > > If we don't fix this, we should at least make sure the documentation is clear. I'll take a doc patch to that effect; also please add this RFE to devel/TODO. You make a good case, but I want to get our decks cleared of tracker issues before we start in on stuff like this. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From esr at thyrsus.com Tue Aug 15 21:05:46 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 15 Aug 2017 17:05:46 -0400 (EDT) Subject: Fix for truly nasty bug introduced this morning Message-ID: <20170815210547.009F313A0206@snark.thyrsus.com> Hal, I think I found the bug that was messing you up. The commit "Address GitLab issue #356: reverse function for restrict" introduced a 'mode' field to restriction nodes in the config parser. The mode could be T_Restrict or T_Unrestrict to specify whether this node is meant to turn restriction flags on or off. Before this change all nodes were restriction-on. The mistake I made was putting the new field first in the structure. This caused it to be randomly trashed by type-punning dynamic-allocation code that was expecting a link field there. C. Gotta love it. Otherwise you'd have to seriously hate it. -- Eric S. Raymond "The state calls its own violence `law', but that of the individual `crime'" -- Max Stirner From hmurray at megapathdsl.net Tue Aug 15 21:51:36 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Tue, 15 Aug 2017 14:51:36 -0700 Subject: Fix for truly nasty bug introduced this morning In-Reply-To: Message from "Eric S. Raymond via devel" of "Tue, 15 Aug 2017 17:05:46 EDT." <20170815210547.009F313A0206@snark.thyrsus.com> Message-ID: <20170815215136.0A2B0406060@ip-64-139-1-69.sjc.megapath.net> > The mistake I made was putting the new field first in the structure. This > caused it to be randomly trashed by type-punning dynamic-allocation code > that was expecting a link field there. I don't understand yet. Why are we type punning there? If it's a hack to avoid malloc, why is the caller assuming anything about the state of the new storage? Should we make a cleanup pass at all avoid-malloc hacks? -- These are my opinions. I hate spam. From esr at thyrsus.com Tue Aug 15 22:08:12 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 15 Aug 2017 18:08:12 -0400 Subject: Fix for truly nasty bug introduced this morning In-Reply-To: <20170815215136.0A2B0406060@ip-64-139-1-69.sjc.megapath.net> References: <20170815210547.009F313A0206@snark.thyrsus.com> <20170815215136.0A2B0406060@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170815220812.GA22812@thyrsus.com> Hal Murray : > > > The mistake I made was putting the new field first in the structure. This > > caused it to be randomly trashed by type-punning dynamic-allocation code > > that was expecting a link field there. > > I don't understand yet. Why are we type punning there? If it's a hack to > avoid malloc, why is the caller assuming anything about the state of the new > storage? > > Should we make a cleanup pass at all avoid-malloc hacks? Oh dear Goddess, not *now*. That would be extremely risky. I think it's not hack to avoid malloc, but rather a hack to allow the config tree to consist of variable-sized structure nodes that are all chain-linked in such a way that they can be freed after config parsing with a simple traverse of that list. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From esr at thyrsus.com Wed Aug 16 19:30:21 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Wed, 16 Aug 2017 15:30:21 -0400 (EDT) Subject: State of the blocker bug Message-ID: <20170816193021.EC81D13A0206@snark.thyrsus.com> Our one blocker bug hasn't yet been stomped, but it's been cornered. We now understand much more about the problem. It was introduced by Hal's attempt to clean up DNS handling for pool queries on May 14th. It seems to affect only pool queries - I don't think it can bite if you give an explicit server name. The symptom is that time to first sync blows up from tens of seconds to hundreds. I tried to come out with a patch to back out Hal's change, but mine only reduces time to first sync by x2, not x10. I don't think my patch is quite right. While working on this, we've found that the iburst flag seems to be a no-op. Entire bug thread is here: https://gitlab.com/NTPsec/ntpsec/issues/347 -- Eric S. Raymond The politician attempts to remedy the evil by increasing the very thing that caused the evil in the first place: legal plunder. -- Frederick Bastiat From hmurray at megapathdsl.net Wed Aug 16 20:58:06 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Wed, 16 Aug 2017 13:58:06 -0700 Subject: State of the blocker bug In-Reply-To: Message from "Eric S. Raymond via devel" of "Wed, 16 Aug 2017 15:30:21 EDT." <20170816193021.EC81D13A0206@snark.thyrsus.com> Message-ID: <20170816205806.E856E406063@ip-64-139-1-69.sjc.megapath.net> Background: This isn't related to the bug, but often adds a layer of confusion when trying to understand what is going on with DNS. The old ntpq -p had a few lines of code that skipped some entries. I forget the details. I think it skipped slots that hadn't received any responses yet. I may have removed that code. If not, we should remove it. At least for me, it often sent me on a wild goose chase. If we don't remove it, we should at least document why that code is a feature. The new DNS has a lot of logging. (Maybe too much.) That should help when debugging, at least if you remember to look at the log file. ---------- ntpd tries to send only one request per second. In the simple non-DNS case, this happens by stepping through the peer list and bailing after it finds a ready slot and sends a packet. At startup, the peers display will show a "when" column going up by one second per line. I'm not sure how DNS is/was tangled up with this. When it bumps the polling interval, it also randomizes the next-poll time. I think it adds a randomized half-poll interval. ---------- I'm not 100% confident I understand this area... The old pool code didn't create a peer slot when it got an IP Address from DNS. It sent a request, then setup the peer slot when the response returned. Thus there was no place to remember the iburst flag. I think this was tangled up with peer mode which we have depricated. -- These are my opinions. I hate spam. From hmurray at megapathdsl.net Wed Aug 16 22:10:49 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Wed, 16 Aug 2017 15:10:49 -0700 Subject: State of the blocker bug In-Reply-To: Message from "Eric S. Raymond via devel" of "Wed, 16 Aug 2017 15:30:21 EDT." <20170816193021.EC81D13A0206@snark.thyrsus.com> Message-ID: <20170816221049.0C0AB406063@ip-64-139-1-69.sjc.megapath.net> I think it's important to split work on this into several areas. The first is the simple no-DNS (and hence no-pool) case. This is the important case. How long does it take to get started when ntp.conf has 3 or 4 server slots specified by IP Address? The interesting case is using iburst, but we should probably collect data for the non-iburst case too, just for reference. Case two would be the same setup using DNS and a local /etc/hosts file. Case three would be using the pool. -- These are my opinions. I hate spam. From esr at thyrsus.com Wed Aug 16 22:57:16 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Wed, 16 Aug 2017 18:57:16 -0400 Subject: State of the blocker bug In-Reply-To: <20170816221049.0C0AB406063@ip-64-139-1-69.sjc.megapath.net> References: <20170816193021.EC81D13A0206@snark.thyrsus.com> <20170816221049.0C0AB406063@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170816225716.GA18656@thyrsus.com> Hal Murray : > > I think it's important to split work on this into several areas. > > The first is the simple no-DNS (and hence no-pool) case. This is the > important case. > > How long does it take to get started when ntp.conf has 3 or 4 server slots > specified by IP Address? The interesting case is using iburst, but we should > probably collect data for the non-iburst case too, just for reference. > > Case two would be the same setup using DNS and a local /etc/hosts file. > > Case three would be using the pool. We should meter all of these, yes. However, I discagree that the no-DNS case is the important one. The vast majority of ordinmary users go to the pool. It's the hand-tuned setups that are the exception. Thus, the most urgent bug right now is clawing back the performance lost in "DNS bug fixing/cleanups". Please give that priority. Tuning the other cases (and fixing iburst so that it's no longer a no-op) can wait. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From Stromeko at nexgo.de Sat Aug 19 13:54:00 2017 From: Stromeko at nexgo.de (Achim Gratz) Date: Sat, 19 Aug 2017 15:54:00 +0200 Subject: possible bug: peerstats Message-ID: <87efs7acjr.fsf@Rainer.invalid> I've updated to ntpsec-0.9.7+1104 ten days ago and just realized that the peerstats logging has changed format: if I use the new refclock syntax, then instead of the 127.127.. in the address field, I now get the driver name like NMEA(0). I had written my scripts defensively enough to ignore these lines, so it only now dawned on me why the associated data went missing. In principle I'd like a logging format that uses symbolic names for all peers (that'd solve the problem of peers getting new addresses via DHCP or IPv6 prefix changes), but please make that configurable and independent of the way the server / refclock gets specified in the config. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Waldorf MIDI Implementation & additional documentation: http://Synth.Stromeko.net/Downloads.html#WaldorfDocs From ianbruene at gmail.com Sat Aug 19 14:02:11 2017 From: ianbruene at gmail.com (Ian Bruene) Date: Sat, 19 Aug 2017 09:02:11 -0500 Subject: possible bug: peerstats In-Reply-To: <87efs7acjr.fsf@Rainer.invalid> References: <87efs7acjr.fsf@Rainer.invalid> Message-ID: <1829c587-7bdb-df7d-b358-51234747b3d6@gmail.com> On 08/19/2017 08:54 AM, Achim Gratz via devel wrote: > I've updated to ntpsec-0.9.7+1104 ten days ago and just realized that > the peerstats logging has changed format: if I use the new refclock > syntax, then instead of the 127.127.. in the address > field, I now get the driver name like NMEA(0). I had written my scripts > defensively enough to ignore these lines, so it only now dawned on me > why the associated data went missing. > > In principle I'd like a logging format that uses symbolic names for all > peers (that'd solve the problem of peers getting new addresses via DHCP > or IPv6 prefix changes), but please make that configurable and > independent of the way the server / refclock gets specified in the > config. This is a deliberate incompatibility with NTPclassic. The relevant sections from docs/ntpsec.txt: * Clock identifiers in log files are normally the driver shortname followed by the unit number in parentheses, rather than the magic IP addresses formerly used. This change affects the peerstats, rawstats, and clockstats files. Reverted in the --enable-classic-mode build. * An instance of +ntpq+ built from the NTPsec code querying a legacy NTP daemon will not automatically display peers with 127.127.127.t.u addresses as refclocks; that assumption has been removed from the NTPsec code as part of getting it fully IPv6-ready. -- In the end; what separates a Man, from a Slave? Money? Power? No. A Man Chooses, a Slave Obeys. -- Andrew Ryan From Stromeko at nexgo.de Sat Aug 19 14:17:15 2017 From: Stromeko at nexgo.de (Achim Gratz) Date: Sat, 19 Aug 2017 16:17:15 +0200 Subject: possible bug: peerstats References: <87efs7acjr.fsf@Rainer.invalid> <1829c587-7bdb-df7d-b358-51234747b3d6@gmail.com> Message-ID: <87a82vabh0.fsf@Rainer.invalid> Ian Bruene via devel writes: > This is a deliberate incompatibility with NTPclassic. Deliberate or not, I still consider it a bug that the same driver logs differently depending on how exactly it gets configured (refclock vs. server keyword), especially since ntpq would always show them identically. Also, that same argument would extend so that servers log with their name rather than the IP address, but that's not happening regardless of configuration. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Wavetables for the Waldorf Blofeld: http://Synth.Stromeko.net/Downloads.html#BlofeldUserWavetables From esr at thyrsus.com Sat Aug 19 15:16:34 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sat, 19 Aug 2017 11:16:34 -0400 Subject: possible bug: peerstats In-Reply-To: <87efs7acjr.fsf@Rainer.invalid> References: <87efs7acjr.fsf@Rainer.invalid> Message-ID: <20170819151634.GA6354@thyrsus.com> Achim Gratz via devel : > > I've updated to ntpsec-0.9.7+1104 ten days ago and just realized that > the peerstats logging has changed format: if I use the new refclock > syntax, then instead of the 127.127.. in the address > field, I now get the driver name like NMEA(0). I had written my scripts > defensively enough to ignore these lines, so it only now dawned on me > why the associated data went missing. > > In principle I'd like a logging format that uses symbolic names for all > peers (that'd solve the problem of peers getting new addresses via DHCP > or IPv6 prefix changes), but please make that configurable and > independent of the way the server / refclock gets specified in the > config. OK, that's weird. The logging code shouldn't *know* what syntax was used to configure the entry. Internally, the new syntax is converted to the equivalent magic IP address by the configuration parser with this code: refclock_command : T_Refclock T_String optional_unit option_list { #ifdef REFCLOCK peer_node *my_node; address_node *fakeaddr; char addrbuf[1025]; /* NI_MAXHOSTS on Linux */ int dtype; for (dtype = 1; dtype < (int)num_refclock_conf; dtype++) if (refclock_conf[dtype]->basename != NULL && strcasecmp(refclock_conf[dtype]->basename, $2) == 0) goto foundit; msyslog(LOG_ERR, "CONFIG: Unknown driver name %s", $2); exit(1); foundit: snprintf(addrbuf, sizeof(addrbuf), "127.127.%d.%d", dtype, $3); fakeaddr = create_address_node(estrdup(addrbuf),AF_INET); my_node = create_peer_node(T_Server, fakeaddr, $4); APPEND_G_FIFO(cfgt.peers, my_node); #endif /* REFCLOCK */ } ; Compare this: server_command : client_type address option_list { peer_node *my_node; my_node = create_peer_node($1, $2, $3); APPEND_G_FIFO(cfgt.peers, my_node); } Note that in both the server and refclock cases, first argument will be T_Server. So I don't see how the difference can leak through even to the config back end, let alone the logging code. Are you sure there isn't some other variable here? Would you mind using gdb to see what create_peer_node() gets passed in those two cases? I'd jump on this, but I'm dealing with an emergency. One of our guys broke the pool startup code and has since gone radio silent - I need to fix that before I can focus on anything else. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From Stromeko at nexgo.de Sat Aug 19 15:38:57 2017 From: Stromeko at nexgo.de (Achim Gratz) Date: Sat, 19 Aug 2017 17:38:57 +0200 Subject: possible bug: peerstats References: <87efs7acjr.fsf@Rainer.invalid> <20170819151634.GA6354@thyrsus.com> Message-ID: <87tw138t4e.fsf@Rainer.invalid> Eric S. Raymond via devel writes: > The logging code shouldn't *know* what syntax was used to configure > the entry. Internally, the new syntax is converted to the equivalent > magic IP address by the configuration parser with this code: In fact it doesn't, I had already wondered why that would filter into the state of the refclock somehow. > Note that in both the server and refclock cases, first argument will be > T_Server. So I don't see how the difference can leak through even to the > config back end, let alone the logging code. > > Are you sure there isn't some other variable here? Would you mind using > gdb to see what create_peer_node() gets passed in those two cases? As I said, I checked it the wrong way --sorry for the noise. I see now that it consistently swicthed the logging format on all boxes with the restart of the recompiled ntpd. I'll have to change my data munging scripts to recognize these correctly so that the data goes where it's supposed to be. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Samples for the Waldorf Blofeld: http://Synth.Stromeko.net/Downloads.html#BlofeldSamplesExtra From esr at thyrsus.com Sat Aug 19 15:50:58 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sat, 19 Aug 2017 11:50:58 -0400 Subject: possible bug: peerstats In-Reply-To: <87tw138t4e.fsf@Rainer.invalid> References: <87efs7acjr.fsf@Rainer.invalid> <20170819151634.GA6354@thyrsus.com> <87tw138t4e.fsf@Rainer.invalid> Message-ID: <20170819155058.GD6354@thyrsus.com> Achim Gratz via devel : > I'll have to change my data munging scripts to recognize these correctly > so that the data goes where it's supposed to be. Yeah, sorry about that. We thought long and hard about this change, and concluded it was necessary in order to server ntpd from its IPv4 assumptions. For people who can't live with that, there's --enable-classic-mode. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From Stromeko at nexgo.de Sat Aug 19 16:00:46 2017 From: Stromeko at nexgo.de (Achim Gratz) Date: Sat, 19 Aug 2017 18:00:46 +0200 Subject: possible bug: peerstats References: <87efs7acjr.fsf@Rainer.invalid> <20170819151634.GA6354@thyrsus.com> <87tw138t4e.fsf@Rainer.invalid> <20170819155058.GD6354@thyrsus.com> Message-ID: <87lgmf8s41.fsf@Rainer.invalid> Eric S. Raymond via devel writes: > Achim Gratz via devel : >> I'll have to change my data munging scripts to recognize these correctly >> so that the data goes where it's supposed to be. > > Yeah, sorry about that. We thought long and hard about this change, > and concluded it was necessary in order to server ntpd from its IPv4 > assumptions. Well, I've luckily learned a long time ago to never work directly off original log data and always put in a translation layer even when it just hands through the original data at the beginning. So it won't be that much of a problem. > For people who can't live with that, there's --enable-classic-mode. So what's your transition strategy with that? The distros will likely decide to go for that option (less chance of breakage on their side if they make it an alternative to classic) and then never look back. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ DIY Stuff: http://Synth.Stromeko.net/DIY.html From esr at thyrsus.com Sat Aug 19 16:14:45 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sat, 19 Aug 2017 12:14:45 -0400 Subject: possible bug: peerstats In-Reply-To: <87lgmf8s41.fsf@Rainer.invalid> References: <87efs7acjr.fsf@Rainer.invalid> <20170819151634.GA6354@thyrsus.com> <87tw138t4e.fsf@Rainer.invalid> <20170819155058.GD6354@thyrsus.com> <87lgmf8s41.fsf@Rainer.invalid> Message-ID: <20170819161445.GA7209@thyrsus.com> Achim Gratz via devel : > > For people who can't live with that, there's --enable-classic-mode. > > So what's your transition strategy with that? The distros will likely > decide to go for that option (less chance of breakage on their side if > they make it an alternative to classic) and then never look back. Maybe. On the other hand, Classic mode wouldn't actually buy them much. The only scripts a typical leaf-node installation runs come right out of the NTP source tree themselves, and ours do the right thing. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From hmurray at megapathdsl.net Sat Aug 19 21:12:43 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Sat, 19 Aug 2017 14:12:43 -0700 Subject: possible bug: peerstats In-Reply-To: Message from "Eric S. Raymond via devel" of "Sat, 19 Aug 2017 11:50:58 EDT." <20170819155058.GD6354@thyrsus.com> Message-ID: <20170819211243.71849406060@ip-64-139-1-69.sjc.megapath.net> > Maybe. On the other hand, Classic mode wouldn't actually buy them much. The > only scripts a typical leaf-node installation runs come right out of the NTP > source tree themselves, and ours do the right thing. Would it be better if we changed that to a run time decision rather than build time? I'm assuming that all the input cases will take either format. With the decision at build time, a distro has to make a choice. If they use the default as you expect, then they screw all their users who do have their own scripts. Does anybody even know how big that set is? It's probably small, but they are also the ones we want to switch to using our code. > Yeah, sorry about that. We thought long and hard about this change, and > concluded it was necessary in order to server ntpd from its IPv4 > assumptions. For people who can't live with that, there's > --enable-classic-mode. I've never understood the IPv4 logic in that area. It might make a good story for a blog. -- These are my opinions. I hate spam. From esr at thyrsus.com Sat Aug 19 23:17:55 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sat, 19 Aug 2017 19:17:55 -0400 Subject: possible bug: peerstats In-Reply-To: <20170819211243.71849406060@ip-64-139-1-69.sjc.megapath.net> References: <20170819155058.GD6354@thyrsus.com> <20170819211243.71849406060@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170819231755.GA26192@thyrsus.com> Hal Murray : > > > Maybe. On the other hand, Classic mode wouldn't actually buy them much. The > > only scripts a typical leaf-node installation runs come right out of the NTP > > source tree themselves, and ours do the right thing. > > Would it be better if we changed that to a run time decision rather than > build time? Grep for ENABLE_CLASSIC_MODE and see. I think any benefit would be almost purely psychological. > I'm assuming that all the input cases will take either format. Yes. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From hmurray at megapathdsl.net Sun Aug 20 07:58:28 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Sun, 20 Aug 2017 00:58:28 -0700 Subject: catching up Message-ID: <20170820075828.3EE1640605C@ip-64-139-1-69.sjc.megapath.net> I think I fixed the iburst startup problems. I also added tests/time-startup.sh which is a script I used for timing. It includes export PYTHONPATH=/usr/local/lib/python2.7/site-packages because that was the quickest way to get it working on my system. Eric: I don't understand the python finding libraries area. It seems reasonable to ask hackers and developers to add a PYTHONPATH to their environment. It seems non-good to me to ask every sysadmin to hack their root environment. Is there any reasonable way to "fix" that? If nothing else, we can hack the install recipe to edit the scripts on the fly so they look where the libraries get installed. -- These are my opinions. I hate spam. From esr at thyrsus.com Sun Aug 20 16:42:24 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sun, 20 Aug 2017 12:42:24 -0400 Subject: catching up In-Reply-To: <20170820075828.3EE1640605C@ip-64-139-1-69.sjc.megapath.net> References: <20170820075828.3EE1640605C@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170820164224.GB4825@thyrsus.com> Hal Murray via devel : > Eric: > I don't understand the python finding libraries area. It seems reasonable > to ask hackers and developers to add a PYTHONPATH to their environment. It > seems non-good to me to ask every sysadmin to hack their root environment. I ruefully agree. Unfortunately, this is an ara where distribution packagers have a rather inexplicable tendeny to fall down. I don't really know why such flakiness is widespread, but from the pattern of errors I think a failure by the Python maintainers to provide strong best-practice guidelines might be contributing to the problem. I never see this sort of problem on Ubuntu, so there is at least one demonstration that this tangle can be avoided. > Is there any reasonable way to "fix" that? If nothing else, we can hack > the install recipe to edit the scripts on the fly so they look where the > libraries get installed. Alas, that's not as easy as you probably think it is. I have brushed up against this problems in the past; I predict that another difficult dive into waf's rather opaque documentation is in my future. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From hmurray at megapathdsl.net Tue Aug 22 21:41:52 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Tue, 22 Aug 2017 14:41:52 -0700 Subject: GPS rollover in HP Driver Message-ID: <20170822214152.E9D7E40605C@ip-64-139-1-69.sjc.megapath.net> I've got a box sending bogus time, but the system seems to be running happily. I assume there is some GPS rollover fixup code someplace, but I can't find it. Anybody know where it is? There are two likely places. On is in the HP driver itself, ntpd/refclock_hpgps.c The other is in ntpd/ntp_refclock.c where it might fixup many drivers I can't find "week", ""roll", or 1024 in either file. -- These are my opinions. I hate spam. From esr at thyrsus.com Tue Aug 22 22:01:46 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 22 Aug 2017 18:01:46 -0400 Subject: GPS rollover in HP Driver In-Reply-To: <20170822214152.E9D7E40605C@ip-64-139-1-69.sjc.megapath.net> References: <20170822214152.E9D7E40605C@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170822220146.GA16255@thyrsus.com> Hal Murray via devel : > > I've got a box sending bogus time, but the system seems to be running happily. > > I assume there is some GPS rollover fixup code someplace, but I can't find > it. Anybody know where it is? > > > There are two likely places. On is in the HP driver itself, > ntpd/refclock_hpgps.c > The other is in ntpd/ntp_refclock.c where it might fixup many drivers > > I can't find "week", ""roll", or 1024 in either file. I am certain the generic refclock code is not doing anything explicit to compensate for GPS rollover. For one, I just re-skimmed it to check. For another, if that were possible there wouldn't be ad-hoc rollover code scattered through multiple drivers. Looking at the hpgps driver...I don't see any rollover handling there, either. The only other possibility I can see is that the core sync algorithms simply ignore any date that isn't plausibly within a certain delta of the system clock date, using only its low-order bits and assuming the the ae is correct. This should actually be a decent heuristic except near midnight. But I'm not sure how to check whether this is happening. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From esr at thyrsus.com Tue Aug 22 22:07:58 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 22 Aug 2017 18:07:58 -0400 Subject: GPS rollover in HP Driver In-Reply-To: <20170822220146.GA16255@thyrsus.com> References: <20170822214152.E9D7E40605C@ip-64-139-1-69.sjc.megapath.net> <20170822220146.GA16255@thyrsus.com> Message-ID: <20170822220758.GA16498@thyrsus.com> Eric S. Raymond via devel : > The only other possibility I can see is that the core sync algorithms simply > ignore any date that isn't plausibly within a certain delta of the system > clock date, using only its low-order bits and assuming the the ae is correct. > This should actually be a decent heuristic except near midnight. I meant "assuming the date is correct". -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From esr at thyrsus.com Wed Aug 23 14:07:05 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Wed, 23 Aug 2017 10:07:05 -0400 (EDT) Subject: Upcoming feature freeze Message-ID: <20170823140705.D23BE13A0206@snark.thyrsus.com> Our planned ship date for 1.0 is 28 September. We'll feature-freeze sooner than that - not sure when yet but somewhere in the ballpark of 7-14 September seems likely. We're down to 7 issues on the tracker. Feature freeze has implications for the two that are RFEs, and for one other. Here they are: #251: Add fudge option to server config If this is going to happen in 1.0. somebody needs to land a patch before feature freeze. That someone should be equipped to test the patch - e.g. not me, as I don't have significant asymmetric delay to contend with. If someone steps up, though, I will write the scanner/parser end to get that offset number into the peer structure. It's not reasonable to expect anyone else but me to grapple with *that* part of the code. Remember, documentation patches *are* required when you add a feature As this would be a pure feature addition, there's no issue with allowing it to wait until 1.1. #204: Support /etc/ntp.d This feature is working, and documented - has been for more than 6 months. For pretty obvious reasons, we should not go breaking backward compatibility after 1.0. That means the window during which we can change the behavior is getting pretty short. Anybody who wants this has two to three weeks, at the outside, to make the argument and ship the code. When I say "make the argument" I mean that I want to see a concrete design and an explanation of why it solves all the problems this one does, and one or more additional ones. Merely not liking it the way it is insufficient. #55: ntpd refclock GPSD_JSON just stops working. I am unhappy with this driver. I believe - as this bug demonstrates - that it's too crappy to ship if we want to establish and maintain a reputation for trouble-free operation. It's an unusual case - the feature that brings it closest to working right is marked experimental, and it's redundant with the SHM driver because GPSD feeds the SHM driver quite happily. In fact, the JSON parsing overhead means the latency and jitter of this driver is necessarily inferior to delivery via SHM. Thus, I think the best thing to do about it would be do simply delete it and shed the defect exposure, redirecting users to GPSD+SHM. And if that going to happen, it needs to happen *now* - that is, before 1.0 implies a promise that it will be stable and maintained. If any of you have an interest in saving this driver, step up now and fix it. -- Eric S. Raymond "As to the species of exercise, I advise the gun. While this gives [only] moderate exercise to the body, it gives boldness, enterprise, and independence to the mind. Games played with the ball and others of that nature, are too violent for the body and stamp no character on the mind. Let your gun, therefore, be the constant companion to your walks." -- Thomas Jefferson, writing to his teenaged nephew. From ghane0 at gmail.com Wed Aug 23 14:33:40 2017 From: ghane0 at gmail.com (Sanjeev Gupta) Date: Wed, 23 Aug 2017 22:33:40 +0800 Subject: Upcoming feature freeze In-Reply-To: <20170823140705.D23BE13A0206@snark.thyrsus.com> References: <20170823140705.D23BE13A0206@snark.thyrsus.com> Message-ID: Hi, I would dearly love to see #204 (/etc/ntp.d) be included in 1.0. As a SysAdm, I typically read the new features list rarely. If it does not land in 1.0 (and pacakge managers and I do not start using it then), it may never get used. -- Sanjeev Gupta +65 98551208 http://www.linkedin.com/in/ghane On Wed, Aug 23, 2017 at 10:07 PM, Eric S. Raymond via devel < devel at ntpsec.org> wrote: > Our planned ship date for 1.0 is 28 September. > > We'll feature-freeze sooner than that - not sure when yet but > somewhere in the ballpark of 7-14 September seems likely. > > We're down to 7 issues on the tracker. Feature freeze has > implications for the two that are RFEs, and for one other. > Here they are: > > #251: Add fudge option to server config > > If this is going to happen in 1.0. somebody needs to land a patch > before feature freeze. That someone should be equipped to test the > patch - e.g. not me, as I don't have significant asymmetric delay > to contend with. > > If someone steps up, though, I will write the scanner/parser end to > get that offset number into the peer structure. It's not reasonable > to expect anyone else but me to grapple with *that* part of the code. > > Remember, documentation patches *are* required when you add a feature > > As this would be a pure feature addition, there's no issue with > allowing it to wait until 1.1. > > #204: Support /etc/ntp.d > > This feature is working, and documented - has been for more than 6 > months. For pretty obvious reasons, we should not go breaking > backward compatibility after 1.0. > > That means the window during which we can change the behavior is > getting pretty short. Anybody who wants this has two to three > weeks, at the outside, to make the argument and ship the code. > > When I say "make the argument" I mean that I want to see a concrete > design and an explanation of why it solves all the problems this one > does, and one or more additional ones. Merely not liking it the way > it is insufficient. > > #55: ntpd refclock GPSD_JSON just stops working. > > I am unhappy with this driver. I believe - as this bug demonstrates - > that it's too crappy to ship if we want to establish and maintain a > reputation for trouble-free operation. > > It's an unusual case - the feature that brings it closest to working > right is marked experimental, and it's redundant with the SHM driver > because GPSD feeds the SHM driver quite happily. In fact, the JSON > parsing overhead means the latency and jitter of this driver is > necessarily inferior to delivery via SHM. > > Thus, I think the best thing to do about it would be do simply delete it > and shed the defect exposure, redirecting users to GPSD+SHM. And > if that going to happen, it needs to happen *now* - that is, before > 1.0 implies a promise that it will be stable and maintained. > > If any of you have an interest in saving this driver, step up now > and fix it. > -- > Eric S. Raymond > > "As to the species of exercise, I advise the gun. While this gives [only] > moderate exercise to the body, it gives boldness, enterprise, and > independence > to the mind. Games played with the ball and others of that nature, are too > violent for the body and stamp no character on the mind. Let your gun, > therefore, be the constant companion to your walks." > -- Thomas Jefferson, writing to his teenaged nephew. > _______________________________________________ > devel mailing list > devel at ntpsec.org > http://lists.ntpsec.org/mailman/listinfo/devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From esr at thyrsus.com Wed Aug 23 14:41:29 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Wed, 23 Aug 2017 10:41:29 -0400 Subject: Upcoming feature freeze In-Reply-To: References: <20170823140705.D23BE13A0206@snark.thyrsus.com> Message-ID: <20170823144129.GB30067@thyrsus.com> Sanjeev Gupta : > I would dearly love to see #204 (/etc/ntp.d) be included in 1.0. Some version of /etc/ntp.d support will definitely be there. The question is whether it will be the version that's there, or whether someone will persuade me that they have a better idea. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From frank at nicholasfamilycentral.com Wed Aug 23 17:07:12 2017 From: frank at nicholasfamilycentral.com (Frank Nicholas) Date: Wed, 23 Aug 2017 13:07:12 -0400 Subject: Upcoming feature freeze In-Reply-To: <20170823140705.D23BE13A0206@snark.thyrsus.com> References: <20170823140705.D23BE13A0206@snark.thyrsus.com> Message-ID: <43B990A8-A392-4EB1-883D-C928E28CE2F7@nicholasfamilycentral.com> On Aug 23, 2017, at 10:07 AM, Eric S. Raymond via devel wrote: > > #55: ntpd refclock GPSD_JSON just stops working. > > I am unhappy with this driver. I believe - as this bug demonstrates - > that it's too crappy to ship if we want to establish and maintain a > reputation for trouble-free operation. > > It's an unusual case - the feature that brings it closest to working > right is marked experimental, and it's redundant with the SHM driver > because GPSD feeds the SHM driver quite happily. In fact, the JSON > parsing overhead means the latency and jitter of this driver is > necessarily inferior to delivery via SHM. > > Thus, I think the best thing to do about it would be do simply delete it > and shed the defect exposure, redirecting users to GPSD+SHM. And > if that going to happen, it needs to happen *now* - that is, before > 1.0 implies a promise that it will be stable and maintained. What about GPSd/NTPSec systems that do not support SHM as required by GPSd/NTPSec? (Are there any? macOS? Others?) Thanks, Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From esr at thyrsus.com Wed Aug 23 17:37:15 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Wed, 23 Aug 2017 13:37:15 -0400 Subject: Upcoming feature freeze In-Reply-To: <43B990A8-A392-4EB1-883D-C928E28CE2F7@nicholasfamilycentral.com> References: <20170823140705.D23BE13A0206@snark.thyrsus.com> <43B990A8-A392-4EB1-883D-C928E28CE2F7@nicholasfamilycentral.com> Message-ID: <20170823173715.GA2721@thyrsus.com> Frank Nicholas : > What about GPSd/NTPSec systems that do not support SHM as required > by GPSd/NTPSec? (Are there any? macOS? Others?) They will usually be able to use NMEA + PPS. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From hmurray at megapathdsl.net Fri Aug 25 05:42:46 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Thu, 24 Aug 2017 22:42:46 -0700 Subject: Upcoming feature freeze In-Reply-To: Message from "Eric S. Raymond via devel" of "Wed, 23 Aug 2017 10:07:05 EDT." <20170823140705.D23BE13A0206@snark.thyrsus.com> Message-ID: <20170825054246.8F145406063@ip-64-139-1-69.sjc.megapath.net> > #55: ntpd refclock GPSD_JSON just stops working. I haven't noticed any troubles. > If any of you have an interest in saving this driver, step up now and fix > it. How important is supporting gpsd on systems without SHM? I think we need the concept of stability to be associated with various features. There should be a way to ship something without a promise of long term support or that it will run on all systems or that all combinations of options have been tested or ... ------------- This whole area needs a lot of thought and work. I assume that is on the post 1.0 list. Is anything written down anyplace? Where? How does SHM/JSON interact with the great refclockd proposal? We could fix the ntpd side of the SHM interface to be read only. That would let multiple readers listen to the same source so you could fire up shmmon while ntpd was running. I think this would solve a protection problem. Some systems don't have SHM. Do we have a list of those systems? Do we need to support external refclocks on those systems? Are we using POSIX SHM? (How many SHM variants are there?) Would you be happy if we threw away the current code and you started from scratch? Would it help if the JSON interface was NTP centric rather than GPSD centric? Maybe we should make a JSON-JSON translator to go between gpsd and ntpd. It would be nice if there was a clean interface for external drivers. I assume that each current driver would turn into a stand alone program that talked to ntpd via some new/wonderful interface. -- These are my opinions. I hate spam. From hmurray at megapathdsl.net Fri Aug 25 06:00:05 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Thu, 24 Aug 2017 23:00:05 -0700 Subject: Upcoming feature freeze In-Reply-To: Message from "Eric S. Raymond via devel" of "Wed, 23 Aug 2017 10:07:05 EDT." <20170823140705.D23BE13A0206@snark.thyrsus.com> Message-ID: <20170825060005.B8D0B406063@ip-64-139-1-69.sjc.megapath.net> I know of 3 things that seem more important (to me) than anything on your list. The simple shared key crypto should support more than MD5 and SHA1. ntpq still fails when talking over a lossy link. (There is a flake option in restrict. We should see if that tickles the problem.) ntpwait can't find its libraries in some environments. Other python programs have the same problem. I'm OK with requiring developers to hack their environment but it seems not-good to to require sysadmins who use ntpwait to hack the boot environment or the casual user to hack their environment. Can't we hack the install process to "fix" things by editing the code? It knows where it is installing things. ----------- Understanding the mysterious GPS rollover fixup is the top of my list. If I find that, I'll probably work on ntpq retransmissions. -- These are my opinions. I hate spam. From daniele at grinta.net Sat Aug 26 00:52:00 2017 From: daniele at grinta.net (Daniele Nicolodi) Date: Fri, 25 Aug 2017 18:52:00 -0600 Subject: MacOS X support? Message-ID: <4acc308a-8f37-002d-c1b0-dd7f70731289@grinta.net> Hello, this https://www.ntpsec.org/supported-platforms.html says that MacOS X is an actively maintained platform, however, NTPsec current git master requires clock_settime() and (ad far as I know) this POSIX function is not implemented on MaxOS X (at least it is not on MacOS 10.10). Should that page be revisited? Cheers, Daniele From fw at fwright.net Sat Aug 26 01:45:18 2017 From: fw at fwright.net (Fred Wright) Date: Fri, 25 Aug 2017 18:45:18 -0700 (PDT) Subject: MacOS X support? In-Reply-To: <4acc308a-8f37-002d-c1b0-dd7f70731289@grinta.net> References: <4acc308a-8f37-002d-c1b0-dd7f70731289@grinta.net> Message-ID: On Fri, 25 Aug 2017, Daniele Nicolodi via devel wrote: > this https://www.ntpsec.org/supported-platforms.html says that MacOS X > is an actively maintained platform, however, NTPsec current git master > requires clock_settime() and (ad far as I know) this POSIX function is > not implemented on MaxOS X (at least it is not on MacOS 10.10). If you don't mind something that's still a bit of a work in progress, you can try: https://gitlab.com/fhgwright/ntpsec/tree/mac-fixes It's rebased to master at the time of this writing, and may be force-pushed to do so in the future. It's a WIP because: 1) It gets a few warnings on 10.6, and lots on 10.5. 2) For unrelated (not OSX-specific reasons), the Python libraries get installed in the wrong place, complicating testing. 3) It hasn't been tested as well as I'd like. Fred Wright From esr at thyrsus.com Sat Aug 26 13:39:25 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sat, 26 Aug 2017 09:39:25 -0400 (EDT) Subject: Catching up omn unfixed bugs Message-ID: <20170826133925.10A8413A0209@snark.thyrsus.com> I've been distracted the last couple of days by trying to spin up another ICEI project that's on a tight deadline. It seems I missed replying on a couple of threads. This is my attempt to carch up. Hal Murray: >How important is supporting gpsd on systems without SHM? Realistically, not very. We've given up on Windows until their ability to run Linux binaries is fully deployed, which will pretty certainly make SHM a non-problem. (I anticipate trouble around the clock-manipulation calls, however.) I think Mac is the only platform without SHM we're actually supporting, and I view it as a minor one. Hobbyists may run NTP service on it, but our real target audience is data centers and you won't find Macs in those. (On the other hand, one might find *BSD.) In any case, Gary has undertaken to rescue GPSD_JSON, so absence of SHM may well become a non-problem. >I think we need the concept of stability to be associated with various >features. There should be a way to ship something without a promise of long >term support or that it will run on all systems or that all combinations of >options have been tested or ... Some drivers are marked deprecated and likely to be removed in a future release. That is presently the only stability tag we have. We could add an unstable/experimental tag. GPSD_JSON is the only place I can think of that might merit it, depending on Gary's degree of success. >How does SHM/JSON interact with the great refclockd proposal? Right, you weren't at the Penguicon FTF meeting. The refclockd plan is pretty dead at this point. The tradeoffs driving it have shifted as our driver inventory has shrunk. The payoff from refclockd refactoring can be thought of as T - S, where T is proportional to the complexity cost (LOC) of the driver code removed and S is the fixed LOC cost of wrapping it in a separare daemon. As the driver inventory shrank, the result of this calculation has been falling. We've turfed out more drivers than I was expecting; we're at 19 now, which is less than half of the original inventory of 43. There are principled reasons we might drop two more - Oncore and GPSD_JSON. Thus, I concluded about 6 months ago that refclockd was no longer looking like much of a win. Instead, further paring back the driver inventory and possibly migrating some support into the generic driver now seems like a better plan. On the gripping hand, we escape one of the constraints if instead of spinning up a refclockd, we were to move the refclock drivers into to GPSD. *That* framework is already paid for and the interface to ntpd is well debugged. Also, I think GPSD's PPS support is better than ntpd's. But nothing like, or refclockd for that matter, that can possibly happen until we get enough test hardware to verify the drivers in their new environment. In effect all this is blocked until we can spin up a hardware test lab. >We could fix the ntpd side of the SHM interface to be read only. That would >let multiple readers listen to the same source so you could fire up shmmon >while ntpd was running. I think this would solve a protection problem. I'd take that patch before 1.0 feature freeze, because if it breaks it will break everywhere and obviously. >Are we using POSIX SHM? (How many SHM variants are there?) We are not using POSIX SHM yet. I know only two variants, the (technically nonstandard) SHM we're using derived from old System V and the POSIX version. There's been no demand for the latter yet and there is something I've forgotten about its API that made it look like a pain. >Would you be happy if we threw away the current code and you started >from scratch? Let's revisit that question if Gary's rescue fails. >Would it help if the JSON interface was NTP centric rather than GPSD >centric? Maybe we should make a JSON-JSON translator to go between >gpsd and ntpd. These seem to me like ways to pile more complexity on top of a problem that has already accreted too much. >It would be nice if there was a clean interface for external drivers. >I assume that each current driver would turn into a stand alone >program that talked to ntpd via some new/wonderful interface. We already have a framework for that kind of external driver. It's called GPSD. :-) That wasn't just a snarky answer. >The simple shared key crypto should support more than MD5 and SHA1. Daniel has undertaken to do AES-CMAC. I think that covers what will be standardized in the foreseeable future. >ntpq still fails when talking over a lossy link. (There is a flake option in >restrict. We should see if that tickles the problem.) I fear I may offend you by saying that I don't see this as a major or release-blocking problem - but in truth I have a hard time seeing ntpq used for anything but local monitoring over a LAN as an important case. Sure, I'd like to see this problem identified and fixed, because we're perfectionists and I like it that way. But, more important than the tracker issues? Not from where I'm sitting. Using ntpq over a WiFi link seems pretty odd to me, because doing time sync over a link that vulnerable to RFI and changes in the weather seems like a stunt that nobody in serious production would want to try. >ntpwait can't find its libraries in some environments. I've replied to this one before. It's a distro-packager screwup. There isn't any elegant fix other than not using distros that screw up. >Understanding the mysterious GPS rollover fixup is the top of my list. > >If I find that, I'll probably work on ntpq retransmissions. Those are what *I'd* prefer you to be working on, anyway. The only other bug assigned to you is the very old one about drift at the rail. If you can fix that one it's gravy. You muttered something about possible overflow in the calculations. I'd have investigated this, but I have zero experience at chasing overflow bugs - I wouldn't have a clue what to look for. Um, maybe the drift calculations need to use the new doubletime_t (long double) type? -- Eric S. Raymond The right to buy weapons is the right to be free. -- A.E. Van Vogt, "The Weapon Shops Of Isher", ASF December 1942 From esr at thyrsus.com Sat Aug 26 13:58:23 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sat, 26 Aug 2017 09:58:23 -0400 (EDT) Subject: State of Mac OS support Message-ID: <20170826135823.D408713A0209@snark.thyrsus.com> There have been inquiries from Daniele Nicolodi and Fred Wright about Mac OS X support. Mark Atwood, who's our strategy/product-management/external-relations specialist, may override me on this. Unless and until he does, here's the skinny: We're supporting 10.12, which has the POSIX clock calls. Earlier versions can go piss up a rope. The reason I am vehement about this is that I recently learned a thing: some pre-10.12 versions ship with headers that don't match what's documented for their releases on the Apple website. If Apple can't be bothered to keep its act together enough to present a stable and documented API, we can't be bothered to support their crap. Yes, we used to ship a special accommodation for Apple's non-POSIX calls in 10.11. At some point it broke. Given what I learned later - including one rumor that their non-POSIX clock-setting call is a no-op - I cannot be fscking bothered to figure out *why* it broke. I consider this entire sorry history a lesson in the wisdom of *not making exceptions* to our POSIX-baseline policy. Doing that for Apple was a mistake I don't intend to double down on. If the 10.12 POSIX clock calls don't work, that's *Apple's* problem. -- Eric S. Raymond Never could an increase of comfort or security be a sufficient good to be bought at the price of liberty. -- Hillaire Belloc From daniele at grinta.net Sat Aug 26 15:36:03 2017 From: daniele at grinta.net (Daniele Nicolodi) Date: Sat, 26 Aug 2017 09:36:03 -0600 Subject: State of Mac OS support In-Reply-To: <20170826135823.D408713A0209@snark.thyrsus.com> References: <20170826135823.D408713A0209@snark.thyrsus.com> Message-ID: On 26/08/17 07:58, Eric S. Raymond via devel wrote: > There have been inquiries from Daniele Nicolodi and Fred Wright about > Mac OS X support. [...] > We're supporting 10.12, which has the POSIX clock calls. Earlier > versions can go piss up a rope. I was only observing that the website needs to be updated to reflect reality. At the moment it says that MacOS is supported, without any version specifier. Cheers, Daniele From esr at thyrsus.com Sat Aug 26 16:19:28 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sat, 26 Aug 2017 12:19:28 -0400 Subject: State of Mac OS support In-Reply-To: References: <20170826135823.D408713A0209@snark.thyrsus.com> Message-ID: <20170826161928.GA676@thyrsus.com> Daniele Nicolodi via devel : > I was only observing that the website needs to be updated to reflect > reality. At the moment it says that MacOS is supported, without any > version specifier. I've pushed an update. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From esr at thyrsus.com Sat Aug 26 17:35:24 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sat, 26 Aug 2017 13:35:24 -0400 (EDT) Subject: Blue-sky thread - ideas for well after 1.0 Message-ID: <20170826173524.8989613A0209@snark.thyrsus.com> Because we've done our work well, we're on what looks like an easy glide path to a 1.0 release in September. There are no more issues that look like blockers rather than irritations; if forced to it, we could ship tomorrow. I believe we've earned the luxury of some blue-sky thinking. I'm not talking about relatively short-term good ideas like NTS or AES-CMAC; those are normal forward engineering. I mean really ambitious plans. I'm going to share the two out-there ideas I have for the long term, and invite any of the rest of you to pitch in your own or react to mine. 1. Field, and then push through IETF, an IPv5 that solves the functional problems with IPv4, like failing to embed its epoch in sync packets. I have some notes towards this in devel/ipv5.txt. I've stated before at our FTF meetings that this is my personal endgame. Having got this done, I think I'd be ready to pass the maintainer's baton onwards. 2. Translate the whole mess to Go. Of course, the motivation for this would be to forever banish all buffer-overrun and memory-allocation bugs forever, and their related security issues. (For those of you unfamiliar with Go, it is an extremely C-like language with garbage collection, an object system, and some very elegant concurrency primitives. Google 'golang' for more.) Back when we had 231KLOC of messy code with a lot of non-standardized calls in it, moving it to different language would have been impractical. Now that we're down to 56KLOC of code that is mostly POSIX-clean, it's beginning to look pretty attractive. Because the boilerplate required for memory management in C is so bulky, I think we might drop as much as 10KLOC in the move. And if I'm wrong, I'm probably underestimating the gains. A 46KLOC ntpd (about the size GPSD is now) that could never have an overrun vulnerability again would be an achievement. -- Eric S. Raymond Never could an increase of comfort or security be a sufficient good to be bought at the price of liberty. -- Hillaire Belloc From dfoxfranke at gmail.com Sat Aug 26 19:06:37 2017 From: dfoxfranke at gmail.com (Daniel Franke) Date: Sat, 26 Aug 2017 15:06:37 -0400 Subject: Blue-sky thread - ideas for well after 1.0 In-Reply-To: <20170826173524.8989613A0209@snark.thyrsus.com> References: <20170826173524.8989613A0209@snark.thyrsus.com> Message-ID: There aren't many deficiencies in NTPv4 which can't be fixed by adding extension fields. A change big enough to make a version bump worthwhile would incorporate at least most of the following: 1. Drop everything other than client/server mode. Replace mode 6 with something that runs over HTTPS on the NTS-KE port. 2. Let client and server packets be formatted differently. Achieve data minimization by just taking the unnecessary fields out of client packets altogether. 3. Forbid use of the legacy MAC field, thus fixing the hairiness around extension parsing. 4. Make NTS mandatory. In the NTPv5 packet format, the version, mode, NTS unique identifier, and (in client packets) NTS cookie come first in plaintext, then the whole rest of the packet is encrypted. 5. Ditch the useless poll, stratum, refid, and reference timestamp fields. Given that all of the above are implemented, origin timestamp also becomes redundant (NTS takes the place of its anti-spoofing role). 6. Represent timestamps as days, seconds, and fractions so that the time can be represented unambiguously during leap seconds. Make the day field 64 bits wide so that its range comfortable exceeds the lifespan of the solar system. 7. Don't implement leap smearing in the wire protocol (servers should always report accurate, unsmeared time), but standardize a formula for translating NTP time into smeared UNIX time seen by other applications. From fw at fwright.net Sat Aug 26 20:49:07 2017 From: fw at fwright.net (Fred Wright) Date: Sat, 26 Aug 2017 13:49:07 -0700 (PDT) Subject: State of Mac OS support In-Reply-To: <20170826135823.D408713A0209@snark.thyrsus.com> References: <20170826135823.D408713A0209@snark.thyrsus.com> Message-ID: On Sat, 26 Aug 2017, Eric S. Raymond via devel wrote: > There have been inquiries from Daniele Nicolodi and Fred Wright about > Mac OS X support. Actually, mine wasn't an inquiry, it was an answer. :-) > Mark Atwood, who's our strategy/product-management/external-relations > specialist, may override me on this. Unless and until he does, here's > the skinny: > > > > We're supporting 10.12, which has the POSIX clock calls. Earlier > versions can go piss up a rope. Well, "POSIX" isn't uniquely valued. For example, some widely available versions of Linux still require linking with librt for clock_gettime(), since that wasn't a POSIX call at the time of that glibc version. Even OSX 10.5 passed all the POSIX compliance tests for whatever version of POSIX was in use at the time (and at least at that time, actually passing POSIX compliance tests was fairly rare). In general, failing to support OS versions that are supported by classic NTP isn't a good way to encourage adoption. > The reason I am vehement about this is that I recently learned a > thing: some pre-10.12 versions ship with headers that don't match > what's documented for their releases on the Apple website. I have no idea WTF you're referring to here. I know my version, which mainly just falls back to the *POSIX* gettimeofday()/settimeofday() calls, at least builds fine for 10.5-10.12, or at least did so until e92a112b8 broke the build for OSX *including 10.12*. > If Apple can't be bothered to keep its act together enough to present > a stable and documented API, we can't be bothered to support their > crap. Actually they have a stable and documented API which is even "POSIX", just not the one you were using. It's called gettimeoday()/settimeofday(). :-) Even in 10.12, clock_gettime()/clock_settime() offer no operational advantage over gettimeofday()/settimeofday(). The apparent nanosecond resolution is an illusion. Preferring the former makes sense, but falling back to the latter is perfectly OK (and not even OSX-specific). The advice against using gettimeofday()/settimeofday() in the hacking guide is inapplicable to systems that lack clock_gettime()/clock_settime(). > Yes, we used to ship a special accommodation for Apple's non-POSIX > calls in 10.11. At some point it broke. Given what I learned later - Aside from whatever you're referring to, the OSX-specific fallback was *always* functionally broken. The commonly circulated example for how to use OSX clock_get_time() has a "port leak" bug (as well as being slower than it needs to be). I kept a corrected version of it for CLOCK_MONOTONIC, though the only place that's used is in ntpfrob (albeit in a context where it shouldn't be using it). Of course there are lots of places that *should* be using CLOCK_MONOTONIC (or better still, CLOCK_MONOTONIC_RAW), but that's another story. > including one rumor that their non-POSIX clock-setting call is a no-op > - I cannot be fscking bothered to figure out *why* it broke. I have no idea whether clock_set_time() is broken or not, but clearly settimeofday() works, since classic NTP (both Apple's version and the MacPorts version) work on OSX <10.12. Fred Wright From hmurray at megapathdsl.net Sat Aug 26 21:19:54 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Sat, 26 Aug 2017 14:19:54 -0700 Subject: Recently introduced build error Message-ID: <20170826211954.1F283406063@ip-64-139-1-69.sjc.megapath.net> [ 63/113] Compiling libntp/msyslog.c In file included from /usr/include/python2.7/pyconfig.h:6:0, from /usr/include/python2.7/Python.h:8, from ../../libntp/pymodule.c:7: /usr/include/python2.7/pyconfig-64.h:1199:0: warning: "_POSIX_C_SOURCE" redefined #define _POSIX_C_SOURCE 200112L :0:0: note: this is the location of the previous definition That is on Fedora 26 -- These are my opinions. I hate spam. From fw at fwright.net Sat Aug 26 21:52:12 2017 From: fw at fwright.net (Fred Wright) Date: Sat, 26 Aug 2017 14:52:12 -0700 (PDT) Subject: Recently introduced build error In-Reply-To: <20170826211954.1F283406063@ip-64-139-1-69.sjc.megapath.net> References: <20170826211954.1F283406063@ip-64-139-1-69.sjc.megapath.net> Message-ID: On Sat, 26 Aug 2017, Hal Murray via devel wrote: > [ 63/113] Compiling libntp/msyslog.c > In file included from /usr/include/python2.7/pyconfig.h:6:0, > from /usr/include/python2.7/Python.h:8, > from ../../libntp/pymodule.c:7: > /usr/include/python2.7/pyconfig-64.h:1199:0: warning: "_POSIX_C_SOURCE" > redefined > #define _POSIX_C_SOURCE 200112L > > :0:0: note: this is the location of the previous definition > > That is on Fedora 26 See if it's e92a112b8. Fred Wright From hmurray at megapathdsl.net Sat Aug 26 22:17:12 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Sat, 26 Aug 2017 15:17:12 -0700 Subject: Recently introduced build error In-Reply-To: Message from Fred Wright via devel of "Sat, 26 Aug 2017 14:52:12 PDT." Message-ID: <20170826221712.6A2BF40605C@ip-64-139-1-69.sjc.megapath.net> > See if it's e92a112b8. It works after: git checkout 98ed7cc3dbd9650168b10cf5bb3bc156c92a5476 -- These are my opinions. I hate spam. From daniele at grinta.net Sat Aug 26 23:05:57 2017 From: daniele at grinta.net (Daniele Nicolodi) Date: Sat, 26 Aug 2017 17:05:57 -0600 Subject: State of Mac OS support In-Reply-To: <20170826161928.GA676@thyrsus.com> References: <20170826135823.D408713A0209@snark.thyrsus.com> <20170826161928.GA676@thyrsus.com> Message-ID: On 26/08/17 10:19, Eric S. Raymond wrote: > Daniele Nicolodi via devel : >> I was only observing that the website needs to be updated to reflect >> reality. At the moment it says that MacOS is supported, without any >> version specifier. > > I've pushed an update. There must be a markup error, the page does not render correctly. Cheers, Daniele From hmurray at megapathdsl.net Sun Aug 27 01:48:55 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Sat, 26 Aug 2017 18:48:55 -0700 Subject: Blue-sky thread - ideas for well after 1.0 In-Reply-To: Message from "Eric S. Raymond via devel" of "Sat, 26 Aug 2017 13:35:24 EDT." <20170826173524.8989613A0209@snark.thyrsus.com> Message-ID: <20170827014855.BE39540605C@ip-64-139-1-69.sjc.megapath.net> You didn't say anything about anti-forgery. In the long range, I think we will need a trusted organization to run and manage enough servers to support the load, something along the lines of the current DNS setup. -- These are my opinions. I hate spam. From hmurray at megapathdsl.net Sun Aug 27 02:01:56 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Sat, 26 Aug 2017 19:01:56 -0700 Subject: Blue-sky thread - ideas for well after 1.0 In-Reply-To: Message from Daniel Franke via devel of "Sat, 26 Aug 2017 15:06:37 EDT." Message-ID: <20170827020156.3721940605C@ip-64-139-1-69.sjc.megapath.net> devel at ntpsec.org said: > 4. Make NTS mandatory. In the NTPv5 packet format, the version, mode, NTS > unique identifier, and (in client packets) NTS cookie come first in > plaintext, then the whole rest of the packet is encrypted. Is there a good high-level writeup of NTS? Why encrypt stuff? (as compared to verify) Are there any useful techniques for monitoring or debugging encrypted traffic? > 6. Represent timestamps as days, seconds, and fractions so that the time can > be represented unambiguously during leap seconds. Make the day field 64 bits > wide so that its range comfortable exceeds the lifespan of the solar system. 64 bits of days seems like way overkill. 32 bits of days is over 23 bits of years. Are you really worried about more than a million years? Should the wire protocol use a non-leap time scale? (and include the offset to UTC) > 7. Don't implement leap smearing in the wire protocol (servers should always > report accurate, unsmeared time), but standardize a formula for translating > NTP time into smeared UNIX time seen by other applications. That's the tip of an iceberg for getting POSIX to get their leap out of the sand. -- These are my opinions. I hate spam. From dfoxfranke at gmail.com Sun Aug 27 02:46:28 2017 From: dfoxfranke at gmail.com (Daniel Franke) Date: Sat, 26 Aug 2017 22:46:28 -0400 Subject: Blue-sky thread - ideas for well after 1.0 In-Reply-To: <20170827020156.3721940605C@ip-64-139-1-69.sjc.megapath.net> References: <20170827020156.3721940605C@ip-64-139-1-69.sjc.megapath.net> Message-ID: On 8/26/17, Hal Murray wrote: > Is there a good high-level writeup of NTS? https://tools.ietf.org/html/draft-ietf-ntp-using-nts-for-ntp-09#section-1.2 > Why encrypt stuff? (as compared to verify) NTS authenticates everything and encrypts as much as possible without breaking backward compatibility and middleboxes. Encryption is mostly for privacy -- prevent leaking anything that could permit tracking of mobile systems. Data minimization already solves 99% of this, but since adding encryption is basically free, it should be the default anytime there's not a particular reason you *want* middleboxes to be able to snoop traffic. > Are there any useful techniques for monitoring or debugging encrypted > traffic? Log encryption keys, or the plaintext itself, at endpoints. If you don't have endpoint cooperation, then inability to extract debug info is a feature, not a bug. > 64 bits of days seems like way overkill. 32 bits of days is over 23 bits of > > years. Are you really worried about more than a million years? It certainly won't be *my* problem. But either way, packets, not bits, are the bottleneck. NTP messages fit in one packet either way. The extra 32 bits are free. > Should the wire protocol use a non-leap time scale? (and include the offset > > to UTC) Either way, I favor including UTC-TAI offset as a field. But even given that, providing timestamps as UTC rather than TAI gives more information, since it enables conversion to calendar date & time without needing a full leap table. > That's the tip of an iceberg for getting POSIX to get their leap out of the > sand. Yeah, POSIX time sucks, but that's a separate problem. My proposal allows NTP to do things the right way, while at same time translating into POSIX with as much fidelity as it's capable of representing. From Stromeko at nexgo.de Sun Aug 27 08:26:54 2017 From: Stromeko at nexgo.de (Achim Gratz) Date: Sun, 27 Aug 2017 10:26:54 +0200 Subject: Something is buggy with iburst... References: <20170509003140.BCC50406061@ip-64-139-1-69.sjc.megapath.net> <87a86mt2sl.fsf@Rainer.invalid> <87shk9rw9b.fsf@Rainer.invalid> Message-ID: <87378dct69.fsf@Rainer.invalid> Achim Gratz via devel writes: > I still think there must be some bug somewhere that either makes the > client send too many packets or the server sending that KOD too early. This is still happening with ntpsec-0.9.7+1104, albeit much less often now (but I've removed iburst from the configuration files). I just had it happen while I was updating the rasPis. Again, the symptom is that there is a "rate_exceeded" event and the hpoll gets set to some high value and never recovers from there even though the poll interval should be fixed: assoc=53276: conf, reach, sel_reject, 1 event, rate_exceeded unreach=0 hmode=3 pmode=4 hpoll=10 ppoll=4 headway=7908 flash=4096 keyid=0 This was and is not happening with NTP classic. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Wavetables for the Terratec KOMPLEXER: http://Synth.Stromeko.net/Downloads.html#KomplexerWaves From Stromeko at nexgo.de Sun Aug 27 08:48:00 2017 From: Stromeko at nexgo.de (Achim Gratz) Date: Sun, 27 Aug 2017 10:48:00 +0200 Subject: Something is buggy with iburst... References: <20170509003140.BCC50406061@ip-64-139-1-69.sjc.megapath.net> <87a86mt2sl.fsf@Rainer.invalid> <87shk9rw9b.fsf@Rainer.invalid> <87378dct69.fsf@Rainer.invalid> Message-ID: <87y3q5bdmn.fsf@Rainer.invalid> Achim Gratz via devel writes: > This was and is not happening with NTP classic. I've just updated to ntpsec-0.9.7+1232 and had it happen again while restarting all ntpd after the build. It happens more often with the rasPi 1B+ which is probably somewhat slower than the others to respond when it's under load. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Wavetables for the Waldorf Blofeld: http://Synth.Stromeko.net/Downloads.html#BlofeldUserWavetables From Stromeko at nexgo.de Sun Aug 27 09:22:14 2017 From: Stromeko at nexgo.de (Achim Gratz) Date: Sun, 27 Aug 2017 11:22:14 +0200 Subject: #251: Add fudge option to server config References: <20170814154754.D167913A0206@snark.thyrsus.com> <87tw1arubz.fsf@Rainer.invalid> Message-ID: <87tw0tbc1l.fsf@Rainer.invalid> Achim Gratz via devel writes: > I've got switched to VDSL2 w/ full vectoring last week and the > asymmetry went to below 300?s best I can tell, so I've removed the > fudge completely for now. I can't fully load the downstream anymore > as I'm currently synced at ~105MBit but get capped at 50MBit via QOS > by the provider, but I've seen delay spikes to around 10ms so far. My provider seems to have finished the line training. I got switched to 55.5/10MBit for down-/up-stream a week or so ago. The delay seems to have stabilized in the process (going from formerly 26ms to 15ms too) and I am seeing slightly less variation at the moment and only around 200?300?s excess asymmetry when loading the downstream to capacity. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ SD adaptation for Waldorf Blofeld V1.15B11: http://Synth.Stromeko.net/Downloads.html#WaldorfSDada From esr at thyrsus.com Sun Aug 27 11:54:16 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sun, 27 Aug 2017 07:54:16 -0400 Subject: Recently introduced build error In-Reply-To: <20170826211954.1F283406063@ip-64-139-1-69.sjc.megapath.net> References: <20170826211954.1F283406063@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170827115416.GA30270@thyrsus.com> Hal Murray : > [ 63/113] Compiling libntp/msyslog.c > In file included from /usr/include/python2.7/pyconfig.h:6:0, > from /usr/include/python2.7/Python.h:8, > from ../../libntp/pymodule.c:7: > /usr/include/python2.7/pyconfig-64.h:1199:0: warning: "_POSIX_C_SOURCE" > redefined > #define _POSIX_C_SOURCE 200112L > > :0:0: note: this is the location of the previous definition > > That is on Fedora 26 You are in a maze of twisty little compiler quirks, all different. What I was trying to do is reduce the scope of our dependence on GNU C features. So I changed the global options, removing -D_GNU_SOURCE and replacing it with -D_POSIX_C_SOURCE=200809L -D_XOPEN_SOURCE=600. This worked just fine on my Ubuntu Linux, gcc 5.4.0. It wreaked different kinds of havoc on other systems. Sigh...commit reverted. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From esr at thyrsus.com Sun Aug 27 12:23:37 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sun, 27 Aug 2017 08:23:37 -0400 Subject: Blue-sky thread - ideas for well after 1.0 In-Reply-To: References: <20170826173524.8989613A0209@snark.thyrsus.com> Message-ID: <20170827122337.GB30801@thyrsus.com> Daniel Franke : > There aren't many deficiencies in NTPv4 which can't be fixed by adding > extension fields. True, and the basis of one of my proposal variants. > A change big enough to make a version bump > worthwhile would incorporate at least most of the following: > > 1. Drop everything other than client/server mode. Replace mode 6 with > something that runs over HTTPS on the NTS-KE port. > > 2. Let client and server packets be formatted differently. Achieve > data minimization by just taking the unnecessary fields out of client > packets altogether. > > 3. Forbid use of the legacy MAC field, thus fixing the hairiness > around extension parsing. > > 4. Make NTS mandatory. In the NTPv5 packet format, the version, mode, > NTS unique identifier, and (in client packets) NTS cookie come first > in plaintext, then the whole rest of the packet is encrypted. > > 7. Don't implement leap smearing in the wire protocol (servers should > always report accurate, unsmeared time), but standardize a formula for > translating NTP time into smeared UNIX time seen by other > applications. I concur with all of these. > 5. Ditch the useless poll, stratum, refid, and reference timestamp > fields. Given that all of the above are implemented, origin timestamp > also becomes redundant (NTS takes the place of its anti-spoofing > role). Aren't we going to need some equivalent of refid for loop detection? Otherwise I agree these seem dispensable. > 6. Represent timestamps as days, seconds, and fractions so that the > time can be represented unambiguously during leap seconds. Make the > day field 64 bits wide so that its range comfortable exceeds the > lifespan of the solar system. There be dragons here. This would disrupt the implementation a *whole lot*, enough to make verification rather difficult. I think a more practical plan would combine the following: (1) Include the server's epoch date in the packet (27 bytes in ISO8601, at least until after 9999CE), (2) include the server's leap offset, and (3) *remove* leap-second correction from timestamps so they're just seconds from epoch. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From esr at thyrsus.com Sun Aug 27 13:02:06 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sun, 27 Aug 2017 09:02:06 -0400 (EDT) Subject: Apparent protocol-machine bug, new top priority Message-ID: <20170827130206.88A9413A0209@snark.thyrsus.com> Heads up, Daniel! Achim Gratz via devel writes: >> I still think there must be some bug somewhere that either makes the >> client send too many packets or the server sending that KOD too early. > >This is still happening with ntpsec-0.9.7+1104, albeit much less often >now (but I've removed iburst from the configuration files). I just had >it happen while I was updating the rasPis. Again, the symptom is that >there is a "rate_exceeded" event and the hpoll gets set to some high >value and never recovers from there even though the poll interval should >be fixed: > >assoc=53276: conf, reach, sel_reject, 1 event, rate_exceeded >unreach=0 hmode=3 pmode=4 hpoll=10 ppoll=4 headway=7908 flash=4096 keyid=0 > >This was and is not happening with NTP classic. Now that iburst has been fixed - and Achim reports seeing this problem with iburst off - this pretty much has to be an issue deeper in the protocol machine. (I guess we should count our blessings and congratulate Daniel that there haven't more of these since the big refactor.) If this is happening with iburst *off*, it becomes more difficult to understand how the rate limit is being triggered. I think maybe we should start by focusing on something else: why is hpoll not recovering after a KOD? I'm thinking this sounds like some KOD-recovery logic got lost during the refactor. I also judge this is our new most serious bug. Daniel, would you give it a hard look, please? You too, Hal - I'm thinking you have better odds of diagnosing this one than I do. -- Eric S. Raymond Rifles, muskets, long-bows and hand-grenades are inherently democratic weapons. A complex weapon makes the strong stronger, while a simple weapon -- so long as there is no answer to it -- gives claws to the weak. -- George Orwell, "You and the Atom Bomb", 1945 From esr at thyrsus.com Sun Aug 27 14:28:18 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Sun, 27 Aug 2017 10:28:18 -0400 (EDT) Subject: Apparent protocol-machine bug, new top priority Message-ID: <20170827142818.5F89C13A0209@snark.thyrsus.com> I wrote: >If this is happening with iburst *off*, it becomes more difficult to >understand how the rate limit is being triggered. I think maybe we >should start by focusing on something else: why is hpoll not >recovering after a KOD? > >I'm thinking this sounds like some KOD-recovery logic got lost during >the refactor. Trying to trace how things go bad. Looks to me like this piece of logic down around line 592, processing a KOD, sets minpoll high: if(is_kod(pkt)) { if(!memcmp(pkt->refid, "RATE", REFIDLEN)) { peer->selbroken++; report_event(PEVNT_RATE, peer, NULL); if (peer->minpoll < 10) { peer->minpoll = 10; } peer->burst = peer->retry = 0; peer->throttle = (NTP_SHIFT + 1) * (1 << peer->minpoll); poll_update(peer, 10); } return; } Then poll_update sets hpoll to 10. Achim seems to be reporting that it stays stuck there. Now I look at this: void poll_update( struct peer *peer, /* peer structure pointer */ uint8_t mpoll ) { unsigned long next, utemp; uint8_t hpoll; /* * This routine figures out when the next poll should be sent. * That turns out to be wickedly complicated. One problem is * that sometimes the time for the next poll is in the past when * the poll interval is reduced. We watch out for races here * between the receive process and the poll process. * * Clamp the poll interval between minpoll and maxpoll. */ hpoll = max(min(peer->maxpoll, mpoll), peer->minpoll); peer->hpoll = hpoll; This means that hpoll can never be set lower than minpoll. Which means there will never be any recovery from the KOD rate limit, no matter what values poll_update() is called with, unless minpoll is lowered. But this never happens. ntp_peer.c:721: peer->minpoll = min(minpoll, NTP_MAXPOLL); ntp_peer.c:724: peer->minpoll = peer->maxpoll; ntp_proto.c:596: if (peer->minpoll < 10) { peer->minpoll = 10; } refclock_jjy.c:2788: peer->minpoll = 8 ; refclock_oncore.c:621: peer->minpoll = 4; refclock_trimble.c:469: peer->minpoll = TRMB_MINPOLL; The ntp_peer.c hits are during new-peer initialization. The refclock hits are irrelevant, we're troubleshooting the code path for NTP peers. My deduction is that ntp_proto.c:596 is probably wrong, it's disabling the normal poll interval hysteresis (which I admit I only vaguely understand). But the problem may be deeper than that. The corresponding code in Classic is this: /* * Check to see if this is a RATE Kiss Code * Currently this kiss code will accept whatever poll * rate that the server sends */ peer->ppoll = max(peer->minpoll, pkt->ppoll); if (kissCode == RATEKISS) { peer->selbroken++; /* Increment the KoD count */ report_event(PEVNT_RATE, peer, NULL); if (pkt->ppoll > peer->minpoll) peer->minpoll = peer->ppoll; peer->burst = peer->retry = 0; peer->throttle = (NTP_SHIFT + 1) * (1 << peer->minpoll); poll_update(peer, pkt->ppoll); return; /* kiss-o'-death */ } I see that our line 596 is a replacement for allowing the KOD packet to set the poll rate. That makes all kinds of sense, as a spoofed KOD packet with a maliciously high poll interval is an obvious DoS vector. (See, Daniel? I are learning to think like an InfoSec paranoid.) Unfortunately for this neat theory, the correwsponding grep hits in Classic are: ntp_peer.c:857: peer->minpoll = NTP_MINDPOLL; ntp_peer.c:859: peer->minpoll = min(minpoll, NTP_MAXPOLL); ntp_peer.c:865: peer->minpoll = peer->maxpoll; ntp_proto.c:1589: peer->minpoll = peer->ppoll; Again, the ntp_peer.c hits are during newpeer initialization. That is, I can't find any way that minpoll recovers after a KOD in Classic, either. What am I misssing here? -- Eric S. Raymond Rifles, muskets, long-bows and hand-grenades are inherently democratic weapons. A complex weapon makes the strong stronger, while a simple weapon -- so long as there is no answer to it -- gives claws to the weak. -- George Orwell, "You and the Atom Bomb", 1945 From Stromeko at nexgo.de Sun Aug 27 14:48:42 2017 From: Stromeko at nexgo.de (Achim Gratz) Date: Sun, 27 Aug 2017 16:48:42 +0200 Subject: Apparent protocol-machine bug, new top priority References: <20170827142818.5F89C13A0209@snark.thyrsus.com> Message-ID: <87pobhawxh.fsf@Rainer.invalid> Eric S. Raymond via devel writes: > Again, the ntp_peer.c hits are during newpeer initialization. That > is, I can't find any way that minpoll recovers after a KOD in > Classic, either. As far as I can tell, in classic I never seem to get a KOD or at least it doesn't move the poll interval up that high. Somehow, I have an inkling that the real problem is that the server might count some requests against the wrong peer and then try to shoot it down with a KOD. Otherwise I have no explanation for why it would be possible for a peer that already runs for quite some time to receive a KOD when another server (also a peer to the other servers) gets restarted. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Samples for the Waldorf Blofeld: http://Synth.Stromeko.net/Downloads.html#BlofeldSamplesExtra From esr at thyrsus.com Tue Aug 29 13:32:40 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 29 Aug 2017 09:32:40 -0400 (EDT) Subject: Century correction absent in some refclocks Message-ID: <20170829133240.744E013A0206@snark.thyrsus.com> Question directed particularly to Hal and Daniel: I have been working on a document that Hal Murray requested, a comprehensive discussion of rollover effects in Unix, NTP, and GPSD calendars. Writing this has required me to read the refclock code looking for how it copes with these. There is a mystery around processing of year input from refclocks. Some report 4-digit years with a century part - gpsd_json, some modes of jjy and generic, magnavox, hpgps, the European modes of the modem driver, zyfer. Some only report 2-digit years - arbiter, some modes of jjy and generic, the ACTS mode of the modem driver, neoclock, oncore, spectracom, trimble, and truetime In nmea, I see explicit code to derive a 4-digit year from a 2-digit year using some calendrical trickery I don't understand - but for this purpose it doesn't matter that I don't understand it, only that someone thought it was neceessary. What I don't understand is why the refclocks returning only 2-digit years ever worked at all. Does the sample-processing code simply ignore the century part of the year? If so, why is nmea supplying that? Puzzled in Malvern... -- Eric S. Raymond "Are we to understand," asked the judge, "that you hold your own interests above the interests of the public?" "I hold that such a question can never arise except in a society of cannibals." -- Ayn Rand From hmurray at megapathdsl.net Tue Aug 29 19:26:58 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Tue, 29 Aug 2017 12:26:58 -0700 Subject: Century correction absent in some refclocks In-Reply-To: Message from "Eric S. Raymond via devel" of "Tue, 29 Aug 2017 09:32:40 EDT." <20170829133240.744E013A0206@snark.thyrsus.com> Message-ID: <20170829192658.55CAB40605C@ip-64-139-1-69.sjc.megapath.net> > What I don't understand is why the refclocks returning only 2-digit years > ever worked at all. Does the sample-processing code simply ignore the > century part of the year? If so, why is nmea supplying that? Here is a comment in refclock_process_f /* * Compute the timecode timestamp from the days, hours, minutes, * seconds and milliseconds/microseconds of the timecode. Use * clocktime() for the aggregate seconds and the msec/usec for * the fraction, when present. Note that this code relies on the * filesystem time for the years and does not use the years of * the timecode. */ That doesn't sound right, but I haven't started pulling that string. Converting 2 digit year to 4 digit should be simple: just add 2000. We don't need to pivot since we are only interested in the current time which can't be in the late 1900s, Things would be more interesting if we had old log files running through test decks. YEAR_BREAK and YEAR_PIVOT are defined in ntp.h The comments indicate that things are mixed up with tm_year I haven't pulled that string either. ------- This may be mixed up with the mysterious GPS rollover fixup I haven't found yet. -- These are my opinions. I hate spam. From esr at thyrsus.com Wed Aug 30 03:40:43 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Tue, 29 Aug 2017 23:40:43 -0400 (EDT) Subject: Verified - ntpd ignores the year part of refclock timestamps Message-ID: <20170830034043.1919813A0206@snark.thyrsus.com> Hal wrote: >Here is a comment in refclock_process_f > /* > * Compute the timecode timestamp from the days, hours, minutes, > * seconds and milliseconds/microseconds of the timecode. Use > * clocktime() for the aggregate seconds and the msec/usec for > * the fraction, when present. Note that this code relies on the > * filesystem time for the years and does not use the years of > * the timecode. > */ >That doesn't sound right, but I haven't started pulling that string. No? Well, I just did. Fsck...me...sideways! It's true. The reason all those old, busted Y2K-afflicted refclocks worked is that ntpd really does ignore the year part of clock timestamps. A look at clocktime() is revealing, though you have to dig a little deeper than that to realize what's actually going on. For clock samples, the code (in effect) casts about for a year that puts the sample within 4 hours of the packet receipt time...which means your system clock has to be that approximately synced in order for a refclock timestamp to be processed correctly at all. There is one case where this really sucks. If you boot on a machine with no access to remote clock peers, and the system time has been thrown way off, it's going to try to conform its refclock samples to whatever garbage value the system clock is holding. If it's zeroed, you might get the "right" time of day within four hours of the Unix epoch. I think. There's an opportunity here. If the year were passed into clocktime, it could check for a year > 99 and use it, freeing us of dependency on the system clock in that case. Rather a boon for standalone operation fed by a couple of refclocks. -- Eric S. Raymond Good intentions will always be pleaded for every assumption of authority. It is hardly too strong to say that the Constitution was made to guard the people against the dangers of good intentions. There are men in all ages who mean to govern well, but they mean to govern. They promise to be good masters, but they mean to be masters. -- Daniel Webster From hmurray at megapathdsl.net Wed Aug 30 05:45:08 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Tue, 29 Aug 2017 22:45:08 -0700 Subject: Verified - ntpd ignores the year part of refclock timestamps In-Reply-To: Message from "Eric S. Raymond via devel" of "Tue, 29 Aug 2017 23:40:43 EDT." <20170830034043.1919813A0206@snark.thyrsus.com> Message-ID: <20170830054508.E5D3040605C@ip-64-139-1-69.sjc.megapath.net> devel at ntpsec.org said: > No? Well, I just did. Fsck...me...sideways! It's true. The reason all > those old, busted Y2K-afflicted refclocks worked is that ntpd really does > ignore the year part of clock timestamps. The problem I'm interested in is not Y2K, it's GPS rollover. 1024 weeks is not an integral number of years. The day and month are garbage too. I'm seeing: T2199801140539523001053 which is 1998, 01=Jan, 14 --------- I agree this is a mess. I think we need a flag to go with with the year. Then we can update the drivers to provide the year (and set the flag) as we get to them. How many of the GPSD test sets for NMEA have 2 digit years? -- These are my opinions. I hate spam. From esr at thyrsus.com Wed Aug 30 11:56:21 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Wed, 30 Aug 2017 07:56:21 -0400 Subject: Verified - ntpd ignores the year part of refclock timestamps In-Reply-To: <20170830054508.E5D3040605C@ip-64-139-1-69.sjc.megapath.net> References: <20170830034043.1919813A0206@snark.thyrsus.com> <20170830054508.E5D3040605C@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170830115621.GA23992@thyrsus.com> Hal Murray : > The problem I'm interested in is not Y2K, it's GPS rollover. 1024 weeks is > not an integral number of years. The day and month are garbage too. Right, not the same problem. > I agree this is a mess. I think we need a flag to go with with the year. > Then we can update the drivers to provide the year (and set the flag) as we > get to them. It's easier than that. I already have a patch that passes the year into clocktime, and you can always tell when you had a 4-digit year because the value passed in will be > 99. All that's needed is one line to set the yearstart variable from a 4-digit year number. Dead easy to test, too. The NMEA driver uses a weird calendrical trick I don't quite understand (that it says will only work until 2399) to deduce the current century and *always* passes 4 digits; all that needs to be checked is if the new logic computes a sane yearstart value - the existing code will do the rest. While it's kind of weird that nobody fixed this before, it's a nice improvement to add to our new-feature list. Makes it actually possible to run autonomously starting from a zeroed system clock with one or more local refclocks and no network peers. > How many of the GPSD test sets for NMEA have 2 digit years? Most of them. You don't get a 4-digit year unless the device emits GPZDA, which is unusual - most consumer-grade hardware does not. The mt3339 used in the Adafruit HAT does. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From hmurray at megapathdsl.net Wed Aug 30 22:00:31 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Wed, 30 Aug 2017 15:00:31 -0700 Subject: Verified - ntpd ignores the year part of refclock timestamps In-Reply-To: Message from "Eric S. Raymond" of "Wed, 30 Aug 2017 07:56:21 EDT." <20170830115621.GA23992@thyrsus.com> Message-ID: <20170830220031.BE69C40605C@ip-64-139-1-69.sjc.megapath.net> How many of the NMEA devices have GPS rollover problems? (either now or soon) -- These are my opinions. I hate spam. From esr at thyrsus.com Thu Aug 31 00:09:38 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Wed, 30 Aug 2017 20:09:38 -0400 Subject: Verified - ntpd ignores the year part of refclock timestamps In-Reply-To: <20170830220031.BE69C40605C@ip-64-139-1-69.sjc.megapath.net> References: <20170830115621.GA23992@thyrsus.com> <20170830220031.BE69C40605C@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170831000938.GA5851@thyrsus.com> Hal Murray : > How many of the NMEA devices have GPS rollover problems? (either now or soon) It's impossible to tell. When a device will roll over is, because of the pivot-date trick, not a function of its hardware type but of its firmware release. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From hmurray at megapathdsl.net Thu Aug 31 00:35:38 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Wed, 30 Aug 2017 17:35:38 -0700 Subject: Verified - ntpd ignores the year part of refclock timestamps In-Reply-To: Message from "Eric S. Raymond" of "Wed, 30 Aug 2017 20:09:38 EDT." <20170831000938.GA5851@thyrsus.com> Message-ID: <20170831003538.3B9FE40605C@ip-64-139-1-69.sjc.megapath.net> >> How many of the NMEA devices have GPS rollover problems? >> (either now or soon) > It's impossible to tell. When a device will roll over is, because of the > pivot-date trick, not a function of its hardware type but of its firmware > release. -- Are there any known examples? I have a collection of NMEA toys. I don't remember seeing GPS rollover on any of them. Some of them are quite old, but I don't think any have reached 20 years yet. We could test fixup software by setting the system clock ahead far enough to look like GPS had rolled over. -- These are my opinions. I hate spam. From esr at thyrsus.com Thu Aug 31 03:41:05 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Wed, 30 Aug 2017 23:41:05 -0400 Subject: Verified - ntpd ignores the year part of refclock timestamps In-Reply-To: <20170831003538.3B9FE40605C@ip-64-139-1-69.sjc.megapath.net> References: <20170831000938.GA5851@thyrsus.com> <20170831003538.3B9FE40605C@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170831034105.GA7359@thyrsus.com> Hal Murray : > > >> How many of the NMEA devices have GPS rollover problems? > >> (either now or soon) > > > It's impossible to tell. When a device will roll over is, because of the > > pivot-date trick, not a function of its hardware type but of its firmware > > release. -- > > Are there any known examples? We had a tracker issue relating to this on an OnCore GT. I think the submitter promised a patch, but it hasn't landed. I'm pretty sure I've seen one or two descriptions of people coping with rollovers on time-nuts while chasing possible sources for old refclock types. It's not exactly a *common* problem - most people who buy consumer-grade GPSes don't seem to keep them in service that long. I've never seen it myself. I have one device that might be old enough - one of the original DeLorme Earthmates from the eatly nineties - but I haven't powered it up in a *long* time; not sure it still works. > I have a collection of NMEA toys. I don't remember seeing GPS rollover on > any of them. Some of them are quite old, but I don't think any have reached > 20 years yet. Right. You'd have to watch for 19.2 years after you acquired the device to be *sure* of seeing it roll over. > We could test fixup software by setting the system clock ahead far enough to > look like GPS had rolled over. What kind of fixup? I looked long and hard at this problem in the context of GPSD. I never found one that wasn't as bad - or worse - than relying on the sysrem clock date. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From hmurray at megapathdsl.net Thu Aug 31 04:24:18 2017 From: hmurray at megapathdsl.net (Hal Murray) Date: Wed, 30 Aug 2017 21:24:18 -0700 Subject: Verified - ntpd ignores the year part of refclock timestamps In-Reply-To: Message from "Eric S. Raymond" of "Wed, 30 Aug 2017 23:41:05 EDT." <20170831034105.GA7359@thyrsus.com> Message-ID: <20170831042418.A8F6A40605C@ip-64-139-1-69.sjc.megapath.net> >> We could test fixup software by setting the system clock >> ahead far enough to look like GPS had rolled over. > What kind of fixup? I looked long and hard at this problem in the context > of GPSD. I never found one that wasn't as bad - or worse - than relying on > the sysrem clock date. I was thinking of testing the code to fixup a device that had rolled over and was now off by 1024 weeks. Pretend the date is 1024 weeks in the future. Now a good GPS device looks like it has rolled over and is giving bogus time. Build ntpd with the pivot date set to 10 years in the future and run it with the system clock set to 15 years in the future. ntpd should fixup the GPS date and jump to 1024 weeks in the future. -- These are my opinions. I hate spam. From esr at thyrsus.com Thu Aug 31 13:12:02 2017 From: esr at thyrsus.com (Eric S. Raymond) Date: Thu, 31 Aug 2017 09:12:02 -0400 Subject: Verified - ntpd ignores the year part of refclock timestamps In-Reply-To: <20170831042418.A8F6A40605C@ip-64-139-1-69.sjc.megapath.net> References: <20170831034105.GA7359@thyrsus.com> <20170831042418.A8F6A40605C@ip-64-139-1-69.sjc.megapath.net> Message-ID: <20170831131202.GA15351@thyrsus.com> Hal Murray : > > >> We could test fixup software by setting the system clock > >> ahead far enough to look like GPS had rolled over. > > > What kind of fixup? I looked long and hard at this problem in the context > > of GPSD. I never found one that wasn't as bad - or worse - than relying on > > the sysrem clock date. > > I was thinking of testing the code to fixup a device that had rolled over and > was now off by 1024 weeks. > > Pretend the date is 1024 weeks in the future. Now a good GPS device looks > like it has rolled over and is giving bogus time. Build ntpd with the pivot > date set to 10 years in the future and run it with the system clock set to 15 > years in the future. ntpd should fixup the GPS date and jump to 1024 weeks > in the future. Nothing we do with the system clock in a test setup tells us how we can compensate in production, because one of our premises is that at startup we can't trust the system clock. I was willing to do that in GPSD only because at the time I thought we were mainly in the location business rather than the time business; if we do it in ntpd, we sabotage autonomous operation. See http://blog.ntpsec.org/2017/08/30/achieving-autonomy.html if you have not already. We *can*, on the other hand, trust the observation that the pivot date is greater than the time/date the GPS is returning. If we see that, we know the GPS has rolled over, but we can't tell how many times it has rolled over. (This will become a question after the second rollover in 2019). We could try assuming it has only rolled over once, but one obvious way for that to go wrong is if the ntpd pivot date is more than 1024 weeks in the past. *That* is going to become an issue for our earliest installations right around the time of the 32-bit POSIX time wraparound. It's a hall of mirrors. Every time you think you've found a way out, you're staring at another one of your own assumptions. -- Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. From gem at rellim.com Thu Aug 31 23:54:35 2017 From: gem at rellim.com (Gary E. Miller) Date: Thu, 31 Aug 2017 16:54:35 -0700 Subject: =?UTF-8?B?4pyYQnVpbGQ=?= failure Message-ID: <20170831165435.2c697d5c@spidey.rellim.com> Yo All! I just tried to build ntpsec for the first time in weeks. Not good. See attached. RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 gem at rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can?t measure it, you can?t improve it." - Lord Kelvin -------------- next part -------------- A non-text attachment was scrubbed... Name: nohup.out Type: application/octet-stream Size: 38543 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From gem at rellim.com Thu Aug 31 23:59:12 2017 From: gem at rellim.com (Gary E. Miller) Date: Thu, 31 Aug 2017 16:59:12 -0700 Subject: =?UTF-8?B?4pyYQnVpbGQ=?= failure In-Reply-To: <20170831165435.2c697d5c@spidey.rellim.com> References: <20170831165435.2c697d5c@spidey.rellim.com> Message-ID: <20170831165912.452c7eee@spidey.rellim.com> Yo Gary! > I just tried to build ntpsec for the first time in weeks. Not good. Here is how I build: ./waf configure --enable-debug --enable-debug-gdb --enable-warnings \ --refclock=all --enable-doc --enable-seccomp && \ ./waf build && \ ./waf install RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 gem at rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can?t measure it, you can?t improve it." - Lord Kelvin -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: