Request for code & logic review

Fri Nov 25 00:21:00 UTC 2016

On 11/24/16, Eric S. Raymond <esr at thyrsus.com> wrote:
> The Python port of ntpdig is almost ready to land.  But there is one last
> little bit of it I'm not sure I understand correctly.  I'm requesting
> review
> of my code and assumptions.
>
> Presently the adjustment and synch distance are calculated this way:
>
>     def delta(self):
>         return self.root_delay
>     def epsilon(self):
>         return self.root_dispersion
>     def synchd(self):
>         "Synchronization distance, estimates worst-case error in seconds"
>         # This is "lambda" in NTP-speak, but that's a Python keyword
>         return abs(self.delta() - self.epsilon())
>     def adjust(self):
>         "Adjustment implied by this packet."
>         return self.received - self.transmit_timestamp
>
> where self.received is the receipt time of the response packet.  I do
> it this way for no better reason than that some examples of Python SNTP
> clients I found on the web use this formula.  But the C ntpdig code
> suggests this is wrong.
>
> Here's the C for the offset/synch-distance calculation:
>
> --------------------------------------------------------------------------------
> 	/* Convert timestamps from network to host byte order */
> 	p_rdly = NTOHS_FP(rpkt->rootdelay);
> 	p_rdsp = NTOHS_FP(rpkt->rootdisp);
> 	NTOHL_FP(&rpkt->reftime, &p_ref);
> 	NTOHL_FP(&rpkt->org, &p_org);
> 	NTOHL_FP(&rpkt->rec, &p_rec);
> 	NTOHL_FP(&rpkt->xmt, &p_xmt);
>
> 	*precision = LOGTOD(rpkt->precision);
>
> 	TRACE(3, ("offset_calculation: LOGTOD(rpkt->precision): %f\n",
> *precision));
>
> 	/* Compute offset etc. */
> 	tmp = p_rec;
> 	L_SUB(&tmp, &p_org);
> 	LFPTOD(&tmp, t21);
> 	TVTOTS(tv_dst, &dst);
> 	dst.l_ui += JAN_1970;
> 	tmp = p_xmt;
> 	L_SUB(&tmp, &dst);
> 	LFPTOD(&tmp, t34);
> 	*offset = (t21 + t34) / 2.;
> 	delta = t21 - t34;
>
> 	// synch_distance is:
> 	// (peer->delay + peer->rootdelay) / 2 + peer->disp
> 	// + peer->rootdisp + clock_phi * (current_time - peer->update)
> 	// + peer->jitter;
> 	//
> 	// and peer->delay = fabs(peer->offset - p_offset) * 2;
> 	// and peer->offset needs history, so we're left with
> 	// p_offset = (t21 + t34) / 2.;
> 	// peer->disp = 0; (we have no history to augment this)
> 	// clock_phi = 15e-6;
> 	// peer->jitter = LOGTOD(sys_precision); (we have no history to augment
> this)
> 	// and ntp_proto.c:set_sys_tick_precision() should get us sys_precision.
> 	//
> 	// so our answer seems to be:
> 	//
> 	// (fabs(t21 + t34) + peer->rootdelay) / 3.
> 	// + 0 (peer->disp)
> 	// + peer->rootdisp
> 	// + 15e-6 (clock_phi)
> 	// + LOGTOD(sys_precision)
>
> 	INSIST( FPTOD(p_rdly) >= 0. );
> #if 1
> 	*synch_distance = (fabs(t21 + t34) + FPTOD(p_rdly)) / 3.
> 		+ 0.
> 		+ FPTOD(p_rdsp)
> 		+ 15e-6
> 		+ 0.	/* LOGTOD(sys_precision) when we can get it */
> 		;
> 	INSIST( *synch_distance >= 0. );
> #else
> 	*synch_distance = (FPTOD(p_rdly) + FPTOD(p_rdsp))/2.0;
> #endif
> --------------------------------------------------------------------------------
>
> That is rather nasty and in my opinion sufficient reason to shoot the C
> version through the head.  Now I'm going to simplify it into pseudocode,
> ignoring issues about endianness and timestamp epochs.
>
> --------------------------------------------------------------------------------
> 	precision = log2(rpkt->precision);
>
> 	/* Compute offset etc. */
> 	t21 = rpkt->rec - rpkt->org
> 	t34 = now - rpkt->xmt
> 	offset = (t21 + t34) / 2.;
>
> #if 1
> 	synch_distance = (fabs(t21 + t34) + rpkt->rdly) / 3.
> 		+ 0.
> 		+ rpkt->rdsp
> 		+ 15e-6
> 		+ 0.	/* LOGTOD(sys_precision) when we can get it */
> 		;
> #else
> 	synch_distance = (pkt->rdly + pkt->p_rdsp)/2.0;
> #endif
> --------------------------------------------------------------------------------
>
> I've omitted the delta calculation because it's only used in a diagnostic;
> offset is what is applied to the local clock.
>
> Note that the synch distance formula that is conditioned out is the one
> I'm using.  Somebody must have thought it was correct at one time.  But if
> I'm to believe the C above, the Python should read thus:
>
>     def synchd(self):
>         "Synchronization distance, estimates worst-case error in seconds"
>         (self.receive_timestamp - self.origin_timestamp) \
> 		+ (now - self.transmit_timestamp) + self.rootdelay) / 3 \
> 		+ self.root_dispersion + 15e-6
>     def adjust(self):
>         "Adjustment implied by this packet."
>         return ((self.receive_timestamp - self.origin_timestamp)
> 		+ (now - self.transmit_timestamp)) / 2
>
> My requests to Daniel and anyone else who might understand these formulas:
>
> (1) Check my logic and C translation.  A bug here would be bad.
>
> (2) What is the authority for these formulas?  What RFC chapter and
>     verse can we cite?

Neither the old nor new version looks correct to me. If you want to
see a correct and readable C implementation of this logic then see my
refactored implementation in ntp_proto.c:592-605. The authority here
is RFC5905, Section 10, but let me spare you from hacking through that
and give you an explanation you'll follow and remember.

For starters, we have our four timestamps:

t_1, the origin timestamp, is the time according to the client at
which the request was sent.
t_2, the transmit timestamp, is the time according to the server at
which the request was received.
t_3, the receive timestamp, is the time according to the server at
which the reply was sent.
t_4, the destination timestamp, is the time according to the client at
which the reply was received.

Theta is the thing we want to estimate: the offset between the server
clock and the client clock. The sign convention is that theta is
positive iff the server is ahead of the client.

Theta is estimated by [(t_2-t_1)+(t_3-t_4)]/2. The accuracy of this
estimate is predicated upon network latency being symmetrical. I've
yet to come up with a snappy description of why this formula is
correct, but just try plugging in some numbers and you'll get the
idea.

Delta is the network round trip time, i.e. (t_4-t_1)-(t_3-t_2). This
one is easier to explain: (t_4-t_1) is the total time that the request
was in flight, and (t_3-t_2) is time that the server spent processing
it; when you subtract that out you're left with just network delays.

Lambda nominally represents the maximum amount by which theta could be
off. It's computed as delta/2 + epsilon. The delta/2 term usually
dominates and represents the maximum amount by which network asymmetry
could be throwing off the calculation. Epsilon is the sum of three
other sources of error:

rho_r: the (im)precision field from response packet, representing the
server's inherent error in clock measurement.
rho_s: the client's own (im)precision.
PHI*(t_4-t_1): The amount by which the client's clock may plausibly
have drifted while the packet was in flight. PHI is taken to be a
constant of 15ppm.

rho_r and rho_s are estimated by making back-to-back calls to
clock_gettime() (or similar) and taking their difference. They're
encoded on the wire as an eight-bit two's complement integer
representing, to the nearest integer, log_2 of the value in seconds.

And that's pretty much all there is to know. Go try again and I'll
review your next pass.