Apparent protocol-machine bug, new top priority

Sun Aug 27 14:28:18 UTC 2017

I wrote:
>If this is happening with iburst *off*, it becomes more difficult to
>understand how the rate limit is being triggered.  I think maybe we
>should start by focusing on something else: why is hpoll not
>recovering after a KOD?
>
>I'm thinking this sounds like some KOD-recovery logic got lost during
>the refactor.

Trying to trace how things go bad.  Looks to me like this piece of
logic down around line 592, processing a KOD, sets minpoll high:

	if(is_kod(pkt)) {
		if(!memcmp(pkt->refid, "RATE", REFIDLEN)) {
			peer->selbroken++;
			report_event(PEVNT_RATE, peer, NULL);
			if (peer->minpoll < 10) { peer->minpoll = 10; }
			peer->burst = peer->retry = 0;
			peer->throttle = (NTP_SHIFT + 1) * (1 << peer->minpoll);
			poll_update(peer, 10);
		}
		return;
	}

Then poll_update sets hpoll to 10.  Achim seems to be reporting that
it stays stuck there.  Now I look at this:

void
poll_update(
	struct peer *peer,	/* peer structure pointer */
	uint8_t	mpoll
	)
{
	unsigned long	next, utemp;
	uint8_t	hpoll;

	/*
	 * This routine figures out when the next poll should be sent.
	 * That turns out to be wickedly complicated. One problem is
	 * that sometimes the time for the next poll is in the past when
	 * the poll interval is reduced. We watch out for races here
	 * between the receive process and the poll process.
	 *
	 * Clamp the poll interval between minpoll and maxpoll.
	 */
	hpoll = max(min(peer->maxpoll, mpoll), peer->minpoll);

	peer->hpoll = hpoll;

This means that hpoll can never be set lower than minpoll. Which means
there will never be any recovery from the KOD rate limit, no matter
what values poll_update() is called with, unless minpoll is lowered.

But this never happens.

ntp_peer.c:721:	peer->minpoll = min(minpoll, NTP_MAXPOLL);
ntp_peer.c:724:		peer->minpoll = peer->maxpoll;
ntp_proto.c:596:			if (peer->minpoll < 10) { peer->minpoll = 10; }
refclock_jjy.c:2788:		peer->minpoll = 8 ;
refclock_oncore.c:621:	peer->minpoll = 4;
refclock_trimble.c:469:	peer->minpoll = TRMB_MINPOLL;

The ntp_peer.c hits are during new-peer initialization. The refclock hits
are irrelevant, we're troubleshooting the code path for NTP peers.  My
deduction is that ntp_proto.c:596 is probably wrong, it's disabling
the normal poll interval hysteresis (which I admit I only vaguely
understand).

But the problem may be deeper than that.  The corresponding code in
Classic is this:

	/*
	 * Check to see if this is a RATE Kiss Code
	 * Currently this kiss code will accept whatever poll
	 * rate that the server sends
	 */
	peer->ppoll = max(peer->minpoll, pkt->ppoll);
	if (kissCode == RATEKISS) {
		peer->selbroken++;	/* Increment the KoD count */
		report_event(PEVNT_RATE, peer, NULL);
		if (pkt->ppoll > peer->minpoll)
			peer->minpoll = peer->ppoll;
		peer->burst = peer->retry = 0;
		peer->throttle = (NTP_SHIFT + 1) * (1 << peer->minpoll);
		poll_update(peer, pkt->ppoll);
		return;				/* kiss-o'-death */
	}

I see that our line 596 is a replacement for allowing the KOD packet
to set the poll rate.  That makes all kinds of sense, as a spoofed KOD
packet with a maliciously high poll interval is an obvious DoS
vector. (See, Daniel? I are learning to think like an InfoSec
paranoid.)

Unfortunately for this neat theory, the correwsponding grep hits in
Classic are:

ntp_peer.c:857:		peer->minpoll = NTP_MINDPOLL;
ntp_peer.c:859:		peer->minpoll = min(minpoll, NTP_MAXPOLL);
ntp_peer.c:865:		peer->minpoll = peer->maxpoll;
ntp_proto.c:1589:			peer->minpoll = peer->ppoll;

Again, the ntp_peer.c hits are during newpeer initialization.  That
is, I can't find any way that minpoll recovers after a KOD in
Classic, either.

What am I misssing here?
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Rifles, muskets, long-bows and hand-grenades are inherently democratic
weapons.  A complex weapon makes the strong stronger, while a simple
weapon -- so long as there is no answer to it -- gives claws to the
weak.
        -- George Orwell, "You and the Atom Bomb", 1945