Replay progress report

Mon Jan 4 17:19:27 UTC 2016

Hal Murray <hmurray at megapathdsl.net>:
> 
> esr at thyrsus.com said:
> > I considered doing it that way.  The problem is that if you multithread the
> > DNS lookups, the order in which the capture events land is no longer
> > deterministic. At that point the minimum complexity required of the replay
> > interpreter *explodes*.  I chose to dodge that lest I find myself down a
> > rathole. 
> 
> Let me try again.
> 
> If you want replay to handle the pool case, then you have to handle the 
> asynchronous DNS case.  I think the pool case is important, but it can wait 
> until you get a simple case working.

Agreed on both counts.

> I think the asynchronous DNS is just like processing packets.  The idea is to 
> work at the edges of the main thread and ignore the DNS thread.  (Intercept 
> might have to ignore logging from the DNS thread.  Daniel added a lock in the 
> logging a while ago.  It was discovered due to mangled logging so there is 
> some logging from the DNS thread.)
> 
> Recording the DNS request is similar to recording transmit packets.  Replay 
> compares the current request with the next line in the log file.
> 
> Processing the DNS result is like processing a receive packet.  Record has to 
> intercept where the main thread pulls the result off the queue.  When replay 
> finds a DNS result in the log file, it has to do whatever the main thread 
> does with the result.

All this is correct.  Furthermore I know what the event recorded when
the lookup returns would look like. This is a sample:

   getaddrinfo 0.ubuntu.pool.ntp.org 96.126.105.86 1

But because the asynch processing has an unpredictable delay, you cannot
predict where it will land in the capture file with respect to other events.
Apparently I failed to make clear why this is a problem.

The way that the replay mode presently works is dead simple.  Before
the main receive loop (captured by the intercept_replay() function),
each intercept function pops a line off the logfile and checks to see
if its event type matches what is expected at this point - for
example, a call to intercept_drift_read() wants to see a drift_read
event.  During the main loop, either of two events (sendpkt or
receive) is accepted and interpreted.

This scheme depends on the assumption that, before the main loop, intercept
functions will be called in the exact same order during replay that
they were during capture.  There's no separate replay interpreter that
is dispatching events itself, so there is no way to handle a
getaddrinfo event.

I part-fixed this problem by arranging for getaddrinfo triggered by
command-line and config-file calls to be done synchronously in capture
and replay mode. This means they land at a fixed place in the capture
file (just after the config file read) and will be interpreted by the
replay code recapitualating that control flow.  You are correct that this
does not address pool lookups.

During the main loop we have a little more slack; I could add a
getaddrinfo event type and have it update the peer list.  The problem
is that we have no guarantee the getaddrinfo callback won't fire (recording
the event) too soon.

There are some ways to get around this.  The simplest would be for the
intercept layer to store replies from the getaddrinfo callback but
defer recording them until intercept_replay() begins.  But "simplest"
would still drive the complexity of the replay interpreter way up,
which is not a good thing when I'm still trying to debug operation
of the simplest possible replay case.

Right now I'm trying to pursue the simplest path to getting one replay
case working, and it's still damn difficult.  Once I have a proof of
concept I will be able to afford doing things in trickier but more
general ways.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>