Is asychronous DNS lookup worth keeping at all?

Wed Dec 2 14:03:52 UTC 2015

Hal Murray <hmurray at megapathdsl.net>:
> >> There is a lot of interest in getting servers restarted quickly.  Telling 
> >> all those users they can't use any non-local server names seems unwise.
> 
> > That's not the implication.  If we removed asynchronous lookup, they'd
> > only incur an additional initial start time cost for using *more than
> > one* named server.  The cost for using just one wouldn't change.
> 
> I can't figure out how you are thinking ntpd works.  I'm not sure I know how
> it does work.

We need to have this conversation so I can get the tour document right. :-)

Here's what I think is going on in the 1-hostname case with asynch DNS:

1. ntpd starts up.
2. getaddrinfo_sometime() is called during argument or config parsing.
   It spawns one worker thread.
3. Main receive loop, where it's iterating over all UDP sockets looking for
   incoming packets, begins in the main thread. No sockets are open.
4. Sometime later, the callback passed to getaddrinfo_sometime is called and the
   numeric IP for the hostname becomes available. A UDP socket is activated.
   Protocol engine ships the first query to the server (or pool whatsis)
   at the other end of the socket.
5. Sometime after this, enough responses come in for time sync.

If DNS lookup is synchronous, the sequences is different.

1. ntpd starts up.
2. getaddrinfo() is called during argument or config parsing.  It blocks
   until it returns a numeric IP. The UDP socket is created immediately.
   The socket list is nonempty. 
3. Main receive loop, where it's iterating over all UDP sockets looking for
   incoming packets, begins. Because the socket list is nonempty
   the protocol engine ships the first query immediately.
4. Sometime after this, enough responses come in for time sync.

The point is, in both cases no query can be shipped to the server
until the getaddrinfo() returns a result.  That delay is independent
of whether the lookup runs in its own thread or the main thread.

Now suppose there are two hostnames. Asynchronous:

1. ntpd starts up.
2. getaddrinfo() is called twice during argument or config parsing.
   It launches two worker threads.
3. Main receive loop, where it's iterating over all UDP sockets looking for
   incoming packets, begins in the main thread. No sockets are open.
4. Sometime later, the callback passed to getaddrinfo_sometime is called and the
   numeric IP for one hostname becomes available. A UDP socket is opened
   and the protocol engine ships a query to the server (or pool whatsis)
   at the other end of the association.
5. Sometime after this, enough responses come in for time sync. The minimum
   delay was bounded below by the time for the first DNS-lookup callback to
   fire.
6. Sometime after this, the callback for the *second* lookup fires.  Another
   UDP socket is added to the list and another query shipped. At this point
   we might already have time sync from the first server.

Synchronous

1. ntpd starts up.
2. getaddrinfo() is called twice during argument or config parsing.  The calls
   blocks sequentially until they return two numeric IPs. The UDP sockets are
   created immediately; we have paid the latency cost for both calls
   rather than just one. The socket list is nonempty. 
3. Main receive loop, where it's iterating over all UDP sockets looking for
   incoming packets, begins. Because the socket list is nonempty
   the protocol engine ships the first two queries immediately.
4. Sometime after this, enough responses come in for time sync.

If I understand all this correctly (and maybe I don't), the latency
cost of both cases is bounded below by the cost of the first DNS lookup.
The difference is that in the synchronous case you also have to wait for
the *second* lookup.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>