Use of pool servers reveals unacceptable crash rate in async DNS

Mon Jun 27 01:11:03 UTC 2016

esr at thyrsus.com said:
>> We could try simplifying things to only supporting lock-everything-I-need 
>> rather than specifying how much.  There might be a slippery slope if 
>> something like a thread stack needs a sane size specified.

> I'm not intimate with mlockall, but it looks like it works that way now. 

There is a back door way to specify a limit.  Part of it is the total.  Part 
of it is the stack size for new threads.

[way to count page faults]
> I don't know.  I can do some research, but I'm not sure "enough page faults
> to merit memory locking" would be a well-defined threshold even if I knew
> how to count them. 

If the answer was 0 then we wouldn't have to discuss the threshold.

----------

> I believe you're right that these platforms don't have it.  The question is,
> how important is that fact?  Is the performance hit from synchronous DNS
> really a showstopper?  I don't know the answer. 

There are two cases I know of where ntpd does a DNS lookup after it gets 
started.

One is the try again when DNS for the normal server case doesn't work during 
initialization.  It will try again occasionally until it gets an answer. 
(which might be negative)

The main one is the pool code trying for a new server.  I think we should be 
extending this rather than dropping it.  There are several possibles in this 
area.  The main one would be to verify that a server you are using is still 
in the pool.  (There isn't a way to do that yet - the pool doesn't have any 
DNS support for that.)  The other would be to try replacing the poorest 
server rather than only replacing dead servers.

DNS lookups can take a LONG time.  I think I've seen 40 seconds on a failing 
case.

If we get the recv time stamp from the OS, I think the DNS delays won't 
introduce any lies on the normal path.  We could test that by putting a sleep 
in the main loop.  (There is a filter to reject packets that take too long, 
but I think that's time-in-flight and excludes time sitting on the server.)

There are two cases I can think of where a pause in ntpd would cause 
troubles.  One is that it would mess up refclocks.  The other is that packets 
will get dropped if too many of them arrive.

I think that means we could use the pool command on a system without 
refclocks.  That covers end nodes and maybe lightly loaded servers.

-------

It's worth checking out the input buffering side of things.  There may be 
some code there that we don't need.  I think there is a pool of buffers.  
Where can a buffer sit other than on the free queue.   Why do we need a pool?

-- 
These are my opinions.  I hate spam.