Use of pool servers reveals unacceptable crash rate in async DNS

Sat Jun 25 22:13:56 UTC 2016

Hal Murray <hmurray at megapathdsl.net>:
> 
> esr at thyrsus.com said:
> > 1. Apply Classic's workaround for the problem, which I don't remember the
> > details of but involved some dodgy nonstandard linker hacks done through the
> > build system.  *However, I did not trust this method when I understood it.*
> > It seemed sure to cause porting difficulties and is inherently fragile. 
> 
> kurt at roeckx.be said:
> > If it's the one I'm thinking about, I think the solution is to remove the
> > locking of memory. 
> 
> We may be confusing several bugs.
> 
> There was a problem with locking stuff into memory.  Some library needed by 
> end of thread processing wasn't loaded yet and things worked out such that 
> with the default memory 32 bit systems worked but 64 bit systems didn't have 
> enough room.
> 
> I think one solution was to create a dummy thread early on to get that module 
> loaded.  Or disable memory locking, or tell it to use more memory, or ...

This matches what I remember, except for "use more memory". There was a third
workaround involved weird linker options to force early loading of the library.

> > 2. Fix the actual problem. Well, that'd be nice, but Hal looked into it
> > months ago and said he understood it but couldn't generate a fix. IIRC, he
> > said it needed a full rewrite.  That tells me the code is probably not
> > salvageable. 
> 
> I don't remember that part.  I use the pool command on several systems.  I 
> haven't seen a crash in ages.

It's very sporadic.  I went for months without seeing it at all.  Then last night
when I was running smoke tests for my changes to remove magic address prefixes
I was seeing it every three or four minutes.

I see I somehow didn't complete my explanation to Kurt.  I wrote "However I didn"
and got distracted by a ringing phone.  How that should have finished:

However I didn't commit that change and some later reset backet it out.
I haven't seen the problem today.

> There was another interesting problem in this area.  It was a bug in 
> FreeBSD's trap handler.  ntpd managed to trigger it consistently.

I'm running under Linux, though.

> I favor understanding things more.
> 
> Can you get a stack trace?

Saw one last night, as described to Kurt.  Now it's not reproducing.

I guess the default for memory locking has to change to "off".  We
can't ship something with that kind of random instability.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>