Use of pool servers reveals unacceptable crash rate in async DNS
Eric S. Raymond
esr at thyrsus.com
Sat Jun 25 22:13:56 UTC 2016
Hal Murray <hmurray at megapathdsl.net>:
> esr at thyrsus.com said:
> > 1. Apply Classic's workaround for the problem, which I don't remember the
> > details of but involved some dodgy nonstandard linker hacks done through the
> > build system. *However, I did not trust this method when I understood it.*
> > It seemed sure to cause porting difficulties and is inherently fragile.
> kurt at roeckx.be said:
> > If it's the one I'm thinking about, I think the solution is to remove the
> > locking of memory.
> We may be confusing several bugs.
> There was a problem with locking stuff into memory. Some library needed by
> end of thread processing wasn't loaded yet and things worked out such that
> with the default memory 32 bit systems worked but 64 bit systems didn't have
> enough room.
> I think one solution was to create a dummy thread early on to get that module
> loaded. Or disable memory locking, or tell it to use more memory, or ...
This matches what I remember, except for "use more memory". There was a third
workaround involved weird linker options to force early loading of the library.
> > 2. Fix the actual problem. Well, that'd be nice, but Hal looked into it
> > months ago and said he understood it but couldn't generate a fix. IIRC, he
> > said it needed a full rewrite. That tells me the code is probably not
> > salvageable.
> I don't remember that part. I use the pool command on several systems. I
> haven't seen a crash in ages.
It's very sporadic. I went for months without seeing it at all. Then last night
when I was running smoke tests for my changes to remove magic address prefixes
I was seeing it every three or four minutes.
I see I somehow didn't complete my explanation to Kurt. I wrote "However I didn"
and got distracted by a ringing phone. How that should have finished:
However I didn't commit that change and some later reset backet it out.
I haven't seen the problem today.
> There was another interesting problem in this area. It was a bug in
> FreeBSD's trap handler. ntpd managed to trigger it consistently.
I'm running under Linux, though.
> I favor understanding things more.
> Can you get a stack trace?
Saw one last night, as described to Kurt. Now it's not reproducing.
I guess the default for memory locking has to change to "off". We
can't ship something with that kind of random instability.
<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>
More information about the devel