My task list

Eric S. Raymond esr at thyrsus.com
Thu Jun 30 13:02:52 UTC 2016


Hal Murray <hmurray at megapathdsl.net>:
> > 1. Try replacing our buggy async-DNS code with the c-ares library.
> 
> You keep calling the existing code "buggy".  Is that correct, or are you just 
> being sloppy since you don't like it (perhaps justifiably) and it has 
> triggered bugs/quirks in other parts of the system.

You told me it was buggy yourself, some months ago.  Something about the
handling of the ring buffers not being right in a way you couldn't see
a fix for.  If that got repaired it's news to me.

I'm aware that this is a separate issue from the mlockall-threads mess.
But it's enough reason for me to distrust the code and want to get rid of
it.

Besides, I think farming the async-DNS support out to people who specialize
in maintaining that and have a track record at it is a good idea.  I might
want to do it even if I weren't suspicious of ours, just to reduce the
KLOC we have to maintain.

> > 2. If that succeeds, reinstate memlocking long enough to check if the
> >    crash bug recurs.  If it doesn't, leave memlocking in.
> 
> The old memlock code, or a simplified lock-everything (no parameters) version?

I'll test the old code first, and if that fails certainly try chrony-like
behavior.

> If any new code uses threads, it's going to have the same problem.  I'd vote 
> against restoring the old code until you have figured out how to test it.

Well, first, I don't consider that a given.  Maybe they have a workaround
that we wouldn't have to maintain. If this is is really a general problem,
their odds of having had to deal with it seem pretty high.

Second, I didn't have to figure out how to test it.  The bad
combination crashes *very* frequently on the Great Beast.  Like, every
three or four minutes if I run it that often.

> > 3. Collect the results from my first profiling runs, now about 14 days of
> > data
> >    Learn how to graph and interpret them. 
> 
> You might do that first since you will probably want to tweak something and 
> collect more data.

Yeah, on the other hand c-ares is likely to be a fast fix for a real
problem, while the data reduction is going to be several days of
exploration.  (I have new tools to learn to even start)

> Data for a day will tell you most of what you will ever get.  If you have 
> lots of data, then you have to scan it looking for glitches.
> 
> Consider bumping the clock and watching it recover.  (util/bumpclock)  There 
> are two interesting cases.  One is a big bump so it will "step" the clock to 
> recover.  The other is a small bump so it will slew (slowly) to recover.  The 
> split is 128 ms.  So I'd try 200 ms and 100 ms.

These seem like good advice of exactly the kind I expect from you.  Will do.

> > 5. Do the cleanup required to get the code compiling under -std=c99. 
> 
> What does that involve?

Getting rid of some GNUisms, notably the u_long/u_int/u_short typedefs 
that NTP uses a lot. That's a simple change that touches a zillion files
in a zillion placed - huge but dumb.

It's either that or finding a better way to conditionalize the definitions. Right
now it's 

#if _XOPEN_SOURCE >= 600
/*
 * Supply GCCisms that stop being visible if we tell it we need the
 * prototype for strptime(3).
 */
typedef unsigned long	u_long;
typedef unsigned short	u_short;
typedef unsigned int	u_int;
#endif

Ideally I'd just write something like 

#if (_XOPEN_SOURCE >= 600) || defined (STD_C_99)

but the right predefine doesn't seem to exist.  I'm still looking.

> TESTFRAME is missing.  How about we both clear our schedules and desks and 
> give it another try?  How about next Wed?

That might be good timing.  I have been contemplating another whack at it,
the magic-address elimination patches were partly a way of getting close
to that code again.

Maybe you don't know about those - I'm not sure I discussed them here and
I sometimes forget you don't watch our other channels.  I have *entirely*
confined the 127.127.t.u assumption about clock addresses to the config
parser.  Doing this required some changes to ntp_io.c and ntp_proto.c that
are near the hairball.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>


More information about the devel mailing list