Using Go for NTPsec

Eric S. Raymond esr at thyrsus.com
Mon Jul 5 03:48:18 UTC 2021


Hal Murray <halmurray at sonic.net>:
> 
> Eric said:
> > Talk to me about what you think the effect of very occasional stop-the-world
> > pauses of 600 microseconds or less would be on sync accuracy. By "very
> > occasionally" let's say once every ten minutes or so, that being what I think
> > is a *very* pessimistic estimate of GC frequency for a program with NTP's
> > memory-usage pattern.
> 
> Could you please say more?  How did you get 600 microseconds?  What 
> assumptions were you making?

You appear to have missed one of my earlier posts on this topic.

There is a suite of tools called dragonboat ("A feature complete and
high performance multi-group Raft library in Go." that has done
careful measurement of STW pauses in Go 1.11 and Go 1.12 under its
workload. Given the description of what it does has to be much
tougher one than NTPsec's - much more packet volume, hence more GC
churn.  Take a look at

https://github.com/lni/dragonboat

If you go far enough down the page you'll find a graph

https://github.com/lni/dragonboat/blob/master/docs/stw.png

which shows their meassured STW pauses are bounded to about 95% by
600us and typically less than 400us. This is consistent with other
reports I've seen, and that's why I took 600us as a worst case STW
we're likely to see.

> The real cost of using a GC is that we have to keep thinking about what it 
> might do, or if the code we want to write might change the assumptions used to 
> conclude that an occasional GC was OK.

Agreed.

> If we get a sample that is off by enough to be interesting, is that because 
> the network was busy or the GC did its thing?

There is a way we can spot possible latency spikes.  We can query the
runtime to get a timestamp of the last GC.  One brutally simple way to
prevent GC-induced latency spilkes from distorting time sync it to
take timestam[s before and after each critical region, then check for
an intervening GC and throw out the associated sample if we find one.

> Suppose Eric got up on the other side of the bed one morning, and I proposed 
> using some new system that had a GC and waved my hands and claimed that it 
> wouldn't be a problem.  I wouldn't be at all surprised if Eric claimed it was 
> ugly, not appropriate, and we should find something better.

But I'm not just claiming it won't be a problem - I've already thought
through multiple mitigation strategies in case it really is a problem.
This plan didn't come out of nowhere; I started thinking and doing the
research years ago.

1. Guard small critical regions by turning off GC.

2. Schedule GCs during quiet periods.

3. Detect when GC spikes might have collided with sample reads
and throw out those samples.

> I'm assuming the goal is more than just to convert our current code base to a 
> safe language.  We also need to get the structure and environment right.  An 
> environment with a GC seems like a bad start.

It poses some challenges, but I think they are surmountable.  And those
challenges have to be put in context of whst's available in the way of safe
languages.  Good options for us are rather limited.

> You haven't commented about my Rust vs Go question.

I have, previously. I guess you missed the post where I explained that.
 
> A friend commented that he might use Go over Rust because it would be easier 
> for others to pick up.

Yes, that is an important factor.

There is one other that is more important: Rust does not have a stable
API, certainly not one that we can count on to be solid on decadal
scales. Nor does it have the kind of development culture that is
conducive to API stability over decades.  It's a very young language,
still in "move fast and break thing" mode.

Go, on the other hand, has an ironclad forward-portability guarantee
that its development culture takes very seriously.  I think we need that.

> I'm learning Rust, or trying to.  I find it not-easy, but picking up new 
> languages is not one of my strong points.  The type-checking is really picky.

Picking up new languages *is* one of my strong points, yet I found Rust
rebarbative in the extreme. This did nothing to make me optimistic about
finding developers to work in it.

On the other hand, we already have two developers expert in Go.

> It doesn't do recv time stamps.  As far as I can tell, there isn't a clean way 
> to do that in Rust.

I think it can be done in Go but will take some fancy dancing.  I have this
under acrtive investigation now.

> Crazy thought dept...
> 
> Assuming we want to us Go, can we split things up such that the timing 
> critical code runs in separate processes without GC?

Maybe. But I know from previous experience that trying to make major
changes to a program's architecture *while you're porting it to new
language* is an invitation to disaster.

The only strategy that works is to do a stupid, literal,
unidiomatic port first, verify it, then clean it up and make
it idiomatic.

This keans that changes oof the kind you're proposing need to be on
hold until we have a more or less literal translation of the present
C code working.

> I know how to split out the server side of ntpd.
> 
> Suppose we come up with an API for refclocks.  Would that, or something 
> similar also work for network servers?
> 
> I think the timing critical code would be small enough that we could write it 
> in C and inspect it carefully.  That may not be valid if we include the crypto 
> stuff.
> 
> Converting that that sort of code to Rust seems reasonable.

I want to stay away from mixing languages if at all possible.  The
joints between them are always *serious* defect attractors and major
sources of maintainence complexity.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>




More information about the devel mailing list