WIBDR: Traffic analysis

Fri May 29 07:11:08 UTC 2020

WIBDR == What I've Been Doing Recently

Maybe if we use a tag like that occasionally, it will encourage others to 
report on their adventures, or some interesting details of plain old boring 
work.

----------

I'm not sure how/why I got started on this, but I've been trying to learn more 
about the traffic that NTP servers see.  I have 2 servers in the pool to use 
as sources of data.

One thing is pretty obvious.  There are a lot of bad guys out there, trying 
all sorts of things.  The KE server gets much more abusive traffic than 
legitimate traffic.

NTS KE serves good:          12613
NTS KE serves_bad:           28461

If you look at your log file, you will see things like:
28 May 18:36:16 ntpd[52023]: NTSs: SSL accept from 172.58.239.64:37024 failed: 
wrong version number, took 0.006 sec
28 May 18:36:36 ntpd[52023]: NTSs: SSL accept from 172.58.239.64:29722 failed: 
wrong version number, took 0.008 sec
28 May 18:37:18 ntpd[52023]: NTSs: SSL accept from 76.216.52.221:49541 failed: 
wrong version number, took 0.002 sec

There are several cases of KE server TLS errors where I special cased the 
logging down to one line in order to reduce clutter.  I think "wrong version" 
is from ssh attempts.  That code is in nts_ke_accept_fail() in nts_server.c if 
you ever want to add another.

-------------

The second part is individual NTP requests.

I recently (month or two ago) cleaned up the rate limiting.  ntpq/mrulist now 
has a "score" column and a "dropped" column.  I was happy when I figured out 
how to do the math.  score is in packets per second.  decay_time is in units 
of seconds.
    mon->score *= expf(-since_last/mon_data.decay_time);
    mon->score += 1.0/mon_data.decay_time;
The first line is the exponential decay since the score was updated when the 
last packet arrived.  The mru slot has the time of last packet so since_last 
is easy to calculate.  The second line adds in the score for this packet.

Graphically, each packet gets a total of 1 packet*second.  Dividing the 
starting score by the decay time accounts for it hanging around longer since 
it is decaying slower.

If the limit is 1 and the decay time is 20 (the defaults), you can get a burst 
of 20 packets before any start getting dropped.

-------------

With the default rate limiting , lots of packets get dropped.

>From a pool server that has been up 7 1/2 days.

Ctrl-C will stop MRU retrieval and display partial results.
 lstint avgint rstr r m v  count    score   drop rport remote address
=====================================================================
  17358  0.494   e0 L 3 3 1170863 2240.073 1170526   123 108.161.83.242
   4209  0.496   e0 L 3 3 1269062 1619.395 1268017   634 67.216.65.10
   4855  0.421   c0 . 3 3 1269863    0.237 1269581   123 50.233.222.130
  32946  0.140   e0 L 3 3 1337989 6322.806 1337808   123 190.106.77.82
   2558  0.355   c0 . 3 3 1359788    0.206 1359539   123 207.189.100.126
  24754  0.296   e0 L 3 3 1467428 6280.005 1467119   123 12.183.201.66
  43956  0.081   e0 L 3 4 1866084 6748.549 1865962   123 65.158.5.78
   1450  0.201   c0 . 3 4 1998916    0.050 1998725   123 216.103.178.80
   7084  0.251   c0 . 3 3 2461501    0.206 2460590   123 67.204.10.106
   1603  0.191   c0 . 3 3 3243349    0.051 3242562   123 206.40.97.188
  12933  0.166   e0 L 3 3 3729992 5879.169 3728988   634 63.98.240.2
  13431  0.079   e0 L 3 4 7717268 6887.582 7716776   123 8.36.94.10
# Collected 12 slots in 1.087 seconds

That system has a big mru list so I can see things like this.  If the bad guys 
are sending packets often enough, they are likely to stay in the mru list 
forever.  The bottom slot is 12 packets per second, averaged over 7 days!

We took a closer look with tcpdump.  There are conspicuous 10 second bursts.  
They range from 2,000 packets per second to over 20,000.  Occasionally, there 
are back-to-back 10 second bursts.

Steven tracked those back to a bug in FortiGate.
  https://community.ntppool.org/t/ntp-bursts-from-fortigate-firewalls/1661

There is another layer of bursts.  These last 2 or 4 minutes with 10-100 
packets per second.  I assume they are several to many systems behind a NAT 
box getting used while my system is in the DNS rotation.  I got one place to 
verify that is was a "large lab" using NAT but don't know yet how many systems 
there are.

I'm still scratching my head about what the right limit should be.  Ideally, 
any NAT system with enough traffic for me to notice would setup their own NTP 
servers and tell their clients via dhcp.

-- 
These are my opinions.  I hate spam.