High-resolution network monitoring with ping
The Pinger network monitoring tool uses ping to look for switches and estimate cable lengths.
ping command is used to determine whether a particular host on the network is accessible and to reveal the packet turnaround time, usually known as the round trip time (RTT). The RTT of a ping request is longer when packets need to pass through network devices or long stretches of wire. In this article, I develop a utility that uses the ping RTT to track down switches and transparent bridges and determine cable lengths.
Common ping programs under Linux, like that from the iputils package , create RTT statistics with a mean value. However, the average of thousands of pings can vary so greatly that it is impossible to achieve high resolution within the framework of a few microseconds to nanoseconds.
These subtleties, however, are interesting when exploring the network and the equipment in it. Expert evaluation, that is, filtering out RTT outliers before computing an average, can return a resolution of less than one microsecond (1µs). Several available ping programs offer additional features that remain mostly unused, such as the ability to send a bit pattern in the ping packet to determine data rot (i.e., damage to data on the network).
In this article, I use the classic ICMP ping on IPv4, but most of the principles are also supported by alternative ping tools, such as arping , httping  and ipmiping  – in fact, by anything that gives you an RTT (e.g., a wget download of a small file or reading the USB register of a USB adapter).
Ping on a gigabit LAN mostly delivers RTT measurements with a Gaussian distribution of typically around 200µs; however, outliers with much higher values shift the mean. Listing 1 shows an example with only one outlier in 100 measured values. It results from pinging between two four-year-old PCs connected by a gigabit switch: one running Debian with kernel 3.1.0-1-amd64 and the other running Ubuntu with kernel version 3.5.0-18-generic.
Ping via Gigabit Switch
[...] 64 bytes from 192.168.1.1: icmp_req=63 ttl=64 time=0.232 ms 64 bytes from 192.168.1.1: icmp_req=64 ttl=64 time=0.158 ms 64 bytes from 192.168.1.1: icmp_req=65 ttl=64 time=508 ms 64 bytes from 192.168.1.1: icmp_req=66 ttl=64 time=0.204 ms 64 bytes from 192.168.1.1: icmp_req=67 ttl=64 time=0.166 ms --- 192.168.1.1 ping statistics --- 100 packets transmitted, 100 received, 0% packet loss, time 98998ms rtt min/avg/max/mdev = 0.097/5.297/508.262/50.549 ms
A single outlier here shifts the mean value avg by more than 1,000 percent, from approximately 0.2 to 5ms. As several measurements show, the RTT values still vary greatly if you increase the number of pings (n) to several thousand. This happens because the outliers are not only relatively large, but also greatly scattered, causing the mean value to vary by approximately 50 percent, even for n = 86,400.
These extremely large outliers are equivalent to a pendulum with a period of approximately 1 second – and that really does take about one second per beat 99 percent of the time – occasionally slowing down so drastically that one beat takes a whole hour or longer.
Through the High Pass
The disproportional effect of outlying values throws off any calculation that makes assumptions about the state of the network on the basis of ping response times. The first step is to remove these extreme values by developing some sort of interference filter.
The usual approach, as many of you will recall from physics problems, is to assume a Gaussian distribution of the RTT measurements as a first approximation and discard those values that deviate from the mean value without outliers by more than 3-sigma (i.e., three standard deviations from the mean). Because no outliers fall in the downward direction, you only need to remove outliers in the upward direction, which calls for high-pass filtering.
ping command from the iputils package returns sigma in the
mdev field, so just a few measurements of a few dozen pings, each without outliers, are sufficient to determine sigma at the command line.
The sigma for the gigabit LAN with one switch connected to several computers turned out to be 45µs. Thus, filtering values above 200+3*45 (i.e., 335µs) is the way to go. In practice, you might want to round up to be safe; you will still filter out most of the outliers that typically lie in the range of 100ms to several seconds. In this example, 400µs would be a good choice – that is, twice the mean value without outliers.
You can also determine this cut-off value automatically, by using outlier tests or by determining the value with the highest RTT density, and simply multiplying it by 2. This is a task that a program can do for you automatically if you allow a brief warm-up period before the actual measurement.
After this noise filtering, you can see that the average error of the mean value is approximately sigma divided by the square root of n, as you would expect for a Gaussian distribution. This means that the mean error of the mean value decreases as n increases; for n = 100, by a factor of 10; for n = 10^6, by a factor of 1,000; and so on.
In terms of the RTT in the gigabit LAN under investigation, this means the standard deviation is 4.5µs, or 45ns. The measurement thus has sufficiently high resolution to detect an intermediate switch or a longer or shorter network cable.
Buy this article as PDF
Linux Foundation's big event celebrates the 25th anniversary of Linux
Competitors get in the game with RHEL without Red Hat
Security researchers have already notified Microsoft; some fixes are available
The company is collaborating with Google and Intel to use Kubernetes as an engine for Fuel
Customers can take a free test drive of SLES for HPC on the Azure Cloud
San Francisco-based chip company announces their first fully open source chip platform.
The whole distro gets rebuilt on glibc 2.3
Ubuntu Vendor tries to solve app packaging and distribution problem across distributions.