Analyzing network flow records
Go with the Flow
Detect operating systems, installed software, and more from easily collected metadata.
What operating systems are installed on your network, and what software is running on them? Questions like these are often posed in IT departments – especially if users are operating their own shadow IT [1] or when documentation, automation, and software distribution need some care and attention. However, you have good reasons to ask these questions: Attackers are also interested in your systems.
Many methods of discovering the current status quo have been developed in recent years; they rely on either actively measuring [2] (e.g., with Nmap) or passively sniffing network traffic [3]. The passive method analyzes all or parts of your network traffic, from which you can draw conclusions. For example, a device that regularly visits the IP address for the domain name update.microsoft.com
would lead to the conclusion that the operating system comes from Microsoft.
In this article, we present a new approach based on network traffic analysis that exclusively considers the widespread and often easily available network communication metadata in the form of flow records. Metadata analysis of network connections can offer many benefits: It requires far less memory and computational power than the analysis of complete packets, it is compliant with data protection, and it does not need port mirroring on the router; moreover, it is comparatively fast.
The example discussed here relies on records from network flows – that is, short snippets of information that state the source and target of an IP packet, among other things, including the protocol used (e.g., TCP, UDP, ICMP) and the transmission volume in bytes. Typical examples of flow records are NetFlow, CFlow, or IPFIX, which each originate with different products and contain different details.
The basic idea is to use the existing information, to the extent possible, to draw conclusions about the data transmitted. Calls to simple websites or any email messages sent will only differ marginally in terms of their footprints in the metadata. For this reason, you cannot say anything about the content of the transmitted messages or, for example, which subpage of a website has been viewed.
However, this is not true of downloads, which allow you to detect clearly distinguishing features based on size alone. Once you have established what the hosts under investigation are downloading, you can easily draw conclusions about the operating system software used – especially when it comes to updates of previously installed programs.
Recording Flow Records
As you can see from Figure 1, flow records can basically be recorded wherever you have access to network traffic. This can mean recording direct communication on your own subnet (1), but also communication within the enterprise on one of the intermediate switches (2). Where network packets are sent to external communication partners, you can also collect flow records on your own enterprise switches (3), the edge router (4), or on the transport route on the Internet (5).
To record metadata actively in the form of flow records, Linux offers you a number of useful tools. The test setup described in the following section uses a combination of Softflowd [4] and Flow-tools [5] to grab network flows from the network traffic and record them as flow records in files.
Recording the Flow
Softflowd can generate Cisco net flow data from the network traffic that it sniffs from a selected network interface or parse from a packet capture file recorded previously; it then sends these data to a flow collector. It always sends complete flows. Only the metadata from the connection is sent to the flow collector (e.g., the source and target address and ports, as well as the minimum, maximum, average, total bytes, and number of packets registered).
Because many devices on the network, such as switches and routers, also generate and transmit Cisco net flows in the same way, Softflowd is particularly well-suited for simulating these devices for test purposes in a virtual environment, without deploying any physical hardware.
As the flow collector that receives and processes the flows sent by Softflowd, we used flow-capture
from the Flow-tools collection of programs. Flow-capture saves the received flows in files that can then be analyzed downstream. The files rotate automatically so that a file always stores the flows from a specific time window. All files can be deleted either by date or by volume of hard disk space used.
Both softflowd and the flow-tools are available in the package sources of Debian and other Linux distributions and can be installed from there. To record net flows, you only need to run softflowd
specifying the option for the interface to use (-i
) and the IP address and port of the target system (-n
).
By default, Softflowd immediately runs as a daemon in the background. If you do want to run Softflowd in the foreground, you additionally need to set the -d
option.
The example here generates net flows from the network traffic monitored on the eth0
interface and sends the net flows to localhost:
softflowd -i eth0 -n 127.0.0.1:4432
To make sure Softflowd really is recording flows, the softflowctl
program, which is part of the distribution, is a useful option.
The softflowctl statistics
command (Listing 1) delivers up-to-date statistics on the analyzed packets. It tells you how many packets Softflowd has processed and how many flows it has detected as expired and exported. You will also want to run Softflowctl with the shutdown
option to close down Softflowd gracefully. Before terminating, it sends to the collector any flows that have not yet been sent.
Listing 1
Softflow Statistics
To process the net flows collected by Softflowd, you need to launch flow-capture
, which has many settings that determine how it creates the flow record files and specifies the criteria used to rotate or delete the files. The following example shows a simple configuration:
flow-capture -w /tmp/flows -n 287 0/127.0.0.1/4432
The -w
option specifies the directory in which flow-capture will store the flow records. The -n
option lets you specify how many new flow record files are created every day – 287 rotations per day gives you a new file every five minutes. A five-minute interval is a useful choice for test purposes, because you will see a number of flows during, but without the need to wait too long for a file to become available.
The final option specifies the IP and port on which to receive net flows from which host. A zero instead of an IP address means use all addresses; however, it does make sense to state explicitly the host sending the net flows to avoid confusing the results with flows from other systems.
The flow-print
tool from the Flow-tools toolkit looks at what the flow records contain. You can see from the connections metadata in Listing 2 that the first flow belongs to a mail client and the others to an HTTPS connection.
Listing 2
flow-print Example
Analyzing Flow Record Metadata
The metadata collected from connections monitored in this way can be put to various uses. For example, you could use it to compute the bandwidth usage per IP or per subnet for billing purposes or to detect deviations from normal communication patterns, such as a massive increase in outgoing connections. The metadata is also suitable as raw material for inventorying the devices on your home network. However, if such data does get into the wrong hands, it can give attackers valuable hints. Also note that the collection of metadata falls under the data retention laws of some countries [6]. As you can see, they are definitely useful for drawing conclusions on content.
All of these reasons make it interesting to determine how much metadata can tell you about the content transferred and with what degree of precision conclusions can be drawn on the transferred data. To investigate this, we set up a web server in a lab environment that hosted 50 files of random size between 1 and 50MB.
The test team ran softflowd
and flow-capture
on the server, as described above, to collect flow records while other hosts were downloading the files. One system used, in addition to the Debian and SLES Linux distributions, was Windows Server 2008, so we could investigate any differences between operating systems in the lab environment.
The flow records were then stored in a tab-separated file for analysis with flow-print
in the Pandas [7] Python data exploration framework. The data was filtered on the basis of source IP address and source port, so that only test file downloads were left. Because the files were always downloaded in the same order, we were able to correlate downloads with files. To classify the records, we deployed the sklearn [8] Python framework, which implements various classification methods.
To teach the classification method, the test team generated 1,250 flow records, from 25 repeats of 50 file downloads, on a Linux host located one hop from the web server. The most efficient classification method turned out to be the decision tree classifier, which only considers the number of bytes transmitted. Additionally taking into consideration the number of packets transmitted did not lead to any improvements; in fact, the results were between 1 and 10 percent worse. With this classification, we achieved an accuracy of 98% in the correlation of flow records to downloaded files.
For a Linux system six hops away, 88% of the downloads were correctly assigned to the corresponding files. However, if packets from the same system detoured via a VPN server, the detection rate dropped to 61%, illustrating that as distance increases, the total number of bytes transmitted deviates by too great an extent to achieve a high degree of accuracy given files of a similar size – probably because of packet loss and retransmission.
The tests on Linux used the wget
download tool; on Windows, we downloaded the files with Windows Internet Explorer. What we discovered was that Windows does not open a new port for each download but immediately reuses the port after a download has completed; therefore, the downloads in the flow records cannot be clearly distinguished and the analysis fails. We would need to perform some more tests to determine whether this behavior is browser dependent or possibly also occurs on other systems.
Overall, the test told us that, given sufficient proximity to the network, very good accuracy is possible in terms of the ability to map monitored flow records to files previously analyzed on a test system. This will give the network operator an easy approach to detecting what updates have been installed, but without complex – and typically expensive – deep packet inspection.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.