Big Data, Python, and the future of security

Good vs. Bad

Article from Issue 155/2013

When you start processing security-related data to find patterns, you quickly end up in Big Data territory, and you'll need some powerful tools to help you separate the good from the bad.

Intrusion detection and prevention is a difficult problem, much like email spam. Basically, you want to block all the "bad" traffic without blocking any "good" data. Because you can't accomplish this perfectly, you have to make a choice of how much bad traffic you're willing to allow, and how much good traffic you're willing to block.

Generally, people take one of three positions here. The first is the infamous "we can't block any good traffic, we'll lose sales, etc." The second approach is "I don't care about inconveniencing anybody, block by default and make sure anything coming through is good." The third option is a little more subtle and difficult to implement; basically, you turn to economics and try to figure out the cost of blocking good traffic (annoying users, support costs) and the cost of not blocking bad traffic (cleaning up after the occasional intrusion), and you make a decision. The third option, however, is rarely based on actual data and is mostly done along the lines of "how much can we annoy users before they yell at us." But, it's better than nothing.

Big Data Tools

Processing all this information, of course, leads to Big Data. Personally, I'm not a fan of buzzwords, but enough incremental change usually leads to entirely new things. Today, I was backing up an email account that contains messages about the size of my first hard drive, and the entire mailbox was larger than the storage of my first seven or eight computers put together. The reality is, if you want to start processing security-related data to find patterns, you're going to end up in Big Data territory quite quickly.

In typical open source fashion, you won't be spoiled for choice of tools for the job. For the purposes of this article, however, I'll mention Hadoop [1], MongoDB [2], and Python. Why Python, you ask? Why not Scala or something else? Python has its roots in scientific computing and, as such, has a number of extremely powerful data processing and machine learning libraries that are ideally suited to the problem here. As for Hadoop and MongoDB, it's simple: They can store a ton of data; they allow you to scale performance very cheaply, and talking to them to manipulate your data is easy.

Bayesian Filtering

One of the most powerful and simple tools for taking a lot of data and figuring out which of it is "good" and which of it is "bad" is Bayesian probability. This concept alone took spam from manually created lists to something that actually worked in an automated fashion. To make a long story short, you basically examine your data set, looking for relationships (e.g., the phrase "refinance your mortgage") that occur in spam email. If you're a mortgage broker, however, this phrase also appears in your ham (good) email. The trick is knowing what percentage of spam email and what percentage of ham email has it. For example, if 1 percent of your spam contains the term but 10 percent of your legitimate email has it, then it's probably a legitimate term for you despite being abused by spammers.

With Bayesian filtering, if you can codify the data, you can process it. For example, if you record all your network traffic and server logs and then a server suffers a break-in, you can mark all the data from the time of the break-in – assuming you can determine that – as suspicious and then compare it to all the other known good traffic. With luck, Bayesian filtering will be able to find the malicious data, because it will not have occurred in the known good set of data (Figure 1). If you combine this approach with additional data like IP address, country of origin, and time of day, you should be able to eliminate large amounts of "good" traffic from the suspect data set quickly.

Figure 1: Data points with Bayesian probability applied. Source: scikit-learn [3], BSD license; Copyright (c) 2010-2013, scikit-learn developers; All rights reserved.

Machine Learning with Python

Mainly what you're doing with these data sets of network traffic, server logs, and so on is classifying and clustering. You want to know "is this data good or bad" and "what things are related to this data." For example, many modern viruses phone home to command and control servers or to servers that host the payload. This means attackers can customize the payload for a virus based on the location of the machine requesting it and keep offering new versions to make detection more difficult.

The trick here is to know what these outgoing requests look like. For example, if you have systems running Linux and Firefox and they start sending out web requests with a user agent of Internet Explorer, then that's probably not legitimate traffic. Another clue might be if they started sending out web requests at three in the morning when no one was in the office. These behaviors seem obvious in hindsight, but there are millions of possibilities and not only do they vary from site to site but they also keep changing.

For Python, the two main packages to help deal with the problem are scikit-learn [3] and mlpy [4]. These tools are built on top of NumPy and SciPy, with the performance-critical parts written in C and Fortran, so they're fast and easy to work with.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95