Big Data, Python, and the future of security
More Good vs. Bad
An approach that is now possible because of cheap data storage and processing is to look retroactively for bad traffic. For example, if you archive all your traffic, in theory, you could replay it a week or so later to your antivirus solution. The idea here is that a new virus may not be detected right away, but after a week, your AV solution should have a signature for it. Thus, you could detect the payload and identify the traffic that resulted in the compromise. Something like Bayesian filtering would be valuable for this approach; by eliminating all the known good traffic and logging only the unknown/known bad, you can limit the amount of data you need to store and process.
SELinux and Local Attacks
Another issue involves minimizing the time for correlation of events. That means, if you can determine that bad traffic is bad within a few minutes instead of a few days, you can minimize the impact of the attack and possibly prevent more systems from being compromised. One way to do this is via SELinux violations and other host-based intrusion detection system (HIDS) violations. In general, if you have properly configured SELinux for your applications (which in most cases means using the default profiles), you should get zero violations.
So, if software causes an SELinux violation, in theory, this could mean it has been compromised. In practice, however, it's much more likely to be a false positive, which is a good reason to learn SELinux and update your system policies/file labels as needed. The same goes for syslog entries and other places where applications typically complain about problems. The more sources of information that you can process the better, especially ones with as much metadata as audit logs, syslog, and so on.
Filters and rule-based security haven't worked very well for some time now – email was the first to fall, and network traffic is rapidly becoming more vulnerable. The sheer volume of traffic and new attacks, as well as the encodings and encryption available to attackers, means that machine-based learning is the only practical, long-term option.
- "Big Data Excavation with Apache Hadoop" by Kenneth Geisshirt: http://www.linux-magazine.com/Issues/2012/144/Hadoop
- MongoDB: http://www.mongodb.org/
- scikit-learn – Machine Learning in Python: http://scikit-learn.org/
- mlpy – Machine Learning Python: http://mlpy.sourceforge.net/
Buy this article as PDF
Xen project announces a privilege escalation problem for Qemu host systems
Attackers can compromise an Android phone just by sending a text message
PC vendor will pre-install Ubuntu on portables in India.
More embarrassment for Adobe's embattled multimedia tool
Mozilla’s script blocker add-on could be putting malware sites on the whitelist.
The Internet community officially banishes the notoriously unsafe Secure Sockets Layer protocol.
Popular desktop environment continues the Gnome 2 legacy – with new support for the Gnome 3 toolkit.
The Obama White House has issued a memorandum telling all US government agencies they must use HTTPS for all websites and web communication.
New program will dial up security for the Firefox browser.
Red Hat's community distro embraces the cloud.