Big Data, Python, and the future of security
More Good vs. Bad
An approach that is now possible because of cheap data storage and processing is to look retroactively for bad traffic. For example, if you archive all your traffic, in theory, you could replay it a week or so later to your antivirus solution. The idea here is that a new virus may not be detected right away, but after a week, your AV solution should have a signature for it. Thus, you could detect the payload and identify the traffic that resulted in the compromise. Something like Bayesian filtering would be valuable for this approach; by eliminating all the known good traffic and logging only the unknown/known bad, you can limit the amount of data you need to store and process.
SELinux and Local Attacks
Another issue involves minimizing the time for correlation of events. That means, if you can determine that bad traffic is bad within a few minutes instead of a few days, you can minimize the impact of the attack and possibly prevent more systems from being compromised. One way to do this is via SELinux violations and other host-based intrusion detection system (HIDS) violations. In general, if you have properly configured SELinux for your applications (which in most cases means using the default profiles), you should get zero violations.
So, if software causes an SELinux violation, in theory, this could mean it has been compromised. In practice, however, it's much more likely to be a false positive, which is a good reason to learn SELinux and update your system policies/file labels as needed. The same goes for syslog entries and other places where applications typically complain about problems. The more sources of information that you can process the better, especially ones with as much metadata as audit logs, syslog, and so on.
Filters and rule-based security haven't worked very well for some time now – email was the first to fall, and network traffic is rapidly becoming more vulnerable. The sheer volume of traffic and new attacks, as well as the encodings and encryption available to attackers, means that machine-based learning is the only practical, long-term option.
- "Big Data Excavation with Apache Hadoop" by Kenneth Geisshirt: http://www.linux-magazine.com/Issues/2012/144/Hadoop
- MongoDB: http://www.mongodb.org/
- scikit-learn – Machine Learning in Python: http://scikit-learn.org/
- mlpy – Machine Learning Python: http://mlpy.sourceforge.net/
Buy this article as PDF
New partnership will bring more and better CS training to US schools
Criminals offer online help over Tor network
Sophisticated malware is still present on Joomla and WordPress sites around the world.
Future versions of Ubuntu's code service will support the popular Git version control system used with Linux and other open source projects.
New release marks the arrival of AMD’s unified driver strategy.
A new study by IDC charts big changes in the big hardware market.
Azure CTO says Redmond has already considered the unthinkable.
Lead developer quells rumors that the Debian version is slated for center stage.
MSBuild is now just another GitHub project as Redmond continues its path to the light.
Malware could pass data and commands between disconnected computers without leaving a trace on the network.