Choosing a Spam Filter
Spam Filter MechanicsBy
Spam filters have different modes of operation. Understanding how they work can help you choose which one to use.
These days, the choice of spam filters comes down to Bogofilter and SpamAssassin. Other choices, like DSPAM, are no longer in development. Although a few other choices (e.g., SpamBayes) are available, when an email reader offers a plugin, it is almost always for either Bogofilter or SpamAssassin. However, what is less often discussed is which filter is the best to use in which circumstances.
Instead, most users simply nod solemnly when they read that both involve “Bayesian filtering.” Most of us – including many who use the phrase – have no idea what Bayesian filtering is, but it sounds scientific and reassures us that either choice is acceptable.
In fact, learning that Bogofilter and SpamAssassin are “Bayesian” is useless for choosing between them. To call them Bayesian means nothing more than their structure is based on the the 18th century work of Thomas Bayes in statistics and probability. More specifically, both apply Bayes’ work by collecting words and assigning a probability that each word indicates spam. The more suspect words contained in an email, the greater the chances it is spam. However, to make an informed choice between spam filters requires considerably more detail.
Bogofilter has its roots in “A Plan for Spam,” a 2002 essay by English developer Paul Graham. After trying to develop filters based on the identifying characteristics of spam, Graham concluded that beyond a certain point, the more rules he added, the more false positives he obtained – that is, the more email messages that were incorrectly identified as spam.
Graham’s solution was to parse his samples of spam and non-spam into tokens, or individual words, and use Bayesian tools to assign each token the possibility that it indicates spam, biasing them slightly in favor of not being spam to minimize false positives. By examining the top 15 tokens in the header and body of each new email message, he calculated the possibility that it was spam. If the probability was greater than 0.9, the message was considered spam.
According to Graham, the advantage of this statistical approach is that it refers to something real – the probability of being spam – and worked with both neutral and spam-indicating words.
However, he also recognized that the more personalized the filter was, the more accurate it would be. For this reason, he also included the possibility of using white lists to indicate non-spam, or “ham,” and black lists to indicate spam.
After reading Graham’s essay, Eric S. Raymond founded the Bogofilter project. Today, Bogofilter is maintained by other developers,and has refined Graham’s calculations based on Gary Robinson’s suggestions. The modern refinements include recognizing MIME types, treating each hostname and IP address as a separate token (rather than dividing them up into separate words), and ignoring dates and Message-IDs as irrelevant. However, the basic approach remains that advocated by Graham.
The mathematically inclined can learn more about how Bogofilter assigns the probability of an email being spam by following the links and reading the man page for the filter. However, the most important point for the average user is that Bogofilter relies on statistical probability, supplemented by each user’s list of spam and ham. Advocates of this approach emphasize its simplicity, as well as its lower number of false positives once it is trained – that is, once the white and black lists are produced. These lists are contained in the .bogofilter folder in your home directory.
SpamAssassin takes a different approach from Bogofilter. SpamAssassin’s main approach is to identify the characteristics of spam and then run tests to locate them. Many tests, although not all, rely heavily on regular expressions to catch variations of words and phrases.
You can view the Perl scripts used by SpamAssassin in /usr/share/spamassassin. More than 50 are listed in my current installation of Debian Stable. From their number alone, you can tell they are a varied lot, but they include tests for the common indicators of spam in headings, in the bodies of email, and in HTML code, as well as tests for recognizing offers for anti-viruses, drugs, and pornography. In the English version, some basic tests for French, German, and Italian are also included. They also include a Bayesian probability test similar to Bogofilter’s, as well as white and black lists for individual customization.
Additionally, /etc/spamassassin includes a test developed for Debian that looks for spam involving anacron, cron, and debconf, as well as plugins installed with each recent version.
Each test assigns an email message a positive or negative value, which is added to the results of other tests to determine whether the email is ham or spam. Unlike Bogofilter, exactly what these values represent is uncertain, although considering many users probably have no understanding of Bayesian analysis, much the same could also be said for Bogofilter, of course.
With all these tests, SpamAssassin exemplifies the basic security principle of “defense in depth.” Unlike Bogofilter, it does not rely on one or two approaches, but on a wide variety of defenses. A piece of spam might slip by a single SpamAssassin test, but the odds of it slipping by all of them is unlikely.
Context is Everything
Both Bogofilter and SpamAssassin are available as plugins for major email readers and generally require little customization. Both also have high success rates. However, because black and white lists greatly improve each filter’s accuracy, be wary of the various comparisons online. Your own results are likely to be very different from those posted, especially before you have trained the filter to suit your personal email.
In fact, the filters are so different in their approaches and so dependent on how they are trained that deciding in any objective sense which one is most effective is almost impossible. To some extent, your decision as to which filter to use may depend on whether you prefer Bogofilter’s single, all-encompassing approach or SpamAssassin’s defense in depth.
Even more importantly, your choice will depend on context. To start, if the speed of filtering matters, Bogofilter is much faster than SpamAssassin for the simple reason that it runs fewer tests. If you ordinarily receive several hundred email messages in the first download of the day, SpamAssassin runs so many tests that you might be unable to access your email for five minutes – a delay that you might consider worse than manually deleting spam.
By contrast, in my experience, Bogofilter requires several days of training before it reaches full effectiveness. On the one hand, stopping to train Bogofilter in the middle of other tasks can be a nuisance, especially because it seems to require several examples before it recognizes posts on a mailing list as ham. On the other hand, SpamAssassin is so comprehensive that it generally identifies spam more accurately without training. If you prefer to minimize training, SpamAssassin is probably the filter you want.
Another consideration is how many false positives you have once your filter of choice has been trained. My experience is that, once trained, Bogofilter has fewer false positives. Just as Graham observed,adding more rules, the way SpamAssassin does, beyond a certain point seems to increase false positives.
Still another consideration is that SpamAssassin is reactive. It adds tests in response to the latest tactics used by spammers but appears to be slower to discard tests that are no longer needed – if it does so at all. Similarly, if new spamming tactics appear, you might temporarily have less effective filtering until a new software release is made. However, because Bogofilter relies on probability rather than on spam characteristics, it might not have the same problems – at least not to the same extent.
As you can see, the decision of which filter to use has no absolute answer. However, once you understand how both filters work, you can at least make a more informed choice to accommodate your preferences and your needs. If nothing else, you can choose the lesser of two evils.
Both projects help organizations build their own containerized systems.
Mark Shuttleworth has resumed the position of CEO of Canonical.
Microsoft's open source code hosting platform CodePlex will come to an end after a more than 10-year stint.
Comes with Gnome 3.24
The bug was introduced back in 2009 and has been lurking around all this time.
The new release deprecates the sshd_config UsePrivilegeSeparation option.
Lives on as a community project
Five new systems join Dell XPS 13 Developer Edition that come with Ubuntu pre-installed.
The Skype Linux client now has almost the same capabilities that it enjoys on other platforms.
At CeBIT 2017, OpenStack Day will offer a wide range of lectures and discussions.