Bayesian Anti Spam Filters - Glue & Duct tape for an End-Of-Life Technology
Learn why protecting Microsoft Exchange Server, Lotus Domino or Novell GroupWise with a Bayesian filter is a faulty approach destined for disappointment.
What the theory says bayesian should be
Bayesian filtering is based on a mathematical theory which promotes that by assigning spam probability numbers to individual trigger words within a message, the message can be classified as spam or not spam. The theory goes on to indicate that the filter will learn from its experience and more intelligently assign spam probability values to new and existing words (tokens) based on historic message content. The filter then should aggregate these individual values. If the aggregate reaches an assigned threshold value, it is classified as spam. However, the bayesian theory leaves out a fairly significant component. It ignores implementation.
What Bayesian actually is, as implemented today
As implemented in most of today's spam filtering products, Bayesian filtering, is virtually nothing more than a halfhearted implementation of a theory with a fancy name. It is simply the aggregation of multiple trigger words being found within an email message. It's "learning", from email traveling through the filter, is rudimentary and error prone (if implemented at all) and results in a "sometimes it works - sometimes it doesn't" result, which the administrator has no way of qualifying or troubleshooting. Add to this the fact, that it does not aggregate its spam qualification values in a particularly intelligent, consistent or meaningful way and you have an anti-spam solution which can produce an ever-increasing spiral of frustration.
Why many spam filter vendors have implemented Bayesian filtering
Most spam filtering products currently on the market are keyword/keyphrase based filters. These filters were fairly effective in stopping spam two years ago, although they have always exhibited an unacceptably high false-positive rate. However, spammers have been busy developing custom software to generate their spam, which hides these keywords and phrases in increasingly sophisticated ways. To make matters worse, the spamming community actually publishes these keywords on the Internet, so that spammers can avoid their use. This has resulted in keyword/keyphrase filters becoming virtually ineffective in stopping spam.
Confronted with the harsh reality that their products entire infrastructure is built around an outdated, ineffective paradigm, the keyword spam filter vendors decided they would hook their wagon to a small portion of the Bayesian theory. They theorized that by applying a score to their existing keywords and then aggregating that score based on hits for that keyword, their keyword filters could prolong the life of their failing products.
Advantages to their pseudo bayesian approach
1) Lower false positive rates than keyword filters alone ( takes more keyword hits to classify as spam )
2) Slightly increased spam identification rate over keywords alone.
Problems with their pseudo bayesian approach
1) Significantly increased system resource usage (what used to take one pass, now takes as many as 10-15 passes), to aggregate the total point value necessary to identify a message as spam, or to clear a message as ok.
2) Can't identify cloaked spam (which is generally the most vile spam), such as "v*i(a)g-r-a" or bogus HTML tags, as well as more sophisticated cloaking.
3) Still based on and dependent upon, having clearly visible and obvious keyword/keyphrases.
4) No method of determining why a particular message was caught by the filter - making it impossible to subsequently, intelligently tune the filter for optimal spam recognition.
5) Blind "training" and retraining of the bayesian filter usually results in unpredictable results and often negatively impacts the filter's ability to correctly identify future spam.
A better approach - pattern based filtering
As spammers turn significant energy to building software which cloaks their spam to hide the trigger words and phrases from Bayesian filters, the resulting messages become progressively simpler for pattern based scanners to correctly identify as spam. This is because pattern based scanners can identify spam by actually targeting the cloaking techniques the spammers are using, rather than worrying about identifying spam based on words and phrases.
Bottom Line
Bayesian filtering as implemented today is little more than a "Duct tape and glue" solution to give an outdated, ineffective paradigm a temporary life extension. A better and more successful approach, is an Anti-spam solution whose architecture is built on pattern recognition. We recommend EMP 5 ( Extensible Messaging Platform ), which is a leading pattern-based anti-spam solution and is recommended by all the major mail server vendors.