So What Makes a Good Spam Filter Anyway?

So What Makes a Good Spam Filter Anyway? Byfalse negative than a false positive if you are
Alan Hearnshawever going to tear yourself away from the killed
Spam Filters. Most of us know we need one.mail folder!
Some of know we need a better one, but howOf course, by definition, community filters cannot
many stop to think what actually makes a goodreach 100% accuracy as someone has to be
spam filter in the first place?getting the spam to be voting it as such!
This is not just a rhetorical question. It is aTheoretically, a Bayesian filter may be able to
question that many users - and many developerseventually get quite close to 100% accuracy, so
- do not ask, and consequently, goes unanswered.at least there is hope there. Content based filters
Maybe this could be better answered by defining(those that look for certain words, phrases or
here the qualities of the perfect spam filter. We'llother indicators in a message to identify it as
call our perfect spam filter the "SpamSplatterspam), will almost certainly not get much higher
3000". Here are some of the defining qualities ofaccuracy figures than the best of them can
"SpamSplatter 3000"achieve today. Adapting to changing spam
1. It requires zero interaction from the user. 2. Itrequires new filters to be created on an ongoing
produces zero false positives (good messagesbasis.
identified as bad) and zero false negatives (badAnd finally, we come to the holy grail of spam
messages identified as good). 3. It is transparent -filtering:
that is, you only ever see good messages andIt is transparent Strangely enough, not enough
never need even be aware that spam exists.work seems to be done in trying to achieve this
That's it. Not much of a shopping list is it? Ofgoal. Some of the best filters on the market
course, "SpamSplatter 3000" hasn't been inventedtoday identify spam with impressive accuracy and
yet (and if it does, I want a piece of the action),then simply place them in a "killed mail" folder for
but it does give us a frame of reference whenyour later perusal. Now, forgive me if I'm missing
looking for the best filter we can find.something here, but isn't the point to save you
Let's take each point in turn:having to wade through the junk mail? Isn't that
It requires zero interaction from the user Therewhat you bought the filter for? With the
are two kinds of filters that come near to this"SpamSplatter 3000", you don't need to do that.
ideal currently: Bayesian Filters and CommunityAs we haven't achieved 100% accuracy yet (and
Filters. Bayesian filters strip messages down toprobably never will), the only way to free us from
small "word bites", or tokens and maintain achecking the killed mail folder is a challenge
database containing lists of good and bad tokens.response system. This is where a message is
When a new message is encountered, the filterautomatically sent back to the sender requiring
strips this message down to tokens, compares itthem to take some action for their message to
to the database, and applies a formula based onactually be delivered.
the British scientist Alan Bayes' formula forSome systems tend to go overboard with the
probability calculation. Over time, the Bayesianchallenge/response system. These systems -
filter "learns" the characteristics of spamoften called "Whitelist" systems - block messages
messages.from anyone that isn't in the user's friends list.
Community Filters simply work on a votingGuaranteed 100% effective, but too drastic a
system whereby every user that receives ameasure for most users.
spam message "votes" it as spam. ThisNow, it seems that the most intelligent use of this
information is stored on a central server andsystem would be to send challenges only to
when enough votes are received the message ismessages that were flagged as "questionable".
banned from all users in the community.Good message can be delivered, definite spam
As can be seen, the user interaction from thesecan be deleted and questionable ones would earn
types of filters is mainly limited to two buttonthemselves a challenge message.
operation - correcting wrongly identified messagesSo, to sum up, let's rewrite the qualities of our
- and the more accurate the filter, the less thoseperfect filter and get a shopping list of what to
buttons are used.look for while we wait for the "SpamSplatter
OK, so that's pretty good. Not exactly zero3000" to arrive:
interaction, but if the filter is accurate enough,1. Simple, minimal setup and maintenance. 2.
then it should be pretty near. That brings us toExtremely low rate of false positives and as few
point two:false negatives as possible. 3. A transparent
It produces zero false positives or negatives This"fail-safe" mechanism whereby the victims of
is the area in which most spam filter developmentthose false positives can force the message
is concentrating and things are getting prettythrough to you.
good nowadays. It is not at all unusual to see anIt's simple really. Now, who's going to build me this
efficient modern filter achieve accuracy of 96%"SpamSplatter 3000"...?
or better. It is, of course, far better to have a