| So What Makes a Good Spam Filter Anyway? By | | | | false negative than a false positive if you are |
| Alan Hearnshaw | | | | ever going to tear yourself away from the killed |
| Spam Filters. Most of us know we need one. | | | | mail folder! |
| Some of know we need a better one, but how | | | | Of course, by definition, community filters cannot |
| many stop to think what actually makes a good | | | | reach 100% accuracy as someone has to be |
| spam filter in the first place? | | | | getting the spam to be voting it as such! |
| This is not just a rhetorical question. It is a | | | | Theoretically, a Bayesian filter may be able to |
| question that many users - and many developers | | | | eventually get quite close to 100% accuracy, so |
| - do not ask, and consequently, goes unanswered. | | | | at least there is hope there. Content based filters |
| Maybe this could be better answered by defining | | | | (those that look for certain words, phrases or |
| here the qualities of the perfect spam filter. We'll | | | | other indicators in a message to identify it as |
| call our perfect spam filter the "SpamSplatter | | | | spam), will almost certainly not get much higher |
| 3000". Here are some of the defining qualities of | | | | accuracy figures than the best of them can |
| "SpamSplatter 3000" | | | | achieve today. Adapting to changing spam |
| 1. It requires zero interaction from the user. 2. It | | | | requires new filters to be created on an ongoing |
| produces zero false positives (good messages | | | | basis. |
| identified as bad) and zero false negatives (bad | | | | And finally, we come to the holy grail of spam |
| messages identified as good). 3. It is transparent - | | | | filtering: |
| that is, you only ever see good messages and | | | | It is transparent Strangely enough, not enough |
| never need even be aware that spam exists. | | | | work seems to be done in trying to achieve this |
| That's it. Not much of a shopping list is it? Of | | | | goal. Some of the best filters on the market |
| course, "SpamSplatter 3000" hasn't been invented | | | | today identify spam with impressive accuracy and |
| yet (and if it does, I want a piece of the action), | | | | then simply place them in a "killed mail" folder for |
| but it does give us a frame of reference when | | | | your later perusal. Now, forgive me if I'm missing |
| looking for the best filter we can find. | | | | something here, but isn't the point to save you |
| Let's take each point in turn: | | | | having to wade through the junk mail? Isn't that |
| It requires zero interaction from the user There | | | | what you bought the filter for? With the |
| are two kinds of filters that come near to this | | | | "SpamSplatter 3000", you don't need to do that. |
| ideal currently: Bayesian Filters and Community | | | | As we haven't achieved 100% accuracy yet (and |
| Filters. Bayesian filters strip messages down to | | | | probably never will), the only way to free us from |
| small "word bites", or tokens and maintain a | | | | checking the killed mail folder is a challenge |
| database containing lists of good and bad tokens. | | | | response system. This is where a message is |
| When a new message is encountered, the filter | | | | automatically sent back to the sender requiring |
| strips this message down to tokens, compares it | | | | them to take some action for their message to |
| to the database, and applies a formula based on | | | | actually be delivered. |
| the British scientist Alan Bayes' formula for | | | | Some systems tend to go overboard with the |
| probability calculation. Over time, the Bayesian | | | | challenge/response system. These systems - |
| filter "learns" the characteristics of spam | | | | often called "Whitelist" systems - block messages |
| messages. | | | | from anyone that isn't in the user's friends list. |
| Community Filters simply work on a voting | | | | Guaranteed 100% effective, but too drastic a |
| system whereby every user that receives a | | | | measure for most users. |
| spam message "votes" it as spam. This | | | | Now, it seems that the most intelligent use of this |
| information is stored on a central server and | | | | system would be to send challenges only to |
| when enough votes are received the message is | | | | messages that were flagged as "questionable". |
| banned from all users in the community. | | | | Good message can be delivered, definite spam |
| As can be seen, the user interaction from these | | | | can be deleted and questionable ones would earn |
| types of filters is mainly limited to two button | | | | themselves a challenge message. |
| operation - correcting wrongly identified messages | | | | So, to sum up, let's rewrite the qualities of our |
| - and the more accurate the filter, the less those | | | | perfect filter and get a shopping list of what to |
| buttons are used. | | | | look for while we wait for the "SpamSplatter |
| OK, so that's pretty good. Not exactly zero | | | | 3000" to arrive: |
| interaction, but if the filter is accurate enough, | | | | 1. Simple, minimal setup and maintenance. 2. |
| then it should be pretty near. That brings us to | | | | Extremely low rate of false positives and as few |
| point two: | | | | false negatives as possible. 3. A transparent |
| It produces zero false positives or negatives This | | | | "fail-safe" mechanism whereby the victims of |
| is the area in which most spam filter development | | | | those false positives can force the message |
| is concentrating and things are getting pretty | | | | through to you. |
| good nowadays. It is not at all unusual to see an | | | | It's simple really. Now, who's going to build me this |
| efficient modern filter achieve accuracy of 96% | | | | "SpamSplatter 3000"...? |
| or better. It is, of course, far better to have a | | | | |