Joel - recommended by 9 out of 10 nutritionists

Powered By Php

Mysql Logo

Powered By Apache

Fighting spam with open source software
Posted by Joel (joelh-website at planetjoel dot com) on Wednesday, 28th of February, 2007 - 11:16:23 PM

This article is based on my own experiences, and should be directly applicable to anyone fighting spam for a small to medium sized organization using open source software. It will be useful for people setting up home rigs and setups.

We found that out of the box SpamAssassin was horribly ineffective and was only stopping around 50% of spam. With these simple techniques we managed to get that closer to the 95-99% range. I personally reduced my own spam from 40 messages a day down to around 1-5, which has been very useful for me.

SpamAssassin

- Rules De Jour

One of the biggest problems spam assassin suffers from is that the creators make awesome rulesets which eliminate a huge percentage of known spam in pre-release versions, yet as soon as they are released spammers adapt their spam so that the rules are almost useless. Many spammers have automated systems that generate spam then run it over spamassassin installations to test whether it will get through, and they alter it if it is detected as spam. No matter what product you run you should normally run a modified or tweaked version for this reason.

Rules De Jour is a script that downloads from rulesemporium.com the latest SA rules. The installation notes recommend that you run it via crontab but I feel uncomfortable about this and tend to run it manually and watch for any bad rules. Remember like virus signatures, or IDS signatures, it only takes a few badly written spam signatures and you will be generating lots of false positives. The temptation is to ramp up efforts as the spammers ramp up theirs, but in truth your systems need to become more accurate with volume rather than more aggerssive.

Link to Rules du jour bash script.

- Pyzor, Razor and DCC

These programs are all separate IP blocklists. They all work slightly differently but achieve the same effect roughly, and tend to be produce little false positives. SA does not give them overly high scores so I would recommend running all three. The reason SA scores them so low is because of a few incidents with massive false positives (I believe spamcop started saying every IP was spam). They are a good addition to an effective rig.

- Bayesian Filter Training

The SA bayes_auto_learn feature is very cool, and indeed its bayesian filter is one of its most effective mechanisms of catching spam, but it will not automatically get you out of trouble. Begin by training on a good batch, at least 5000 spams and 5000 hams if you have them, from your mail server. The spam you receive will be different to everyone elses and its really important that it is as authentic and recent as possible.

You must also work out a way of training the filter on misclassifications otherwise the auto learn feature will get away from you. If you don't do this it will slowly get less and less effective as it learns spams as hams and isn't corrected. In the past I have set up mechanisms where people can put spam in a folder and it will be automatically slurp'd up by a script and fed back into the learner as ham or spam. Don't worry that it has seen it before (and the message may be subtly altered as it has had new Received: or other headers added) - SA should store the message-id and know that its learning is a message previously misclassified.

I would also highly recommend altering the bayes_auto_learn_threshold_spam to be 4.5 or even 4. 5 is generally too high. I would also adjust the BAYES_99 score to be 5 because a lot of spam will only hit this rule but it will have very few false positives in my experience. You can generally trust the bayesian classifier when it says its 99% sure.

- Trust Networks

Set trusted_networks in your local.cf to something sensible (see the documentation for more about this). This will make a huge difference because the automatic mechanisms I believe rely on timing and other factors and are horribly inaccurate. I found roughly 40% of our spam was previously being thought of as "trusted". SA should do a better job of making sure you always set this value.

RBL's

Postfix plays really nicely with RBL's by default and although the percentage gains might seem small on paper, its about adding up many pieces of the pie rather than one magic bullet. The following lines make a nice addition to any main.cf file:

smtpd_recipient_restrictions=XXXXmorestuffhere, reject_rbl_client list.dsbl.org, reject_rbl_client sbl.spamhaus.org
FuzzyOCR

FuzzyOCR oh how I love thee. Before installing FuzzyOCR I would say almost 80% of our spam was image based spam, now its rare if ever that I see such images. FuzzyOCR works by applying ocr software to the images and looking for words in its blocklist - however its real power appears to be in checking whether the JPG or GIF's are valid or match certain characteristics. For example many spammers use animation based GIF's to hide the true text, something that businesses are very unlikely to use in a footer for example.

I recommend installing this with both gocr and ocrad because they have their own strengths and weaknesses and work quite differently. It is quite a pain to install this on debian because much of the software needs to be installed from source (and boy is it a pain to get it working in a chroot'd environment - but that's life). Its well worth the effort.

The only other thing I would say about this tool is make sure you edit the blocklist because the default blocklist includes some really stupid words, such as "company" and "international". Also the default scores are rather high, in the end I decided to leave most of them as defaults but watch this one like a hawk depending on your business and clients.

Client Filtering

Yes client filtering is kind of lame and its probably a sign that your organization is not doing a good enough job. That said, it makes some sense. Mail servers only have X amount of time that they can process each message and must be very sure before classifying a message incase it is a noob on the other end who isn't adept enough to look inside two folders for their mail. With personal client filtering you can have more processing time and you can be adjusted based on the persons preference and amount of spam.

For linux I find thunderbird's junk mail filter is absolutely incredible at identifying spam. It is way overzealous compared to SA, but hey that's its job and I turned it on. For windows I would recommend spambayes's plugin for Outlook.

SPF

SPF is a technology that works through the TXT DNS records (AFAIK previously basically unused). Its a standard that allows you to specify which IP's of yours will be possibly delivering mail. Many people are under the (very) mistaken assumption that company X has one or two "mail servers" which all do "email", be it sending, receiving, filtering, etc. In reality companies of any decent size tend to have a farm of filters, a farm of receivers, a farm of sorters and a farm of senders. For this reason its commonplace for IP's other than those listed as A or MX records to be sending out spam, and of course its very difficult then for the receiver to work out if an address is being spoofed. SPF allows an organization to say "only these IP's will send email from my domain, otherwise its probably spoofed".

SPF won't decrease the amount of spam you receive but its good internet practice. You should also set your mailname to an externally resolvable DNS entry that ideally maps back to your IP and you should set the reverse DNS on that IP to something sensible. In many cases this is not possible however.

Summary

Open source software is very powerful but is widely deployed because its free and suffers almost from its own size. Its freedom also means spammers can use it free and see all the source code so they can quickly find ways around it. It is highly effective but requires serious tweaking.

If you only make one tweak install FuzzyOCR. Its great.

aside: DSPAM

I don't discuss dspam in this article because I haven't really used it. But you should certainly investigate it and see whether it is suitable to your needs.

Add a comment