If you are looking for the vm-bogofilter.el file, it is there!

The history of this file is the following: On April 25 (2003), I adapted the original code of vm-spamassassin.el by Markus Mohnen to use bogofilter rather than spamassasin. On September 12 (2003), another emacs/VM enthusiast (Alain Cochard) informed me that someone else (Bjorn Knutsson) had done a similar job under the same name (vm-bogofilter.el). Comparing the two versions, we quickly found out that Bjorn's version was far better and the link above will lead you to the current version of the vm-bogofilter.el file as found on Bjorn's website..)

This file is intended for people who read their mail via POP and cannot (or don't want to) use fetchmail. In other cases, it is recommended to call bogofilter using procmail which will be much faster than from within emacs/VM.

The spamcurve below shows how many spams I receive per day (on two email addresses, but one of them is already strongly filtered). The red curve is the number of spams which passes though bogofilter 0.92.6. I am updating this curve from time to time... (but not too often!)

Note that since the beginning of this figure I updated bogofilter, more or less following the stable release, and retrained the data base about six times (in recent versions I have been using the script bogominitrain.pl provided with bogofilter). Today my wordlist database is about 4.8Mb. I don't use systematic retraining on errors. I don't plot false positives since I experienced only 6 of them since April 2003 when I started using bogofilter (note that this could be underestimated since I don't have the time to check very carefully all my spam).




Author: Olivier Cappé <comcap at free dot fr>Date: March 13, 2003Revision: September 12, 2003

I wrote this to summarize my experience with spam filters and to provide some elements to the folks reading the bogofilter "Open discussion" forum (on sourceforge) who may have read some of my earlier postings. Note that starting with version 0.15 of bogofilter, the problem described below should not happen anymore since bogofilter is now capable of parsing whole mbox files.

Background

I am very much upset by spam (for obvious reasons) and also interested (for professional reasons) in statistical based spam filters.

spamassassin

I started with spamassassin which despite large improvement between versions 2.20 and 2.44 (the last one I used) still has a large missed detection (of spams) rate. My own mileage was 18 missed spams out of 109 (for 2.20) and 59 out off 586 for 2.44, which is still around 10%.

spamprobe

I then used spambrobe which relies on statistical methods based on word counts (see project page for links). This one is very good with 5 missed spams out of 295 and no false positive during the time I used it. This program also has a lot of very nice features (like MD5 hashing which makes sure that the same email is not fed several times to the training procedure). The only problem with spamprobe are size and efficiency issues: the executable (GNU Linux) is 2.5 Mb large and after two weeks of use, the word count database has grown to 42.6 Mb. Depending on the way you are calling the program (I am calling it whenever I download a message from a POP spool file) this can be quite of a problem.

bogofilter

My last venture to date has been bogofilter. I used the same database of spam and valid email as I used before with spamprobe and my first impression was not very positive: As I wrote there (Invalid spam_cutoff value for fisher):

A strange problem

The problem is that the missed detection rate is very high with the (default) fisher option.

The missed detection figures on a spam testing sample (295 spams) are:

fisher (-f) 81 / 295
graham (-g) 5 / 295
robinson (-r) 10 / 295

On a (small) testing sample of valid email, the false positive rate with the same options are
fisher (-f) 0 / 47
graham (-g) 1 / 47
robinson (-r) 1 / 47
Too see what was happening, I plotted the histogram of the scores for all three methods and mail / spam combination. My conclusion from this was that only the graham option was comparable to what I used before.

I was wrong! (but hold on...)

But I then realized that I had a twist in my database: because of an incorrect use of formail some (not all, but a large portion) of the mails in my database where indeed analyzed as single emails whereas they should have been split into several different ones (this wasn't a problem with spamprobe which has a -m switch which makes it possible to process mboxes instead of single emails). It meant, in particular, that the mail headers were incorrectly analyzed. I re-ran the training and test, correcting the error, and I obtained these numbers
Missed detections (295 spams)

fisher (-f)   4 / 295
graham (-g)   5 / 295
robinson (-r) 1 / 295
and all three methods give no false positive on 56 valid emails. In addition, the histograms for fisher in particular now look much better. Furthermore the margins (difference between lowest score for spam and highest score for valid email) are
fisher (-f)   4E-5
graham (-g)   -6E-6
robinson (-r) 0.032
which means that both fisher and robinson are clearly preferable to graham, as the threshold could be tuned to have exactly 0 errors, whereas there is no way of getting rid of the 5 errors of the graham method without triggering several false positive. Although not statistically compelling, these results (both the number of errors and the margin) suggest that robinson method, which was the default in bogofilter up to version 0.11, is actually better than fisher, which is now the default.

Conclusion

My first impression on bogofilter was obvioulsy wrong, however I was not the only one to feel that switching to fisher method involved a large loss in performance (which "calculation" is best for me?). From the above, a possible explanation is that fisher seems to be very sensitive to the type of problem I had in my training database. For some reason, graham is definitively very robust in this respect (although it performed worst than the two other, once I had cleaned the data). Finally, robinson stands in the middle as a very good choice both in terms of robustness and performance (this is the one I am now using).