MBOX Corpus Training

Some people have maildir, some people have mbox. If you have an mbox corpus, here's what you can do to train with it.

cat ham|formail -s dspam --source=corpus --class=innocent 

I would like to see dspam be smart enough to handle an mbox stream. It's not like mbox is a very complicated file type. And it would run a lot faster if it handled it natively. This is an area where Spamassassin beats dspam.

Formail breaks up the input stream into individual messages and runs dspam for each message. Learning speed is much faster if you use the daemon/client configuration.

Converting MBOX to maildir


Note: An easier way to convert mbox->maildir is to use the mb2md package (downloadable at http://batleth.sapienti-sat.org/projects/mb2md/ and also included with many distributions).
--FrankLuithle


I used mutt to convert the mboxes to maildir format. First, add

set mbox_type=maildir 

to your .muttrc or .mutt/muttrc. Then open your mbox file

mutt -f ./my_mbox_ham 

tag all mails (T . (dot) <Enter>), and copy them to a new folder (;C ./ham_maildir).

Then ./ham_maildir/new/ contains the mails as individual files.

After populating both ham and spam maildirs, you can call

dspam_train user spam_maildir/new ham_maildir/new 

Enjoy!

last edited 2006-03-21 23:33:13 by FrankLuithle