Performance Tuning
DSPAM and all components involved, depending on your setup, can cause a great impact on the load of your servers if not properly tweaked. This page aims to get some light on what you should worry about and point out some solutions already used in production systems.
This is the "kick-off" page, which has info I've gathered so far. Place your comments at Performance Tuning Comments. This is a work in progress...
Hardware
physical memory
fast disks
Operating System
don't use ext3 (???)
DSPAM tuning
client-server model
unix sockets if possible
low # of dpam instances concurrently
which training mode (???)
MySQL tuning
version 4.1 or above
unix sockets if possible
my.cnf parameters
InnoDB vs MyISAM (???)
size vs speed [>size ==> >I/O ==> <speed] (???)
modified purge scripts (???)
From [dspam-users] mailing list
Here are relevant parts of some threads, some posts seem contradictory, but in a near future all this will be clarified (I hope).
[dspam-users] Performance Issue
http://dspam.nuclearelephant.com/dspam-users/7550.html
http://dspam.nuclearelephant.com/dspam-users/7569.html
... Try using InnoDB instead of MyISAM. Given your situation, row-level locking (as opposed to table-level in MyISAM) should net you a huge performance benefit. [MYSQL2]
http://dspam.nuclearelephant.com/dspam-users/7571.html
Ok. I tried InnoDB. After 2 hours, the time increased from 0.5-10 to 130-1850. And I have something about 250 dspam processes running simultaneously. Lots of queries in MySQL's process list as well. The ibdata1 filled 700MB in that two hours.
http://dspam.nuclearelephant.com/dspam-users/7575.html
It seems to me that if you're going to take any real advantage of InnoDB, DSPAM has got to be fitted for proper usage. Transactions, read/read-write and next key locking need to be part of the DSPAM. Without it, you'll have one off transactions and deadlocks popping up all over.
http://dspam.nuclearelephant.com/dspam-users/7572.html
... Here's what I have on my box w/ 1G of ram in /etc/mysql/my.cnf key_buffer = 512M table_cache = 1024 myisam_sort_buffer_size = 265M max_allowed_packet = 4M thread_stack = 128K Notice how aggressive I'm using memory for sorting the keys and sorts. You should look in to the sample my.cnf that comes w/ mysql (my-huge.cnf, my-large.cnf, etc.) take a look at [MYSQL3] ...
http://dspam.nuclearelephant.com/dspam-users/7584.html
The insert queries that dspam_merge attempted on our dspam 3.2.6 install were in excess of 60MB; I had to increase the max_allowed_packet parameter accordingly in order for the process to complete.
http://dspam.nuclearelephant.com/dspam-users/7574.html
... its better to have a small # of concurrent dspam procs... ... tweaking key_buffer size and table_cache parameter. The 2 parameters the most important when tuning mysql. [MYSQL1]
http://dspam.nuclearelephant.com/dspam-users/7613.html
...After many tests, the most suitable situation I found was:
MySQL 4.1 remote tcp/ip
Method notrain
MySQL vars:
set-variable = max_allowed_packet=8192000
set-variable = max_connections=1524
set-variable = key_buffer_size=128M
set-variable = myisam_sort_buffer_size=88M
set-variable = table_cache=1024
dspam's database is running on a machine with 512MB ram.
My MX statistics:
5 dspam processes running simultaneously
5 messages per second at Exim's incoming queuing daemon
12 Exim's processes running simultaneously delivering about 20+ messages
load average: 2.35, 2.37, 2.51
Those statistics are updated every 5 minutes, but that's the average.
I really can't do it TOE or TEFT, despite database on MX or remote.
I have one more question: what hardware (Intel based, please) do
you recommend for my setup (remember: I run a MX with 20000+ mail
boxes, 150K+ messages a day, 90% spam)?...
http://dspam.nuclearelephant.com/dspam-users/7614.html
... Add more ram. For that many users, mysql's gonna need all the ram you can throw at it.
http://dspam.nuclearelephant.com/dspam-users/7616.html
Adding memory is good. As long as you configure it properly, MySQL will use as much memory as you throw at it. Scripts that periodically optimize the dspam DB tables will help. Also, faster disk will help immensely. Placing your SQL files on a RAID stripe will increase performance. Another thing you can do is to look into compiler optimizations and also compiling static binaries. Every little bit helps. Beyond that, you could look into MySQL clustering. I don't have much experience with that one though.
http://dspam.nuclearelephant.com/dspam-users/7617.html
The filesystem you use also has a big impact on performance. Stay away from ext3. I have done some tests and they indicate that with xfs, jfs and ReiserFS I get about 300% increase in performance. This is on a system with weak CPU and limited memory. I will publish full results once my tests are concluded.
[dspam-users] DSPAM performance data
http://dspam.nuclearelephant.com/dspam-users/7821.html
http://dspam.nuclearelephant.com/dspam-users/7822.html
> In such a setup, the average email processing time is about 1.5 second, with > many messages that get processed faster, but some that take much longer to > process (up to 20+ seconds !) with no visible reason -- system is not under > high load conditions. Did you try and correlate the size of the message to the time it takes to process? Messages with more tokens (i.e. bigger messages) are going to take more time. Until Jonathan added the MaxMessageSize option, I had some extremely large (5MB+) messages take several minutes to process.
http://dspam.nuclearelephant.com/dspam-users/7823.html
There seems to be a relation between messages size and processing time, but I have a MaxMessageSize set to 1MB (who's ever seen a spam that big ?), so really big messages shouldn't be a concern. It also seems to me that the messages that are the longest to process are messages for one of my users who's in TEFT mode with a small database (training isn't finished for him).
http://dspam.nuclearelephant.com/dspam-users/7824.html
Inserts are going to take longer than updates, due to the indexes that need to be touched. There is also an extra index or two in the postgres setup as pgsql's query builder screws up without them.
[dspam-users] Large Scale
http://dspam.nuclearelephant.com/dspam-users/7833.html
http://dspam.nuclearelephant.com/dspam-users/7831.html
> Which training mode do you recommend for using the DSPAM in a large > scale environment(40K users)? TOE? TOE is definitely much gentler on the database, but be advised that until you reach the training threshold (2500 innocent messages), the system will use the more taxing TEFT (train everything) mode. You can get past this initial hump by pretraining a shared group. > And hardware? How can I distribute the load among 3 or 4 machines? > Is it necessary a exclusive server for the mysql db? If all 40k users are in a single domain, it will be harder, but if they are in multiple domains, you can use Storage Profiles to target multiple databases. You will have to realize that dspam sends a lot of data to the database (definitely tune MySQL as LARGE), so if you are not targetting a local database, you will want to make sure you have a very fast internal network. You may be able to get away with running multiple instances of MySQL local to the machine that is running dspam (since then you won't be blocking a single instance, especially during maintenance). It would best to have a very robust disk subsystem (perhaps a SAN for storage); RAID 1+0 would be better than RAID 5 in this respect.
http://dspam.nuclearelephant.com/dspam-users/7832.html
I'd suggest merged groups + TOE. That will let you have some out of the box accuracy and individual user training. There are a couple people that use NoTrain for the majority of their user base and then allow only a couple people to help out with training (TOE). One way to have multiple mysql's would be to have a mysql server local to each machine and then set them all up in a ring replication topology.
[dspam-users] Training Mode
http://dspam.nuclearelephant.com/dspam-users/7834.html
Some hopefully useful information. Just got done running a test corpus
of around 9,000 messages through DSPAM to identify the differences in
training mode. The results are below. The tests were limited to a
balanced corpus (50% ham and 50% spam) so these are likely to change for
an unbalanced one (which I hope to test at some point).
Set 1: Starting with empty corpus
teft:ch,wh,tb=0
TS: 4286 TI: 4524 SM: 262 IM: 26 SC: 114 IC: 2
toe:ch,wh,tb=0
TS: 4305 TI: 4570 SM: 262 IM: 32 SC: 100 IC: 2
tum:ch,wh,tb=0
TS: 4354 TI: 4523 SM: 194 IM: 27 SC: 55 IC: 3
Set 2: Starting with pretrained corpus (2500 nonspamm, 2500 ham)
teft:ch,wh,tb=0
TS: 1994 TI: 2045 SM: 54 IM: 5 SC: 2530 IC: 2501
toe:ch,wh,tb=0
TS: 1992 TI: 2106 SM: 62 IM: 7 SC: 2520 IC: 2501
tum:ch,wh,tb=0
TS: 1962 TI: 2044 SM: 87 IM: 5 SC: 2535 IC: 2501
When starting with zero corpus, the winner was TUM by a landslide, which
resulted in a significantly lower smal miss count and only one
additional false positive. TOE and TEFT were close in performance, but
TOE's gentleness on the database cost 6 false positives. Still worth it
IMO if you have a large user base.
When using a pretrained corpus, however, TUM didn't fair as well which
suggests that if you're going to use TUM, you should use it from the
start. TEFT was the winner in the pretrained corpus test set.
References
[MYSQL1] Tuning Server Parameters http://dev.mysql.com/doc/mysql/en/server-parameters.html
[MYSQL2] Converting MyISAM Tables to InnoDB http://dev.mysql.com/doc/mysql/en/converting-tables-to-innodb.html
[MYSQL3] How MySQL Uses Memory http://dev.mysql.com/doc/mysql/en/memory-use.html
Comments
To avoid the mess put them into Performance Tuning Comments or you can always drop me a note. PauloMatos.
