Performance Tuning

DSPAM and all components involved, depending on your setup, can cause a great impact on the load of your servers if not properly tweaked. This page aims to get some light on what you should worry about and point out some solutions already used in production systems.


This is the "kick-off" page, which has info I've gathered so far. Place your comments at Performance Tuning Comments. This is a work in progress...


Hardware

Operating System

DSPAM tuning

MySQL tuning


From [dspam-users] mailing list

Here are relevant parts of some threads, some posts seem contradictory, but in a near future all this will be clarified (I hope).

[dspam-users] Performance Issue

http://dspam.nuclearelephant.com/dspam-users/7550.html

http://dspam.nuclearelephant.com/dspam-users/7569.html

... Try using InnoDB instead of MyISAM. Given your situation, row-level
locking (as opposed to table-level in MyISAM) should net you a huge
performance benefit. [MYSQL2]

http://dspam.nuclearelephant.com/dspam-users/7571.html

Ok. I tried InnoDB. After 2 hours, the time increased from 0.5-10 to
130-1850. And I have something about 250 dspam processes running
simultaneously. Lots of queries in MySQL's process list as well. The
ibdata1 filled 700MB in that two hours.

http://dspam.nuclearelephant.com/dspam-users/7575.html

It seems to me that if you're going to take any real advantage of
InnoDB, DSPAM has got to be fitted for proper usage. Transactions,
read/read-write and next key locking need to be part of the DSPAM.
Without it, you'll have one off transactions and deadlocks popping up
all over.

http://dspam.nuclearelephant.com/dspam-users/7572.html

... Here's what I have on my box w/ 1G of ram

in /etc/mysql/my.cnf

key_buffer = 512M
table_cache = 1024
myisam_sort_buffer_size = 265M
max_allowed_packet = 4M
thread_stack = 128K

Notice how aggressive I'm using memory for sorting the keys and sorts.
You should look in to the sample my.cnf that comes w/ mysql
(my-huge.cnf, my-large.cnf, etc.)

take a look at [MYSQL3] ...

http://dspam.nuclearelephant.com/dspam-users/7584.html

The insert queries that dspam_merge attempted on our dspam 3.2.6 install
were in excess of 60MB; I had to increase the max_allowed_packet parameter
accordingly in order for the process to complete.

http://dspam.nuclearelephant.com/dspam-users/7574.html

... its better to have a small # of concurrent dspam procs...
... tweaking key_buffer size and table_cache parameter. The 2 parameters
the most important when tuning mysql. [MYSQL1]

http://dspam.nuclearelephant.com/dspam-users/7613.html

...After many tests, the most suitable situation I found was:

MySQL 4.1 remote tcp/ip
Method notrain
MySQL vars:

set-variable = max_allowed_packet=8192000
set-variable = max_connections=1524
set-variable = key_buffer_size=128M
set-variable = myisam_sort_buffer_size=88M
set-variable = table_cache=1024

    dspam's database is running on a machine with 512MB ram.
My MX statistics:

5 dspam processes running simultaneously
5 messages per second at Exim's incoming queuing daemon
12 Exim's processes running simultaneously delivering about 20+ messages
load average: 2.35, 2.37, 2.51

   Those statistics are updated every 5 minutes, but that's the average.
   I really can't do it TOE or TEFT, despite database on MX or remote.
   I have one more question: what hardware (Intel based, please) do
you recommend for my setup (remember: I run a MX with 20000+ mail
boxes, 150K+ messages a day, 90% spam)?...

http://dspam.nuclearelephant.com/dspam-users/7614.html

...  Add more ram. For that many users, mysql's gonna need all the ram you
can throw at it.

http://dspam.nuclearelephant.com/dspam-users/7616.html

Adding memory is good. As long as you configure it properly, MySQL will
use as much memory as you throw at it.
Scripts that periodically optimize the dspam DB tables will help.
Also, faster disk will help immensely. Placing your SQL files on a
RAID stripe will increase performance.
Another thing you can do is to look into compiler optimizations and also
compiling static binaries. Every little bit helps.
Beyond that, you could look into MySQL clustering. I don't have much
experience with that one though.

http://dspam.nuclearelephant.com/dspam-users/7617.html

The filesystem you use also has a big impact on performance. Stay away
from ext3. I have done some tests and they indicate that with xfs, jfs
and ReiserFS I get about 300% increase in performance. This is on a
system with weak CPU and limited memory.

I will publish full results once my tests are concluded.

[dspam-users] DSPAM performance data

http://dspam.nuclearelephant.com/dspam-users/7821.html

http://dspam.nuclearelephant.com/dspam-users/7822.html

> In such a setup, the average email processing time is about 1.5 second, with
> many messages that get processed faster, but some that take much longer to
> process (up to 20+ seconds !) with no visible reason -- system is not under
> high load conditions.

Did you try and correlate the size of the message to the time it takes to
process? Messages with more tokens (i.e. bigger messages) are going to take
more time. Until Jonathan added the MaxMessageSize option, I had some extremely
large (5MB+) messages take several minutes to process. 

http://dspam.nuclearelephant.com/dspam-users/7823.html

There seems to be a relation between messages size and processing time, but I
have a MaxMessageSize set to 1MB (who's ever seen a spam that big ?), so
really big messages shouldn't be a concern.

It also seems to me that the messages that are the longest to process are
messages for one of my users who's in TEFT mode with a small database
(training isn't finished for him). 

http://dspam.nuclearelephant.com/dspam-users/7824.html

Inserts are going to take longer than updates, due to the indexes that
need to be touched. There is also an extra index or two in the postgres
setup as pgsql's query builder screws up without them. 

[dspam-users] Large Scale

http://dspam.nuclearelephant.com/dspam-users/7833.html

http://dspam.nuclearelephant.com/dspam-users/7831.html

> Which training mode do you recommend for using the DSPAM in a large
> scale environment(40K users)? TOE?

TOE is definitely much gentler on the database, but be advised that
until you reach the training threshold (2500 innocent messages), the
system will use the more taxing TEFT (train everything) mode. You can
get past this initial hump by pretraining a shared group.

> And hardware? How can I distribute the load among 3 or 4 machines?
> Is it necessary a exclusive server for the mysql db?

If all 40k users are in a single domain, it will be harder, but if they
are in multiple domains, you can use Storage Profiles to target multiple
databases. You will have to realize that dspam sends a lot of data to
the database (definitely tune MySQL as LARGE), so if you are not
targetting a local database, you will want to make sure you have a very
fast internal network.

You may be able to get away with running multiple instances of MySQL
local to the machine that is running dspam (since then you won't be
blocking a single instance, especially during maintenance). It would
best to have a very robust disk subsystem (perhaps a SAN for storage);
RAID 1+0 would be better than RAID 5 in this respect. 

http://dspam.nuclearelephant.com/dspam-users/7832.html

I'd suggest merged groups + TOE. That will let you have some out of
the box accuracy and individual user training. There are a couple
people that use NoTrain for the majority of their user base and then
allow only a couple people to help out with training (TOE).

One way to have multiple mysql's would be to have a mysql server local
to each machine and then set them all up in a ring replication
topology.

[dspam-users] Training Mode

http://dspam.nuclearelephant.com/dspam-users/7834.html

Some hopefully useful information. Just got done running a test corpus 
of around 9,000 messages through DSPAM to identify the differences in 
training mode. The results are below. The tests were limited to a 
balanced corpus (50% ham and 50% spam) so these are likely to change for 
an unbalanced one (which I hope to test at some point).

Set 1: Starting with empty corpus

teft:ch,wh,tb=0
     TS:  4286 TI:  4524 SM:   262 IM:    26 SC:   114 IC:     2
toe:ch,wh,tb=0
     TS:  4305 TI:  4570 SM:   262 IM:    32 SC:   100 IC:     2
tum:ch,wh,tb=0
     TS:  4354 TI:  4523 SM:   194 IM:    27 SC:    55 IC:     3

Set 2: Starting with pretrained corpus (2500 nonspamm, 2500 ham)

teft:ch,wh,tb=0
     TS:  1994 TI:  2045 SM:    54 IM:     5 SC:  2530 IC:  2501
toe:ch,wh,tb=0
     TS:  1992 TI:  2106 SM:    62 IM:     7 SC:  2520 IC:  2501
tum:ch,wh,tb=0
     TS:  1962 TI:  2044 SM:    87 IM:     5 SC:  2535 IC:  2501

When starting with zero corpus, the winner was TUM by a landslide, which 
resulted in a significantly lower smal miss count and only one 
additional false positive. TOE and TEFT were close in performance, but 
TOE's gentleness on the database cost 6 false positives. Still worth it 
IMO if you have a large user base.

When using a pretrained corpus, however, TUM didn't fair as well which 
suggests that if you're going to use TUM, you should use it from the 
start. TEFT was the winner in the pretrained corpus test set.

References

[MYSQL1] Tuning Server Parameters http://dev.mysql.com/doc/mysql/en/server-parameters.html

[MYSQL2] Converting MyISAM Tables to InnoDB http://dev.mysql.com/doc/mysql/en/converting-tables-to-innodb.html

[MYSQL3] How MySQL Uses Memory http://dev.mysql.com/doc/mysql/en/memory-use.html


Comments

To avoid the mess put them into Performance Tuning Comments or you can always drop me a note. PauloMatos.

last edited 2007-03-08 02:08:28 by JoshuaKugler