Thanks everybody for the feedback about Metamail. Some people have asked me about the performance of the MapReduce Jobs. I’ve added some code in the git repo to benchmark the jobs. I’ve executed each job 20 times in my laptop and the results are the following:

Job Average Time (seconds)
Standard Deviation
Messages By Size 39.94 3.91
Messages By Thread Length 39.12 1.60
Messages By Time Period 44.58 3.18
Messages Received 43.12 1.41
Messages Sent 37.82 1.41

More or less all jobs take the same amount of time, except for ‘Messages By Time Period’ and ‘Messages Received’. Both jobs do more work parsing the body of an email than any other job.* ‘Messages By Time Period’* does several calculations on the same job to calculate messages by year, month, day of the week and hour of the day. On the other hand, although* ‘Messages Received’* and ‘Messages Sent’ are similar tasks and therefore should take the same amount of time, the code for parsing the ‘To:’ header is more expensive than parsing the* ‘From:’* header.

What indeed takes a lot of time is to import the more than 500.000 mails (2.6 GB) that conform the Enron Email Dataset into HBase, 16 minutes approximately. When I started coding Metamail I used some sample data from my own mail archive, compiled in a single mbox file. The code required to read a file and split it into several records, being each record an email. There was no database. If I remember correctly a job crunching a 15MB file, with the same hardware settings, used to take longer than a minute. Definitely it looks like a good idea to use HBase to store thousands of small files and feed Hadoop with data records.

Hardware settings:

Laptop: Lenovo X220
OS: Fedora 16
CPU: Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz (4 CPUs)
RAM: 8GB DDR3
HD: INTEL SSDSA2M160G2LE