Saturday, 18 October 2008

chasing time

As discussed before, I am working on a little hobby project called mu, for indexing/searching e-mail messages in maildirs. As a true hobby project, it's about finding things out. I'll take notes as I go along.


One important part of indexing and searching is.... indexing. Indexing (in this context) is the operation of recursively going through a maildir, analyzing each message file, and storing the results in a database. In mu's case, there are actually two databases, one SQLite-database and one Xapian-database (a really interesting tool - to be discussed later).

Indexing may take a considerable amount of time; mu version 0.1 took 192 seconds (on average) to index 10000 messages in my testing corpus. And this version did not even support the Xapian database. Indexing involves reading from disk, querying the database to see if the message is already there, and if not, storing the message metadata. Because of this scheme, re-indexing of the same 10000 messages only takes about 5 seconds (with re-indexing, only modified/new messages need to be indexed).

The full indexing operation probably does not happen very often, for most people. Still, I think it's very worthwhile to try and make it faster. Nobody likes to wait for 192 seconds, even once - and during development, I need to do a full index rather often. Another important reason is that optimizing software is simply interesting - which is a main motivator for a hobby project.

So, let's see how we can make this a bit faster; here I'll only discuss some of the database-related optimizations.


As mentioned, mu stores the indexing data in two databases; one SQLite-database and one Xapian-database. Both of these databases know the concept of a transaction. By default, SQLite puts every query in a separate transaction. This is very safe, but also quite expensive. When indexing messages, there is no risk of data loss, so it's quite reasonable to increase the transaction size. And this makes things a lot faster. Between mu version 0.1 to 0.2, I increased the default from one transaction per message (3 queries) to one transaction per 100 messages. This made indexing more than 2.5 times faster -- see the table below. This improvement is even more impressive when considering that I also added full-text search, indexing message bodies as well (this is what Xapian is for).

For Xapian transactions, the default value I chose is 1000 transactions -- but the performance effects are much smaller. So, my 'optimal' values, are 100 and 1000, respectively. I found that transactions bigger than that don't improve the performance very much, but of course still affect memory usage. You can tune these with --tune-sqlite-transaction-size and --tune-xapian-transaction-size. The defaults should be just fine for the normal desktop use case - still, if you need a less memory-hungry but slower version, that is possible too. See the mu-index(1) man page for details.


Another area for performance are SQLite's PRAGMA-statements. Some useful ones are PRAGMA synchronous= (which you can influence with --tune-synchronous and PRAGMA temp_store=, which you can tune with --tune-temp-store. Again, see the mu-index(1) man page for details.

It turns out that PRAGMA synchronous allows for some improvement. This setting determines whether SQLite does it writes in a synchronous way. It's faster (and slightly less safe, but the notes at the end of this blog entry). From the table below, it seems that PRAGMA temp_store does not make much difference in this case. This PRAGMA determines where we store temporary (non-committed) results. Some testing suggests this is because, when we do not enable synchronous writing (above), even the 'file' temp_store never physically hits the disk, due to caching by the kernel.


Having optimization options tunable through command line options is really useful. Software optimization, especially from what your read online, seems to be a field full of myths, outdated 'facts' and placebo-effects. And even if the information is correct, it may not apply to your use case. The only thing you can do is measure it. And with command line-options I can easily do that, as well as see how various combinations of optimizations perform.

Here's a table with the results for indexing 10000 messages with version 0.3. Between all the runs, I used

# sync && echo 3 > /proc/sys/vm/drop_caches
to flush the caches. That's a critical step - the kernel caches a lot of data, which makes subsequent runs much faster if you don't flush the caches. And that is not what I wanted to measure.

msg/sqlite tx
msg/xapian tx
synchronous sqlite
temp store sqlite
time (s)
similar to defaults for mu 0.1, but faster
default for mu 0.3

As an example, the default for mu version 0.3 is equivalent to:
./mu-index --tune-sqlite-transaction-size=100 --tune-xapian-transaction-size=1000  --tune-synchronous=0 --tune-temp-store=2 ~/data/testmaildir
Again, see the mu-index(1) manpage for details.

Note, these optimizations are a good strategy for indexing data, that is, generating data from data that is already safely stored somewhere else. If anything goes wrong, we can always restart the indexing later. However, if your database stores data that cannot easily be retrieved again afterwards (say, that one occurrence of the Higg's Boson in your particle accelerator), you would want to be a bit more careful.

There are some more optimizations possible; some I have even implemented, such as inode-sorting, which is documented in the mu-index(1) man page. To be discussed some other time.


IvanFrade said...

Hi Dirk-Jan, nice project!

As you can imagine, i am very interested on this topic (in tracker we also index emails).

Did you take into account for the performance the DB Schema? Is your database a simple table with a row per email?

What about attachments? do you plan to support the "meta information" of the mails, like conversation threads?

Thanks for your post and keep "hobby programming"!

djcb said...

@ivanfrade: hi ivan :-)

zeeshan also asked about tracker; i think i'll blog a bit about that the next time. maybe you can blog a bit about your experiences? I am sure we can all learn a lot from those!

anyway, to answer your question: when indexing, there will be one row per message in the sqlite database (table 'message'). this includes sender/recipients fields; note that recipients is a string that can contain multiple recipients (comma separated)

however, there is also the concept of a contact (a single sender or recipient), which has a n:m relation with table message.

this redundancy allows for searching messages without joins (only 'message' is needed), while it's also quick/easy to get, say, all mails sent to some particular contact. mu does not use that last option yet though.

when indexing, we scan the full message body as well and store it in the xapian database, tagging it with the sqlite message-id, so when searching through xapian, we can find the corresponding row in sqlite. it's a bit hackish, two databases, but afaik, there's no good alternative, and in practice, it works pretty well.

btw, you can find the sqlite database schema in index/mu-storage.sql in the source code.

regarding attachments: i detect that a message has attachments, but I am not doing anything with them. for me personally, that's not really something high on my list.

regarding the conversation threads: they are there already if you use mu together with mutt (courtesy of mutt, of course). if i ever develop some GUI, i'd probably try something like
I have the data, it's just a small matter of coding :)