As discussed
before, I am working on a little hobby project called
mu, for indexing/searching e-mail messages in maildirs. As a true hobby project, it's about finding things out. I'll take notes as I go along.
indexing
One important part of indexing and searching is....
indexing. Indexing (in this context) is the operation of recursively going through a
maildir, analyzing each message file, and storing the results in a database. In mu's case, there are actually two databases, one
SQLite-database and one
Xapian-database (a really interesting tool - to be discussed later).
Indexing may take a considerable amount of time; mu version 0.1 took 192 seconds (on average) to index 10000 messages in my testing corpus. And this version did not even support the Xapian database. Indexing involves reading from disk, querying the database to see if the message is already there, and if not, storing the message metadata. Because of this scheme, re-indexing of the same 10000 messages only takes about 5 seconds (with re-indexing, only modified/new messages need to be indexed).
The full indexing operation probably does not happen very often, for most people. Still, I think it's very worthwhile to try and make it faster. Nobody likes to wait for 192 seconds, even once - and during development, I need to do a full index rather often. Another important reason is that optimizing software is simply interesting - which is a main motivator for a hobby project.
So, let's see how we can make this a bit faster; here I'll only discuss some of the database-related optimizations.
transactions
As mentioned,
mu stores the indexing data in two databases; one
SQLite-database and one
Xapian-database. Both of these databases know the concept of a
transaction. By default, SQLite puts every query in a separate transaction. This is very safe, but also quite expensive. When indexing messages, there is no risk of data loss, so it's quite reasonable to increase the transaction size. And this makes things a
lot faster. Between
mu version 0.1 to 0.2, I increased the default from one transaction per message (3 queries) to one transaction per 100 messages. This made indexing more than
2.5 times faster -- see the table below. This improvement is even more impressive when considering that I also added full-text search, indexing message bodies as well (this is what Xapian is for).
For Xapian transactions, the default value I chose is 1000 transactions -- but the performance effects are much smaller. So, my 'optimal' values, are 100 and 1000, respectively. I found that transactions bigger than that don't improve the performance very much, but of course still affect memory usage. You can tune these with --tune-sqlite-transaction-size and --tune-xapian-transaction-size. The defaults should be just fine for the normal desktop use case - still, if you need a less memory-hungry but slower version, that is possible too. See the mu-index(1) man page for details.
pragmatic
Another area for performance are SQLite's
PRAGMA-statements. Some useful ones are
PRAGMA synchronous= (which you can influence with
--tune-synchronous and
PRAGMA temp_store=, which you can tune with
--tune-temp-store. Again, see the
mu-index(1) man page for details.
It turns out that PRAGMA synchronous allows for some improvement. This setting determines whether SQLite does it writes in a synchronous way. It's faster (and slightly less safe, but the notes at the end of this blog entry). From the table below, it seems that PRAGMA temp_store does not make much difference in this case. This PRAGMA determines where we store temporary (non-committed) results. Some testing suggests this is because, when we do not enable synchronous writing (above), even the 'file' temp_store never physically hits the disk, due to caching by the kernel.
results
Having optimization options tunable through command line options is really useful. Software optimization, especially from what your read online, seems to be a field full of myths, outdated 'facts' and placebo-effects. And even if the information is correct, it may not apply to your use case. The only thing you can do is measure it. And with command line-options I can easily do that, as well as see how various combinations of optimizations perform.
Here's a table with the results for indexing 10000 messages with version 0.3. Between all the runs, I used
# sync && echo 3 > /proc/sys/vm/drop_caches
to flush the caches. That's a critical step - the kernel caches a lot of data, which makes subsequent runs much faster if you don't flush the caches. And that is not what I wanted to measure.
msg/sqlite tx
| msg/xapian tx
| synchronous sqlite
| temp store sqlite
| time (s)
| notes
|
1
| 1
| full
| file
| 1536
|
1
| 1
| normal
| default
| 182
| similar to defaults for mu 0.1, but faster
|
100
| 1000
| full
| file
| 73
|
100
| 1000
| no
| file
| 68
|
100
| 1000
| no
| memory
| 68
| default for mu 0.3
|
10000
| 10000
| no
| memory
| 67
|
As an example, the default for mu version 0.3 is equivalent to:
./mu-index --tune-sqlite-transaction-size=100 --tune-xapian-transaction-size=1000 --tune-synchronous=0 --tune-temp-store=2 ~/data/testmaildir
Again, see the
mu-index(1) manpage for details.
Note, these optimizations are a good strategy for indexing data, that is, generating data from data that is already safely stored somewhere else. If anything goes wrong, we can always restart the indexing later. However, if your database stores data that cannot easily be retrieved again afterwards (say, that one occurrence of the Higg's Boson in your particle accelerator), you would want to be a bit more careful.
There are some more optimizations possible; some I have even implemented, such as inode-sorting, which is documented in the mu-index(1) man page. To be discussed some other time.