Monday, July 19, 2010

(Single instance) attachment storage

(Mailing list thread for this post.)

Now that v2.0.0 is only waiting for people to report bugs (and me to figure out how to fix them), I've finally had time to start doing what I actually came here (Portugal Telecom/SAPO) to do. :)

The idea is to have dbox and mdbox support saving attachments (or MIME parts in general) to separate files, which with some magic gives a possibility to do single instance attachment storage. Comments welcome.

Reading attachments

dbox metadata would contain entries like (this is a wrapped single line entry):

X1442 2742784 94/b2/01f34a9def84372a440d7a103a159ac6c9fd752b
2744378 27423 27/c8/a1dccc34d0aaa40e413b449a18810f600b4ae77b

So the format is:

"X" 1*(<offset> <byte count> <link path>)

So when reading a dbox message body, it's read as:

offset=0: <first 1442 bytes from dbox body>
offset=1442: <next 2742784 bytes from external file>
offset=2744226: <next 152 bytes from dbox body>
offset=2744378: <next 27423 bytes from external file>
offset=2744378 27423: <the rest from dbox body>

This is all done internally by creating a single istream that lazily opens the external files only when data is actually tried to be read from that part of the message.

The link paths don't have to be in any specific format. In future perhaps it can recognize different formats (even http:// urls and such).

Saving attachments separately

Message MIME structure is being parsed while message is saved. After each MIME part's headers are parsed, it's determined if this part should be stored into attachment storage. By default it only checks that the MIME part isn't multipart/* (because then its child parts would contain attachments). Plugins can also override this. For example they could try to determine if the commonly used clients/webmail always downloads and shows the MIME part when opening the mail (text/*, inline images, etc).

dbox_attachment_min_size specifies the minimum MIME part size that can be saved as an attachment. Anything smaller than that will be stored normally. While reading a potential attachment MIME part body, it's first buffered into memory until the min. size is reached. After that the attachment file is actually created and buffer flushed to it.

Each attachment filename contains a global UID part, so that no two (even identical) attachments will ever contain the same filename. But there can be multiple attachment storages in different mount points, and each one could be configured to do deduplication internally. So identical attachments should somehow be stored to same storage. This is done by taking a hash of the body and using a part of it as the path to the file. For example:

mail_location = dbox:~/dbox:ATTACHMENTS=/attachments/$/$

Each $ would be expanded to 8 bits of the hash in hex (00..ff). So the full path to an attachment could look like:


Sysadmin can then create /attachment/00..ff as symlinks to different storages.

Hashing problems

Some problematic design decisions:
  1. Hash is taken from hardcoded first n kB vs. first dbox_attachment_min_size bytes?
    • + With first n kB, dbox_attachment_min_size can be changed without causing duplication of attachments, otherwise after the change the same attachment could get a hash to a different storage than before the change.

    • - If n kB is larger than dbox_attachment_min_size, it uses more memory.

    • - If n kB is determined to be too small to get uniform attachment distribution to different storages, it can't be changed without recompiling.

  2. Hash is taken from first n bytes vs. everything?

    • + First n bytes are already read to memory anyway and can be hashed efficiently. The attachment file can be created without wasting extra memory or disk I/O. If everything is hashed, the whole attachment has to be first stored to memory or to a temporary file and from there written to final storage.

    • - With first n bytes it's possible for an attacker to generate lots of different large attachments that begin with the same bytes and then overflow a single storage. If everything is hashed with a secure hash function and a system-specific secret random value is added to the hash, this attack isn't possible.

I'm thinking that even though taking a hash of everything is the least efficient option, it's the safest option. It's pretty much guaranteed to give a uniform distribution across all storages, even against intentional attacks. Also the worse performance isn't probably that noticeable, especially assuming a system where local disk isn't used for storing mails, and the temporary files would be created there.

Single instance storage

All of the above assumes that if you want a single instance storage, you'll need to enable it in your storage. Now, what if you can't do that?

I've been planning on making all index/dbox code to use an abstracted out simple filesystem API rather than using POSIX directly. This work can be started by making the attachment reading/writing code use the FS API and then create a single instance storage FS plugin. The plugin would work like:
  • open(ha/sh/hash-guid): The destination storage is in ha/sh/ directory, so a new temp file can be created under it. The hash is part of the filename to make unlink() easier to handle.

    Since the hash is already known at open() time, look up if hashes/<hash> file exists. If it does, open it.

  • write(): Write to the temp file. If hashes/ file is open, do a byte-by-byte comparison of the inputs. If there's a mismatch, close the hashes/ file and mark it as unusable.

  • finish():

    1. If hashes/ file is still open and it's at EOF, link() it to our final destination filename and delete the temp file. If link() fails with ENOENT (it was just expunged), goto b. If link() fails with EMLINK (too many links), goto c.

    2. If hashes/ file didn't exist, link() the temp file to the hash and rename() it to the destination file.

    3. If the hashed file existed but wasn't the same, or if link() failed with EMLINK, link() our temp file to a second temp file and rename() it over the hashes/ file and goto a.

  • unlink(): If hashes/<hash> has the same inode as our file and the link count is 2, unlink() the hash file. After that unlink() our file.

One alternative to avoid using <hash> as part of the filename would be for unlink() to read the file and recalculate its hash, but that would waste disk I/O.

Another possibility would to be to not unlink() the hashes/ files immediately, but rather let some nightly cronjob to stat() through all of the files and unlink() the ones that have link count=1. This could be wastefully inefficient though.

Yet another possibility would be for the plugin to internally calculate the hash and write it somewhere. If it's at the beginning of the file, it could be read from there with some extra disk I/O. But is it worth it?..

Extra features

The attachment files begin with an extensible header. This allows a couple of extra features to reduce disk space:

  1. The attachment could be compressed (header contains compressed-flag)

  2. If base64 attachment is in a standardized form that can be 100% reliably converted back to its original form, it could be stored decoded and then encoded back to original on the fly.

It would be nice if it was also possible to compress (and decompress) attachments after they were already stored. This would be possible, but it would require finding all the links to the message and recreating them to point to the new message. (Simply overwriting the file in place would require there are no readers at the same time, and that's not easy to guarantee, except if Dovecot was entirely stopped. I also considered some symlinking schemes but they seemed too complex and they'd also waste inodes and performance.)

Code status

Initial version of the attachment reading/writing code is already done and works (lacks some error handling and probably performance optimizations). The SIS plugin code is also started and should be working soon.

This code is very isolated and can't cause any destabilization unless it's enabled, so I'm thinking about just adding it to v2.0 as soon as it works, although the config file comments should indicate that it's still considered unstable.

Wednesday, May 19, 2010

A new director service in v2.0 for NFS installations

(Mailing list thread for this post.)

As NFS wiki page describes, the main problem with NFS has always been caching problems. One NFS client changes two files, but another NFS client sees only one of the changes, which Dovecot then assumes is caused by corruption.

The recommended solution has always been to redirect the same user to only a single server at the same time. User doesn't have to be permanently assigned there, but as long as a server has some of user's files cached, it should be the only server accessing the user's mailbox. Recently I was thinking about a way to make this possible with an SQL database.

The company here in Italy didn't really like such idea, so I thought about making it more transparent and simpler to manage. The result is a new "director" service, which does basically the same thing, except without SQL database. The idea is that your load balancer can redirect connections to one or more Dovecot proxies, which internally then figure out where the user should go. So the proxies act kind of like a secondary load balancer layer.

When a connection from a newly seen user arrives, it gets assigned to a mail server according to a function:

host = vhosts[ md5(username) mod vhosts_count ]

This way all of the proxies assign the same user to the same host without having to talk to each others. The vhosts[] is basically an array of hosts, except each host is initially listed there 100 times (vhost count=100). This vhost count can then be increased or decreased as necessary to change the host's load, probably automatically in future.

The problem is then of course that if (v)hosts are added or removed, the above function will return a different host than was previously used for the same user. That's why there is also an in-memory database that keeps track of username → (hostname, timestamp) mappings. Every new connection from user refreshes the timestamp. Also existing connections refresh the timestamp every n minutes. Once all connections are gone, the timestamp expires and the user is removed from database.

The final problem then is how multiple proxies synchronize their state. The proxies connect to each others forming a connection ring. For example with 4 proxies the connections would go like A → B → C → A. Each time a user is added/refreshed, a notification is sent to both directions in the ring (e.g. B sends to A and C), which in turn forward it until it reaches a server that has already seen it. This way if a proxy dies (or just hangs for a few seconds), the other proxies still get the changes without waiting for it to timeout. Host changes are replicated in the same way.

It's possible that two connections from a user arrive to different proxies while (v)hosts are being added/removed. It's also possible that only one of the proxies has seen the host change. So the proxies could redirect users to different servers during that time. This can be prevented by doing a ring-wide sync, during which all proxies delay assigning hostnames to new users. This delay shouldn't be too bad because a) they should happen rarely, b) it should be over quickly, c) users already in database can still be redirected during the sync.

The main complexity here comes from how to handle proxy server failures in different situations. Those are less interesting to describe and I haven't yet implemented all of it, so let's just assume that in future it all works perfectly. :) I was also thinking about writing a test program to simulate the director failures to make sure it all works.

Finally, there are the doveadm commands that can be used to:

  • List the director status:
    # doveadm director status
    mail server ip vhosts users 100 1312 50 1424

  • Add a new mail server (defaults are in dovecot.conf):
    # doveadm director add

  • Change a mail server's vhost count to alter its connection count (also works during adding):
    # doveadm director add 50

  • Remove a mail server completely (because it's down):
    # doveadm director remove

If you want to slowly get users away from a specific server, you can assign its vhost count to 0 and wait for its user count to drop to zero. If the server is still working while "doveadm director remove" is called, new connections from the users in that server are going to other servers while the old ones are still being handled.

Friday, March 19, 2010

Time to switch to clang

I actually wanted to start using clang a long time ago, but it didn't give enough warnings. Many important warnings, like not verifying printf() parameters, were completely missing. So I kept using gcc..

But in last few days this one guy started adding support for the missing gcc warnings. I also found out that printf() warnings were added within last few months also. So it looks like clang is finally potentially usable! I still have to actually start developing with it, but it looks promising.

This picture shows how much better clang's error and warning handling is compared to gcc.

I also did a few benchmarks with Dovecot:

  • Dovecot compiled about 10% faster with clang. Based on clang's web page I expected much more, but I guess it's better than nothing.. (I used configure --enable-optimizations, didn't change anything else)

  • Dovecot ran about 7% faster when I/O wasn't the limit (SSD disk, fsync_disable=yes).

Here's how I tested the 7% speed improvement (Dovecot v2.0 hg, Maildir):

imaptest seed=123 secs=300 msgs=100 delete=10 expunge=10 logout=1

gcc version 4.4.3 20100108 (prerelease) (Debian 4.4.2-9)

Logi List Stat Sele Fetc Fet2 Stor Dele Expu Appe Logo
100% 50% 50% 100% 100% 100% 50% 10% 10% 100% 1%
30% 5%
674 31545 31442 674 63169 90559 30029 5419 6332 19773 1348
646 31725 31640 646 63270 90160 29987 5403 6163 20224 1292

clang version 1.5 (trunk 98979)
Target: x86_64-unknown-linux-gnu

Logi List Stat Sele Fetc Fet2 Stor Dele Expu Appe Logo
100% 50% 50% 100% 100% 100% 50% 10% 10% 100% 1%
30% 5%
693 33927 33765 693 68032 96951 32356 5691 6786 21034 1386
674 33990 34027 674 68018 97428 32101 5823 6863 21260 1348

Saturday, March 13, 2010

Design: Asynchronous I/O for single/multi-dbox

(Mailing list thread for this post.)

The long term plan is to get all of Dovecot disk I/O asynchronous. The first step to that direction would be to make dbox/mdbox I/O asynchronous. This might also allow mbox/maildir to become partially asynchronous.

I already started describing how the lib-storage API could be changed to support high-latency storages. Adding support for all of the paralellism would be nice, but probably not necessary as the first step. Then again, the API changes are necessary, so the parallelism optimizations probably aren't a huge task on top of that.

Besides the API changes described in the above link, another change that's necessary is to add mail_get_nonblocking_stream(). The returned stream can return EAGAIN whenever it would need to block. Whenever sending message data to client, this stream would have to be used. It would also be used internally by all of message parsing/searching code. Luckily all of that already supports nonblocking input or can be easily modified to support it.

Below are some thoughts how to go forward with this. I originally thought about writing a more specific plan, but I think this is good enough now to start coding. The target Dovecot version is v2.1, not v2.0.

Filesystem API

The idea is to first abstract out all POSIX filesystem accessing in dbox/mdbox code. Non-blocking I/O works pretty nicely for socket I/O, so I thought I'd use a similar API for disk I/O as well:

handle = open(path, mode)

  • this function can't fail. it's executed asynchronously.

  • mode=read-only: no writing to file

  • mode=append: appending to file is allowed

handle = create(path, mode, permissions)

  • this function can't fail. it's executed asynchronously.

  • mode=fail-if-exists: commit() fails if file already exists

  • mode=replace-if-exists: commit() replaces file if it already exists

  • permissions: i'm not entirely sure how this works yet. but it should contain mode and gid in some format

set_input_callback(handle, callback, context)
set_output_callback(handle, callback, context)

  • call the callback when more data can be read/written

ret = pread(handle, buf, size, offset)

  • just like pread(), but can fail with EAGAIN if there are no bytes already buffered. so the idea is that the backend implementation would buffer/readahead data, which would be returned by this call. this would require memcpy()ing all data, but it might get too complex/fragile if it was asynchronously written to given buffer.

ret = write(handle, buf, size)

  • append data to given file and return how many bytes were actually added to write buffer. works in a similar way than writing to socket. data can only be appended to files, there is no support for overwriting data.

  • no writes will be visible to reads until commit() is called

ret = commit(handle, [filename])

  • commit all previous writes to disk. either returns success/EAGAIN.

  • if filename is given and a new file is being created, the filename is changed to the given one instead of using the original path's filename. this is needed because e.g. mdbox saving can write many temp files in a single transaction and only at commit stage it locks the index files and knows what the filenames will be.


  • rollback all previous writes.


  • if file was created and not committed, the temp file will be deleted

  • does implicit rollback

ret = try_lock(handle)

  • this isn't an asynchronous operation! it assumes that locking state is kept in memory, so that the operation will be fast. if backend doesn't support locking or it's slow, single-dbox should be used (instead of multi-dbox), because it doesn't need locking.

  • returns success or "already locked"

  • only exclusive locking is possible

Async IO streams

Async input streams are created with FS API handle, so it's possible to start reading from them before the file has even been open()ed. The callers must be aware of this and realize that read() can fail with ENOENT, etc.

Async input streams' read() would work exactly as file_istream works for non-blocking sockets: It would return data that is already buffered in memory. If there's nothing, it returns EAGAIN. The FS API's set_input_callback() can be used to set a callback function that is called whenever there's more data available in the buffer.

Async output streams also work the same as non-blocking file_ostreams: write() returns the number of bytes added to buffer. When buffer becomes full, it starts returning EAGAIN. The ostream handles flushing internally the same way as file_ostreams does, although instead of using io_add(IO_WRITE) it uses FS API's set_output_callback(). If callers need to know when more data can be written or when all of the data has been written, it can override the ostream's flush_callback, just like with file_ostreams.

Async IO for FS API backend

So now that all of the APIs have been designed, all that's needed to do is to write a simple FS API implementation using kernel's async IO API, right? Wrong. There is no usable async IO API in Linux, and probably nowhere else either:

  • POSIX AIO isn't supported by Linux kernel. And even if it was, it only supports async reads/writes, not async open().

  • Linux kernel has its own native AIO implementation! ..But it only works with direct IO, which makes it pretty much useless for almost everyone. There have been many different attempts to get buffered AIO support to Linux, but all of them have failed.

So for now the only practical way is to implement it with threads. There are several libraries that could make this easier.. But all of them enable (and require) full thread safety for libc calls. I don't really like that. Dovecot isn't using threads, I shouldn't pay the penalty of using extra locking when it's really not necessary. So I was thinking about doing the async IO in two ways:

  1. For Linux/x86/x86-64 (and maybe others) implement a version that creates threads with clone() and uses lockless queues for communicating between the async io worker threads. The threads won't use malloc() or any other unsafe calls, so this should be pretty nice.

  2. Fallback version that uses pthreads with mutexes.

Is 2) going to be noticeably slower than 1)? Probably not.. In both cases there is also the problem of how many worker threads to create. I've really no idea, kernel would be so much better in deciding that.. I guess that also depends on how many processes Dovecot is configured to run on (and how busy they are at the moment). Maybe it could be just a globally configurable number of threads/process, defaulting to something like 10.

dbox changes

Most of the dbox code shouldn't be too difficult to change to using the FS API. The one exception is writing messages. Currently the format looks like:

[dbox header magic (2 bytes)]
[dbox message header, which contains message size][lf]
[dbox metadata magic (2 bytes)][lf]
[dbox metadata]

The problem is that the message size isn't always known before the message is written. So currently the code just pwrite()s the size afterwards, but this is no longer possible with the new FS API. One possibility would be to buffer the message in memory, but that could waste a lot of memory since messages can be large.

So the only good solution is to change the dbox file format to contain the message size after the message. At the same time the dbox format could be cleaned up from old broken ideas as well. The reading code could support the old version format as well, because reading isn't a problem.

The new dbox format could look like:

[dbox header magic (2 bytes)]
[dbox unique separator (random 16 bytes or so)][lf]
[dbox metadata magic (2 bytes)]
[dbox unique separator, same as above][message size][lf]
[dbox metadata, last item being metadata size]

The unique separator exists there primarily for fixing corrupted dbox storage, so it can (quite) reliably recognize where a message ends.

Multi-dbox's indexes contain already the message offset + size of (message+metadata), i.e. offset+size = offset of next message. The size is primarily used to figure out if more data can be appended to the file (so it won't grow too large). This size could just be changed to become message's size and the check just changed to assume metadata size would be e.g. 512 bytes. It won't really matter in practise. So then the reading code could get the message's size from the index.

Single-dbox doesn't have such index. There are two ways to solve the problem there (both actually would work for multi-dbox too):
  1. Metadata block can be found by reading backwards from end of file. For example read the last 512 bytes of the file and find the metadata magic block. Get the message's size from there.

  2. Create an istream that simply reads until it reaches dbox metadata magic and unique separator. And to be absolutely sure it reached the correct end of message, it can also check that the rest of the data in the stream contains only a valid metadata block.

Friday, February 19, 2010

I'm web 2.0?

A friend of mine kept bugging me about creating a blog, and after a few days I finally started thinking that perhaps it might be a good idea after all. I post some things to Dovecot mailing list every once in a while that might be interesting to casual followers, but it's difficult to find them among all the other traffic.

Lets start with posting links to some old design ideas (hopefully to be implemented in not too distant future):

I'm also considering on changing how quota full situations are handled in v2.0, unless someone comes up with a good reason not to do it.