Saturday, March 13, 2010

Design: Asynchronous I/O for single/multi-dbox

(Mailing list thread for this post.)

The long term plan is to get all of Dovecot disk I/O asynchronous. The first step to that direction would be to make dbox/mdbox I/O asynchronous. This might also allow mbox/maildir to become partially asynchronous.

I already started describing how the lib-storage API could be changed to support high-latency storages. Adding support for all of the paralellism would be nice, but probably not necessary as the first step. Then again, the API changes are necessary, so the parallelism optimizations probably aren't a huge task on top of that.

Besides the API changes described in the above link, another change that's necessary is to add mail_get_nonblocking_stream(). The returned stream can return EAGAIN whenever it would need to block. Whenever sending message data to client, this stream would have to be used. It would also be used internally by all of message parsing/searching code. Luckily all of that already supports nonblocking input or can be easily modified to support it.

Below are some thoughts how to go forward with this. I originally thought about writing a more specific plan, but I think this is good enough now to start coding. The target Dovecot version is v2.1, not v2.0.

Filesystem API


The idea is to first abstract out all POSIX filesystem accessing in dbox/mdbox code. Non-blocking I/O works pretty nicely for socket I/O, so I thought I'd use a similar API for disk I/O as well:

handle = open(path, mode)

  • this function can't fail. it's executed asynchronously.

  • mode=read-only: no writing to file

  • mode=append: appending to file is allowed

handle = create(path, mode, permissions)

  • this function can't fail. it's executed asynchronously.

  • mode=fail-if-exists: commit() fails if file already exists

  • mode=replace-if-exists: commit() replaces file if it already exists

  • permissions: i'm not entirely sure how this works yet. but it should contain mode and gid in some format

set_input_callback(handle, callback, context)
set_output_callback(handle, callback, context)

  • call the callback when more data can be read/written

ret = pread(handle, buf, size, offset)

  • just like pread(), but can fail with EAGAIN if there are no bytes already buffered. so the idea is that the backend implementation would buffer/readahead data, which would be returned by this call. this would require memcpy()ing all data, but it might get too complex/fragile if it was asynchronously written to given buffer.

ret = write(handle, buf, size)

  • append data to given file and return how many bytes were actually added to write buffer. works in a similar way than writing to socket. data can only be appended to files, there is no support for overwriting data.

  • no writes will be visible to reads until commit() is called

ret = commit(handle, [filename])

  • commit all previous writes to disk. either returns success/EAGAIN.

  • if filename is given and a new file is being created, the filename is changed to the given one instead of using the original path's filename. this is needed because e.g. mdbox saving can write many temp files in a single transaction and only at commit stage it locks the index files and knows what the filenames will be.

rollback(handle)

  • rollback all previous writes.

close(handle)

  • if file was created and not committed, the temp file will be deleted

  • does implicit rollback

ret = try_lock(handle)

  • this isn't an asynchronous operation! it assumes that locking state is kept in memory, so that the operation will be fast. if backend doesn't support locking or it's slow, single-dbox should be used (instead of multi-dbox), because it doesn't need locking.

  • returns success or "already locked"

  • only exclusive locking is possible


Async IO streams


Async input streams are created with FS API handle, so it's possible to start reading from them before the file has even been open()ed. The callers must be aware of this and realize that read() can fail with ENOENT, etc.

Async input streams' read() would work exactly as file_istream works for non-blocking sockets: It would return data that is already buffered in memory. If there's nothing, it returns EAGAIN. The FS API's set_input_callback() can be used to set a callback function that is called whenever there's more data available in the buffer.

Async output streams also work the same as non-blocking file_ostreams: write() returns the number of bytes added to buffer. When buffer becomes full, it starts returning EAGAIN. The ostream handles flushing internally the same way as file_ostreams does, although instead of using io_add(IO_WRITE) it uses FS API's set_output_callback(). If callers need to know when more data can be written or when all of the data has been written, it can override the ostream's flush_callback, just like with file_ostreams.

Async IO for FS API backend


So now that all of the APIs have been designed, all that's needed to do is to write a simple FS API implementation using kernel's async IO API, right? Wrong. There is no usable async IO API in Linux, and probably nowhere else either:


  • POSIX AIO isn't supported by Linux kernel. And even if it was, it only supports async reads/writes, not async open().

  • Linux kernel has its own native AIO implementation! ..But it only works with direct IO, which makes it pretty much useless for almost everyone. There have been many different attempts to get buffered AIO support to Linux, but all of them have failed.



So for now the only practical way is to implement it with threads. There are several libraries that could make this easier.. But all of them enable (and require) full thread safety for libc calls. I don't really like that. Dovecot isn't using threads, I shouldn't pay the penalty of using extra locking when it's really not necessary. So I was thinking about doing the async IO in two ways:


  1. For Linux/x86/x86-64 (and maybe others) implement a version that creates threads with clone() and uses lockless queues for communicating between the async io worker threads. The threads won't use malloc() or any other unsafe calls, so this should be pretty nice.

  2. Fallback version that uses pthreads with mutexes.



Is 2) going to be noticeably slower than 1)? Probably not.. In both cases there is also the problem of how many worker threads to create. I've really no idea, kernel would be so much better in deciding that.. I guess that also depends on how many processes Dovecot is configured to run on (and how busy they are at the moment). Maybe it could be just a globally configurable number of threads/process, defaulting to something like 10.

dbox changes


Most of the dbox code shouldn't be too difficult to change to using the FS API. The one exception is writing messages. Currently the format looks like:

[dbox header magic (2 bytes)]
[dbox message header, which contains message size][lf]
[message]
[dbox metadata magic (2 bytes)][lf]
[dbox metadata]
[lf]

The problem is that the message size isn't always known before the message is written. So currently the code just pwrite()s the size afterwards, but this is no longer possible with the new FS API. One possibility would be to buffer the message in memory, but that could waste a lot of memory since messages can be large.

So the only good solution is to change the dbox file format to contain the message size after the message. At the same time the dbox format could be cleaned up from old broken ideas as well. The reading code could support the old version format as well, because reading isn't a problem.

The new dbox format could look like:

[dbox header magic (2 bytes)]
[dbox unique separator (random 16 bytes or so)][lf]
[message]
[dbox metadata magic (2 bytes)]
[dbox unique separator, same as above][message size][lf]
[dbox metadata, last item being metadata size]
[lf]

The unique separator exists there primarily for fixing corrupted dbox storage, so it can (quite) reliably recognize where a message ends.

Multi-dbox's indexes contain already the message offset + size of (message+metadata), i.e. offset+size = offset of next message. The size is primarily used to figure out if more data can be appended to the file (so it won't grow too large). This size could just be changed to become message's size and the check just changed to assume metadata size would be e.g. 512 bytes. It won't really matter in practise. So then the reading code could get the message's size from the index.

Single-dbox doesn't have such index. There are two ways to solve the problem there (both actually would work for multi-dbox too):
  1. Metadata block can be found by reading backwards from end of file. For example read the last 512 bytes of the file and find the metadata magic block. Get the message's size from there.

  2. Create an istream that simply reads until it reaches dbox metadata magic and unique separator. And to be absolutely sure it reached the correct end of message, it can also check that the rest of the data in the stream contains only a valid metadata block.

blog comments powered by Disqus