Friday, November 8, 2013

Dovecot MTA

I've never really wanted to create my own MTA, because I like Postfix quite a lot. And I always thought it would require a horribly lot of time to be able to create something that was anywhere even close to having Postfix's features. (I would shudder to even think about recreating Dovecot from scratch nowadays.) But slowly over time I've also been thinking of ways how things could be done a bit better, and I think I have enough ideas to start thinking about Dovecot MTA more seriously in a few more months (after my current busy schedule calms down a bit). And (unlike Dovecot!) I'm not planning on taking over the world with the MTA (or at least not very quickly), but it would definitely be useful for many installations I know of.

My main design goals for the MTA are:
  • In normal load don't queue mails, just continue delivering the mail through different processes/services until it succeeds or fails, and only after that return ok/failure to the SMTP client. So there's no (forced) post-queue filtering, everything would normally happen pre-queue. This is required because in Germany (and EU in general?) you aren't allowed to just drop spams after SMTP server has responsed OK to the client, even if you’re 100% sure it’s a spam. So this would also mean that the SMTP DATA replies will come more slowly, which means that the SMTP server must be able to handle a lot more concurrent SMTP connections, which means that in large installations the smtpd process must be able to asynchronously handle multiple SMTP client connections.
  • In some cases you can't really avoid placing mails into a queue. This could be because of temporary failures or maybe because of an abnormal load spike. A mail queue in local disk isn't very nice though, because if the local disk dies, the queued mails are lost. Dovecot MTA will allow the queue to be in object storage and it will also likely support replication (similar to current dsync replication). In both of these cases if a server dies, another server can quickly take over its queue and continue handling it.
  • Dovecot MTA is a new product, which means we can add some requirements to how it's being used, especially related to securely sending emails between servers. It could do a bunch of checks at startup and fail to even start if everything isn't correct. Here are some things I had in mind - not sure if all of these are good ideas or not:
    • Require DKIM configuration. All outgoing mails will be DKIM signed.
    • Require the domain’s DNS to contain _submission._tcp SRV record (and actually might as well require _imap._tcp too)
    • Require SSL certificates to be configured and always allow remote to use STARTTLS
    • Require DANE TLSA record to exist and match the server's configured SSL cert
    • Have very good (and strict?) DNSSEC support. If we know a remote server is supposed to have valid DNSSEC entries, but doesn't, fail to deliver mail entirely?
    • Add a new DNS record that advertises this is a Dovecot MTA (or compatible). If such entry is found (especially when correctness is guaranteed by DNSSEC), the email sender can assume that certain features exist and work correctly. If they don't, it could indicate an attack and the mail sending should be retried later. This DNS record would of course be good to try to standardize.
  • Configuration: It would take years to implement all of the settings that Postfix has, but I think it's not going to be necessary. In fact I think the number of new settings to dovecot.conf that Dovecot MTA requires would be very minimal. Instead nearly all of the configuration could be done using Sieve scripts. We'd need to implement some new MTA-specific Sieve extensions and a few core features/configurations/databases that the scripts can use, but after that there wouldn't be really any limits to what could be done with them.
  • Try to implement as many existing interfaces as possible (e.g. Milter and various Postfix APIs like policy servers) so that it wouldn’t be necessary to reimplement all the tools and filters.
So perhaps something like this could be done in time for Dovecot v2.4. Any thoughts/ideas/suggestions?

Tuesday, February 28, 2012

Dovecot clustering with dsync-based replication

This document describes a design for a dsync-replicated Dovecot cluster. This design can be used to build at least two different types of dsync clusters, which are both described here. Ville has also drawn overview pictures of these two setups, see director/NFS-based cluster and SSH-based cluster.

First of all, why dsync replication instead of block level filesystem replication?

  • dsync won't replicate filesystem corruption.
  • A cold restart of replication won't go through all of the data in the disks, but instead quickly finds out what has changed.
  • Split brain won't result in downtime or losing any data. If both sides did changes, the changes are merged without data loss.
  • If using more than 2 storages, the users' replicas can be divided among the other storages. So if one storage goes down, the extra load is shared by all the other storages, not just one.

Replication mail plugin

This is a simple plugin based on notify plugin. It listens for all changes that happen to mailboxes (new mails, flag changes, etc.) Once it sees a change, it sends an asynchronous (username, priority) notification to replication-notify-fifo. The priority can be either high (new mails) or low (everything else).

Optionally the replication plugin can also support synchronous replication of new mail deliveries. In this way it connects to replication-notify UNIX socket, tells it to replicate the user with sync (=highest) priority and waits until it is done or replication_sync_timeout occurs. The IMAP/LMTP client won't see an "OK" reply until the mail is replicated (or the replication has failed). The synchronous replication probably adds a noticeable delay, so it might not be acceptable for IMAP, but might be for LMTP.

So, what is listening in those replication-notify* sockets? It depends on if Dovecot is running on director-based setup or not.

Aggregator

When running in Dovecot director-based setup, all of the Dovecot backends (where replication plugin runs) also run "aggregator" process. Its job is very simple: It proxies the notifications from mail plugin and sends them via a single TCP connection to the replicator process running in Dovecot proxies. This is simply an optimization to avoid tons of short lived TCP connections directly from replication plugin to director server.

When not running in Dovecot director setup (i.e. there is only a single Dovecot instance that handles all of the users), there is no point in having an aggregator proxy, because the replicator process is running on the same server. In this kind of setup the replicator process directly listens on the replication-notify* sockets.

Replicator

The initial design for replicator isn't very complex either: It keeps a priority queue of all users, and replicates those users at the top of the queue. Notifications about changes to user's mailboxes (may) move the user up in the priority queue. If the user at the top of the queue already has been replicated "recently enough", the replicator stops its work until new changes arrive or the "recently enough" is no longer that.

dsync can do two types of syncs: quick syncs and full syncs. A quick sync trusts indexes and does the replication with the least amount of work and network traffic. A quick sync is normally enough to replicate all changes, but just in case something has gone wrong there's also the full sync option, which guarantees that the mailboxes end up being fully synced. A full sync is slower though, and uses more network traffic.

The priority queue is sorted by:

  • 1. Priority (updated by a notification from replication plugin)
  • 2. If priority!=none: Last fast sync (those users are replicated first whose last replication time is oldest)
  • 2. If priority=none: Last full sync (these users should already be fully synced, but do a full sync for them once in a while anyway)
All users get added to the replication queue at replicator startup with "none" priority. The list of users is looked up via userdb iteration. If the previous replication state is found from a disk dump, it's used to update the priorities, last_*_sync timestamps and other replication state. Replicator process creates such dumps periodically [todo: every few mins? maybe a setting?].

Replicator starts replicating users at the top of the queue, setting their priorities to "none" before starting. This means that if another change notification arrives during replication, the priority is bumped up and no changes get lost. replication_max_conns setting specifies how many users are replicated simultaneously. If the user's last_full_sync is older than replication_full_sync_interval setting, a full sync is done instead of a fast sync. If the user at the top of the queue has "none" priority and the last_full_sync is newer than replication_full_sync_interval, the replication stops. [todo: it would be nice to prefer doing all the full syncs at night when there's hopefully less disk I/O]

(A global replication_max_conns setting isn't optimal in proxy-based setup, where different backend servers are doing the replication. There it should maybe be a per-backend setting. Then again, it doesn't account for the replica servers that also need to do replication work. Also to properly handle this each backend should have its own replication queue, but this requires doing a userdb lookup for each user to find out their replication server, and this would need to be done periodically in case the backend changes, which can easily happen often with director-based setup. So all in all, none of this is being done in the initial implementation. Ideally the users are distributed in a way that a global replication queue would work well enough.)

In director-based setup each director runs a replicator server, but only one of them (master) actually asks the backends to do the replication. The rest of them just keep track of what's happening, and if the master dies or hangs, one of the others becomes the new master. The server with lowest IP address is always the master. The replicators are connected to a ring like the directors, using the same director_servers setting. The communication between them is simply about notifications of what's happening to users' priorities. Preferably the aggregators would always connect to the master server, but this isn't required. In general there's not much that can go wrong, since it's not a problem if two replicators request a backend to start replication for the same user or if the replication queue states aren't identical.

If the replication is running too slowly [todo: means what exactly?], log a warning and send an email to admin.

So, how does the actual replication happen? Replicator connects to doveadm server and sends a "sync -u user@domain" command. In director-based setup the doveadm server redirects this command to the proper backend.

doveadm sync

This is an independent feature from all of the above. Even with none of it implemented, you could run this to replicate a user. Most of this is already implemented. The only problem is that currently you need to explicitly tell it where to sync. So, when the destination isn't specified, it could do a userdb lookup and use the returned "mail_replica" field as the destination. Multiple (sequentially replicated) destinations could be supported by returning "mail_replica2", "mail_replica3" etc. field.

In NFS-based (or shared filesystem-based in general) setup the mail_replica setting is identical to mail_location setting. So your primary mail_location would be in /storage1/user/Maildir, while the secondary mail_replica would be in /storage2/user/Maildir. Simple.

In non-NFS-based setup two Dovecot servers talk dsync protocol to each others. Currently dsync already supports SSH-based connections. It would also be easy to implement direct TCP-based connections between two doveadm servers. In future these connections could be SSL-encrypted. Initially I'm only supporting SSH-based connections, as they're already implemented. So what does the mail_replica setting look like in this kind of a setup? I'm not entirely sure. I'm thinking that it could be either "ssh:host" or "ssh:user@host", where user is the SSH login user (this is opposite of the current doveadm sync command line usage). In future then it could support also tcp:host[:port]. Both of these ssh: and tcp: prefixes would also be supported by doveadm sync command line usage (and perhaps the prefixless user@domain be deprecated).

dsync can run without any long lived locking and it typically works fine. In case mailbox was modified during dsync, the replicas may not end up being identical, but nothing breaks. dsync currently usually notices this and logs a warning. When these conflicting changes was caused by imap/pop3/lda/etc. this isn't a problem, they've already notified replicator already to perform another sync that will fix it.

Running two dsyncs at the same time is more problematic though, mainly related to new emails. Both dsyncs notice that mail X needs to be replicated, so both save it and it results in having a duplicate. To avoid this, there should be a dsync-lock. If this lock exists, dsync should wait until the previous dsync is done and then do it again, just in case there were more changes since the previous sync started.

This should conclude everything needed for replication itself.

High-availability NFS setup

Once you have replication, it's of course nice if the system automatically recovers from a broken storage. In NFS-based setups the idea is to do soft mounts, so if the NFS server goes away things start failing with EIO errors, which Dovecot notices and switches to using the secondary storage(s).

In v2.1.0 Dovecot already keeps track of mounted filesystems. Initially they're all marked as "online". When multiple I/O errors occur in a filesystem [todo: how many exactly? where are these errors checked, all around in the code or checking the log?] the mountpoint is marked as "offline" and the connections accessing that storage are killed [todo: again how exactly?].

Another job for replication plugin is to hook into namespace creation. If mail_location points to a mountpoint marked as "offline", it's replaced with mail_replica. This way the user can access mails from the secondary storage without downtime. If the replica isn't fully up to date, this means that some of the mails (or other changes) may temporarily be lost. These will come back again after the original storage has come back up and replication has finished its job. So as long as mails aren't lost in the original storage, there won't be any permanent mail loss.

When an offline storage comes back online, its mountpoint's status is initially changed to "failover" (as opposed to "online"). During this state the replication plugin works a bit differently when the user's primary mail_location is in this storage: It first checks if the user is fully replicated, and if so uses the primary storage, otherwise it uses the replica storage. Long running IMAP protocesses check the replication state periodically and kill themselves once the user is replicated, to move back to primary storage.

Once replicator notices that all users have been replicated, it tells the backends' to change the "failover" state to "online" (via doveadm server).

High-availability non-NFS setup

One possibility is to use Dovecot proxies, which know which servers are down. Instead of directing users to those servers, it would direct them to replica servers. The server states could be handled similar to NFS setup's online vs. failover vs. offline states.

Another possibility would be to do the same as above, except without separate proxy servers. Just make "mail.example.com" DNS point to two IP addresses, and if one Dovecot notices that it's not the user's primary server, it proxies to the secondary server, unless it's down. If one IP is down, clients hopefully connect to the other.

Monday, February 13, 2012

Dovecot v2.2 plans

(Mailing list thread for this post.)

Here's a list of things I've been thinking about implementing for Dovecot v2.2. Probably not all of them will make it, but I'm at least interested in working on these if I have time.

Previously I've mostly been working on things that different companies were paying me to work on. This is the first time I have my own company, but the prioritization still works pretty much the same way:

  • 1. priority: If your company is highly interested in getting something implemented, we can do it as a project via my company. This guarantees that you'll get the feature implemented in a way that integrates well into your system.
  • 2. priority: Companies who have bought Dovecot support contract can let me know what they're interested in getting implemented. It's not a guarantee that it gets implemented, but it does affect my priorities. :)
  • 3. priority: Things other people want to get implemented.
There are also a lot of other things I have to spend my time on, which are before the 2. priority above. I guess we'll see how things work out.

dsync-based replication


I'll write a separate post about this later. Besides, it's coming for Dovecot v2.1 so it's a bit off topic, but I thought I'd mention it anyway.

Shared mailbox improvements


Support for private flags for all mailbox formats:
namespace {
type = public
prefix = Public/
mail_location = mdbox:/var/vmail/public:PVTINDEX=~/mdbox/indexes-public
}
  • dsync needs to be able to replicate the private flags as well as shared flags.
  • might as well add a common way for all mailbox formats to specify which flags are shared and which aren't. $controldir/dovecot-flags would say which is the default (private or shared) and what flags/keywords are the opposite.
  • easy way to configure shared mailboxes to be accessed via imapc backend, which would allow easy shared mailbox accesses across servers or simply between two system users in same server. (this may be tricky to dsync.)
  • global ACLs read from a single file supporting wildcards, instead of multiple different files
  • default ACLs for each namespace/storage root (maybe implemented using the above..)

Metadata / annotations


Add support for server, mailbox and mail annotations. These need to be dsyncable, so their changes need to be stored in various .log files:
  1. Per-server metadata. This is similar to subscriptions: Add changes to dovecot.mailbox.log file, with each entry name a hash of the metadata key that was changed.
  2. Per-mailbox metadata. Changes to this belong inside mailbox_transaction_context, which write the changes to mailbox's dovecot.index.log files. Each log record contains a list of changed annotation keys. This gives each change a modseq, and also allows easily finding out what changes other clients have done, so if a client has done ENABLE METADATA Dovecot can easily push metadata changes to client by only reading the dovecot.index.log file.
  3. Per-mail metadata. This is pretty much equivalent to per-mailbox metadata, except changes are associated to specific message UIDs.
The permanent storage is in dict. The dict keys have components:

  • priv/ vs. shared/ for specifying private vs. shared metadata
  • server/ vs mailbox/[mailbox guid]/ vs. mail/[mailbox guid]/[uid]
  • the metadata key name
This would be a good time to improve the dict configuration to allow things like:
  • mixed backends for different hierarchies (e.g. priv/mailbox/* goes to a file, while the rest goes to sql)
  • allow sql dict to be used in more relational way, so that mail annotations could be stored with tables: mailbox (id, guid) and mail_annotation (mailbox_id, key, value), i.e. avoid duplicating the guid everywhere.
Things to think through:
  • How to handle quota? Probably needs to be different from regular mail quota. Probably some per-user "metadata quota bytes" counter/limit.
  • Dict lookups should be done asynchronously and prefetched as much as possible. For per-mail annotation lookups mail_alloc() needs to include a list of annotations that are wanted.

Configuration


Copy all mail settings to namespaces, so it'll be possible to use per-namespace mailbox settings. Especially important for imapc_* settings, but can be useful for others as well. Those settings that aren't explicitly defined in the namespace will use the global defaults. (Should doveconf -a show all of these values, or simply the explicitly set values?)

Get rid of *.conf.ext files. Make everything part of dovecot.conf, so doveconf -n outputs ALL of the configuration. There are mainly 3 config files I'm thinking about: dict-sql, passdb/userdb sql, passdb/userdb ldap. The dict-sql is something I think needs a bigger redesign (mentioned above in "Metadata" section), but the sql/ldap auth configs could be merged. One way could be:

sql_db sqlmails {
# most settings from dovecot-sql.conf.ext, except for queries
driver = mysql
connect = ...
}

ldap_db ldapmails {
# most settings from dovecot-ldap.conf.ext, except attributes/filters
}

passdb {
driver = sql
db = sqlmails
sql_query = select password from users where username = '%u'
}
passdb {
driver = ldap
db = ldapmails
ldap_attributes {
password = %{ldap:userPassword}
}
ldap_filter = ...
}
The sql_db {} and ldap_db {} would be generic enough to be used everywhere (e.g. dict-sql), not just for passdb/userdb.

Some problems:

  • Similar to the per-namespace mail settings, doveconf -a would output all sql_query, ldap_attributes, ldap_filter, etc. settings for all passdbs/userdbs. Perhaps a similar solution?
  • The database configs contain passwords, so they should be readable only by root. This makes running dovecot-lda and maybe doveadm difficult, since they fail at "permission denied" when trying to open the config. There are probably only two solutions: a) The db configs need to be !include_try'd or b) the configs can be world-readable, but only passwords are placed to only-root-readable files by using "password = <db.password"

IMAP state saving/restoring


IMAP connections are often long running. Problems with this:
  1. Currently each connection requires a separate process (at least to work reliably), which means each connection also uses quite a lot of memory even when they aren't doing anything for a long time.
  2. Some clients don't handle lost connections very nicely. So Dovecot can't be upgraded without causing some user annoyance. Also in a cluster if you want to bring down one server, the connections have to be disconnected before they can be moved to another server.
If IMAP session state could be reliably saved and later restored to another process, both of the above problems could be avoided entirely. Typically when a connection is IDLEing there are really just 4 things that need to be remembered: username, selected mailbox name, its UIDVALIDITY and HIGHESTMODSEQ. With this information the IMAP session can be fully restored in another process without losing any state. So, what we could do is:
  1. When an IMAP connection has bee IDLEing for a while (configurable initial time, could be dynamically adjusted):
    • move the IMAP state and the connection fd to imap-idle process
    • the old imap process is destroyed
    • imap-idle process can handle lots of IMAP connections
    • imap-idle process also uses inotify/etc. to watch for changes in the specified mailbox
    • if any mailbox changes happen or IMAP client sends a command, start up a new imap process, restore the state and continue from where we left off
    • This could save quite a lot of memory at the expense of some CPU usage
  2. Dovecot proxy <-> backend protocol could be improved to support moving connection to another backend. Possibly using a separate control connection to avoid making the proxying less efficient in normal operation.
  3. When restarting Dovecot, move all the connections to a process that keeps the connections open for a while. When Dovecot starts up, create imap processes back to the connections. This allows changing configuration for existing client connections (which sometimes may be bad! need to add checks against client-visible config conflicts), upgrading Dovecot, etc. without being visible to clients. The only problem is SSL connections: OpenSSL doens't provide a way to save/restore state, so either you need to set shutdown_clients=no (and possibly keep some imap-login processes doing SSL proxying for a long time), or SSL connections need to be killed. Of course the SSL handling could be outsourced to some other software/hardware outside Dovecot.
The IMAP state saving isn't always easy. Initially it could be implemented only for the simple cases (which are a majority) and later extended to cover more.

IMAP extensions


  • CATENATE is already implemented by Stephan
  • URLAUTH is also planned to be implemented, somewhat differently than in Apple's patch. The idea is to create a separate imap-urlauth service that provides extra security.
  • NOTIFY extension could be implemented efficiently using mailbox list indexes, which already exists in v2.1.
  • FILTERS extension can be easily implemented once METADATA is implemented
  • There are also other missing extensions, but they're probably less important: BINARY & URLAUTH=BINARY, CONVERT, CONTEXT=SORT, CREATE-SPECIAL-USE, MULTISEARCH, UTF8=* and some i18n stuff.

Backups


Filesystem based backups have worked well enough with Dovecot in the past. But with new features like single instance storage it's becoming more difficult. There's no 100% consistent way to even get filesystem level backups with SIS enabled, because deleting both the message file and its attachment files can't be done atomically (although usually this isn't a real problem). Restoring SIS mails is more difficult though, first you need to restore the dbox mail files and then you need to figure out what attachment files from SIS need to be restored, and finally you'll need to do doveadm import to put them into their final destination.

I don't have much experience with backup software, but other people in my company do. The initial idea is to implement a Dovecot backup agent to one (commercial) backup software, which allows doing online backups and restoring mails one user/mailbox/mail at a time. I don't know the details yet how exactly this is going to be implemented, but the basic plan is probably to implement a "backup" mail storage backend, which is a PostgreSQL pg_dump-like flat file containing mails from all mailboxes. doveadm backup/import can then export/import this format via stdout/stdin. Incremental backups could possibly be done by giving a timestamp of previous backup run (I'm not sure about this yet).

Once I've managed to implement the first fully functional backup agent, it should become clearer how to implement it to other backup solutions.

Random things


  • dovecot.index.cache file writing is too complex, should be simplified
  • Enable auth_debug[_passwords]=yes on-the-fly for some specific users/IPs via doveadm
  • Optimize virtual mailboxes using mailbox list indexes. It wouldn't anymore need to keep all the backend mailboxes' index files open.
  • Would be nice to go forward with supporting key-value databases as mail storage backends.

Monday, July 19, 2010

(Single instance) attachment storage

(Mailing list thread for this post.)

Now that v2.0.0 is only waiting for people to report bugs (and me to figure out how to fix them), I've finally had time to start doing what I actually came here (Portugal Telecom/SAPO) to do. :)

The idea is to have dbox and mdbox support saving attachments (or MIME parts in general) to separate files, which with some magic gives a possibility to do single instance attachment storage. Comments welcome.

Reading attachments


dbox metadata would contain entries like (this is a wrapped single line entry):

X1442 2742784 94/b2/01f34a9def84372a440d7a103a159ac6c9fd752b
2744378 27423 27/c8/a1dccc34d0aaa40e413b449a18810f600b4ae77b

So the format is:

"X" 1*(<offset> <byte count> <link path>)

So when reading a dbox message body, it's read as:

offset=0: <first 1442 bytes from dbox body>
offset=1442: <next 2742784 bytes from external file>
offset=2744226: <next 152 bytes from dbox body>
offset=2744378: <next 27423 bytes from external file>
offset=2744378 27423: <the rest from dbox body>

This is all done internally by creating a single istream that lazily opens the external files only when data is actually tried to be read from that part of the message.

The link paths don't have to be in any specific format. In future perhaps it can recognize different formats (even http:// urls and such).

Saving attachments separately


Message MIME structure is being parsed while message is saved. After each MIME part's headers are parsed, it's determined if this part should be stored into attachment storage. By default it only checks that the MIME part isn't multipart/* (because then its child parts would contain attachments). Plugins can also override this. For example they could try to determine if the commonly used clients/webmail always downloads and shows the MIME part when opening the mail (text/*, inline images, etc).

dbox_attachment_min_size specifies the minimum MIME part size that can be saved as an attachment. Anything smaller than that will be stored normally. While reading a potential attachment MIME part body, it's first buffered into memory until the min. size is reached. After that the attachment file is actually created and buffer flushed to it.

Each attachment filename contains a global UID part, so that no two (even identical) attachments will ever contain the same filename. But there can be multiple attachment storages in different mount points, and each one could be configured to do deduplication internally. So identical attachments should somehow be stored to same storage. This is done by taking a hash of the body and using a part of it as the path to the file. For example:

mail_location = dbox:~/dbox:ATTACHMENTS=/attachments/$/$

Each $ would be expanded to 8 bits of the hash in hex (00..ff). So the full path to an attachment could look like:

/attachments/04/f1/5ddf4d05177b3b4c7a7600008c4a11c1

Sysadmin can then create /attachment/00..ff as symlinks to different storages.

Hashing problems


Some problematic design decisions:
  1. Hash is taken from hardcoded first n kB vs. first dbox_attachment_min_size bytes?
    • + With first n kB, dbox_attachment_min_size can be changed without causing duplication of attachments, otherwise after the change the same attachment could get a hash to a different storage than before the change.

    • - If n kB is larger than dbox_attachment_min_size, it uses more memory.

    • - If n kB is determined to be too small to get uniform attachment distribution to different storages, it can't be changed without recompiling.


  2. Hash is taken from first n bytes vs. everything?

    • + First n bytes are already read to memory anyway and can be hashed efficiently. The attachment file can be created without wasting extra memory or disk I/O. If everything is hashed, the whole attachment has to be first stored to memory or to a temporary file and from there written to final storage.

    • - With first n bytes it's possible for an attacker to generate lots of different large attachments that begin with the same bytes and then overflow a single storage. If everything is hashed with a secure hash function and a system-specific secret random value is added to the hash, this attack isn't possible.

I'm thinking that even though taking a hash of everything is the least efficient option, it's the safest option. It's pretty much guaranteed to give a uniform distribution across all storages, even against intentional attacks. Also the worse performance isn't probably that noticeable, especially assuming a system where local disk isn't used for storing mails, and the temporary files would be created there.

Single instance storage


All of the above assumes that if you want a single instance storage, you'll need to enable it in your storage. Now, what if you can't do that?

I've been planning on making all index/dbox code to use an abstracted out simple filesystem API rather than using POSIX directly. This work can be started by making the attachment reading/writing code use the FS API and then create a single instance storage FS plugin. The plugin would work like:
  • open(ha/sh/hash-guid): The destination storage is in ha/sh/ directory, so a new temp file can be created under it. The hash is part of the filename to make unlink() easier to handle.

    Since the hash is already known at open() time, look up if hashes/<hash> file exists. If it does, open it.

  • write(): Write to the temp file. If hashes/ file is open, do a byte-by-byte comparison of the inputs. If there's a mismatch, close the hashes/ file and mark it as unusable.

  • finish():

    1. If hashes/ file is still open and it's at EOF, link() it to our final destination filename and delete the temp file. If link() fails with ENOENT (it was just expunged), goto b. If link() fails with EMLINK (too many links), goto c.

    2. If hashes/ file didn't exist, link() the temp file to the hash and rename() it to the destination file.

    3. If the hashed file existed but wasn't the same, or if link() failed with EMLINK, link() our temp file to a second temp file and rename() it over the hashes/ file and goto a.


  • unlink(): If hashes/<hash> has the same inode as our file and the link count is 2, unlink() the hash file. After that unlink() our file.

One alternative to avoid using <hash> as part of the filename would be for unlink() to read the file and recalculate its hash, but that would waste disk I/O.

Another possibility would to be to not unlink() the hashes/ files immediately, but rather let some nightly cronjob to stat() through all of the files and unlink() the ones that have link count=1. This could be wastefully inefficient though.

Yet another possibility would be for the plugin to internally calculate the hash and write it somewhere. If it's at the beginning of the file, it could be read from there with some extra disk I/O. But is it worth it?..

Extra features


The attachment files begin with an extensible header. This allows a couple of extra features to reduce disk space:

  1. The attachment could be compressed (header contains compressed-flag)

  2. If base64 attachment is in a standardized form that can be 100% reliably converted back to its original form, it could be stored decoded and then encoded back to original on the fly.


It would be nice if it was also possible to compress (and decompress) attachments after they were already stored. This would be possible, but it would require finding all the links to the message and recreating them to point to the new message. (Simply overwriting the file in place would require there are no readers at the same time, and that's not easy to guarantee, except if Dovecot was entirely stopped. I also considered some symlinking schemes but they seemed too complex and they'd also waste inodes and performance.)

Code status


Initial version of the attachment reading/writing code is already done and works (lacks some error handling and probably performance optimizations). The SIS plugin code is also started and should be working soon.

This code is very isolated and can't cause any destabilization unless it's enabled, so I'm thinking about just adding it to v2.0 as soon as it works, although the config file comments should indicate that it's still considered unstable.

Wednesday, May 19, 2010

A new director service in v2.0 for NFS installations

(Mailing list thread for this post.)

As NFS wiki page describes, the main problem with NFS has always been caching problems. One NFS client changes two files, but another NFS client sees only one of the changes, which Dovecot then assumes is caused by corruption.

The recommended solution has always been to redirect the same user to only a single server at the same time. User doesn't have to be permanently assigned there, but as long as a server has some of user's files cached, it should be the only server accessing the user's mailbox. Recently I was thinking about a way to make this possible with an SQL database.

The company here in Italy didn't really like such idea, so I thought about making it more transparent and simpler to manage. The result is a new "director" service, which does basically the same thing, except without SQL database. The idea is that your load balancer can redirect connections to one or more Dovecot proxies, which internally then figure out where the user should go. So the proxies act kind of like a secondary load balancer layer.

When a connection from a newly seen user arrives, it gets assigned to a mail server according to a function:

host = vhosts[ md5(username) mod vhosts_count ]

This way all of the proxies assign the same user to the same host without having to talk to each others. The vhosts[] is basically an array of hosts, except each host is initially listed there 100 times (vhost count=100). This vhost count can then be increased or decreased as necessary to change the host's load, probably automatically in future.

The problem is then of course that if (v)hosts are added or removed, the above function will return a different host than was previously used for the same user. That's why there is also an in-memory database that keeps track of username → (hostname, timestamp) mappings. Every new connection from user refreshes the timestamp. Also existing connections refresh the timestamp every n minutes. Once all connections are gone, the timestamp expires and the user is removed from database.

The final problem then is how multiple proxies synchronize their state. The proxies connect to each others forming a connection ring. For example with 4 proxies the connections would go like A → B → C → A. Each time a user is added/refreshed, a notification is sent to both directions in the ring (e.g. B sends to A and C), which in turn forward it until it reaches a server that has already seen it. This way if a proxy dies (or just hangs for a few seconds), the other proxies still get the changes without waiting for it to timeout. Host changes are replicated in the same way.

It's possible that two connections from a user arrive to different proxies while (v)hosts are being added/removed. It's also possible that only one of the proxies has seen the host change. So the proxies could redirect users to different servers during that time. This can be prevented by doing a ring-wide sync, during which all proxies delay assigning hostnames to new users. This delay shouldn't be too bad because a) they should happen rarely, b) it should be over quickly, c) users already in database can still be redirected during the sync.

The main complexity here comes from how to handle proxy server failures in different situations. Those are less interesting to describe and I haven't yet implemented all of it, so let's just assume that in future it all works perfectly. :) I was also thinking about writing a test program to simulate the director failures to make sure it all works.

Finally, there are the doveadm commands that can be used to:


  • List the director status:
    # doveadm director status
    mail server ip vhosts users
    11.22.3.44 100 1312
    12.33.4.55 50 1424

  • Add a new mail server (defaults are in dovecot.conf):
    # doveadm director add 1.2.3.4

  • Change a mail server's vhost count to alter its connection count (also works during adding):
    # doveadm director add 1.2.3.4 50

  • Remove a mail server completely (because it's down):
    # doveadm director remove 1.2.3.4


If you want to slowly get users away from a specific server, you can assign its vhost count to 0 and wait for its user count to drop to zero. If the server is still working while "doveadm director remove" is called, new connections from the users in that server are going to other servers while the old ones are still being handled.

Friday, March 19, 2010

Time to switch to clang

I actually wanted to start using clang a long time ago, but it didn't give enough warnings. Many important warnings, like not verifying printf() parameters, were completely missing. So I kept using gcc..

But in last few days this one guy started adding support for the missing gcc warnings. I also found out that printf() warnings were added within last few months also. So it looks like clang is finally potentially usable! I still have to actually start developing with it, but it looks promising.

This picture shows how much better clang's error and warning handling is compared to gcc.

I also did a few benchmarks with Dovecot:


  • Dovecot compiled about 10% faster with clang. Based on clang's web page I expected much more, but I guess it's better than nothing.. (I used configure --enable-optimizations, didn't change anything else)

  • Dovecot ran about 7% faster when I/O wasn't the limit (SSD disk, fsync_disable=yes).



Here's how I tested the 7% speed improvement (Dovecot v2.0 hg, Maildir):


imaptest seed=123 secs=300 msgs=100 delete=10 expunge=10 logout=1

1)
gcc version 4.4.3 20100108 (prerelease) (Debian 4.4.2-9)

Logi List Stat Sele Fetc Fet2 Stor Dele Expu Appe Logo
100% 50% 50% 100% 100% 100% 50% 10% 10% 100% 1%
30% 5%
674 31545 31442 674 63169 90559 30029 5419 6332 19773 1348
646 31725 31640 646 63270 90160 29987 5403 6163 20224 1292

2)
clang version 1.5 (trunk 98979)
Target: x86_64-unknown-linux-gnu

Logi List Stat Sele Fetc Fet2 Stor Dele Expu Appe Logo
100% 50% 50% 100% 100% 100% 50% 10% 10% 100% 1%
30% 5%
693 33927 33765 693 68032 96951 32356 5691 6786 21034 1386
674 33990 34027 674 68018 97428 32101 5823 6863 21260 1348

Saturday, March 13, 2010

Design: Asynchronous I/O for single/multi-dbox

(Mailing list thread for this post.)

The long term plan is to get all of Dovecot disk I/O asynchronous. The first step to that direction would be to make dbox/mdbox I/O asynchronous. This might also allow mbox/maildir to become partially asynchronous.

I already started describing how the lib-storage API could be changed to support high-latency storages. Adding support for all of the paralellism would be nice, but probably not necessary as the first step. Then again, the API changes are necessary, so the parallelism optimizations probably aren't a huge task on top of that.

Besides the API changes described in the above link, another change that's necessary is to add mail_get_nonblocking_stream(). The returned stream can return EAGAIN whenever it would need to block. Whenever sending message data to client, this stream would have to be used. It would also be used internally by all of message parsing/searching code. Luckily all of that already supports nonblocking input or can be easily modified to support it.

Below are some thoughts how to go forward with this. I originally thought about writing a more specific plan, but I think this is good enough now to start coding. The target Dovecot version is v2.1, not v2.0.

Filesystem API


The idea is to first abstract out all POSIX filesystem accessing in dbox/mdbox code. Non-blocking I/O works pretty nicely for socket I/O, so I thought I'd use a similar API for disk I/O as well:

handle = open(path, mode)

  • this function can't fail. it's executed asynchronously.

  • mode=read-only: no writing to file

  • mode=append: appending to file is allowed

handle = create(path, mode, permissions)

  • this function can't fail. it's executed asynchronously.

  • mode=fail-if-exists: commit() fails if file already exists

  • mode=replace-if-exists: commit() replaces file if it already exists

  • permissions: i'm not entirely sure how this works yet. but it should contain mode and gid in some format

set_input_callback(handle, callback, context)
set_output_callback(handle, callback, context)

  • call the callback when more data can be read/written

ret = pread(handle, buf, size, offset)

  • just like pread(), but can fail with EAGAIN if there are no bytes already buffered. so the idea is that the backend implementation would buffer/readahead data, which would be returned by this call. this would require memcpy()ing all data, but it might get too complex/fragile if it was asynchronously written to given buffer.

ret = write(handle, buf, size)

  • append data to given file and return how many bytes were actually added to write buffer. works in a similar way than writing to socket. data can only be appended to files, there is no support for overwriting data.

  • no writes will be visible to reads until commit() is called

ret = commit(handle, [filename])

  • commit all previous writes to disk. either returns success/EAGAIN.

  • if filename is given and a new file is being created, the filename is changed to the given one instead of using the original path's filename. this is needed because e.g. mdbox saving can write many temp files in a single transaction and only at commit stage it locks the index files and knows what the filenames will be.

rollback(handle)

  • rollback all previous writes.

close(handle)

  • if file was created and not committed, the temp file will be deleted

  • does implicit rollback

ret = try_lock(handle)

  • this isn't an asynchronous operation! it assumes that locking state is kept in memory, so that the operation will be fast. if backend doesn't support locking or it's slow, single-dbox should be used (instead of multi-dbox), because it doesn't need locking.

  • returns success or "already locked"

  • only exclusive locking is possible


Async IO streams


Async input streams are created with FS API handle, so it's possible to start reading from them before the file has even been open()ed. The callers must be aware of this and realize that read() can fail with ENOENT, etc.

Async input streams' read() would work exactly as file_istream works for non-blocking sockets: It would return data that is already buffered in memory. If there's nothing, it returns EAGAIN. The FS API's set_input_callback() can be used to set a callback function that is called whenever there's more data available in the buffer.

Async output streams also work the same as non-blocking file_ostreams: write() returns the number of bytes added to buffer. When buffer becomes full, it starts returning EAGAIN. The ostream handles flushing internally the same way as file_ostreams does, although instead of using io_add(IO_WRITE) it uses FS API's set_output_callback(). If callers need to know when more data can be written or when all of the data has been written, it can override the ostream's flush_callback, just like with file_ostreams.

Async IO for FS API backend


So now that all of the APIs have been designed, all that's needed to do is to write a simple FS API implementation using kernel's async IO API, right? Wrong. There is no usable async IO API in Linux, and probably nowhere else either:


  • POSIX AIO isn't supported by Linux kernel. And even if it was, it only supports async reads/writes, not async open().

  • Linux kernel has its own native AIO implementation! ..But it only works with direct IO, which makes it pretty much useless for almost everyone. There have been many different attempts to get buffered AIO support to Linux, but all of them have failed.



So for now the only practical way is to implement it with threads. There are several libraries that could make this easier.. But all of them enable (and require) full thread safety for libc calls. I don't really like that. Dovecot isn't using threads, I shouldn't pay the penalty of using extra locking when it's really not necessary. So I was thinking about doing the async IO in two ways:


  1. For Linux/x86/x86-64 (and maybe others) implement a version that creates threads with clone() and uses lockless queues for communicating between the async io worker threads. The threads won't use malloc() or any other unsafe calls, so this should be pretty nice.

  2. Fallback version that uses pthreads with mutexes.



Is 2) going to be noticeably slower than 1)? Probably not.. In both cases there is also the problem of how many worker threads to create. I've really no idea, kernel would be so much better in deciding that.. I guess that also depends on how many processes Dovecot is configured to run on (and how busy they are at the moment). Maybe it could be just a globally configurable number of threads/process, defaulting to something like 10.

dbox changes


Most of the dbox code shouldn't be too difficult to change to using the FS API. The one exception is writing messages. Currently the format looks like:

[dbox header magic (2 bytes)]
[dbox message header, which contains message size][lf]
[message]
[dbox metadata magic (2 bytes)][lf]
[dbox metadata]
[lf]

The problem is that the message size isn't always known before the message is written. So currently the code just pwrite()s the size afterwards, but this is no longer possible with the new FS API. One possibility would be to buffer the message in memory, but that could waste a lot of memory since messages can be large.

So the only good solution is to change the dbox file format to contain the message size after the message. At the same time the dbox format could be cleaned up from old broken ideas as well. The reading code could support the old version format as well, because reading isn't a problem.

The new dbox format could look like:

[dbox header magic (2 bytes)]
[dbox unique separator (random 16 bytes or so)][lf]
[message]
[dbox metadata magic (2 bytes)]
[dbox unique separator, same as above][message size][lf]
[dbox metadata, last item being metadata size]
[lf]

The unique separator exists there primarily for fixing corrupted dbox storage, so it can (quite) reliably recognize where a message ends.

Multi-dbox's indexes contain already the message offset + size of (message+metadata), i.e. offset+size = offset of next message. The size is primarily used to figure out if more data can be appended to the file (so it won't grow too large). This size could just be changed to become message's size and the check just changed to assume metadata size would be e.g. 512 bytes. It won't really matter in practise. So then the reading code could get the message's size from the index.

Single-dbox doesn't have such index. There are two ways to solve the problem there (both actually would work for multi-dbox too):
  1. Metadata block can be found by reading backwards from end of file. For example read the last 512 bytes of the file and find the metadata magic block. Get the message's size from there.

  2. Create an istream that simply reads until it reaches dbox metadata magic and unique separator. And to be absolutely sure it reached the correct end of message, it can also check that the rest of the data in the stream contains only a valid metadata block.