Tuesday, February 28, 2012

Dovecot clustering with dsync-based replication

This document describes a design for a dsync-replicated Dovecot cluster. This design can be used to build at least two different types of dsync clusters, which are both described here. Ville has also drawn overview pictures of these two setups, see director/NFS-based cluster and SSH-based cluster.

First of all, why dsync replication instead of block level filesystem replication?

  • dsync won't replicate filesystem corruption.
  • A cold restart of replication won't go through all of the data in the disks, but instead quickly finds out what has changed.
  • Split brain won't result in downtime or losing any data. If both sides did changes, the changes are merged without data loss.
  • If using more than 2 storages, the users' replicas can be divided among the other storages. So if one storage goes down, the extra load is shared by all the other storages, not just one.

Replication mail plugin

This is a simple plugin based on notify plugin. It listens for all changes that happen to mailboxes (new mails, flag changes, etc.) Once it sees a change, it sends an asynchronous (username, priority) notification to replication-notify-fifo. The priority can be either high (new mails) or low (everything else).

Optionally the replication plugin can also support synchronous replication of new mail deliveries. In this way it connects to replication-notify UNIX socket, tells it to replicate the user with sync (=highest) priority and waits until it is done or replication_sync_timeout occurs. The IMAP/LMTP client won't see an "OK" reply until the mail is replicated (or the replication has failed). The synchronous replication probably adds a noticeable delay, so it might not be acceptable for IMAP, but might be for LMTP.

So, what is listening in those replication-notify* sockets? It depends on if Dovecot is running on director-based setup or not.

Aggregator

When running in Dovecot director-based setup, all of the Dovecot backends (where replication plugin runs) also run "aggregator" process. Its job is very simple: It proxies the notifications from mail plugin and sends them via a single TCP connection to the replicator process running in Dovecot proxies. This is simply an optimization to avoid tons of short lived TCP connections directly from replication plugin to director server.

When not running in Dovecot director setup (i.e. there is only a single Dovecot instance that handles all of the users), there is no point in having an aggregator proxy, because the replicator process is running on the same server. In this kind of setup the replicator process directly listens on the replication-notify* sockets.

Replicator

The initial design for replicator isn't very complex either: It keeps a priority queue of all users, and replicates those users at the top of the queue. Notifications about changes to user's mailboxes (may) move the user up in the priority queue. If the user at the top of the queue already has been replicated "recently enough", the replicator stops its work until new changes arrive or the "recently enough" is no longer that.

dsync can do two types of syncs: quick syncs and full syncs. A quick sync trusts indexes and does the replication with the least amount of work and network traffic. A quick sync is normally enough to replicate all changes, but just in case something has gone wrong there's also the full sync option, which guarantees that the mailboxes end up being fully synced. A full sync is slower though, and uses more network traffic.

The priority queue is sorted by:

  • 1. Priority (updated by a notification from replication plugin)
  • 2. If priority!=none: Last fast sync (those users are replicated first whose last replication time is oldest)
  • 2. If priority=none: Last full sync (these users should already be fully synced, but do a full sync for them once in a while anyway)
All users get added to the replication queue at replicator startup with "none" priority. The list of users is looked up via userdb iteration. If the previous replication state is found from a disk dump, it's used to update the priorities, last_*_sync timestamps and other replication state. Replicator process creates such dumps periodically [todo: every few mins? maybe a setting?].

Replicator starts replicating users at the top of the queue, setting their priorities to "none" before starting. This means that if another change notification arrives during replication, the priority is bumped up and no changes get lost. replication_max_conns setting specifies how many users are replicated simultaneously. If the user's last_full_sync is older than replication_full_sync_interval setting, a full sync is done instead of a fast sync. If the user at the top of the queue has "none" priority and the last_full_sync is newer than replication_full_sync_interval, the replication stops. [todo: it would be nice to prefer doing all the full syncs at night when there's hopefully less disk I/O]

(A global replication_max_conns setting isn't optimal in proxy-based setup, where different backend servers are doing the replication. There it should maybe be a per-backend setting. Then again, it doesn't account for the replica servers that also need to do replication work. Also to properly handle this each backend should have its own replication queue, but this requires doing a userdb lookup for each user to find out their replication server, and this would need to be done periodically in case the backend changes, which can easily happen often with director-based setup. So all in all, none of this is being done in the initial implementation. Ideally the users are distributed in a way that a global replication queue would work well enough.)

In director-based setup each director runs a replicator server, but only one of them (master) actually asks the backends to do the replication. The rest of them just keep track of what's happening, and if the master dies or hangs, one of the others becomes the new master. The server with lowest IP address is always the master. The replicators are connected to a ring like the directors, using the same director_servers setting. The communication between them is simply about notifications of what's happening to users' priorities. Preferably the aggregators would always connect to the master server, but this isn't required. In general there's not much that can go wrong, since it's not a problem if two replicators request a backend to start replication for the same user or if the replication queue states aren't identical.

If the replication is running too slowly [todo: means what exactly?], log a warning and send an email to admin.

So, how does the actual replication happen? Replicator connects to doveadm server and sends a "sync -u user@domain" command. In director-based setup the doveadm server redirects this command to the proper backend.

doveadm sync

This is an independent feature from all of the above. Even with none of it implemented, you could run this to replicate a user. Most of this is already implemented. The only problem is that currently you need to explicitly tell it where to sync. So, when the destination isn't specified, it could do a userdb lookup and use the returned "mail_replica" field as the destination. Multiple (sequentially replicated) destinations could be supported by returning "mail_replica2", "mail_replica3" etc. field.

In NFS-based (or shared filesystem-based in general) setup the mail_replica setting is identical to mail_location setting. So your primary mail_location would be in /storage1/user/Maildir, while the secondary mail_replica would be in /storage2/user/Maildir. Simple.

In non-NFS-based setup two Dovecot servers talk dsync protocol to each others. Currently dsync already supports SSH-based connections. It would also be easy to implement direct TCP-based connections between two doveadm servers. In future these connections could be SSL-encrypted. Initially I'm only supporting SSH-based connections, as they're already implemented. So what does the mail_replica setting look like in this kind of a setup? I'm not entirely sure. I'm thinking that it could be either "ssh:host" or "ssh:user@host", where user is the SSH login user (this is opposite of the current doveadm sync command line usage). In future then it could support also tcp:host[:port]. Both of these ssh: and tcp: prefixes would also be supported by doveadm sync command line usage (and perhaps the prefixless user@domain be deprecated).

dsync can run without any long lived locking and it typically works fine. In case mailbox was modified during dsync, the replicas may not end up being identical, but nothing breaks. dsync currently usually notices this and logs a warning. When these conflicting changes was caused by imap/pop3/lda/etc. this isn't a problem, they've already notified replicator already to perform another sync that will fix it.

Running two dsyncs at the same time is more problematic though, mainly related to new emails. Both dsyncs notice that mail X needs to be replicated, so both save it and it results in having a duplicate. To avoid this, there should be a dsync-lock. If this lock exists, dsync should wait until the previous dsync is done and then do it again, just in case there were more changes since the previous sync started.

This should conclude everything needed for replication itself.

High-availability NFS setup

Once you have replication, it's of course nice if the system automatically recovers from a broken storage. In NFS-based setups the idea is to do soft mounts, so if the NFS server goes away things start failing with EIO errors, which Dovecot notices and switches to using the secondary storage(s).

In v2.1.0 Dovecot already keeps track of mounted filesystems. Initially they're all marked as "online". When multiple I/O errors occur in a filesystem [todo: how many exactly? where are these errors checked, all around in the code or checking the log?] the mountpoint is marked as "offline" and the connections accessing that storage are killed [todo: again how exactly?].

Another job for replication plugin is to hook into namespace creation. If mail_location points to a mountpoint marked as "offline", it's replaced with mail_replica. This way the user can access mails from the secondary storage without downtime. If the replica isn't fully up to date, this means that some of the mails (or other changes) may temporarily be lost. These will come back again after the original storage has come back up and replication has finished its job. So as long as mails aren't lost in the original storage, there won't be any permanent mail loss.

When an offline storage comes back online, its mountpoint's status is initially changed to "failover" (as opposed to "online"). During this state the replication plugin works a bit differently when the user's primary mail_location is in this storage: It first checks if the user is fully replicated, and if so uses the primary storage, otherwise it uses the replica storage. Long running IMAP protocesses check the replication state periodically and kill themselves once the user is replicated, to move back to primary storage.

Once replicator notices that all users have been replicated, it tells the backends' to change the "failover" state to "online" (via doveadm server).

High-availability non-NFS setup

One possibility is to use Dovecot proxies, which know which servers are down. Instead of directing users to those servers, it would direct them to replica servers. The server states could be handled similar to NFS setup's online vs. failover vs. offline states.

Another possibility would be to do the same as above, except without separate proxy servers. Just make "mail.example.com" DNS point to two IP addresses, and if one Dovecot notices that it's not the user's primary server, it proxies to the secondary server, unless it's down. If one IP is down, clients hopefully connect to the other.

Monday, February 13, 2012

Dovecot v2.2 plans

(Mailing list thread for this post.)

Here's a list of things I've been thinking about implementing for Dovecot v2.2. Probably not all of them will make it, but I'm at least interested in working on these if I have time.

Previously I've mostly been working on things that different companies were paying me to work on. This is the first time I have my own company, but the prioritization still works pretty much the same way:

  • 1. priority: If your company is highly interested in getting something implemented, we can do it as a project via my company. This guarantees that you'll get the feature implemented in a way that integrates well into your system.
  • 2. priority: Companies who have bought Dovecot support contract can let me know what they're interested in getting implemented. It's not a guarantee that it gets implemented, but it does affect my priorities. :)
  • 3. priority: Things other people want to get implemented.
There are also a lot of other things I have to spend my time on, which are before the 2. priority above. I guess we'll see how things work out.

dsync-based replication


I'll write a separate post about this later. Besides, it's coming for Dovecot v2.1 so it's a bit off topic, but I thought I'd mention it anyway.

Shared mailbox improvements


Support for private flags for all mailbox formats:
namespace {
type = public
prefix = Public/
mail_location = mdbox:/var/vmail/public:PVTINDEX=~/mdbox/indexes-public
}
  • dsync needs to be able to replicate the private flags as well as shared flags.
  • might as well add a common way for all mailbox formats to specify which flags are shared and which aren't. $controldir/dovecot-flags would say which is the default (private or shared) and what flags/keywords are the opposite.
  • easy way to configure shared mailboxes to be accessed via imapc backend, which would allow easy shared mailbox accesses across servers or simply between two system users in same server. (this may be tricky to dsync.)
  • global ACLs read from a single file supporting wildcards, instead of multiple different files
  • default ACLs for each namespace/storage root (maybe implemented using the above..)

Metadata / annotations


Add support for server, mailbox and mail annotations. These need to be dsyncable, so their changes need to be stored in various .log files:
  1. Per-server metadata. This is similar to subscriptions: Add changes to dovecot.mailbox.log file, with each entry name a hash of the metadata key that was changed.
  2. Per-mailbox metadata. Changes to this belong inside mailbox_transaction_context, which write the changes to mailbox's dovecot.index.log files. Each log record contains a list of changed annotation keys. This gives each change a modseq, and also allows easily finding out what changes other clients have done, so if a client has done ENABLE METADATA Dovecot can easily push metadata changes to client by only reading the dovecot.index.log file.
  3. Per-mail metadata. This is pretty much equivalent to per-mailbox metadata, except changes are associated to specific message UIDs.
The permanent storage is in dict. The dict keys have components:

  • priv/ vs. shared/ for specifying private vs. shared metadata
  • server/ vs mailbox/[mailbox guid]/ vs. mail/[mailbox guid]/[uid]
  • the metadata key name
This would be a good time to improve the dict configuration to allow things like:
  • mixed backends for different hierarchies (e.g. priv/mailbox/* goes to a file, while the rest goes to sql)
  • allow sql dict to be used in more relational way, so that mail annotations could be stored with tables: mailbox (id, guid) and mail_annotation (mailbox_id, key, value), i.e. avoid duplicating the guid everywhere.
Things to think through:
  • How to handle quota? Probably needs to be different from regular mail quota. Probably some per-user "metadata quota bytes" counter/limit.
  • Dict lookups should be done asynchronously and prefetched as much as possible. For per-mail annotation lookups mail_alloc() needs to include a list of annotations that are wanted.

Configuration


Copy all mail settings to namespaces, so it'll be possible to use per-namespace mailbox settings. Especially important for imapc_* settings, but can be useful for others as well. Those settings that aren't explicitly defined in the namespace will use the global defaults. (Should doveconf -a show all of these values, or simply the explicitly set values?)

Get rid of *.conf.ext files. Make everything part of dovecot.conf, so doveconf -n outputs ALL of the configuration. There are mainly 3 config files I'm thinking about: dict-sql, passdb/userdb sql, passdb/userdb ldap. The dict-sql is something I think needs a bigger redesign (mentioned above in "Metadata" section), but the sql/ldap auth configs could be merged. One way could be:

sql_db sqlmails {
# most settings from dovecot-sql.conf.ext, except for queries
driver = mysql
connect = ...
}

ldap_db ldapmails {
# most settings from dovecot-ldap.conf.ext, except attributes/filters
}

passdb {
driver = sql
db = sqlmails
sql_query = select password from users where username = '%u'
}
passdb {
driver = ldap
db = ldapmails
ldap_attributes {
password = %{ldap:userPassword}
}
ldap_filter = ...
}
The sql_db {} and ldap_db {} would be generic enough to be used everywhere (e.g. dict-sql), not just for passdb/userdb.

Some problems:

  • Similar to the per-namespace mail settings, doveconf -a would output all sql_query, ldap_attributes, ldap_filter, etc. settings for all passdbs/userdbs. Perhaps a similar solution?
  • The database configs contain passwords, so they should be readable only by root. This makes running dovecot-lda and maybe doveadm difficult, since they fail at "permission denied" when trying to open the config. There are probably only two solutions: a) The db configs need to be !include_try'd or b) the configs can be world-readable, but only passwords are placed to only-root-readable files by using "password = <db.password"

IMAP state saving/restoring


IMAP connections are often long running. Problems with this:
  1. Currently each connection requires a separate process (at least to work reliably), which means each connection also uses quite a lot of memory even when they aren't doing anything for a long time.
  2. Some clients don't handle lost connections very nicely. So Dovecot can't be upgraded without causing some user annoyance. Also in a cluster if you want to bring down one server, the connections have to be disconnected before they can be moved to another server.
If IMAP session state could be reliably saved and later restored to another process, both of the above problems could be avoided entirely. Typically when a connection is IDLEing there are really just 4 things that need to be remembered: username, selected mailbox name, its UIDVALIDITY and HIGHESTMODSEQ. With this information the IMAP session can be fully restored in another process without losing any state. So, what we could do is:
  1. When an IMAP connection has bee IDLEing for a while (configurable initial time, could be dynamically adjusted):
    • move the IMAP state and the connection fd to imap-idle process
    • the old imap process is destroyed
    • imap-idle process can handle lots of IMAP connections
    • imap-idle process also uses inotify/etc. to watch for changes in the specified mailbox
    • if any mailbox changes happen or IMAP client sends a command, start up a new imap process, restore the state and continue from where we left off
    • This could save quite a lot of memory at the expense of some CPU usage
  2. Dovecot proxy <-> backend protocol could be improved to support moving connection to another backend. Possibly using a separate control connection to avoid making the proxying less efficient in normal operation.
  3. When restarting Dovecot, move all the connections to a process that keeps the connections open for a while. When Dovecot starts up, create imap processes back to the connections. This allows changing configuration for existing client connections (which sometimes may be bad! need to add checks against client-visible config conflicts), upgrading Dovecot, etc. without being visible to clients. The only problem is SSL connections: OpenSSL doens't provide a way to save/restore state, so either you need to set shutdown_clients=no (and possibly keep some imap-login processes doing SSL proxying for a long time), or SSL connections need to be killed. Of course the SSL handling could be outsourced to some other software/hardware outside Dovecot.
The IMAP state saving isn't always easy. Initially it could be implemented only for the simple cases (which are a majority) and later extended to cover more.

IMAP extensions


  • CATENATE is already implemented by Stephan
  • URLAUTH is also planned to be implemented, somewhat differently than in Apple's patch. The idea is to create a separate imap-urlauth service that provides extra security.
  • NOTIFY extension could be implemented efficiently using mailbox list indexes, which already exists in v2.1.
  • FILTERS extension can be easily implemented once METADATA is implemented
  • There are also other missing extensions, but they're probably less important: BINARY & URLAUTH=BINARY, CONVERT, CONTEXT=SORT, CREATE-SPECIAL-USE, MULTISEARCH, UTF8=* and some i18n stuff.

Backups


Filesystem based backups have worked well enough with Dovecot in the past. But with new features like single instance storage it's becoming more difficult. There's no 100% consistent way to even get filesystem level backups with SIS enabled, because deleting both the message file and its attachment files can't be done atomically (although usually this isn't a real problem). Restoring SIS mails is more difficult though, first you need to restore the dbox mail files and then you need to figure out what attachment files from SIS need to be restored, and finally you'll need to do doveadm import to put them into their final destination.

I don't have much experience with backup software, but other people in my company do. The initial idea is to implement a Dovecot backup agent to one (commercial) backup software, which allows doing online backups and restoring mails one user/mailbox/mail at a time. I don't know the details yet how exactly this is going to be implemented, but the basic plan is probably to implement a "backup" mail storage backend, which is a PostgreSQL pg_dump-like flat file containing mails from all mailboxes. doveadm backup/import can then export/import this format via stdout/stdin. Incremental backups could possibly be done by giving a timestamp of previous backup run (I'm not sure about this yet).

Once I've managed to implement the first fully functional backup agent, it should become clearer how to implement it to other backup solutions.

Random things


  • dovecot.index.cache file writing is too complex, should be simplified
  • Enable auth_debug[_passwords]=yes on-the-fly for some specific users/IPs via doveadm
  • Optimize virtual mailboxes using mailbox list indexes. It wouldn't anymore need to keep all the backend mailboxes' index files open.
  • Would be nice to go forward with supporting key-value databases as mail storage backends.