Tuesday, February 28, 2012

Dovecot clustering with dsync-based replication

This document describes a design for a dsync-replicated Dovecot cluster. This design can be used to build at least two different types of dsync clusters, which are both described here. Ville has also drawn overview pictures of these two setups, see director/NFS-based cluster and SSH-based cluster.

First of all, why dsync replication instead of block level filesystem replication?

  • dsync won't replicate filesystem corruption.
  • A cold restart of replication won't go through all of the data in the disks, but instead quickly finds out what has changed.
  • Split brain won't result in downtime or losing any data. If both sides did changes, the changes are merged without data loss.
  • If using more than 2 storages, the users' replicas can be divided among the other storages. So if one storage goes down, the extra load is shared by all the other storages, not just one.

Replication mail plugin

This is a simple plugin based on notify plugin. It listens for all changes that happen to mailboxes (new mails, flag changes, etc.) Once it sees a change, it sends an asynchronous (username, priority) notification to replication-notify-fifo. The priority can be either high (new mails) or low (everything else).

Optionally the replication plugin can also support synchronous replication of new mail deliveries. In this way it connects to replication-notify UNIX socket, tells it to replicate the user with sync (=highest) priority and waits until it is done or replication_sync_timeout occurs. The IMAP/LMTP client won't see an "OK" reply until the mail is replicated (or the replication has failed). The synchronous replication probably adds a noticeable delay, so it might not be acceptable for IMAP, but might be for LMTP.

So, what is listening in those replication-notify* sockets? It depends on if Dovecot is running on director-based setup or not.

Aggregator

When running in Dovecot director-based setup, all of the Dovecot backends (where replication plugin runs) also run "aggregator" process. Its job is very simple: It proxies the notifications from mail plugin and sends them via a single TCP connection to the replicator process running in Dovecot proxies. This is simply an optimization to avoid tons of short lived TCP connections directly from replication plugin to director server.

When not running in Dovecot director setup (i.e. there is only a single Dovecot instance that handles all of the users), there is no point in having an aggregator proxy, because the replicator process is running on the same server. In this kind of setup the replicator process directly listens on the replication-notify* sockets.

Replicator

The initial design for replicator isn't very complex either: It keeps a priority queue of all users, and replicates those users at the top of the queue. Notifications about changes to user's mailboxes (may) move the user up in the priority queue. If the user at the top of the queue already has been replicated "recently enough", the replicator stops its work until new changes arrive or the "recently enough" is no longer that.

dsync can do two types of syncs: quick syncs and full syncs. A quick sync trusts indexes and does the replication with the least amount of work and network traffic. A quick sync is normally enough to replicate all changes, but just in case something has gone wrong there's also the full sync option, which guarantees that the mailboxes end up being fully synced. A full sync is slower though, and uses more network traffic.

The priority queue is sorted by:

  • 1. Priority (updated by a notification from replication plugin)
  • 2. If priority!=none: Last fast sync (those users are replicated first whose last replication time is oldest)
  • 2. If priority=none: Last full sync (these users should already be fully synced, but do a full sync for them once in a while anyway)
All users get added to the replication queue at replicator startup with "none" priority. The list of users is looked up via userdb iteration. If the previous replication state is found from a disk dump, it's used to update the priorities, last_*_sync timestamps and other replication state. Replicator process creates such dumps periodically [todo: every few mins? maybe a setting?].

Replicator starts replicating users at the top of the queue, setting their priorities to "none" before starting. This means that if another change notification arrives during replication, the priority is bumped up and no changes get lost. replication_max_conns setting specifies how many users are replicated simultaneously. If the user's last_full_sync is older than replication_full_sync_interval setting, a full sync is done instead of a fast sync. If the user at the top of the queue has "none" priority and the last_full_sync is newer than replication_full_sync_interval, the replication stops. [todo: it would be nice to prefer doing all the full syncs at night when there's hopefully less disk I/O]

(A global replication_max_conns setting isn't optimal in proxy-based setup, where different backend servers are doing the replication. There it should maybe be a per-backend setting. Then again, it doesn't account for the replica servers that also need to do replication work. Also to properly handle this each backend should have its own replication queue, but this requires doing a userdb lookup for each user to find out their replication server, and this would need to be done periodically in case the backend changes, which can easily happen often with director-based setup. So all in all, none of this is being done in the initial implementation. Ideally the users are distributed in a way that a global replication queue would work well enough.)

In director-based setup each director runs a replicator server, but only one of them (master) actually asks the backends to do the replication. The rest of them just keep track of what's happening, and if the master dies or hangs, one of the others becomes the new master. The server with lowest IP address is always the master. The replicators are connected to a ring like the directors, using the same director_servers setting. The communication between them is simply about notifications of what's happening to users' priorities. Preferably the aggregators would always connect to the master server, but this isn't required. In general there's not much that can go wrong, since it's not a problem if two replicators request a backend to start replication for the same user or if the replication queue states aren't identical.

If the replication is running too slowly [todo: means what exactly?], log a warning and send an email to admin.

So, how does the actual replication happen? Replicator connects to doveadm server and sends a "sync -u user@domain" command. In director-based setup the doveadm server redirects this command to the proper backend.

doveadm sync

This is an independent feature from all of the above. Even with none of it implemented, you could run this to replicate a user. Most of this is already implemented. The only problem is that currently you need to explicitly tell it where to sync. So, when the destination isn't specified, it could do a userdb lookup and use the returned "mail_replica" field as the destination. Multiple (sequentially replicated) destinations could be supported by returning "mail_replica2", "mail_replica3" etc. field.

In NFS-based (or shared filesystem-based in general) setup the mail_replica setting is identical to mail_location setting. So your primary mail_location would be in /storage1/user/Maildir, while the secondary mail_replica would be in /storage2/user/Maildir. Simple.

In non-NFS-based setup two Dovecot servers talk dsync protocol to each others. Currently dsync already supports SSH-based connections. It would also be easy to implement direct TCP-based connections between two doveadm servers. In future these connections could be SSL-encrypted. Initially I'm only supporting SSH-based connections, as they're already implemented. So what does the mail_replica setting look like in this kind of a setup? I'm not entirely sure. I'm thinking that it could be either "ssh:host" or "ssh:user@host", where user is the SSH login user (this is opposite of the current doveadm sync command line usage). In future then it could support also tcp:host[:port]. Both of these ssh: and tcp: prefixes would also be supported by doveadm sync command line usage (and perhaps the prefixless user@domain be deprecated).

dsync can run without any long lived locking and it typically works fine. In case mailbox was modified during dsync, the replicas may not end up being identical, but nothing breaks. dsync currently usually notices this and logs a warning. When these conflicting changes was caused by imap/pop3/lda/etc. this isn't a problem, they've already notified replicator already to perform another sync that will fix it.

Running two dsyncs at the same time is more problematic though, mainly related to new emails. Both dsyncs notice that mail X needs to be replicated, so both save it and it results in having a duplicate. To avoid this, there should be a dsync-lock. If this lock exists, dsync should wait until the previous dsync is done and then do it again, just in case there were more changes since the previous sync started.

This should conclude everything needed for replication itself.

High-availability NFS setup

Once you have replication, it's of course nice if the system automatically recovers from a broken storage. In NFS-based setups the idea is to do soft mounts, so if the NFS server goes away things start failing with EIO errors, which Dovecot notices and switches to using the secondary storage(s).

In v2.1.0 Dovecot already keeps track of mounted filesystems. Initially they're all marked as "online". When multiple I/O errors occur in a filesystem [todo: how many exactly? where are these errors checked, all around in the code or checking the log?] the mountpoint is marked as "offline" and the connections accessing that storage are killed [todo: again how exactly?].

Another job for replication plugin is to hook into namespace creation. If mail_location points to a mountpoint marked as "offline", it's replaced with mail_replica. This way the user can access mails from the secondary storage without downtime. If the replica isn't fully up to date, this means that some of the mails (or other changes) may temporarily be lost. These will come back again after the original storage has come back up and replication has finished its job. So as long as mails aren't lost in the original storage, there won't be any permanent mail loss.

When an offline storage comes back online, its mountpoint's status is initially changed to "failover" (as opposed to "online"). During this state the replication plugin works a bit differently when the user's primary mail_location is in this storage: It first checks if the user is fully replicated, and if so uses the primary storage, otherwise it uses the replica storage. Long running IMAP protocesses check the replication state periodically and kill themselves once the user is replicated, to move back to primary storage.

Once replicator notices that all users have been replicated, it tells the backends' to change the "failover" state to "online" (via doveadm server).

High-availability non-NFS setup

One possibility is to use Dovecot proxies, which know which servers are down. Instead of directing users to those servers, it would direct them to replica servers. The server states could be handled similar to NFS setup's online vs. failover vs. offline states.

Another possibility would be to do the same as above, except without separate proxy servers. Just make "mail.example.com" DNS point to two IP addresses, and if one Dovecot notices that it's not the user's primary server, it proxies to the secondary server, unless it's down. If one IP is down, clients hopefully connect to the other.

39 comments:

  1. 1xbet Casino, 1xBet Welcome Bonus, Deposit €100 + 150 Spins
    1xbet Casino, 1xbet 1xBet 1XBET Welcome Bonus, Deposit €100 + 150 Spins. Casino, william hill 1xBet Welcome Bonus, Deposit €100 + 150 Spins.

    ReplyDelete
  2. Thanks a bunch for sharing this with all people you actually recognise what you’re speaking approximately! Bookmarked.
    먹튀검증
    온라인경마

    ReplyDelete
  3. Thanks for taking the time to discuss this, I feel strongly about it and love learning more on this topic. 바둑이사이트넷

    ReplyDelete
  4. 안전놀이터 Great beat ! I wish to apprentgice at tthe same time as you amend yor site, how can i subscribe for a blog website?
    Thee account aided mee a applicable deal. I have been a little bit familiar of this your broadcast provided bright clear concept

    ReplyDelete
  5. 토토
    스포츠토토

    I just could not go away your web site before suggesting that I really enjoyed the standard information a person provide on your guests?
    Is going to be back steadily in order to check up on new posts. My web-site

    ReplyDelete
  6. Your way of telling the whole thing this post is
    in fact good, every one be able to effortlessly be aware
    of it, Thanks a lot

    토토사이트
    온라인카지노
    토토사이트

    ReplyDelete
  7. It is the best time to make some plans for the future and it is time to be happy. I have read this post and if I could I desire to suggest you few interesting things or advice.

    토토사이트
    카지노사이트
    토토

    ReplyDelete
  8. This is really helpful post and very informative there is no doubt about it.
    카지노사이트

    ReplyDelete
  9. I am actually happy to read this website posts which contains plenty of helpful information, thanks for providing these kinds of statistics.
    온라인카지노

    ReplyDelete
  10. Thanks for sharing your thoughts, I think the admin of this web page is really working hard for his website 먹튀검증

    ReplyDelete
  11. I really like it when people come together and share ideas.
    Great website, keep it up!먹튀검증

    ReplyDelete
  12. Way cool! Some extremely valid points! I appreciate you writing this write-up and also the rest of the website is really good.카지노사이트

    ReplyDelete
  13. เว็บรวมเกมคาสิโนออนไลน์ ที่สามารถเลือกเดิมพันเกมได้หลากหลายตามความสนใจของผู้เล่น ซึ่งในเว็บไซต์มีการเปิดให้บริการโหมด ทดลองเล่นบาคาร่า ที่เล่นได้ฟรี ไม่ต้องเติมเงิน หรือสมัครสมาชิกก็เล่นได้ เว็บของเรานั้นมีการพัฒนาระบบนั้นมีความทันสมัย ตอบโจทย์ผู้ที่เข้ามาใช้งาน มีเทคนิคในการเล่นมากมาย เล่นได้ 24 ชั่วโมง

    ReplyDelete
  14. It¡¦s really a great and useful piece of information. I¡¦m satisfied that you shared this useful info with us.
    토토사이트

    ReplyDelete
  15. Thank you for going into such great depth in your post; I hope to read more of your work in the future.
    먹튀검증

    ReplyDelete
  16. And know i am very happy after visiting your blog. Very nice work and thanks for sharing.
    온라인카지노

    ReplyDelete
  17. selection at work; the genetic variation most suitable to a given environment is the one that thrives.카지노사이트

    ReplyDelete
  18. The other issue considerations the number of games produced by main brands, they're so many and so cool that it's impossible to choose on} only one. We have them all right here at Tuskcasino.com the place have the ability to|you presumably can} play free slots. Real cash slots characteristic minimal and maximum wager amounts and offer you a return on your cash and spins. What we imply is, when you hit a specific combination or 우리카지노 win a jackpot you’ll receive a cash reward and should you miss, you lose your cash. Perhaps they don’t have much quantity outcome of|as a end result of} they only like to have high-quality games like Golden Buffalo, Shopping Spree, and A Night with Cleo. That being mentioned, we do like that they have plenty of jackpot slots – 34 on the time of writing.

    ReplyDelete
  19. make sure your online slot site is trusted like slot88, because a lot of data is being stolen by cyber, therefore slot88 already has an international trust license slot gacor ,l

    ReplyDelete
  20. raja slot gacor hari ini sites often surprise with the big jackpots that we provide. Our site offers you the best quality online slot game games in Indonesia, which will give you the biggest advantage in playing online slot games

    ReplyDelete
  21. Active members are members who are ready to play, the benefits of being an active member at raja slot you will be given a very large bonus from the deposit you made.

    ReplyDelete
  22. Its like you read my mind! You seem to know so much about this please visit my web

    ReplyDelete
  23. I'm happy to see some great article on your site. Many thanks for sharing it
    Dui lawyer in Virginia

    ReplyDelete

  24. site akun pro kamboja provides the best quality online slot game games giving you a big advantage in playing online slot games

    ReplyDelete
  25. join a trusted and trusted site in 2023, this site has become an online gambling center because of the attractiveness of its very high winning rate, so register quickly easily at https://slotonline.org

    ReplyDelete
  26. In fact, your creative writing abilities has motivated me to get my own blog now.

    ReplyDelete
  27. Your blog is really nice and sound really good

    ReplyDelete
  28. Vey nice and very important post thank you so much for sharing.
    Reckless Driving Virginia

    ReplyDelete
  29. The post you are sharing is very informative and creative. Thanks for an amazing post. Traffic Lawyer Frederick VA

    ReplyDelete
  30. "Dovecot's clustering, utilizing dsync-driven replication, offers robust data redundancy and seamless failover. By synchronizing mailbox content across nodes, it ensures high availability and reliable email service, making it a solid choice for scalable and fault-tolerant email infrastructure."
    first time offense reckless driving virginia

    ReplyDelete
  31. Bufete de Abogados Accidentes de Camiones
    The article on Dovecot clustering with dsync-based replication is a technical gem that provides an in-depth understanding of the complex setup. The clarity in the process and detailed instructions are valuable resources for those working on similar configurations. The comments highlight the appreciation for the insights shared and the article's value as a reference for those navigating Dovecot clustering. The author's expertise is evident in the comprehensive breakdown of the topic, and the article and its ensuing comments highlight its importance for system administrators and developers.

    ReplyDelete
  32. I wanted to extend my gratitude for your blog. Your words have a way of sparking curiosity and inspiring exploration.Divorcio Barato en Nueva York

    ReplyDelete
  33. This comment has been removed by the author.

    ReplyDelete