Quorum Enforcement

As of yesterday, my most significant patch yet became a real part of GlusterFS. It’s not a big patch, but it’s significant because what it adds is enforcement of quorum for writes. In operational terms, what this means is that – if you turn quorum enforcement on – the probability of “split brain” problems is greatly reduced. It’s not eliminated entirely, because clients don’t see failures at the same time and might take actions that lead to split brain during that immediate post-failure interval. There are also some failure conditions that can cause clients to have persistently inconsistent models of who has quorum and who doesn’t. Still, for 99% of failures this will significantly reduce the number of files affects by split brain – often down to zero. What will happen instead is that clients attempting writes (actually any modifying operation) without quorum will get EROFS instead. That might cause the application to blow up; if that’s worse for you than split brain would be, then just don’t enable quorum enforcement. Otherwise, you have the option to avoid or reduce one of the more pernicious problems that affect GlusterFS deployments with replication.

There’s another significant implication that might be of interest to those who follow my other blog. As such readers would know, I’m an active participant in the endless debates about Brewer’s CAP Conjecture (I’ve decided that Gilbert and Lynch’s later Theorem is actively harmful to understanding of the issues involved). In the past, GlusterFS has been a bit of a mess in CAP terms. It’s basically AP, in that it preserves availability and partition tolerance as I apply those terms, but with very weak conflict resolution. If only one side wrote to a file, there’s not really a conflict. When there is a conflict within a file, GlusterFS doesn’t really have the information it needs to reconstruct a consistent sequence of events, so it has to fall back on things like sizes and modification times (it does a lot better for directory changes). In a word, ick. What quorum enforcement does is turn GlusterFS into a CP system. That’s not to say I like CP better than AP – on the contrary, my long-term plan is to implement the infrastructure needed for AP replication with proper conflict resolution – but I think many will prefer the predictable and well understood CP behavior with quorum enforcement to the AP behavior that’s there now. Since it was easy enough to implement, why not give people the choice?


5 Responses

You can follow any responses to this entry through the RSS 2.0 feed.

Both comments and pings are currently closed.

  1. NottSys says:

    Hi Jeff,
    Very interesting article, infact this is exactly what we need I think. Just to give you bit of background, we are implementing GlusterFS to be replicated FileSystem between 3 Campuses. However, we want to avoid split brain, in-case if the link between the sites go down, so would very much want to use this feature. However, I am not able to find any documentation on how to actually set the quorum on a cluster so would appreciate your help. Also, can you please confirm if this feature is available within release 3.2? Thanks once again and your response will be much appreciated.

  2. Jeff Darcy says:

    Unfortunately, the official documentation for the quorum options is still in progress. They are supported by the CLI, though. The main one is cluster.quorum-type, which can have three values: none, auto, or fixed. “None” is the current behavior. “Auto” defines quorum as more than half, or exactly half including the first brick in the volume definition (to handle the common two-brick case). “Fixed” lets you specify a value with the separate cluster.quorum-count option (which is otherwise ignored). So normally you’d just set quorum-type to “auto” and that would avoid 99% of split-brain issues. If a client lacked quorum it would get EROFS instead on writes, which is still something you’d have to deal with, but most people find that case easier to recover from than letting the non-quorum write proceed.

    That said, I’m a little leery of replicating between sites. As I’ve pointed out in other articles here, the AFR translator is very latency sensitive and likely to perform quite poorly in high-latency environments. It’s something I want to fix – with a separate translator optimized for that case – some day, but that day never seems to come. :( I’m not saying don’t do it, but I highly recommend thorough testing to make sure that the performance meets your needs.

    • NottsSys says:

      Hi Jeff,
      Thanks for your quick response. I have installed GlusterFS3.3beta3 and have done some quick tests with the quorum-type and it works perfect.
      I take your point regarding the AFR translators and high-latency and absolutely agree that we got to test stuff throughly.
      On a second note, we are finding it very difficult to get some help/documentation on GlusterFS and are initially looking for someone to provide us with some paid consultancy work ( 2 – 3 hrs over phone/skype) to help us understand GlusterFS better (we know some of it as we have been playing with it for quite some time, it is the details that we worried about and have some questions that are not answered elsewhere) so to essentially help us to decide if GlusterFS is right for us for what we trying to do or it won’t work (we will explain what we trying to do and so that you can comment/guide us if it is feasible or not). I was thinking if you are interested in this opportunity? If so please email me.

      • Jeff Darcy says:

        It’s entirely reasonable for you to ask, but it turns out that accepting such an engagement would be a conflict according to both Red Hat’s and my own definitions of such things. However, there’s no reason I can’t do the same thing for free and call it community support or pre-sales. I’ll contact you off-line.

  3. NottsSys says:

    Thanks Jeff. Much appreciated. I await your email/contact.