GlusterFS Algorithms: Distribution

A lot of people seem to be curious about how GlusterFS works, not just in the sense of effects but in terms of internal algorithms etc. as well. Here’s an example from this morning. The documentation at this level really is kind of sparse, so I might as well start filling some of the gaps. Today I’ll talk about DHT, which is the real core of how GlusterFS aggregates capacity and performance across multiple servers. Its responsibility is to place each file on exactly one of its subvolumes – unlike either replication (which places copies on all of its subvolumes) or striping (which places pieces onto all of its subvolumes). It’s a routing function, not splitting or copying.

The basic method used in DHT is consistent hashing. Each subvolume (brick) is assigned a range within a 32-bit hash space, covering the entire range with no holes or overlaps. Then each file is also assigned a value in that same space, by hashing its name. Exactly one brick will have an assigned range including the file’s hash value, and so the file “should” be on that brick. However, there are many cases where that won’t be the case, such as when the set of bricks (and therefore the range assignment of ranges) has changed since the file was created, or when a brick is nearly full. Much of the complexity in DHT involves these special cases, which we’ll discuss in a moment. First, though, it’s worth making a couple more observations about the basic algorithm.

  • The assignment of hash ranges to bricks is determined by extended attributes stored on directories (here’s a description of those data structures). This means the distribution is directory-specific. You could well distribute files differently – e.g. across different sets of bricks – in different directories if you know what you’re doing, but it’s quite unsafe. Firstly it’s unsafe because you’d really better know what you’re doing. Secondly it’s unsafe because there’s no management support for this, so the next time you do a rebalance (more about that later) it will happily stomp on your carefully hand-crafted xattrs. In the fairly near future, I hope to add a feature to recognize hand-set xattrs as such and leave them alone. In the more distant future, there might be management support for assigning bricks to various pools or classes of storage, and defining per-directory placement policies in terms of those.
  • Consistent hashing is usually thought of as hashing around a circle, but in GlusterFS it’s more linear. There’s no need to “wrap around” at zero, because there’s always a break (between one brick’s range and another’s) at zero.
  • If a brick is missing, there will be a hole in the hash space. Even worse, if hash ranges are reassigned while a brick is offline, some of the new ranges might overlap with the (now out of date) range stored on that brick, creating a bit of confusion about where files should be. GlusterFS tries really hard to avoid these problems, but it also checks aggressively to make sure nothing slips through. If you ever see messages in your logs about holes or overlaps, that means the checking code is doing its job.

So, those are the basics. How about those special cases? It’s probably easiest to look at the “read” path first, where we’re trying to find a file that we expect to be there. Here’s the sequence of operations.

  1. Make sure we have the hash-range assignments (the “layout”) for each brick’s copy of the parent directory. This information is cached, so we’ll usually have it already.
  2. Hash the file name and look up the corresponding brick in the layout.
  3. Send a LOOKUP request to that brick, specifying the file path.
  4. If the LOOKUP comes back positive, we found the file and we’re done.
  5. Otherwise, re-send the LOOKUP to all bricks to see who really has the file.
  6. If nobody gives a positive reply, the file really isn’t there and we’re done again.
  7. Go back to the brick where the file “should” have been, and create a link file (described below) pointing to the real location.
  8. Return the LOOKUP result to the caller.

What’s a link file, then? Have you ever looked on one of your bricks and seen zero-length files with weird permissions (sticky bit set)? Those are link files. If you look closer, you’ll also see that they have trusted.dht.linkfile xattrs with brick names in them. That’s how we avoid the “broadcast” mentioned above. On subsequent lookups, if we find a link file we just follow it to the real brick. Considering that we only go through this lookup procedure once per file per client anyway (location information is cached), the cost of “guessing wrong” is therefore pretty minimal. I once implemented a scheme where we do an exponentially expanding search instead of an immediate broadcast, hoping to achieve a better balance of lookup latency vs. network traffic, but in the end it just didn’t seem to make a difference so the extra complexity wasn’t worth it. Now, let’s look at the file-creation path.

  1. Assume we’ve already done a lookup, so we already have the layout information cached and we know the file doesn’t already exist anywhere.
  2. Hash the file name and look up the corresponding brick in the layout.
  3. If that brick is full, choose another brick (doesn’t really matter how) that isn’t instead.
  4. Send a CREATE request to the chosen brick for that file.
  5. If we “diverted” because of a full brick, go back and add a link file to the brick chosen by pure hashing. The next client will almost certainly need it.

This brings us to rebalancing, which is one of the key challenges – and therefore one of the most interesting research areas IMO – in this kind of system. The first thing to know about GlusterFS rebalancing is that it’s not automatic. If you add a new brick, even new files won’t be put on it until you do the “fix-layout” part of rebalance, and old files won’t be put on it until you do the “migrate-data” part. What do these do?

  • Fix-layout just walks the directory tree recalculating and modifying the trusted.glusterfs.dht xattrs to reflect the new list of bricks. It does this in a pretty simple way, assigning exactly one range of length MAXINT/nbricks to each brick in turn starting at zero.
  • Migrate-data is much more costly. For each file, it calculates where the file “should” be according to the new layout. Then, if the file is not already there, it moves the file by copying and renaming over the original. There’s some tricky code to make sure this is transparent and atomic and so forth, but that’s the algorithm.

In my personal opinion, there are problemsenhancement opportunities in both of these areas. Let’s take these in reverse order. Migrate-data is slow. What it should do is run in parallel on all of the bricks, with each brick either “pushing” data that is currently local but needs to be elsewhere or “pulling” data that’s the other way around. What it does instead is run on one node, potentially moving files for which it is neither source nor destination. This is a big deal, because it causes rebalance to take days when it should take hours – or weeks when it should take days, on larger installations. The amount of I/O involved is also why you don’t necessarily want this to be an automatic process.

While the migrate-data issue is at the level of mechanics and implementation, the fix-layout issue is at more of a conceptual level. To put it simply, when we add a new brick we should reallocate approximately 1/new_brick_count hash values. Because our layout calculations are naive, we will usually reallocate much more than that – exacerbating the migrate-data problem because reallocated hash values correspond to moved files. Time for a picture.

The outer ring represents the state with just three bricks – hash value zero at the top, split into three equal ranges. The inner ring represents the state after adding a fourth brick. Any place where the inner and outer rings are different colors represents a range that has been reassigned from one brick to another – implying a migration of data likewise. If you look carefully, you’ll see that we’re moving half of the data when it should be only a quarter – 8% blue to orange, 17% orange to yellow, and 25% yellow to green. What could we do that’s better? Not much, if we stay within the limitation of a single brick claiming a single range, but there really doesn’t need to be such a limitation. Instead, we could borrow from Dynamo and assign multiple “virtual node IDs” for brick four, giving it a total of 25% drawn equally from bricks one through three. (If you look at this as a clock, that’s one hour each at three, seven, and eleven o’clock). Taking this approach too far can cause layout tables to get unreasonably large, so sometimes it makes more sense to forego this optimization or even reverse it by combining/swapping ranges even though that will cause “unnecessary” data migration. Done sensibly, this can keep the table size under control without resorting to the scale-limiting “hash bucket” approach of some Dynamo derivatives. There are even more sophisticated ways to address this problem, but that’s a whole different post and this should be enough to convey the general problem space.

That’s how DHT works today, and some thoughts about how it might evolve in the future. The next article will focus on replication.


5 Responses

You can follow any responses to this entry through the RSS 2.0 feed.

Both comments and pings are currently closed.

  1. Dan Gardner says:

    Hi Jeff,

    Thank you for the post. It’s great to see you filling the gaps! I posted a related question to the Gluster community site a couple of weeks ago. I think that the problem of heterogeneous storage node sizes would be solved by virtual nodes. I’m pasting the question below. It would be great if you could address in the context of your post:

    Balance across non uniform nodes sizes

    We expect that as we scale out our nodes will take advantage of larger and larger disk capacities. When nodes are added (in groups of two or three AFR sets), we would like a way to control placement of data so that proportionally more load is sent to the higher capacity nodes. This could be achieved by allowing a brick to be added more than once (thus achieving weighting ala Swift’s rings (see Part5)). The gluster command line refuses to accept repeated bricks. I could also create numerous bricks on a single node (again achieving weighting) but in this case I would be wasting machine resources since a brick must (it seems) have a dedicated process. Is there a way to achieve load balancing across non uniform node sizes. I expect that this use case is not exceptional.


    • Jeff Darcy says:


      Yes, using controlled placement to deal with heterogeneous node sizes is a good idea. It shouldn’t even be all that hard to implement – just have fix-layout issue a statfs to each subvolume before it starts, and use the returned info to guide weighting. You could even have it weight by remaining space instead of total space if you want . . . which brings me to my first caveat. Sometimes you actually don’t want to balance space. You might want to balance speed instead, so you might assign a larger range to a faster brick rather than to a larger one. Thus, it would probably make sense to have an “algorithm” argument to fix-layout: total size, remaining size, speed, probably others. When we add storage classes the weights would probably within a class, though it becomes unclear what happens when bricks can belong to multiple overlapping classes (I think finding an optimal solution is isomorphic to some NP-complete problem or other).

      As for virtual nodes, they don’t quite solve the problem of assigning weights differently. What they do is provide a way to optimize the transition from one set of assignments to another, by minimizing data motion. In other words, you can do weighting without virtual nodes to get a better final result or you can do virtual nodes without weighting to have a shorter and less disruptive transition. Do both, and you’re in really good shape.

      I think I might hack on the weighting thing a bit later today, just for fun. Thanks for the idea.

      • r00z says:


        I really need this load-balancing algorithm too !
        Actually the distributed mode seems to spread the writings with round-robin algorithm.
        Same problems on my 3 nodes that’s exporting 1 brick each.
        My tests are stuck, now my node1 is full filled up so my writing on clustered storage fail 1/3 (write fail when trying to write on node1).
        This is my test nodes :
        node1 : exporting 10To (~100%)
        node2 : exporting 12To (~90%)
        node3 : exporting 20To (~50%)

        Please leave a feedback when you’ll have done your hacks :P

        Thanks a lot!

  2. John Read says:

    Hello Jeff,

    Great article. Much appreciated. I am doing serious testing before deploying GlusterFS to production at a large company.

    In my testing on a distributed volume the following statements you made above are just not happening. When I have a full brick and a file hashes there, the write fails.

    • If that brick is full, choose another brick (doesn’t really matter how) that isn’t instead.
    • Send a CREATE request to the chosen brick for that file.
    • If we “diverted” because of a full brick, go back and add a link file to the brick chosen by pure hashing.

    Do you have any advice on how to get that going? I am using the native client. Perhaps I am missing a volume option or something?

  3. John Read says:

    Hello, In reference to my previous comment which is still under moderation:

    Didn’t occur to me that this could be a bug. In my testing I was using Red Hat Storage’s version of gluster. I’ll report back if I can reproduce this error on any of the open source versions.

    gluster –version
    glusterfs built on Nov 8 2012 22:30:35

    Meanwhile, it was a bug and there is a work around.

    Here’s the bug. Bug 874554 – cluster.min-free-disk not having an effect on new files @
    From the documentation:
    cluster.min-free-disk Specifies the percentage of disk space that must be kept free. Might be useful for non-uniform bricks. Default is 10%

    Note how that is not very clear and doesn’t really tell you if that is per brick or per volume. It is per brick. What happens is linkfiles should be created on the target brick and the actual file redirected to less full bricks.

    The fix is to use a size number, not a % for this volume option. This is undocumented at the moment and is not listed as valid input but it is and works.

    gluster volume set rhs_test1 cluster.min-free-disk 1GB