A lot of people seem to be curious about how GlusterFS works, not just in the sense of effects but in terms of internal algorithms etc. as well. Here’s an example from this morning. The documentation at this level really is kind of sparse, so I might as well start filling some of the gaps. Today I’ll talk about DHT, which is the real core of how GlusterFS aggregates capacity and performance across multiple servers. Its responsibility is to place each file on exactly one of its subvolumes – unlike either replication (which places copies on all of its subvolumes) or striping (which places pieces onto all of its subvolumes). It’s a routing function, not splitting or copying.
The basic method used in DHT is consistent hashing. Each subvolume (brick) is assigned a range within a 32-bit hash space, covering the entire range with no holes or overlaps. Then each file is also assigned a value in that same space, by hashing its name. Exactly one brick will have an assigned range including the file’s hash value, and so the file “should” be on that brick. However, there are many cases where that won’t be the case, such as when the set of bricks (and therefore the range assignment of ranges) has changed since the file was created, or when a brick is nearly full. Much of the complexity in DHT involves these special cases, which we’ll discuss in a moment. First, though, it’s worth making a couple more observations about the basic algorithm.
- The assignment of hash ranges to bricks is determined by extended attributes stored on directories (here’s a description of those data structures). This means the distribution is directory-specific. You could well distribute files differently – e.g. across different sets of bricks – in different directories if you know what you’re doing, but it’s quite unsafe. Firstly it’s unsafe because you’d really better know what you’re doing. Secondly it’s unsafe because there’s no management support for this, so the next time you do a rebalance (more about that later) it will happily stomp on your carefully hand-crafted xattrs. In the fairly near future, I hope to add a feature to recognize hand-set xattrs as such and leave them alone. In the more distant future, there might be management support for assigning bricks to various pools or classes of storage, and defining per-directory placement policies in terms of those.
- Consistent hashing is usually thought of as hashing around a circle, but in GlusterFS it’s more linear. There’s no need to “wrap around” at zero, because there’s always a break (between one brick’s range and another’s) at zero.
- If a brick is missing, there will be a hole in the hash space. Even worse, if hash ranges are reassigned while a brick is offline, some of the new ranges might overlap with the (now out of date) range stored on that brick, creating a bit of confusion about where files should be. GlusterFS tries really hard to avoid these problems, but it also checks aggressively to make sure nothing slips through. If you ever see messages in your logs about holes or overlaps, that means the checking code is doing its job.
So, those are the basics. How about those special cases? It’s probably easiest to look at the “read” path first, where we’re trying to find a file that we expect to be there. Here’s the sequence of operations.
- Make sure we have the hash-range assignments (the “layout”) for each brick’s copy of the parent directory. This information is cached, so we’ll usually have it already.
- Hash the file name and look up the corresponding brick in the layout.
- Send a LOOKUP request to that brick, specifying the file path.
- If the LOOKUP comes back positive, we found the file and we’re done.
- Otherwise, re-send the LOOKUP to all bricks to see who really has the file.
- If nobody gives a positive reply, the file really isn’t there and we’re done again.
- Go back to the brick where the file “should” have been, and create a link file (described below) pointing to the real location.
- Return the LOOKUP result to the caller.
What’s a link file, then? Have you ever looked on one of your bricks and seen zero-length files with weird permissions (sticky bit set)? Those are link files. If you look closer, you’ll also see that they have trusted.dht.linkfile xattrs with brick names in them. That’s how we avoid the “broadcast” mentioned above. On subsequent lookups, if we find a link file we just follow it to the real brick. Considering that we only go through this lookup procedure once per file per client anyway (location information is cached), the cost of “guessing wrong” is therefore pretty minimal. I once implemented a scheme where we do an exponentially expanding search instead of an immediate broadcast, hoping to achieve a better balance of lookup latency vs. network traffic, but in the end it just didn’t seem to make a difference so the extra complexity wasn’t worth it. Now, let’s look at the file-creation path.
- Assume we’ve already done a lookup, so we already have the layout information cached and we know the file doesn’t already exist anywhere.
- Hash the file name and look up the corresponding brick in the layout.
- If that brick is full, choose another brick (doesn’t really matter how) that isn’t instead.
- Send a CREATE request to the chosen brick for that file.
- If we “diverted” because of a full brick, go back and add a link file to the brick chosen by pure hashing. The next client will almost certainly need it.
This brings us to rebalancing, which is one of the key challenges – and therefore one of the most interesting research areas IMO – in this kind of system. The first thing to know about GlusterFS rebalancing is that it’s not automatic. If you add a new brick, even new files won’t be put on it until you do the “fix-layout” part of rebalance, and old files won’t be put on it until you do the “migrate-data” part. What do these do?
- Fix-layout just walks the directory tree recalculating and modifying the trusted.glusterfs.dht xattrs to reflect the new list of bricks. It does this in a pretty simple way, assigning exactly one range of length MAXINT/nbricks to each brick in turn starting at zero.
- Migrate-data is much more costly. For each file, it calculates where the file “should” be according to the new layout. Then, if the file is not already there, it moves the file by copying and renaming over the original. There’s some tricky code to make sure this is transparent and atomic and so forth, but that’s the algorithm.
In my personal opinion, there are
problemsenhancement opportunities in both of these areas. Let’s take these in reverse order. Migrate-data is slow. What it should do is run in parallel on all of the bricks, with each brick either “pushing” data that is currently local but needs to be elsewhere or “pulling” data that’s the other way around. What it does instead is run on one node, potentially moving files for which it is neither source nor destination. This is a big deal, because it causes rebalance to take days when it should take hours – or weeks when it should take days, on larger installations. The amount of I/O involved is also why you don’t necessarily want this to be an automatic process.
While the migrate-data issue is at the level of mechanics and implementation, the fix-layout issue is at more of a conceptual level. To put it simply, when we add a new brick we should reallocate approximately 1/new_brick_count hash values. Because our layout calculations are naive, we will usually reallocate much more than that – exacerbating the migrate-data problem because reallocated hash values correspond to moved files. Time for a picture.
The outer ring represents the state with just three bricks – hash value zero at the top, split into three equal ranges. The inner ring represents the state after adding a fourth brick. Any place where the inner and outer rings are different colors represents a range that has been reassigned from one brick to another – implying a migration of data likewise. If you look carefully, you’ll see that we’re moving half of the data when it should be only a quarter – 8% blue to orange, 17% orange to yellow, and 25% yellow to green. What could we do that’s better? Not much, if we stay within the limitation of a single brick claiming a single range, but there really doesn’t need to be such a limitation. Instead, we could borrow from Dynamo and assign multiple “virtual node IDs” for brick four, giving it a total of 25% drawn equally from bricks one through three. (If you look at this as a clock, that’s one hour each at three, seven, and eleven o’clock). Taking this approach too far can cause layout tables to get unreasonably large, so sometimes it makes more sense to forego this optimization or even reverse it by combining/swapping ranges even though that will cause “unnecessary” data migration. Done sensibly, this can keep the table size under control without resorting to the scale-limiting “hash bucket” approach of some Dynamo derivatives. There are even more sophisticated ways to address this problem, but that’s a whole different post and this should be enough to convey the general problem space.
That’s how DHT works today, and some thoughts about how it might evolve in the future. The next article will focus on replication.