My recent work on High Speed Replication is not the only thing I’ve done to improve GlusterFS performance recently. In addition to that 2x improvement in synchronous/replicated write performance, here are some of the other changes in the pipeline.
Patch 3005 is a more reliable way to make sure we use a local copy of a file when there is one. The current “first to respond” method doesn’t always work, because the first to respond is usually the first to be asked, and we always query the servers in the same order regardless of where we are. Thus, if the bricks are on nodes A and B, everyone will query A first and most of the time everyone – including B – will decide to read from A. The new method should work a lot better in “peer to peer” deployments where every machine is both client and server.
Patch 2926 handles the opposite case, where servers and clients are separate. The same convergence behavior mentioned above was causing strange inconsistencies in performance runs. Sometimes (if clients happened to divide themselves evenly across replicas) performance was really good; other times (if clients happened to converge on the same – usually first – replica) it was really bad. The new method uses hashing to ensure that both one client accessing multiple files and multiple clients accessing the same file will distribute that access across replicas. If hash-based random distribution is good enough for DHT, it’s good enough for AFR as well.
Patch 3004 deals with distribution rather than replication, and it might be the most interesting of the three, but by itself doesn’t tell the whole story of what I’m up to. ALl the patch does is support a “layout” type (see my article about distribution) that the built-in rebalance infrastructure will leave alone. Thus, if you want to create a carefully hand-crafted layout for a directory so that all files go to bricks in a certain rack, you’ll be able to do that safely. More importantly, it provides the essential “hook” for experimenting with different rebalancing algorithms. I already have a Python script that can calculate a new layout based on an old one, with two twists.
- It checks the brick sizes, and generates a new layout with the hash ranges weighted according to those sizes. This improves behavior when bricks are different sizes – as is often the case when the new batch of disks is 50% larger than the batch purchased a year ago.
- It also tries to build a new layout that minimizes the data motion coming from the old one. The built-in algorithm is a bit stupid this way, tending to move (N-1)/N of the data when you add brick N. For large N, this means a pretty complete reshuffling of your data. The new algorithm seems to converge on moving only about 40% of the data if you want a “perfect” final result, or all the way below 10% if you’re willing to sacrifice a bit of “accuracy” to minimize the disruption associated with data migration.
The one thing that was missing was a way to apply these changes and make them stick. Once we have that, what I envision is a three-step approach to adding bricks. Sometimes you might not migrate any data at all; you just adjust the layouts to prefer placement of new data on the new bricks (a very small tweak to the total-size-based weighting I’ve already implemented). If you want to be a bit more aggressive about rebalancing, you can do the low-data-movement kind of rebalancing which will get you pretty close to an optimal layout. Then, perhaps during planned downtime or predictable low-usage periods, you do the full rebalance to ensure fully optimal placement of data across your cluster.
Those aren’t the only patches I have in the works, but I think they constitute a coherent group with the common goal of distributing load across the cluster more evenly than before. That’s what “scale out” is supposed to mean, so I’m glad that we’ll finally be capturing more of the advantages associated with that approach.