My standing search for “glusterfs” on Twitter got me into an interesting discussion with Dr. Shawn Tan about an interesting GlusterFS configuration for several workstations. At first my reaction was panic, because I could see potential for data loss in that configuration, but “you’re using it wrong” is rarely a productive response from a developer. Dr. Tan’s use case is actually quite valid. It just happens to be one that GlusterFS doesn’t deal with very well right now, so instead of spending my time feeling superior I spent it thinking about what we could do differently instead. The basic parameters here are:
- Provide shared access to data across several workstations in a cluster.
- Use replication to ensure that data can survive a workstation failure.
- Maximize the percentage of reads that are local.
- Maximize the percentage of writes that are local.
The first two points are pretty clearly good reasons to use GlusterFS or something like it. It’s the third and particularly fourth points where this configuration runs into trouble. Caching data for reads is fine, as long as you understand your consistency needs and don’t give up more consistency than your applications or users can really tolerate. Caching (actually buffering) writes is much more dangerous. If your goal is to ensure that data can survive a workstation failure, then writing it only on the workstation and leaving it there forever is simply a non-starter. However, writing it initially on the workstation and writing it asynchronously to another machine might shrink the window of vulnerability to an acceptable level. GlusterFS’s “geosync” won’t do many-to-one replication, which would be very space-inefficient anyway, so how else might we get something like the same effect?
One part of the solution would be something like the “NUFA” (Non Uniform File Access) feature that GlusterFS used to have, but which has been deprecated since “DHT” replaced “unify” before I was even seriously involved with the project. The idea behind NUFA was to create files locally whenever possible, and write to those local files instead of via the “normal” kind of distribution. That sounds ideal except that as soon as you add replication you’re writing remotely again, so NUFA’s not doing anything for you . . . or is it? It’s making sure one copy of the file is local. What if we could make that the only copy initially, so that we only do local writes in the main I/O path, but with some control over how long it takes before that data does get replicated somewhere else? That’s where my bypass translator comes in. It’s not production quality yet, but it basically fits the bill of setting up two replicas and then populating only one. Since it works with the regular AFR translator, you can then use regular self-heal to propagate the changes to another replica. With self-heal becoming more precise, thanks to lots of hard work by other members of the GlusterFS team, this will soon be a pretty efficient process. The only thing you lose is strict ordering. Thus, you might not want to do this for parallel applications that depend on such ordering but it’s probably fine for sharing on the time scale of a person working on one workstation before lunch and a different one after. Your I/O flow would be something like this:
- User at workstation A creates a new file and writes data into it. The file is actually created on A and B, but the data is only written into A (with all of the pending flags set so AFR knows that B needs an update).
- Every ten minutes, a job runs that “self-heals” partially written files from A to B, A to C, or in any other case where A shares a replica set with one of the other workstations.
- The same user subsequently sits down at workstation C and reads the same file. It’s read from A and/or B, which is unfortunate, but that’s a case where we can deploy caching to remove some of the sting.
If a user gets up and moves immediately to another workstation, they might see stale data. If they then start modifying the same files from the new workstation, split-brain problems could result. Similarly, if a failure happens on A before a new file has been propagated to B, then that file’s actual contents will be unavailable until A (or at least the relevant disk from A) comes back. These might be absolute show-stoppers for most people. On the other hand, I could see using something like this myself e.g. between my work machine, my home machine, one or more laptops, and my general-purpose server at Rackspace. On the third hand, maybe I’ll just wait until I have a chance to work on my personal Patch Based Filesystem project which would handle this exact use case even better.
As a further optimization, the system could detect reads and/or existing-file writes at C for files living on A and B, then start pro-actively relocating related files so that one copy is at C. Ideally, this would mean adding a replica at C (turning two-way replication into three-way) and then deleting the replica at A or B (going back to two-way) so that data protection is maintained throughout. Unfortunately, current GlusterFS replication doesn’t offer this kind of flexibility to change replication levels on a per-file or temporary basis. We don’t have the detection of changes to trigger this migration, or the policy to identify related files, so this is all clearly in the future anyway. However, it’s a direction we could go, step by incremental step, providing at least a little bit of additional value to at least a few users each time.