How To Build a Distributed Filesystem

This was originally a response on Quora, but it ended up attached to a comment that got downvoted (not by me) and that makes it hard to find so I figured I’d give it some more exposure here. I’ll add some more Gluster-specific commentary at the end, but first I just have to re-use the graphic from my original answer.

BoromirFS

Here’s how I outlined some of the steps. They might look familiar.

  1. Implement a simple object server. You just have to be able to create/delete objects and read/write aligned blocks within them at this point.
  2. Decide how objects are going to be placed and found among multiple object servers. Single leader with election and failover? Consensus on each decision? Deterministic algorithm such as consistent hashing or CRUSH? Implement a simple lookup/read/write protocol that can deal with multiple object servers – statically configured for now.
  3. Make the object-server configuration dynamic, so new servers can join the group and new objects can be placed on them. It’s OK for now if a server leaving means its data is lost. At this point you’re mostly just implementing a membership protocol.
  4. Decide how replication is going to work. How are replicas chosen? Is replication sync or async? Do you make replications crash-proof by wrapping them in transactions? By writing to one server and then having it write to others? By using Dynamo-style R/W/N semantics? Implement a basic form of replication so that one server can fail and operations can occur on others. Don’t worry about repair for now.
  5. Worry about repair. If a server comes down and then back up, how does it reconcile state for its objects with state for other copies of those objects on other servers? How are conflicts detected and resolved? This could be a whole semester-long course all by itself, but just implement whatever works for you and make it modular so you can replace it later.
  6. Add rebalancing. As servers are added or removed, data needs to be shifted around to match the new topology. Implement that process.
  7. Add basic filesystem semantics like nested directories and byte-granularity reads/writes. If you chose the Ceph/RADOS (Frangipani/Petal) approach, you’ll need to add a whole metadata layer converting file operations into block operations. If you chose the GlusterFS (PVFS) approach, you’ll need to add that functionality within the object – now file – servers themselves. Either way, you’ll need to deal with things like concurrent reads/writes within a file and listing directories while entries within them are being created/deleted/renamed.
  8. Add basic security – UID/GID and user/group/other permission bits. Make the code check permissions. Might as well add other standard filesystem metadata such as mtime here as well.
  9. Add other operation types – hard and soft links, xattrs, rename, etc.
  10. Add caching and other performance enhancers. Do you want stateful caching with invalidation (or updates)? If so, how will you handle client failures? Or maybe you prefer leases, or time-to-live. Consider your needs for data and metadata (including directory entries) separately. Ditto for prefetching, buffering/batching, etc. Each will be different, so plan and implement accordingly.

At this point you’ll be at approximate parity with the other distributed filesystems out there. Most of these steps should take about half to two months for a suitably skilled individual working full time or nearly so, plus another couple of months for generic stuff like FUSE interfaces and connection management, so you’re looking at a bit over a year. Triple everything if you decide to do it all in the kernel instead, and triple again if you want to make it production quality.

OK, so why did GlusterFS in user space take even longer than that? The first reason is the time in which it was written. Many of the techniques that now seem obvious were very poorly understood (by anyone) at the time, which meant a lot of experimenting and often backtracking. Some components had to be replaced multiple times. Many of the things we depend on, such as FUSE, didn’t work all that well at the time either. Secondly, that same industry-wide unfamiliarity meant that recruiting and training developers and testers was much harder. That meant reduced hands-on productivity both from the new folks (at each iteration) and from the old folks who became mentors. Thirdly, and I think most significantly, the constraints of having been in production and having to deal with actual customer bugs played a part. It’s not easy to drain the swamp when you’re up to your a~~ in alligators, and it’s not easy to write new code when you’re at a customer site fixing the old stuff. If you look at other serious distributed filesystems of approximately the same vintage, you’ll see the same thing. (Don’t even get me started on the planetary waste of brainpower that NFS has been over the same period.) It would be feasible for a single person to develop a semi-decent distributed filesystem in a year or so now (I think about that all the time) but that most certainly wasn’t true in 2006. We’ll all be developing more rapidly now, adding features left and right. Keep that in mind as you evaluate each new announcement and try to guess who’s going to be ahead in another year or two.