GlusterFS vs. Ceph

Everywhere I go, people ask me about Ceph. That’s hardly surprising, since we’re clearly rivals – which by definition means we’re not enemies. In fact I love Ceph and the people who work on it. The enemy is expensive proprietary Big Storage. The other enemy is things like HDFS that were built for one thing and are only good for one thing but get hyped relentlessly as alternatives to real storage. Ceph and GlusterFS, by contrast, have a lot in common. Both are open source, run on commodity hardware, do internal replication, scale via algorithmic file placement, and so on. Sure, GlusterFS uses ring-based consistent hashing while Ceph uses CRUSH, GlusterFS has one kind of server in the file I/O path while Ceph has two, but they’re different twists on the same idea rather than two different ideas – and I’ll gladly give Sage Weil credit for having done much to popularize that idea.

It should be no surprise, then, that I’m interested in how the two compare in the real world. I ran Ceph on my test machines a while ago, and the results were very disappointing, but I wasn’t interested in bashing Ceph for not being ready so I didn’t write anything then. Lately I’ve been hearing a lot more about how it’s “nearly awesome” so I decided to give it another try. At first I tried to get it running on the same machines as before, but the build process seems very RHEL-unfriendly. Actually I don’t see how duplicate include-file names and such are distro-specific, but the makefile/specfile mismatches and hard dependency on Oracle Java seem to be. I finally managed to get enough running to try the FUSE client, at least, only to find that it inappropriately ignores O_SYNC so those results were meaningless. Since the FUSE client was only slightly interesting and building the kernel client seemed like a lost cause, I abandoned that effort and turned to the cloud.

For these tests I used a pair of 8GB cloud servers that I’ve clocked at around 5000 synchronous 4KB IOPS (2400 buffered 64KB IOPS) before, plus a similar client. The very first thing I did was test local performance to verify that local performance was as I’d measured before. Oddly, one of the servers was right in that ballpark, but the other was consistently about 30% slower. That’s something to consider in the numbers that follow. In any case, I installed Ceph “Argonaut” and GlusterFS 3.2 because those were the ones that were already packaged. Both projects have improved since then; another thing to consider. Let’s look at the boring number first – buffered sequential 64KB IOPS.

Async 64KB graph

No clear winner here. The averages are quite similar, but of course you can see that the GlusterFS numbers are much more consistent. Let’s look at the graph that will surprise people – synchronous random 4KB IOPS.

Sync 4KB graph

Oh, my. This is a test that one would expect Ceph to dominate, what with that kernel client to reduce latency and all. I swear, I double- and triple-checked to make sure I hadn’t reversed the numbers. My best guess at this point is that the FUSE overhead unique to GlusterFS is overwhelmed by some other kind of overhead unique to Ceph. Maybe it’s the fact that Ceph has to contact two servers at the filesystem and block (RADOS) layers for some operations, while GlusterFS only has a single round trip. That’s just a guess, though. The important thing here is that a lot of people assume Ceph will outperform GlusterFS because of what’s written in a paper, but what’s written in the code tells a different story.

Just for fun, I ran one more set of tests to see if the assumptions about FUSE overhead at least held true for metadata operations – specifically directory listings. I created 10K files, did both warm and cold listings, and removed them. Here are the results in seconds.

Ceph GlusterFS
create 109.320 184.241
cold listing 0.889 9.844
warm listing 0.682 8.523
delete 93.748 77.334

Not too surprisingly, Ceph beat GlusterFS in most of these tests – more than 10x for directory listings. We really do need to get those readdirp patches in so that directory listings through FUSE aren’t quite so awful. Maybe we’ll need something else too; I have a couple of ideas in that area, but nothing I should be talking about yet. The real surprise was the last test, where GlusterFS beat Ceph on deletions. I noticed during the test that Ceph was totally hammering the servers – over 200% CPU utilization for the Ceph server processes, vs. less than a tenth of that for GlusterFS. Also, the numbers at 1K files weren’t nearly as bad. I’m guessing again, but it makes me wonder whether something in Ceph’s delete path has O(n²) behavior.

So, what can we conclude from all of this? Not much, really. These were really quick and dirty tests, so they don’t prove much. It’s more interesting what they fail to prove, i.e. that Ceph’s current code is capable of realizing any supposed advantage due to its architecture. Either those advantages aren’t real, or the current implementation isn’t mature enough to demonstrate them. It’s also worth noting that these results are pretty consistent with both Ceph’s own Argonaut vs. Bobtail performance preview and my own previous measurements of a block-storage system I’ve been told is based on Ceph. I’ve seen lots of claims and theories about how GlusterFS is going to be left in the dust, but as yet the evidence seems to point (weakly) the other way. Maybe we should wait until the race has begun before we start predicting the result.

 

11 Responses

You can follow any responses to this entry through the RSS 2.0 feed.

Both comments and pings are currently closed.

  1. Clint Byrum says:

    Its probably worth noting that most of the excitement around CEPH is around its block device scalability, and not its filesystem performance.

  2. Jeff Darcy says:

    That’s true, Clint. On the other hand, as I was just saying to someone on Google+, there’s not likely to be much difference at that level. The preferred method in both cases is to use the qemu driver, with librados for Ceph and libgfapi (not FUSE). The filesystem layer is the hard part to implement without sacrificing performance, which is probably why it’s taking so long to mature. I’d love to do block-storage comparisons, and object-storage comparisons too, but I had to start somewhere.

  3. Is Gluster going to support block storage? In particular, will it be useable as a volume storage backend in OpenStack?

  4. Jeff Darcy says:

    GlusterFS already supports block storage in a general sense, either via loopback or more recently via qemu driver (equivalent to Ceph’s RBD, written by IBM). For OpenStack specifically, I know there are integration efforts going on but don’t know enough about the details to comment intelligently.

  5. Jamie Begin says:

    Can you clarify what you mean by “For these tests I used a pair of 8GB cloud servers”?

    Wouldn’t using hardware that you have physical access to be more useful when benchmarking a filesystem? Cloud-based block storage performance is often wildly erratic and there are many underlying variables that can’t be controlled for.

  6. Jeff Darcy says:

    Did you read the first two paragraphs, Jamie? I did try to run on physical servers first, but then I ran into the extreme RHEL-unfriendliness of the Ceph build and it was easier to spin up a few cloud servers than to reinstall the OS on the physical ones for their sake. That approach also has the advantage that others can reproduce my results.

    I’m well aware of the variability in cloud block storage performance, but it is possible to measure and correct for that. When the block-storage performance is stable over a period of time[1] and the filesystem consistently captures less than a quarter of that performance, that’s not the block storage’s fault. Either the filesystem is overly sensitive to general *network* slowness (e.g. by incurring too many round trips per operation and/or by failing to parallelize/pipeline operations) or it has purely internal problems.

    [1] Not hard to find on Rackspace or Storm on Demand’s SSD servers. Cleverkite’s servers are consistent, but consistently *bad*. Digital Ocean’s are good for a few brief moments, then get throttled into oblivion. Amazon’s are just all over the map all of the time, which is why I don’t do tests there unless I’m specifically testing a new Amazon instance type. Remember, cloud != Amazon.

  7. Jeff Darcy says:

    I was able to run some tests on a couple of machines at work: 8x 2.4GHz Westmere CPUs, 24GB memory, 7x SAS disks each, running RHS 2.0 (approximately RHEL 6.2) with GlusterFS 3.3.1 and Ceph 0.56 (Bobtail). The numbers for disk were uninteresting in both cases, so I switched to using ramdisks. Then I had to install Fedora 18 in a VM on the client (servers are still bare metal) because the Ceph kernel client won’t even build on a kernel that old and I didn’t really feel like blowing away the OS on one of my machines for a competitor’s sake. It’s not ideal by any stretch, but at least it doesn’t have any random variation from other users.

    In that configuration, Ceph actually managed to win significantly for medium thread counts (10-20 threads) but had much greater variability and *huge* CPU usage (over 400% sometimes) in the OSD daemons. This makes for a much more interesting performance/scalability picture with a less clear winner, but I’m too tired after fighting with SELinux all evening to turn the results into pretty graphs right now.

  8. Sunghost says:

    thx for that. i am looking actualy for a good system for videostreaming and tried moosefs which performed under 50MBit now i am looking to another better system, whould you use glusterfs for that?

  9. John Wallace says:

    Thanks for doing this work!! We are looking for a new distributed file system to possibly replace our 20 year old OpenAFS cell and Ceph and GlusterFS are both on our list.

  10. Mattias Eliasson says:

    It should be noted that besides Ceph scalability it also have self-healability and self-management as a design goal. If you want performance you could add OpenAFS on top of it for aggressive local caching.

    Its design goal makes Ceph a lot easier to deploy than GlusterFS and for that reason alone I prefer Ceph.

    Of course its not entirely plug and play but its far closer to that than any other cloud storage I have tried. That includes OpenAFS.

    Personally I plan to use Ceph even as a local file-system in order to use spread data among disks where I see it as far superior compared to ZFS and other local enterprise file-systems.

    On another issue… There are more than one distributed filesystem out there and when I look at GlusterFS and Ceph I see a lot of redundant code in both projects. I do not know how many CRC32 implementations I have seen in my life for example (i don’t remember if GlusterFS have one but its still a good example). Of course the real problem are that a lot of processors these days have CRC32 in hardware. In Solaris there are system API:s for such algorithms, with a system wide implementation. And on UltraSPARC that implementation are in hardware. That’s one issue I have with code redundancy. The others are that it also means bug redundancy, if there are an error in many CRC32 implementations, how do I know which applications are affected. If all of them use a common shared library I would know for certain.

    This is something that both GlusterFS and Ceph developer should consider… share as much code as possible and do it in some easy to use packaging.. like libcrc32.so… Of course in order to have system-wide libraries that replace all per-application implementation this may need to be utilizing another licence than GPL.

    There are also a lot of public domain code in the SQLite project that overlaps cluster file-system projects. For example CRC32.