Everywhere I go, people ask me about Ceph. That’s hardly surprising, since we’re clearly rivals – which by definition means we’re not enemies. In fact I love Ceph and the people who work on it. The enemy is expensive proprietary Big Storage. The other enemy is things like HDFS that were built for one thing and are only good for one thing but get hyped relentlessly as alternatives to real storage. Ceph and GlusterFS, by contrast, have a lot in common. Both are open source, run on commodity hardware, do internal replication, scale via algorithmic file placement, and so on. Sure, GlusterFS uses ring-based consistent hashing while Ceph uses CRUSH, GlusterFS has one kind of server in the file I/O path while Ceph has two, but they’re different twists on the same idea rather than two different ideas – and I’ll gladly give Sage Weil credit for having done much to popularize that idea.
It should be no surprise, then, that I’m interested in how the two compare in the real world. I ran Ceph on my test machines a while ago, and the results were very disappointing, but I wasn’t interested in bashing Ceph for not being ready so I didn’t write anything then. Lately I’ve been hearing a lot more about how it’s “nearly awesome” so I decided to give it another try. At first I tried to get it running on the same machines as before, but the build process seems very RHEL-unfriendly. Actually I don’t see how duplicate include-file names and such are distro-specific, but the makefile/specfile mismatches and hard dependency on Oracle Java seem to be. I finally managed to get enough running to try the FUSE client, at least, only to find that it inappropriately ignores O_SYNC so those results were meaningless. Since the FUSE client was only slightly interesting and building the kernel client seemed like a lost cause, I abandoned that effort and turned to the cloud.
For these tests I used a pair of 8GB cloud servers that I’ve clocked at around 5000 synchronous 4KB IOPS (2400 buffered 64KB IOPS) before, plus a similar client. The very first thing I did was test local performance to verify that local performance was as I’d measured before. Oddly, one of the servers was right in that ballpark, but the other was consistently about 30% slower. That’s something to consider in the numbers that follow. In any case, I installed Ceph “Argonaut” and GlusterFS 3.2 because those were the ones that were already packaged. Both projects have improved since then; another thing to consider. Let’s look at the boring number first – buffered sequential 64KB IOPS.
No clear winner here. The averages are quite similar, but of course you can see that the GlusterFS numbers are much more consistent. Let’s look at the graph that will surprise people – synchronous random 4KB IOPS.
Oh, my. This is a test that one would expect Ceph to dominate, what with that kernel client to reduce latency and all. I swear, I double- and triple-checked to make sure I hadn’t reversed the numbers. My best guess at this point is that the FUSE overhead unique to GlusterFS is overwhelmed by some other kind of overhead unique to Ceph. Maybe it’s the fact that Ceph has to contact two servers at the filesystem and block (RADOS) layers for some operations, while GlusterFS only has a single round trip. That’s just a guess, though. The important thing here is that a lot of people assume Ceph will outperform GlusterFS because of what’s written in a paper, but what’s written in the code tells a different story.
Just for fun, I ran one more set of tests to see if the assumptions about FUSE overhead at least held true for metadata operations – specifically directory listings. I created 10K files, did both warm and cold listings, and removed them. Here are the results in seconds.
Not too surprisingly, Ceph beat GlusterFS in most of these tests – more than 10x for directory listings. We really do need to get those readdirp patches in so that directory listings through FUSE aren’t quite so awful. Maybe we’ll need something else too; I have a couple of ideas in that area, but nothing I should be talking about yet. The real surprise was the last test, where GlusterFS beat Ceph on deletions. I noticed during the test that Ceph was totally hammering the servers – over 200% CPU utilization for the Ceph server processes, vs. less than a tenth of that for GlusterFS. Also, the numbers at 1K files weren’t nearly as bad. I’m guessing again, but it makes me wonder whether something in Ceph’s delete path has O(n²) behavior.
So, what can we conclude from all of this? Not much, really. These were really quick and dirty tests, so they don’t prove much. It’s more interesting what they fail to prove, i.e. that Ceph’s current code is capable of realizing any supposed advantage due to its architecture. Either those advantages aren’t real, or the current implementation isn’t mature enough to demonstrate them. It’s also worth noting that these results are pretty consistent with both Ceph’s own Argonaut vs. Bobtail performance preview and my own previous measurements of a block-storage system I’ve been told is based on Ceph. I’ve seen lots of claims and theories about how GlusterFS is going to be left in the dust, but as yet the evidence seems to point (weakly) the other way. Maybe we should wait until the race has begun before we start predicting the result.