The new hi1.4xlarge instances in EC2 are pretty exciting, not only because they’re equipped with SSDs but because they’re also equipped with 10GbE and placement groups allow you to create server clusters that are closely colocated with full bandwidth among them. I was about ready to do another round of GlusterFS testing to see the effects of some recent changes (specifically the multi-threaded SSL transport and Avati’s delayed post-op in AFR) so it seemed like a good time to try out the new instances as well.
After firing up my two server instances, the first thing I did was check my local I/O performance. Each volume seemed to top out at approximately 30K IOPS, same as I’d seen at Storm on Demand when I was testing my replication code there, but the Amazon instances have two of those so they should be able to do 60K IOPS per instance (the 100K everyone else keeps quoting is just a marketing number). I couldn’t immediately fire up a third instance in the same placement group because of resource limits so I fired up a plain old m1.xlarge for the client. I’ve applied for a resource-limit increase so I can do the test I wanted to do, but for now these results should at least be directly comparable to Storm on Demand. All of these tests were run on a four-brick two-way-replicated GlusterFS volume to take full advantage of the hardware in the servers. Please bear in mind that these are random synchronous writes over a (slow) network, so the numbers will seem very low compared to those you’d get if you were testing async I/O locally. This is all about a worst case; the best case just wasn’t interesting enough to report on.

If you compare to the Storm on Demand graph (link above) a few things immediately become apparent. One is that the highest valid number (the unsafe “fast-path” number doesn’t count) has gone up from about 3000 to about 4000. That’s nice, but also bear in mind that the Amazon instances cost $3.10 per hour and the Storm on Demand instances are only $0.41 per hour. Even if the IOPS numbers had doubled, that still doesn’t seem like such a great deal.
The second obvious result is that the same number for “plain old AFR” has gone up from ~1500 IOPS to well over 4000, quite handily overtaking my own hsrepl. I’m not entirely sure why hsrepl actually managed to get worse, but my working theory is that the new handling of “xdata” (where we put the version numbers necessary for correct operation) is considerably less efficient than the handling I’d implemented on my own before. I don’t have hard evidence of that, but the new code will definitely go around in a much longer code path issuing more reads for the same data, and the sudden drop-off for hsrepl in my own local testing corresponds exactly with that change. In the end we seem to be even further from that theoretical maximum, even though the absolute IOPS number has increased.
The other mystery for me is why the multi-threading also seems to make things worse. This isn’t actually doing SSL, even though the two features were inextricably tied together in the same patch, so there’s not a lot more total computational load. These machines have plenty of cores to spare, so it shouldn’t be a thread-thrashing issue either. I expected the multi-threaded numbers to get a bit better, and in all of my other tests that has been the case. Maybe when I get my resource limit increased I’ll see something different in the all-10GbE environment.
That’s pretty much all I have to say about the new instances or GlusterFS running on them. They’re certainly a welcome improvement for this worst-case kind of workload, but I’ve seen their ilk before so the only thing that’s really new to me is the high price tag.
Grrr. The Amazon sales/support guy who handled my request doesn’t seem to realize that there’s a separate resource limit for this instance type. Yes, I can create 20 instances in us-east-1, but not 20 instances *of this type* and I can’t put instances of another type (e.g. the Amazon-AMI-only cluster compute type) into the same placement group. In other words, Amazon doesn’t even acknowledge the existence of the limit that’s blocking further progress.
Here are some more fun graphs. The first is from AWS, measuring the effect of disabling the POD (post op delay) optimization that’s now in AFR.


From this we can see that hsrepl’s optimizations actually were effective, but that POD is even more so. A little stumped by the poor hsrepl results, I went back to Storm on Demand and re-ran the same tests there.
From this we can see that hsrepl got a bit worse there too, and AFR (which still has other optimizations) got considerably better. We can also see that the results are overall are better than on EC2 – modestly slow for the two slower configurations, more so for the fastest. Not bad for 1/7 the price. Lastly, it seems that I was luckier with this run than I had been on my previous run at the same provider, because some of the POD results exceed even the “speed of light” from last time. Maybe they’ve upgraded hardware a bit, or maybe I just landed on a less loaded machine or network segment. In any case, that’s an excellent result. I should mention that the POD numbers were rather inconsistent, so I actually did three runs and took the middle result for each thread count, and there’s still that odd dip from 10-16 threads. Even the bottom of that trough is still better than anything seen on EC2, or anything on SoD before, so the main conclusion is still that all of the work put into optimizing AFR has paid off well for this type of configuration.
Great testing! Love your comment “thems marketing numbers…”
Are these non-virtualized machines or shared tenancy machines? I can’t help but think of contention for resources at or near 8 CPUs, but the curves bend over before that.
If these are non-virtualized (bare metal) machines, you might be exposing some architectural stuff, but more than likely you’d have to start pinning IRQs to individual processors, and turning off IRQ balancing, and doing some baseline OS tweaking via sysctl.
The 30k -> 3k drop off (local vs remote) looks pretty severe, more than we usually see apart from (badly) untuned physical systems. But that might be what you need to check for kind of a worst of the worst case scenario. Not quite sure how, if at all, you could tune the virtual stack for this … high performance and virtualization are generally rarely compatible, and are used to solve very different domain problems (as you know).
Of course they don’t say whether these machines are multi-tenant, virtual but single-tenant, or physical. Most common opinion seems to be that the new EC2 type is virtual but single-tenant. It’s less clear for the Storm on Demand machines. The high variability for some of my tests suggests to me that they are in fact shared, but that’s not entirely conclusive.
Yes, the 30K to 3K drop off is severe. OTOH, remember that these are synchronous operations over plain old GigE. Therefore a single stream of requests can get at most ~1000 IOPS no matter how fast the storage is. The test results show that we’re actually managing to overlap at least 4-5 requests at a time, which isn’t that bad for 20 threads on a single client. The servers didn’t seem to be working that hard, so if I feel like spending some more money (this all comes out of my own pocket BTW) I might try some runs with multiple clients to see what happens. I’d also *really* like to test with a third machine in the same placement group on the same 10GbE network, but Amazon seems clueless about how to enable that for me and Storm doesn’t even offer such a thing.