For a long time now, I’ve been running a server in the Rackspace cloud to do various things for me, including a sort of Dropbox equivalent to which I sync various files I want to be accessible from everywhere. Historically this has involved a combination of sshfs and encfs, but it’s about time I started eating my own dog food and using (some parts of) CloudFS for this. Yes, I know some people don’t like the “dog food” terminology, but the oft-cited “drinking our own champagne” alternative is even less applicable in this case. I wrote it. It’s still dog food until I say otherwise. Even as I write this, I’m copying files from the old setup to the new one based on pretty vanilla GlusterFS plus the at-rest encryption translator from CloudFS, all mounted on my desktop at work. I’ll have to run that through an ssh tunnel for now – until I finish writing the SSL translator – to deal with authentication and in-flight encryption issues. Similarly, I need to finish the ID-mapping translator before I could recommend this for use by more than one person per machine. With those caveats, though, I’d still say that the result is usable and secure enough for my own purposes (including compliance with Red Hat’s infosec policies). If anybody else is interested in getting on the “personal CloudFS” bandwagon, I’ll post some detailed instructions on how you can do this yourself.
Extended attributes are one of the best kept secrets in modern filesystems. Here you have a fully general feature to attach additional information to files, supported by most modern filesystems, and yet hardly anybody seems to use it. As it turns out, though, GlusterFS uses extended attributes – xattrs – quite extensively to do almost all of the things that it does from replication to distribution to striping. Because they’re such a key part of how GlusterFS works, and yet so little understood outside of that context, xattrs have become the subject of quite a few questions which I’ll try to answer. First, though, let’s just review what you can do with xattrs. At the most basic level, an xattr consists of a string-valued key and a string or binary value – usually on the order of a few bytes up to a few dozen. There are operations to get/set xattrs, and to list them. This alone is sufficient to support all sorts of functionality. For example, SELinux security contexts and POSIX ACLs are both stored as xattrs, with the underlying filesystems not needing to know anything about their interpretation. In fact, I was just dealing with some issues around these kinds of xattrs today . . . but that’s a story for another time.
The sneaky bit here is that the act of getting or setting xattrs on a file can trigger any kind of action at the server where it lives, with the potential to pass information both in (via setxattr) and out (via getxattr). That amounts to a form of RPC which components at any level in the system can use without requiring special support from any of the other components in between, and this trick is used extensively throughout GlusterFS. For example, the rebalancing code uses a magic xattr call to trigger recalculation of the “layouts” that determine which files get placed on which servers (more about this in a minute). The “quick-read” translator uses a magic xattr call to simulate an open/read/close sequence – saving two out of three round trips for small files. There are several others, but I’m going to concentrate on just two: the trusted.glusterfs.dht xattr used by the DHT (distribution) translator, and the trusted.afr.* xattrs used by the AFR (replication) translator.
The way that DHT works is via consistent hashing, in which file names are hashed and the hashes looked up in a table where each range of hashes is assigned exactly one “brick” (volume on some server). This assignment is done on each directory when it’s created. Directories must exist on all bricks, with each copy having a distinct trusted.glusterfs.dht xattr describing what range of hash values it’s responsible for. This xattr contains the following (all 32-bit values):
- The count of ranges assigned to the brick (for this directory). This is always one currently, and other values simply won’t work.
- A format designator for the following ranges. Currently zero, but this time it’s not even checked so it doesn’t matter.
- For each range, a starting and ending hash value. Note that there’s no way in this scheme to specify a zero-length range, nor can ranges “wrap around” from 0xffffffff to 0.
When a directory is looked up, therefore, all the code needs to do is collect these xattrs and combine the ranges they contain into a table. There’s also code to look for gaps and overlaps, which seem to have been quite a problem lately. It doesn’t take long to see that there are some serious scalability issues with this approach, such as the requirement for directories to exist on every brick or the need to recalculate xattrs on every brick whenever new bricks are added or removed. I have to address these issues, but for now the scheme works pretty well.
The most complicated usage of xattrs is not in DHT but in AFR. Here, the key is the trusted.afr.* xattrs, where the * can be the name of any brick in the replica set other than the one where the xattr appears. Huh? Well, let’s say you have an AFR volume consisting of subvolumes test1-client-0 (on server0) and test1-client-1 (on server1). A file on test1-client-0 might therefore have an xattr called trusted.afr.test1-client-1. The reason for this is that the purpose of AFR is to recover from failures. Therefore, the state of an operation can’t just be recorded in the same place where a failure can wipe out both the operation and the record of it. Instead, operations are done one place and recorded everywhere else. (Yes, this is wasteful when there are more than two replicas; that’s another thing I plan to address some day). The way this information is stored is as “pending operation counts” with xattrs recording counts (each 32-bit) for three different kinds of operations:
- Data operations – mostly writes but also e.g. truncates
- Metadata operations – e.g. chmod/chown/chgrp, and xattrs (yes, this gets recursive)
- Namespace operations – create, delete, rename, etc.
Whenever a modifcation is made to the filesystem, the counters are updated everywhere first. In fact, GlusterFS defines a few extra xattr operations (e.g. atomic increments) just to support AFR. Once all of the counters have been incremented, the actual operation is sent to all replicas. As each node completes the operation, the counters everywhere else are decremented once more. Ultimately, all of the counters should go back to zero. If a node X crashes in the middle, or is unavailable to begin with, then every other replica’s counter for X will remain non-zero. This state can easily be recognized the next time the counters are fetched and compared – experienced GlusterFS users probably know that a stat() call will do this. The exact relationships between all of the counters will usually indicate which brick is out of date, so that “self-heal” can go in the right direction. The most fearsome case, it should be apparent, is when the xattrs for the same file/directory on two bricks have non-zero counters for each other. This is the infamous “split brain” case, which can be difficult or even impossible for the system to resolve automatically.
Those are the two most visible uses of xattrs in GlusterFS but, as I said before, there are others. For example, trusted.gfid is a sort of generation number used to detect duplication of inode numbers (because DHT’s local-to-global mapping function is prone to such duplication whenever the server set changes). My personal favorite is trusted.glusterfs.test which appears (with the value “working”) in the root directory of every brick. This is used as a “probe” to determine whether xattrs are supported, but then never cleaned up even after the probe has yielded its result. The result of all this xattr use and abuse is, of course, a confusing plethora of xattrs attached to everything in a GlusterFS volume. That’s why it’s so important when saving/restoring to use a method that handles xattrs properly. Hopefully I’ve managed to show how crucial these little “extra” bits of information are, and perhaps given people some ideas for how to spot or fix problems related to their use.
Ahhh, you have to love the way user IDs are handled in Linux. Back in the old UNIX days a process just had one UID. Then it had separate effective and real UIDs. Then sometime while I wasn’t looking it grew a saved-set-uid, probably as an ugly solution to a simple problem with real and effective UIDs that I remember finding at Encore back in 1990. Then Linux came along and added fsuid so that file servers could take on a user’s identity for filesystem operations. Not a bad idea, really, but it sure would have been nice if the people who added setfsuid and setfsgid had thought through the problem enough to add setfsgroups as well . . . but that’s a different discussion. The confusing thing about setfsuid is that it’s not clear whether it has its effect at process or thread level. I’ve seen plenty of claims both ways, but a slight majority seem to think it’s process level. In other words, if you have two threads that are handling requests for two different users, you’d have to serialize their use of setfsuid to make sure each one operated under the correct (assumed) identity. That seemed really broken to me, but it was the most common interpretation, so instead of trusting random sources on the web I wrote a test program. This creates a file with restricted permissions, then has one child thread do setfsuid to a bogus identity, then has a second child thread try to open the file. Here’s the result:
parent waking first child kid1 (27054) setting fsuid to 9876 kid1 (27054) creating .childfile file created OK parent waking second child kid2 (27055) trying to open .parentfile file opened, setfsuid must be per-thread test complete
That seems pretty conclusive. If setfsuid had its effect at process level, the open would have failed. The test program also has the first thread create a new file, which does show up with the bogus owner, so clearly the setfsuid call had the right effect on that thread even though it didn’t affect the sibling thread. This is good. This is what I want, and probably what every other developer would want, so why would anyone have thought setfsuid worked at process level? I’d guess it’s partly because many people in the Linux community have worked very hard to obfuscate the relationships between forked processes, cloned processes, and threads as much as possible. I did actually check the kernel code, and the fsuid is stored as part of the “task_struct” which most people would intuitively associate with a process . . . but with cloning it’s actually closer to a thread. It’s also possible that there were bugs at one time that caused setfsuid to affect the whole process even though it was never supposed to. In any case, the actual answer seems to be the right one. Hopefully, anybody else who wonders can come here and use the test program to check for themselves.
I apologize to my readers at Red Hat for the interruption in access to the site. This site is co-hosted with my personal site, and it turns out that I did something while uploading files for an article there that triggered an automatic IP-address block from my host. Since all of Red Hat has one IP address as far as they’re concerned, that affected all machines at the company. I’d like to stress that this is not GlowHost’s fault. Their concern for security is in fact commendable, and they have been as helpful as anyone could wish for in resolving this. If anyone’s at fault it’s me, though I will say that I’m not entirely on board with the idea of running an entire company this size behind a single-address NAT.
Mostly, as I’ve said in subsequent conversation with folks at GlowHost, the fault lies with the abysmal state of technology in this area. Really, I should be able to manipulate files within my hosting account as easily as I do local files, with decent authentication and encryption and . . . hey. This is starting to sound a lot like the primary use case for CloudFS. I don’t really think CloudFS would be a fit here, especially since GlowHost is probably not using GlusterFS, but the features that are being developed for the first release are exactly those that would allow me to reach out from a CloudFS client on my desktop to one or more CloudFS servers at one or more companies with whom I’ve contracted to provide a service, while retaining full security for both of us. I usually think of those companies as being cloud providers, but they could just as easily be hosting providers instead. Aren’t they all trying to re-brand themselves as cloud providers anyway?
So apparently Dropbox might have a security problem. Someone who has access to your machine can apparently grab a config file that confers access to your Dropbox data, and that access can continue from other machines despite password changes. That last bit is important, and seems to be what many commentators on “Spews for Herds” web sites are missing. If a machine M is already connected to a resource R, then R will be available to anyone with access to M as long as that access lasts, unless periodic re-authentication (or re-connection) is required. Such re-authentication can burdensome enough that many people are willing to let it be infrequent or forego it altogether, so I’m not going to fault Dropbox for that. On the other hand, if an attacker might retain access to your Dropbox account even after you get them off your machine and change your password, that’s a whole different kettle of fish. The reports I’ve seen are a bit unclear, but here are several possible interpretations.
- Your Dropbox password is not used in any way that actually confers protection, and is essentially useless. This would be fraud on Dropbox’s part.
- Your Dropbox password is only used when fetching a hostid and there’s no check for duplicate connections from the same hostid. Therefore, once you have the hostid – however you acquired it – your access can no longer be interrupted. This would be incredibly sloppy on Dropbox’s part.
- Your Dropbox password is only used when you first connect, so existing connections are unaffected. In other words, you’d better change the password before your unwelcome visitor establishes a connection from somewhere else. This is almost as sloppy as the previous possibility, but not quite.
- Your Dropbox password is in fact used for periodic re-authentication, and/or periodic checks are made for duplicate hostids, but Derek Newton didn’t wait long enough to see that before rushing out to sound the alarm. This would be kind of sad, but I think it fits into the realm of understandable error.
- Dropbox is in fact reasonably secure and there’s some other reason why Derek Newton’s analysis is flawed.
- Derek Newton knew there was no problem, but made a fake report anyway. This would be fraud on Derek’s part, but I’m including this only for completeness. I don’t know Derek from a random coffee bean on the floor at Starbucks, and have no reason to believe he’d be even remotely capable of such a thing.
We don’t know yet which of these scenarios will actually turn out to represent the truth. The real point, though, is that the differences mostly have to do with how Dropbox is using those passwords, keys, hostids, or whatever other security-related artifacts are involved. None of us know that, and that’s the real problem. You see, I’ve had to think through a lot of the exact same issues for CloudFS. One conclusion I reached very early is that you shouldn’t need to know or care whether your cloud-storage provider is competent, diligent, or trustworthy enough to be keeping your data safe. All of the keys and similar resources needed to access your data should be under your control, so that you can prevent access to your data by forbidding access to the keys at your end. Even if Dropbox is actually secure, users around the world might well wonder whether that has always been true or will continue to be true. They shouldn’t have to wonder. An architecture that would allow them to have peace of mind even when reports like this come out is harder to implement – and a lot harder to implement without significant sacrifices in performance – but it’s necessary.
UPDATE 2011-Apr-19: Miguel de Icaza points out another problem with Dropbox security. As I said on Twitter back when this was originally posted, if a storage provider has your keys then they don’t have deniability, and that’s bad for both of you.
There might not be anyone in the industry who’s more tired of that question than I am, so please stop rolling your eyes already. When I started interviewing for a “cloud filesystem developer” job at Red Hat, I naturally asked people what they thought the phrase might mean. Nobody would even give me a hint, and I don’t think it’s because they didn’t have any ideas; they just wanted me to figure it out. I think I’ve figured out one answer, which is why there is this site where you’re reading this post, but the problem is that there are multiple valid answers. Hardly a week goes by when I don’t have to make the following distinction.
- A filesystem supporting the cloud. In this scenario, the “cloud filesystem” is providing resources to the cloud infrastructure itself in the form of virtual block devices (including boot devices) stored as files using loopback/virtio/iSCSI. A lot of people are very interested in this, but I have to admit it’s not really what gets me going personally.
- A filesystem in the cloud. This is the CloudFS I’m working on – a filesystem that is available as a service within the cloud, visible as a filesystem to users but deployed externally rather than within the user’s own compute instances. You can read everything else on the site for more about that.,?li>
That brings me to why I’m writing this today. Over at Nasuni, Andres Rodriguez posted about Designing a File System for the Cloud and it was mentioned Twitter. Mere minutes later, somebody else I follow there mentioned Oracle’s Cloud Filesystem (which was actually announced a little over a month ago). This led to the following exchange:
Jesse’s right, of course, and I’ll get back to that in a moment, but first I have to deal with this other thing. Oracle’s pseudo-offering is just a blatant case of “cloudwashing” – wrapping something old and fugly in “cloud” verbiage – and I’m not the first to notice. I mean, come on, it’s just an existing bit of technology (ACFS) with a new name, totally a play for data centers where machines are all part of the same shared-storage and identity domains. How do multi-tenancy and “pay for what you use” apply to something that you could only conceivably run on hardware you own and control? Is the NIST definition of “elasticity” (which Oracle cites) really satisfied by the “ability to add disks as data volumes increase”? How very uncloudy.
So, back to Jesse and Nasuni. I think what they’re doing is very worthwhile. I have often cited them in my talks as one of the few cloud-storage companies who are getting security anywhere near right. Recruiters call me about Nasuni multiple times a week, and they’re absolutely right to do so (even though I’m not interested right now thanks). It’s a fine company and a fine product. I just don’t think of it as a cloud filesystem. I don’t mean that as a negative, though. I don’t think adding “cloud” makes everything better, so I don’t think a cloud filesystem is the epitome of what every parallel/cluster/distributed filesystem should aspire to be. It’s just different. If you look at the two potential meanings of “cloud filesystem” above, Nasuni doesn’t match either one. Maybe “a filesystem that’s connected to a cloud” or perhaps even “a filesystem that’s integrated with a cloud” is a perfectly valid interpretation of “cloud filesystem” (I’m sure Jesse and others at Nasuni would say so) but it’s just not an interpretation I personally have in my head when I’m thinking or talking about such things.
Maybe we need different terms for all of these subtly different things. Cloud-infrastructure filesystems and cloud-service filesystems and cloud-gateway filesystems are all wonderful things, each in their own way, but they have drastically different requirements regarding performance and security and a whole bunch of other criteria. Hanging the same term on all of them will only lead to time wasted comparing things that really aren’t comparable, disparaging apples for not being oranges. That’s why seeing “cloud filesystem” applied to other things which until now have gone by other names kind of bothers me.