My previous post on this subject seems to have attracted a lot of interest. I got links from Phoronix, Heise (German), and most of all from my friends at Gluster. I was particularly amused by the Gluster response, because usually AB is the good cop and I’m the bad cop but this time he seems to have taken a more aggressive position than me. I had been thinking of writing a follow-up, and had put it off, but then as I composed a reply to a comment on the previous article I realized that I had covered most of the points I’d intended to make anyway. Being the lazy sort that I am, I’m just going to re-post the comment and my reply here. First, here’s the challenge from P.B.Shelley.
How about admitting that with FUSE, data has to be copied to kernel, then your user space component, and then back to kernel to write to disk? If you claim that this overhead can result in better performance, please PROVE it, instead of citing how many successful user space filesystems you have out there. All of them can perform better if they live in the kernel!
Here’s my response to that challenge.
You’re missing the point on several levels, P.B. I already mentioned the issue of extra data copies, but also made the point that it doesn’t relegate all user-space file systems to toys. Let’s see how many reasons there might be for that.
(1) The copies you mention are artifacts of the Linux FUSE implementation, and are not inherent to user-space file systems in general. Other systems do this more intelligently. PVFS2 does it more intelligently *on Linux*. With RDMA, communication could be direct from the application to application, without even the overhead of going through the kernel. FUSE itself could be more efficient if resistance from the kernel grognards and their sycophants could be overcome. Even if one could make the case that filesystems based on FUSE as it exists today are all toys, Linus’s statement *as he made it* would still be untrue.
(2) The copies don’t matter in many environments, especially in distributed systems. If your system is network, memory, or I/O bound anyway – whether that’s because of provisioning or algorithms – then the copies are only consuming otherwise-idle CPU cycles. This is especially true since most systems sold today are way out of balance in favor of CPU/memory over network or disk I/O anyway.
(3) There’s an important distinction between latency and throughput. The FUSE overheads mostly affect latency. If latency is your chief concern, then you probably shouldn’t be using any kind of distributed file system regardless of whether it’s in kernel or user space. If throughput is your chief concern, which is the more common case, you need a system that allows you to aggregate the power of many servers without hitting algorithmic limits. Such systems are hard enough to scale and debug already, without the added difficulty of putting them into the kernel prematurely. I’m not against putting code in the kernel *when all of the algorithms are settled*, but projects can go well beyond “toy” status well before that.
(4) There are concerns besides performance. There are bazillions of libraries that one can use easily from user space. Many of them can not and should not ever be reimplemented in the kernel simply because that would bloat the kernel beyond belief. In some cases there would be other serious implications, such as a kernel-ported security library losing its certification in the process.
(5) Results from actual field usage trumps synthetic micro-benchmarks any day, and either trumps empty theorizing like yours. If Argonne and Pandora and dozens of others can use PVFS2 and GlusterFS and HDFS for serious work, then they’re not toys. The point is already proven. End of story.
The real point here is that user-space file systems might not be better than kernel file systems in the terms that people like Linus and P.B. care about. I don’t think anybody has claimed that they were. However, they can be better in other ways. The importance of making it easier to develop and integrate user-space file systems in already-challenging environments, and the greater ease of hiring developers to work on them, can not be lightly dismissed. The user-space file system that’s finished beats the kernel file system that remains mired in pre-release debugging. Most of the algorithms that underlie modern distributed file systems, including kernel-based ones such as pNFS or Ceph, were developed in user space first. Often, the user-space prototype turned out to be complete enough and fast enough for some real-life purpose that putting it in the kernel was no longer worth the effort.
For your root file system, you should probably go with a traditional kernel-based write-in-place file system (not a copy-on-write file system because they’re bad in almost the same latency and CPU-usage terms that user-space file systems are). For data, if latency is your primary concern and you’re not hitting other limits before CPU and you can’t be bothered fixing the interfaces, then by all means develop your fancy new file system in the kernel. If you’re more concerned about throughput or your primary constraints are network-related or you’re willing to use/implement some interface besides FUSE, then maybe there are wiser choices you could make.