Daniel Abadi described his blog entry about Hadoop connectors as a “Stonebraker-style rant” and then delivered on the threat. Like everything Stonebraker has written in the last five years, it’s based on a fundamentally flawed premise, which is that HDFS stores unstructured data. This assumption is not clearly stated, but it’s pretty clear from context, e.g.
primary storage in Hadoop (HDFS) is a file system that is optimized for unstructured data
Among storage folks, “unstructured” means stuff that is written using any of the millions of applications of the last thirty years that write their data using the storage API that’s built into every operating system – i.e. files. HDFS does not store unstructured data. If it did, there would be no need to import true unstructured data into HDFS. HDFS is not quite a real file system, usable by a significant percentage of applications. It’s designed to support Hadoop itself and that’s pretty much all it does with any degree of competence. Can you extract and build a Linux kernel source tree in HDFS? Even if you could, how good do you think a system designed around management of 128MB chunks work for that? (BTW yes, I can think of ways to analyze this kind of data using Hadoop, so it’s not an entirely silly example. Use your imagination.)
Going back to Daniel’s argument, RDBMS-to-Hadoop connectors are indeed silly because they incur a migration cost without adding semantic value. Moving from one structured silo to another structured silo really is a waste of time. That is also exactly why filesystem-to-Hadoop connectors do make sense, because they flip that equation on its head – they do add semantic value, and they avoid a migration cost that would otherwise exist when importing data into HDFS. Things like GlusterFS’s UFO or MapR’s “direct access NFS” decrease total time to solution vs. the HDFS baseline. I’d put DataStax’s Brisk into the same category, even though the storage it provides is structured underneath, because it also works “inline” like the native file system alternatives instead of requiring an import phase like HDFS. The fact that you can also use CassandraFS to work with file-based applications is just icing on the cake.
So, are Hadoop adapters a good idea or a bad one? It depends on what kind you’re talking about. Daniel and I can agree that ETL-style RDBMS connector doesn’t make any sense at all, but I believe that an inline unstructured-data connector is a different story altogether.