Comparing the Hadoop File System (HDFS) with the Cassandra File System (CFS)

Title: Comparing the Hadoop File System (HDFS) with the Cassandra File System (CFS)

Description: The Hadoop Distributed File System (HDFS) is one of many different components and projects contained within the community Hadoop™ ecosystem. The Apache Hadoop project defines HDFS as: “the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.”

Hadoop utilizes a scale-out architecture that makes use of commodity servers configured as a cluster, where each server possesses inexpensive internal disk drives. As the Apache Project’s site states, data in Hadoop is broken down into blocks and spread throughout a cluster. Once that happens, MapReduce tasks can be carried out on the smaller subsets of data that may make up a very large dataset overall, thus accomplishing the type of scalability needed for big data processing.

In general, this divide-and-conquer strategy of processing data is nothing really new, but the combination of HDFS being open source software (which overcomes the need for high-priced specialized storage solutions), and its ability to carry out some degree of automatic redundancy and failover make it popular for modern businesses looking for batch analytics solutions. This is just one reason why the Hadoop market is expected to grow at an eye-popping compound annual growth rate (CAGR) of 58 percent until 2018.

However, what these businesses are most interested in is not Hadoop’s underlying storage structure, but rather what it facilitates in delivering: a cost-effective means for analyzing and processing vast amounts of data. Being able to make decisions from the output of MapReduce, Hive, Pig, Mahout, and other operations is what matters most to these organizations.

Not surprisingly, a variety of vendors offer alternatives to HDFS, with a recent article by GigaOM supplying a brief survey of the most popular options. This paper provides a high-level overview
of how Apache Cassandra™ can be used to replace HDFS, with no programming changes
required from a developer perspective, and how a number of compelling benefits can be realized
in the process.