Title: Comparing the Hadoop File System (HDFS) with the Cassandra File System (CFS)
Description: The Hadoop Distributed File System (HDFS) is one of many different components and projects contained within the community Hadoop™ ecosystem. The Apache Hadoop project defines HDFS as: “the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.”
Hadoop utilizes a scale-out architecture that makes use of commodity servers configured as a cluster, where each server possesses inexpensive internal disk drives. As the Apache Project’s site states, data in Hadoop is broken down into blocks and spread throughout a cluster. Once that happens, MapReduce tasks can be carried out on the smaller subsets of data that may make up a very large dataset overall, thus accomplishing the type of scalability needed for big data processing.
In general, this divide-and-conquer strategy of processing data is nothing really new, but the combination of HDFS being open source software (which overcomes the need for high-priced specialized storage solutions), and its ability to carry out some degree of automatic redundancy and failover make it popular for modern businesses looking for batch analytics solutions. This is just one reason why the Hadoop market is expected to grow at an eye-popping compound annual growth rate (CAGR) of 58 percent until 2018.
However, what these businesses are most interested in is not Hadoop’s underlying storage structure, but rather what it facilitates in delivering: a cost-effective means for analyzing and processing vast amounts of data. Being able to make decisions from the output of MapReduce, Hive, Pig, Mahout, and other operations is what matters most to these organizations.
This paper explains how MapReduce, Hive, Pig, and Mahout tasks may be run directly on DataStax Enterprise Cassandra databases that service online applications and how such functionality differs from traditional Hadoop installations that target data warehouse/data lake use cases.