email iconemail phone iconcall

Hadoop Vs. Cassandra

Contrasting Hadoop & Apache Cassandra
Apache Cassandra is a NoSQL database ideal for high-speed, online transactional data, while Hadoop is a big data analytics system that focuses on data warehousing and data lake use cases.

What is Hadoop?

Apache Hadoop, an Apache Software Foundation Project, is a big data analytics framework that focuses on near-time and batch-oriented analytics of historical data. Hadoop helps run analytics on high volumes of historical/line of business data on commodity hardware.

There are four fundamental components that make up Hadoop:

Hadoop Distributed File System (HDFS) is a distributed file system that looks like any other file system except than when you move a file on HDFS, this file is split into many small files, each of those files is replicated and stored on (usually, may be customized) 3 servers for fault tolerance constraints. MapReduce is a programming paradigm for processing and handling large data sets. It splits requests into smaller requests which are sent to many small servers to be processed in a parallel manner. As a result, you can process very large data sets very quickly. Common/Core is a package containing libraries and utilities to support Hadoop modules. YARN is a resource management platform included for managing computing resources and scheduling Hadoop tasks.

How does Cassandra complement Hadoop?

As with legacy relational database applications, there is typically a need in modern Web, mobile and IOT applications to have a database devoted to online operations (that includes analytics on hot data) and a batch-oriented data warehouse environment that supports the processing of colder data for analytic purposes.

Apache Cassandra is a perfect database choice for online Web and mobile applications, whereas Hadoop targets the processing of colder data in data lakes, warehouses, etc. This allows a IT organization to effectively support the different analytic “tempos” needed to satisfy customer requirements and run the business.

What about HBase?

HBase is an open source, NoSQL, distributed database modeled after Google's BigTable and is written in Java. It’s included as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like capabilities for Hadoop.

HBase is designed to support data warehouse/lake-styled use cases and is not typically utilized for distributed Web and mobile applications that need a high-performance online database.

How does Cassandra compare to HBase?

HBase is sometimes used for an online application because an existing Hadoop implementation exists at a site and not because it is the right fit for the application. HBase is typically not a good choice for developing always-on online applications and is nearly 2-3 years behind Cassandra in many technical respects.

In comparison to HBase, Cassandra supplies:

  • Higher performance
  • True continuous, “always on” availability with no single point of failure
  • Powerful and easy multi-data center / cloud availability zone support
  • A simpler architecture (masterless) with easier setup and fewer requirements
  • Easier development (SQL-like language with CQL, more)
DataStax Enterprise is the database for cloud applications.
Learn More