Toggle Menu

Apache Cassandra is a NoSQL database ideal for high-speed, online transactional data, while Apache Hadoop is a big data analytics system that focuses on data warehousing and data lake use cases. Let’s look at their similarities, differences, best applications, and how they can be used together. But, first, let’s spend a little more time on Hadoop.

Hadoop and its key components  

Apache Hadoop, an Apache Software Foundation Project, is a framework to store and process big data in a distributed environment. It excels at performing near-time and batch-oriented analytics of historical data. Hadoop helps run analytics using high volumes of historical/line of business data on commodity hardware.

Unlike Cassandra, Hadoop is an ecosystem consisting of several components. Here are the four essential ones:

  • Hadoop Distributed File System (HDFS) is Hadoop’s primary storage system. It’s a distributed file system that looks like any other file system except that when you move a file on HDFS, it is split into many small files, and each of those files is replicated and stored on three servers (as a default, the number can be customized) for fault tolerance. With HDFS, enormous amounts of data— structured and unstructured—can be stored in a distributed fashion. Another option, with Hadoop, is to store the data in HBase. Part of the Apache Hadoop project, HBase is an open-source, NoSQL, distributed, scalable database that runs on top of HDFS. Modeled after Google’s Bigtable, it brings similar capabilities to Hadoop, allowing rapid record-level data access. HBase is designed to support data warehouse/data lake-style use cases, and is not typically utilized for distributed web and mobile applications that need a high-performance online database.
  • MapReduce is a programming paradigm for processing and handling large datasets. It splits requests into smaller requests which are sent to many small servers to be processed in a parallel manner. As a result, you can quickly process very large datasets. 
  • Common/Core is a package containing libraries and utilities to support Hadoop modules. 
  • YARN, or Yet Another Resource Negotiator, is a resource management platform included for managing computing resources and scheduling Hadoop tasks.

A Cassandra vs. Hadoop comparison: Architecture and more

 Cassandra and Hadoop’s ecosystem of components have many similarities. Here are some notable ones:

  • Members of the Apache Software Foundation
  • Built to store and process massive amounts of data
  • Distributed and scalable architecture
  • Designed to take advantage of low-cost commodity servers
  • Apply replication
  • Support indexing
  • Deliver analytics

However, they use different approaches to tackle various tasks, and, because of that, each is more effective than the other in certain areas. Let’s review their architecture and several other factors to see how they compare, where they excel, and where they’re lacking.

 

Cassandra

Hadoop

Architecture

Peer-to-peer architecture where data is distributed across nodes in a cluster. There is no central hub or master node. Instead, every node has the same role, and each can accept read and write requests.

Master/slave architecture. This approach has a master node, and multiple slave nodes. Only the master node can write to the database. It’s also responsible for managing key operations, and storing data in HDFS.

Data center distribution

Highly distributed. Can be deployed across regions, and even globally.

 

Typically deployed in a single data center, or geographically co-located with other servers.

Replication and fault tolerance

By default, data is replicated to all the nodes in the cluster. However, the number of replicas can be customized. Since Cassandra normally has more nodes storing replicas than Hadoop, it is considered more reliable and fault tolerant. The gossip protocol is used to broadcast ongoing status to all the peer nodes in the cluster, and, if one goes offline, the others take on its responsibility and carry on without a blip.

Hadoop creates three copies by default. As with Cassandra, you can alter this number. If the master and secondary nodes fail, all data will be lost.

 

Latency

Low, since it’s based on NoSQL. Read/write functions are fast.

Higher. Write latency is less than reading.

Data formats

Handles most structured, semi-structured and unstructured data, except for images.

Handles everything Cassandra does, plus images.

 

Indexing

Easy. Key-value pair data storage makes it straightforward. Allows the creation of multiple indexes, and includes many ways to quickly retrieve data.

Difficult.

CAP theorem (Consistency, availability, partition tolerance)

Supports availability and partition tolerance. That said, a trade-off can be made between consistency and availability. Consistency can be gained if replication and data consistency factors are tuned, but the availability guarantee will be lost.

Achieves consistency and partition tolerance with HDFS.

Access / query method

Uses Cassandra Query language (CQL). Since it’s similar to SQL, most developers will pick it up quickly.

Uses MapReduce to read/write.

 

 

Data compression

Its compression algorithm can reduce storage by up to 80%, without any overhead.

Between 10-15%

 

And, when specifically compared to HBase, Cassandra supplies: 

  • Higher performance
  • True continuous, “always-on” availability, with no single point of failure
  • Powerful and easy multi-data center / cloud availability zone support
  • A simpler architecture (masterless), with easier setup and fewer requirements
  • Easier development (SQL-like language with CQL, and more)

Use cases for Cassandra vs. Hadoop

As we saw in the last section, Cassandra and Hadoop have their own strengths and weaknesses. But how do those attributes apply to real-world usage? What kinds of situations should you use one over the other?

Cassandra is likely a better fit for:

  • Real-time processing of data
  • Always-on, high-speed, online transactional data and applications
  • Handling a high level of interaction and concurrent traffic.
  • Situations where each interaction is processing a small amount of data
  • Applications requiring high availability of large-scale reads
  • Data warehousing structured data, since HDFS doesn’t have record-level indexing

Note: HBase is sometimes used for an online application just because an existing Hadoop implementation exists at a site, not because it’s the right fit for the application. HBase is typically not a good choice for developing always-on online applications and is nearly two-to-three years behind Cassandra in many technical respects.

Here are just a few of the many ways Cassandra is used:

  • E-commerce and inventory management
  • Personalization, recommendations, and customer experience
  • Internet of things and edge computing
  • Fraud detection and authentication

Hadoop delivers considerable value to organizations by providing cost-effective processing and analysis of vast amounts of data. That analysis is often used to make mission critical decisions.

With that in mind, you should consider Hadoop for:

  • Running near-time, big data analytics on historical data
  • Batch processing and storing extremely large volumes of data
  • Data lakes and data warehousing (with the exception noted above for structured data)

Here are examples of how Hadoop and its components are used:

  • Retail analytics
  • Financial risk analysis, trading, and forecasting
  • Healthcare applications, such as predicting patients at risk of serious illness
  • Social networking sites that process incredibly high volumes, such as Twitter and Facebook

How Cassandra and Hadoop complement each other

While Cassandra and Hadoop are just the right fit for many situations, there are other times when it would be nice to take advantage of what each does best. Luckily, it’s possible to run them side-by-side or to tightly integrate them so you can do just that.

Here are a couple of scenarios highlighting when it might make sense to use Cassandra and Hadoop together.

Run Cassandra and Hadoop side-by-side to take on hot and cold data

As with legacy relational database applications, there is typically a need in modern web, mobile and IoT applications to have a database devoted to online operations (that includes analytics on hot data), and a batch-oriented data warehouse environment that supports the processing of colder data for analytic purposes.

As we’ve discussed, Cassandra is a perfect database choice for online web and mobile applications, while Hadoop targets the processing of colder, historical data in data lakes, warehouses, etc. By leaning on both Cassandra and Hadoop, an IT organization can effectively support the different analytic “tempos” needed to satisfy customer requirements and run the business.

Deploy Hadoop on top of Cassandra for convenient data analytics and reporting

Cassandra provides highly fault tolerant storage for online systems, and Hadoop excels at data analytics. Layering Hadoop on top of Cassandra. It turns out you can have the best of both worlds by deploying Hadoop on top of Cassandra. This allows companies to conveniently leverage the immense amount of data they already have in Cassandra, in real time, to conduct the level of operational analytics and reporting Hadoop provides. Otherwise, you’d have to move data off Cassandra into HDFS to gain the level of operational analytics and reporting Hadoop provides. This also avoids the hassle of having to deal with the resource-intensive and complicated step of moving data from Cassandra to HDFS to accomplish the same thing.

Cassandra or Hadoop? There’s no wrong answer

So, what’s the right choice: Cassandra or Hadoop? If you need high availability and performance, with low latency, and need to conduct real-time processing of an online application, then Cassandra is likely the right choice. On the other hand, if you need batch processing or big data analytics on extremely large volumes of historical data, Hadoop is probably the way to go. There is no right or wrong answer, and many organizations run Cassandra and Hadoop side by side, or integrate them, to access the benefits of both.

As a next step, explore Apache Cassandra in the cloud.

Icon
Blog
The 5 Features to Look for in a NoSQL Database

For an introduction to NoSQL databases, check out What is NoSQL? NoSQL databases have been around a long time - since the 1960s - but it wasn’t until the early 21st century that companies really started to use them, primarily to handle their big data and real-time web and cloud applications. Since then, the NoSQL database has surged in use and popularity, although relational databases still have their place. But when beginning to search for a NoSQL solution, what should you look for? Here are the 5 key features to look for in a NoSQL database: 1. Support for Multiple DataModels Where relational databases require data to be put into tables and columns to be accessed and analyzed, the various data model capabilities of NoSQL databases make them extremely flexible when it comes to handling data. They can ingest structured, semi-structured, and unstructured data with equal ease, whereas relational databases are extremely rigid, handling primarily structured data. Different data models handle specific application requirements. Developers and architects choose a NoSQL database to more easily handle different agile application development requirements. Popular data models include graph, document, wide-column, and key-value. The ideal is to support multiple data models, which allows you to use the same data in different data model types without having to manage a completely different database. 2. Easily Scalable via Peer-to-Peer Architecture  It’s not that relational databases can’t scale, it’s that they can’t scale EASILY or CHEAPLY, and that’s because they’re built with a traditional master-slave architecture, which means scaling UP via bigger and bigger hardware servers as opposed to OUT or worse via sharding. Sharding means dividing a database into smaller chunks across multiple hardware servers instead of a single large server, and this leads to operational administration headaches. Instead, look for a NoSQL database with a masterless, peer-to-peer architecture with all nodes being the same.  This allows easy scaling to adapt to the data volume and complexity of cloud applications. This scalability also improves performance, allowing for continuous availability and very high read/write speeds. 3. Flexibility: Versatile Data Handling Where relational databases require data to be put into tables and columns to be accessed and analyzed, the multi-model capabilities of NoSQL databases make them extremely flexible when it comes to handling data. They can easily process structured, semi-structured, and unstructured data, while relational databases, as stated previously, are designed to handle primarily structured data. 4. Distribution Capabilities Look for a NoSQL database that is designed to distribute data at global scale, meaning it can use multiple locations involving multiple data centers and/or cloud regions for write and read operations. Relational databases, in contrast, use a centralized application that is location-dependent (e.g. single location), especially for write operations. A key advantage of using a distributed database with a masterless architecture is that you can maintain continuous availability because data is distributed with multiple copies where it needs to be. 5. Zero Downtime  The final but certainly no less important key feature to seek in a NoSQL database is zero downtime. This is made possible by a masterless architecture, which allows for multiple copies of data to be maintained across different nodes. If a node goes down, no problem: another node has a copy of the data for easy, fast access. When one considers the cost of downtime, this is a big deal. Summary: NoSQL vs. SQL Decision Making Choosing between a NoSQL and a relational database is always going to come down to your company’s particular needs. And there are, of course, situations for which you might want to use both types, as they can often complement each other. If you deal with a lot of data types, and/or you want or need to build powerful web and cloud applications for a distributed and quickly growing user base, then you will need your database to be multi-model, flexible, easily scalable, distributed, and always on, which means you will need a NoSQL database that can handle these requirements. If you want to learn more about the differences between NoSQL vs. Relational Databases, check out our in-depth comparison page. Interested in learning about more than just features? Check out our complete guide to NoSQL. We also have an informative white paper that discusses Active Everywhere Databases.

Get the Blog
Icon
Blog
The Evolution of NoSQL

For years, organizations have relied on relational databases management systems (RDBMSs) to store, process, and analyze critical business information. The idea originated in a paper written in 1970 by a computer scientist named Edgar Codd, who thought to archive information in tables containing rows and columns. The concept was a major leap forward from the slow and inefficient flat file systems that businesses were using at the time, although these systems did work in conjunction with pre-relational model databases. The Rise of SQL Shortly after, IBM developed the SQL language to scan and manipulate sets of transactional data sets stored within RDBMSs. With SQL, it became possible to quickly access and modify large pools of records without having to create complex commands. SQL essentially enabled one-click access to sets of data. The idea took off, and the RDBMS eventually emerged as the most widely used data management system. Today, most organizations are still using RDBMSs one way or another. RDBMSs, however, have one major limitation: They are only capable of efficiently processing relatively small amounts of structured data—like names and ZIP codes. The NoSQL Imperative When the era of big data hit, a new kind of database was required. The real driver for NoSQL was the sheer shift in data volumes that the Internet brought. Prior to the internet, and in its early days, relational databases only had to deal with the data of a single company or organization. But when faced with the millions of Internet users that could discover a company's service in waves, the RDBMS model either broke or became very challenging to shard correctly. Relational databases also required a tremendous amount of maintenance. A database of a few thousand objects may handle things decently, but as you scale up, performance declines. This is a big problem—especially considering the massive volume of unstructured data that is being generated on a daily basis. According to 451 Research, 63% of enterprises and service providers today are managing storage capacities of at least 50 petabytes—and more than half of that data is unstructured. The concept of NoSQL has been around for decades. Believe it or not, businesses have been using non-relational databases to store and retrieve unstructured data since the 1960s. The technology, however, wasn’t referred to as NoSQL until developer Carlo Strozzi created the Strozzi NoSQL Open Source Relational Database in 1998. Strozzi’s database, though, was really just a relational database that didn’t have an SQL interface. It wasn’t until 2009 that we saw a true departure from the relational database model and the first working NoSQL application. NoSQL databases offer several advantages over relational databases. Most importantly, they can handle large volumes of big data. Other advantages include: Elastic scalability. Unlike relational databases, NoSQL databases can scale outward into new nodes instead of upward. This strategy is much more flexible, efficient and affordable than scaling with traditional legacy storage systems. Lower operating costs. One of the biggest downsides to using an RDBMS is the fact that you will have to deal with expensive servers. Since NoSQL databases leverage commodity server clusters, you can process and store larger data volumes at a lower cost. Reduced management. NoSQL databases are much easier to install and maintain as they are simpler and come with advanced auto-repair capabilities. While it’s not completely hands-off, NoSQL is much easier for network teams to manage on a daily basis. Bridging RDBMS With NoSQL Right now, NoSQL databases only account for about 3% of the $46 billion database market, but  they are quickly gaining traction and on pace to become a legitimate long-term market disruptor. But while NoSQL is heating up and the RDBMS market is experiencing a significant slowdown, this doesn’t mean that businesses are running out and abandoning their RDBMS systems altogether. RBDMSs, after all, are still great at managing transactional workloads, which are heavily used today. The best solution often involves finding a way to use your legacy technology to support your new applications, and this means getting an enterprise data layer. What’s an enterprise data layer? It’s a way to connect your systems of record with your systems of engagement. Essentially, it’s a data management layer that precludes you from having to go through a painfully expensive and time-consuming “rip and replace” process, and it allows you to salvage your legacy tech and put it to good use. You may still be stuck in the relational age, but that doesn’t mean you can’t take full advantage of the NoSQL revolution. The Architect’s Guide to NoSQL (white paper) READ NOW

Get the Blog