Get your copy of the O’Reilly Cassandra eBook: The Definitive Guide - Download FREE Today
Let’s see how Cassandra’s architecture and implementation stands up against top distributed NoSQL competitors—Couchbase, HBase, and MongoDB. After reviewing the comparison, I think you’ll see why Cassandra is so popular with leading companies around the world.
NoSQL Architecture: Apache Cassandra vs. Competitors
Cassandra incorporates a number of architectural best practices that affect performance. None are unique to Cassandra, but Cassandra is the only NoSQL system that incorporates all of them.
Every Cassandra machine handles a proportionate share of every activity in the system. There are no special cases like the HDFS namenode, MongoDB mongos, or the MySQL Fabric Process that all require special treatment. And with every node the same, Cassandra is far simpler to install and operate, which has long-term implications for troubleshooting. Even when everything works perfectly, master/slave designs have a bottleneck at the master. Cassandra leverages its masterless design to deliver lower latency as well as uninterrupted uptime.
Log-structured storage engine
A log-structured engine that avoids overwrites to turn updates into sequential I/O is essential both on hard disks (HDD) and solid-state disks (SSD). On HDD, it’s because the seek penalty is so high, whereas on SSD, it’s to avoid write amplification and disk failure. This is why you see MongoDB performance go through the floor as the dataset size exceeds RAM.
Couchbase's append-only B-trees avoid overwrites, but require several seeks when updating or inserting new documents and do not support durable writes without a large performance penalty.
HBase has an integrated, log-structured storage engine, but relies on the Hadoop Distributed File System (HDFS) for replication instead of managing storage locally. This means HBase is architecturally incapable of supporting Cassandra-style optimizations like putting the commitlog on a separate disk, mixing SSD and HDD in a single cluster with appropriate data pinned to each, or incrementally pulling compacted SSTables into the page cache.
NoSQL systems were formerly characterized by only allowing primary key lookups, and there was no query planning to speak of. Today, Cassandra and most other systems1 support indexes and increasingly complex queries.
The Cassandra Query Language allows Cassandra to pre-parse and re-use query plans, reducing overhead. Others remain stuck with primitive JSON APIs or even raw Java Scanner objects. CQL also allows Cassandra to express more sophisticated operations like lightweight transactions with a minimal impact on clients, resulting in wide support across many languages. The closest alternative is Apache Phoenix, a Java-only SQL layer for HBase.
NoSQL Implementation: Apache Cassandra vs. Competitors
An architecture is only as good as its implementation. For the first years after Cassandra's open-sourcing as an Apache project, every release was a learning experience. 0.3, 0.4, 0.5, 0.6, each attracted a new wave of users that exposed some previously unimportant weakness. Today, there are thousands of production deployments of Cassandra, the most for any scalable database.
Common methods of implementing a NoSQL database
When comparing each NoSQL database option, we considered the 3 most common use cases for implementation:
- New Applications: Begin with NoSQL by choosing a new application and start from the ground up. Such an approach mitigates the issues of application rewrites, data migrations, etc.
- Augmentation: Augment an existing system by adding a NoSQL component to it. This often happens when applications outgrow RDBMS (e.g., due to scale problems, there’s a need for better availability, hybrid/cloud environments, etc.)
- Full database replacement: For RDBMS systems that are simply proving too costly to keep, or are failing due to increases of user concurrency, data velocity, or data volume, a full replacement is done with a NoSQL database.
Apache Cassandra vs. MongoDB
MongoDB can be a great alternative to MySQL, but it's not really appropriate for the scale-out applications targeted by Cassandra. Still, as early members of the NoSQL category, the two do draw comparisons.
One important limitation in MongoDB is database-level locking. That is, only one writer may modify a given database at a time. Support for collection-level (a set of documents, analogous to a relational table) was added more recently, but, even with collection-level locking, a small number of writes would produce stalls in read performance to "hot" tables.
In contrast, Cassandra uses advanced concurrent structures to provide high performance updates without locking. Cassandra even eliminates the need for locking during index updates.
A more subtle MongoDB limitation: When adding or updating a field in a document, the entire document must be rewritten. If you preallocate space for each document, you can avoid the associated fragmentation. But, even with preallocation, updating your document gets slower as it grows. Cassandra's storage engine only appends updated data; it never has to rewrite or reread existing data. That means updates to a Cassandra row or partition stay fast as your dataset grows.
Apache Cassandra vs. HBase
HBase's storage engine is the most similar to Cassandra's; both drew on Bigtable's design early on.
But today, Cassandra's storage engine is far ahead of HBase's. This is primarily because building on HDFS instead of locally-managed storage makes everything more complex and less performant. Cassandra leads in SSD support, efficient use of the page cache support for large data sets, and more.
Cassandra's replication design is inherently more suited for delivering low latency response times, while also tolerating failures better.
Cassandra is also a leader in developer productivity, introducing CQL while HBase remains stuck on the difficult to use column family model. It's also worth noting that while Cassandra supports hundreds of tables, HBase "does not do well with anything above two or three column families."
Apache Cassandra vs. Couchbase
Couchbase presents a document-based data model to the end user, but under the hood it maps everything to a key/value storage API. Like MongoDB, updating any field in a document requires rewriting the whole thing.
Like MongoDB, Couchbase performs asynchronous writes by default. That is: After performing a Couchbase put operation, it is buffered in memory but not on disk. This is why naive Couchbase benchmarks post such startling performance numbers. Couchbase can be forced to persist writes to disk, but doing so kills performance; since there is no commitlog or journaling, each write must update Couchbase's B-tree and fsync.
Couchbase's storage engine has trouble dealing with more than five buckets (analogous to relational tables). The suggested workaround is to create a type attribute that will help you differentiate the various objects stored in a single bucket.
Couchbase's replication is simpler than MongoDB's, but no more rigorous in its design. Couchbase manages to be neither fully consistent, nor fully available. It cannot serve reads during failover or network partitions, but it can still serve stale data to reads. Couchbase nominally supports active/active cross-datacenter replication. However, if the same document is updated concurrently in both, one of the updates will be lost. Cassandra solves this problem by merging updates at the column level and optionally by using lightweight transactions to opt in to a linearizable operation order. Finally, Couchbase’s cross-datacenter replication failure often requires manual intervention to recover.
Conclusion: Apache Cassandra is the clear winner
When you take a close look at Cassandra's architecture and implementation, its advantages over other top NoSQL databases is clear. It doesn’t matter if we’re talking about distribution, storage, queries, scaling, updates, or replication, Cassandra is the clear winner in all of the above.