DataStax Developer Blog

Cassandra architecture and performance, mid 2014

By Jonathan Ellis -  June 12, 2014 | 1 Comment

The impending release of Cassandra 2.1 is a good time to look at how Cassandra is doing against the distributed NoSQL competition. This is an update of my summary of the top distributed NoSQL databases from January 2013.

Architecture

Cassandra incorporates a number of architectural best practices that affect performance. None are unique to Cassandra, but Cassandra is the only NoSQL system that incorporates all of them.

Fully distributed: Every Cassandra machine handles a proportionate share of every activity in the system. There are no special cases like the HDFS namenode, MongoDB mongos, or the MySQL Fabric process that require special treatment. And with every node the same, Cassandra is far simpler to install and operate, which has long-term implications for troubleshooting. Even when everything works perfectly, master/slave designs have a bottleneck at the master. Cassandra leverages its masterless design to deliver lower latency as well as uninterrupted uptime.

Log-structured storage engine: A log-structured engine that avoids overwrites to turn updates into sequential i/o is essential both on hard disks (HDD) and solid-state disks (SSD). On HDD, because the seek penalty is so high; on SSD, to avoid write amplification and disk failure. This is why you see mongodb performance go through the floor as the dataset size exceeds RAM. Couchbase’s append-only b-trees avoids overwrites, but requires several seeks when updating or inserting new documents and does not support durable writes without a large performance penalty.

Locally-managed storage: HBase has an integrated, log-structured storage engine, but relies on HDFS for replication instead of managing storage locally. This means HBase is architecturally incapable of supporting Cassandra-style optimizations like putting the commitlog on a separate disk, mixing SSD and HDD in a single cluster with appropriate data pinned to each, or incrementally pulling compacted sstables into the page cache.

Prepared statements: Five years ago, NoSQL systems were characterized by only allowing primary key lookups, and there was no query planning to speak of. Today, Cassandra and most other systems2 support indexes and increasingly complex queries. The Cassandra Query Language allows Cassandra to pre-parse and re-use query plans, reducing overhead. Others remain stuck with primitive JSON APIs or even raw Java Scanner objects. CQL also allows Cassandra to express more sophisticated operations like lightweight transactions with a minimal impact on clients, resulting in wide support across many languages. The closest alternative is Apache Phoenix, a Java-only SQL layer for HBase.

Implementation

An architecture is only as good as its implementation. For the first years after Cassandra’s open-sourcing as an Apache project, every release was a learning experience. 0.3, 0.4, 0.5, 0.6, each attracted a new wave of users that exposed some previously unimportant weakness. Today, there are thousands of production deployments of Cassandra, the most for any scalable database. Some are listed here. To paraphrase ESR, “With enough eyes, all performance problems are obvious.”

What are some implementation details relevant to performance?

MongoDB

MongoDB can be a great alternative to MySQL, but it’s not really appropriate for the scale-out applications targeted by Cassandra. Still, as early members of the NoSQL category, the two do draw comparisons.

One important limitation in MongoDB is database-level locking. That is, only one writer may modify a given database at a time. Support for collection-level (a set of documents, analogous to a relational table) locking has been planned since 2010, but even with collection-level locking a small number of writes would produce stalls in read performance to “hot” tables.

In contrast, Cassandra uses advanced concurrent structures to provide high performance updates without locking. Cassandra even eliminates the need for locking during index updates.

A more subtle MongoDB limitation is that when adding or updating a field in a document, the entire document must be re-written. If you pre-allocate space for each document, you can avoid the associated fragmentation, but even with pre-allocation updating your document gets slower as it grows. Cassandra’s storage engine only appends updated data; it never has to re-write or re-read existing data. Thus, updates to a Cassandra row or partition stay fast as your dataset grows.

MongoDB’s replication is complex and fragile. It’s relatively easy to achieve “impossible” scenarios like multiple masters for the same shard.

HBase

HBase’s storage engine is the most similar to Cassandra’s; both drew on Bigtable’s design early on.

But today, Cassandra’s storage engine is far ahead of HBase’s.  This is primarily because building on HDFS instead of locally-managed storage makes everything more complex and less performant. Cassandra leads in SSD support, efficient use of the page cache, support for large data sets, and more.

Cassandra’s replication design is inherently more suited for delivering low latency reponse times, while also tolerating failures better.

Cassandra has also led in developer productivity, introducing CQL while HBase remains stuck on the difficult to use Column Family model. It’s also worth noting that while Cassandra supports hundreds of tables, HBase “does not do well with anything above two or three column families“.

Couchbase

Couchbase presents a document-based data model to the end user, but under the hood it maps everything to a key/value storage API. Thus, like MongoDB, updating any field in a document requires rewriting the whole thing.

Like MongoDB circa 2013, Couchbase performs asynchronous writes by default. That is: after performing a couchbase put operation, it is buffered in memory but not on disk. This is why naive Couchbase benchmarks post such startling performance numbers. Couchbase can be forced to persist writes to disk, but doing so kills performance; since there is no commitlog or journaling, each write must update Couchbase’s btree and fsync.

Couchbase’s storage engine has trouble dealing with more than five buckets (analogous to relational tables). The suggested workaround is to create a type attribute that will help you differentiate the various objects stored in a single bucket.

Couchbase’s replication is simpler than MongoDB’s, but no more rigorous in its design. Couchbase manages to be neither fully consistent, nor fully available: it cannot serve reads during failover or network partitions, but it can still serve stale data to reads. Couchbase nominally supports active/active cross-datacenter replication, but if the same document is updated concurrently in both, one of the updates will be lost. (Cassandra solves this problem by merging updates at the column level and optionally by using lightweight transactions to opt in to a linearizable operation order.)

Couchbase cross-datacenter replication failure often requires manual intervention to recover.


1 DataStax is working with End Point Corporation on a proper, clustered benchmark. I will update this post when those results are available.  In the meantime, last year’s full benchmark is here.
2 HBase is the notable exception that does not offer indexes out of the box.



Comments

  1. dorian says:

    That sorted-primary-key in hbase/hypertable/bigtable is a pretty big thing . I repeat, very big. Again, BIG.
    Also data is better-compressable this way.

    What about indexing hotspot, where you have many data about a client and are indexing by client_id ?

    Phoenix is jdbc (python has jdbc, dont know about other languages).

    You can’t say you have a better query stuff than hbase (coprocessors).

    Isn’t the column-family stuff of hbase more dynamic ?
    Too much overhead for each cell (column-name!)?

    Can’t the ttl be fixed for the whole column-family?
    Use an id for column.names ?

    Can cassandra store data in reed-solomon encoding (qfs) ?
    Cassandra doesn’t even have access-groups ?

    The documentation site sucks(thanks for the fixed header). No page to explain data-model stuff, but here is a music-service.
    Can you support x maximum_versions_kept per column ?

    Also, can’t you shard data by ordered ranges of rows like hbase/hypertable/bigtable (not like the orderedbytes old stuff) ?

    Have you tried comparing against tokumx (a better mongodb) ?
    Also table != column_family.

    Collections set and map can be created easly with column-families ?(list a little more difficult)

    Your biggest win, randompartitioner, is your biggest loss.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>