DataStax Developer Blog

Cassandra architecture and performance, mid 2014

By Jonathan Ellis -  June 12, 2014 | 4 Comments

The impending release of Cassandra 2.1 is a good time to look at how Cassandra is doing against the distributed NoSQL competition. This is an update of my summary of the top distributed NoSQL databases from January 2013.

Architecture

Cassandra incorporates a number of architectural best practices that affect performance. None are unique to Cassandra, but Cassandra is the only NoSQL system that incorporates all of them.

Fully distributed: Every Cassandra machine handles a proportionate share of every activity in the system. There are no special cases like the HDFS namenode, MongoDB mongos, or the MySQL Fabric process that require special treatment. And with every node the same, Cassandra is far simpler to install and operate, which has long-term implications for troubleshooting. Even when everything works perfectly, master/slave designs have a bottleneck at the master. Cassandra leverages its masterless design to deliver lower latency as well as uninterrupted uptime.

Log-structured storage engine: A log-structured engine that avoids overwrites to turn updates into sequential i/o is essential both on hard disks (HDD) and solid-state disks (SSD). On HDD, because the seek penalty is so high; on SSD, to avoid write amplification and disk failure. This is why you see mongodb performance go through the floor as the dataset size exceeds RAM. Couchbase’s append-only b-trees avoids overwrites, but requires several seeks when updating or inserting new documents and does not support durable writes without a large performance penalty.

Locally-managed storage: HBase has an integrated, log-structured storage engine, but relies on HDFS for replication instead of managing storage locally. This means HBase is architecturally incapable of supporting Cassandra-style optimizations like putting the commitlog on a separate disk, mixing SSD and HDD in a single cluster with appropriate data pinned to each, or incrementally pulling compacted sstables into the page cache.

Prepared statements: Five years ago, NoSQL systems were characterized by only allowing primary key lookups, and there was no query planning to speak of. Today, Cassandra and most other systems2 support indexes and increasingly complex queries. The Cassandra Query Language allows Cassandra to pre-parse and re-use query plans, reducing overhead. Others remain stuck with primitive JSON APIs or even raw Java Scanner objects. CQL also allows Cassandra to express more sophisticated operations like lightweight transactions with a minimal impact on clients, resulting in wide support across many languages. The closest alternative is Apache Phoenix, a Java-only SQL layer for HBase.

Implementation

An architecture is only as good as its implementation. For the first years after Cassandra’s open-sourcing as an Apache project, every release was a learning experience. 0.3, 0.4, 0.5, 0.6, each attracted a new wave of users that exposed some previously unimportant weakness. Today, there are thousands of production deployments of Cassandra, the most for any scalable database. Some are listed here. To paraphrase ESR, “With enough eyes, all performance problems are obvious.”

What are some implementation details relevant to performance?

MongoDB

MongoDB can be a great alternative to MySQL, but it’s not really appropriate for the scale-out applications targeted by Cassandra. Still, as early members of the NoSQL category, the two do draw comparisons.

One important limitation in MongoDB is database-level locking. That is, only one writer may modify a given database at a time. Support for collection-level (a set of documents, analogous to a relational table) locking has been planned since 2010, but even with collection-level locking a small number of writes would produce stalls in read performance to “hot” tables.

In contrast, Cassandra uses advanced concurrent structures to provide high performance updates without locking. Cassandra even eliminates the need for locking during index updates.

A more subtle MongoDB limitation is that when adding or updating a field in a document, the entire document must be re-written. If you pre-allocate space for each document, you can avoid the associated fragmentation, but even with pre-allocation updating your document gets slower as it grows. Cassandra’s storage engine only appends updated data; it never has to re-write or re-read existing data. Thus, updates to a Cassandra row or partition stay fast as your dataset grows.

MongoDB’s replication is complex and fragile. It’s relatively easy to achieve “impossible” scenarios like multiple masters for the same shard.

HBase

HBase’s storage engine is the most similar to Cassandra’s; both drew on Bigtable’s design early on.

But today, Cassandra’s storage engine is far ahead of HBase’s.  This is primarily because building on HDFS instead of locally-managed storage makes everything more complex and less performant. Cassandra leads in SSD support, efficient use of the page cache, support for large data sets, and more.

Cassandra’s replication design is inherently more suited for delivering low latency reponse times, while also tolerating failures better.

Cassandra has also led in developer productivity, introducing CQL while HBase remains stuck on the difficult to use Column Family model. It’s also worth noting that while Cassandra supports hundreds of tables, HBase “does not do well with anything above two or three column families“.

Couchbase

Couchbase presents a document-based data model to the end user, but under the hood it maps everything to a key/value storage API. Thus, like MongoDB, updating any field in a document requires rewriting the whole thing.

Like MongoDB circa 2013, Couchbase performs asynchronous writes by default. That is: after performing a couchbase put operation, it is buffered in memory but not on disk. This is why naive Couchbase benchmarks post such startling performance numbers. Couchbase can be forced to persist writes to disk, but doing so kills performance; since there is no commitlog or journaling, each write must update Couchbase’s btree and fsync.

Couchbase’s storage engine has trouble dealing with more than five buckets (analogous to relational tables). The suggested workaround is to create a type attribute that will help you differentiate the various objects stored in a single bucket.

Couchbase’s replication is simpler than MongoDB’s, but no more rigorous in its design. Couchbase manages to be neither fully consistent, nor fully available: it cannot serve reads during failover or network partitions, but it can still serve stale data to reads. Couchbase nominally supports active/active cross-datacenter replication, but if the same document is updated concurrently in both, one of the updates will be lost. (Cassandra solves this problem by merging updates at the column level and optionally by using lightweight transactions to opt in to a linearizable operation order.)

Couchbase cross-datacenter replication failure often requires manual intervention to recover.


1 DataStax is working with End Point Corporation on a proper, clustered benchmark. I will update this post when those results are available.  In the meantime, last year’s full benchmark is here.
2 HBase is the notable exception that does not offer indexes out of the box.



Comments

  1. dorian says:

    That sorted-primary-key in hbase/hypertable/bigtable is a pretty big thing . I repeat, very big. Again, BIG.
    Also data is better-compressable this way.

    What about indexing hotspot, where you have many data about a client and are indexing by client_id ?

    Phoenix is jdbc (python has jdbc, dont know about other languages).

    You can’t say you have a better query stuff than hbase (coprocessors).

    Isn’t the column-family stuff of hbase more dynamic ?
    Too much overhead for each cell (column-name!)?

    Can’t the ttl be fixed for the whole column-family?
    Use an id for column.names ?

    Can cassandra store data in reed-solomon encoding (qfs) ?
    Cassandra doesn’t even have access-groups ?

    The documentation site sucks(thanks for the fixed header). No page to explain data-model stuff, but here is a music-service.
    Can you support x maximum_versions_kept per column ?

    Also, can’t you shard data by ordered ranges of rows like hbase/hypertable/bigtable (not like the orderedbytes old stuff) ?

    Have you tried comparing against tokumx (a better mongodb) ?
    Also table != column_family.

    Collections set and map can be created easly with column-families ?(list a little more difficult)

    Your biggest win, randompartitioner, is your biggest loss.

    1. Christopher Smith says:

      I think you’re doing a bit too much feature chart comparison here. I’ve found when using the tools most of the points you raise aren’t terribly important in practice.

      For example, I don’t partition much data by client_id. I usually have it as part of the cluster key in tables, but it doesn’t make much sense as a partition key for anything beyond basic login data.

      JDBC support tends only to be relevant in Java land (not even sure what you are getting at with Python there…). There are JDBC compatible Java drivers out there, but I’ve found them mostly useful for migrating legacy code over. They are adding a JDBC module on top of the driver, but I’m not exactly waiting with baited breath.

      On the “dynamic” column-family stuff… Cassandra has very dynamic models underneath CQL, but ever since CQL introduced compound primary keys, it hasn’t been needed. I’m not even sure what the issue is with overhead for each cell, but sstable compression support makes the overhead quite minimal, and if you really want to you can always use the legacy compact storage model.

      Similarly, I can’t see why it’d be a disadvantage to have more flexibility with the TTL. You can always have your code enforce a consistent TTL for all writes to a table. With 2.0 you could use Triggers to enforce a TTL if you really wanted to.

      It’s funny you mention the security groups thing. I’ve increasingly found I end up not wanting that. Instead I have a table that manages roles and their access rules, and then I generate permissions from that table, effectively creating a table-based DSL for security policy. Having to conform to some else’s group/ACL model just adds more pain. The harder stuff to manage independently is object-level permissions, auditing and encryption, which Cassandra is great for, so I tend not to lose any sleep over the groups support.

      QFS ECC is nice, but I’d actually prefer to have that handled in the host OS filesystem/block device layer, and then just manipulate replication levels at a higher layer. Cassandra actually does do ECC on its underlying data chunks. Using node-level ECC to recover from node failures is quite nasty with a high availability solution, and it makes more sense for longer term storage of data, which in Cassandra you usually do by having a separate analysis data center, where I might indeed have much lower redundancy levels anyway. For the real-time querying, having higher replication levels is actually more desirable.

      You sound like you are aware of ByteOrderedPartitioner, which IIRC is still available. I’m not sure what more you’d want it. Ordered partitioning actually manifests a lot of the disadvantages you mention (like hot spots), so I’m kind of confused as to why you’d want that.

      In general, I’ve found every time I’ve *thought* I wanted ordered partitioning, I’ve actually been quite wrong (you do still have ordering *within* a partition, which is critical). Particularly with Cassandra 2.0 and the Spark integration, ordered partitioning seems all disadvantage and no advantage. What’s the upside from your perspective?

      1. dorian says:

        -re column overhead
        Each column has too much overhead. This can maybe be fixed by having a ‘document’ cell-type where you pay the overhead only once. The list/set/map are internal as separate columns, so you also have the normal column overhead there.

        -re ttl
        As i’ve read, if you want to ttl a column, you must store extra data (the ttl timestamp 8 bytes). While in hypertable (maybe even hbase), you store the TTL at table creation time, and each cell doesn’t need to hold it’s own TTL time (since in the normal column overhead, is also a timestamp, that can be used for the ttl)

        -re Access-groups
        Nope, I didn’t mean security groups. Watch the bigtable paper for what I mean of access-groups (or hypertable docs). Access groups are columns in the same row, that are stored in different files (like different tables, but you can write them together and read them together).

        -re Group-by
        Why no groupby’s but only inside a partition?

        -re Indexes
        I still don’t understand how indexes are saved/distributed? Does each node hold indexes for it’s own partitions?

        -re Overhead2
        The name of the column is another overhead. You have a create-table command, can’t you save a column-id there? Of course this is compressed, but not in memory (or am I wrong?)

  2. Jason Berko says:

    Jonathan, a couple of notes: In your previous blog entry to this, you indicate that by default MongoDB fsyncs every 100 ms. Though as you did change some defaults for Cassandra, and to be fair across DBs, shouldn’t you set Mongo’s fsync value equal to Cassandra, every 10 sec? (Or change C* to be 100 ms?)

    And might you know when the results for that clustered benchmark you mentioned will be released?

    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>