date: January 18, 2012
Today Amazon announced DynamoDB, a hosted database for Amazon Web Services.
With DynamoDB, Amazon and Cassandra come full circle: just as Cassandra was inspired by Amazon's 2007 Dynamo paper, DynamoDB has adopted many of the improvements Cassandra has made since then.
Here's a summary of how DynamoDB compares to Cassandra:
|Key + columns data model||Yes||Yes|
|Composite key support||Yes||Yes|
|Tuneable consistency||Yes||Most operations|
|Largest value supported||2GB||64KB|
|Idempotent write batches||Yes||No|
|Indexes on column values||No|
|Hadoop integration||M/R,Hive, Pig||M/R, Hive|
|Multi-datacenter support||Full cross-region||Mutiple availability zones only|
|High performance on commodity disks||Yes||N/A|
|Backups||Low-impact snapshot + incremental||Manually with EMR|
|Deployable||Anywhere||Only on AWS|
Both Cassandra and DynamoDB achieve high scalablity, using many of the same techniques. Cassandra has been shown to scale to millions of ops/s, and Amazon announced on this morning's webcast that they have a customer doing over 250 thousand op/s on DynamoDB. This kind of scale needs to be baked in from the start, not tacked on as an afterthought.
The multi-datacenter availability story is a bit more complex. DynamoDB replicates data "across multiple Availability Zones in a Region," but cross region is not supported. Since Availability Zones are not geographically dispersed, you need cross-region replication if you're worried about outages affecting entire regions or if you want to provide local data latencies to all your clients in any region. This requires the kind of fine-grained control over data consistency seen in Cassandra.
The data model is one area where DynamoDB is much closer to Cassandra than to the original Dynamo. The original Dynamo design had a primitive key/value data model, which has serious implications: to update a single field, the entire value must be read, updated, and re-written. (Other NoSQL databases have grafted a document API onto a key/value engine; this is one reason their performance suffers compared to Cassandra.) Another direct consequence is requiring complex vector clocks to handle conflicts from concurrent updates to separate fields. Cassandra was one of the first NoSQL databases to offer a more powerful ColumnFamily data model, which DynamoDB's strongly resembles.
However, secondary indexes were one of the best-received features when Cassandra introduced them a year ago. Going back to only being able to query by primary key would feel like a big step backwards to me now.
Like Cassandra, DynamoDB offers Hadoop integration for analytical queries. It's unclear if DynamoDB is also able to partition analytical workloads to a separate group of machines the way you can with Cassandra and DataStax Enterprise. Realtime and analytical workloads have very different performance characteristics, so partitioning them to avoid resource conflicts is necessary at high request volumes.
Hadoop is rapidly becoming the industry standard for big data analytics, so it makes sense to offer Hadoop support instead of an incompatible custom API. It's worth noting that Cassandra supports Pig as well as Hive on top of Hadoop. Many people think pig is a better fit for the underlying map/reduce engine.
Cassandra has been in production in many demanding environments for years now, so it's no surprise that Cassandra has a substantial lead in delivering real-world features like backup. Cassandra's log-structured storage engine-- which is also the reason Cassandra runs so performantly without SSDs -- allows both full and incremental backups with no impact on performance. (You can then upload your backup files with a tool like tablesnap).
As an engineer, it's nice to see so many of Cassandra's design decisions imitated by Amazon's next-gen NoSQL product. I feel like a proud uncle! But in many important ways, Cassandra retains a firm lead in power and flexibility.