DataStax Developer Blog

Amazon DynamoDB

By Jonathan Ellis -  January 18, 2012 | 28 Comments

Today Amazon announced DynamoDB, a hosted database for Amazon Web Services.

With DynamoDB, Amazon and Cassandra come full circle: just as Cassandra was inspired by Amazon’s 2007 Dynamo paper, DynamoDB has adopted many of the improvements Cassandra has made since then.

Here’s a summary of how DynamoDB compares to Cassandra:

Cassandra DynamoDB
Key + columns data model Yes Yes
Composite key support Yes Yes
Tuneable consistency Yes Most operations
Distributed counters Yes Yes
Largest value supported 2GB 64KB
Idempotent write batches Yes No
Time-to-live support Yes No
Conditional updates No Yes
Indexes on column values Yes No
Hadoop integration M/R, Hive, Pig M/R, Hive
Multi-datacenter support Full cross-region Mutiple availability zones only
Integrated caching Yes Unclear
High performance on commodity disks Yes N/A
Monitorable Yes Yes
Backups Low-impact snapshot + incremental Manually with EMR
Deployable Anywhere Only on AWS

Both Cassandra and DynamoDB achieve high scalablity, using many of the same techniques. Cassandra has been shown to scale to millions of ops/s, and Amazon announced on this morning’s webcast that they have a customer doing over 250 thousand op/s on DynamoDB. This kind of scale needs to be baked in from the start, not tacked on as an afterthought.

The multi-datacenter availability story is a bit more complex. DynamoDB replicates data “across multiple Availability Zones in a Region,” but cross region is not supported. Since Availability Zones are not geographically dispersed, you need cross-region replication if you’re worried about outages affecting entire regions or if you want to provide local data latencies to all your clients in any region. This requires the kind of fine-grained control over data consistency seen in Cassandra.

The data model is one area where DynamoDB is much closer to Cassandra than to the original Dynamo. The original Dynamo design had a primitive key/value data model, which has serious implications: to update a single field, the entire value must be read, updated, and re-written. (Other NoSQL databases have grafted a document API onto a key/value engine; this is one reason their performance suffers compared to Cassandra.) Another direct consequence is requiring complex vector clocks to handle conflicts from concurrent updates to separate fields. Cassandra was one of the first NoSQL databases to offer a more powerful ColumnFamily data model, which DynamoDB’s strongly resembles.

However, secondary indexes were one of the best-received features when Cassandra introduced them a year ago. Going back to only being able to query by primary key would feel like a big step backwards to me now.

Like Cassandra, DynamoDB offers Hadoop integration for analytical queries. It’s unclear if DynamoDB is also able to partition analytical workloads to a separate group of machines the way you can with Cassandra and DataStax Enterprise. Realtime and analytical workloads have very different performance characteristics, so partitioning them to avoid resource conflicts is necessary at high request volumes.

Hadoop is rapidly becoming the industry standard for big data analytics, so it makes sense to offer Hadoop support instead of an incompatible custom API. It’s worth noting that Cassandra supports Pig as well as Hive on top of Hadoop. Many people think pig is a better fit for the underlying map/reduce engine.

Cassandra has been in production in many demanding environments for years now, so it’s no surprise that Cassandra has a substantial lead in delivering real-world features like backup. Cassandra’s log-structured storage engine — which is also the reason Cassandra runs so performantly without SSDs — allows both full and incremental backups with no impact on performance. (You can then upload your backup files with a tool like tablesnap).

As an engineer, it’s nice to see so many of Cassandra’s design decisions imitated by Amazon’s next-gen NoSQL product. I feel like a proud uncle! But in many important ways, Cassandra retains a firm lead in power and flexibility.



Comments

  1. Cassandra’s tech is superior, as far as I can tell. But we’ll probably be using DynamoDB until there is an equivalent managed host service for Cassandra. Moving to Cassandra is simply too expensive right now.

    I wish Datastax should sell managed Cassandra on AWS at reasonable pricing. The price is a big part of scaling, and right now you have to be pretty big for Cassandra to begin to make economic sense over all the other alternatives. This cuts out the overwhelming majority of the market who are nearly-big or merely aspire to be big and want to start on a platform that can support it.

    All those are clearly better served by a service like DynamoDB than trying to run their own Cassandra clusters unless they happen to be very proficient in Cassandra administration and want to dedicate precious human resources to administration. That takes a lot of the benefits of “cloud” away from small and mid-sized companies where cost and management are the limiting factors.

  2. Paul Webber says:

    How much does your hosted service cost?

  3. Hi Robert,

    Cassandra-as-a-service exists in private beta as a Heroku add-on! We can invite you to try it out, if you like! frank at m2m.io is my e-mail, if you’d like to get in contact (or anyone interested, for that matter).

  4. Serdar says:

    to compare with dynamo price can you give an example of how much does it cost to deploy data on cassandra with ec2 instances. eg for 3 server quorum consistent read/write 100gb data, 10000 write, 50000 reads per second ?

  5. Serdar says:

    based on this benchmark http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

    can we say 7 reqs/sec = $1 per month ?

  6. Jack says:

    I bet Dynamo does data distribution a lot better Cassandra and does falter under load when you finally need to partition your data further and add a node.

    I’d never use Cassandra again.

  7. Sunny says:

    The pricing model used in DynamoDB can be tricky when comparing with Cassandra cluster in EC2. In DynamoDB, you need to assign read and write throughput per table, thus you need to have a detail understanding of how much traffic individual tables experience, whether this will increase or decrease the overall cost really depend on the behavior of the application imposed on the database.

    I am still trying to work our the cost effectiveness of DynamoDB w.r.t. Cassandra running in EC2.

    Any thought…

  8. Edlich says:

    You table:
    Stating that monitoring of DynamoDB is not possible is simply wrong. Some other entries are also at least doubtful…

  9. Özgür says:

    How did you conclude that 7reqs/sec = 1$ ?

  10. Hi Jack,

    What are the Cons (and Pros) and weakness on Cassandra you have so far based on your experience? that make you don’t like Cassandra. I just want to learn from your experience.

    Thanks,
    Charlie

  11. Serdar Irmak says:

    Hi Özgür, based on this table: http://4.bp.blogspot.com/-omcjPorKws8/TrHPIl82k8I/AAAAAAAAAXU/ic_MTht06zA/s1600/cost.png

    node cost / hour (for 288 nodes) is $195.84

    195.84 / 288 * 24 * 30.5 = 497.76 per month per node

    1,099,837 writes per secod / 288 node = 3818.9 writes/sec per node,

    so, 3818.9 / 497.76 = 7.67 writes per second per 1$
    (I didn’t read if it’s 1 digit ms latency as dynamodb)

  12. Serdar Irmak says:

    7.67 writes per second per 1$ is for 1 month

  13. Serdar Irmak says:

    and dynamodb seems 0.01 * 24 * 30.5 = $7.32 per 10 writes per second

  14. Serdar Irmak says:

    again for per month

  15. Sunny says:

    Keep in mind the read and write throughput is shared among different partitions of the table, thus for certain ‘hot partitions’, you will get much less than the total throughput you assigned to the table. For example, if your read throughput is set at 10, where your table has 5 partitions, you would expect to get 2 iops per partition instead of 10 iops. This is where I meant difficult to compare iops cost effectiveness between DynamoDB and Cassandra.

  16. Eugene says:

    So converting both Cassandra and DynamoDB costs to the same units…

    Cassandra: 7.67 writes per second per dollar per month

    DynamoDB: 1.37 writes per second per dollar per month

    (That’s taking the $7.32 per 10 writes per sec per month, dividing by 10 to get dollars per 1 write, and taking the inverse to get writes per dollar)

    Seems like a no-brainer to me…or is my math way off?

  17. Serdar Irmak says:

    yes, true. for this sample benchmark; it seems cassandra is 5 tames cheaper for write, but same price for read (dynamo is 5 times cheaper for reads than writes and the speed ratio of cassandra read/write is nearly same). Also they didn’t use spot instances for this benchmark, so, result can be more cheaper (for cassandra) with spot instances. But as Sunny said, there are lots of conditions that can be effect the whole result. each one can be cost effective for different use cases I think.

    I wish a pro cassandra guy may help us with a much more correct cost comparision.

  18. @Eugene you are comparing the infrastructure costs of Cassandra (on AWS) to the infrastructure costs *and* administrative costs of DynamoDB.

    It’s not apples to apples, the total cost of ownership for running Cassandra must include managing and tuning your clusters. With DynamoDB the support and administrative costs scale linearly. With Cassandra, right now, it doesn’t.

    So as an example. I have some websites considering Cassandra and we’ve been playing with it for a while now. We aren’t willing to port our sites to it till we know we can get it out of a jam so for us we’d have to get some kind of support contract (and since Datastax’s inception we’ve been evaluating them for it).

    But the pricing for this doesn’t seem very transparent yet, as it doesn’t seem to have hit the economies of scale that makes it more cost-effective at smaller scales. With DynamoDB we are on a proprietary platform (and would love to one day see a hosted service on a non-proprietary platform such as Cassandra instead) but it solves the much more difficult problem of supporting and managing the cluster. The price of the iron itself is not the major factor till you get huge (in terms of number of nodes).

  19. Serdar Irmak says:

    @Robert this is an important point. Here is a part of an article I found :

    Sid Anand, who helped transition Netflix from Oracle to AWS’s SimpleDB to Cassandra and who now is on the LinkedIn infrastructure team, wrote on his blog earlier this week that “[i]f [your NoSQL database] is not hosted (e.g. by AWS), be prepared to hire a fleet of ops folks to support it yourself. If you don’t have the manpower, I recommend AWS’[s] DynamoDB.”

  20. John says:

    That table says that DynamoDB doesn’t have monitoring. That’s just not true. http://aws.amazon.com/dynamodb/faqs/#How_do_I_monitor_table_size_and_performance

  21. Jonathan Ellis says:

    Thanks John, I’ve updated the table with your link.

  22. Edward Capriolo says:

    @Serdar believe it or not Ops people are useful in today’s world. I know there is a new generation of cloud lovers that insist we (ops people) are a waste of time. (not saying that about Sid). Whatever the case, we have not had to hire a “fleet of ops people” to manage our NoSQL. Our ops department has grown slower then our number of developers and slower then out number of systems.

    Also since AmazonDB came out only a few days ago, any endorsement must be taken with a grain of salt.

  23. Serdar Irmak says:

    @Edward indeed for a multi million $ company. But for startups with a 1-2 people team and a little budget, or for part time home-grown/garage projects, and if nosql is a must; what’s the best way for maintaining a 3 node cassandra cluster ?

  24. Edward Capriolo says:

    @Serdar

    AMI’s and RPMs exist, as do recipes for tools like puppet and chef. To answer your question, the best way to maintain something is to understand by reading about it and following best practices. Not to make this a cloud rant but like most things that someone makes easy they make them cost more. Only you can crunch the numbers and decide what you are comfortable with. I personally am not sure I am comfortable with a pay-per-op model of DynamoDB, the ops/sec/$ seems very low.

  25. Serdar Irmak says:

    @Edward thank you, great answer.

  26. Amazon DynamoDB does not support Map Reduce. This post mistakenly says that it supports it. I have checked with Amazon people. It supports Hive through EMR. All of their documentation always talk about EMR and not just MR. They support exporting files to S3 and then importing it to Hive. There is no way to run a mass update using Map Reduce. The only way you can map reduce is to export files to S3. That’s not what I would call supporting Map Reduce.

  27. edlich says:

    “High performance on commodity disks” is also a cool question giving cassandra a yes and dynamo db. If the question would be “High performance because runs on SSD” the answer would be vice versa ;-)
    (but ok comparisons are always difficult and I am thankful that you tried).
    Best

  28. joe says:

    How do they compare (throughput, latency, scalability) to Hyperdex, FoundationDB, LightCloud, MonetDB, VoltDB or Aerospike?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>