Why NoSQL Can Be Safer than an RDBMS
Sean Doherty’s recent article in Wired made a case for why relational databases may still have long life left in them, while at the same time asserting that NoSQL databases might not. Sean is a friend of mine for whom I have great respect, and he touted a database (Postgres), which I think is a great RDBMS.
That said, I disagree with many of the article’s statements, the most important being that companies should not consider NoSQL databases as a first choice for critical data. In this article, I’ll show first how a NoSQL database like Cassandra is indeed being used today as a primary datastore for key data and, second, that Cassandra can actually end up being safer than an RDBMS for important information.
But before I do that, let me first quickly address a few other assertions in Sean’s article.
You’re Not Google
Sean repeats something that many others have said when trying to limit the types of companies that can profit from NoSQL: “Remember that most companies are not Google, Yahoo or Facebook.”
No, most companies aren’t. But here’s the thing: you don’t have to be Google to have a big data problem or benefit from a NoSQL database.
Take two of our customers: Boxever and MarkedUp. Each provides a data-related service (Boxever to the airline industry and MarkedUp to software shops) and both are pretty small companies that share a similar story we constantly hear at DataStax.
They started out with an RDBMS for their application and quickly hit scaling and performance walls that an RDBMS couldn’t overcome. Enter NoSQL and Cassandra. Today, each company is handling big data use cases with ease.
The bottom line? You don’t need to be a monolithic cutting-edge tech company to need help with big data and reap the benefits of NoSQL databases.
We’ve Heard This Before
Sean says that other challengers to the relational model have come and gone, implying that NoSQL will also. He intimates that RDBMSs may absorb NoSQL engines, much like some did with other database technology (e.g. OLAP).
I don’t see this happening with a NoSQL technology like Cassandra because the fundamental architecture of relational engines cannot support the same use cases as Cassandra. RDBMSs were built to scale up not out for both reads and writes. They are best at handling structured data, offer high but not continuous availability, and are lousy at easily entering, distributing and synchronizing data that is widely dispersed from a geographical standpoint.
So, while Oracle bought Express many years ago and rolled its OLAP technology into its core database, it couldn’t do the same thing with NoSQL technology, which is why Oracle ended up introducing its own, separate NoSQL database a few years back. I expect other build/buy use cases to happen with other major RDBMS players as well.
Sean ends his article by saying, “Relational databases may not be hot or sexy but for your important data there is no substitute.”
Many companies are indeed substituting NoSQL for RDBMSs where their important data is concerned. Netflix stores 95% of all their data, including the entire viewing history of all 36 million of its members, in Cassandra. It doesn’t get more important than this, and they migrated from Oracle in the process.
But it’s not hard to miss the key implication made by RDBMS stalwarts, which was more openly stated in the early days of NoSQL but not heard so much anymore:
NoSQL isn’t safe for your critical data.
I won’t attempt to defend each and every NoSQL database against this claim, but I’ll tell you why a NoSQL database like Cassandra can actually be safer than a relational database for your important data.
WHY DATA IN CASSANDRA IS SAFE
First, let’s understand a few basic things. Data written to Cassandra is first written to a commit log in much the same way as happens in an RDBMS. Data is not only durable in Cassandra, but transactions are also atomic and isolated (the A and I in ACID). These things mean you don’t lose data in Cassandra.
Data consistency in Cassandra does differ from relational engines, as there are no foreign keys/referential constraints to contend with in its data model and architecture. Instead, consistency is tunable in nature so that each operation (i.e. each insert, delete, etc.) can be specified to be as strong or eventual across a cluster as needed with the understanding that performance implications may exist depending on the implementation (e.g. clusters spanning multiple geographies).
Yes, Cassandra handles data consistency differently than RDBMSs, but the data it manages is just as safe.
SAFER THAN RDBMSs – BUILT FOR DISASTER AVOIDANCE
Cassandra was inherently designed with the understanding that hardware failures can and do occur. Its masterless architecture means there are no special node types (e.g. master/slave) and that redundancy in data and function is built in.
This allows customers to perform any action on any node and easily build database clusters that have multiple copies of data on various nodes (the typical average number of copies being three, but it can be more or less). Lose a node, and everything continues as normal.
For example, Aaron Turner tweeted that it took their tech team 10 hours to notice they’d lost a Cassandra node because everything just kept working.
Unlike RDBMS’ replication, Cassandra’s affords true write/read anywhere (i.e. any node) capabilities, with that support including not only clusters that are single premise in nature, but also multiple data centers and cloud availability zones.
This is why data in Cassandra can actually be safer than in an RDBMS, which relies on an older master/slave design that can’t support a true active-everywhere environment. A single Cassandra cluster can, for example, span five data centers with users in different locales all reading and writing to their own local nodes (with local data redundancy being present), and all data being automatically synched up across the cluster.
Should one or more data centers go down, the data in the cluster is completely safe and business continues on as normal with downed data center traffic simply being redirected to the remaining data centers. Once any outage is rectified, Cassandra will automatically synch the downed nodes back up with the rest of the cluster.
This isn’t rhetoric. These deployments are in place and working today. For example, content discovery and recommendation engine titan Outbrain uses Cassandra in just this fashion. When super storm Sandy knocked out one of their data centers, their Cassandra data never went offline and their application continued to serve their customers.
When Outbrain’s downed center came back online, they only needed to enter one command to bring its part of the Cassandra cluster back up to date. However, RDBMSs in that same data center required much more effort to restore and bring back to life. Why? Because RDBMS’s aren’t architected to do such things as easily as a NoSQL engine like Cassandra can.
Outbrain’s example has been experienced by other customers as well. When AWS experienced a large outage last year, Netflix remained online along with all its data and continued to serve customers. When Amazon repaired its outage, the Cassandra nodes in that availability zone automatically updated themselves to be current.
Either/Or or Both/And?
I do think RDBMSs will stick around for quite some time, but I also see NoSQL technologies thriving and having long life because they address application use cases that companies simply can’t tackle via the relational model.
The reason companies like eBay, Adobe, Constant Contact, and hundreds of other Cassandra customers use NoSQL as a primary datasource is not because they’re chasing the latest-and-greatest technology, but rather they use NoSQL out of necessity. Their applications, use cases, and data needs have many times outgrown the legacy RDBMS model and require a different type of engine.
The good news is, NoSQL databases like Cassandra meet these needs well and do so in a way that ensures their data can actually be safer in many cases than it would be in an RDBMS.