email iconemail phone iconcall
Shiyi Gu

Disaster Avoidance, Not Disaster Recovery

By Shiyi GuAugust 13, 2014

According to a recent survey from the 2014 Uptime Symposium, 46% of companies using their own data center had at least one “business-impacting” data center outage over 12 months, and 7% had 5 or more outages. Needless to say, it’s understandable that availability has become the number-one selection criterion companies use when choosing a data center provider.

Great Disaster Recovery Isn’t Cool; Disaster Avoidance is Cool

For mission-critical applications, any type of downtime can kill your business even with an elaborate data backup and recovery plan. Direct revenue impact, productivity loss, dissatisfied customers, and damage to the brand, are just a few areas of the business that can experience severe issues. The average time to get back to “normal operations” after a major failure was 21 days, and almost 50% of these companies had a loss of revenue, while 46% said that the failures caused them to violate SLAs, according to a Compuware survey.  Average recovery time of 21 days. Are your customers that patient?

Veterans in relational database technology know that good disaster recovery is difficult to implement. This is because relational databases rely on a master-slave architecture. Basically, every time data is written to the system, that write has to be routed through the master server. So, if you try to store data in a second datacenter (DC2), all writes have to be routed through DC1, where the master server is, then routed to DC2. That can take a while. So, even in a disaster recovery scenario, the application has to replicate data in real time to a second data center (not that hard) and then present a way to fail over to that data center in the event of a disaster (really, really hard). Serving data from multiple data centers in a master-slave architecture is challenging at best.

What is ideal is to implement a solid disaster avoidance solution. Contrary to popular belief, disaster avoidance does not live only in a perfect world, one filled with unicorns and pixie dust. Apache Cassandra™ has a peer-to-peer architecture, which means that no server is any more important than any other server. Data could be spread to multiple servers to allow for better redundancy (when a server fails, all data will remain available, improving performance). Since data can flow bi-directionally (reads and writes can happen on any server instead of flowing through a master server), Cassandra can work in a multi-data center deployment with all data centers serving the data. Read more about that here.

Why Multi-Data Center Replication Matters?

For those of you who think such disasters will never happen to you, think again. Such disasters happen much more than people think. Our co-founder tells a story of when he was at Rackspace, a truck drove off the highway and landed (yes landed as in flew, then was not flying anymore) IN a datacenter. Hurricane Sandy took down tons of East Coast data centers. Essentially anything hosted in Japan was taken offline by the Fukushima tsunami. A lightning storm caused power outages that took down all of Amazon East. Being able not simply to recover when disaster strikes but to thrive while in the eye of the storm is critical in today’s business environment.

Now, you may be wondering how hard is it to achieve active-active multi-datacenter in Cassandra AND is anyone actually doing it? To list just a few large companies using it, we have, Netflix, BazaarvoiceBarracuda NetworksComcastConstant ContacteBayGoDaddyHuluOdnoklassniki (, and Spotify. And to quell any questions about needing a massive team to do multi-data center replication, here are some up-and-coming companies using multi-data center replication in Cassandra: AdGearClozeDataDogEmbedlyFull ContactGnipHailoHealthline NetworksIovationKeen IO, Mass RelevanceMetabroadcastMollomNewsWhipOnSipPROSReachLocalRetailigenceRightScaleScanditSessionMSkillpagesSoftware ProjectsStormpathTaboolaTendrilTutaoVigLinkWize Commerce, and Zonar Systems.

As a case in point, Outbrain serves online content for some of the world’s largest media brands including Reuters, Wall Street Journal, and USA Today and pushes out 90 billion recommendations on more than 10 billion page views per month. Outbrain relies on Cassandra as its massively scalable data store capable of handling high data velocities with 58,000 links to content per second. During Hurricane Sandy, Outbrain completely lost an entire data center but their data in Cassandra never went offline. There was a blip that lasted less than a second. When that datacenter came back online, they easily brought the nodes back into the cluster and started updating that data. For their relational databases, they had to ship drives all around the country and the world to have it recovered.

New to multi-data center? There are plenty of resources on implementing it in Cassandra to help you through. Download our white paper: “An Introduction to Multi-data center replication with Apache Cassandra, Hadoop and Solr.”



Your email address will not be published. Required fields are marked *