Choosing the right architecture for big data scale
Following up from my last post, let’s now take a look at the section of this paragraph from our recent press release that deals with the foundation of Cassandra: it’s architecture. The paragraph reads:
Customers this year chose Cassandra time and time again over competing solutions. The peer-to-peer design allows for high performance with linear scalability and no single points of failure, even across multiple data centers. Combine this with native optimization for the cloud and an extremely robust data model and Cassandra clearly stands apart from the competition for enterprise, mission-critical systems. [emphasis added]
When dealing with new technologies, one of the easiest things to overlook is the architecture. Talking about architecture is not sexy or glamorous, but it is the absolute foundation of everything you will do for years to come. Make a mistake up front with most systems, and unwinding later it can be difficult. Make a mistake with your big data architecture, and unwinding it later can be downright ugly if not impossible.
Today’s big data architectures come in primarily two flavors: one where a single machine coordinates all activities for other machines in the cluster (aka Master/Slave); and one where all machines in the cluster are equal in type and function (aka peer-to-peer, or others may call it “fully distributed”). Cassandra is built on the latter–a fully distributed peer-to-peer architecture based on something called Amazon Dynamo. (For those who want to geek out on the details of Amazon Dynamo, you can read this paper.)
The decision between the two is vitally important. In master/slave architectures you have, by definition, a single point of failure in your master coordination node and you have introduced some complexity into scaling. There are techniques and tricks to try and mitigate this issue, but at the end of the day there is simply no free lunch and dealing with it at some level is inescapable. We will touch more on this in my next post.
Another benefit of the architecture comes in terms performance. The Cassandra developers are absolutely fanatical about performance and it shows. But what really still stuns me about Cassandra is not just its amazing performance, but that the performance scales linearly. Think of the advantages this provides to the operations and capacity planning teams. You don’t have to worry about what node types to add at what point along the way, you just keep adding nodes to the cluster and scale keeps going up exactly, mathematically, how you would expect it to.
An even more incredible aspect to this linear scale is that it is not limited to on-premise solutions. This graph shows how one of our customers achieved perfect linearly scalability that takes place entirely in the cloud.
When it comes to big data, choosing the right backend system is absolutely critical to the long-term success of your application and it’s worth a little up-front time investigating it.