Facebook’s Instagram: Making the Switch to Cassandra from Redis, a 75% ‘Insta’ Savings
Rick Branson, Infrastructure Software Engineer at Instagram (Follow @rbranson)
"Implementing Cassandra cut our costs to the point where we were paying around a quarter of what we were paying before. Not only that, but it also freed us to just throw data at the cluster because it was much more scalable and we could add nodes whenever needed."
Instagram is a free photo sharing app to take photos, apply filters, and share them on social networks such as Facebook, Twitter, and the like. It allows its over 200 million users to capture and customize their photos and videos to share with the world.
CUTTING COSTS WITH CASSANDRA
Initially our deployment was for storing auditing information related to security and site integrity purposes. To break down that concept, it means fighting spam, finding abusive users, and other things like that. It was really a sweet spot for the Cassandra offering.
Originally, these features were conducted in Redis; the data size was growing too rapidly, and keeping it in memory was not a productive way to go. It was a really high write rate and really low read rate, a spot where Cassandra really pops and shines so the switch ended up being a no-brainer for us to adopt Cassandra in that area. We started out with a 3 node cluster and that use case has grown to a 12 node cluster. That was our path for our main application backend stuff.
FRAUD DETECTION, NEWSFEED, & INBOX
For the first use case mentioned above for our backend, we moved off of a Redis master/slave replication setup; it was just too costly to have that. We moved from having everything in memory, with very large instances, to just putting everything on disks; when you really don’t need to read that often, it works fine having it on disks. Implementing Cassandra cut our costs to the point where we were paying around a quarter of what we were paying before. Not only that, but it also freed us to just throw data at the cluster because it was much more scalable and we could add nodes whenever needed. When you’re going from an un-sharded setup to a sharded setup, it can be a pain; you basically get that for free with Cassandra, where you don’t have to go through the painful process of sharding your data.
Recently, we decided to port another use case that is much more critical. We spent time getting everyone on the team up-to-date with Cassandra, reading documentation, learning how to operate it effectively. We chose to use Cassandra for what we call the “inboxes” or the newsfeed part of our app. Basically, it’s a feed of all the activity that would be associated with a given user’s account; you can see if people like your photos, follow you, if your friends have joined Instagram, received comments, etc. The reason we decided to move that to Cassandra was that it was previously in Redis and we were experiencing the same memory limitations.
For this “inbox” use case, the feed was already sharded; it was a 32 node cluster with 16 masters and 16 replicas that were fail-over replicas and, of course, we had to go through all the sharing of things. We noticed that we were running out of space on these machines and they weren’t really consuming a lot of CPU (Redis can be incredibly efficient with CPU) but obviously when you run out of memory… you run out of memory.
It just ended up being more cost effective and easy to operate a Cassandra cluster for this use case, where you don’t need the kind of in-memory level performance. Durability was a big factor as well that Redis didn’t provide effectively; I touched on that in my Cassandra Summit 2013 presentation.
DEPLOYMENT AT INSTAGRAM
We’ve had a really good experience with the reliability and availability of Cassandra. It’s a much different work load: we’re running on SSDs with Cassandra version 1.2 and we’re able to get that latest version there with all of the nice bells and whistles including Vnodes, Leveled Compaction, etc. It was a very successful project and it only took us a few days to convert everything over.
Some details on our cluster: It’s a 12 node cluster of EC2 hi1.4xlarge instances; we store around 1.2TB of data across this cluster. At peak, we’re doing around 20,000 writes per second to that specific cluster and around 15,000 reads per second. We’ve been really impressed with how well Cassandra has been able to drop into that role. We also ended up reducing our footprint, so that’s been a really good experience for us. We learned a lot from that first implementation and we were able to apply that knowledge to our most recent implementation. Every time someone pulls up their Instagram now, they’re hitting that 12 node Cassandra cluster to pull their data from; it’s really exciting.
DIG INTO THE DOCS
I would recommend digging into the system and reading all of the Cassandra documentation, especially the stuff on the DataStax website. The best part is documentation that I’ve noticed is it has a lot of extra information about the internals; really understanding that is important. Any database or datastore you use, you’re really going to need to dig into the documentation in order to properly use it the way it’s intended. People often run into situations where they get themselves cornered by adopting a solution too quickly or incorrectly and not doing their homework. Specifically, with a datastore, it’s really important that it’s the most stable and reliable part of your stack.
And people can always find me on the IRC channel trolling away.