The Five Minute Interview – Shift
Shift Migrates from MongoDB to DataStax Enterprise to Power High Velocity Marketing Platform
This article is one in a series of quick-hit interviews with companies using Apache Cassandra and DataStax Enterprise for key parts of their business. For this interview, we spoke with Jon Haddad, Senior Architect, and Blake Egglestone, Senior Developer, at Shift.
“MongoDB versus Cassandra; for us Cassandra had a number of advantages operationally. It’s so much more sane to deal with.”
DataStax: Guys, what does Shift do for its customers?
Blake: Shift is a platform that is tailored for marketers. It enables them to communicate across organizations and departments in a single place. In addition to messaging, SHIFT is also an open application platform. It has a set of applications built on top of it that (will eventually be able to) communicate with each other. We have our own apps on the platform, but we’re also encouraging third-party developers to develop for it andwe’re building a set of APIs that allow people to work more effectively.
Jon: In addition to Shift we also have our Media Manager app that runs on Shift, and we do a lot of collection of statistics based on Facebook and Twitter ads for people managing their ad campaigns. We’re probably fetching a billion ad stats a day at this point and we work with a lot of Fortune 500 companies and manage their ads for them through this tool. We fluctuate between 50,000 to 100,000 queries a second against Cassandra in order to keep these stats as accurate as possible. Essentially we’re shooting for an up-to-the-minute overview of how your Facebook or Twitter ad campaigns are doing.
DataStax: That is definitely a high-velocity data scenario. Was that the primary reason for migrating to Cassandra from DataStax Enterprise or did other factors play into that decision?
Jon: We actually evaluated a lot of different databases. Initially we were using MongoDB and Neo4J and we hit Neo4J’s limit almost immediately. It was completely impractical for anything – we were trying to do everything in real-time. We introduced Cassandra initially to work with the Titan graph database, but since using it we actually found that when we controlled our table structure with Cassandra we would get absolutely amazing performance. We have now phased out all of the graph database aspects and all of MongoDB and so Cassandra is our only data store at this point.
DataStax: So you needed a graph database though to store all these different points that you’re collecting?
Blake: It wasn’t so useful for the ad stats, it was more for the Shift side of things where there’s a lot of messages and users and teams and there’s a lot of interaction between people. So we were using the graph to construct each user’s view of the data, but ran into some performance problems. Ultimately, it came down to the fact that if we managed the data in Cassandra ourselves we’d avoid a lot of the overhead that the graph database introduced. We went from certain API calls taking four to five seconds down to 400 or 300 milliseconds or less in some cases. We started with a tenfold improvement by just going with straight Cassandra.
DataStax: Now you have migrated from MongoDB to Cassandra – what advantages does Cassandra have over MongoDB?
Jon: MongoDB versus Cassandra; for us Cassandra had a number of advantages operationally. It’s so much more sane to deal with. You don’t to worry about replica sets, config servers and Mongo’s processes and its weird sharding infrastructure. Cassandra’s hash ring made it really easy for me to deploy our clusters and manage our data on the Shift side of the business by myself. Then on the media manager side there’s only one guy. We don’t worry about it and it’s so easy to manage that it’s kind of a no-brainer for us. Our concerns were what happens if machines go down or if we need to add capacity to a cluster and it’s extremely easy with Cassandra.
DataStax: You mentioned you looked at a bunch of different data stores and you originally settled on MongoDB and 4J. What else did you look at? Why did they not make the cut?
Jon: On the relational side we considered using something like Postgres or MySQL at one point but the problem that we have with that is they’re all based on this master slave relationship. The operational management is more work. You have to worry about manually handling things like failover and it becomes a bit of a problem. We’d rather just structure our data in a way that lets the queries run efficiently and let Cassandra do the work of adding nodes and lets us just add capacity to the cluster. We looked at HBase but it has the same problem as Mongo to be honest with you. There was just a lot of moving parts and the community seems a little fragmented and their master branch isn’t even stable right now. I just didn’t have enough confidence in the project. In the end Cassandra just worked for us.
DataStax: How did you come up with the data model for Cassandra? What does an actual MongoDB migration to Cassandra look like?
Blake: Well, we tackled each component on a case-by-case basis depending on how it was being used, and we migrated one component at a time. Some of the more complex components took a lot of thought about the best way to structure our data.
As far as how the migration looked, we used a two-branch migration strategy that allowed us to move all of our data from Mongo to Cassandra without any downtime. In our first branch, we created a Cassandra table and mirrored our writes. Every time a MongoDB document was saved it was also saved in the Cassandra database. Once this branch was deployed to production, we’d run a migration script to catch up all the old data. Then we deployed the second half of the migration which switched all the reads from Mongo to Cassandra, that saved us a lot of time.
Our goal was to have zero downtime and that was really the only way that we could think of where we’d be able to reliably deploy new code and work with two different databases simultaneously. By deploying the writes first, it allowed us to spot check our data. We actually developed both branches simultaneously so we would deploy the writes, run the migration, catch up with the data, and an hour later or two hours or the next day deploy the second half.
Jon: One thing that made it significantly easier for us was CQL3 had just come out. We weren’t really interested in having to write something ourselves on top of Pycassa, the Python driver. CQL was nice because we could create a schema that made it a lot of easier for us to look like our MongoDB documents. It was extremely convenient.
DataStax: Is it fair to say that without CQL3 this isn’t something you would have taken on?
Jon: It would have been a lot harder. It would have been much trickier. We actually did a lot of work on this. We built CQL engine which is now as far as I know it’s the only CQL object mapper for Python. That’s opensource in Github https://github.com/cqlengine/cqlengine.
Blake: We actually talked with Tyler Hobbs about it. We’re working with him on cqlengine, which is pretty exciting. We’re shooting to get 100% compatibility with all the features of CQL3.
DataStax: Can you talk a little bit about your environment?
Jon: We’re running in AWS. We’re running on the same instances as we were prior with Mongo, but we’ll move to SSDs. Performance-wise individual reads are going to be a little bit slower and that partially has to do with the fact that we’re doing quorum reads whereas Mongo reads the slave server and you get whatever it has on it which in the past is actually been the cause of several weird bugs that are impossible to track down because your data is totally wrong. So with Cassandra we’re willing to make that slight concession as you know the data is right and where it is coming from.
Blake: The slower reads for Cassandra is a bit of a misperception though, because although Cassandra might be slower to read one document, the way that we have our data structured it’s significantly faster to read the entire lists that we need.
DataStax: You have the data you stored as a document in MongoDB in a single-column family in Cassandra and it just reads straight down the column?
Jon: Exactly. We make heavy use of clustering keys in order to have wide rows. For instance every person has their own view of the messages that they can look at and we use TimeUUID1’s to manage their message stream. Getting that data is insanely fast. Whereas if you were to try and do that with Mongo you would have to do this weird index look-up every time you try to do it and overtime it would just grind to a halt. It’s just that MongoDB could not work for this type of data. Fortunately Cassandra gives us that flexibility where we know how our data is going to be laid out on disk and in memory. We have control over the compaction type that lets us. We know that with LeveledCompaction our data that are going to be split across only two SSTables. It give us really fast performance and we know that if we have 10 times the traffic we could just add another server to it and everything will be fine. It’s great in that regard for us.
It took us about two months to get everything into Cassandra and it was absolutely worth it. Our unit tests went from over 20 minutes down to one. We actually execute our whole test more often now. It’s just easier to develop on.
DataStax: Now you get to get into the tuning and optimization part of it. That will be cool actually to follow your progress in six months time, revisit and see what tips you’ve learned along the way to actually optimize it.
Blake: Yeah, definitely. We talked with Al Tobey at the Bay Area Cassandra summit. That guy is amazing. In 30 minutes he taught us so much, and we are excited to put some of those things into practice.
For more information on Shift, see: www.shift.com.