The Five Minute Interview – Next Big Sound
This article is one in a series of quick-hit interviews with companies using Apache Cassandra and/or DataStax Enterprise for key parts of their business. For this interview, we talked with Next Big Sound and Eric Czech who is their chief architect.
DataStax: Eric, thanks for taking the time to chat today. To start, can you give us a quick idea of what Next Big Sound is all about?
Eric: Sure. The company was started in 2009. What we do is aggregate, manage, analyze, and serve up large amounts of data pertaining to the music industry. Our initial focus was on social media activity centered around music and while that’s still a core component of our business, a lot of our bigger infrastructure problems actually stem from maintaining digital, physical, and streaming activity.
More concisely, we started tracking how users engage with an artist’s content through YouTube video plays, Facebook fan page likes, Twitter followers, etc., but have expanded that to include actions like iTunes downloads, Spotify streams, Google Analytics web traffic statistics, and the like.
Our primary customers include several major record labels as well as thousands of individual artists, managers, and smaller record labels.
DataStax: So I’m guessing that you guys are dealing with high amounts of inbound data?
Eric: Absolutely. Although our normal data traffic is 30-40 million new transactions per day, we’ve seen spikes where we get 350-400 million new transactions a day.
DataStax: What’s the format of that data and where is it stored?
Eric: The majority of the data exists as time series with a bunch of different qualifiers and dimensions, and all of that is stored inside DataStax Enterprise. We use Cassandra for all the new incoming data and for real-time access and then have a number of nodes in our database cluster devoted to Hadoop and analytics. Beyond the time series data though, we also store smaller data sets in MySQL or MongoDB.
DataStax: Did you start out using Cassandra or something else?
Eric: At first, we tried to use MySQL for everything, but like everyone else, we quickly outgrew it. We still use MySQL for metadata management involving entities like users, accounts, and preferences, but early on we recognized that we needed something more that could scale and handle our larger data sets, and so we began investigating NoSQL solutions with Cassandra ending up as the winner.
DataStax: What caused you to choose Cassandra and DataStax Enterprise?
Eric: We looked at Redis, Tokyo Cabinet, MongoDB, Riak, and Cassandra. We knew that our write-heavy workload would make sequential writes on disk a top priority and Cassandra quickly became the frontrunner as its SSTable architecture allows for exactly that.
We also felt that Cassandra was designed to avoid a lot of the operational headaches we knew would result from scaling horizontally in a hurry and even beyond that, we knew the thriving community would support us anywhere the design was lacking.
In practice, all of these things have been true — switching to Cassandra led to an instant 10x in write throughput, quadrupling our infrastructure size has been easy, and the community has been invaluable in tracking down important bottlenecks and keeping our data model as efficient as possible.
DataStax: What’s your current configuration look like?
Eric: We have a 30-node DataStax Enterprise cluster that has 22 Cassandra nodes and 8 Hadoop nodes. From a data volume standpoint, we have about 6TB’s spread across the nodes, but that’s using Cassandra’s compression. Our experience with that has been about a 2 to 1 compression rate; so the raw data size is likely double that.
DataStax: How do you use Hadoop inside of DataStax Enterprise?
Eric: We’ve been using the Hadoop nodes fairly heavily. We run a lot of our charting, searching, indexing, and maintenance processes through Hadoop and it has been an incredible help in gaining greater insight into our data as well as keeping it properly maintained.
DataStax: How do you manage your cluster?
Eric: We started with Cassandra pretty early before OpsCenter was available, and managing a cluster was a little more difficult. But now we use OpsCenter a lot and we’ve found it to be a great tool, especially for statistical/performance analysis. It helps us understand everything from read/write latency to overall network throughput and more.
DataStax: What advice would you give people who are looking to use DataStax Enterprise and Cassandra?
Eric: A development environment that closely mirrors your production environment is key. With it, you can practice various operations, upgrades, data movements, balancing, etc. It’s important that the dev cluster match your production cluster where the number of nodes and data volumes is concerned too or else what you do and practice won’t be the same.
DataStax: Eric, thanks for the time.
For more information on Next Big Sounds, visit: http://nextbigsound.com/