The Five Minute Interview – MetaBroadcast
This article is one in a series of quick-hit interviews with companies using Apache Cassandra and/or DataStax Enterprise for key parts of their business. For this interview, we spoke with Chris Jackson who is CEO of MetaBroadcast.
DataStax: Chris, thanks for the time today. What can you tell us about MetaBroadcast?
Chris: MetaBroadcast processes metadata that comes out of the broadcast industry. We do this in three ways for the B2B customers we serve.
First, we take all sorts of content descriptions, their topic and description information and act as a top hub here in the UK for anyone wanting to know what’s happening on radio or TV.
Second, on top of that we run a personalization system, which offers a range of recommendations for our clients to use. And third, we have a series of analytic products that help the industry understand better what’s going on with their audiences.
We’ve been doing this for five years now and have tens of millions of end users making use of what we do.
DataStax: I imagine you’re dealing with quite a bit of data and need to ensure that data is served up pretty fast?
Chris: For us, a key need is to be able to handle content changes very quickly. We routinely see about 500 content changes/second, along with lots of user interaction along the way. In terms of data volume, we have several TB of online data that we maintain across the company.
DataStax: What caused you to start using Cassandra?
Chris: For our platform, we started off using MySQL, and then made a move to MongoDB, which we still use for some parts of our systems. But MongoDB works best when the whole dataset is in memory, and our data is a little too large for that. We looked at using a sharded MongoDB arrangement, but that proved to be too expensive for the data that we don’t need to access all the time. And the churn costs of getting data into and out of memory with MongoDB were just too high.
For the next iteration of our platform, we’ve moved to Cassandra in conjunction with ElasticSearch running over top of it to provide a whole range of indexes on the data. This combination allows us to look up a small bit of content very, very quickly and handle big data volumes well also.
DataStax: How else does Cassandra compare over MongoDB or other options you looked at?
Chris: We continue to be impressed with all the options in Cassandra for moving data between memory and disk, which is a big change for us over MongoDB where we had no control whatsoever over what was in memory and coming off disk.
DataStax: What kind of data are you primarily dealing with?
Chris: It’s very structured at the moment. We’re pretty adept at de-normalizing the data into Cassandra and getting very fast performance as the end result. We’re seeing very good throughput and are very encouraged by the overall performance of the database.
Cassandra gets the very heavy end of what we’re doing with content description data not just with writes, but with reads also. A lot of our queries end up aggregating a fairly random set of data and rows into one. That being true, we need Cassandra to respond very quickly so that we can rapidly produce one query off of many reads to the database.
DataStax: It sounds like read performance is a big deal to you.
Chris: It is. We very much want to move away from dumb caching at the front end. Doing such a thing invariably leads to complex cache invalidation and other complications that we want to avoid. Instead, we’re looking to push the load backwards as much as possible.
As an example, we have TV schedules that can go all the way back to 1920, along with a lot of other, older archived data that we want to make available in the same way that we do with a TV program that’s changing in the next 5 minutes, or a topic that’s currently being discussed on a program live. We want to provide the same type of latency for requests on all of that data.
DataStax: How are you running Cassandra today – in your own data centers or in the cloud?
Chris: Today, we run Cassandra exclusively on AWS, although the costs are beginning to cause us to look to more of a private cloud arrangement, which we’re investigating right now.
One of our requirements is that our applications must span multiple data centers. We use a couple of different AWS availability zones right now, but ultimately we need to spread it further than that.
DataStax: How do you manage Cassandra today?
Chris: We use Puppet to handle general administration tasks like adding new nodes and such. We really like the fact that it’s so easy to scale out with Cassandra when we need to and that there’s so little effort involved. We also use Nimrod to gather metrics and stats on our whole stack, which helps our tuning work. OpCenter is a further source of Cassandra-specific metrics like read/write performance.
DataStax: Chris, what advice would you give someone new to Cassandra?
Chris: Make sure you’re using Cassandra for the right things. You don’t come to Cassandra because you have some small use case to solve. Instead, you come to Cassandra because you’re dealing with a more complex scenario to address, where you need fast performance against a sizable data load.
For us, implementing some of our more challenging application features first in Cassandra and then tuning those has taught us a lot up front. That plan of attack has allowed us to support the large data loads that we’re seeing now.
DataStax: Chris, good info to have; thanks!
Chris: Happy to help.
For more information on MetaBroadcast, visit: http://metabroadcast.com/.