The Five Minute Interview – NexgateNovember 25, 2013
Nexgate Helps Large Businesses Secure Social Data Using DataStax Enterprise
This article is one in a series of quick-hit interviews with companies using Apache Cassandra and DataStax Enterprise for key parts of their business. For this interview, we spoke with Rich Sutton, Chief Technology Officer at Nexgate.
“We evaluated DataStax vs. Riak, but it wasn’t even an interesting comparison to talk about. DataStax was faster, easier to set up and much more reliable.”
DataStax: Rich, what does Nexgate do for its customers?
Rich: Nexgate is a security and compliance suite for social. We help enterprises solve their issues around their social infrastructure. In 2013, enterprises have adopted social in a big way. They’ve got YouTube channels, Twitter accounts, and Facebook pages but they’ve created these in a very ad hoc fashion; they don’t have controls around what kind of content can get posted, what apps are being used, or who the administrators are. They can’t, in most cases, even find all the pages and accounts that represent them on the social web. We’ve built a cloud-based application that solves many of these problems for them.
DataStax: What is your use case for Cassandra and DataStax Enterprise?
Rich: We use Cassandra for harvesting huge swarms of data out of the social web. We then classify and action it based on policies that our customers configure.
As a concrete example, when we work with a bank, the bank wants to make sure that employees who are working on the social web aren’t posting things that violate FINRA regulations. We, in as real time as possible, read their posts and comments and classify them with natural language processing and machine learning. We then action those items based on policy.
We have a need to store a large amount of data that comes out of the social web and store information. That meets the Cassandra use case really well because it’s column data. The metadata that comes out of Facebook is slightly different than the metadata that comes out of Twitter or LinkedIn or YouTube; so, we have a need for this really huge table of social content that is 50% alike across all these different platforms. We want the ability to quickly add new columns and be able to write code that operates on those new columns in a performant manner. Cassandra lets us basically build an endlessly scalable store for all this social data.
DataStax: What does your data model look like?
Rich: The data model is basically two huge tables. We’ve got a table called “items”, which has a row for every single post, tweet, comment, or share that we’ve ever scanned. It includes all the content and all the metadata around that, including the identity of the author and the application and all kinds of other information.
Then we’ve got a second table that holds all of our classifications and categorizations. Again, it’s very simple because we’re not using many of the sophisticated features around column families. Again, the advantage for us is that this thing can get just as big as we can possibly ever get and we can continue to scale Cassandra horizontally by adding nodes.
One thing I should also say is that we also have requirements that need to be satisfied with summarization of this data. We use MySQL, actually, for a front end with a small number of fixed fields that represent things like an ID, the date, and time that the content was posted, IDs around who the author is. Information that we need to summarize and report across is used in MySQL almost like an index into the huge set of data stored in Cassandra.
DataStax: What was your original motivation for looking at Cassandra and what other technologies was it evaluated against?
Rich: Like many other start ups do in the very beginning, we had a MySQL database and we just started to throw data into that thing. We quickly recognized that whole model of a single instance wasn’t going to ultimately solve the problem for us in the long term because we wanted to keep all this data; we didn’t want to make hard decisions around throwing it out.
The accuracy of any data classification is predicated on the scale of the corpus that you can test it against. We knew we wanted to save this data forever and we quickly saw the limitations of a traditional relational database model. We started to look at NoSQL solutions because they clearly fit our use case.
We looked a little bit at MongoDB but it was very important to me, operationally, to have a solution that was multi-master, where every node in the cluster was itself a master. I shouldn’t have to think about [or manage] the details of which node is a master and which is a slave and all the details underneath that in terms of replication. I wanted seamless replication based on how I define the data model, where every node in the system was a master.
It simplified the solution for me operationally and from a software development standpoint as well. I ended up looking at Cassandra and Riak, those were the two that we put head to head; this was because our final requirement was integration with Solr, which both technologies possessed. When we ran our benchmarks, Cassandra won hands down in terms of reliability, ease of use, the speed with which you could scale horizontally. It just won technologically.
DataStax: Why did you decide to use DataStax Enterprise instead of just open source Cassandra?
Rich: That’s an easy question to answer. We are a security company that secures enterprise use of the social web. The success and quality of any security company’s classification technology is predicated on having a big corpus of data that you can mine for interesting attributes, that you can test your application against. It’s all about a corpus, and the storage management of a corpus, and having the tools you need to explore that corpus. Social data has a very unique use case around NoSQL databases, so we started to look for a NoSQL platform that was integrated with the tools that are easy for us to set up and then quickly leverage – in particular, Solr and then the second requirement, Hadoop which we use to run complex data analysis inside of Cassandra. So we really fit directly into the DataStax use case.
We evaluated DataStax vs. Riak, but it wasn’t even an interesting comparison to talk about. DataStax was faster, easier to set up, much more reliable, and so we use it in our products as the back end for our social corpus and use tools on top of it for security research.
DataStax: Technically speaking, what does your infrastructure look like?
Rich: We’re in AWS and running on Ubuntu. We try to leverage the strengths of Cassandra’s distributive architecture in the sense that any time you’re running in AWS, you’re always thinking about the cost of the instances. We’re a small company operating on VC money that we have to make last for a while. We’re not a big company that can throw millions of dollars at this problem.
The number of instances, the number of cores and the amount of money that costs us is very important inside of AWS. We run Cassandra on a set of 4 nodes across 2 AWS regions. We use the EC2 multi-region snitch to replicate data across, but those instances are actually quite small; they’re AWS large instances. Again, another thing that I love about Cassandra is the ability to run lots of small nodes across multiple regions and my costs are not wrapped up in the number of boxes running Cassandra.
DataStax: Thanks for taking the time to speak with us today. Is there anything else you’d like to add?
Rich: The last thing I want to share is that I’m a big fan of Cassandra. I’ve been developing systems that are built around relational databases for 20 years. I am very excited to see solutions coming up to solve these issues around big data and Cassandra has been rock solid and a super solution; I’m a huge fan of Cassandra.
For more information on Nexgate, see: www.nexgate.com.
SHARE THIS PAGE