DataStax Enterprise: Cassandra with Solr Integration Details
date: April 12, 2012
DataStax Enterprise 2.0 (DSE) was released recently in NYC and it's generated quite a bit of interest due, in part, to the new Apache Solr search integration. This post will describe how this Cassandra-Solr integration works and how it can help you if you are an existing Cassandra user or if you are primarily a Solr user looking for a way to more easily manage your Solr cluster.
If you are more of a visual person (like me) then check out the screencast below showing DSE Search in action. It's free to kick the tires on DSE, just download. We also have a lot more documentation to peruse.
DSE Search vs Solandra
One of the most frequent questions we've gotten since the launch is how is DSE Search different than the open source project Solandra. If you don't know what Solandra is, then it's ok to skip this section.
First, these two systems are very different implementations. Solandra replaces the Lucene index format with a set of Cassandra ColumnFamilies, this means you can partition documents across nodes but Solr has no knowledge of Cassandra other than how it accesses its own data. Datastax Enterprise however, is built on top of the Cassandra secondary index api which means you can index data directly from column families. For a system like DataStax Enterprise, this is a great feature since you can now put data into Cassandra, run map-reduce jobs on that data and search it. More on this later.
Another big difference is how the data is stored. As mentioned above, Solandra stores the "index data" in a set of column families. DSE Search uses the native Lucene index format to store the index locally per node. The raw field data is not stored in the Lucene index however, that is still fetched from the column family. By doing this, we can keep the data in sync per node and provide consistency and durability guarantees across Solr and Cassandra while inheriting all the performance effort that goes into Lucene. This makes DSE Search 2-4x faster than Solandra.
DSE Search Design
In DSE Search nodes, we have the following components running in the JVM:
Data can be written to the search nodes from either the Solr APIs or the Cassandra APIs, in the former case the SolrInputDocument is converted to a Cassandra RowMutation and processed as if it came in from the Cassandra API. By doing this, Solr inherits the multimaster writes and durability guarantees Cassandra provides.
How DSE Maps Cassandra to Solr
The picture below illustrates how DSE Search maps Cassandra rows into Solr documents. For each Cassandra Column Family, we can create a Solr index and link Solr field names to Cassandra column names. Each row_key identifier maps to the uniqueKey in the Solr schema. When a row comes into Cassandra, the secondary index api converts the row to a Solr document and updates a local index Solr index. One great feature is we can maintain durability of the data in sole by syncing the Cassandra memtable flushes with Lucene flushes. If there is a crash before a flush happens then the Cassandra commit log will replay the un-flushed mutations on startup. Also, by utilizing Solr's NearRealTime commits in 4.0 we can search the unflushed documents.
DSE does not duplicate field data across Cassandra and Solr indexes. When a Solr requires access to a stored field it gets the data from the local Cassandra data. This means you can "drop" and recreate a Solr index without writing any re-indexing code. In fact, DSE Search lets you change your Solr schema on the fly and rebuild from the source data in Cassandra.
How does Solr scale on Cassandra?
We have touched on how data is written to Cassandra and Solr but only in the context of a single node. How do writes and reads scale to multiple nodes? To understand this you need a little background on how Cassandra partitions data. At a high level, each node in Cassandra is responsible for a range of rows. The default RandomPartitioner in Cassandra randomly partitions rows by hashing the row key with its MD5 value. Since Cassandra rows map to Solr documents, this means your Solr docs are also randomly partitioned. Any write that enters the system will be sent to the correct node(s) responsible for that row/document. When you change the replication settings for a Cassandra Keyspace, then all Solr documents will also be replicated across nodes. When you add a new Cassandra node into the ring, it expands the capacity of the cluster, redistributes some of the data to the new node, and builds the Solr indexes. Now you can scale up and down your cluster without thinking about Solr shard management.
You can send a Solr query to any search node and it will use Cassandra's ring information to construct the right Solr distributed search query, based on the replication factor of the data and the physical location of the data. You can use any of the client libraries Solr supports as well as execute Solr searches from the Cassandra secondary index APIs.
Finally, since we built on Cassandra's distribution and recovery architecture to create a scalable Solr, we can support multi-datacenter Solr clusters out of the box.
Integration with Hadoop
An interesting thing about the DSE Architecture is you have a single source of truth for data, namely Cassandra, then you can access it from Solr and Hadoop. This means you can more easily build data intensive apps. Consider this application:
You are asked to build a system to ingest live application log data from hundreds of servers and make them searchable in near realtime through the web. The system must also be able to generate pre-canned weekly, monthly, yearly key performance indicator reports for the applications running across the machines.
The components on the left and right look similar but the one on the left requires managing three separate distributed systems as well as managing the ETL between them. DSE, on the right, simplifies this setup by having one system that provides the same technology stack as the left but with much simpler operations and no custom ETL.