DataStax Enterprise Search Vs. SolrCloud
The Apache Solr team has recently announced a beta release for the long awaited SolrCloud implementation, that
makes building distributed Solr clusters easier and more fault tolerant. I thought it would be helpful to explain where SolrCloud and DSE Search overlap and where they differ.
Both DSE and SolrCloud are built on Apache Solr. Apache Solr is, at its core, a service layer for Lucene. It exposes indexes over HTTP and adds schema to validate and analyze complex types of fields.
DSE and SolrCloud build on this by taking care of things like durability, availability and scaling indexes to massive sizes across many different machines. Both solutions use the Lucene index format to store the index data on disk and rely on Lucene for most of the index and query logic.
Since both are built on Solr they both offer the same featureset such as faceting, geo, near real time search, etc.
SolrCloud relies on a Zookeeper ensemble to track which nodes are available and which nodes a shard of a given index is on. The configuration files for a given index are also stored in Zookeeper as well, providing a global view of settings.
When you start SolrCloud you tell it how many shards you want your index to have and how many replicas per shard. Once set, the number of shards can’t be changed.
DSE Search relies on the Cassandra ring information to see what nodes are members of the cluster. DSE maps a Solr index to a corresponding ColumnFamily and uses the ColumnFamily metadata to store Solr specific resources like the Solr schema. This metadata is stored on each node in the cluster.
By building on the core Cassandra architecture, each node in the ring is by definition a shard. You can add and remove shards by changing the layout of the ring. Replication settings are also controlled in Cassandra so each index can have any number of replicas.
Both systems provide an easy way to get a distributed Solr cluster up and running. From the users perspective, they can read or write to any node and the system will take care of the rest.
With SolrCloud, maintaining a separate Zookeeper cluster of at least 3 nodes adds operational overhead and cost. Not to mention that multi-datacenter Zookeeper deployments are not entirely practical.
CAP Theorem Tradeoffs
SolrCloud elects a master node for each shard and all writes go through that node, then to the replicas. A write does not return until all replias have responded.
As with most systems built on Zookeeper, SolrCloud is designed as a CP system (choosing Consistency over Availability). For example, if a shard master node dies, a recovery process must begin, elect a new master, then all the replicas must verify they are in sync before new writes to that shard can continue.
By building on Cassandra, DSE Search is a AP system (choosing Availability over Consistency). Since Cassandra has no master nodes, if one node dies another replica can accept a write for it. When the dead node recovers the missing data is brought upto speed. It also allows a consistency level to be set on write, meaning if you have log data that doesn’t need to be instantly consistent, you can write it with low consistency and increase ingestion rates. On the otherhand you can write with a high consistency level and trade consistency for write latency (since it must wait for each replica to respond).
Cassandra’s tunable consistency really is a win for clients looking to build dynamic and easy to operate clusters. As well as offers the ability to scale your Solr cluster across datacenters.
SolrCloud uses Lucene for most of its index storage. This includes the raw string data representing each field (assuming the solr schema includes stored=true for a given field). It also keeps a transaction log for durability purposes.
Cassandra also uses Lucene for its index storage, but it differs in that the raw fields are kept in Cassandra. As mentioned earlier, DSE Search maps a Solr index to a Cassandra ColumnFamily. Each document in Solr represents a row in Cassandra where the document field names are column names and field values are column values. DSE Search saves all field data to the column family even those marked as stored=false.
By splitting the raw data from the indexed data, DSE Search can do a few things SolrCloud can’t:
- Solr indexes can be rebuilt from source internally. This helps if you want to rebuild the index due to a lost disk or if you decide to change the analyzer types associated with your data. Much easier than writing your own mechanism to rebuild from source.
- Data is available to other systems for processing. Need to run an analysis across all your data? All the data can be processed by Hive or Pig.
- Recovery and Operations are simplified. For example, the Cassandra commitlog provides durability to Solr indexes. If you were to run a Solr Cluster, a Hadoop Cluster and a Cassandra Cluster you inherit 3x the operational overhead of DSE.
Hopefully this gives you insight into the similarities and differences between SolrCloud and DSE Search. Search is a crucial part of big data infrastructure and both SolrCloud and DSE Search make it possible to scale search much easier than it had previously been. We feel, however, that by building on the solid foundation of Cassandra we can offer a more tightly integrated platform with the rest of the big data tools out there and simplify your stack.