DataStax Enterprise 3.0 Documentation

DSE Search/Solr versus Open Source Solr

This documentation corresponds to an earlier product version. Make sure this document corresponds to your version.

Latest DSE documentation | Earlier DSE documentation

By virtue of its integration into DataStax Enterprise, differences exist between DSE Search/Solr and Open Source Solr (OSS).

Major differences

The major differences in capabilities are:

Capability DSE OS Solr Description
Includes a database yes no A user has to create an interface to add a database to OSS.
Indexes real-time data yes no Cassandra ingests real-time data and Solr indexes the data.
Provides an intuitive way update data yes no DataStax provides a SQL-like language and command-line shell, CQL, for loading and updating data. Data added to Cassandra shows up in Solr.
Indexes Hadoop output without ETL yes no Cassandra ingests the data, Solr indexes the data, and you run MapReduce against that data in one cluster.
Supports data distribution yes yes [1] DataStax Enterprise distributes Cassandra real-time, Hadoop, and Solr data to multiple nodes in a cluster transparently.
Balances loads on nodes/shards yes no Unlike OSS and Solr Cloud loads can be rebalanced efficiently.
Spans indexes over multiple data centers yes no A cluster can have more than one data center for different types of nodes.
Automatically re-indexes Solr data yes no The only way to re-index data in OSS is to have the client re-ingest everything.
Stores data added through Solr in Cassandra yes no Data updated using the Solr API shows up in Cassandra.
Makes durable updates to data yes no Updates are durable and written to the Cassandra commit log regardless of how the update is made.
Upgrades of Lucene preserve data yes no DataStax integrates Lucene upgrades periodically and when you upgrade DSE, data is preserved. OSS users must re-ingest all their data when upgrading to Lucene.
Security yes no DataStax has extended SolrJ to protect internal communication and HTTP access. Solr data can be encrypted and audited.
[1]Requires using Zookeeper.

Minor differences

Minor differences between DSE Search and OSS include:

  • You launch DSE Search by starting a DataStax Enterprise node in DSE Search mode. You start Solr using java -jar start.jar

  • DSE Search terminology used to describe objects differs from OSS terminology. The DataStax Enterprise vs Solr concepts section lists the differences.

  • Delete by query in DSE Search differs from OSS. Deletions begin immediately. You do not need to post a commit after posting the delete command.

  • The process for creating an index and reloading a schema differs.

  • DSE Search has removed the Optimize button from the Core Admin UI.

  • In the DSE Search schema, if you do not configure the uniqueKey field as stored (stored="true"), DataStax Enterprise forces that flag to be true.

    This change is necessary to make distributed search work.

  • Behavior differs between DSE Search and OSS when you configure a non-unique field as not stored.

    In OSS, the data is lost, whereas in DSE Search, the data is stored in Cassandra. The field does not show up in the search results of DSE Search or OSS.

  • DataStax provides a real-time caching directory factory flag, DSENRTCachingDirectoryFactory, that you can use to specify where files are cached.

  • The autoCommit element in the Solrconfig.xml is removed in DSE Search/Solr and the autoSoftCommit element is uncommented.

    In OSS the autoCommit element is present and uncommented. The autoSoftCommit is commented out.

  • OSS supports relative paths set by the <lib> property in the solrconfig.xml, but DSE Search/Solr does not. Configuring Solr library paths describes a workaround for this issue that DataStax Enterprise will address in a future release.

Pseudo join and pivot faceting, not fully supported by DataStax Enterprise, do not belong in the differences list because OSS does not support these, or any other OSS features, in distributed mode. OSS does not distribute data in a scalable, peer-to-peer system like DataStax Enterprise does.

DataStax Enterprise vs Solr concepts

In a distributed environment, such as DataStax Enterprise and Cassandra, the column family data is spread over multiple nodes. In Solr, there are several names for an index of documents and configuration on a single node:

  • A core
  • A collection
  • One shard of a collection

Each document in a core/collection is considered unique and contains a set of fields that adhere to a user-defined schema. The schema lists the field types and how they should be indexed. DSE Search maps Solr cores/collections to Cassandra column families. Each column family has a separate Solr core/collection on a particular node. Solr documents are mapped to Cassandra rows, and document fields to columns. The shard is analogous to a partition of the column family. The Cassandra keyspace is a prefix for the name of the Solr core/collection and has no counterpart in Solr.

This table shows the relationship between Cassandra and Solr concepts:

Cassandra Solr--single node environment Solr--distributed environment
Column family Core or collection Collection
Row Document Document
Row key Unique key Unique key
Column Field Field
Node N/A Node
Partition N/A Shard
Keyspace N/A N/A

With Cassandra replication, a Cassandra node or Solr core contains more than one partition (shard) of column family (collection) data. Unless the replication factor equals the number of cluster nodes, the Cassandra node or Solr core contains only a portion of the data of the column family or collection.