The major new enhancement made to DataStax Enterprise is enterprise search support using Lucene and Apache Solr. Coming from the Apache Lucene project, Solr is the most popular open source enterprise search platform in use today.
Solr’s primary features include robust free-text search, hit highlighting, and rich document (PDF, Microsoft Word, and so on) handling. Solr also provides more advanced features like aggregation, grouping, and geo spatial search. Today, Solr powers the search and navigation features of many of the world's largest Internet sites. With the inclusion of Solr 4.0, near real-time indexing can be performed.
The unique combination of Cassandra, Solr, and Hadoop in DSE bridges the gap between online transaction processing (OLTP) and online analytical processing (OLAP). DSE Search in Cassandra offers a way to aggregate and look at data in many different ways in real-time. Cassandra speed compensates for typical MapReduce performance problems. By integrating Solr into the DataStax Enterprise big data platform, DataStax extends Solr’s capabilities and overcomes the shortcomings of native Solr mentioned in the next section.
DSE Search is easily scalable. You add search capacity to your cluster in the same way as you add Hadoop or Cassandra capacity to your cluster. You can have a hybrid cluster of nodes, some running Cassandra, some running search, and some running Hadoop. If you don't need Cassandra or Hadoop, migrate to DSE strictly for Solr and create an exclusively Solr cluster. The DSE cluster configuration improves upon the master-slave configuration supported by native Solr.
DSE supports native Solr tools and APIs, simplifying migration from Solr to DSE Search for Solr users.
DataStax Enterprise Search is built on top of Solr 4.0, which offers real-time querying of files. Search indexes remain tightly in line with live data. There are significant benefits of running your enterprise search functions through DataStax Enterprise instead of native Solr, including:
DSE Search takes secondary indexes to a new level: data added to Cassandra is locally indexed in Solr. Data added to Solr is locally indexed in Cassandra.
DSE Enterprise supports cluster partitioning by workload as described in About Replication in Cassandra.
Using this approach, you can make some of your DSE nodes handle search while others handle MapReduce, or just act as ordinary Cassandra nodes. In production environments, do not run Solr and Hadoop on the same node. In development environments, running both is feasible.
Cassandra ingests the data, Solr indexes the data, and you run MapReduce against that data, all in one cluster without having to do any manual extract, transform, and load (ETL) operations.
Cassandra handles the replication and isolation of resources.
Solr calls an index of documents a core. Each document in a core is considered unique and contains a set of fields that adhere to a user-defined schema. The schema lists the field types and how they should be indexed.
DSE Search links Solr cores to Cassandra column families, Solr documents to Cassandra rows, and document fields to columns. This table shows the relationship between Cassandra and Solr concepts:
Solr has a number of required and optional configuration files. A minimal Solr installation requires these files:
For more information about creating the schema, see Creating a Schema.
DSE Search includes a REST API for adding and retrieving resources associated with an index. You can look at the contents of the existing Solr resource by loading its URL in a web browser or using HTTP get.
After generating valid schema.xml and solrconfig.xml files, you can create a new Solr index by posting the files through a specific HTTP endpoint. Use this format:
Generally, you can post any resource required by Solr to this URL. For example, stopwords.txt and elevate.xml are optional, frequently-used Solr configuration files that you post using this URL.
Example of Creating an Index
For example, to create a Solr index on a column family, make two HTTP POST requests using the cURL utility as follows:
Configuration file POST request:
curl http://localhost:8983/solr/resource/keyspace.columnfamily/solrconfig.xml --data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8'
Schema file POST request:
curl http://localhost:8983/solr/resource/keyspace.columnfamily/schema.xml --data-binary @schema.xml -H 'Content-type:text/xml; charset=utf-8'
DSE Search stores the files on all the Cassandra nodes and creates a new Solr core. If you HTTP post the files to a pre-existing column family, DSE Search starts indexing the data. If you HTTP post the files to a non-existing column keyspace or column family, DSE Search creates the keyspace and column family, and then starts indexing the data. For example, you can change the stopwords.txt file, repost the schema, and the index updates.
Changing the Solr schema makes reindexing necessary and reindexing can be disruptive. Users can be affected by performance hits caused by reindexing. Changing the schema is recommended only when absolutely necessary. Also, changing the schema during scheduled down time is recommended.
DSE Search does not support:
Timeseries type rows
Solr fields must be strings.