DataStax Enterprise 3.1 Documentation

Getting started with Solr in DataStax Enterprise

This documentation corresponds to an earlier product version. Make sure this document corresponds to your version.

Latest DSE documentation | Earlier DSE documentation

DataStax Enterprise supports Open Source Solr (OSS) tools and APIs, simplifying migration from Solr to DataStax Enterprise. DataStax Enterprise Search 3.1 and later is built on top of Solr 4.3. Before starting a DSE Search/Solr node on a production cluster or data center, it is important to disable virtual nodes. You can skip this step to run the Solr getting started tutorial.

Disabling virtual nodes

DataStax recommends using virtual nodes only on data centers running Cassandra real-time workloads. You should disable virtual nodes on data centers running either Hadoop or Solr workloads.

To disable virtual nodes:

  1. In the cassandra.yaml file, set num_tokens to 1.

    num_tokens = 1
    
  2. Uncomment the initial_token property and set it to 1 or to the value of a generated token for a multi-node cluster.

Introduction to Solr

The Apache Lucene project, Solr features robust free-text search, hit highlighting, and rich document (PDF, Microsoft Word, and so on) handling. Solr also provides more advanced features like aggregation, grouping, and geo-spatial search. Today, Solr powers the search and navigation features of many of the world's largest Internet sites. With the inclusion of Solr 4.0, near real-time indexing can be performed.

The unique combination of Cassandra, Solr, and Hadoop in DSE bridges the gap between online transaction processing (OLTP) and online analytical processing (OLAP). DSE Search in Cassandra offers a way to aggregate and look at data in many different ways in real-time. DataStax extends Solr’s capabilities that are compared in the next section.


../../_images/ds_integration.png

DSE Search is easily scalable. You add search capacity to your cluster in the same way as you add Hadoop or Cassandra capacity to your cluster. You can have a hybrid cluster of nodes, provided the Solr nodes are in a separate data center, some running Cassandra, some running search, and some running Hadoop. If you don't need Cassandra or Hadoop, migrate to DSE strictly for Solr and create an exclusively Solr cluster.

Sources of information about OSS

Covering all the features of OSS is beyond the scope of DataStax Enterprise documentation. Because DSE Search/Solr supports all Solr tools and APIs, refer to Solr documentation for information about topics, such as how to construct Solr query strings to retrieve indexed data.

For more information, see Solr 4.x Deep Dive by Jack Krupansky.

Benefits of using Solr in DataStax Enterprise

Solr offers real-time querying of files. Search indexes remain tightly in line with live data. There are significant benefits of running your enterprise search functions through DataStax Enterprise instead of OSS, including:

  • A fully fault-tolerant, no-single-point-of-failure search architecture
  • Linear performance scalability--add new search nodes online
  • Automatic indexing of data ingested into Cassandra
  • Automatic and transparent data replication
  • Isolation of all real-time, Hadoop, and search/Solr workloads to prevent competition between workloads for either compute resources or data
  • The capability to read/write to any Solr node, which overcomes the Solr write bottleneck
  • Selective updates of one or more individual fields supported (a full re-index operation is still required)
  • Search indexes that can span multiple data centers (OSS cannot)
  • Limited CQL support for Solr/search queries (Solr HTTP API recommended)
  • Creation of Solr indexes from existing tables created with CQL/CLI/Thrift

Data added to Cassandra is locally indexed in Solr and data added to Solr is locally indexed in Cassandra.

Unsupported features

DSE Search does not support:

  • Cassandra super columns
  • Cassandra counter columns
  • Cassandra timeseries type rows
  • Cassandra composite columns, Solr fields must be strings.

Defining key Solr terms

In a distributed environment, such as DataStax Enterprise and Cassandra, the data is spread over multiple nodes. In Solr, there are several names for an index of documents and configuration on a single node:

  • A core
  • A collection
  • One shard of a collection

Each document in a core/collection is considered unique and contains a set of fields that adhere to a user-defined schema. The schema lists the field types and how they should be indexed. DSE Search maps Solr cores/collections to Cassandra tables. Each table has a separate Solr core/collection on a particular node. Solr documents are mapped to Cassandra rows, and document fields to columns. The shard is analogous to a partition of the table. The Cassandra keyspace is a prefix for the name of the Solr core/collection and has no counterpart in Solr.

This table shows the relationship between Cassandra and Solr concepts:

Cassandra Solr--single node environment Solr--distributed environment
Table Core or collection Collection
Row Document Document
Partition key Unique key Unique key
Column Field Field
Node N/A Node
Partition N/A Shard
Keyspace N/A N/A

With Cassandra replication, a Cassandra node or Solr core contains more than one partition (shard) of table (collection) data. Unless the replication factor equals the number of cluster nodes, the Cassandra node or Solr core contains only a portion of the data of the table or collection.

Installing Solr nodes

To install a Solr node, use the same installation procedure as you use to install any other type of node. To use real-time (Cassandra), analytics (Hadoop), or search (Solr) nodes in the same cluster, segregate the different nodes into separate data centers. Using the default DSESimpleSnitch automatically puts all the Solr nodes in the same data center. Use OpsCenter Enterprise to rebalance the cluster when you add a node to the cluster.

Starting and stopping a Solr node

The way you start up a Solr node depends on the type of installation, tarball or packaged.

Tarball installation

From the install directory, use this command to start the Solr node:

bin/dse cassandra -s

The Solr node starts up.

From the install directory, use this command to stop the Solr node:

bin/dse cassandra-stop

Packaged installation

  1. Enable Solr mode by setting this option in /etc/default/dse: SOLR_ENABLED=1

  2. Start the dse service <start-dse> using this command:

    sudo service dse start
    

    The Solr node starts up.

You stop a Solr node using this command:

sudo service dse stop

Solr getting started tutorial

Setting up Cassandra and Solr for this tutorial involves the same basic steps as setting up a typical DSE Search/Solr application:

  • Create a Cassandra table.
  • Import data.
  • Create a search index.

These steps for setting up Cassandra and Solr are explained in detail in this tutorial. After completing the setup, you use DSE Search/Solr to perform simple queries, sort the query results, and construct facet queries.

In this tutorial, you use some sample data from a health-related census. Download the sample data now.

Setup

This setup assumes you started DataStax Enterprise 3.1 in DSE Search/Solr mode and downloaded the sample data.

  1. Unzip the files you downloaded in the DataStax Enterprise installation home directory. The solr_tutorial directory is created in the installation directory that contains the following files.

    • The CSV (comma separated value) data, nhanes52.csv
    • Cassandra table definition, create_nhanes.cql
    • The copy command, copy_nhanes.cql
    • The Solr schema, schema.xml

    You can take a look at these files by using your favorite editor.

  2. Copy the solrconfig.xml from the DataStax Enterprise 3.1 <install-location>/demos/wikipedia directory (tarball installations) or /usr/share/dse-demos (packaged installations) to the solr_tutorial directory.

Create a Cassandra table

  1. Start cqlsh, and create a keyspace. Use the keyspace.

    cqlsh> CREATE KEYSPACE nhanes_ks WITH REPLICATION =
           { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
    
    cqlsh> USE nhanes_ks;
    
  2. Copy the CQL table definition from the downloaded create_nhanes.cql file, and paste it on the cqlsh command line.

    This action creates the nhanes table in the nhanes_ks keyspace. The table uses the WITH COMPACT STORAGE directive.

Import data

  1. Copy the cqlsh COPY command from the downloaded copy_nhanes.cql file.

  2. Paste the COPY command on the cqlsh command line, change the FROM clause to match the path to /solr_tutorial/nhanes52.csv in your environment, and then run the command.

    This action imports the data from the CSV file into the nhanes table in Cassandra.

In a production environment, you would likely use a hefty tool, such as the Cassandra bulk loader or sqoop for importing data. An alternative to importing the data into the pre-existing Cassandra table is to import the data into Solr and let DataStax Enterprise create the table after search indexing.

Create a search index

On the command line in the solr_tutorial directory, upload the solrconfig.xml and schema.xml to Solr, and create the Solr core named after the Cassandra table and keyspace, nhanes_ks.nhanes.

curl http://localhost:8983/solr/resource/nhanes_ks.nhanes/solrconfig.xml --data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8'

curl http://localhost:8983/solr/resource/nhanes_ks.nhanes/schema.xml --data-binary @schema.xml -H 'Content-type:text/xml; charset=utf-8'

curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=nhanes_ks.nhanes"

Now, the searching can begin.

Exploring the Solr Admin

After creating the Solr core, you can check that the Solr index is working by using the browser-based Solr Admin:

http://localhost:8983/solr/

To explore the Solr Admin:

  1. Click Core Admin. Unless you loaded other cores, the path to the default core, nhanes_ks.nhanes, appears.

    At the top of the Solr Admin console, the Reload, Reindex and Full Reindex buttons perform functions that correspond to RELOAD command options. If you modify the schema.xml or solrconfig.xml, you use these controls to re-index the data.

  2. Check that the numDocs value is 20,050. The number of Solr documents corresponds to the number of rows in the CSV data and nhanes table you created in Cassandra.

  3. In Core Selector, select the name of the core, nhanes_ks.nhanes.

    Selecting the name of the core brings up additional items, such as Query, in the vertical navigation bar.


    ../../_images/dse_search_qs1.png

    You can learn more about the Solr Admin from the Overview of the Solr Admin UI.

Using the Solr HTTP API

For serious searching, use the Solr HTTP API. The Solr Admin query form is limited, but useful for learning about Solr, and can even help you get started using the Solr HTTP API. The form shows the queries in Solr HTTP format at the top of the form. After looking at a few URLs, you can try constructing queries in Solr HTTP format.

To get started using the Solr HTTP API:

  1. Scroll to the top of the form, and click the greyed out URL.


    ../../_images/dse_search_qs4.png

    A page of output independent of the query form appears that you can use to examine and change the URL. The URL looks like this:

    http://localhost:8983/solr/nhanes_ks.nhanes/select?
      q=family_size%3A9&sort=age+asc&fl=age+family_size
      &wt=xml&indent=true&facet=true&facet.field=age
    
  2. In the URL in the address bar, make these changes:

    FROM:

    q=family_size%3A9
    &fl=age+family_size
    

    TO:

    q=age:[20+TO+40]
    &fl=age+family_size+num_smokers
    

    The modifed URL looks like this:

    http://localhost:8983/solr/nhanes_ks.nhanes/select?
      q=age:[20+TO+40]&sort=age+asc&fl=age+family_size+num_smokers
      &wt=xml&indent=true&facet=true&facet.field=age
    

    In the Solr Admin query form, you can use spaces in the range [20 TO 40], but in the URL, you need to use URL encoding for spaces and special characters. For example, use + or %20 instead of a space, [20+TO+40].

  3. Use the modified URL to execute the query. Move to the end of the URL, and press ENTER.

    The number of hits increases from 186 to 7759. Results show the number of smokers and family size of families whose members are 20-40 years old. Facets show how many people fell into the various age groups.

    . . .
      <doc>
       <int name="age">20</int>
       <int name="family_size">4</int>
       <int name="num_smokers">1</int>
      </doc>
    </result>
    <lst name="facet_counts">
    <lst name="facet_queries"/>
     <lst name="facet_fields">
      <lst name="age">
      <int name="23">423</int>
      <int name="24">407</int>
      <int name="31">403</int>
      <int name="30">388</int>
      <int name="40">382</int>
      <int name="28">381</int>
      <int name="27">378</int>
      <int name="21">377</int>
      <int name="33">377</int>
      <int name="22">369</int>
      <int name="29">367</int>
      <int name="20">365</int>
      <int name="32">363</int>
      <int name="34">361</int>
      <int name="36">361</int>
      <int name="25">358</int>
      <int name="26">358</int>
      <int name="35">358</int>
      <int name="38">353</int>
      <int name="37">339</int>
      <int name="39">291</int>
      <int name="17">0</int>
    . . .
    
  4. Experiment with different Solr HTTP API URLs by reading documentation on the internet and trying different queries using this sample database.

This tutorial introduced you to DSE Search/Solr basic setup and searching. Next, delve into DataStax Enterprise documentation and the recommended Solr documentation.