DataStax Enterprise 2.1 Documentation

DSE Search Management Operations

This documentation corresponds to an earlier product version. Make sure this document corresponds to your version.

Latest DSE documentation | Earlier DSE documentation

A DSE data center (DC) can be physical or virtual. In this diagram, nodes in data centers 1 and 2 (DC 1 and DC 2) run a mix of:

  • Real-time queries (Cassandra and no other services)
  • Analytics (Cassandra and Hadoop)

Data centers 3 and 4 (DC 3 and DC 4) are dedicated to search.

../../_images/dse_search_datacenter.png

Within the same data center, attempting to run Solr on some nodes and real-time queries or analytics on other nodes does not work.

The Solr nodes run HTTP and hold the indexes for the column family data. If a Solr node goes down, the commit log replays the Cassandra inserts, which correspond to Solr inserts, and the node is restored automatically.

Performing Management Operations

Tasks related to managing search are:

Adding a New Solr Node

To increase the number of nodes in a Solr cluster, you can add or bootstrap a DSE node to the cluster. If you want to increase capacity of your search, you need to bootstrap the node, then optionally, rebalance the cluster. To bootstap a Solr node, use the same method you use to bootstrap a Cassandra node. Using the default DSESimpleSnitch automatically puts all the Solr nodes in the same data center. Use OpsCenter Enterprise to rebalance the cluster.

Inserting into, Modifying, and Deleting Data from a Solr Node

When you insert data into Cassandra, it shows up in Solr. When you add data to Solr, it shows up in Cassandra. You can use any Solr API to write data to Solr, however, the native Solr HTTP REST API is recommended. Writes are durable. A Solr API client writes data to Cassandra first, and then Cassandra updates secondary indexes.

To modify or remove data from a Solr node use the Cassandra Query Language (CQL), the Command Line Interface (CLI), or Solr APIs. By virtue of updating a field in Cassandra, the data in Solr is updated. When you update the column family, the Solr document is updated.

Updating Individual Fields in a Solr Document

You can use the Solr API to insert into, modify, or delete data from a Solr node. When using the Solr API to change a document, the entire document is updated. Using DSE Search, you can update an individual field. After writing the modifications to the Solr document, by using a URL in the following format to update the document:

http://<host>:<port>/solr/<keyspace>.<column family>/update?
  replacefields=false

When you use CQL or CLI to update a field, DSE Search implicitly sets replacefields to false and updates individual fields in the Solr document.

Warning about using optimize

Do not use the optimize command. Using the optimize command in a URL can cause nodes to fail.

Increasing Read Performance by Adding Replicas

You can increase DSE Search read performance by configuring replicas just as you do in Cassandra. You define a replica placement strategy and the number of replicas you want. For example, you can add replicas using the NetworkToplogyStrategy replica placement strategy. To configure this strategy if you are using a PropertyFileSnitch, you can use CQL.

  1. Check the data center names of your nodes using the nodetool command.

    ./nodetool -h localhost ring
    

    Note

    The data center names, DC1 and DC2 in this example, must match the data center name configured for your snitch.

  2. Start CQL on the DSE command line and create a keyspace that specifies the number of replicas you want.

    CREATE KEYSPACE test
    WITH strategy_class = 'NetworkTopologyStrategy'
    AND strategy_options:DC1 = 1
    AND strategy_options:DC2 = 3;
    
The strategy options set the number of replicas in data centers, one replica in data center 1 and three in data center 2. For more information about adding replicas, see Choosing Keyspace Replication Options.

Decommissioning and Repairing a Node

You can decommission and repair a Solr node in the same manner as you would a Cassandra node.

Rebuilding an Index

The dsetool is equipped to rebuild a Solr index from existing Cassandra data. To rebuild a corrupted index:

  1. Run nodetool drain.

  2. Shut down the node.

  3. Delete the Solr index directory for the bad column family. The Solr index directory path is <Cassandra data directory>/solr.data/<keyspace_name>.<column-family-name>.

  4. Restart the node.

  5. Use this command to rebuild the index:

    ./dsetool rebuild_indexes <keyspace> <columnfamily>
    

Managing the Location of Solr Data

Solr has its own set of data files. Like Cassandra data files, you can control where the Solr data files are saved on the server. By default, the data is saved in <Cassandra data directory>/solr.data. You can change the location from the <Cassandra data directory> to another directory, from the command line. For example:

cassandra -s -Ddse.solr.data.dir=/opt

In this example, the data in solr.data is saved in the /opt directory.

About the Validation Log

DSE Search stores validation errors that arise from non-indexable data sent from non-Solr nodes in this log:

/var/log/cassandra/solrvalidation.log

For example, if a Cassandra node that is not running Solr puts a string in a date field, an exception is logged for that column when the data is replicated to the Solr node.

Changing the Solr Connector Port

To change the Solr port from the default, 8983, change the http.port setting in the catalina.properties file installed with DSE in <dse-version>/resources/tomcat/conf.

Tuning Performance

DataStax Enterprise server is able to support real-time, analytic, and search workloads in the same cluster of machines with smart workload isolation. This ensures that workloads do not compete with the other for data or computing resources and helps deliver consistently high performance. In the event of a performance degradation, high memory consumption, or other problem with DataStax Enterprise Search nodes, try:

  • Using Column Family Compression
  • Configuring the solrconfig.xml update handler flag
  • Tuning the solrconfig.xml to specify cache locations
  • Managing the data consistency level on replicas

Using Column Family Compression

Search nodes typically engage in read-dominated tasks, so maximizing storage capacity of nodes, reducing the volume of data on disk, and limiting disk I/O can improve performance. In Cassandra 1.0 and later, you can configure data compression on a per-column family basis to optimize performance of read-dominated tasks.

You can configure the compression algorithm for compressing SSTable files. For read-heavy workloads, such as those carried by Enterprise Search, Snappy compression is recommended. Developers can also implement custom compression classes using the org.apache.cassandra.io.compress.ICompressor interface. You can configure the compression chunk size for read/write access patterns and the average size of rows in the column family.

Setting the High-Performance Update Handler

You need to configure the solrconfig.xml to use near real-time capabilities in Solr by setting the default high-performance update handler flag. For example, the Solr configuration file for the Wikipedia demo sets this flag as follows:

<!-- The default high-performance update handler -->
  <updateHandler class="solr.DirectUpdateHandler2">
    <autoSoftCommit>
      <maxTime>1000</maxTime>
    </autoSoftCommit>
  </updateHandler>

This example uses the maxTime update handler option. The update handler options enable near real-time performance and trigger a soft commit of data automatically, so checking synchronization of data to disk is not necessary. Data durability is maintained by letting cassandra do hard commits along with Cassandra memtable flushes. This table describes both update handler options.

Option Name Default Description
maxDocs No default Maximum number of documents to add since the last soft commit before automatically triggering a new soft commit.
maxTime 1000 Maximum expired time in milliseconds between the addition of a document and a new, automatically triggered soft commit.

For more information about the update handler and modifying SolrConfig.xml, see the Solr documentation.

Changing the Stack Size and Memtable Space

Some Solr users have reported that increasing the stack size improves performance under Tomcat. To increase the stack size, uncomment and modify the default -Xss128k setting in the cassandra-env.sh file. Also, decreasing the memtable space to make room for Solr caches might improve performance. Modify the memtable space using the memtable_total_space_in_mb property in the cassandra.yaml file.

Managing Caching

You can configure the solrconfig.xml to specify where files are cached, in RAM or on the file system, by setting the DSE near real-time caching directory factory flag. By changing directory factory attributes, you can manage where files are cached.

To manage caching operations:

  1. Open solrconfig.xml for editing.
  2. Add a directoryFactory element to solrconfig.xml of type DSENRTCachingDirectoryFactory. For example:
<directoryFactory name="DirectoryFactory"
  class="com.datastax.bdp.cassandra.index.solr.DSENRTCachingDirectoryFactory">
  <double name="maxmergesizemb">5.0</double>
  <double name="maxcachedmb">32.0</double>
</directoryFactory>
  1. Set the DirectoryFactory attributes:

    • maxmergesizemb

      The threshold (MB) for writing a merge segment to a RAMDirectory or to the file system. If the estimated size of merging a segment is less than maxmergesizemb, the merge segment is written to the RAMDirectory; otherwise, it is written to the file system.

    • maxcachemb

      The maximum value (MB) of the RAMDirectory.

Managing the Consistency Level

Consistency refers to how up-to-date and synchronized a row of data is on all of its replicas. Like Cassandra, DSE-Search extends Solr by adding an HTTP parameter, cl, that you can send with Solr data to tune consistency. The format of the URL is:

http://<host>:<port>/solr/<keyspace>.<column family>/update?cl=ONE

The cl parameter specifies the consistency level of the write in Cassandra. The default consistency level is QUORUM, but you can change the default using the “search.consistencylevel.write” system property.