DataStax Enterprise 3.0 Documentation

Tuning DSE Search performance

This documentation corresponds to an earlier product version. Make sure this document corresponds to your version.

Latest DSE documentation | Earlier DSE documentation

In the event of a performance degradation, high memory consumption, or other problem with DataStax Enterprise Search nodes, try:

Using column family compression

Search nodes typically engage in read-dominated tasks, so maximizing storage capacity of nodes, reducing the volume of data on disk, and limiting disk I/O can improve performance. In Cassandra 1.0 and later, you can configure data compression on a per-column family basis to optimize performance of read-dominated tasks.

Configuration affects the compression algorithm for compressing SSTable files. For read-heavy workloads, such as those carried by Enterprise Search, Snappy compression is recommended. Compression using the Snappy compressor is enabled by default when you create a column family in Cassandra 1.1 and later. You can change compression options using CQL. Developers can also implement custom compression classes using the org.apache.cassandra.io.compress.ICompressor interface. You can configure the compression chunk size for read/write access patterns and the average size of rows in the column family.

Configuring the Search Handler

The wikipedia demo solrconfig.xml configures the SearchHandler as follows:

<requestHandler name="search" class="solr.SearchHandler" default="true">

DataStax recommends using this configuration for the SearchHandler.

Configuring the update handler and autoSoftCommit

You need to configure the solrconfig.xml to use near real-time capabilities in Solr by setting the default high-performance update handler flag.

For example, the Solr configuration file for the Wikipedia demo sets this flag as follows and uncomments the autoSoftCommit element:

<!-- The default high-performance update handler -->
  <updateHandler class="solr.DirectUpdateHandler2">

  . . .

    <autoSoftCommit>
      <maxTime>1000</maxTime>
    </autoSoftCommit>
  </updateHandler>

The autoCommit element has been removed to prevent hard commits that hit the disk and flush the cache. The soft commit forces uncommitted documents into internal memory. When data is committed, is it immediately available after the commit.

The autoSoftCommit element uses the maxTime update handler attribute. The update handler attributes enable near real-time performance and trigger a soft commit of data automatically, so checking synchronization of data to disk is not necessary. This table describes both update handler options.

Attribute Default Description
maxDocs No default Maximum number of documents to add since the last soft commit before automatically triggering a new soft commit.
maxTime 1000 Maximum expired time in milliseconds between the addition of a document and a new, automatically triggered soft commit.

For more information about the update handler and modifying SolrConfig.xml, see the Solr documentation.

Changing the stack size and memtable space

Some Solr users have reported that increasing the stack size improves performance under Tomcat. To increase the stack size, uncomment and modify the default -Xss128k setting in the cassandra-env.sh file. Also, decreasing the memtable space to make room for Solr caches might improve performance. Modify the memtable space using the memtable_total_space_in_mb property in the cassandra.yaml file.

Managing caching

DataStax Enterprise 3.0 defaults to using NRTCachingDirectoryFactory, which is recommended for real-time performance. These non-settable defaults specify where files are cached and files are managed:

  • maxMergeSizeMB = 4.0 MB
  • maxCachedMB = 48.0 MB

You can configure the solrconfig.xml to specify where files are cached, in RAM or on the file system, by setting the DSE near real-time caching directory factory flag. By changing directory factory attributes, you can manage where files are cached.

To manage caching operations:

  1. Open solrconfig.xml for editing.
  2. Add a directoryFactory element to solrconfig.xml of type DSENRTCachingDirectoryFactory. For example:
<directoryFactory name="DirectoryFactory"
  class="com.datastax.bdp.cassandra.index.solr.DSENRTCachingDirectoryFactory">
  <double name="maxmergesizemb">5.0</double>
  <double name="maxcachedmb">32.0</double>
</directoryFactory>
  1. Set the DirectoryFactory attributes:

    • maxmergesizemb

      The threshold (MB) for writing a merge segment to a RAMDirectory or to the file system. If the estimated size of merging a segment is less than maxmergesizemb, the merge segment is written to the RAMDirectory; otherwise, it is written to the file system.

    • maxcachemb

      The maximum value (MB) of the RAMDirectory.

Increasing read performance by adding replicas

You can increase DSE Search read performance by configuring replicas just as you do in Cassandra. You define a replica placement strategy and the number of replicas you want. For example, you can add replicas using the NetworkToplogyStrategy replica placement strategy. To configure this strategy, you can use CQL. For example, if you are using a PropertyFileSnitch, perform these steps:

  1. Check the data center names of your nodes using the nodetool command.

    ./nodetool -h localhost ring
    

    Note

    The data center names, DC1 and DC2 in this example, must match the data center name configured for your snitch.

  2. Start CQL on the DSE command line and create a keyspace that specifies the number of replicas you want.

    CREATE KEYSPACE test
    WITH strategy_class = 'NetworkTopologyStrategy'
    AND strategy_options:DC1 = 1
    AND strategy_options:DC2 = 3;
    
The strategy options set the number of replicas in data centers, one replica in data center 1 and three in data center 2. For more information about adding replicas, see Choosing Keyspace Replication Options.

Changing the replication factor for a Solr keyspace

When you post the solrconfig.xml and schema.xml and create or reload a Solr core, DSE Search creates a keyspace and column family in Cassandra. The default replication factor for this keyspace is 1. If you need more than one replica of the keyspace in your cluster, you need to update the replication factor of the keyspace.

The following procedure builds on the wikipedia demo example. Assume the solrconfig.xml and schema.xml files have already been posted using wiki.solr in the URL, which creates a keyspace named wiki that has a default replication factor of 1. You want three replicas of the keyspace in the cluster, so you need to update the Solr keyspace replication factor.

To change the Solr keyspace replication factor

  1. Check the name of the data center of the Solr/Search nodes.

    ./nodetool -h localhost ring
    

    The output tells you that the name of the data center for your node is, for example, datacenter1.

  2. Use the pre-release version of CQL 3 (included with DataStax Enterprise 3.0) or Cassandra CLI to change the replication factor of the keyspace. Set a replication factor of 3 using CQL 3, for example:

    ALTER KEYSPACE wiki
      WITH strategy_class = NetworkTopologyStrategy
      AND strategy_options:Solr = 3
    

If you have data in a keyspace and then change the replication factor, run nodetool repair to avoid having missing data problems or data unavailable exceptions.

Managing the consistency level

Consistency refers to how up-to-date and synchronized a row of data is on all of its replicas. Like Cassandra, DSE-Search extends Solr by adding an HTTP parameter, cl, that you can send with Solr data to tune consistency. The format of the URL is:

http://<host>:<port>/solr/<keyspace>.<column family>/update?cl=ONE

The cl parameter specifies the consistency level of the write in Cassandra on the client side. The default consistency level is QUORUM, but you can change the default globally, on the server side using Cassandra’s drivers and client libraries.

Setting the consistency level using SolrJ

SolrJ does not allow setting the consistency level parameter using a Solr update request. To set the consistency level parameter:

HttpSolrServer httpSolrServer = new HttpSolrServer(url);
httpSolrServer.getInvariantParams().add("cl", "ALL");

For more information, see the Data Consistency in DSE Search blog.

Configuring the available indexing threads

DSE Search provides a new multi-threaded indexing implementation to improve performance on multi-core machines. All index updates are internally dispatched to a per-core indexing thread pool and executed asynchronously: this allows for greater concurrency and parallelism, but as a consequence, index requests will return a response before the indexing operation is actually executed. The number of available indexing threads per core is by default equal to number of available cores times 2: it can be configured by editing the max_solr_concurrency_per_core parameter in the dse.yaml configuration file; if set to 1, DSE Search will go back to the synchronous indexing behavior of the previous release

Also, DSE Search provides advanced, JMX-based, configurability and visibility through the IndexPool-ks.cf (where ks.cf is the name of a DSE Search Solr core) MBean under the com.datastax.bdp namespace.