DataStax Enterprise 3.1 Documentation

Tuning DSE Search performance

This documentation corresponds to an earlier product version. Make sure this document corresponds to your version.

Latest DSE documentation | Earlier DSE documentation

In the event of a performance degradation, high memory consumption, or other problem with DataStax Enterprise Search nodes, try:

Using table compression

Search nodes typically engage in read-dominated tasks, so maximizing storage capacity of nodes, reducing the volume of data on disk, and limiting disk I/O can improve performance. In Cassandra 1.0 and later, you can configure data compression on a per-table basis to optimize performance of read-dominated tasks.

Configuration affects the compression algorithm for compressing SSTable files. For read-heavy workloads, such as those carried by Enterprise Search, Snappy compression is recommended. Compression using the Snappy compressor is enabled by default when you create a table in Cassandra 1.1 and later. You can change compression options using CQL. Developers can also implement custom compression classes using the org.apache.cassandra.io.compress.ICompressor interface. You can configure the compression chunk size for read/write access patterns and the average size of rows in the table.

Configuring re-indexing and repair

When running the RELOAD command using the reindex or deleteAll options, a long delay might indicate that tuning is needed. Tune the performance of re-indexing and index rebuilding by making a few changes in the solrconfig.xml file.

  1. Increase the size of the RAM buffer, which is set to 100MB by default, to 125, for example.

    <indexConfig>
      <useCompoundFile>false</useCompoundFile>
      <ramBufferSizeMB>125</ramBufferSizeMB>
      <mergeFactor>10</mergeFactor>
    . . .
    
  2. Increase the soft commit time, which is set to 1000 ms by default, to a larger value. For example, increase the time to 15-16 minutes:

    <autoSoftCommit>
      <maxTime>1000000</maxTime>
    </autoSoftCommit>
    

The downside of changing the autoSoftCommit attribute is that newly updated rows take longer than usual (1000 ms) to appear in search results.

Configuring update performance

If updates take too long and you changed the default autoSoftCommit from the default 1000 ms to a higher value, reset autoSoftCommit in the solrconfig.xml to its default value.

Configuring the Search Handler

The wikipedia demo solrconfig.xml configures the SearchHandler as follows:

<requestHandler name="search" class="solr.SearchHandler" default="true">

DataStax recommends using this configuration for the SearchHandler.

Configuring the update handler and autoSoftCommit

You need to configure the solrconfig.xml to use near real-time capabilities in Solr by setting the default high-performance update handler flag.

For example, the Solr configuration file for the Wikipedia demo sets this flag as follows and uncomments the autoSoftCommit element:

<!-- The default high-performance update handler -->
 <updateHandler class="solr.DirectUpdateHandler2">

 . . .

  <autoSoftCommit>
    <maxTime>1000</maxTime>
  </autoSoftCommit>
</updateHandler>

The autoCommit element has been removed to prevent hard commits that hit the disk and flush the cache. The soft commit forces uncommitted documents into internal memory. When data is committed, is it immediately available after the commit.

The autoSoftCommit element uses the maxTime update handler attribute. The update handler attributes enable near real-time performance and trigger a soft commit of data automatically, so checking synchronization of data to disk is not necessary. This table describes both update handler options.

Attribute Default Description
maxDocs No default Maximum number of documents to add since the last soft commit before automatically triggering a new soft commit.
maxTime 1000 Maximum expired time in milliseconds between the addition of a document and a new, automatically triggered soft commit.

For more information about the update handler and modifying SolrConfig.xml, see the Solr documentation.

Changing the stack size and memtable space

Some Solr users have reported that increasing the stack size improves performance under Tomcat. To increase the stack size, uncomment and modify the default -Xss128k setting in the cassandra-env.sh file. Also, decreasing the memtable space to make room for Solr caches might improve performance. Modify the memtable space using the memtable_total_space_in_mb property in the cassandra.yaml file.

Managing caching

DataStax Enterprise 3.0 and later defaults to using NRTCachingDirectoryFactory, which is recommended for real-time performance. These non-settable defaults specify where files are cached and files are managed:

  • maxMergeSizeMB = 4.0 MB
  • maxCachedMB = 48.0 MB

You can configure the solrconfig.xml to specify where files are cached, in RAM or on the file system, by setting the DSE near real-time caching directory factory flag. By changing directory factory attributes, you can manage where files are cached.

To manage caching operations:

  1. Open solrconfig.xml for editing.

  2. Add a directoryFactory element to solrconfig.xml of type DSENRTCachingDirectoryFactory. For example:

    <directoryFactory name="DirectoryFactory"
      class="com.datastax.bdp.cassandra.index.solr.DSENRTCachingDirectoryFactory">
      <double name="maxmergesizemb">5.0</double>
      <double name="maxcachedmb">32.0</double>
    </directoryFactory>
    
  3. Set the DirectoryFactory attributes:

    • maxmergesizemb

      The threshold (MB) for writing a merge segment to a RAMDirectory or to the file system. If the estimated size of merging a segment is less than maxmergesizemb, the merge segment is written to the RAMDirectory; otherwise, it is written to the file system.

    • maxcachemb

      The maximum value (MB) of the RAMDirectory.

Increasing read performance by adding replicas

You can increase DSE Search read performance by configuring replicas just as you do in Cassandra. You define a replica placement strategy and the number of replicas you want. For example, you can add replicas using the NetworkToplogyStrategy replica placement strategy. To configure this strategy, you can use CQL. For example, if you are using a PropertyFileSnitch, perform these steps:

  1. Check the data center names of your nodes using the nodetool command.

    ./nodetool -h localhost ring
    

    The data center names, DC1 and DC2 in this example, must match the data center name configured for your snitch.

  2. Start CQL on the DSE command line and create a keyspace that specifies the data center names and number of replicas you want.

Set the number of replicas in data centers. For example, one replica in data center 1 and three in data center 2. For more information about adding replicas, see Choosing Keyspace Replication Options.

Changing the replication factor for a Solr keyspace

The following procedure builds on the example in Using DSE Search/Solr. Assume the solrconfig.xml and schema.xml files have already been posted using mykeyspace.mysolr in the URL, which creates a keyspace named mykeyspace that has a default replication factor of 1. You want three replicas of the keyspace in the cluster, so you need to update the Solr keyspace replication factor.

To change the Solr keyspace replication factor

  1. Check the name of the data center of the Solr/Search nodes.

    ./nodetool -h localhost ring
    

    The output tells you that the name of the data center for your node is, for example, datacenter1.

  2. Use CQL 3 to change the replication factor of the keyspace. Set a replication factor of 3, for example:

    ALTER KEYSPACE mykeyspace WITH REPLICATION =  { 'class' :
      'NetworkTopologyStrategy', 'datacenter1' : 3 };
    

If you have data in a keyspace and then change the replication factor, run nodetool repair to avoid having missing data problems or data unavailable exceptions.

Managing the consistency level

Consistency refers to how up-to-date and synchronized a row of data is on all of its replicas. Like Cassandra, DSE-Search extends Solr by adding an HTTP parameter, cl, that you can send with Solr data to tune consistency. The format of the URL is:

curl "http://<host>:<port>/solr/<keyspace>.<table>/update?cl=ONE"

The cl parameter specifies the consistency level of the write in Cassandra on the client side. The default consistency level is QUORUM, but you can change the default globally, on the server side using Cassandra’s drivers and client libraries.

Setting the consistency level using SolrJ

SolrJ does not allow setting the consistency level parameter using a Solr update request. To set the consistency level parameter:

HttpSolrServer httpSolrServer = new HttpSolrServer(url);
httpSolrServer.getInvariantParams().add("cl", "ALL");

For more information, see the Data Consistency in DSE Search blog.

Configuring the available indexing threads

DSE Search provides a multi-threaded indexing implementation to improve performance on multi-core machines. All index updates are internally dispatched to a per-core indexing thread pool and executed asynchronously. This implementation allows for greater concurrency and parallelism, but as a consequence, index requests return a response before the indexing operation is actually executed. The number of available indexing threads per Solr core is by default equal to number of available CPU cores times 2. The available threads can be configured by editing the max_solr_concurrency_per_core parameter in the dse.yaml configuration file; if set to 1, DSE Search uses the legacy synchronous indexing implementation.

Also, DSE Search provides advanced, JMX-based, configurability and visibility through the IndexPool-ks.cf (where ks.cf is the name of a DSE Search Solr core) MBean under the com.datastax.bdp namespace.