DataStax Enterprise 3.1 Documentation

Common operations

This documentation corresponds to an earlier product version. Make sure this document corresponds to your version.

Latest DSE documentation | Earlier DSE documentation

You can run Solr on one or more nodes. DataStax does not support running Solr and Hadoop on the same node, although it's possible to do so in a development environment. In production environments, run real-time (Cassandra), analytics (Hadoop), or DSE Search (Solr) nodes on separate nodes and in separate data centers.

Common DSE Search/Solr operations are:

Fast repair

Repairing subranges of data in a cluster is faster than running a nodetool repair operation on entire ranges because all the data replicated during the nodetool repair operation has to be re-indexed. When you repair a subrange of the data, less data has to be re-indexed.

To repair a subrange

Perform these steps as a rolling repair of the cluster, one node at a time.

  1. Run the dsetool list_subranges command, using the approximate number of rows per subrange, the beginning of the partition range (token), and the end of the partition range of the node.

    dsetool list_subranges my_keyspace my_table 10000 113427455640312821154458202477256070485 0
    

    The output lists the subranges.

    Start Token                             End Token                               Estimated Size
    ------------------------------------------------------------------------------------------------
    113427455640312821154458202477256070485 132425442795624521227151664615147681247 11264
    132425442795624521227151664615147681247 151409576048389227347257997936583470460 11136
    151409576048389227347257997936583470460 0                                       11264
    
  2. Use the output of the previous step as input to the nodetool repair command.

    nodetool repair my_keyspace my_table -st 113427455640312821154458202477256070485
      -et 132425442795624521227151664615147681247
    nodetool repair my_keyspace my_table -st 132425442795624521227151664615147681247
      -et 151409576048389227347257997936583470460
    nodetool repair my_keyspace my_table -st 151409576048389227347257997936583470460
      -et 0
    

The anti-entropy node repair runs from the start to the end of the partition range.

Handling inconsistencies in query results

Due to the nature of a distributed system, the DSE Search/Solr consistency level of ONE, and other factors, Solr queries can return inconsistent results. For example, Solr might return different numFound counts from consecutive queries.

An efficient way of achieving consistent results is to repair nodes using the subrange repair method.

Adding a new Solr node

To increase the number of nodes in a Solr cluster, you can add a DSE node to the cluster. If you want to increase capacity of your search, add the node, then optionally, rebalance the cluster. To add a Solr node, use the same method you use to add a Cassandra node. Using the default DSESimpleSnitch automatically puts all the Solr nodes in the same data center. Use OpsCenter Enterprise to rebalance the cluster.

Decommissioning and repairing a node

You can decommission and repair a Solr node in the same manner as you would a Cassandra node. The efficient and recommended way to repair a node, or cluster, is to use the subrange repair method.

Managing the location of Solr data

Solr has its own set of data files. Like Cassandra data files, you can control where the Solr data files are saved on the server. By default, the data is saved in <Cassandra data directory>/solr.data. You can change the location from the <Cassandra data directory> to another directory, from the command line. For example, on Linux:

cd <install_directory>

bin/dse cassandra -s -Ddse.solr.data.dir=/opt

In this example, the Solr data is saved in the /opt directory.

Viewing the Solr log

DSE Search logs Solr log messages in the Cassandra system log:

/var/log/cassandra/system.log

Changing the Solr logging level

Assuming you configured and are using the Apache log4j utility, you can control the granularity of Solr log messages, and other log messages, in the Cassandra system.log file by configuring the log4j-server.properties file. The log4j-server.properties file is located in:

Packaged installations: /etc/dse/cassandra

Binary installations: /resources/cassandra/conf/

To set log levels, configure the log4j.rootLogger value, specifying one of these values:

  • All - turn on all logging
  • OFF - no logging
  • FATAL - severe errors causing premature termination
  • ERROR - other runtime errors or unexpected conditions
  • WARN - use of deprecated APIs, poor use of API, near errors, and other undesirable or unexpected runtime situations
  • DEBUG - detailed information on the flow through the system
  • TRACE - more detailed than DEBUG
  • INFO - highlight the progress of the application at a coarse-grained level

For example, open the log4j-server.properties file and change the log level by configuring the log4j.rootLogger value:

# output messages into a rolling log file as well as stdout
log4j.rootLogger=INFO,stdout

Accessing the validation Log

DSE Search stores validation errors that arise from non-indexable data sent from non-Solr nodes in this log:

/var/log/cassandra/solrvalidation.log

For example, if a Cassandra node that is not running Solr puts a string in a date field, an exception is logged for that column when the data is replicated to the Solr node.

Changing the Solr connector port

To change the Solr port from the default, 8983, change the http.port setting in the catalina.properties file installed with DSE in <dse-version>/resources/tomcat/conf.

Securing a Solr cluster

DataStax Enterprise supports secure enterprise search using Apache Solr 4.3 and Lucene. The security table summarizes the security features of DSE Search/Solr and other integrated components. DSE Search data is completely or partially secured by using DataStax Enterprise security features:

  • Object permission management

    Access to Solr documents, excluding cached data, can be limited to users who have been granted access permissions. Permission management also secures tables used to store Solr data.

  • Transparent data encryption

    Data at rest in Cassandra tables, excluding cached and Solr-indexed data, can be encrypted. Encryption occurs on the Cassandra side and impacts performance slightly.

  • Client-to-node encryption

    You can encrypt HTTP access to Solr data and internal, node-to-node Solr communication using SSL. Enable SSL node-to-node encryption on the Solr node by setting encryption options in the dse.yaml file as described in Client-to-node encryption.

  • Kerberos authentication

    You can authenticate DSE Search users through Kerberos authentication using Simple and Protected GSSAPI Negotiation Mechanism (SPNEGO). To use the SolrJ API against DSE Search clusters with Kerberos authentication, client applications should use the SolrJ-Auth library and the DataStax Enterprise SolrJ component as described in the solrj-auth-README.md file.

You can also use HTTP Basic Authentication, but this is not recommended.

HTTP Basic Authentication

When you enable Cassandra's internal authentication by specifying authenticator: org.apache.cassandra.auth.PasswordAuthenticator in cassandra.yaml, clients must use HTTP Basic Authentication to provide credentials to Solr services. Due to the stateless nature of HTTP Basic Authentication, this can have a significant performance impact as the authentication process must be executed on each HTTP request. For this reason, DataStax does not recommend using internal authentication on DSE Search clusters in production. To secure DSE Search in production, enable DataStax Enterprise Kerberos authentication.

To configure DSE Search to use Cassandra's internal authentication, follow this configuration procedure:

  1. Comment AllowAllAuthenticator and uncomment the PasswordAuthenticator in cassandra.yaml to enable HTTP Basic authentication for Solr.

    #authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
    authenticator: org.apache.cassandra.auth.PasswordAuthenticator
    #authenticator: com.datastax.bdp.cassandra.auth.PasswordAuthenticator
    #authenticator: com.datastax.bdp.cassandra.auth.KerberosAuthenticator
    
  2. Configure the replication strategy for the system_auth keyspace.

  3. Start the server.

  4. Open a browser, and go to the service web page. For example, assuming you ran the wikipedia demo, go to http://localhost:8983/demos/wikipedia/.

    The browser asks you for a Cassandra username and password.

Excluding hosts from Solr-distributed queries

You can exclude hosts from Solr-distributed queries in DataStax Enterprise 3.1.2 and later. To exclude hosts from queries, perform these steps on each node that you want to send queries to.

  1. In DataStax Enterprise 3.1.2 and later, navigate to the Solr/conf directory:
    • Packaged installations: /usr/share/dse/solr/conf
    • Tarball installations: <dse install>/resources/solr/conf
  2. Open the exclude.hosts file, and add the list of nodes to be excluded. Each name must be separated by a newline character.
  3. Update the list of routing endpoints on each node, by calling the JMX operation refreshEndpoints() on the com.datastax.bdp:type=ShardRouter mbean.

Using the ShardRouter Mbean

DataStax Enterprise 3.1.2 exposes the com.datastax.bdp:type=ShardRouter Mbean, providing the following operations:

  • getShardSelectionStrategy(String core) Retrieves the name of the shard selection strategy used for the given core.
  • getEndpoints(String core) Retrieves the list of endpoints that can be queried for the given core.
  • getEndpointLoad(String core) Retrieves the list of endpoints with related query load for the given core; the load is computed as a 1-minute, 5-minutes and 15-minutes exponentially weighted moving average, based on the number of queries received by the given node.
  • refreshEndpoints() Manually refreshes the list of endpoints to be used for querying cores.