DataStax Developer Blog

DSE Solr Index Backup, Restore and Re-index

By Alex Liu -  July 24, 2012 | 1 Comment

Backup, restore and re-index are very common and important tasks for data center operation. This blog post explains the high level Solr integration in DSE, and how to backup, recover and re-index solr indexes in DSE.

Solr Integration in DSE

Solr integration features includes:

  • Embedded Solr: Solr core shares the same JVM of DSE server.
  • Distributed search: DSE takes the  advantages of Cassandra’s linear scalable peer to peer architecture and replication mechanism to build distributed search.
  • No master slave complexity: Thanks to Cassandra’s P2P architecture, DSE  removes the need to have master slave replication architecture.
  • Data replication: DSE stores the indexed data in Cassandra column families. The data is replicated in the cluster ring.
  • Lucene native index file format: DSE stores the index file as Lucene native index file format which takes advantage of Lucene’s index improvements.
  • Workload separation: All solr nodes can be configured in the same data center so that search doesn’t slow down nodes in other data centers. e.g. Hadoop data center for analysis and real time data center.

Cassandra with solr integration details has more in-depth discussion.

Re-index

Because Solr index implements the Cassandra secondary index API, Solr index can be rebuilt the same way as how Cassandra rebuilds the secondary indexes.  The nodetool rebuild_index rebuilds native secondry index for a given column family. Run nodetool rebuild command, specifying host name, key space, column family and indexes.

   $ nodetool -h localhost rebuild_index [keyspace] [cf-name] [idx1,idx1...]
   Solr index name: [keyspace].[cf-name]
   Example: nodetool -h localhost rebuild_index ks cf ks.cf

Follow the below steps to rebuild your solr indexes.

  1. Stop DSE node
  2. Delete all sub-directories under /var/lib/cassandra/data/solr.data/directory.
  3. Start DSE in search mode.
  4. Run nodetool rebuild_index for each column family.

Backup and Restore

We can use Cassandra nodetool snapshot to backup indexed data.

Lucene indexe files are stored at local file directories and indexed data are stored in Cassandra column families. Because Lucene index files are indexed data are stored separately, we need to backup them in different processes. It’s hard to synchronize the two different backup processes. Cassandra index rebuild feature simplifies the backup and restore process. It’s not necessary to backup Lucene index files any more. We can only backup indexed data and rebuild all solr index files. By that way, we synchronize the indexed data with Lucene index files as well.

Cassandra backup and restore guide shows the details how to backup and restore Cassandra data. I list the backup steps as following.

Create a snapshot of a node

Run the nodetool snapshot command, specifying the hostname, JMX port and snapshot name.

   $ nodetool -h localhost -p snapshot [keyspaces...] -cf [columnfamilyName]
   -t [snapshotName]
Example: nodetool -h localhost -p 7199 snapshot -t 07232012

By default the snapshot files are stored in the /var/lib/cassandra/data/<keyspace_name>/<column_family_name>/snapshots directory. You can also enable incremental backups. It combines with snapshots to provide a dependable, up-to-date backup mechanism.

Restore from a snapshot

  1. Shut down the node to be restored. (disablegossip, dissablethrift, drain, shutdown)
  2. Delete all files in /var/lib/cassandra/commitlog (the default commitlog directory).
  3. Delete all *.db files in <data_directory_location>/<keyspace_name>/<column_family_name> directory, but DO NOT delete the /snapshots and /backups subdirectories.
  4. Locate the most recent snapshot folder in <data_directory_location>/<keyspace_name>/<column_family_name>/snapshots/<snapshot_name>, and copy its contents into the <data_directory_location>/<keyspace_name>/<column_family_name> directory.
  5. If using incremental backups as well, copy all contents of <data_directory_location>/<keyspace_name>/<column_family_name>/backups into <data_directory_location>/<keyspace_name>/<column_family_name>.
  6. Delete all sub-directories under /var/lib/cassandra/data/solr.data/ directory.
  7. Restart the node, keeping in mind that a temporary burst of I/O activity will consume a large amount of CPU resources. There are some IndexNotFoundExceptions in the system log, you can ignore them. The exceptions are thrown because we just remove old index files.
  8. Rebuild all Solr indexes by using nodetool rebuild_index.

Cluster Backup

Use a parallel ssh tool (such as pssh) to run parallel nodetool snapshot for each node in the cluster. This provides an eventually consistent backup. The syntax is listed as following.

   Usage: pssh [OPTIONS] -h hosts.txt prog [arg0] ..
   -h --hosts hosts file (each line "host[:port] [user]")
   -l --user username (OPTIONAL)
   -p --par max number of parallel threads (OPTIONAL)
   -o --outdir output directory for stdout files (OPTIONAL)
   -t --timeout timeout in seconds to do ssh to a host (OPTIONAL)
   -v --verbose turn on warning and diagnostic messages (OPTIONAL)
   -O --options SSH options (OPTIONAL)
   Example: pssh -h ips.txt -l aliu -o /tmp/foo
nodetool -h localhost -p 7199 snapshot
   -t 12022011

Backup Solr Indexes

If you have to back up Solr indexes, you can use Solr backup and backupcleaner scripts.  The backup is local, so for cluster wide you need a parallel ssh tool (such as pssh). Default Solr index files are at /var/lib/cassandra/data/solr.data directory.

Using backup script to backup solr index files.

   usage: backup [-d dir] [-u username] [-v]
   -d          specify directory holding index data
   -u          specify user to sudo to before running script
   -v          increase verbosity
   -V          output debugging info
   Example: backup -d /var/lib/cassandra/data/solr.data -u root -v

Using backupcleaner to clean the backup index files.
   usage: backupcleaner -D <days> | -N <num> [-d dir] [-u username] [-v]
   -D <days>   cleanup backups more than <days> days old
   -N <num>    keep the most recent <num> number of backups and
   cleanup up the remaining ones that are not being pulled
   -d          specify directory holding index data
   -u          specify user to sudo to before running script
   -v          increase verbosity
   -V          output debugging info
   Example: backupcleaner -D 2 -N 2 -d /var/lib/cassandra/data/solr.data -u root



Comments

  1. For anyone unable to find the backup and backupcleaner scripts, they are available here: http://svn.apache.org/viewvc/lucene/dev/trunk/solr/scripts/

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>