DataStax Enterprise 3.1 Documentation

Using DSE Search/Solr

This documentation corresponds to an earlier product version. Make sure this document corresponds to your version.

Latest DSE documentation | Earlier DSE documentation

When you update a table using CQL or CLI, the Solr document is updated. When you update a Solr document using the Solr API, the table is updated. Re-indexing occurs automatically after an update.


../../_images/dse_search_cass_solr.png

Writes are durable. A Solr API client writes data to Cassandra first, and then Cassandra updates indexes. All writes to a replica node are recorded both in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.

The commit log replaces the Solr updatelog, which is not supported in DSE Search/Solr. Consequently, atomic updates and real-time get that require the updateLog are not supported. In Cassandra, a write is atomic at the row-level, meaning inserting or updating columns in a row is treated as one write operation.

The Solr index update operation is similar to a Cassandra index update. If the old column value was still in the Cassandra memtable, Cassandra removes the index entry; otherwise, the old entry remains to be purged by compaction. If a read sees a stale index entry before compaction purges it, the reader thread invalidates it. You can also trigger the expiration of search data.

Inserting, deleting, and searching data

The show you how to perform basic operations:

You can insert data into Solr in several ways:

These examples include the first two methods, using CQL and the Solr HTTP API.

Example Using CQL

  1. If you did not already do this, create a directory named solr_tutorial. Copy the schema.xml and solrconfig.xml from the wikipedia demos directory to the solr_tutorial directory.

  2. After starting DSE as a Solr node, open a shell window or tab, go to the bin directory on Linux for example, and start CQL:

    ./cqlsh
    
  3. Create a keyspace and a table, and then, insert some data for DSE Search to index. You need use the WITH COMPACT STORAGE directive when defining the table.

    CREATE KEYSPACE mykeyspace
      WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
    
    USE mykeyspace;
    
    CREATE TABLE mysolr (
      id text PRIMARY KEY,
      name text,
      title text,
      body text
    ) WITH COMPACT STORAGE;
    
    INSERT INTO mysolr (id, name, title, body) VALUES ('123', 'Christopher Morley', 'Life',
      'Life is a foreign language; all men mispronounce it.');
    
    INSERT INTO mysolr (id, name, title, body) VALUES ('124', 'Daniel Akst', 'Life',
      'In matters of self-control as we shall see again and again, speed kills.
      But a little friction really can save lives.');
    
    INSERT INTO mysolr (id, name, title, body) VALUES ('125', 'Abraham Lincoln', 'Success',
      'Always bear in mind that your own resolution to succeed is more important
      than any one thing.');
    
    INSERT INTO mysolr (id, name, title, body) VALUES ('126', 'Albert Einstein', 'Success',
      'If A is success in life, then A equals x plus y plus z. Work is x; y is
      play; and z is keeping your mouth shut.');
    
  4. Change the schema.xml file to contain the schema shown in the Sample schema section.

  5. On the command line in the solr_tutorial directory, post the configuration file using the cURL utility.

    curl http://localhost:8983/solr/resource/mykeyspace.mysolr/solrconfig.xml
      --data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8'
    
  1. Post the schema file:

    curl http://localhost:8983/solr/resource/mykeyspace.mysolr/schema.xml
      --data-binary @schema.xml -H 'Content-type:text/xml; charset=utf-8'
    
  2. Create a Solr core.

    curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=mykeyspace.mysolr"
    

    If you are recreating the mykeyspace.mysolr core, use the RELOAD instead of the CREATE command.

  3. Search Cassandra using the Solr HTTP API to find titles like Succ*.

    http://localhost:8983/solr/mykeyspace.mysolr/
      select?q=title%3ASucc*&wt=json&indent=on&omitHeader=on
    

    The response is:

    {
       "response":{"numFound":2,"start":0,"docs":[
           {
             "id":"125",
             "body":"Always bear in mind that your own resolution to succeed
             is more important\n than any one thing.",
             "name":"Abraham Lincoln",
             "title":"Success"},
           {
             "id":"126",
             "body":"If A is success in life, then A equals x plus y plus z.
             Work is x; y is\n play; and z is keeping your mouth shut.",
             "name":"Albert Einstein",
             "title":"Success"}]
       }}
    

Example using the Solr HTTP API

You can use the Solr HTTP REST API to insert into, modify, or delete data from a Solr node. When you update only a single field, the document is re-indexed in full. After writing the field modifications to the Solr document, use a URL in the following format to update the document:

curl http://<host>:<port>/solr/<keyspace>.<table>/update?
  replacefields=false  -H 'Content-type: application/json' -d
  '<json string>'

Using this format to insert data into the Cassandra table and Solr index created in the previous example, the curl command is:

curl http://localhost:8983/solr/mykeyspace.mysolr/update?replacefields=false
  -H 'Content-type: application/json' -d
  '[{"id":"130", "body":"Life is a beach.", "name":"unknown", "title":"Life"}]'

The Solr convention is to use curl for issuing update commands instead of using a browser. You do not have to post a commit command in the update command as you do in OSS, and doing so is ineffective.

When you use CQL or CLI to update a field, DSE Search implicitly sets replacefields to false and updates individual fields in the Solr document. The re-indexing of data occurs automatically.

Warning about using the optimize command

Do not include the optimize command in URLs to update Solr data. This warning appears in the system log when you use the optimize:

WARN [http-8983-2] 2013-03-26 14:33:04,450 CassandraDirectUpdateHandler2.java (line 697)
Calling commit with optimize is not recommended.

The Lucene merge policy is very efficient. Using the optimize command is no longer necessary and using the optimize command in a URL can cause nodes to fail.

Deleting Solr data

To delete a Cassandra table and its data, including the data indexed in Solr, from a Solr node drop the table using the Cassandra Query Language (CQL) or the Command Line Interface (CLI). The following example, which assumes you ran the wikipedia demo, lists the Solr files on the file system, drops the table named solr that the demo created, and then verifies that the files have been deleted from the file system:

  1. List the Solr data files on the file system.

    • Packaged install:

      ls /usr/local/var/lib/dse5/data/solr.data/wiki.solr/index/
      
    • Tarball install:

      ls /var/lib/cassandra/data/solr.data/wiki.solr/index
      

    The output looks something like this:

    _33.fdt      _35_nrm.cfe   _38_Lucene40_0.tim
    _33.fdx      _35_nrm.cfs   _38_Lucene40_0.tip
    _33.fnm      _36.fdt     _38_nrm.cfe
    . . .
    
  2. Launch cqlsh and execute the CQL command to drop the table named solr.

    USE wiki;
    DROP TABLE solr;
    
  3. Exit cqlsh and check that the files have been deleted on the file system. For example:

    ls /var/lib/cassandra/data/solr.data/wiki.solr/index
    

    The output is:

    ls: /var/lib/cassandra/data/solr.data/wiki.solr/index: No such file or directory
    

Using copy fields

The way DSE Search/Solr handles copy fields depends on the value of the stored attribute.

If stored=false in the copyField directive:

  • Ingested data is copied by the copyField mechanism to the destination field for search, but data is not stored in Cassandra.
  • When you add a new copyField directive to the schema.xml, pre-existing and newly ingested data is re-indexed when copied as a result of the new directive.

If stored=true in the copyField directive (backward compatibility mode):

  • Ingested data is copied by the copyField mechanism and data is stored in Cassandra.
  • When you add a new copyField directive to the schema.xml, pre-existing data is re-indexed as the result of an old copyField directive, but not when copied as the result of a new copyField directive. To be re-indexed, data must be re-ingested after you add a new copyField directive to the schema.

Using a copy field and multivalue field

When you use copy fields to copy multiple values into a field, CQL comes in handy because you do not need to format the data in json, for example, when you insert it. Using the Solr HTTP API update command, the data must be formatted.

Use the CQL BATCH command to insert column values in a single CQL statement to prevent overwriting. This process is consistent with Solr HTTP APIs, where all copied fields need to be present in the inserted document. You need to use BATCH to insert the column values whether or not the values are stored in Cassandra.

Using docValues and copy fields for faceting

Using docValues can improve performance of faceting, grouping, filtering, sorting, and other operations described on the Solr Wiki.

For faceting to use docValues, the schema needs to specify multiValued="true" even if the field is a single-value facet field. The field also needs to include docValues="true". You also need to use a field type that supports being counted by Solr. The text type, which tokenizes values, cannot be used, but the string type works fine.

Example of using copy fields and docValues

This example uses copy fields to copy various aliases, such as a twitter name and email alias, to a multivalue field. You can then query the multivalue field using any alias as the term to get the other aliases in the same row or rows as the term. This example also uses docValues

  1. If you did not already do this, create a directory named solr_tutorial. Copy the schema.xml and solrconfig.xml from the wikipedia demos directory to the solr_tutorial directory.

  2. Using CQL, create a keyspace and a table to store user names, email addresses, and their skype, twitter, and irc names. The all field will exist in the Solr index only, so you do not need an all column in the table.

    CREATE KEYSPACE user_info
      WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
    
    CREATE TABLE user_info.users (
      id text PRIMARY KEY,
      name text,
      email text,
      skype text,
      irc text,
      twitter text
    ) WITH COMPACT STORAGE;
    
  3. Run a CQL BATCH command, as explained earlier, if the schema includes a multivalue field.

    BEGIN BATCH
      INSERT INTO user_info.users (id, name, email, skype, irc, twitter) VALUES
      ('user1', 'john smith', 'jsmith@abc.com', 'johnsmith', 'smitty', '@johnsmith')
    
      INSERT INTO user_info.users (id, name, email, skype, irc, twitter) VALUES
      ('user2', 'elizabeth doe', 'lizzy@swbell.net', 'roadwarriorliz', 'elizdoe',  '@edoe576')
    
      INSERT INTO user_info.users (id, name, email, skype, irc, twitter) VALUES
      ('user3', 'dan graham', 'etnaboy1@aol.com', 'danielgra', 'dgraham', '@dannyboy')
    
      INSERT INTO user_info.users (id, name, email, skype, irc, twitter) VALUES
      ('user4', 'john smith', 'jonsmit@fyc.com', 'johnsmith', 'jsmith345', '@johnrsmith')
    
     INSERT INTO user_info.users (id, name, email, skype, irc, twitter) VALUES
      ('user5', 'john smith', 'jds@adeck.net', 'jdsmith', 'jdansmith',  '@smithjd999')
    
     INSERT INTO user_info.users (id, name, email, skype, irc, twitter) VALUES
      ('user6', 'dan graham', 'hacker@legalb.com', 'dangrah', 'dgraham', '@graham222')
    
    APPLY BATCH;
    
  4. Use a schema that contains the multivalued field--all, copy fields for each alias plus the user id, and a docValues option.

    <schema name="my_search_demo" version="1.1">
      <types>
        <fieldType name="string" class="solr.StrField"/>
        <fieldType name="text" class="solr.TextField">
          <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
          </analyzer>
        </fieldType>
      </types>
      <fields>
        <field name="id"  type="string" indexed="true"  stored="true"/>
        <field name="name"  type="string" indexed="true"  stored="true"/>
        <field name="email" type="string" indexed="true" stored="true"/>
        <field name="skype" type="string" indexed="true"  stored="true"/>
        <field name="irc"  type="string" indexed="true"  stored="true"/>
        <field name="twitter" type="string" indexed="true" stored="true"/>
        <field name="all" type="string" docValues="true" indexed="true" stored="false" multiValued="true"/>
      </fields>
      <defaultSearchField>name</defaultSearchField>
      <uniqueKey>id</uniqueKey>
      <copyField source="id" dest="all"/>
      <copyField source="email" dest="all"/>
      <copyField source="skype" dest="all"/>
      <copyField source="irc" dest="all"/>
      <copyField source="twitter" dest="all"/>
    </schema>
    
  5. On the command line in the solr_tutorial directory, upload the schema and solrconfig.xml to Solr. Create the Solr core for user_info.users.

    curl http://localhost:8983/solr/resource/user_info.users/solrconfig.xml
      --data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8'
    
    curl http://localhost:8983/solr/resource/user_info.users/schema.xml
      --data-binary @schema.xml -H 'Content-type:text/xml; charset=utf-8'
    
    curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=user_info.users"
    
  6. Search Solr to identify the user, aliases, and id of users having an alias smitty.

    http://localhost:8983/solr/user_info.users/select?q=all%3Asmitty&wt=xml&indent=true
    

    Output is:

    <result name="response" numFound="1" start="0">
     <doc>
       <str name="id">user1</str>
       <str name="email">jsmith@abc.com</str>
       <str name="irc">smitty</str>
       <str name="name">john smith</str>
       <str name="skype">johnsmith</str>
       <str name="twitter">@johnsmith</str>
     </doc>
    </result>
    
  7. Run this query:

    http://localhost:8983/solr/user_info.users/select/?q=*:*&facet=true&facet.field=name&facet.mincount=1&indent=yes
    

    At the bottom of the output, the facet results appear. Three instances of john smith, two instances of dan graham, and one instance of elizabeth doe:

    . . .
    </result>
    <lst name="facet_counts">
      <lst name="facet_queries"/>
      <lst name="facet_fields">
        <lst name="name">
          <int name="john smith">3</int>
          <int name="dan graham">2</int>
          <int name="elizabeth doe">1</int>
        </lst>
      </lst>
      . . .
    
  8. Now, you can view the status of the field cache memory to see the RAM usage of docValues per Solr field. Results look something like the example shown in Example 2.

Changing the value of a stored copyField attribute

To change the stored attribute value of a copyField directive from true to false:

  1. Change the values of stored in copyField directives to false.
  2. Post the solrconfig.xml and the modified schema.xml.
  3. Reload the core, specifying an in-place re-index.

Previously stored copies of data are not automatically removed from Cassandra.

Changing the stored attribute value from false to true is not directly supported. The workaround is:

  1. Remove the copyField directives that have stored=false.
  2. Reload the solrconfig.xml and schema.xml. Use the reindex=true option.
  3. Add back the copyField directives you removed in step 1 to the schema.xml and set stored=true.
  4. Post the solrconfig.xml and the modified schema.xml.
  5. Reload the core, specifying an in-place re-index.
  6. Re-ingest the data.

Stored values are not automatically removed from Cassandra.