DataStax Developer Blog

Effects of Schema Changes in DSE Search

By Hamilton Tran -  December 17, 2012 | 0 Comments

In the beginning…

2011 Library of Congress USA 5466788868 card catalog

Remember these?  Well, maybe not the beginning, but this was the search technology that we all grew up learning to use at an early age. While perhaps forgotten in our current digital age, the library, and its card catalog is a good  model of the processes involved in maintaining a repository of data.  Books would come in, get cataloged (indexed), and then an entry would be placed in the card catalog (index) for people to search through.

Library of Congress

The Library of Congress was established on April 24, 1800 and today has 22,765,967 cataloged books housed in 3 different buildings connected to one another through a series of passages.  In the beginning their card catalog was  probably very simple and over time has morphed into something completely different. Each time the Library wanted to make a change to their card catalog they would have to re-catalog all their books and update their card catalogs. This was not trivial;  as the individual cards in the card catalog were updated , searching for a book while the Library was re-cataloging the catalog may have resulted in not finding all the books you were searching for.  Assuming that each one of the Library’s buildings had it’s own card catalog,  it would have been some time before all three of the buildings’ card catalogs held the same indexing data.

A simple card catalog

So how does the Library of Congress changing the format of their card catalog relate to DSE Search and Solr?  Let’s start with a simple schema example that is used by DSE Search and Solr  to store  information about books:

<?xml version="1.0" encoding="UTF-8"?>
<schema name="Library" version="1.1">
   <types>
      <fieldType name="string" class="solr.StrField" />
      <fieldType name="text" class="solr.TextField">
         <analyzer>
            <tokenizer class="foo.LibraryTokenizerFactory" />
         </analyzer>
      </fieldType>
      <fieldType name="date" class="solr.DateType" />
   </types>
   <fields>
      <field name="isbn" type="string" indexed="true" stored="true" />
      <field name="title" type="string" indexed="true" stored="true" />
      <field name="author" type="string" indexed="true" stored="true" />
      <field name="publisher" type="string" indexed="true" stored="true" />
      <field name="excerpt" type="text" indexed="false" stored="false" />
      ...
      <!-- Some more attributes of books -->
      <field name="pub_date" type="date" indexed="false" stored="false" />
   </fields>
   <defaultSearchField>title</defaultSearchField>
   <uniqueKey>isbn</uniqueKey>
</schema>

This example  Solr schema, contains a few fields that will be indexed for search, a book’s ISBN, title, author, and finally the name of the publisher of the book. Also included in the schema, but not shown are other pieces of data about a book that may be useful later but will not be used for searches or returned in the results, indicated by the indexed=false and stored=false options.

A Small Change

Searches are nice but people need a little more context with each result brought back by their search; by adding a small excerpt for each book in the search results will help users decide if the book they were looking for is the one they actually want. To do this, we will change the excerpt field in our schema.xml so that when we search for books, excerpt text will be returned as well.

<field name="excerpt" type="text" indexed="false" stored="false"/>

to

<field name="excerpt" type="text" indexed="false" stored="true"/>

We now have a change, but need to notify our other “buildings” or nodes about this change. In DSE Search  you would post the file to the Solr core via a command like this:

curl -v http://localhost:8983/solr/resource/library.books/schema.xml --data-binary @schema.xml -H 'Content-type:text/xml; charset=utf-8'

Once the file has been posted, the change will need to propagate to the other DSE Search nodes  where they will reload their Solr cores and will begin to start returning excerpt text on searches.

In a pure sharded Solr cluster, one would have to  manually propagate the change in each node’s schema.xml and then perform a rolling restart before the change would take affect, for the current version of  DSE Search you only need to do a rolling restart.

Another “Small” Change

Our users want to search on the excerpt text as well, something we did not anticipate so we are going to do another small change in our schema.  The last change didn’t require much work so this one shouldn’t either.  Let’s change the excerpt field in our schema.xml once again. From

<field name="excerpt" type="text" indexed="true" stored="true"/>

to

<field name="excerpt" type="text" indexed="true" stored="true"/>

However, this subtle change has a cost.  If this were the Library of Congress, they would need to rebuild its card catalog, all the cards currently in the card catalog (index) have no searching related information regarding the excerpt field. As the Library begins to pull up books and replacing cards in the catalog, people doing book searches might not find all the books they would have if the card catalog was 100% updated. So what does the mean for DSE Search and this schema change? In DSE Search, the nodes will begin bringing  up the data  from Cassandra where it was stored and using Solr’s plethora of tokenizers  and filters to analyze the data to re-index; for pure Solr deployments,  a re-ingesting of  the source data is required  in order to re-index, on top of having to perform a rolling reboot.  In both cases, during this period of re-indexing of data, our searches will not be 100% accurate since the information contained in the index may not have been updated yet.

Summation

  • Schema changes are not free, the nature of search is such that small changes can have big impacts on your system.
  • Schema changes in DSE Search do not require the re-ingesting the source material to re-index the data.
  • In both DSE Search and pure Solr deployments, while you are re-indexing, searches may not give you all the expected matches until re-indexing has completed.

Some best practices

  • Deploy schema changes to development cluster first to understand the effect of the change.
  • Test to see if the change is actually beneficial, for example, changing  Solr analyzers and tokenizers have a major effect on what is returned on a search.
  • Plan for re-indexing of data, both in code and time.

Further Reading:

Getting Started with DSE Search

Cassandra with Solr Integration Details

Datastax Enterprise Search vs SolrCloud



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>