email iconemail phone iconcall

What’s new for Search in DSE 5.1

By Nick Panahi, Sr. Product Manager, Server -  May 15, 2017 | 0 Comments

I am pleased to announce the general availability of DataStax Enterprise (DSE) 5.1 as of April 18th, 2017. We are especially excited about this release for DSE Search which is built on top of the best distribution of Apache Cassandra™. Let me provide a quick tour of some of the enhancements and features found in DSE Search 5.1. For the 5.0 release, there was a major focus on improving performance and stability as well as eliminating complexity for users. We are continuing with those themes.

First and foremost for DSE Search, 5.1 delivers an upgraded, production certified version of Apache Solr 6™. We skipped the Solr 5.x line completely and instead integrated Apache Solr 6.0.1™.

Component upgrades are important for various reasons including new functionality. For DSE, while features were one of the important reasons, the bigger drive for this upgrade is to incorporate a number of improvements, bug fixes and optimizations. The Solr upgrade certainly does deliver that for DSE Search. Along with our own improvements and bug fixes, DSE 5.1 has garnered considerable performance improvements across the board, from querying to indexing.

One of the Solr features that does warrant highlighting is the new JSON Facet API for what was formally facets along with the StatsComponent. Introduced in Apache Solr 5 , the new & re-architected API for performance allows users to easily execute aggregation queries to build statistical analysis style search queries in intuitive JSON format. This functionality is available through Solr’s native HTTP API in DSE Search. Traditional facet searches are still supported through CQL as well but they maintain the previous and simpler API better suited for situations like product catalogue groupings.

Arguably, the most exciting feature that comes to us in DSE Search 5.1 is the CQL based search index management.   Both DSE Search and Apache Solr users know that there are configuration files involved with building a Solr Core or DSE Search index. These configurations define the behavior, functionality and even performance of your search capabilities.

As we work to make search functionality more native to the DSE platform, managing a CQL table’s search index is a great place to start. With DSE 5.1, not only can configuration files be inferred and automatically generated for you, but modifying your index configuration and schema is much easier through the new CQL integration.

Instead of discussing all of the rich functionalities available, let’s walk through an example. Let’s create a search index on an existing CQL table to provide more flexible simple boolean queries but not full-text capabilities.

Starting with this simple CQL schema:

CREATE TABLE amazon.metadata (

   asin text PRIMARY KEY,

   also_bought set<text>,

   buy_after_viewing set<text>,

   categories set<text>,

   imurl text,

   price double,

   title text

);

We’ll begin by creating a default search index on this table.

cqlsh:> CREATE SEARCH INDEX IF NOT EXISTS ON amazon.metadata;

We can validate what this command does by executing a CQL DESCRIBE on the table.

cqlsh:amazon> DESC TABLE amazon.metadata;

CREATE TABLE amazon.metadata (

   asin text PRIMARY KEY,

   also_bought set<text>,

   buy_after_viewing set<text>,

   categories set<text>,

   imurl text,

   price double,

   solr_query text,

   title text

) …;

CREATE CUSTOM INDEX amazon_metadata_solr_query_index ON amazon.metadata (solr_query) USING ‘com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex’;

And we can validate that our Cassandra data has also been indexed for search with this query.

cqlsh> SELECT count(*) from amazon.metadata where solr_query = ‘*:*’ ;

count

——-

11047

cqlsh> DESCRIBE ACTIVE SEARCH INDEX SCHEMA ON amazon.metadata;

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”no”?>

<schema name=”autoSolrSchema” version=”1.5″>

 <types>

   <fieldType class=”org.apache.solr.schema.TextField” name=”TextField”>

     <analyzer>

       <tokenizer class=”solr.StandardTokenizerFactory”/>

       <filter class=”solr.LowerCaseFilterFactory”/>

     </analyzer>

   </fieldType>

   <fieldType class=”org.apache.solr.schema.TrieDoubleField” name=”TrieDoubleField”/>

   <fieldType class=”org.apache.solr.schema.StrField” name=”StrField”/>

 </types>

 <fields>

   <field indexed=”true” multiValued=”false” name=”title” stored=”true” type=”TextField”/>

   <field indexed=”true” multiValued=”false” name=”imurl” stored=”true” type=”TextField”/>

   <field docValues=”true” indexed=”true” multiValued=”false” name=”price” stored=”true” type=”TrieDoubleField”/>

   <field indexed=”true” multiValued=”true” name=”also_bought” stored=”true” type=”TextField”/>

   <field indexed=”true” multiValued=”false” name=”asin” stored=”true” type=”StrField”/>

   <field indexed=”true” multiValued=”true” name=”categories” stored=”true” type=”TextField”/>

   <field indexed=”true” multiValued=”true” name=”buy_after_viewing” stored=”true” type=”TextField”/>

 </fields>

 <uniqueKey>asin</uniqueKey>

</schema>

We can see a few things have happened with our simple CREATE command. We’ve generated a Solr configuration file and a Solr schema file inferred from our CQL DDL. We’ve posted the files to Solr and created the Core and we’ve also issued a indexing command to index our current data. As you can see, the process would have been much more complex without the CQL command.

At this point, the table is configured for full-text search and any data inserted into DSE will be indexed as well. This is a very nice way to get up and running but the index configuration is performing more functionalities than the use case requires. Index functionality directly affects the storage requirements. To do more, you will simply need to store more in terms of the data in your search index. By re-configuring our search index to only provide basic indexing functionality, we can reduce the storage requirements as well as increase the indexing performance of our system.

Consider a scenario where you want to leverage DSE Search for basic indexing for boolean queries instead of full-text search. Let’s walk through an example of setting up a more advanced search configuration using the new CQL syntax.

cqlsh> DROP SEARCH INDEX ON amazon.metadata; //drops the Solr core and removes search index data.

cqlsh> CREATE SEARCH INDEX IF NOT EXISTS ON amazon.metadata WITH PROFILES spaceSavingNoTextfield AND COLUMNS * {docValues:true};

cqlsh> DESCRIBE ACTIVE SEARCH INDEX SCHEMA ON amazon.metadata;

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”no”?>

<schema name=”autoSolrSchema” version=”1.5″>

 <types>

   <fieldType class=”org.apache.solr.schema.StrField” name=”StrField”/>

   <fieldType class=”org.apache.solr.schema.TrieDoubleField” name=”TrieDoubleField” precisionStep=”0″/>

 </types>

 <fields>

   <field docValues=”true” indexed=”true” multiValued=”false” name=”title” stored=”true” type=”StrField”/>

   <field docValues=”true” indexed=”true” multiValued=”false” name=”imurl” stored=”true” type=”StrField”/>

   <field docValues=”true” indexed=”true” multiValued=”false” name=”price” stored=”true” type=”TrieDoubleField”/>

   <field docValues=”true” indexed=”true” multiValued=”true” name=”also_bought” stored=”true” type=”StrField”/>

   <field docValues=”true” indexed=”true” multiValued=”false” name=”asin” stored=”true” type=”StrField”/>

   <field docValues=”true” indexed=”true” multiValued=”true” name=”categories” stored=”true” type=”StrField”/>

   <field docValues=”true” indexed=”true” multiValued=”true” name=”buy_after_viewing” stored=”true” type=”StrField”/>

   <field docValues=”false” indexed=”false” multiValued=”false” name=”_partitionKey” omitNorms=”true” stored=”false” type=”StrField”/>

 </fields>

 <uniqueKey>asin</uniqueKey>

</schema>

Here, we’ve created a new search index using one of the available index profile options to reduce the index data size as much as possible since our use case does not require text-analysis, phrase searches or even joins. Next, we’ve enabled docValues for all of our indexed fields to greatly improve our sorting & even faceting performance using the column options.

So far so good but we can do more! Now let’s configure this index to be a real-time (RT) indexing table. RT indexing is a feature introduced in DSE 4. for high throughput, low latency searches. To enable live indexing, first we set the config option to true using a shortcut directive.

cqlsh> ALTER SEARCH INDEX CONFIG ON amazon.metadata SET realtime = true;

We’ll do the same for increasing the searchable memory buffer.

cqlsh> ALTER SEARCH INDEX CONFIG ON amazon.metadata SET ramBufferSize = 2048;

We’ll further configure our real-time indexing to utilize off-heap memory allocations for the postings by adding the element to the configuration and setting the element’s value to true.

cqlsh> ALTER SEARCH INDEX CONFIG ON amazon.metadata SET indexConfig.rtOffHeapPostings = true;

Finally, we’ll set our index refresh time to 500ms for real-time visibility for newly indexed documents.

cqlsh> ALTER SEARCH INDEX CONFIG ON amazon.metadata SET autoCommitTime = 500;

Validating our changes reveals we have successfully configured this CQL table for an optimized search index to provide boolean CQL queries on any field defined in our search schema, the entire configuration can be done in a matter of a few minutes. To see our pending changes, we need to run a command to get the PENDING configuration as opposed to the current and ACTIVE configuration.

cqlsh> DESCRIBE PENDING SEARCH INDEX CONFIG ON amazon.metadata;

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”no”?>

<config>

<abortOnConfigurationError>${solr.abortOnConfigurationError:true}</abortOnConfigurationError>

 <luceneMatchVersion>LUCENE_6_0_0</luceneMatchVersion>

 <dseTypeMappingVersion>2</dseTypeMappingVersion>

 <directoryFactory class=”solr.StandardDirectoryFactory” name=”DirectoryFactory”/>

 <indexConfig>

   <rt>true</rt>

   <useCompoundFile>false</useCompoundFile>

   <ramBufferSizeMB>2048</ramBufferSizeMB>

   <mergeFactor>10</mergeFactor>

   <reopenReaders>true</reopenReaders>

   <deletionPolicy class=”solr.SolrDeletionPolicy”>

     <str name=”maxCommitsToKeep”>1</str>

     <str name=”maxOptimizedCommitsToKeep”>0</str>

   </deletionPolicy>

   <infoStream file=”INFOSTREAM.txt”>false</infoStream>

   <rtOffheapPostings>true</rtOffheapPostings>

 </indexConfig>

 <jmx/>

 <updateHandler class=”solr.DirectUpdateHandler2″>

   <autoSoftCommit>

     <maxTime>500</maxTime>

   </autoSoftCommit>

 </updateHandler>

 <query>

   <maxBooleanClauses>1024</maxBooleanClauses>

   <filterCache class=”solr.SolrFilterCache” highWaterMarkMB=”256″ lowWaterMarkMB=”128″/>

   <enableLazyFieldLoading>true</enableLazyFieldLoading>

   <useColdSearcher>true</useColdSearcher>

   <maxWarmingSearchers>16</maxWarmingSearchers>

 </query>

 <requestDispatcher handleSelect=”true”>

   <requestParsers enableRemoteStreaming=”true” multipartUploadLimitInKB=”2048000″/>

   <httpCaching never304=”true”/>

 </requestDispatcher>

 <requestHandler class=”solr.SearchHandler” default=”true” name=”search”>

   <lst name=”defaults”>

     <int name=”rows”>10</int>

   </lst>

 </requestHandler>

 <requestHandler class=”com.datastax.bdp.search.solr.handler.component.CqlSearchHandler” name=”solr_query”>

   <lst name=”defaults”>

     <int name=”rows”>10</int>

   </lst>

 </requestHandler>

 <requestHandler class=”solr.UpdateRequestHandler” name=”/update”/>

 <requestHandler class=”solr.UpdateRequestHandler” name=”/update/csv” startup=”lazy”/>

 <requestHandler class=”solr.UpdateUpdateRequestHandler” name=”/update/json” startup=”lazy”/>

 <requestHandler class=”solr.FieldAnalysisRequestHandler” name=”/analysis/field” startup=”lazy”/>

 <requestHandler class=”solr.DocumentAnalysisRequestHandler” name=”/analysis/document” startup=”lazy”/>

 <requestHandler class=”solr.admin.AdminHandlers” name=”/admin/”/>

 <requestHandler class=”solr.PingRequestHandler” name=”/admin/ping”>

   <lst name=”invariants”>

     <str name=”qt”>search</str>

     <str name=”q”>solrpingquery</str>

   </lst>

   <lst name=”defaults”>

     <str name=”echoParams”>all</str>

   </lst>

 </requestHandler>

 <requestHandler class=”solr.DumpRequestHandler” name=”/debug/dump”>

   <lst name=”defaults”>

     <str name=”echoParams”>explicit</str>

     <str name=”echoHandler”>true</str>

   </lst>

 </requestHandler>

 <admin>

   <defaultQuery>*:*</defaultQuery>

 </admin>

</config>

When we’re satisfied with our changes and ready to apply the new configuration, we will need to issue a RELOAD command to the index to apply the configuration and/or schema as the new ACTIVE configuration.

cqlsh> RELOAD SEARCH INDEX ON amazon.metadata;

Similarly, if there are schema changes, we will need to issue a REBUILD command to rebuild the index to the new configuration.

cqlsh> REBUILD SEARCH INDEX ON amazon.metadata;

This step was not required as part of our changes since we dropped the index earlier and rebuilt it with the profile options. We can now verify the hand build configuration is applied to the active index.

cqlsh> DESCRIBE ACTIVE SEARCH INDEX CONFIG ON amazon.metadata;

Executing a few queries shows that we are able to execute a query on any column but with a strict lookup versus full-text search.

cqlsh> SELECT count(*) from amazon.metadata where solr_query = ‘{“q”:”categories:Books”}’;

count

——-

10530

cqlsh> SELECT count(*) from amazon.metadata where solr_query = ‘{“q”:”categories:books”}’;

count

——-

    0

cqlsh> SELECT * from amazon.metadata where solr_query = ‘{“q”:”price:45.31″}’;

asin       | also_bought | buy_after_viewing                                        | categories | imurl                                                 | price | solr_query | title

————+————-+———————————————————-+————+——————————————————-+——-+————+——————————————————

0007321198 |        null | {‘0007126409’, ‘0007437862’, ‘0195392884’, ‘055010237X’} |  {‘Books’} | http://ecx.images-amazon.com/images/I/51dGxNC4u0L.jpg | 45.31 |       null | Collins English Dictionary: 30th Anniversary Edition

(1 rows)

DSE Graph 5.0 also leverages DSE Search backed indexes to provide distributed search capabilities to power fast, global, graph traversals.  DSE Graph 5.1 extends that foundation by providing the additional capabilities. For more information, please read the What’s New in DSE Graph 5.1 post.

Similarly, DSE Analytics can automatically leverage search indexes for optimizations as well. For more information on Search Analytics, please read the DSE 5.1: Automatic Optimization of Spark SQL Queries Using DSE Search blog post.

Interested in learning additional improvements we introduced in DSE 5.1?  Check out the following blogs:

 

 

DataStax Enterprise is powered by the best distribution of Apache Cassandra™.

© 2017 DataStax, All Rights Reserved. DataStax, Titan, and TitanDB are registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.









DataStax has many ways for you to advance in your career and knowledge.

You can take free classes, get certified, or read one of our many white papers.



register for classes

get certified

DBA's Guide to NoSQL







Comments

Your email address will not be published. Required fields are marked *




Subscribe for newsletter: