Sebastian Estevez

<h4 id="intro">Intro</h4>

Using map collections in DSE Search takes advantage of dynamic fields in Apache Solr™ for indexing. For this to work, every key in your map has to be prefixed with the name of the collection. Using an example, this article aims to demonstrate:

<ol>
	<li>How to create and populate map collections that are compatible with DSE Search</li>
	<li>How to use generateResources to generate the schema and index maps as dynamic fields, and</li>
	<li>How to perform a data migration using&nbsp;<a href="https://github.com/brianmhess/cassandra-loader#options">Brian's cassandra-loader/unloader</a>&nbsp;for existing data that lacks the prefix required by DSE Search</li>
</ol>

Note: This same methodology (cassandra-unloader|awk|cassandra-loader) can be used in many different ETL workloads, this is just a common example of that larger group of situations where this may be handy.

Note:&nbsp;This blog post was written targeting DSE 4.8. Please refer to the&nbsp;<a href="http://docs.datastax.com/en/" title="DataStax Documentation">DataStax documentation</a>&nbsp;for your specific version of DSE if different.

Something to watch out for: Dynamic fields, like Cassandra collections, are not meant to store large amounts of data. The odds are, if you are misusing Apache Cassandra™ collections, you will also have problems on the search side with dynamic fields because they tend to create significant heap pressure due to their memory&nbsp;footprint.

<h4 id="creatingandpopulatingthemaps">Creating and Populating the maps</h4>

If you are using a map to store contact information and the name of your map is called&nbsp;<code>contact_info_</code>, you may have the following table definition:

<pre>
<code>CREATE TABLE autogeneratedtest.customers_by_channel ( 
 customer_id uuid,
 customer_type text,
 channel_id text,
 contact_info_ map&lt;text, text&gt;,
 country_code text,
 PRIMARY KEY ((customer_id), channel_id)
);
</code></pre>

and you may have some rows as follows:

<pre>
<code>insert into autogeneratedtest.customers_by_channel ( 
 customer_id,
 customer_type,
 channel_id,
 contact_info_, 
 country_code
)
VALUES ( 
 uuid(), 
 'subscription', 
 'web-direct',
 {
 'email': 'betrio@gmail.com',
 'first_name': 'Bill',
 'last_name': 'Evans'
 },
 'USA'
);

insert into autogeneratedtest.customers_by_channel ( 
 customer_id,
 customer_type,
 channel_id,
 contact_info_,
 country_code
) 
VALUES ( 
 uuid(),
 'subscription',
 'web-direct',
 {
 'email': 'messengers@gmail.com',
 'first_name': 'Art',
 'last_name': 'Blakey'
 },
 'USA'
);
</code></pre>

In order to index the map with DSE Search, the keys in the map would have to include the prefix&nbsp;<code>contact_info_</code>&nbsp;as follows:

<pre>
<code>{
 'contact_info_email': 'messengers@gmail.com', 
 'contact_info_first_name': 'Art',
 'contact_info_last_name': 'Blakey'
}
</code></pre>

Note: for existing systems, adding a prefix to the map's key will require changes in your application code.

<h4 id="indexingthefieldwithgenerateresources">Indexing the field with generateResources</h4>

In previous version of DSE Search, users had to manually create and upload their own&nbsp;<code>schema.xml</code>&nbsp;and&nbsp;<code>solrconfig.xml</code>&nbsp;files with which to index their tables. This process was rather painful because hand crafting xml files is quite error prone. DSP-5373 (released with DSE&nbsp;<a href="http://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/RNdse46.html">4.6.8</a>&nbsp;and&nbsp;<a href="http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/RNdse.html?scroll=RNdse__471Chgs">4.7.1</a>) made it so that you can index a table with a single API call and DSE will take care of generating both your&nbsp;<code>schema.xml</code>&nbsp;and your&nbsp;<code>solrconfig.xml</code>&nbsp;automagically.

Use&nbsp;<code>dsetool</code>&nbsp;or&nbsp;<code>curl</code>&nbsp;to index a core for the table in one fell swoop:

<pre>
<code>dsetool create_core autogeneratedtest.customers_by_channel generateResources=true
</code></pre>

or

<pre>
<code>curl "http://&lt;host&gt;:8983/solr/admin/cores?action=CREATE&amp;name=autogeneratedtest.customers_by_channel &amp;generateResources=true"
</code></pre>

Protip: If you're using Cassandra authentication, dsetool does not yet work and you'll have to use the curl command.

<h4 id="datamigrationwithcassandraloaderunloader">Data Migration with cassandra-loader/unloader</h4>

If your data set is very large, a spark job is a good way of migrating your data (<a href="https://github.com/rssvihla/datastax_work/blob/master/spark_commons/examples/spark_bulk_operations/src/main/java/pro/foundev/java/SchemaMigration.java">here's an example by Ryan Svhila</a>). That is a topic for another post.

This post will focus on small to medium datasets and simple transformations that are implementable in&nbsp;<code>awk</code>. Because we can use input and output from stdin / stdout, the combination of the loader, the unloader, and some sed - awk magic can be used as a quick and dirty ETL tool.

<a href="https://github.com/brianmhess/cassandra-loader#options">Brian's cassandra-loader and cassandra-unloader</a>&nbsp;are a pair of java applications (built using the DataStax java driver). They are easy to use, full featured delimiter bulk loading / unloading tools, built following all the Cassandra / java driver best practices.

Note: Use this source code as a reference architecture when building Java (and other) applications that interact with Cassandra.

First download the binaries and set permissions:

<pre>
<code>wget "https://github.com/brianmhess/cassandra-loader/releases/download/v0.0.17/cassandra-loader"

wget "https://github.com/brianmhess/cassandra-loader/releases/download/v0.0.17/cassandra-unloader"

sudo chmod +x cassandra* 
</code></pre>

Thanks Brian for helping optimize the&nbsp;<code>awk</code>&nbsp;script so that we can pipe directly from unloader to awk to the loader, this makes it so that we don't have to fit the entire dataset in RAM.

Here's how you would run it:

<pre>
<code>./cassandra-unloader -f stdout \
 -delim "|" \
 -host localhost \
 -schema "autogeneratedtest.customers_by_channel \
 ( \
 customer_id, \
 customer_type, \
 channel_id, \
 contact_info_, \
 country_code \
 )" | \
awk -F "|" '{ \ 
 a=substr($4, 3, length($4)-4); \
 nb=split(a, b, ","); \
 d=""; sep=""; \
 for (i=1; i&lt;=nb; i+=2) { \
 c=substr(b[i], 2); \
 b[i]="\"contact_info_" c; \
 d=d sep b[i] " : " b[i+1]; \
 sep=", "; \
 } \
 for (i=1;i&lt;=3;i++) { \
 printf(%s|",$i); \
 } \
 printf("%s",d); \
 for (i=5;i&lt;=NF;i++) { \
 printf("|%s", $i); \
 } \
 printf("\n"); \
}' | \
./cassandra-loader \
 -f stdin \
 -delim "|" \
 -host localhost \
 -schema "autogeneratedtest.customers_by_channel2( \
 customer_id, \
 customer_type, \
 channel_id, \
 contact_info_, \
 country_code \
)"
</code></pre>

The result is a new table with the map keys prefixed by the name of the map column contactinfo.

The loader and unloader will use the number of threads = cpu cores in your box and will handle 1000 in flight futures. This and other&nbsp;<a href="https://github.com/brianmhess/cassandra-loader#options">advanced options</a>&nbsp;are configurable but the defaults should work fine (especially if you run this from a separate box).

Enjoy!

Using Brian’s cassandra-loader/unloader to migrate C* Maps for DSE Search compatibility

Sebastian EstevezDataStax

Share

Share

More Technology

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

Simplifying Agent Development with Astra DB Connector for Vertex AI Search

Making Astra DB easier for MongoDB developers

One-stop Data API for Production GenAI