Jonathan Ellis

<h1>Deprecation warning</h1>

This post covers the obsolete Cassandra 0.7. Modern Cassandra manipulates indexes using CQL.

<h3>Overview</h3>

In Cassandra, indexes on column values are called "secondary indexes," to distinguish them from the index on the row key that all ColumnFamilies have. Secondary indexes allow querying by value and can be built in the background automatically without blocking reads or writes.

The best way to explain secondary indexes is by example. Let's start the Cassandra CLI and create a&nbsp;users&nbsp;ColumnFamily:

 
<code>$ bin/cassandra-cli --host localhost 
Connected to: "Test Cluster" on localhost/9160 
Welcome to cassandra CLI. 
Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. 
[default@unknown]&nbsp;create keyspace demo; 
[default@unknown]&nbsp;use demo; 
[default@demo]&nbsp;create column family users with comparator=UTF8Type 
...&nbsp;and column_metadata=[{column_name: full_name, validation_class: UTF8Type}, 
...&nbsp;{column_name: birth_date, validation_class: LongType, index_type: KEYS}];</code>

Here we've defined two columns: full_name, which isn't indexed but is required to be a UTF8 String, and birth_date, which we&nbsp;are&nbsp;indexing.

For Cassandra 0.7.0 only the KEYS index type is supported; this is similar to a hash index. Support for bitmap indexes is&nbsp;<a href="https://issues.apache.org/jira/browse/CASSANDRA-1472">being worked on</a>&nbsp;for a future release.

Next we add some users:

<code>[default@demo]&nbsp;set users[bsanderson][full_name] = 'Brandon Sanderson'; 
[default@demo]&nbsp;set users[bsanderson][birth_date] = 1975; 
[default@demo]&nbsp;set users[prothfuss][full_name] = 'Patrick Rothfuss'; 
[default@demo]&nbsp;set users[prothfuss][birth_date] = 1973; 
[default@demo]&nbsp;set users[htayler][full_name] = 'Howard Tayler'; 
[default@demo]&nbsp;set users[htayler][birth_date] = 1968;</code>

Now we can ask Cassandra for users born in a given year: 
<code>[default@demo]&nbsp;get users where birth_date = 1973;</code>

------------------- 
RowKey: prothfuss 
=&gt; (column=birth_date, value=1973, timestamp=1291333944389000) 
=&gt; (column=full_name, value=Patrick Rothfuss, timestamp=1291333940538000)

<h3>Adding an index</h3>

Now, suppose we now want to find all the users in a given state. In older versions of Cassandra, we'd need to create a ColumnFamily named (say) users_by_state, whose row keys were the state names and whose columns were the users in that state -- sort of a materialized view in each row.

<a href="http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/">This works fine</a>, but it has some drawbacks: it's a fair amount of boilerplate to maintain these in your application, and when you add query types, back-populating the materialized views for extant data is a chore (although&nbsp;<a href="http://wiki.apache.org/cassandra/HadoopSupport">Hadoop support</a>&nbsp;helps).

Secondary indexes automate this. Let's add some state data:

<code>[default@demo]&nbsp;set users[bsanderson][state] = 'UT'; 
[default@demo]&nbsp;set users[prothfuss][state] = 'WI'; 
[default@demo]&nbsp;set users[htayler][state] = 'UT';</code>

Note that even though state is not indexed yet, we can include the new state data in a query as long as another column in the query is indexed:

<code>[default@demo]&nbsp;get users where state = 'UT'; 
No indexed columns present in index clause with operator EQ 
[default@demo]&nbsp;get users where state = 'UT' and birth_date &gt; 1970; 
No indexed columns present in index clause with operator EQ 
[default@demo]get users where birth_date = 1968 and state = 'UT'; 
------------------- 
RowKey: htayler 
=&gt; (column=birth_date, value=1968, timestamp=1291334765649000) 
=&gt; (column=full_name, value=Howard Tayler, timestamp=1291334749160000) 
=&gt; (column=state, value=5554, timestamp=1291334890708000)</code>

One consequence of the KEYS index type being more like a hash index than a btree is shown here: even though birth_date is indexed, Cassandra couldn't perform the range query "&gt; 1970" against it.

We also see above that the CLI doesn't know how to interpret the value of the state column since we haven't told it what kind of data is in it yet. We'll add that at the same time as the new index; then we can query the state column alone:

<code>[default@demo]&nbsp;update column family users with comparator=UTF8Type 
...&nbsp;and column_metadata=[{column_name: full_name, validation_class: UTF8Type}, 
...&nbsp;{column_name: birth_date, validation_class: LongType, index_type: KEYS}, 
...&nbsp;{column_name: state, validation_class: UTF8Type, index_type: KEYS}];</code>

Now we can query against the state column alone or with other columns:

<code>[default@demo]&nbsp;get users where state = 'UT'; 
------------------- 
RowKey: bsanderson 
=&gt; (column=birth_date, value=1975, timestamp=1291333936242000) 
=&gt; (column=full_name, value=Brandon Sanderson, timestamp=1291333931790000) 
=&gt; (column=state, value=UT, timestamp=1291334909266000) 
------------------- 
RowKey: htayler 
=&gt; (column=birth_date, value=1968, timestamp=1291334765649000) 
=&gt; (column=full_name, value=Howard Tayler, timestamp=1291334749160000) 
=&gt; (column=state, value=UT, timestamp=1291334890708000) 
[default@demo]&nbsp;get users where state = 'UT' and birth_date &gt; 1970; 
------------------- 
RowKey: bsanderson 
=&gt; (column=birth_date, value=1975, timestamp=1291333936242000) 
=&gt; (column=full_name, value=Brandon Sanderson, timestamp=1291333931790000) 
=&gt; (column=state, value=UT, timestamp=1291334909266000)</code>

We can perform the range query now that the state column is also indexed, so Cassandra can use the state predicate as the primary and filter on the other with a nested loop.

<h3>Programatically</h3>

Different&nbsp;<a href="http://www.riptano.com/software">Cassandra clients</a>&nbsp;may use different method names but the idea is the same. This last query in the&nbsp;<a href="http://github.com/pycassa/pycassa">pycassa</a>&nbsp;Python client looks like this:

<code>state_expr = pycassa.create_index_expression('state', 'UT') 
birth_expr = pycassa.create_index_expression('birth_date', 1970, op=IndexOperator.GT) 
clause = pycassa.create_index_clause([state_expr, bday_expr]) 
result = users.get_indexed_slices(clause):</code>

In the&nbsp;<a>Hector</a>&nbsp;Java client:

<code>StringSerializer ss = StringSerializer.get(); 
IndexedSlicesQuery&lt;String, String, String&gt; indexedSlicesQuery = HFactory.createIndexedSlicesQuery(keyspace, ss, ss, ss); 
indexedSlicesQuery.setColumnNames("full_name", "birth_date", "state"); 
indexedSlicesQuery.addGtExpression("birth_date", 1970L); 
indexedSlicesQuery.addEqualsExpression("state", "UT"); 
indexedSlicesQuery.setColumnFamily("users"); 
indexedSlicesQuery.setStartKey(""); 
QueryResult&lt;OrderedRows&lt;String, String, String&gt;&gt; result = indexedSlicesQuery.execute();</code>

See the&nbsp;<a href="http://pycassa.github.com/pycassa/">pycassa documentation</a>&nbsp;and&nbsp;<a href="http://www.riptano.com/sites/default/files/hector-v2-client-doc.pdf">hector documentation</a>&nbsp;for more details.

<h3>Previously</h3>

<ul>
	<li><a href="http://www.riptano.com/blog/whats-new-cassandra-07-live-schema-updates">What's new in Cassandra 0.7: Live schema updates</a></li>
	<li><a href="http://www.riptano.com/docs/0.6/appendix/appendix_a_whats_new">What's new in Cassandra 0.6</a></li>
</ul>

What’s new in Cassandra 0.7: Secondary indexes

Jonathan EllisTechnology

Share

Share

Deprecation warning

Overview

Adding an index

Programatically

Previously

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI