Apache Cassandra 0.7 Documentation

Secondary Indexes

In Cassandra, indexes on column values are called “secondary indexes” to distinguish them from indexes on row keys in column families. Cassandra supports secondary indexes of the type “KEYS,” which is similar to a hash index.

Secondary indexes allow for efficient querying by specific values using equality predicates (where column x = value y). Also, queries on indexed values can apply additional filters to perform operations such as range queries.

Building Secondary Indexes

You can specify the KEYS index type when creating a column definition, or you can add it later to index an existing column. Secondary indexes are built in the background automatically without blocking reads or writes.

Indexing a New Column

In this example, the Cassandra CLI is used to create a column family users with the columns full_name and state, and the indexed column birth_date (note the the index_type specification at the end of the last line):

[default@demo] create column family users with comparator=UTF8Type
... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
... {column_name: state, validation_class: UTF8Type},
... {column_name: birth_date, validation_class: LongType, index_type: KEYS}];

Because of the secondary index created for birth_date, its values can be queried directly for users born in a given year as follows:

[default@demo] get users where birth_date = 1973;
RowKey: prothfuss
=> (column=birth_date, value=1973, timestamp=1291333944389000)
=> (column=full_name, value=Patrick Rothfuss, timestamp=1291333940538000)

Indexing an Existing Column

You can create a secondary index on an existing column, indexing all existing data in the column. For example, to index the state column in the example column family, you could use the Cassandra CLI to update the column with index_type: KEYS as follows:

[default@demo] update column family users with comparator=UTF8Type
... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
... {column_name: birth_date, validation_class: LongType, index_type: KEYS},
... {column_name: state, validation_class: UTF8Type, index_type: KEYS}];

Because of the secondary index created for state, its values can be queried directly for users in a given state. Moreover, Cassandra can perform a range query on birth_date now that the state column is indexed, using the state predicate as the primary and filtering on the other with a nested loop:

[default@demo] get users where state = 'UT' and birth_date > 1970;
RowKey: bsanderson
=> (column=birth_date, value=1975, timestamp=1291333936242000)
=> (column=full_name, value=Brandon Sanderson, timestamp=1291333931790000)
=> (column=state, value=UT, timestamp=1291334909266000)

Cassandra allows adding additional predicates in this way with no restrictions. However, such queries should always be planned to begin with the index that is expected to return the fewest rows, and then, as Cassandra loops over that, apply the additional predicates.

When to Use Secondary Indexes

KEYS indexes are best for cases when there are many rows with each value indexed. The more unique values exist in a particular column, the more overhead you will have, on average, to perform queries. Whenever a secondary index results in performance penalties due to high numbers of unique values and high query volume, then Using Column Families as Indexes is likely to be the better approach.

Another consideration is the operational advantage of secondary indexes in some cases. When you create a secondary index on an existing column, it indexes existing data in the background. In the example above, if the state column had been indexed by creating a column family such as users_by_state, it would have been necessary to back-populate the existing data.

Powered by Rackspace
Apache, Apache Cassandra, Cassandra, Apache Hadoop, Hadoop and the eye logo are trademarks of the Apache Software Foundation.