An index is a data structure that allows for fast, efficient lookup of data matching a given condition. In relational database design, a primary key is the unique key used to identify each row in a table.
In Cassandra, the partition key index, like any index, speeds up random access to data in the table. The partition key ensures record uniqueness, and may also control the order in which records are physically clustered, or stored by the database.
In Cassandra, each node maintains a partition index on tables of the data it manages.
Rows are assigned to nodes by the cluster-configured partitioner and the keyspace-configured replica placement strategy. Each node knows what ranges of keys each node manages. Requested rows can be efficiently located by scanning the indexes only on the relevant replicas.
An index in Cassandra refers to an index on column values. Cassandra implements an index as a hidden table, separate from the table that contains the values being indexed. Using CQL, you can create an index on a column after defining a table. The music service example shows how to create an index on the artists column of playlist, and then querying Cassandra for songs by a particular artist::
CREATE INDEX artist_names ON playlists( artist );
An index name is optional. If you provide an index name, such as artist_idx, the name must be unique within the keyspace. After creating an index for the artist column and inserting values into the playlists table, greater efficiency is achieved when you query Cassandra directly for artist by name, such as Fu Manchu:
SELECT * FROM playlists WHERE artist = 'Fu Manchu';
Using multiple indexes
For example purposes, let's say you can create multiple indexes, for example on album and title columns of the playlists table, and use multiple conditions in the WHERE clause to filter the results. In a real-world situation, these columns might not be good choices, depending on their cardinality, as described later:
CREATE INDEX album_name ON playlists ( album );
CREATE INDEX title_name ON playlists ( title );
SELECT * FROM playlists WHERE album = 'Roll Away' AND title = 'Outside Woman Blues' ALLOW FILTERING ;
When multiple occurrances of data match a condition in a WHERE clause, Cassandra selects the least-frequent occurrence of a condition for processing first for efficiency. For example, suppose data for Blind Joe Reynolds and Cream's versions of "Outside Woman Blues" were inserted into the playlists table. Cassandra queries on the album name first if there are fewer albums named Roll Away than there are songs called "Outside Woman Blues" in the database. When you attempt a potentially expensive query, such as searching a range of rows, Cassandra requires the ALLOW FILTERING directive.
When to use an index
Cassandra's built-in indexes are best on a table having many rows that contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have, on average, to query and maintain the index. For example, suppose you had a playlists table with a billion songs and wanted to look up songs by the artist. Suppose in the real-word playlists table, many songs share the same column value for artist. The artist column is a good candidate for an index.
When not to use an index
Do not use an index to query a huge volume of records for a small number of results. For example, if you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion users, looking up users by their email address (a value that is typically unique for each user) instead of by their state, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index. For columns containing unique data, it is sometimes fine performance-wise to use an index for convenience, as long as the query volume to the table having an indexed column is moderate and not under constant load.
Naturally, there is no support for counter columns, in which every value is distinct.
Conversely, creating an index on an extremely low-cardinality column, such as a boolean column, does not make sense. Each value in the index becomes a single row in the index, resulting in a huge row for all the false values, for example. Indexing a multitude of indexed columns having foo = true and foo = false is not useful.
Building and maintaining indexes
An advantage of indexes is the operational ease of populating and maintaining the index. Indexes are built in the background automatically, without blocking reads or writes. Client-maintained tables as indexes must be created manually; for example, if the state column had been indexed by creating a table such as users_by_state, your client application would have to populate the table with data from the users table.
To perform a hot rebuild of an index, use the nodetool rebuild_index command.