Maintaining secondary indexes¶
To perform a hot rebuild of a secondary index, use the nodetool utility rebuild_index command.
An index is a data structure that allows for fast, efficient lookup of data matching a given condition.
In relational database design, a primary key is the unique key used to identify each row in a table. A primary key index, like any index, speeds up random access to data in the table. The primary key also ensures record uniqueness, and may also control the order in which records are physically clustered, or stored by the database.
In Cassandra, the primary index for a table is the index of its row keys. Each node maintains this index for the data it manages.
Rows are assigned to nodes by the cluster-configured partitioner and the keyspace-configured replica placement strategy. The primary index in Cassandra allows looking up of rows by their row key. Since each node knows what ranges of keys each node manages, requested rows can be efficiently located by scanning the row indexes only on the relevant replicas.
With randomly partitioned row keys (the default in Cassandra), row keys are partitioned by their MD5 hash and cannot be scanned in order like traditional b-tree indexes. The CQL 3 token function introduced in Cassandra 1.1.1 now makes it possible to page through non-ordered partitioner results.
Secondary indexes in Cassandra refer to indexes on column values (to distinguish them from the primary row key index for a table). Cassandra implements secondary indexes as a hidden table, separate from the table that contains the values being indexed. Using CQL, you can create a secondary index on a column after defining a table. Using CQL 3, for example, define a table, and then create a secondary index on one of its named columns:
USE myschema; CREATE TABLE users ( userID uuid, fname text, lname text, state text, zip int, PRIMARY KEY (userID, zip) ); CREATE INDEX users_state ON users (state);
Secondary index names, in this case users_state, must be unique within the keyspace. After creating a secondary index for the state column and inserting values into the table, greater efficiency is achieved when you query Cassandra directly for users who live in a given state:
SELECT * FROM users WHERE state = 'TX';
You can create multiple secondary indexes, for example on the firstname and lastname columns of the users table, and use multiple conditions in the WHERE clause to filter the results:
CREATE INDEX users_fname ON users (fname); CREATE INDEX users_lname ON users (lname); SELECT * FROM users WHERE fname = 'bob' AND lname = 'smith' ALLOW FILTERING;
When there are multiple conditions in a WHERE clause, Cassandra selects the least-frequent occurrence of a condition for processing first for efficiency. In this example, Cassandra queries on the last name first if there are fewer Smiths than Bobs in the database or on the first name first if there are fewer Bobs than Smiths. When you attempt a potentially expensive query, such as searching a range of rows, Cassandra requires the ALLOW FILTERING directive.
Cassandra's built-in secondary indexes are best on a table having many rows that contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have, on average, to query and maintain the index. For example, suppose you had a user table with a billion users and wanted to look up users by the state they lived in. Many users will share the same column value for state (such as CA, NY, TX, etc.). This would be a good candidate for a secondary index.
Do not use secondary indexes to query a huge volume of records for a small number of results. For example, if you create indexes on columns that have many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion users, looking up users by their email address (a value that is typically unique for each user) instead of by their state, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using a secondary index. For columns containing unique data, it is sometimes fine performance-wise to use secondary indexes for convenience, as long as the query volume to the indexed table is moderate and not under constant load.
An advantage of secondary indexes is the operational ease of populating and maintaining the index. Secondary indexes are built in the background automatically, without blocking reads or writes. Client-maintained tables as indexes must be created manually; for example, if the state column had been indexed by creating a table such as users_by_state, your client application would have to populate the table with data from the users table.