|CQL data modeling|
An index is a data structure that allows for fast, efficient lookup of data matching a given condition. In relational database design, a primary key is the unique key used to identify each row in a table.
In Cassandra, the index is a partition key index, and like any index, it speeds up random access to data in the table. The partition key ensures record uniqueness, and may also control the order in which records are physically clustered, or stored by the database. The index indexes column values in a separate, hidden table from the one that contains the values being indexed.
In Cassandra, each node maintains a partition index on tables of the data it manages.
Rows are assigned to nodes by the cluster-configured partitioner and the keyspace-configured replica placement strategy. Each node knows which ranges of partition keys each node manages and each contains only the indexed data for the partitions it has stored locally. Cassandra locates requested rows efficiently by scanning the indexes only on the relevant replicas using a scatter/gather approach. The coordinator node, which is the node being queried, contacts at least one replica from each replica set. The replica responds to the coordinator with the its indexed data. The coordinator aggregates the responses and sends them to the client.
Using CQL, you can create an index on a column after defining a table. The music service example shows how to create an index on the artists column of playlist, and then query Cassandra for songs by a particular artist:
CREATE INDEX artist_names ON playlists( artist );
An index name is optional. If you do not provide a name, Cassandra assigns a name such as artist_idx. If you provide a name, such as artist_names, the name must be unique within the keyspace. After creating an index for the artist column and inserting values into the playlists table, greater efficiency is achieved when you query Cassandra directly for artist by name, such as Fu Manchu:
SELECT * FROM playlists WHERE artist = 'Fu Manchu';
For example purposes, let's say you can create multiple indexes, for example on album and title columns of the playlists table, and use multiple conditions in the WHERE clause to filter the results. In a real-world situation, these columns might not be good choices, depending on their cardinality as described later:
CREATE INDEX album_name ON playlists ( album );
CREATE INDEX title_name ON playlists ( title );
SELECT * FROM playlists WHERE album = 'Roll Away' AND title = 'Outside Woman Blues' ALLOW FILTERING ;
When multiple occurrances of data match a condition in a WHERE clause, Cassandra selects the least-frequent occurrence of a condition for processing first for efficiency. For example, suppose data for Blind Joe Reynolds and Cream's versions of "Outside Woman Blues" were inserted into the playlists table. Cassandra queries on the album name first if there are fewer albums named Roll Away than there are songs called "Outside Woman Blues" in the database. When you attempt a potentially expensive query, such as searching a range of rows, Cassandra requires the ALLOW FILTERING directive.
Cassandra's built-in indexes are best on a table having many rows that contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have, on average, to query and maintain the index. For example, suppose you had a playlists table with a billion songs and wanted to look up songs by the artist. Many songs will share the same column value for artist. Suppose in a real-word playlists table, many songs share the same column value for artist. The artist column is a good candidate for an index.
Do not use an index to query a huge volume of records for a small number of results. For example, if you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their artist, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index. For columns containing unique data, it is sometimes fine performance-wise to use an index for convenience, as long as the query volume to the table having an indexed column is moderate and not under constant load.
Naturally, there is no support for counter columns, in which every value is distinct.
Conversely, creating an index on an extremely low-cardinality column, such as a boolean column, does not make sense. Each value in the index becomes a single row in the index, resulting in a huge row for all the false values, for example. Indexing a multitude of indexed columns having foo = true and foo = false is not useful.