Apache Cassandra™ 2.0

How Cassandra stores indexes

Internally, a Cassandra index is a data partition. In storage engine terms, the index is a wide row. In the example of a music service, the playlists table includes an artist column and uses a compound primary key: id is the partition key and song_order is the clustering column.

CREATE TABLE playlists (
  id uuid,
  song_order int,
  . . .
  artist text,
PRIMARY KEY  (id, song_order ) );

As shown in the music service example, to filter the data based on the artist, create an index on artist. Cassandra uses the index to pull out the records in question. An attempt to filter the data before creating the index will fail because the operation would be very inefficient. A sequential scan across the entire playlists dataset would be required.

For example purposes, here is a CQL representation of the index on the artist column:

//pseudo-code CQL representation of an index
CREATE TABLE playlists_artists (
  artist text,
  id uuid,
  song_order int,
PRIMARY KEY  (artist, id, song_order ) ) WITH COMPACT STORAGE;

After creating the artist index, Cassandra can filter the data in the playlists table by artist, such as Fu Manchu.

Index distribution

Generally in the case of a table that has a simple or compound primary key, Cassandra stores an entire row of data on a single node by partition key. The partitions are distributed by hashing the playlist id and using the ring to locate the nodes that store the distributed data. Cassandra would generally store playlist information on different nodes, and to find all the songs by Fu Manchu, Cassandra would have to visit different nodes.

If Cassandra stored the index in the same way as it stores a table, all index entries for Fu Manchu would be contained on a single node plus replicas. Eventually, performance problems could occur because all index lookups would query that particular node.

To avoid these problems, Cassandra stores indexes locally. This technique, however, does not guarantee trouble-free indexing, so know when and when not to use an index. The local storage of indexes means each node stores index entries of the data distributed to it. For example, the node containing partition 62c36092... also contains artist index entries Fu Manchu, ZZ Top, and Back Door Slam. Another playlist, having a different partition id 8b172618... for example, would not contain these artist index entries. This other playlist has different data and different artist index entries: Beyonce, J Z, and Pitbull, for example.