DataStax Developer Blog

What’s new in Cassandra 0.7: Secondary indexes

By Jonathan Ellis -  December 3, 2010 | 40 Comments

Overview

In Cassandra, indexes on column values are called “secondary indexes,” to distinguish them from the index on the row key that all ColumnFamilies have. Secondary indexes allow querying by value and can be built in the background automatically without blocking reads or writes.

The best way to explain secondary indexes is by example. Let’s start the Cassandra CLI and create a users ColumnFamily:


$ bin/cassandra-cli --host localhost
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.
Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.
[default@unknown] create keyspace demo;
[default@unknown] use demo;
[default@demo] create column family users with comparator=UTF8Type
... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
... {column_name: birth_date, validation_class: LongType, index_type: KEYS}];

Here we’ve defined two columns: full_name, which isn’t indexed but is required to be a UTF8 String, and birth_date, which we are indexing.

For Cassandra 0.7.0 only the KEYS index type is supported; this is similar to a hash index. Support for bitmap indexes is being worked on for a future release.

Next we add some users:

[default@demo] set users[bsanderson][full_name] = 'Brandon Sanderson';
[default@demo] set users[bsanderson][birth_date] = 1975;
[default@demo] set users[prothfuss][full_name] = 'Patrick Rothfuss';
[default@demo] set users[prothfuss][birth_date] = 1973;
[default@demo] set users[htayler][full_name] = 'Howard Tayler';
[default@demo] set users[htayler][birth_date] = 1968;

Now we can ask Cassandra for users born in a given year:
[default@demo] get users where birth_date = 1973;

-------------------
RowKey: prothfuss
=> (column=birth_date, value=1973, timestamp=1291333944389000)
=> (column=full_name, value=Patrick Rothfuss, timestamp=1291333940538000)

Adding an index

Now, suppose we now want to find all the users in a given state. In older versions of Cassandra, we’d need to create a ColumnFamily named (say) users_by_state, whose row keys were the state names and whose columns were the users in that state — sort of a materialized view in each row.

This works fine, but it has some drawbacks: it’s a fair amount of boilerplate to maintain these in your application, and when you add query types, back-populating the materialized views for extant data is a chore (although Hadoop support helps).

Secondary indexes automate this. Let’s add some state data:

[default@demo] set users[bsanderson][state] = 'UT';
[default@demo] set users[prothfuss][state] = 'WI';
[default@demo] set users[htayler][state] = 'UT';

Note that even though state is not indexed yet, we can include the new state data in a query as long as another column in the query is indexed:

[default@demo] get users where state = 'UT';
No indexed columns present in index clause with operator EQ
[default@demo] get users where state = 'UT' and birth_date > 1970;
No indexed columns present in index clause with operator EQ
[default@demo]get users where birth_date = 1968 and state = 'UT';
-------------------
RowKey: htayler
=> (column=birth_date, value=1968, timestamp=1291334765649000)
=> (column=full_name, value=Howard Tayler, timestamp=1291334749160000)
=> (column=state, value=5554, timestamp=1291334890708000)

One consequence of the KEYS index type being more like a hash index than a btree is shown here: even though birth_date is indexed, Cassandra couldn’t perform the range query “> 1970″ against it.

We also see above that the CLI doesn’t know how to interpret the value of the state column since we haven’t told it what kind of data is in it yet. We’ll add that at the same time as the new index; then we can query the state column alone:

[default@demo] update column family users with comparator=UTF8Type
... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
... {column_name: birth_date, validation_class: LongType, index_type: KEYS},
... {column_name: state, validation_class: UTF8Type, index_type: KEYS}];

Now we can query against the state column alone or with other columns:

[default@demo] get users where state = 'UT';
-------------------
RowKey: bsanderson
=> (column=birth_date, value=1975, timestamp=1291333936242000)
=> (column=full_name, value=Brandon Sanderson, timestamp=1291333931790000)
=> (column=state, value=UT, timestamp=1291334909266000)
-------------------
RowKey: htayler
=> (column=birth_date, value=1968, timestamp=1291334765649000)
=> (column=full_name, value=Howard Tayler, timestamp=1291334749160000)
=> (column=state, value=UT, timestamp=1291334890708000)
[default@demo] get users where state = 'UT' and birth_date > 1970;
-------------------
RowKey: bsanderson
=> (column=birth_date, value=1975, timestamp=1291333936242000)
=> (column=full_name, value=Brandon Sanderson, timestamp=1291333931790000)
=> (column=state, value=UT, timestamp=1291334909266000)

We can perform the range query now that the state column is also indexed, so Cassandra can use the state predicate as the primary and filter on the other with a nested loop.

Programatically

Different Cassandra clients may use different method names but the idea is the same. This last query in the pycassa Python client looks like this:

state_expr = pycassa.create_index_expression('state', 'UT')
birth_expr = pycassa.create_index_expression('birth_date', 1970, op=IndexOperator.GT)
clause = pycassa.create_index_clause([state_expr, bday_expr])
result = users.get_indexed_slices(clause):

In the Hector Java client:

StringSerializer ss = StringSerializer.get();
IndexedSlicesQuery<String, String, String> indexedSlicesQuery = HFactory.createIndexedSlicesQuery(keyspace, ss, ss, ss);
indexedSlicesQuery.setColumnNames("full_name", "birth_date", "state");
indexedSlicesQuery.addGtExpression("birth_date", 1970L);
indexedSlicesQuery.addEqualsExpression("state", "UT");
indexedSlicesQuery.setColumnFamily("users");
indexedSlicesQuery.setStartKey("");
QueryResult<OrderedRows<String, String, String>> result = indexedSlicesQuery.execute();

See the pycassa documentation and hector documentation for more details.

Previously



Comments

  1. k says:

    how does the introduction of secondary indexes in cassandra 0.7 impact lucandra, which uses lucene under the covers to create/manage reverse indexes?

    thanks.

  2. Jake Luciani says:

    k: Lucandra may adopt secondary indexes internally. In terms of overlap, wrapping lucene around cassandra will always leverage the best of two well maintained projects and make it easier for folks to build apps on cassandra.

  3. David says:

    Hi Jonathan,

    Thanks for the feedbacks. I have another question, does 0.7 version support non-blocking I/O. I’m testing with 0.7RC1 and it looks like Thrift interface uses blocking IO and it is pretty slow under stress test.

    Thanks.

  4. Jonathan Ellis says:

    Blocking socket i/o is actually faster in Java on modern OSes than nonblocking. Whatever you are seeing, switching to nio would not make it faster.

    The cassandra user mailing list would be a better place to discuss what you are doing further.

  5. Yeh Zheng Tan says:

    I tried to run your demo programmatically.
    After I added the index, the query “get users where state = ‘UT’” returns no result. Can you please advise on this? Below is how I added the index:-

    KsDef ks = client.describe_keyspace(KEYSPACE_NAME);
    cfDef = new CfDef(ks.cf_defs.get(0));//tested & confirmed this is the cf I wanted
    ColumnDef columnDef2 = new ColumnDef(ByteBuffer.wrap(STATE.getBytes()), “UTF8Type”);
    columnDef2.index_type = IndexType.KEYS;
    cfDef.setColumn_metadata(Arrays.asList(columnDef, columnDef1, columnDef2));//columnDef & columnDef1 are for full_name and birth_date
    client.system_update_column_family(cfDef);

    Thank you very much.

  6. Jonathan Ellis says:

    If I were to guess, I’d say that you’re encoding the state column name for the index with getBytes() differently than you’re querying it later. (By default getBytes is going to give you UTF-16, I think.)

    Please follow up on the user@cassandra.apache.org mailing list, if necessary.

  7. Hanif says:

    Hi, I am trying to query using secondary index. I use phpcassa client for php.

    Here is my code:
    $state_expr = CassandraUtil::create_index_expression(‘state’, “UT”);

    $clause = CassandraUtil::create_index_clause(array($state_expr));

    $result = $users->get_indexed_slices(array($clause));

    print_r($result);

    I have created column family and set data according to your sample code. It works fine in DOS command. But does not work in php.

    How can I solve it ?

  8. Jonathan Ellis says:

    You’ll want to ask on http://groups.google.com/group/phpcassa, including the error you are getting with your code.

  9. Peter Chang says:

    So it seems that there is no way to do the following query (yet):
    get users where state = ‘UT’ OR state = ‘CA’ OR state = ‘VA’

    Right? If I’m wrong, I’d be to happy to be shown the way.

    Maybe this is the roadmap for future features of the secondary index (the same would be accomplished by having an “in” or “set” condition similar to sql “where foo IN (x,y,z)”?

  10. Jonathan Ellis says:

    Right. (Because you can’t have arbitrarily nested structs in Thrift.)

    The good news is it’s much easier to do OR client-side reasonably efficiently, than AND.

  11. khoa says:

    Is there a limit on the number of columns in a single column family that serve as secondary indexes in Cassandra? Also, does performance decrease significantly if the uniqueness of the column’s values is high? Thanks!

  12. Jason Kolb says:

    I would also be interested in the answer to Khoa’s question. And, to put a different spin on it, is there a limit to the number of secondary indexed columns in a column family? I’d like to keep them very sparsely populated, but there are going to be LOTS of them.

    Thanks!!

  13. Vijayakumar says:

    Hi,

    We have a typical requirement which we come across very often ie. lexical searching of the data. I am looking for a functionality in cassandra which works as ‘LIKE’ expression in SQL world.

    Is there a plan to incorporate this IndexOperator in Cassandra?

    Your help is greatly appreciated.

    Thanks!!!

  14. Jonathan Ellis says:

    khoa, jason: you can have as many indexes as you like, subject to memory constraints.

    vijay: as with SQL, LIKE ‘X%’ (“starts with” query) is relatively straightforward to index, but full-text search is better handled by something like https://github.com/tjake/Solandra.

  15. Tsiki says:

    I tried to cut an paste the example above in a Cassandra 1.0.6 installation.

    After fixing the syntax errors (missing quotes and some ascii() calls) I got the data in.

    But when I do this:
    get users where state = ‘UT’;
    I get only one result – Howard Taylor.
    And when I do:
    get users where state = ‘UT’ and birth_date > 1968;
    I get a timed out exception.

    What am I doing wrong?

    Thanks!!

  16. Tsiki says:

    Looks like my problem was due to a bug in 1.0.4 which was cleared once I rebuilt the keyspace in 1.0.6.

    Thanks

  17. Amani says:

    Hi,
    Is there a way to filter dateTime? for example i want to get data between 2010-01-01 09:00 AM to 2011-01-01 09:00AM ?
    Is this possible using indexedSlicesQuery o rangeSlicesQuery?

  18. Jonathan Ellis says:

    Yes, if you model your data as a “wide row” with the date as your column comparator. See http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/ for an example.

  19. ddzuba says:

    Hello!

    Where to get more information about how secondary indexes works in Cassandra?
    As I understand, in Cassandra secondary index is a column family with indexing value as a key and indexing row key as a column names.

    I mean, for example for this data:
    [format is row_key(column1_name=value1, column2_name=value2, ...)]

    people:
    bob(age=20, state=NY)
    alice(age=30, state=LA)
    john(age=30, state=NY)
    peter(age=40, state=LA)
    howard(age=30)

    the “age” and “state” indexes will be looked like this:

    people_age_index:
    20(bob)
    30(alice, john, howard)
    40(peter)

    state_index:
    NY(bob, john)
    LA(alice, peter)

    rignt?

    So,
    1.
    Can I run queries with several condition for same index, for example:
    get people where age=30 OR age=40?

    2.
    What about performance of ranged queries in indexes?
    get people where age>10 and age=20.
    Will it get all people with state=NY and all people with age>=20 and then merge results in memory?
    What about performance of such queries?

    4.
    What if some indexing values are mutable? What problems can I expect?

    5.
    Why indexes with hight cardinallity are bad?

    Thanks.

  20. The sample given won’t work with version 1.0.7. The reason is that the key_validation_class is missing (UTF8Type) and also the default_validation_class (UTF8Type) for adding the state-column. For your convenience, here comes the corrected version of the CLI-sample:

    create column family users with comparator=UTF8Type and key_validation_class=UTF8Type and default_validation_class=UTF8Type and column_metadata=[{column_name: full_name, validation_class: UTF8Type}, {column_name: birth_date, validation_class: LongType, index_type: KEYS}];

    set users[bsanderson][full_name] = ‘Brandon Sanderson’;

    set users[bsanderson][birth_date] = 1975;

    set users[prothfuss][full_name] = ‘Patrick Rothfuss’;

    set users[prothfuss][birth_date] = 1973;

    set users[htayler][full_name] = ‘Howard Tayler’;

    set users[htayler][birth_date] = 1968;

    get users where birth_date = 1973;

    set users[bsanderson][state] = ‘UT’;

    set users[prothfuss][state] = ‘WI’;

    set users[htayler][state] = ‘UT’;

    get users where state = ‘UT’;

    get users where state = ‘UT’ and birth_date > 1970;

    get users where birth_date = 1968 and state = ‘UT’;

    update column family users with comparator=UTF8Type and key_validation_class=UTF8Type and default_validation_class=UTF8Type and column_metadata=[{column_name: full_name, validation_class: UTF8Type}, {column_name: birth_date, validation_class: LongType, index_type: KEYS}, {column_name: state, validation_class: UTF8Type, index_type: KEYS}];

    get users where state = ‘UT’;

    get users where state = ‘UT’ and birth_date > 1970;

  21. Thomas Ryabin says:

    How would you programatically add an index with Hector? I’m looking for the equivalent of:

    [default@demo] update column family users with comparator=UTF8Type
    … and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
    ... {column_name: birth_date, validation_class: LongType, index_type: KEYS},
    ... {column_name: state, validation_class: UTF8Type, index_type: KEYS}];

  22. Anthony says:

    why I have to put an equality comparison (which is secondary index column name) as the first condition when I use secondary index?
    How it works it is underlying?
    Does it create some temporary cf or something?

    Can you give explaination? thks

  23. xinyuan says:

    wow, great! this is really helpful for me to learn secondary indexes.

    Seems a mistake here: bday_expr should be birth_date ^^

    i wonder to know why
    >>get users where birth_date > 1970′<>get users where birth_date > 1970 and state = ‘UT’<<
    can work.

    what does this mean " so Cassandra can use the state predicate as the primary and filter on the other with a nested loop"

    great thanks
    xinyuan

  24. Ashish says:

    Thanks,
    I was struggling for last three days to run my phpcassa code. Now it’s working

  25. artemv says:

    Are “Secondary indexes” distributed? I.e. when I’m calling indexedSlicesQuery – it will be executing on single node, or the call will hit all concerned nodes?

  26. Amani says:

    Is it possible to get the result set ordered using indexedSlicesQuery???

  27. sriram says:

    How do i design a server migrating from traditional SQL server?. It is a huge database.

  28. Manu Zhang says:

    we need to add
    assume Users KEYS as UTF8;
    assume Users VALIDATOR as UTF8;

  29. pravat says:

    good afternoon friends.
    now my requirement is to create more no of secondary index
    in a single cql query.
    like:String query=”create index on”+column family name+”(index name)”;
    like this i want to create another index in this single query.
    so can anyone help me to solve this issue.please reply to my email
    thanks
    pravat.

  30. I am facing an issue with cassandra secondary index. Each time I query on secondary index using cli, I get different count of results?
    I didn’t seem to get the issue.

  31. Kirill says:

    What about performance of using secondary index?

    context:
    for example i have very big CF (with much rows) and need to make some as:

    SELECT * FROM test WHERE partition_key = ‘some’ AND cluster_key <= 'timestamp' AND secondary_index = true

    The problem is if i could use in only primary_index like:
    SELECT * FROM test WHERE partition_key = 'some' AND cluster_key <= 'timestamp' AND non_secondary_index_cluster_part = true

    I can't put non_secondary_index_cluster_part in first place in cluster keys because i need to query when this key could be: 1) true 2) false 3) true OR false. And non-equal can be only last key of cluster keys.

    So i desided to make index on that column. But as say in documentation here http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html
    i must avoid using secondary index "To look for a row in a large partition.."

    well, some way i think – that partition key is looking in only one hash (i.e. one node) so is that mean i use smoll partition?

    I have some trouble with understanding this moment – please tell me about it more..

  32. This is really awesome. This makes querying data in Cassandra more fun. Thanks for the hardwork making this feature available guys. Really appreciate it.

  33. Timo Nentwig says:

    … index_type: KEYS
    Could not convert KEYS into Integer.

    (0.7rc1. index_type: 0 works, though)

  34. Jonathan Ellis says:

    Yes, it’s the same as for normal columns. (If you mean, ‘can you have different metadata for columns inside supercolumn B than in supercolumn A,’ the answer is no.)

  35. Hi, do you have any ruby example for this also?

    Thanks

  36. David says:

    [default@demo] get users where birth_date > 1970;
    No indexed columns present in index clause with operator EQ

    Why is this query failed? Other queries in this post work

  37. Shotaro says:

    Thanks for useful information.

    I had to use “0″ instead of KEYS for index_type on cassandra 0.7-rc1:
    {column_name: birth_date, validation_class: LongType, index_type: 0}

    Also, I got an exception when I tried to update the column family. But the bug seems to be fixed in trunk version.
    https://issues.apache.org/jira/browse/CASSANDRA-1764

    Error on CLI:
    ——-
    Internal error processing system_update_column_family
    ——-

  38. David says:

    Ahh, got it, didn’t read the post carefully. So one must add a “dumb” column to use as first level index, so that he can do range query on the whole CF. It’s a bit ugly but still a very good feature. Thanks!

  39. Jonathan Ellis says:

    Josh: the ruby client does not yet support secondary indexes. Expect that to be added soon.

    Shotaro: yes, both of those are fixed in 0.7 rc2.

  40. David says:

    Hi Jonathan, is there a way to define the metadata for columns inside a super column? What is the syntax in CLI?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>