A deep look at the CQL WHERE clause

By Benjamin Lerer -  June 8, 2015 | 15 Comments

While they share similar syntaxes, there are lots of differences between CQL and SQL. The reasons for these differences come mainly from the fact that Cassandra is dealing with distributed data and aims to prevent inefficient queries.

One of the places where CQL differs a lot from SQL is the WHERE clause. The goal of this post is  to describe what is supported by the CQL WHERE clause and the reasons why it differs from normal SQL.

Primary key columns

In Cassandra, two types of columns have a special role: the partition key columns and the clustering columns. Together, they will define your row primary key.

The partition key columns are the first part of primary key and their role is to spread data evenly around the cluster. Rows will be spread around the cluster based on the hash of the partition keys.

The clustering key columns are used to cluster the data of a partition, allowing a very efficient retrival of rows.

Due to the differences in the role that they are playing, partition key, clustering and normal columns support different sets of restrictions within the WHERE clause. Futhermore, those sets of restrictions differ depending of the type of query: SELECT, UPDATE or DELETE.

WHERE clause restrictions for SELECT statements

Partition keys restrictions

The partition key columns support only two operators: = and IN

IN restriction

Prior to 2.2, the IN restrictions could only be applied to the last column of the partition key. So, for example, if your table was:

CREATE TABLE numberOfRequests (
    cluster text,
    date text,
    time text,
    numberOfRequests int,
    PRIMARY KEY ((cluster, date), time)
)

In 2.1, you could only use an IN operator on the date column. In 2.2 you can use the IN operator on any partition key column. By consequence, a query like:

SELECT * FROM numberOfRequests
    WHERE cluster IN ('cluster1', 'cluster2')
    AND date = '2015-05-06'
    AND time >= '12:00'
    AND time <= '14:00';

is valid since 2.2 but is invalid in the prior releases.

This change makes CQL more consistent but you should nevertheless be careful with IN restrictions on partition key columns. The nice post of Ryan Svihla will give you a clear explanation on why you should try to avoid them.

Another change, introduced with 2.2, is that the results are not returned anymore in the order in which the partition key where specified in the IN clause. From 2.2 onward, the results are returned in the natural order of the column type and duplicates are ignored.

Unrestricted partition key columns

Cassandra will require that you either restrict all the partition key columns, or none of them unless the query can use a secondary index.

This means that a query like:

SELECT * FROM numberOfRequests WHERE cluster='cluster1' AND time ='12:00';

will be rejected as the date column is not restricted.

The reason why is that Cassandra needs all the partition key columns to be able to compute the hash that will allow it to locate the nodes containing the partition.

If no restrictions are specified on the partition keys but some are specified on the clustering keys, Cassandra will require ALLOW FILTERING to be added to the query. For more information on ALLOW FILTERING you should look at ALLOW FILTERING explained.

>, >=, <= and < operators

Cassandra distributes the partition accross the nodes using the selected partitioner. As only the ByteOrderedPartitioner keeps an ordered distribution of data Cassandra does not support >, >=, <= and < operator directly on the partition key.

Instead, it allows you to use the >, >=, <= and < operator on the partition key through the use of the token function.

SELECT * FROM numberOfRequests
    WHERE token(cluster, date) > token('cluster1', '2015-06-03')
    AND token(cluster, date) <= token('cluster1', '2015-06-05')
    AND time = '12:00';

If you use a ByteOrderedPartitioner, you will then be able to perform some range queries over multiple partitions. You should nevertheless be careful. Using a ByteOrderedPartitioner is not recommended as it can result in unbalanced clusters.

Clustering column restrictions

Clustering columns support the =, IN, >, >=, <=, <, CONTAINS and CONTAINS KEY operators in single-column restrictions and the =, IN, >, >=, <= and < operators in multi-column restrictions.

Unrestricted clustering columns

The role of clustering columns is to cluster data within a partition. If you have the following table:

CREATE TABLE numberOfRequests (
    cluster text,
    date text,
    datacenter text,
    hour int,
    minute int,
    numberOfRequests int,
    PRIMARY KEY ((cluster, date), datacenter, hour, minute))

The data will be stored per partition in the following way:

{datacenter: US_WEST_COAST {hour: 0 {minute: 0 {numberOfRequests: 130}} {minute: 1 {numberOfRequests: 125}} … {minute: 59 {numberOfRequests: 97}}} {hour: 1 {minute: 0 …

You can see that in order to retrieve data in an efficient way without a secondary index, you need to know all the clustering key columns for you selection.

So, if you execute:

SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WEST_COAST'
    AND hour = 14
    AND minute = 00;

Cassandra will find the data efficiently but if you execute:

SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND hour = 14
    AND minute = 0;

Cassandra will reject the query as it has to scan the entire partition to find the requested data, which is inefficient.

IN restrictions

Prior to 2.2 the IN restrictions on clustering columns are only allowed on the last clustering column. In 2.2, the IN restriction can be used on any column and the following query will work:

SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WEST_COAST'
    AND hour IN (14, 15)
    AND minute = 0;

The same set of data can be retrieved prior to 2.2 by using a multi-column IN restriction

SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WEST_COAST'
    AND (hour, minute) IN ((14, 0), (15, 0));

In 2.2, multi-column IN restrictions can be applied to any set of clustering columns.

SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND (datacentre, hour) IN (('US_WEST_COAST', 14), (‘US_EAST_COAST’, 17))
    AND minute = 0;

Prior to 2.2, multi-column IN restrictions can only be applied to the last set of clustering columns being restricted. By consequence, the previous query is invalid in 2.1. But following query is a perfectly valid one.

SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WEST_COAST'
    AND (hour) IN ((14), (15));

>, >=, <= and < restrictions

Single column slice restrictions are allowed only on the last clustering column being restricted.

Therefore, the following queries are valid:

SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WEST_COAST'
    AND hour= 12
    AND minute >= 0 AND minute <= 30;
SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WEST_COAST'
    AND hour >= 12;
SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter > 'US';

But the following one is invalid:

SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WEST_COAST'
    AND hour >= 12 AND minute = 0;

Multi-column slice restrictions are allowed on the last set of clustering columns being restricted.

SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WESTCOAST'
    AND (hour, minute) >= (12, 0) AND (hour, minute) <= (14, 0)

If both sides are specified for the slice, the restrictions must start with the same column. Therefore the query:

SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WEST_COAST'
    AND (hour, minute) >= (12, 30) AND (hour) < (14)

is valid but the following one is not:

SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacentre = 'US_WEST_COAST'
    AND (hour, minute) >= (12, 0)
    AND (minute) <= (45)

CONTAINS and CONTAINS KEY restrictions

CONTAINS and CONTAINS KEY restrictions can only be used on collections when the query is using a secondary index.

Regular column restrictions

Regular columns can be restricted by =, >, >=, <= and <, CONTAINS or CONTAINS KEY restrictions if the query is a secondary index query.
IN restrictions are not supported.

Secondary indices queries

Direct queries on secondary indices support only =, CONTAINS or CONTAINS KEY restrictions.

The CONTAINS restriction can only be used on collection types. The CONTAINS KEY restriction can only be used on map for which the keys have been indexed.

For example, if you have the following table:

CREATE TABLE contacts (
    id int PRIMARY KEY,
    firstName text,
    lastName text,
    phones map<text, text>,
    emails set<text>
);
CREATE INDEX ON contacts (firstName);
CREATE INDEX ON contacts (keys(phones)); // Using the keys function to index the map keys
CREATE INDEX ON contacts (emails);  

the following queries will be valid:

SELECT * FROM contacts WHERE firstname = 'Benjamin';
SELECT * FROM contacts WHERE phones CONTAINS KEY 'office';
SELECT * FROM contacts WHERE emails CONTAINS 'Benjamin@oops.com';

Secondary indices filtering

Secondary index queries allow you to restrict the returned results using the =, >, >=, <= and <, CONTAINS and CONTAINS KEY restrictions on non-indexed columns using filtering.

Therefore, the following queries are valid, as long as ALLOW FILTERING is specified:

SELECT * FROM contacts
    WHERE firstname = 'Benjamin'
    AND lastname = 'Lerer'
    ALLOW FILTERING;
SELECT * FROM contacts
   WHERE phones CONTAINS KEY 'office'
   AND phones CONTAINS '0000.0000.0000'
   ALLOW FILTERING;

You should nevertheless be carefull with filtering. It can be an expensive operation.

Partition key restrictions and Secondary indices

When Cassandra must perform a secondary index query, it will contact all the nodes to check the part of the secondary index located on each node. If all the partition key components are restricted, Cassandra will use that information to
query only the nodes that contains the specified partition keys, which will make the query more efficient.

For secondary index queries, only = restrictions are supported on partition key columns.

Clustering column restrictions and Secondary indices

For each indexed value, Cassandra stores the full primary key (partition key columns + clustering columns) of each row containing the value. When an index query is performed, Casssandra will retrieve the primary keys of the rows containing the value from the index. It will then retrieve the rows from the table and perform any filtering needed on it.

If the first clustering columns have been restricted, Cassandra will perform an early filtering on the primary keys returned by the index, making the filtering more efficient.

For that type of filtering, Cassandra will accept the following clustering column restrictions: =, IN, >, >=, <= and <.

So, if we add the following secondary index to the numberOfRequests table:

CREATE INDEX ON numberOfRequests (minute);

the following query is perfectly valid:

SELECT * FROM numberOfRequests
    WHERE cluster = 'cluster1'
    AND date = '2015-06-05' 
    AND datacenter IN ('US_WEST_COAST', 'US_EAST_COAST')
    AND minute = 0
    ALLOW FILTERING;

WHERE clause restrictions for the UPDATE and DELETE statements

In UPDATE and DELETE statements all the primary key columns must be restricted and the only allowed restrictions are:

  • the single-column = on any partition key or clustering columns
  • the single-column IN restriction on the last partition key column

CASSANDRA-6237 will address part of those limitations in 3.0 by adding support to UPDATE AND DELETE statements for:

  • IN restrictions on any partition key column
  • IN restictions on any clustering column
  • EQ and IN multi-column restrictions on the clustering keys (mixed or not with single-column restrictions).

and by adding to DELETE statements support for range deletion of entire rows.

IN restrictions will still not supported for conditional updates or deletes.

Secondary index lookup for UPDATE and DELETE statements is not and will not be supported due to the risk involved with read before write.

Conclusion

To get the most out of Cassandra you will need to design your tables for the queries you want to perfom. Hopefully, the explanations provided in this post will help you in this task.

A last word about single and multi-column restrictions mix

In some versions of Cassandra mixing single and multi-column restrictions has been broken. It has been properly fixed in 2.0.15 and 2.1.5.



Comments

  1. Benjamin Lerer says:

    If you are looking for the slides of the presentation that I gave at the Cassandra Summit, you can find them here: http://www.slideshare.net/BenjaminLerer1/a-deep-look-at-the-cql-where-clause-53138465

  2. Mck says:

    These two examples are misleading

    Multi-column slice restrictions are allowed on the last set of clustering columns being restricted.

    SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WESTCOAST'
    AND (server, time) >= (‘196.8.0.0’, 12:00) AND (server, time) <= (‘196.8.255.255’, 14:00)

    If both sides are specified for the slice, the restrictions must start with the same column. Therefore the query:

    SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WEST_COAST'
    AND (server, time) >= (‘196.8.0.0’, 12:00) AND (server) <= (‘196.8.255.255’)

    is valid

    Reading them most people will expect results where time is after “12:00” (and before “14:00” in the former).

    While the actual results hereare all values between (‘196.8.0.0’, 12:00) and (‘196.8.255.255’) inclusive.

    That is all rows from ‘196.8.0.1’ to ‘196.8.255.254’ are returned regardless of their time value. Only the first ‘196.8.0.0’ row is restricted to time values from and after ’12:00′, and only the last ‘196.8.255.255’ row is restricted to time values before and to ’14:00′).

    These two statements would have been more intuitive against a schema like
    cluster text,
    date text,
    datacenter text,
    day int,
    hour int,
    numberOfRequests int,
    PRIMARY KEY ((cluster, date), datacenter, day, hour))
    );

    SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND datacenter = 'US_WESTCOAST'
    AND (day, hour) >= (2015016, 12) AND (day, hour) <= (2015200, 14)

    1. Benjamin Lerer says:

      Thanks for the feedback.

      I agree, the example was really misleading.

      I changed the examples of the ‘Clustering column restrictions’ section.

      I hope that it will clearer for everybody.

  3. anu says:

    Hi
    Please find below the issue i encountered :

    Table structure :
    CREATE TABLE ratevalue (
    strid uuid,
    starttime bigint,
    rate int
    PRIMARY KEY (strid , starttime,heart_rate )
    ) ;

    Now please look at query below and results

    select starttime, rate from ratevalue where strid = f4aa8a29-819e-4ae6-901c-afa5642977c1 and (starttime, rate) >= (-2717403298298551337,65) and (starttime,rate ) <= (-2717403289751566418 ,100 ) ;

    starttime | rate
    ———————-+————
    -2717403298298551337 | 100
    -2717403289751566418 | 65

    why row where rate is 65 is coming as output of the above query ?

    Please help.

    1. Benjamin Lerer says:

      What you query means is:

      I want all the rows of partition f4aa8a29-819e-4ae6-901c-afa5642977c1 which have a clustering key greater or equals to (-2717403298298551337,65) and smaller or equals to (-2717403289751566418 ,100 )

      If SELECT * FROM myTable WHERE pk = 1 returns:

      pk | c1 | c2
      —+—-+—-
      1| 1 | 1
      —+—-+—-
      1| 1 | 2
      —+—-+—-
      1| 1 | 3
      —+—-+—-
      1| 1 | 4
      —+—-+—-
      1| 2 | 1
      —+—-+—-
      1| 2 | 2
      —+—-+—-
      1| 2 | 3
      —+—-+—-
      1| 2 | 4

      The query SELECT * FROM myTable WHERE pk = 1 AND (c1, c2) >= (1,3) and (c1, c2) <= (2, 2) will return:

      pk | c1 | c2
      —+—-+—-
      1| 1 | 3
      —+—-+—-
      1| 1 | 4
      —+—-+—-
      1| 2 | 1
      —+—-+—-
      1| 2 | 2
      —+—-+—-

      Hope it helps

  4. Asb says:

    CREATE TABLE users3 (
    userid int,
    first_name text,
    last_name text,
    city text,
    count int,
    PRIMARY KEY ((userid), count, last_name)
    );

    cqlsh:apprepo> select * from users3 where userid=112 AND count >100 AND count < 200 AND last_name='ZZTop2';
    Bad Request: PRIMARY KEY column "last_name" cannot be restricted (preceding column "count" is either not restricted or by a non-EQ relation)

    any thoughts on the errors above ?

    1. Benjamin Lerer says:

      The error message looks clear to me.
      The last_name column cannot be restricted because you have restricted the count column by a non equal relation (> and <).

      Single column slice restrictions are allowed only on the last clustering column being restricted.

  5. Chuan says:

    For the benefit of beginners coming from SQL background, I think it is a good idea to include a line saying: “in CQL, only primary key columns and secondary index columns are supported.” before diving into the WHERE clause restrictions.

    1. Benjamin Lerer says:

      This statement is not fully true.
      As I explained in the ‘Secondary indices filtering’ section, you can restrict any non indexed column as long as the query use a secondary index.

      In 3.5, you will be able to filter on any non-primary key columns even if the query does not use an index and CASSANDRA-11310 will add support for filtering on clustering columns.

  6. Mahdi says:

    In your last example with secondary index on `minute` if you change the = to >= then the query will be rejected, whereas the text suggests it would still be valid.

    cqlsh:mykeyspace> SELECT * FROM numberOfRequests
    … WHERE cluster = ‘cluster1’
    … AND date = ‘2015-06-05’
    … AND datacenter IN (‘US_WEST_COAST’, ‘US_EAST_COAST’)
    … AND minute >= 0
    … ALLOW FILTERING;
    InvalidRequest: code=2200 [Invalid query] message=”PRIMARY KEY column “minute” cannot be restricted as preceding column “hour” is not restricted”

    1. Benjamin Lerer says:

      As mention at the beginning ‘secondary indices queries’:
      “Direct queries on secondary indices support only =, CONTAINS or CONTAINS KEY restrictions.”

      As only the ‘minute’ column is indexed, if you do not use an ‘=’ restriction on it, C* will not use the secondary index and your query will be rejected (as the ‘hour’ column is not restricted).

      The last section is discussing filtering on the remaining clustering columns (e.g. the ones for which the index is not used). In the case of my example ‘datacenter’.

      In theory, you could replace datacenter IN (‘US_WEST_COAST’, ‘US_EAST_COAST’) by datacenter >= ‘US’.

      In practice we had some regression since 2.2 that affected that type of queries (see: CASSANDRA-11510). By consequence, it might not be working for all versions yet.

  7. Mahdi says:

    Another question. As far as I understand, the token() function returns the hash value of the keys, so comparing `token(cluster, date) > token(‘cluster1’, ‘2015-06-03’)` does not necessarily compare the values of `cluster` and/or `date`. In other words the token() function does not preserve the ordering.

    1. Benjamin Lerer says:

      The token() function will preserve the ordering only if you use a ByteOrderedPartitioner.

  8. msingh says:


    SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’
    AND hour = 14
    AND minute = 0;
    Cassandra will reject the query as it has to scan the entire partition to find the requested data, which is inefficient.”
    This doesn’t tell the complete picture. A query with only few clustering columns will work but they must be specified in order. So following query will work –

    SELECT * FROM numberOfRequests
    WHERE cluster = ‘cluster1’
    AND date = ‘2015-06-05’ AND data_center=’US-WEST-COAST’;

    1. Benjamin Lerer says:

      You are right. Prior to 3.6, you can have unrestricted clustering columns but a clustering column cannot be restricted if one of the previous one was not.
      Since 3.6, it is allowed if filtering is used.

Comments

Your email address will not be published. Required fields are marked *




Subscribe for newsletter:

Tel. +1 (408) 933-3120 sales@datastax.com Offices France Germany

DataStax Enterprise is powered by the best distribution of Apache Cassandra™.

© 2017 DataStax, All Rights Reserved. DataStax, Titan, and TitanDB are registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.