ALLOW FILTERING explained

By Benjamin Lerer -  December 15, 2014 | 20 Comments

While discussing with people at the London C* Summit, I realized that it was not always clear for them why Cassandra requires ALLOW FILTERING for some CQL queries and not for some others.

Why ALLOW FILTERING?

Let’s take for example the following table:

CREATE TABLE blogs (blogId int, 
                    time1 int, 
                    time2 int, 
                    author text, 
                    content text, 
                    PRIMARY KEY(blogId, time1, time2));

If you execute the following query:

SELECT * FROM blogs;

Cassandra will return you all the data that the table blogs contains.

If you now want only the data at a specified time1, you will naturally add an equal condition on the column time1:

SELECT * FROM blogs WHERE time1 = 1418306451235;

In response, you will receive the following error message:

Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING.

Cassandra knows that it might not be able to execute the query in an efficient way. It is therefore warning you: “Be careful. Executing this query as such might not be a good idea as it can use a lot of your computing resources”.

The only way Cassandra can execute this query is by retrieving all the rows from the table blogs and then by filtering out the ones which do not have the requested value for the time1 column.

If your table contains for example a 1 million rows and 95% of them have the requested value for the time1 column, the query will still be relatively efficient and you should use ALLOW FILTERING.

On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value for the time1 column, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.

Unfortunately, Cassandra has no way to differentiate between the 2 cases above as they are depending on the data distribution of the table.  Cassandra is therefore warning you and relying on you to make the good choice.

Secondary indices and ALLOW FILTERING

If we add an index on the author column and execute the following query:

SELECT * FROM blogs WHERE author = ‘Jonathan Ellis’;

Cassandra will return all the blogs that have been written by Jonathan and will not request ALLOW FILTERING. This is due to the fact that Cassandra can use the secondary index on the author column to find the matching rows and does not need to perform any filtering.

But if we execute the following one:

SELECT * FROM blogs WHERE author=’Jonathan Ellis’ and time2 = 1418306451235;

Cassandra will request ALLOW FILTERING as it will have to first find and load the rows containing Jonathan as author, and then to filter out the ones which do not have a time2 column equal to the specified value.

Adding an index on time2 might improve the query performance. Cassandra will then use the index with the highest selectivity to find the rows that need to be loaded. It will however not change anything regarding the need for ALLOW FILTERING, as it will still have to filter the loaded rows using the remaining predicate.

Making the right choice

When your query is rejected by Cassandra because it needs filtering, you should resist the urge to just add ALLOW FILTERING to it. You should think about your data, your model and what you are trying to do.

You always have multiple options. You can change your data model, add an index, use another table or use ALLOW FILTERING.

You have to make the right choice for your specific use case.









DataStax has many ways for you to advance in your career and knowledge.

You can take free classes, get certified, or read one of our many white papers.



register for classes

get certified

DBA's Guide to NoSQL







Comments

  1. Thanks for the knowledge sharing…

  2. Kaushik Periwal says:

    Wow, bang on the point. I read at so many places, but this one is by far precise and the best explanation.

    Cheers
    KP

  3. Anand Sampat says:

    wow. best explanation i’ve seen anywhere. thanks!

  4. Nagendra says:

    Thanks for the example, It is very clearly explained.

  5. varma dantuluri says:

    Thanks for the explanation

  6. shahnavaz saiyed says:

    Thanks for the good explanation.

  7. Jomon says:

    Thanks for the clear explanation. I still have a small confusion though. Since the time2 is part of the clustering key and is already primary indexed, why would a secondary index(or filtering) needed on the same column to perform the query?

    1. Benjamin Lerer says:

      I guess that you are talking about the
      SELECT * FROM blogs WHERE author=’Jonathan Ellis’ and time2 = 1418306451235;
      query when the author and time2 columns are both indexed.

      As time1 is not restricted, Cassandra cannot use the clustering columns to find efficiently the requested rows. By consequence, it will use the existing secondary indices. But Cassandra can only use one of the two indices. If Cassandra choose to use the time2 index (because it returns less results) it will then need to filter the results using the author restriction.

  8. Marcel Villet says:

    A query such as the following also requires ALLOW FILTERING i.e. where the partition key is contained in the WHERE as well as an additional field in the WHERE not contained in the Primary Key:

    SELECT * FROM blogs WHERE blogId = 1 AND time1 = 1418306451235;

    Why is this?

  9. Benjamin Lerer says:

    Cassandra will be able to locate the partition data using the partition key but it will then have to check all the rows of the partition to see for which rows the value of the time1 column is equals to 1418306451235 (=> filtering).

    Relational databases are doing exactly the same. They just do not warn you that they are doing it.

  10. Nikhil says:

    i got it but how to make ALLOW FILTERING enable in cassandra?

    1. Benjamin Lerer says:

      You simply add ALLOW FILTERING at the end of your query.

  11. Bhaskar Pandey says:

    good explanation

  12. Huan says:

    Thanks! A lot people just say DONT USE ALLOW FILTERING, which doesn’t make sense to me as C* support this statement, it comes with purpose. Now I have a much better about ALLOW FILTERING and won’t be afraid of using it if necessary!

  13. Yuval Greenfield says:

    I don’t understand why `SELECT * FROM blogs WHERE time1 = 1418306451235;` would cause “Cassandra will load 999,998 rows for nothing” when `time1` is a clustering key.

    Wouldn’t Cassandra just need to access every partition to O(log-n) search for time1? Effectively that would make the scan O(log(1,000,000)) time if all the nodes participate. `ALLOW FILTERING` would still be appropriate because it’s a cluster-wide query, but it would still be as fast as a partition-key constrained similar query.

  14. Benjamin says:

    This comment is effectively not accurate.

    The reality is in fact slightly more complex.
    If you have N partitions, Cassandra will need to go through each of your partitions so O(N). It will then need for each partition to search for the rows matching the specified clustering slice so O(log n) for the data within the memtable and O(n) for the data within the SSTables.

    So a single partition query will have an O(n) cost whereas that specific query will have an O(Nn) cost.

  15. Francis says:

    If you are querying with partition key AND non-primary columns and adding ALLOW FILTERING, will Cassandra perform the search on partition key (fast) and load all records only on that partition and then search for the remaining filter?

    1. Benjamin Lerer says:

      Yes, if you have a restriction on the partition key only the data from that partition will be loaded. Nevertheless, that does not mean that it will be fast. It will depends of how much data your partition contains.

  16. kavya says:

    can anyone explain about allow filtering in architectural manner?

    1. Benjamin Lerer says:

      I am not sure of what explanation you are looking for. ALLOW FILTERING is just a query flag that tells Cassandra if it is allowed to use filtering or not.

Comments

Your email address will not be published. Required fields are marked *




Subscribe for newsletter:

Tel. +1 (408) 933-3120 sales@datastax.com Offices France GermanyJapan

DataStax Enterprise is powered by the best distribution of Apache Cassandra™.

© 2017 DataStax, All Rights Reserved. DataStax, Titan, and TitanDB are registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.