We Shall Have Order!

By Aaron Ploetz -  March 2, 2015 | 6 Comments

Aaron Ploetz, Lead Database Engineer at AccuLynx.com
Aaron is the Lead Database Engineer for AccuLynx, where he implemented Cassandra to scale part of their cloud-based business management system. He is active in the Cassandra tags on StackOverflow, and recently became the first recipient of a tag-specific badge in CQL. Aaron holds a B.S. in Management/Computer Systems from the University of Wisconsin-Whitewater, a M.S. in Software Engineering and Database Technologies from Regis University, and was selected as a DataStax MVP for Apache Cassandra in 2014.


Introduction

One of the more misunderstood aspects of Cassandra data modeling is that of how to properly order your data.  In this article we will explore a typical modeling journey from “first attempt” to “end solution.”  We will discuss the limitations of the ORDER BY clause, as well as explain how Cassandra clustering order works – and how to take advantage of that order when building a data model.

A common problem that new Cassandra users will attempt to solve is that of modeling time series data.  One of the more typical early mistakes made in time series data modeling is designing a table that is dependent on a time as its primary key, like this:

This is commonly accompanied by a question such as:

“How can I sort my result set by posttime?”

Of course – for this problem – designating posttime as a single primary key is not going to have the desired effect.  Cassandra will place (DataStax 2015a) “data on each node according to the value of the partition key and the range that the node is responsible for.”  It is important to understand that the order in which partitioned rows are returned, depends on the order of the hashed token values and not on the key values themselves.  So if the goal is to order the queried results by posttime then the primary key definition is going to have to be different.

Some new users make the mistake of using the Byte Ordered Partitioner (BOP) to sort data by partition key.  This is discouraged, given its propensity for hot spots and load balancing difficulties.  Some have even gone so far as to identify its use as an anti-pattern.  It is this author’s opinion that no new clusters should be built with the BOP.  Often, a use case which may appear to be a “good fit” for the BOP can be solved with proper data modeling.

The correct way to model this is to first find a different column by which to partition the data.  Proper partitioning and clustering keys will allow the data to be sorted and queried in the desired order.  The column to designate as the partition key largely depends on the requirements of the application and the particular query you are trying to solve.  Be mindful of the cardinality of your potential partition key.  If it is too low you will get “hot spots” (poor data distribution), and if it is too high you will negate the benefits of the “wide row” data model (too little data to make ordering worthwhile).

Another incorrect path which new users can find themselves going down is to designating a “dummy” partition key.  In this solution, the “dummy” key always has the same value (lowest possible cardinality), and its use can fool the user into thinking that the problem is solved.  The two main issues with this are that of hot spots and (Strickland 2014) unbounded row growth.  As the (constant) partition key value will be hashed the same on each insert, all of the data will be written to the same partition, thus creating hot spots of data.  Additionally, Cassandra can only support (McFadin 2014) two billion columns per partition.  With all data being written to that one partition, the chances of reaching that limit will increase over time.  As these two points are significant potential issues, I cannot stress enough that using a “dummy” partition key is a terrible idea.

For our example, we will decide to partition our data by userid with a new table specifically designed to serve the query, like this:

This particular table uses a compound primary key.  The first column listed in the primary key definition is the partition key (userid) and the second column is a clustering column (posttime).  After the partition key(s), data for each partition (DataStax 2015b) is “clustered by the remaining column or columns of the primary key definition.”  Essentially, the clustering columns determine the on-disk sort order within each partition.  Note that this table definition also uses the WITH clause, which indicates that the data should be clustered in descending order by posttime.

However, this can lead to another common incorrect assumption.  Consider the following query and question:

“I am trying to query the 100 most-recent posts by user.  Why is my data in random order even though CLUSTERING ORDER BY is specified in postsbyuser?”

In the example table above, queries will return data that is sorted according to the clustering key(s) only when a partition key is also specified.  Without a partition key specified in the WHERE clause, the actual order of the result set then becomes dependent on the hashed values of userid.  This is apparent whenpostsbyuser is queried without specifying a WHERE clause and using the token() function on userid:

In any case, to serve the above query we need qualify it with the partition key (userid).  This query will return the 100 most-recent posts where userid is equal to 2:

While we are on the right track, one last issue that may still be encountered with this solution is (again) unbounded row growth.  Depending on the frequency of user postings, this solution may be at risk for hitting the limit of two billion columns per partition (probably not, but we will assume so for the purposes of this example).  A possible solution for this is to add an additional partition key.  Let’s say that we know that the postsbyuser data will not exceed the limit of two billion columns per partition in a single calendar year. Then, we could solve this issue by using year as an additional partition key:

Now we can query the top most-recent posts by userid for a specific year, in descending order byposttime:

Note that the ORDER BY clause does not need to be specified in the SELECT, as the data will already be sorted based on the CLUSTERING ORDER specified in the table definition.  The ORDER BY clause could be used if we wanted to alter the sort direction (“ASCending” vs. “DESCending”) of the result set.

In summary, when it comes to ordering your result sets in Cassandra, remember these points:

  • The clustering column(s) determine the data’s on-disk sort order only within a partition key.

  • Do not model your table with a “dummy” partition key (key that always has the same value).

  • It is not necessary to specify the ORDER BY clause on a query if your desired sort direction (“ASCending/DESCending”) already matches the CLUSTERING ORDER in the table definition.

  • On a CREATE TABLE command, Cassandra will specify CLUSTERING ORDER for you in the table definition if you did not specify it yourself.  The default CLUSTERING ORDER on a table is comprised of your clustering keys in the “ASCending” sort direction.

  • Be mindful of the limit of two billion columns per wide row (partition), as well as the problems that accompany unbounded row growth.

  • For tables supporting queries for recent, time-based data, you may want to specify a “DESCending” sort direction in your table definition.

  • Do not create a new cluster using the Byte Ordered Partitioner.  If still you think that you really need to do this, contact someone at DataStax support first.

Happy modeling!

References


This blog post originally appeared in Planet Cassandra.



Comments

  1. Tamar says:

    It’s worth mentioning that having the year in the partition key means you can no longer select all the posts from a user without restricting the year as well. i.e. the query `SELECT * FROM postsbyuser WHERE userid=2` will fail.

  2. Yuval Greenfield says:

    I couldn’t figure out from this post what to do if I actually want to query for the most recent posts – not by a specific user. Would using just the year as a partition key be a good idea if I have 100K posts a year? For live time series data would that make the cluster bad at writes as only one partition would be writing during any given year? The problem does not go away if we use smaller time windows. Is Cassandra just not designed for a distributed sorted column? Is there another solution? Here’s a stack overflow question struggling with this as well: http://stackoverflow.com/questions/32014367/cassandra-list-10-most-recently-modified-records

    1. Aaron Ploetz says:

      Yuval,

      You are correct that for your particular issue, the table structure as-given will not work. If you really do have 100K posts in a year or you need to be able to query the most-recent posts, you will want to revisit the PRIMARY KEY definition. There really isn’t a good way to support a blanket query of “give me the most recent posts for all users,” while defining “most recent” at query time. Cassandra cannot support sorted partition keys (or at least not sorted in a way that helps most business requirements). The relational mindset of “SELECT * FROM posts LIMIT 100;” becomes a multi-key (and thus, multi-node) query in Cassandra, which is something you should model to avoid at all costs.

      But you can make some intelligent decisions about what, exactly, “most recent” means in terms of your requirements. And using a smaller time window will help you from both the query and distribution perspectives. Maybe “most recent means” for the current day? Or maybe it means for the current hour? Depending on the sheer volume of posts and how recent you need to see them, your time bucket may need to be more granular than year. And if you need to query for multiple users, then you’ll also need to move userid to be a clustering key (instead of a partition key).

      Bottom line, postsbyuseryear is an example designed to illustrate the concept of query-based modeling. It’s not a one-size-fits-all solution by any means.

  3. Suraj says:

    I am not able to sort my quires. The table structure is like :
    CREATE TABLE “Testing”.valuesout (
    bucket timestamp,
    split_factor varint,
    uniqueid varint,
    timestampA timestamp,
    timestampB timestamp,
    value double,
    PRIMARY KEY ((bucket, split_factor), uniqueid, timestampA, timestampB)
    ) WITH COMPACT STORAGE
    AND CLUSTERING ORDER BY (uniqueid ASC, timestampA ASC, timestampB)

    My query is – select * from valuesout where uniqueid = 2 order by uniqueid desc allow filtering;

    Where am i going wrong?

    1. Aaron Ploetz says:

      Remember, you can only enforce a sort order *within* a partition. Your partition keys are bucket and split_factor, so only specifying uniqueid in your WHERE (with ALLOW FILTERING) pulls your rows back by order of their hashed token values. To see what I mean, change your query to:

      select token(bucket,split_factor),bucket,split_factor,uniqueid WHERE uniqueid = 2 ALLOW FILTERING;

      Pro tip: For best results, rework your table so that you don’t need to specify ALLOW FILTERING. Nothing good ever comes from ALLOW FILTERING.

  4. Vihag Gupta says:

    This was really helpful!

Comments

Your email address will not be published. Required fields are marked *




Subscribe for newsletter:

© 2017 DataStax, All rights reserved. Tel. +1 (408) 933-3120 sales@datastax.com Offices

DataStax is a registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.