Patrick McFadin

<p><strong><em>For more recent content on Data Modeling, check out&nbsp;</em><i data-stringify-type="italic">the&nbsp;sample data models in the&nbsp;<a href="https://www.datastax.com/learn/data-modeling-by-example">Data Modeling By Example</a>&nbsp;learning series, as well as&nbsp;</i><em><a href="https://www.datastax.com/blog/2019/10/data-modeling-apache-cassandra">Why Data Modeling Is Critical</a>.</em></strong></p>

<p>&nbsp;</p>

<hr />
<p>&nbsp;</p>

<p>The title for this article could really stand alone, but I’m not going to just leave it at that! Yes, this is a fundamental rule for Apache Cassandra, but I’m going to take some time to explain why that statement is correct. With relational data modeling, you can start with the primary key, but effective data models in an RDBMS are much more about the foreign key relationships and relational constraints between tables. Since using a JOIN isn’t possible with Cassandra, we have much less complexity creating data models. The complexity trade-off for Apache Cassandra is in knowing about your queries and data access patterns ahead of time. I won’t be going into how to accomplish that here. There is an excellent course available on <a href="http://academy.datastax.com/">DataStax Academy</a> for that topic. This article will focus on how and why to best pick a primary key. <b>The basic primary key</b> Let’s start with the most basic table. These can be called a “Static table” or “Entity table” which are used to store a single record of data. Here is an example from the <a href="http://www.killrvideo.com/">KillrVideo example application</a>:</p>

<div class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-mac print-yes notranslate" data-settings=" minimize scroll-mouseover" id="crayon-580e720254886742361883">
<div class="crayon-plain-wrap"><code>CREATE TABLE videos ( videoid uuid, userid uuid, name varchar, description varchar, location text, location_type int, preview_thumbnails map&amp;amp;amp;amp;amp;lt;text,text&amp;amp;amp;amp;amp;gt;, tags set&amp;amp;amp;amp;amp;lt;varchar&amp;amp;amp;amp;amp;gt;, added_date timestamp, PRIMARY KEY (videoid) ); </code></div>
</div>

<p>&nbsp;</p>

<p>The <strong>PRIMARY KEY</strong> designation is the simplest form. A single parameter that identifies a single video uploaded to our system. The first element in our <strong>PRIMARY KEY</strong> is what we call a partition key. The partition key has a special use in Apache Cassandra beyond showing the uniqueness of the record in the database. The other purpose, and one that very critical in distributed systems, is determining data locality. When data is inserted into the cluster, the first step is to apply a hash function to the partition key. The output is used to determine what node (and replicas) will get the data. The algorithm used by Apache Cassandra utilizes Murmur3 which will take an arbitrary input and create a consistent token value. That token value will be inside the range of tokens owned by single node. In simpler terms, a partition key will always belong to one node and that partition’s data will always be found on that node. Why is that important? If there wasn’t an absolute location of a partition’s data, then it would require searching every node in the cluster for your data. In a small cluster, this may complete quickly, but in much larger cluster it would be painfully slow. We want what is shown below. <img alt="image03" class="alignnone wp-image-12942" height="279" src="https://www.datastax.com/sites/default/files/content/blog/20160222_devblog_cbm_tpk_bp_03.png" width="629" /> <b>Complex primary key</b> The other type of table in Apache Cassandra is what we will call a “Dynamic table.” Let’s first look at another example from KillrVideo:</p>

<div class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-mac print-yes notranslate" data-settings=" minimize scroll-mouseover" id="crayon-580e72025489d267860915">
<div class="crayon-plain-wrap"><code>CREATE TABLE user_videos ( userid uuid, added_date timestamp, videoid uuid, name text, preview_image_location text, PRIMARY KEY (userid, added_date, videoid) ); </code></div>
</div>

<p>&nbsp;</p>

<div class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-mac print-yes notranslate" data-settings=" minimize scroll-mouseover" id="crayon-580e720254895850088214">
<div class="crayon-plain-wrap">In this table, we are satisfying the query “show me all videos associated with a specific user.” As you can see, the <strong>PRIMARY KEY</strong> now has more than the partition key, we have now added more elements. All columns listed after the partition key are called clustering columns. This is where we take a huge break from relational databases. Where the partition key is important for data locality, the clustering column specifies the order that the data is arranged inside the partition. The way we read this is left to right:</div>
</div>

<ul>
	<li>Item one is the partition key</li>
	<li>Item two is the first clustering column. <em>Added_date</em> is a timestamp so the sort order is chronological, ascending.</li>
	<li>Item three is the second clustering column. Since <em>videoid </em>is a UUID, we are including it so simply show that it is a part of a unique record.</li>
</ul>

<p>After inserting data, you should expect your <strong>SELECT</strong> to return data in the ascending order of the&nbsp;<em>added_date</em> for a single partition in ascending order. <img alt="image00" class="alignnone wp-image-12931" height="330" src="https://www.datastax.com/sites/default/files/content/blog/20160222_devblog_cbm_tpk_bp_00.png" width="550" /> <b>Controlling order of the clustering columns</b> Since the clustering columns specify the order in a single partition, it would be helpful to control the directionality of the sorting. We could accomplish this run time by added an <strong>ORDER BY</strong> clause to our&nbsp;<strong>SELECT</strong> like this:</p>

<div class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-mac print-yes notranslate" data-settings=" minimize scroll-mouseover" id="crayon-580e72025489d267860915">
<div class="crayon-plain-wrap"><code>SELECT * FROM user_videos WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89 ORDER BY added_date DESC; &amp;amp;gt;</code></div>
</div>

<p>&nbsp;</p>

<p>What if we want to control the sort order as a default of the data model? We can specify that at table creation time using the <strong>CLUSTERING ORDER BY</strong> clause:</p>

<div class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-mac print-yes notranslate" data-settings=" minimize scroll-mouseover" id="crayon-580e7202548a3625378932">
<div class="crayon-plain-wrap"><code>CREATE TABLE user_videos ( userid uuid, added_date timestamp, videoid uuid, name text, preview_image_location text, PRIMARY KEY (userid, added_date, videoid) ) WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC); </code></div>
</div>

<p>&nbsp;</p>

<p>Now when we insert data into <em>user_videos</em> the data will be pre-sorted to <em>added_date</em> in descending order. <img alt="image02" class="alignnone wp-image-12938" height="332" src="https://www.datastax.com/sites/default/files/content/blog/20160222_devblog_cbm_tpk_bp_02.png" width="564" /> This may seem like a pre-optimization, but the use cases this addition enables are very compelling. When<b>CLUSTERING ORDER BY</b> is used in time series data models, we can now quickly access the last N items inserted. As an example:</p>

<div class="crayon-syntax crayon-theme-github crayon-font-monaco crayon-os-mac print-yes notranslate" data-settings=" minimize scroll-mouseover" id="crayon-580e7202548aa359098033">
<div class="crayon-plain-wrap"><code>SELECT * FROM user_videos WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89 LIMIT 10; </code></div>
</div>

<p>&nbsp;</p>

<p>What this query is asking for is “the last 10 videos the user uploaded” A very fast, useful and efficient query enabled by a simple addition of the <b>CLUSTERING ORDER BY</b> clause. You can see how this might be very useful in cases of user interaction or fraud use cases. <b>Conclusion</b> In this quick overview of the <b>PRIMARY KEY</b> relationships, I hope you see how important this is not only to your queries but also how your store your data. Some basic understanding of the components can help you make some informed choices in your next data model. For example, now that you understand a partition key controls data locality, it’s probably best to not use just one! An extreme example, but one that you could easily make without knowing the reasoning behind it. Now that you know the most important thing to know in Cassandra data modeling, what are you going to do with it? Build something awesome!</p>

<hr />
<p>This blog post originally appeared on Planet Cassandra.</p>


The most important thing to know in Cassandra data modeling: The primary key

Patrick McFadinDeveloper Relations

Discover more

Share

Share

More Technology

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

Simplifying Agent Development with Astra DB Connector for Vertex AI Search

One-stop Data API for Production GenAI