TechnologyJune 5, 2013

Does CQL support dynamic columns / wide rows?

Jonathan Ellis
Jonathan EllisTechnology
Does CQL support dynamic columns / wide rows?

The transition to CQL has been tough for some people who were used to the existing Thrift-based data model. A common misunderstanding is that CQL does not support dynamic columns or wide rows. On the contrary, CQL was designed to support everything you can do with the Thrift model, but make it easier and more accessible.

A note on terminology

Part of the confusion comes from the Thrift api using the same terms as CQL/SQL to mean different things. To avoid ambiguity, I will only use these terms in the CQL sense here:

Thrift term CQL term
row partition
column cell
[cell name component or value] column
[group of cells with shared component prefixes] row

So when someone asks, does "does CQL support dynamic columns?" he is really asking, "does it support dynamic cell names." Let's look at how that works.

Wide rows

Suppose we're gathering data from sensors. We have sensors identified by a unique integer, and their observations are (timestamp, float) pairs. Thus, our cell names will be timestamps, and our cell values will be floats:

[cassandra-cli -- the old Thrift way]

create column family data 
with key_validation_class=Int32Type 
 and comparator=DateType 
 and default_validation_class=FloatType;

We'll insert some sample data:


set data[1]['2013-06-05 15:11:00'] = '3.1';
set data[1]['2013-06-05 15:11:10'] = '4.1';
set data[1]['2013-06-05 15:11:20'] = '5.1';

set data[2]['2013-06-05 15:11:00'] = '3.2';

set data[3]['2013-06-05 15:11:00'] = '3.3';
set data[3]['2013-06-05 15:11:10'] = '4.3';

And list it:


list data;
RowKey: 1
=> (cell=2013-06-05 15:11:00-0500, value=3.1, timestamp=1370463146717000)
=> (cell=2013-06-05 15:11:10-0500, value=4.3, timestamp=1370463282090000)
=> (cell=2013-06-05 15:11:20-0500, value=5.7, timestamp=1370463282093000)
RowKey: 2
=> (cell=2013-06-05 15:11:00-0500, value=3.2, timestamp=1370463332361000)
RowKey: 3
=> (cell=2013-06-05 15:11:00-0500, value=3.3, timestamp=1370463332365000)
=> (cell=2013-06-05 15:11:10-0500, value=4.3, timestamp=1370463332368000)

You can see how people would call this "wide rows [partitions]" -- in practice, a partition like this could easily grow to millions of cells, or why in 2011 we announced that Cassandra 0.7 could handle two billion columns [cells] per row [partition]. This actually translates straightforwardly to CQL. Here's what we get when we SELECT * FROM data in cqlsh:


 key | column1                  | value
   1 | 2013-06-05 15:11:00-0500 |   3.1
   1 | 2013-06-05 15:11:10-0500 |   4.3
   1 | 2013-06-05 15:11:20-0500 |   5.7
   2 | 2013-06-05 15:11:00-0500 |   3.2
   3 | 2013-06-05 15:11:00-0500 |   3.3
   3 | 2013-06-05 15:11:10-0500 |   4.3

You can instantly see how this matches up to the Thrift data, only now it's organized neatly into rows and columns. The only thing missing is the column names, so Cassandra used placeholders. We can add that information easily:


ALTER TABLE data RENAME key TO sensor_id;
ALTER TABLE data RENAME column1 TO collected_at;
ALTER TABLE data RENAME value TO volts;

 sensor_id | collected_at             | volts
         1 | 2013-06-05 15:11:00-0500 |   3.1
         1 | 2013-06-05 15:11:10-0500 |   4.3
         1 | 2013-06-05 15:11:20-0500 |   5.7
         2 | 2013-06-05 15:11:00-0500 |   3.2
         3 | 2013-06-05 15:11:00-0500 |   3.3
         3 | 2013-06-05 15:11:10-0500 |   4.3

We can even get the table definition as if it had been created initially in CQL:



  sensor_id int,
  collected_at timestamp,
  volts float,
  PRIMARY KEY (sensor_id, collected_at)

Thus, the way to model dynamic cells in CQL is with a compound primary key. For the gory details on things like CompositeType, see my previous post.

What about the special case where one cell = one column?

Sometimes it could be useful to use dynamic cell names even when they don't represent a series of similar values. For instance, consider users with multiple phone numbers. Cassandra doesn't support joins by design, so one way to represent this might be with cells named "home", "work", and so forth. In Thrift I could do that by defining the static cells up front, and then letting the user inject dynamic cells as needed:


create column family users
with key_validation_class = 'UTF8Type'
 and comparator = 'UTF8Type'
 and column_metadata = [
       {column_name : 'name', validation_class : UTF8Type},
       {column_name : 'birth_year', validation_class : Int32Type}];

set users['jbellis']['name'] = 'Jonathan Ellis';
set users['jbellis']['birth_year'] = 1976;
set users['jbellis']['home'] = long(1112223333);
set users['jbellis']['work'] = long(2223334444);

To a degree, this works. But look at what happens when I fetch back this user:


get users['jbellis'];
=> (cell=birth_year, value=1976, timestamp=1370470203739000)
=> (cell=home, value=00000000424b2e65, timestamp=1370470203743000)
=> (cell=name, value=Jonathan Ellis, timestamp=1370470203736000)
=> (cell=work, value=000000008485642c, timestamp=1370470203746000)

Cassandra doesn't know what data type we're using since it's not part of the schema. We're off in no-man's land. CQL's typed collections offer a more robust approach:


  user_id text PRIMARY KEY,
  name text,
  birth_year int,
  phone_numbers map<text, bigint>

INSERT INTO users (user_id, name, birth_year, phone_numbers)
VALUES ('jbellis', 'Jonathan Ellis', 1976, {'home': 1112223333, 'work': 2223334444});

Now when we examine our results we get


 user_id | birth_year | name           | phone_numbers
 jbellis |       1976 | Jonathan Ellis | {home: 1112223333, work: 2223334444}


A method to the madness

CQL solves specific problems with the Thrift API, while still exposing the power of Cassandra's underlying engine. If you hear someone explaining that CQL doesn't let you do X that you could do in Thrift, he's probably misinformed.


One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.