DataStax Developer Blog

Does CQL support dynamic columns / wide rows?

By Jonathan Ellis -  June 5, 2013 | 23 Comments

The transition to CQL has been tough for some people who were used to the existing Thrift-based data model. A common misunderstanding is that CQL does not support dynamic columns or wide rows. On the contrary, CQL was designed to support everything you can do with the Thrift model, but make it easier and more accessible.

A note on terminology

Part of the confusion comes from the Thrift api using the same terms as CQL/SQL to mean different things. To avoid ambiguity, I will only use these terms in the CQL sense here:

Thrift term CQL term
row partition
column cell
[cell name component or value] column
[group of cells with shared component prefixes] row

So when someone asks, does “does CQL support dynamic columns?” he is really asking, “does it support dynamic cell names.” Let’s look at how that works.

Wide rows

Suppose we’re gathering data from sensors. We have sensors identified by a unique integer, and their observations are (timestamp, float) pairs. Thus, our cell names will be timestamps, and our cell values will be floats:


[cassandra-cli -- the old Thrift way]

create column family data 
with key_validation_class=Int32Type 
 and comparator=DateType 
 and default_validation_class=FloatType;

We’ll insert some sample data:


[cassandra-cli]

set data[1]['2013-06-05 15:11:00'] = '3.1';
set data[1]['2013-06-05 15:11:10'] = '4.1';
set data[1]['2013-06-05 15:11:20'] = '5.1';

set data[2]['2013-06-05 15:11:00'] = '3.2';

set data[3]['2013-06-05 15:11:00'] = '3.3';
set data[3]['2013-06-05 15:11:10'] = '4.3';

And list it:


[cassandra-cli]

list data;
RowKey: 1
=> (cell=2013-06-05 15:11:00-0500, value=3.1, timestamp=1370463146717000)
=> (cell=2013-06-05 15:11:10-0500, value=4.3, timestamp=1370463282090000)
=> (cell=2013-06-05 15:11:20-0500, value=5.7, timestamp=1370463282093000)
-------------------
RowKey: 2
=> (cell=2013-06-05 15:11:00-0500, value=3.2, timestamp=1370463332361000)
-------------------
RowKey: 3
=> (cell=2013-06-05 15:11:00-0500, value=3.3, timestamp=1370463332365000)
=> (cell=2013-06-05 15:11:10-0500, value=4.3, timestamp=1370463332368000)

You can see how people would call this “wide rows [partitions]” — in practice, a partition like this could easily grow to millions of cells, or why in 2011 we announced that Cassandra 0.7 could handle two billion columns [cells] per row [partition].

This actually translates straightforwardly to CQL. Here’s what we get when we SELECT * FROM data in cqlsh:


[cqlsh]

 key | column1                  | value
-----+--------------------------+-------
   1 | 2013-06-05 15:11:00-0500 |   3.1
   1 | 2013-06-05 15:11:10-0500 |   4.3
   1 | 2013-06-05 15:11:20-0500 |   5.7
   2 | 2013-06-05 15:11:00-0500 |   3.2
   3 | 2013-06-05 15:11:00-0500 |   3.3
   3 | 2013-06-05 15:11:10-0500 |   4.3

You can instantly see how this matches up to the Thrift data, only now it’s organized neatly into rows and columns. The only thing missing is the column names, so Cassandra used placeholders. We can add that information easily:


[cqlsh]

ALTER TABLE data RENAME key TO sensor_id;
ALTER TABLE data RENAME column1 TO collected_at;
ALTER TABLE data RENAME value TO volts;
SELECT * FROM data;

 sensor_id | collected_at             | volts
-----------+--------------------------+-------
         1 | 2013-06-05 15:11:00-0500 |   3.1
         1 | 2013-06-05 15:11:10-0500 |   4.3
         1 | 2013-06-05 15:11:20-0500 |   5.7
         2 | 2013-06-05 15:11:00-0500 |   3.2
         3 | 2013-06-05 15:11:00-0500 |   3.3
         3 | 2013-06-05 15:11:10-0500 |   4.3

We can even get the table definition as if it had been created initially in CQL:


[cqlsh]

DESCRIBE TABLE data;

CREATE TABLE data (
  sensor_id int,
  collected_at timestamp,
  volts float,
  PRIMARY KEY (sensor_id, collected_at)
) WITH COMPACT STORAGE;

Thus, the way to model dynamic cells in CQL is with a compound primary key. For the gory details on things like CompositeType, see my previous post.

What about the special case where one cell = one column?

Sometimes it could be useful to use dynamic cell names even when they don’t represent a series of similar values.

For instance, consider users with multiple phone numbers. Cassandra doesn’t support joins by design, so one way to represent this might be with cells named “home”, “work”, and so forth.

In Thrift I could do that by defining the static cells up front, and then letting the user inject dynamic cells as needed:


[cassandra-cli]

create column family users
with key_validation_class = 'UTF8Type'
 and comparator = 'UTF8Type'
 and column_metadata = [
       {column_name : 'name', validation_class : UTF8Type},
       {column_name : 'birth_year', validation_class : Int32Type}];

set users['jbellis']['name'] = 'Jonathan Ellis';
set users['jbellis']['birth_year'] = 1976;
set users['jbellis']['home'] = long(1112223333);
set users['jbellis']['work'] = long(2223334444);

To a degree, this works. But look at what happens when I fetch back this user:


[cassandra-cli]

get users['jbellis'];
=> (cell=birth_year, value=1976, timestamp=1370470203739000)
=> (cell=home, value=00000000424b2e65, timestamp=1370470203743000)
=> (cell=name, value=Jonathan Ellis, timestamp=1370470203736000)
=> (cell=work, value=000000008485642c, timestamp=1370470203746000)

Cassandra doesn’t know what data type we’re using since it’s not part of the schema. We’re off in no-man’s land.

CQL’s typed collections offer a more robust approach:


[cqlsh]

CREATE TABLE users (
  user_id text PRIMARY KEY,
  name text,
  birth_year int,
  phone_numbers map
);

INSERT INTO users (user_id, name, birth_year, phone_numbers)
VALUES ('jbellis', 'Jonathan Ellis', 1976, {'home': 1112223333, 'work': 2223334444});

Now when we examine our results we get


[cqlsh]

 user_id | birth_year | name           | phone_numbers
---------+------------+----------------+--------------------------------------
 jbellis |       1976 | Jonathan Ellis | {home: 1112223333, work: 2223334444}

A method to the madness

CQL solves specific problems with the Thrift API, while still exposing the power of Cassandra's underlying engine. If you hear someone explaining that CQL doesn't let you do X that you could do in Thrift, he's probably misinformed.



Comments

  1. Ansar says:

    CQL does support dynamic columns and wide rows but some how it makes data model fixed.

    Column names must be predefined which was not the case before, though CQL handle this in other way but have to Alter table, if I need to create new column.

  2. Jonathan Ellis says:

    If you are adding a new “static” column, then yes, CQL does force you to add it to the schema. (Not adding it to the schema was a kind of “worst practice” before, but allowed.)

    Requiring ALTER now allows us to make performance enhancements that were impossible before, such as https://issues.apache.org/jira/browse/CASSANDRA-4175.

  3. Shahab Yunus says:

    Thanks for the explanation as it was need give the bit confusing change from different CQL versions.

  4. Ansar says:

    Using thrift the statement was still true

    “In Cassandra, each row has it’s own schema”

    but not in CQL anymore as you must have to predefined your schema. To add new column, one must have to Alter table.

  5. Patrick McFadin says:

    Ansar,

    I understand what you are thinking about the how CQL enforces schema. My first impression was somewhat the same because I had gotten used to just adding a column name and column value of any arbitrary value. It took looking a little deeper into how Cassandra was storing the data to clarify how close it was to what I have always done.

    First, when modeling using Thrift, I would simply state “We’ll store the timestamp in the column name and the voltage in the column value” I never created any DDL to define that other than a comparator. It was up to the programmer to get it right when inserting new data. After the data was inserted you could list the data in cassandra-cli and see the column name was a timestamp and the column value was the voltage. Perfect. What I wanted.

    Let’s alter Jonathan’s example a bit. Use this CQL table create statement.

    CREATE TABLE data (
    sensor_id int,
    collected_at timestamp,
    volts float,
    PRIMARY KEY (sensor_id, collected_at)
    );

    I didn’t use the “COMPACT STORAGE” option so this is a regular CQL3 table. Now insert some data:
    INSERT INTO data(sensor_id,collected_at,volts)
    VALUES (1,’2013-06-05 15:11:00′,3.1);
    INSERT INTO data(sensor_id,collected_at,volts)
    VALUES (1,’2013-06-05 15:11:10′,4.1);
    INSERT INTO data(sensor_id,collected_at,volts)
    VALUES (1,’2013-06-05 15:11:20′,5.1);
    INSERT INTO data(sensor_id,collected_at,volts)
    VALUES (2,’2013-06-05 15:11:00′,3.2);
    INSERT INTO data(sensor_id,collected_at,volts)
    VALUES (3,’2013-06-05 15:11:00′,3.3);
    INSERT INTO data(sensor_id,collected_at,volts)
    VALUES (3,’2013-06-05 15:11:10′,4.3;

    When you look at it using a “select * from data;” you get this:
    sensor_id | collected_at | volts
    -----------+--------------------------+-------
    1 | 2013-06-05 15:11:00-0700 | 3.1
    1 | 2013-06-05 15:11:10-0700 | 4.1
    1 | 2013-06-05 15:11:20-0700 | 5.1
    2 | 2013-06-05 15:11:00-0700 | 3.2
    3 | 2013-06-05 15:11:00-0700 | 3.3
    3 | 2013-06-05 15:11:10-0700 | 4.3

    But take a look from the cli. This is better for seeing what’s really going on behind the scenes.


    RowKey: 1
    => (column=2013-06-05 15\:11\:00-0700:, value=, timestamp=1370716683851000)
    => (column=2013-06-05 15\:11\:00-0700:volts, value=40466666, timestamp=1370716683851000)
    => (column=2013-06-05 15\:11\:10-0700:, value=, timestamp=1370716683855000)
    => (column=2013-06-05 15\:11\:10-0700:volts, value=40833333, timestamp=1370716683855000)
    => (column=2013-06-05 15\:11\:20-0700:, value=, timestamp=1370716683862000)
    => (column=2013-06-05 15\:11\:20-0700:volts, value=40a33333, timestamp=1370716683862000)
    -------------------
    RowKey: 2
    => (column=2013-06-05 15\:11\:00-0700:, value=, timestamp=1370716683869000)
    => (column=2013-06-05 15\:11\:00-0700:volts, value=404ccccd, timestamp=1370716683869000)
    -------------------
    RowKey: 3
    => (column=2013-06-05 15\:11\:00-0700:, value=, timestamp=1370716683877000)
    => (column=2013-06-05 15\:11\:00-0700:volts, value=40533333, timestamp=1370716683877000)
    => (column=2013-06-05 15\:11\:10-0700:, value=, timestamp=1370716690556000)
    => (column=2013-06-05 15\:11\:10-0700:volts, value=4089999a, timestamp=1370716690556000)

    Notice that the “collected_at” is missing? Every column name IS the timestamp + the name of the value. The only difference between that and my Thrift model is the additional row marker (the timestamp and :) and the word volts after the timestamp. Other than that I have a wide row where timestamp is the column name and the column value is the voltage. Just like I originally wanted. No need to alter table here.

    1. Patrick McFadin says:

      That wasn’t supposed to be a smiley face. It should read timestamp and :

    2. Keith Freeman says:

      Thanks, your example finally made this clear to me in a way that the Cassandra docs never did.

    3. Gabriel says:

      I’m so used at using Hector that I admit, I’m still digging this up and some things still aren’t clear in my head.
      But, after a first look, I can’t find a way to select a column (or cell) slice. To use your example, what cql command should i use to get, let’s say, every column between ’2013-06-05 15:11:02-0700 ‘ and ’2013-06-05 15:11:22-0700 ‘ for sensor_id =1? In this example, you’re using a composite primary key which i think limits this kind of selecting the data.
      I understand you suggest using collections for dynamic columns, but I don’t think you can select a column slice from a collection either.
      In my application, I don’t know the names of the columns I want to select, nor how many will i select. Can i still use CQL?

      1. Jonathan Ellis says:

        SELECT *
        FROM data
        WHERE sensor_id = 1
        AND collected_at >= ’2013-06-05 15:11:02-0700′ AND collected_at < ’2013-06-05 15:11:22-0700′

        (collected_at is a clustering column and not a partition key component. Components of the partition key can’t be used in inequalities in either Thrift or CQL)

        1. Gabriel says:

          OK, this is the SQL approach… At first, it was hard for me to think non-SQL, now it’s hard to think SQL :) It’s just that i didn’t want to use a composite PRIMARY KEY… My case is a little bit more complex: i have an unknown number of ‘collected_at’ like columns (so, my primary key would be composed of an unknown number of columns). I guess i’ll stick to Hector at the moment… We could probably rethink our model, but that’s what we are trying to avoid.

  6. Ian Rogers says:

    One confusing thing with cql is that it only supports operations on cells that have been named in a schema definition. E.g.

    select * from foo;

    will not show any cells/columns that are not in the schema.

    This is not so bad for select but is a major pain for:

    copy foo to ‘foo.csv’;

    Is there any chance of providing an option to support “sparse” csv ‘copy to’? It would have to be 2-pass: the first to get a list of all the cell names from all rows – for the optional header and to know how many columns should be in the csv in total – and then the 2nd pass to actually export the data.

  7. Alex P says:

    Here’s one more guide that goes into details about how to use Dynamic Rows in CQL: http://clojurecassandra.info/articles/data_modelling.html#toc_3

    We’ve got some more info, for people who need to understand how data is stored internally here: http://clojurecassandra.info/articles/troubleshooting.html#toc_4 which goes into details and explains how to query rows via cassandra-cli.

    Hope it helps someone!

  8. Jonathan Ellis says:

    Ian, you’re proceeding from a broken premise. As shown in my example above AND Patrick’s, “select * from foo;” shows *all* cells, with cell names and values both mapped to separate columns.

  9. Ansar says:

    Jonathan and Patrick, I totally agree with you. It does everything that we can do with Thrift model and even make the things more simpler but please remember there was lot more flexibility in Thrift model and you have to admit that. I have no doubt that storing data with Thrift model requires more space as each column value also need to store column name. E.g what if I want to create a new column name as collected_date then I have to alter table so this is the flexibility I was talking about.

    Please tell me

  10. Vladimir says:

    What about wide rows for the tables with counter columns. Table that has counter columns can’t have columns other than counter. Is there any workaround?

  11. Jonathan Ellis says:

    Datatype “counter” is no different from datatype “int,” “text,” etc. as far as handling wide rows is concerned.

  12. Deepak Kumar says:

    **** ” If you hear someone explaining that CQL doesn’t let you do X that you could do in Thrift, he’s probably misinformed.” ****

    Hi jonathan thanks for all the useful information.

    But i would like to ask you a question with reference to the above comment.

    How you are going to do a range query on column name.(or cell name).
    Well i was able to do the same with Thrift or Hector API. But i didn’t find a way to do the same using Cql. Please enlighten me.

  13. Tim says:

    Thanks for this, was an excellent read. I have a question which I think follows some of the ideas here:

    Let’s say, I have CF ‘products’
    id timeuuid
    location varchar
    shopname varchar
    expiry timestamp
    count int
    PRIMARY KEY (id)

    I want to be able to select products at specific location ordered by expiry. Therfore create register like:
    CF ‘id_register_by_loc_expy’
    location varchar
    expiry timestamp
    id timeuuid
    PRIMARY KEY (location,expiry,id)

    and want to select products at specific shopname ordered by expiry. Then create:
    CF ‘id_register_by_shopname_expy’
    shopname vachar
    expiry timestamp
    id timeuuid
    PRIMARY KEY (shopname,expiry,id)

    This is so I can do efficient queries/slicing as follows:
    1.select id from id_reg_by_loc_expy where location = ‘x’; // [naturally ordered by expiry]
    2.select id from id_reg_by_loc_expy where location = ‘x’ and expiry > ‘t1′ and expiry ‘t1′ and expiry ‘t1′ and expiry < 't2';
    ** note that this requires me to force 'allow filtering' and seems poor design to include another secondary index simply to allow this query.. i.e. a query of which I'm less interested in than the 'order by' query anyway.

    2. Using timeuuid in lieu of timestamp for the expiry. Even if this comes to work which I can't find a way, it doesn't help my 'ordering by count' intentions.

    Is there something fundamental that I am missing which should allow this? Is the answer that I need to go ahead with all the tombstone mitigation techniques? or do some of the ordering in my application?
    Cheers,
    Tim

  14. Jonathan Ellis says:

    If your question is, “How do I order by count,” that’s a Tough Problem. Most people use spark or Hive or Pig. Gory details on why it’s hard: http://www.youtube.com/watch?v=SRejy08zM7Y

  15. Martin Svajdlenka says:

    When a new CQL version with support for filtering collections in SELECT WHERE clause will be available?

    I need to model a tree where each node has a variable number of child nodes, and variable number of properties in map style property:value. I wanted to use list for child nodes and map for properties, but when it is not possible to select nodes based on properties and children, it is not usable for us.

  16. Hello, Jonathan, thank you for write up.

    I have use case, where meta data (key-value map) should be stored along with other properties for an entity, where meta data key is translated into column name (prefixed with a term), value is stored correspondingly. Additionally:
    #1 User meta data is optional, meaning that key-values can absent
    #2 Key names are not fixed, columns can not be precreated
    #3 Typed map is not an option due to size/count restrictions

    Having that:
    I doesn’t see any appropriate syntax to insert dynamic columns via CQL, where column names are not known before is table created.

    Although it’s possible to create table with composite columns:
    create table t1 (f1 ascii, f2 ascii, f3 ascii, primary key ((f1), f2, f3));
    The below works fine
    insert into t1 (f1, f2, f3) values (’1′, ‘key1′, ‘value1′);
    insert into t1 (f1, f2, f3) values (’1′, ‘key2′, ‘value2′);
    CQL prohibits to insert a row without f2 & f3, requirement #1 from the list

    Also there is a gap between thrift wide rows and CQL and I have just submitted an issue:
    https://issues.apache.org/jira/browse/CASSANDRA-7049

  17. Sougata says:

    Have a question here, I am using Netflix/astyanax api, I need to do some batch operation, while doing the batch operation using cql to build the CQL batch String takes sometime and more CPU, so we planned to use Thrift model, Is there any performance issue with Thrift model?
    I am using MutationBatch.
    I am asking the question since in lot of post I see CQL is the way for cassandra.

    Do you feel datastax driver’s performance is better than netflix one.

    One more quick question, for approximately 20K writes per second with replication factor 2 how many node would be good.
    -Sougata

  18. Nikolay says:

    I wonder, is it possible to do “dynamic” columns, without map ?

    something like:

    insert into ppl(key, fname)values(1, ‘Nick’);
    insert into ppl(key, something)values(2, ’123′);

    update ppl set yet_another = ‘qwerty’ where key=1;

    This definitely can be done with thrift, but with cql I do not see how it can be done?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>