The next release (as of this writing) on the 2.0 branch of Cassandra, Cassandra 2.0.6, adds a few noteworthy features to CQL. This post details those.
Batching of conditional updates
The first addition to CQL in Cassandra 2.0.6 is the ability to batch conditional updates. Cassandra 2.0 introduced lightweight transactions in the form of conditions on updates, but it wasn't possible to batch such updates so far. Cassandra 2.0.6 changes that, allowing to group multiple conditional updates in a batch as long as all those updates apply to the same partition (the underlying Paxos implementation works at the granularity of the partition, so it is not possible to batch conditions on different partitions).
Note that it is possible to group updates that have conditions with others that don't, but as soon as one statement in a batch uses a condition, the whole batch will be committed using a single Paxos proposal, and this if all of the conditions contained in the bach apply.
Batching conditional updates offers a few interesting possibilities when combined with another new feature: static columns.
A static column is a special column that is shared by all the rows of the same partition. Let's take an example: suppose that we want to store per-user bills that needs to be paid and keep a per-user balance amount of what remains to be paid. The invariant we want to maintain is that the balance is always the sum of all unpaid bills:
CREATE TABLE bills ( user text, balance int static, expense_id int, amount int, description text, paid boolean, PRIMARY KEY (user, expense_id) );
Because the balance is static, we will only have one balance for each user in practice, no matter how many bills a user have. We can then insert a few expenses:
BEGIN BATCH INSERT INTO bills (user, balance) VALUES ('user1', -8) IF NOT EXISTS; INSERT INTO bills (user, expense_id, amount, description, paid) VALUES ('user1', 1, 8, 'burrito', false); APPLY BATCH; BEGIN BATCH UPDATE bills SET balance = -208 WHERE user='user1' IF balance = -8; INSERT INTO bills (user, expense_id, amount, description, paid) VALUES ('user1', 2, 200, 'hotel room', false); APPLY BATCH;
Note that we are allowed to update a static column in a statement that only provide the partition key, which wouldn't be correct for a normal column (since the primary key contains a clustering column). Also note that due to the use of (batched) conditional updates, we guarantee that the balance will always correspond to the amount due for the unpaid bills. This wouldn't be possible if the balance was stored in a separate table since a batch with conditional updates cannot span multiple partitions. At that point we will have:
SELECT * FROM bills; user | expense_id | balance | amount | description | paid -------+------------+---------+--------+-------------+------- user1 | 1 | -208 | 8 | burrito | False user1 | 2 | -208 | 200 | hotel room | False
Note that both rows share the same balance. I would then be possible to pay a bill with:
BEGIN BATCH UPDATE bills SET balance=-200 WHERE user='user1' IF balance=-208; UPDATE bills SET paid=true WHERE user='user1' AND expense_id=1 IF paid=false; APPLY BATCH; SELECT * FROM bills; user | expense_id | balance | amount | description | paid -------+------------+---------+--------+-------------+------- user1 | 1 | -200 | 8 | burrito | True user1 | 2 | -200 | 200 | hotel room | False
This example is just one use of static columns, but they can be more generally used for optimizations, when you want to store some per-partition "static" information alongside clustered rows and still want to be able to query both of those with a single SELECT.
Please note that:
- only columns not part of the PRIMARY key can be static.
- only tables with at least one clustering column can have static columns (a table with no clustering column has only one row per partition and as such all columns are already static by construction).
- tables with the COMPACT STORAGE option cannot have static columns.
Increased control for "slicing" over multiple clustering columns
The last improvement to CQL in Cassandra 2.0.6 is a new syntax that allows slightly more general "slicing" over the rows of a given partition when there is more than one clustering column. Consider the following (somewhat artificial) example:
CREATE TABLE timeline ( day text, hour int, min int, sec int, value text, PRIMARY KEY (day, hour, min, sec) ); INSERT INTO timeline (day, hour, min, sec, value) VALUES ('12 Jan 2014', 3, 43, 12, 'event1'); INSERT INTO timeline (day, hour, min, sec, value) VALUES ('12 Jan 2014', 3, 52, 58, 'event2'); INSERT INTO timeline (day, hour, min, sec, value) VALUES ('12 Jan 2014', 4, 37, 01, 'event3'); INSERT INTO timeline (day, hour, min, sec, value) VALUES ('12 Jan 2014', 4, 37, 41, 'event3'); INSERT INTO timeline (day, hour, min, sec, value) VALUES ('12 Jan 2014', 6, 00, 34, 'event4'); SELECT * FROM timeline; day | hour | min | sec | value -------------+------+-----+-----+-------- 12 Jan 2014 | 3 | 43 | 12 | event1 12 Jan 2014 | 3 | 52 | 58 | event2 12 Jan 2014 | 4 | 37 | 1 | event3 12 Jan 2014 | 4 | 37 | 41 | event3 12 Jan 2014 | 6 | 0 | 34 | event4
and suppose that you want all events for the 12th of January 2014 between 3:50:00 and 4:37:30 (I said it was somewhat artificial). Doing:
SELECT * FROM timeline WHERE day='12 Jan 2014' AND hour >= 3 AND min >= 50 AND hour <= 4 AND min <= 37 AND sec <=30;
doesn't work. No only is that not allowed but no row can satisfy such query since no time can have both min >= 50 AND min <= 37. This is not the query we are looking for. So to allow this type of query, Cassandra 2.0.6 will introduce the following syntax:
SELECT * FROM timeline WHERE day='12 Jan 2014' AND (hour, min) >= (3, 50) AND (hour, min, sec) <= (4, 37, 30); day | hour | min | sec | value -------------+------+-----+-----+-------- 12 Jan 2014 | 3 | 52 | 58 | event2 12 Jan 2014 | 4 | 37 | 1 | event3