Apache Cassandra 1.1 Documentation

What's New in Apache Cassandra 1.1

This document corresponds to an earlier product version. Make sure you are using the version that corresponds to your version.

Latest Cassandra documentation | Earlier Cassandra documentation

In Cassandra 1.1, key improvements have been made in the areas of CQL, performance, and management ease of use:

Cassandra Query Language (CQL) Enhancements

One of the main objectives of Cassandra 1.1 was to bring CQL up to parity with the legacy API and command line interface (CLI) that has shipped with Cassandra for several years. This release achieves that goal. CQL is now the primary interface into the DBMS. The CQL specification has now been promoted to CQL 3, although CQL 2 remains the default in 1.1 because CQL3 is not backwards compatible. A number of the new CQL enhancements have been rolled out in prior Cassandra 1.0.x point releases. These are covered in the CQL Reference.

Composite Primary Key Columns

The most significant enhancement of CQL is support for composite primary key columns and wide rows. Composite keys distribute column family data among the nodes. New querying capabilities are a beneficial side effect of wide-row support. You use an ORDER BY clause to sort the result set. A new compact storage directive provides backward-compatibility for applications created with CQL 2. If this directive is used, then instead of each non-primary key column being stored in a way where each column corresponds to one column on disk, an entire row is stored in a single column on disk. The drawback is that updates to that column’s data are not allowed. The default is non-compact storage.

CQL Shell Utility

The CQL shell utility (cqlsh) contains a number of new features. First is the SOURCE command, which reads CQL commands from an external file and runs them. Next, the CAPTURE command writes the output of a session to a specified file. Finally, the DESCRIBE COLUMNFAMILIES command shows all the column families that exist in a certain keyspace.

Global Row and Key Caches

Memory caches for column families are now managed globally instead of at the individual column family level, simplifying configuration and tuning. Cassandra automatically distributes memory for various column families based on the overall workload and specific column family usage. Two new configuration parameters, key_cache_size_in_mb and row_cache_size_in_mb replace the per column family cache sizing options. Administrators can choose to include or exclude column families from being cached via the caching parameter that is used when creating or modifying column families.

Off-Heap Cache for Windows

The serializing cache provider (the off heap cache) has been rewritten to no longer require the external JNA library. This is particularly good news for Microsoft Windows users, as Cassandra never supported JNA on that platform. But with the JNA requirement now being eliminated, the off heap cache is available on the Windows platform, which provides the potential for additional performance gains.

Row-Level Isolation

Full row-level isolation is now in place so that writes to a row are isolated to the client performing the write and are not visible to any other user until they are complete. From a transactional ACID (atomic, consistent, isolated, durable) standpoint, this enhancement now gives Cassandra transactional ACID support. Consistency in the ACID sense typically involves referential integrity with foreign keys among related tables, which Cassandra does not have. Cassandra offers tunable consistency not in the ACID sense, but in the CAP theorem sense where data is made consistent across all the nodes in a distributed database cluster. A user can pick and choose on a per operation basis how many nodes must receive a DML command or respond to a SELECT query.

Concurrent Schema Change Support

Cassandra has supported online schema changes since 0.7, however the potential existed for nodes in a cluster to have a disagreement over the sequence of changes made to a particular column family. The end result was the nodes in question had to rebuild their schema.

In version 1.1, large numbers of schema changes can simultaneously take place in a cluster without the fear of having a schema disagreement occur.

A side benefit of the support for schema changes is new nodes are added much faster. The new node is sent the full schema instead of all the changes that have occurred over the life of the cluster. Subsequent changes correctly modify that schema.

Fine-grained Data Storage Control

Cassandra 1.1 provides fine-grained control of column family storage on disk. Until now, you could only use a separate disk per keyspace, not per column family. Cassandra 1.1 stores data files by using separate column family directories within each keyspace directory. In 1.1, data files are stored in this format:

/var/lib/cassandra/data/ks1/cf1/ks1-cf1-hc-1-Data.db

Now, you can mount an SSD on a particular directory (in this example cf1) to boost the performance for a particular column family. The new file name format includes the keyspace name to distinguish which keyspace and column family the file contains when streaming or bulk loading.

Write Survey Mode

Using the write survey mode, you can add a node to a database cluster so that it accepts all the write traffic as if it were part of the normal database cluster, without the node itself actually being part of the cluster where supporting user activity is concerned. It never officially joins the ring. In write survey mode, you can test out new compaction and compression strategies on that node and benchmark the write performance differences, without affecting the production cluster.

To see how read performance is affected by the various modifications, you apply changes to the dummy node, stop the node, bring it up as a standalone machine, and then benchmark read operations on the node.

Abortable Compactions

In Cassandra 1.1, you can stop a compaction, validation, and several other operations from continuing to run. For example, if a compaction has a negative impact on the performance of a node during a critical time of the day, for example, you can terminate the operation using the nodetool stop [operation type] command.

Hadoop Integration

The following low-level features have been added to Cassandra’s support for Hadoop:

  • Secondary index support for the column family input format. Hadoop jobs can now make use of Cassandra secondary indexes.
  • Wide row support. Previously, wide rows that had, for example, millions of columns could not be accessed, but now they can be read and paged through in Hadoop.
  • The bulk output format provides a more efficient way to load data into Cassandra from a Hadoop job.