Apache Cassandra 1.2 Documentation

About data modeling in Cassandra

Data modeling in Cassandra is different from data modeling in a relational database data model. In SQL you create tables that define the column names and their data types, and the client application then supplies rows conforming to that schema. In Cassandra, you also define tables and metadata about the columns, but the actual columns that make up a row are determined by the client application.

The best way to approach data modeling in Cassandra is to start with your queries and work back from there. Think about the actions your application needs to perform, how you want to access the data, and then design tables to support those access patterns. A good rule of a thumb is one table per query since you optimize tables for read performance.

For example, start with listing the use cases your application needs to support. Think about the data you want to capture and the lookups your application needs to do. Also note any ordering, filtering, or grouping requirements. For example, needing events in chronological order or needing only the last 6 months worth of data would be factors in your data model design.

Denormalize

In the relational world, the data model is usually designed up front with the goal of normalizing the data to minimize redundancy. Normalization typically involves creating smaller, well-structured tables and then defining relationships between them. During queries, related tables are joined to satisfy the request.

Cassandra does not have foreign key relationships like a relational database does, which means you cannot join multiple tables to satisfy a given query request. Cassandra performs best when the data needed to satisfy a given query is located in the same table. Try to plan your data model so that one or more rows in a single table are used to answer each query. This sacrifices disk space (one of the cheapest resources for a server) in order to reduce the number of disk seeks and the amount of network traffic.

Planning for concurrent writes

Within a table, every row is known by its row key, a string of virtually unbounded length. The key has no required form, but it must be unique within a table. Unlike the primary key in a relational database, Cassandra does not enforce uniqueness. Inserting a duplicate row key will upsert the columns contained in the insert statement rather than return a unique constraint violation.

Using natural or surrogate row keys

One consideration is whether to use surrogate or natural keys for a table. A surrogate key is a generated key (such as a UUID) that uniquely identifies a row, but has no relation to the actual data in the row.

For some tables, the data may contain values that are guaranteed to be unique and are not typically updated after a row is created. For example, the user name in a users table. This is called a natural key. Natural keys make the data more readable and remove the need for additional indexes or denormalization. However, unless your client application ensures uniqueness, it could potentially overwrite column data.

For more information about data modeling, see Anatomy of a table.