Apache Cassandra 0.7 Documentation

Natural Keys vs. Surrogate Keys

This document corresponds to an earlier product version. Make sure you are using the version that corresponds to your version.

Latest Cassandra documentation | Earlier Cassandra documentation

Within a column family, every row is known by its key, a string of virtually unbounded length. The key has no required form, so typically this is just a UUID to uniquely identify the row. This type of key is known as a “Surrogate Key”, as the key has no relation to the data in the row. Below is a sample user table with an index to match; the surrogate keys in this case are increasing integers.

USER: {
  '1': {'username':'joe', 'birthday':'11/11/69', ...},
  '2': {'username':'mary', 'dogs_owned':'2', ...},
  '3': ...
}
USERID_BY_NAME: {
  'joe': {'1': },
  'mary': {'2': },
  ...
}

However, there are often columns in a row that are guaranteed to be unique and can be used for the same purpose (e.g. the Username column of a User row). Since a username is typically unique across the system, it’s often a good design choice to use the username as the row-key itself. This type of key is called a “Natural Key”. For example:

USER: {
  'joe': {'birthday':'11/11/69', ...},
  'mary': {'dogs_owned':'2', ...},
  'tom': ...
}

There are pros and cons to this approach.

The gains from Natural Keys are increased readability, the removal of a source of denormalization and extra reads, and the removal of a required username-to-userid CF. In terms of clarity, get(USER, 'bob') is generally more clear than get(USER, get(USERID_BY_NAME, 'bob')).

The biggest con is when the uniqueness of the column in the row is not enforced. Continuing the previous example, if another user named ‘bob’ shows up, he will smash the existing data for the original ‘bob’. Additionally, if requirements change midway through the project and users need to be able to change names, the Surrogate Keys approach can change a single column in the USERID_BY_NAME index, whereas the Natural Keys approach requires a copy of the entire row, which may be quite large.