Apache Cassandra 0.7 Documentation

Example: Twissandra

Twissandra is a project that provides similar functionality to Twitter.

Because it is beneficial to consider all the use cases for your data before you structure how it will be stored, the actions Twissandra needs to perform are listed here first.

The common actions to be supported are:

  • Get a user record by username
  • Get a list of user records by their usernames
  • Get the friends of a username
  • Get the followers of a username
  • Get the usernames of the friends of a username
  • Get the usernames of the followers of a username
  • Get a timeline of all tweets
  • Get a timeline of a specific user’s tweets
  • Get a tweet from a tweet ID
  • Create a tweet
  • Create a user
  • Add friends to a user
  • Remove friends from a user

Here, a “friend” is somebody that a user follows.

Twissandra uses seven different column families to minimize the number of reads or inserts that need to be made for each type of action. The column families are:

  • USER * Row key is username, columns hold user details
  • FRIENDS * Row key is username, columns are friend’s usernames
  • FOLLOWERS * Row key is username, column are follower’s usernames
  • TWEET * Row key is tweet ID, columns are the tweet body and the username
  • TIMELINE * Holds the tweets of friends a user is following * Row key is username, column names are timestamps and values are tweet IDs
  • USERLINE * Holds all the tweets by a given user * Row key is username, column names are timestamps and values are tweet IDs * One row for the PUBLIC_USERLINE_KEY, discussed below

Adding Friends

Adding and requires two sets of writes: one to update the user’s row in FRIENDS, and one set of writes to update FOLLOWERS for each of the friends added.

Getting a User’s Tweets

Getting a slice of a user’s tweets (such as the last 20 tweets) requires at least two queries, three if we want to attach a user record to each tweet:

  1. Get a slice of the columns in the user’s row in TIMELINE. This gives us the tweet IDs in chronological order (because the column names are timestamps).
  2. Get the actual tweets by calling multiget() on TWEET with the tweet IDs as the keys.
  3. Optionally, get the user record for each tweet by:
  1. Collecting the usernames from all of the tweets
  2. Calling multiget() on USER with the usernames as the keys

Getting a Slice of All Tweets, PUBLIC_USERLINE_KEY

In USERLINE, the special PUBLIC_USERLINE_KEY is used to hold a timeline of all tweets. Obviously, on a Twitter-like scale, holding all tweets in a single row will eventually cause problems. This can easily be fixed by splitting the public userline row by day or hour, for example, but since this is an educational example, this has not been done.

Other than that, getting a slice of all tweets is performed exactly the same as for a single user, except the PUBLIC_USERLINE_KEY is used instead of a username.

Adding a Tweet

To add a tweet, we have to do the following:

  1. Create the tweet in the TWEET column family
  2. Add it to the user’s row in USERLINE
  3. Add it to the PUBLIC_USERLINE_KEY row in USERLINE
  4. Add it to the timelines of the followers of the user