DataStax Developer Blog

Basic Rules of Cassandra Data Modeling

By Tyler Hobbs -  February 2, 2015 | 29 Comments

Picking the right data model is the hardest part of using Cassandra. If you have a relational background, CQL will look familiar, but the way you use it can be very different. The goal of this post is to explain the basic rules you should keep in mind when designing your schema for Cassandra. If you follow these rules, you’ll get pretty good performance out of the box. Better yet, your performance should scale linearly as you add nodes to the cluster.

Non-Goals

Developers coming from a relational background usually carry over rules about relational modeling and try to apply them to Cassandra. To avoid wasting time on rules that don’t really matter with Cassandra, I want to point out some non-goals:

Minimize the Number of Writes

Writes in Cassandra aren’t free, but they’re awfully cheap. Cassandra is optimized for high write throughput, and almost all writes are equally efficient [1]. If you can perform extra writes to improve the efficiency of your read queries, it’s almost always a good tradeoff. Reads tend to be more expensive and are much more difficult to tune.

Minimize Data Duplication

Denormalization and duplication of data is a fact of life with Cassandra. Don’t be afraid of it. Disk space is generally the cheapest resource (compared to CPU, memory, disk IOPs, or network), and Cassandra is architected around that fact. In order to get the most efficient reads, you often need to duplicate data.

Besides, Cassandra doesn’t have JOINs, and you don’t really want to use those in a distributed fashion.

Basic Goals

These are the two high-level goals for your data model:

  1. Spread data evenly around the cluster
  2. Minimize the number of partitions read

There are other, lesser goals to keep in mind, but these are the most important. For the most part, I will focus on the basics of achieving these two goals. There are other fancy tricks you can use, but you should know how to evaluate them, first.

Rule 1: Spread Data Evenly Around the Cluster

You want every node in the cluster to have roughly the same amount of data. Cassandra makes this easy, but it’s not a given. Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY. So, the key to spreading data evenly is this: pick a good primary key. I’ll explain how to do this in a bit.

Rule 2: Minimize the Number of Partitions Read

Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible.

Why is this important? Each partition may reside on a different node. The coordinator will generally need to issue separate commands to separate nodes for each partition you request. This adds a lot of overhead and increases the variation in latency. Furthermore, even on a single node, it’s more expensive to read from multiple partitions than from a single one due to the way rows are stored.

Conflicting Rules?

If it’s good to minimize the number of partitions that you read from, why not put everything in a single big partition? You would end up violating Rule #1, which is to spread data evenly around the cluster.

The point is, these two goals often conflict, so you’ll need to try to balance them.

Model Around Your Queries

The way to minimize partition reads is to model your data to fit your queries. Don’t model around relations. Don’t model around objects. Model around your queries. Here’s how you do that:

Step 1: Determine What Queries to Support

Try to determine exactly what queries you need to support. This can include a lot of considerations that you may not think of at first. For example, you may need to think about:

  • Grouping by an attribute
  • Ordering by an attribute
  • Filtering based on some set of conditions
  • Enforcing uniqueness in the result set
  • etc …

Changes to just one of these query requirements will frequently warrant a data model change for maximum efficiency.

Step 2: Try to create a table where you can satisfy your query by reading (roughly) one partition

In practice, this generally means you will use roughly one table per query pattern. If you need to support multiple query patterns, you usually need more than one table.

To put this another way, each table should pre-build the “answer” to a high-level query that you need to support. If you need different types of answers, you usually need different tables. This is how you optimize for reads.

Remember, data duplication is okay. Many of your tables may repeat the same data.

Applying the Rules: Examples

To show some examples of a good throught process, I will walk you through the design of a data model for some simple problems.

Example 1: User Lookup

The high-level requirement is “we have users and want to look them up”. Let’s go through the steps:

Step 1: Determine what specific queries to support
Let’s say we want to either be able to look up a user by their username or their email. With either lookup method, we should get the full set of user details.

Step 2: Try to create a table where you can satisfy your query by reading (roughly) one partition
Since we want to get the full details for the user with either lookup method, it’s best to use two tables:

CREATE TABLE users_by_username (
    username text PRIMARY KEY,
    email text,
    age int
)

CREATE TABLE users_by_email (
    email text PRIMARY KEY,
    username text,
    age int
)

Now, let’s check the two rules for this model:

Spreads data evenly? Each user gets their own partition, so yes.
Minimal partitions read? We only have to read one partition, so yes.

Now, let’s suppose we tried to optimize for the non-goals, and came up with this data model instead:

CREATE TABLE users (
    id uuid PRIMARY KEY,
    username text,
    email text,
    age int
)

CREATE TABLE users_by_username (
    username text PRIMARY KEY,
    id uuid
)

CREATE TABLE users_by_email (
    email text PRIMARY KEY,
    id uuid
)

This data model also spreads data evenly, but there’s a downside: we now have to read two partitions, one from users_by_username (or users_by_email) and then one from users. So reads are roughly twice as expensive.

Example 2: User Groups

Now the high-level requirement has changed. Users are in groups, and we want to get all users in a group.

Step 1: Determine what specific queries to support
We want to get the full user info for every user in a particular group. Order of users does not matter.

Step 2: Try to create a table where you can satisfy your query by reading (roughly) one partition
How do we fit a group into a partition? We can use a compound PRIMARY KEY for this:

CREATE TABLE groups (
    groupname text,
    username text,
    email text,
    age int,
    PRIMARY KEY (groupname, username)
)

Note that the PRIMARY KEY has two components: groupname, which is the partitioning key, and username, which is called the clustering key. This will give us one partition per groupname. Within a particular partition (group), rows will be ordered by username. Fetching a group is as simple as doing the following:

SELECT * FROM groups WHERE groupname = ?

This satisfies the goal of minimizing the number of partitions that are read, because we only need to read one partition. However, it doesn’t do so well with the first goal of evenly spreading data around the cluster. If we have thousands or millions of small groups with hundreds of users each, we’ll get a pretty even spread. But if there’s one group with millions of users in it, the entire burden will be shouldered by one node (or one set of replicas).

If we want to spread the load more evenly, there are a few strategies we can use. The basic technique is to add another column to the PRIMARY KEY to form a compound partition key. Here’s one example:

CREATE TABLE groups (
    groupname text,
    username text,
    email text,
    age int,
    hash_prefix int,
    PRIMARY KEY ((groupname, hash_prefix), username)
)

The new column, hash_prefix, holds a prefix of a hash of the username. For example, it could be the first byte of the hash modulo four. Together with groupname, these two columns form the compound partition key. Instead of a group residing on one partition, it’s now spread across four partitions. Our data is more evenly spread out, but we now have to read four times as many partitions. This is an example of the two goals conflicting. You need to find a good balance for your particular use case. If you do a lot of reads and groups don’t get too large, maybe changing the modulo value from four to two would be a good choice. On the other hand, if you do very few reads, but any given group can grow very large, changing from four to ten would be a better choice.

There are other ways to split up a partition, which I will cover in the next example.

Before we move on, let me point out something else about this data model: we’re duplicating user info potentially many times, once for each group. You might be tempted to try a data model like this to reduce duplication:

CREATE TABLE users (
    id uuid PRIMARY KEY,
    username text,
    email text,
    age int
)

CREATE TABLE groups (
    groupname text,
    user_id uuid,
    PRIMARY KEY (groupname, user_id)
)

Obviously, this minimizes duplication. But how many partitions do we need to read? If a group has 1000 users, we need to read 1001 partitions. This is probably 100x more expensive to read than our first data model. If reads need to be efficient at all, this isn’t a good model. On the other hand, if reads are extremely infrequent, but updates to user info (say, the username) are extremely common, this data model might actually make sense. Make sure to take your read/update ratio into account when designing your schema.

Example 3: User Groups by Join Date

Suppose we continue with the previous example of groups, but need to add support for getting the X newest users in a group.

We can use a similar table to the last one:

CREATE TABLE group_join_dates (
    groupname text,
    joined timeuuid,
    username text,
    email text,
    age int,
    PRIMARY KEY (groupname, joined)
)

Here we’re using a timeuuid (which is like a timestamp, but avoids collisions) as the clustering column. Within a group (partition), rows will be ordered by the time the user joined the group. This allows us to get the newest users in a group like so:

SELECT * FROM group_join_dates
    WHERE groupname = ?
    ORDER BY joined DESC
    LIMIT ?

This is reasonably efficient, as we’re reading a slice of rows from a single partition. However, instead of always using ORDER BY joined DESC, which makes the query less efficient, we can simply reverse the clustering order:

CREATE TABLE group_join_dates (
    groupname text,
    joined timeuuid,
    username text,
    email text,
    age int,
    PRIMARY KEY (groupname, joined)
) WITH CLUSTERING ORDER BY (joined DESC)

Now we can use the slightly more efficient query:

SELECT * FROM group_join_dates
    WHERE groupname = ?
    LIMIT ?

As with the previous example, we could have problems with data being spread evenly around the cluster if any groups get too large. In that example, we split partitions somewhat randomly, but in this case, we can utilize our knowledge about the query patterns to split partitions a different way: by a time range.

For example, we might split partitions by date:

CREATE TABLE group_join_dates (
    groupname text,
    joined timeuuid,
    join_date text,
    username text,
    email text,
    age int,
    PRIMARY KEY ((groupname, join_date), joined)
) WITH CLUSTERING ORDER BY (joined DESC)

We’re using a compound partition key again, but this time we’re using the join date. Each day, a new partition will start. When querying the X newest users, we will first query today’s partition, then yesterday’s, and so on, until we have X users. We may have to read multiple partitions before the limit is met.

To minimize the number of partitions you need to query, try to select a time range for splitting partitions that will typically let you query only one or two partitions. For example, if we usually need the ten newest users, and groups usually acquire three users per day, we should split by four-day ranges instead of a single day [2].

Summary

The basic rules of data modeling covered here apply to all (currently) existing versions of Cassandra, and are very likely to apply to all future versions. Other lesser data modeling problems, such as dealing with tombstones, may also need to be considered, but these problems are more likely to change (or be mitigated) by future versions of Cassandra.

Besides the basic strategies covered here, some of Cassandra’s fancier features, like collections, user-defined types, and static columns, can also be used to reduce the number of partitions that you need to read to satisfy a query. Don’t forget to consider these options when designing your schema.

Hopefully I’ve given you some useful fundamental tools for evaluating different schema designs. If you want to go further, I suggest taking Datastax’s free, self-paced online data modeling course (DS220). Good luck!

Footnotes

[1]: Notable exceptions: counters, lightweight transactions, and inserting into the middle of a list collection.

[2]: I suggest using a timestamp truncated by some number of seconds. For example, to handle four-day ranges, you might use something like this:

now = time()
four_days = 4 * 24 * 60 * 60
shard_id = now - (now % four_days)


Comments

  1. Keith Webber says:

    Tyler,
    Thanks for this great data modeling guide!

    Currently we are working on building a new model where this could be very helpful.

    The use case would be for some users they have 10k objects in a row and for others have 10M+. I want to partition but I don’t want to over partition for the smaller users(1K partitions for 10K items) and I don’t want to under design for the bigger users(2 partitions for 1M). Do you have any advice on the Cassandra use of huge varieties of data sets?

    From the application side we have considered some logic to handle the partitions as we grow but if there was a way to say at 200K objects create a partition and migrate some data that would be ideal on Cassandra and then our API/data model wouldn’t have to change regardless of data set size.

    Thoughts?

    1. Tyler Hobbs Tyler Hobbs says:

      Keith,

      I’m glad you found the post useful!

      If it’s possible control partitioning through application logic (and perhaps some metadata in Cassandra, such as counters), I recommend doing that. I talked a bit about this in an older post about time series data (see the section on Variable Time Bucket Sizes), but the strategies apply to non-timeseries as well. Unfortunately, there’s nothing to do this automatically in Cassandra. I hope that helps, though!

  2. vinit says:

    Excellent article on Cassandra data modeling. It provided solution to many questions I had during data modeling. Thanks a lot for sharing your knowledge.

  3. Ivan says:

    Can you please link articles on ‘notable exceptions’? I remember the grounds for counters / transactions cost but not on inserting into the middle of a list collection

    1. Tyler Hobbs Tyler Hobbs says:

      Sure, I’ve added a link for each of them.

  4. Colin McQueen says:

    Tyler,

    Great article much appreciated! 🙂

    How would you handle a use case where you have minute resolution time series data that gets queried by different resolutions?

    e.g. Have temperature readings per device that are being collected every minute. Two clients need the data in different resolutions:

    – Web UI requesting readings for the month for each device
    – Reporting system requesting readings for the day or down to the minute for each device

    Would you create separate tables with the partition key being the requested resolution and the device id?

    1. Tyler Hobbs Tyler Hobbs says:

      Colin,

      Thanks! For your example, I would use a single table with one partition per month. I’ll explain my reasoning. At one event per minute, you’ll have about 44k events per month, which is a reasonable number of rows per partition. Obviously, having one partition per month makes the Web UI queries simple and efficient. For the Reporting System queries, it’s simple to restrict the range of rows within the partition to a single day or minute, and the query will be very efficient. Reading a (contiguous) portion of single partition is not a bad thing, and Cassandra is well optimized for it.

      I hope that helps!

      1. Colin McQueen says:

        Tyler,

        Thank you for providing your perspective! I’ll definitely try it and see if I get better performance. 🙂

        I have question about a design that has been given to me.

        e.g. Events being generated through internal system applications. There are different types of events (customer, user, system). Customer and user ids are UUIDs and system an integer as the id.

        Would you create two tables one that have the customer and user events and another for system events? Or would you separate them into three tables (customer, user, and system)?

        Also the customer/user table could be used as a dumping ground for another origin that we don’t know of yet, which is making the partition key generic.

        Is there a consequence of storing mixed events if the layout of the data is the same?

        1. Tyler Hobbs Tyler Hobbs says:

          Colin,

          This is where the two steps become important. You need to figure out what queries you need to support on this data first. It’s basically impossible to make a good data model in Cassandra without knowing that. Do the different types of data need to be queried separately, or together? What kind of queries?

          It can be okay to mix the data sources in one table if you need to query them both at the same time anyway. If you don’t, there’s no good reason to mix them in one table.

          1. Colin McQueen says:

            Tyler,

            In that example I would be querying the events by day, hourly, and/or by the minute.

            Here are the queries that have been defined:
            get all events for a customer/user from now() and up until a certain number (1 – 100).
            get all events for a customer/user for a day (or a time range)

            “It can be okay to mix the data sources in one table if you need to query them both at the same time anyway. If you don’t, there’s no good reason to mix them in one table.”

            What do you mean by at the same time? Do you mean 1 read to a partition in the table or multiple reads?

            Here is a draft of the design for the customer/user table:


            CREATE TABLE event
            (
            originid uuid, -- The Origin where where the event occurred, i.e. - Customer / User
            edate text, -- The Day the event Happened: 2014-11-20
            otype int, -- The origin type, determines the origin id
            etime timestamp, -- Exact time the event happened
            source int, -- identifies the process that created this event
            detail text,

            PRIMARY KEY (( originid, edate, otype), etime )
            )
            WITH CLUSTERING ORDER BY (etime desc)
            ;

            Based on the above design, I need to query separately even with the data mix. The only difference is the storing the data in one table vs separate tables as the one table has a field in the partition key that identifies another field in the partition key.

          2. Tyler Hobbs Tyler Hobbs says:

            What do you mean by at the same time? Do you mean 1 read to a partition in the table or multiple reads?

            I mean that if your application needs both sets of data every time, storing them in one table (or partition) may be a good idea.

            Based on the above design, I need to query separately even with the data mix.

            Why? If it’s because you need all events for the users (within a time range) regardless of the origin type, then you should remove the origin type from the partition key. On the other hand, if you only need events with a specific origin type for a user, keeping the current schema is fine. It just depends on what you want to be able to read efficiently.

  5. Dennis says:

    Thanks for sharing this post! Super helpful! In your example 1, how do you handle data replication? For example, if the username changes, you can’t do an UPDATE on the primary key

  6. Anuja says:

    Thanks for sharing a helpful topic like this.
    I have two questions,
    1. Does creating secondary index on low cardinality columns like of boolean type helps in read performance any ways? Because there will be only two values( true and false ) for that column in index table.

    2. Should secondary indexes be created on clustering columns even if these columns are not frequently used in where clause of a query?

    1. Tyler Hobbs Tyler Hobbs says:

      1. Does creating secondary index on low cardinality columns like of boolean type helps in read performance any ways? Because there will be only two values( true and false ) for that column in index table.

      For small amounts of data (say, 100k rows per node), this could work okay. However, with larger sets of data, the main problem is that the index partitions will become very large. There would only be two index partitions: one for “true” and one for “false”.

      2. Should secondary indexes be created on clustering columns even if these columns are not frequently used in where clause of a query?

      Secondary indexes are more of a convenience feature than a performance feature, so this might be okay if it’s not run frequently. Otherwise, you should look into using materialized views or creating a table by hand.

  7. Noam Shaish says:

    Hi Tyler,
    Thank you for this enlightening article.
    I as well have a question regarding modeling.

    The domain is as such:
    Events are generated on websites of two types mandatory and complementary. I would wont to each mandatory event find the first and last complementary event fitting by site and visitor where distance is up to 30 days needless to mention arrival order is unknown.

    I was thinking of two tables: mandatory and complementary event table.

    complementary table:
    partition by account and date (yyyy/mm/dd)
    clustered by visitor and event time

    mandatory table:
    partitioned by date (yyyy/mm/dd) and 15 min buckets (1/2/3/4)

    then in my application ill read mandatory events by bucket and by the result will query for all fitting dates.

    2 question:
    1. in this model I will read 30 partitions per mandatory event (sounds to me too much) changing the partition to by by 30 day bucket means Ill insert each event to 30 partitions (sounds too much as well)
    2. is this 2 query approach is a good pattern to follow?

    Thank you

    1. Tyler Hobbs Tyler Hobbs says:

      1. in this model I will read 30 partitions per mandatory event (sounds to me too much) changing the partition to by by 30 day bucket means Ill insert each event to 30 partitions (sounds too much as well)

      I don’t follow why using a 30 day bucket would require 30 insertions. Can you clarify?

      In any case, a 30 day bucket may be too large, depending on your data insertion rate. You may want to consider compromising with a 2, 5, or 10 day bucket. Remember that partitions should ideally be between 1 and 10mb, although 100mb can work if you’re careful.

      2. is this 2 query approach is a good pattern to follow?

      You’re essentially performing a join here, and since Cassandra doesn’t support joins, I can’t think of a better alternative off the top of my head.

  8. matskr says:

    Hi, thanks for very interesting article. You wrote in Example 2: “Users are in groups, and we want to get all users in a group.”

    And solution:


    CREATE TABLE groups (
    groupname text,
    username text,
    email text,
    age int,
    PRIMARY KEY (groupname, username)
    );

    SELECT * FROM groups WHERE groupname = ?;

    But to find all user in group we can set: “PRIMARY KEY (groupname)” Why is needed in this case the clustering key (username)?

    1. Tyler Hobbs Tyler Hobbs says:

      The username needs to be included in the primary key in order to uniquely identify the row. I made it a clustering key instead of a partition key so that all users in a group would end up in the same partition.

  9. anu says:

    Hi Tyler, This is really very useful, I have few questions on a following scenarios:

    1. 95% of my queries will be fetching / looking for data which is from last 24 hours only.
    2. remaining 5% of my queries can look for data older than 24 hrs but within 14 days only.

    My questions are :

    1. How do you suggest to partition this table.
    I have a primarey key as machine_id and the field on which most of the queries will apply a filter
    is a timestamp field and most of the queries would use where clause as
    timestamp .

    2. in 24 hours the expected data volume is 20 tb.

    please give an example to show how really design this query and model the table.

    The typical data this table will keep is machine_id, timestamp, parameter_name1, parameter_value1, parameter_name2, parameter_value2 ).

    My third question is some of the queries may use parameter_value2 as filter crtieria..How do you suggest this table .

    I assume, I will have to design 2 tables to meet the above 2 query requirements.

    so I will have to write the same data in 2 tables in a different ways.

    Please advise.

    1. Tyler Hobbs Tyler Hobbs says:

      If you haven’t read my post on time series data modeling, I suggest reading that. With that volume of data, you’re going to need to use the “high throughput” modeling techniques along with small time buckets.

      If you need to be able to filter by parameter_value2, you probably need a second table that includes parameter_value2 in the partition key along with the time bucket. This would allow you to query for parameter_value2 matches within a given period of time.

      1. anu says:

        Thank you.

        I read through your post on timeseries and it would be really very helpful if you can please give an example either here or in the other post , where you show a create table strucuure and suggest on partition key etc.

        for the query on value columns, I am thinking to use Apache Spark with cassandra. do you see any concerns which I should be aware of.

        Another question is what if I want to move my Relational model as it is in cassandra with change such as partitioning of data and then do rest of stuff using Saprk.

  10. Kedar Parikh says:

    Hi Tyler,

    wrt marketing domain
    We have to maintain Lists of users and each user will have multiple attributes. And there could be hundreds and thousands of such lists.

    e.g. List_id,user_id,age,sex,city,pincode,mobile,email ……

    The challenge is to find segment of users within a list. i.e All males, or users between a date range or combination or multiple such attributes.

    How can we model such a requirement for better search performance.

  11. Venkat says:

    Hi Tyler,

    Here is a table that I need to design in cassandra.

    The use case is that
    – alerts get created for organization
    – alerts can be of different types
    – alerts can have status, read/unread
    – each alert will have a timestamp
    – alert will have a message
    – alert to have an identifier so that it can updated, deleted

    the columns will be orgId, id, message, type, status, timestamp

    My queries will be

    alerts for an orgn with a status read/unread.
    alerts for an orgn with a type with a status read/unread.
    alerts for an orgn within a period with a status read/unread.
    alerts for an orgn with a type within a period with a status read/unread.
    delete an alert based on its id

    There will be many orgs around 4000.

    Can you please suggest a better modelling for these scenarios?

  12. Dmitri says:

    Hello Tyler,

    now my company is evaluate the introduction of the “Cassandra Datastax” in our working process and it depends on the success solution of the following problem:

    We have 3 tables :
    1) TABLE1 (is a big table),
    2) TABLE2 (is a very big table),
    3) TABLE2_HISTORY (is a very-very big table).

    and the following two very long term queries on DB2 :

    ********** QUERY 1 ‘START’ : **************
    SELECT
    (SELECT COUNT(*) FROM TABLE2 WHERE A2 IN (SELECT A1 FROM TABLE1))
    +
    (SELECT COUNT(*) FROM TABLE2_HISTORY WHERE A2_HISTORY IN (SELECT A1 FROM TABLE1))
    FROM DUAL;
    ********** QUERY 1 ‘STOP’. **************

    ********** QUERY 2 : **************
    SELECT SUM(S) FROM
    (
    SELECT B2 AS S FROM TABLE2 WHERE A2 IN (SELECT A1 FROM TABLE1))
    UNION ALL
    SELECT B2_HISTORY AS S FROM TABLE2_HISTORY WHERE A2_HISTORY IN (SELECT A1 FROM TABLE1))
    )
    AS A;
    ********** QUERY 2 ‘STOP’. **************

    How must be declared these 3 tables in Cassandra to permit fastest possible execution of the queries ?

    If I perform the following declarations in Cassandra :

    CREATE TABLE TABLE1 (
    PK1 text,
    A1 text,
    PRIMARY KEY (PK1));

    CREATE TABLE TABLE2 (
    PK2 text,
    A2 text,
    B2 DECIMAL,
    PRIMARY KEY (PK2));

    CREATE TABLE TABLE2_HISTORY (
    PK2_HISTORY text,
    A2_HISTORY text,
    B2_HISTORY DECIMAL,
    PRIMARY KEY (PK2_HISTORY));

    and then, in Cassandra, execute the following queries (are identical to queries listed before) :

    ********** QUERY 1 ‘START’ : **************
    SELECT
    (SELECT COUNT(*) FROM TABLE2 WHERE A2 IN (SELECT A1 FROM TABLE1))
    +
    (SELECT COUNT(*) FROM TABLE2_HISTORY WHERE A2_HISTORY IN (SELECT A1 FROM TABLE1))
    FROM DUAL;
    ********** QUERY 1 ‘STOP’. **************

    ********** QUERY 2 : **************
    SELECT SUM(S) FROM
    (
    SELECT B2 AS S FROM TABLE2 WHERE A2 IN (SELECT A1 FROM TABLE1))
    UNION ALL
    SELECT B2_HISTORY AS S FROM TABLE2_HISTORY WHERE A2_HISTORY IN (SELECT A1 FROM TABLE1))
    )
    AS A;
    ********** QUERY 2 ‘STOP’. **************

    Did I solve the problem ?

    If my solution is not appropriate, how must I model tables and write the queries to archieve maximum performance ?

    Thank you in advance !
    Have a nice day,
    Dmitri.

  13. Josh says:

    As others have said, this is a great post! I feel I have a much better understanding of the reasoning behind storing the same data in multiple tables.

    The question I have is in regards to getting the data to the tables in the first place. Above, you have several tables into which we are storing user data, depending on which queries are going to be run. However, are there any best practices for handling a potential failure to one of the writes? I don’t believe that Cassandra provides a native method for doing this, but wanted to know if you’re aware of any client-side best practices for this.

    Thanks!

    1. Tyler Hobbs Tyler Hobbs says:

      Hi Josh,

      That’s a good question! If you go with the multiple-table approach, I recommend writing to all of the tables at once with a logged batch. This will at least provide some guarantees about eventual consistency across the tables in the event of coordinator failure.

      If you’re using 3.0 or higher, materialized views are a good alternative to manually maintaining multiple tables. In addition to handling the batchlog for you, they provide other consistency features, like locking the base table to correctly handle concurrent updates, and the ability for a repair of the base table to also repair the views. If materialized views work for your use cases, I recommend using them.

  14. ajeet says:

    in your first data model example – Example 1: User Lookup

    we could have modeled the table like below

    Create table Username_Lookup
    ( id int,
    lookup_name text,
    lookup_value text
    primary key ((id,lookup_name), lookup_value) ;

    What is your thought on this.

  15. Nick says:

    I am trying to build a registration and login system using node.js and Cassandra. I want to add a password reset and a lock out feature after lets say 5 wrong password tries. The problem is if the user tries to login before the third servers updates and what happens if the password is compared to the outdated data on the third server? Wouldn’t the user get locked out. Is their any way to get around that? I like the benefits of Cassandra offers with eventual consistency. Or would i have to use another database to store the login data and store everything else in Cassandra? I really appropriate all feedback about this.

    1. Ryan Moore says:

      I imagine that the proper solution would be to use higher consistency reads and writes when it comes to password-related activities. For example, EACH_QUOROM-level consistency.

Leave a Reply

Your email address will not be published. Required fields are marked *