DataStax Developer Blog

Modeling A/B Tests With Apache Cassandra

By Ben Coverston -  January 25, 2012 | 0 Comments

The first obstacle that most people come up against with Cassandra is in data
modeling. Cassandra has a very flexible data store, and data can be stored in
a multitude of ways. The important question to ask is this: given my scenario,
How should I put my data into Cassandra in a way that will make it useful to me
later and answer the questions that the stake-holders for this project care about.

With most things in life data modeling in Cassandra can be compared to learning to
ride a bike. It can be scary, you might fall off, but in the end once you learn a
few fundamental concepts everything will be easier to do. The goal of this article
is to get you comfortable with a basic data modeling scenario that you will likely
see in the real world.

A/B Testing

A/B testing is a method in which we can market test control samples against test
samples. When used, it can effectively tell you which font, color, layout, phrasing,
or other device works best for your target audience. Many organizations including
Netflix, Amazon, Google, Disney, and eBay are all involved in A/B testing.

The Scenario

As the lead data scientist at your company you have been tasked to use your customer
database to track responses for an email campaign. The marketing department wants to
test the responses of two different styles of wording in their Call to Action. One
faction believes that linking to ‘Look Cats!’ will get the best response, while the
other believes that linking ‘Ooh! Shiny!’ will garner a better response. Your job
is to design a schema and use data to help to resolve this dispute:

The Data Model

To model this scenario we need two things, first a place to put our data, and second,
a place for the results.

Users

To do this we’ll create a column familiy called Users. One way a column family can be used is
as tabular data store, similar to the way tables are used in a relational data store. A
distinct feature of column families is that each element of data is a key/value pair so when we
visualize the model it is appropriate to think of the data as a nested hash table. It helps to
think of the data as a series of associative arrays, one to find the row, and another (nested) to find the
column.


    {Users : [
        james : {
            'firstName' : 'james',
            'lastName' : 'Silva',
            'email' : 'james@example.com',
            'state' : 'CA',
            'rank' : 'black'
            },
        johnny : {
            'firstName' : 'james',
            'lastName' : 'DeSosa',
            'email' : 'johnny@example.com',
            'state' : 'CA'
            },
        armando : {
            'firstName' : 'armando',
            'lastName' : 'Dias',
            'state' : 'UT'
            }
    ]};

It is interesting to note that unlike a relational data store, cassandra makes no constraints on the
actual columns inside of each row in the column family. Armando, for example, does not have an entry
for his email address. James, has an column that the others do not (rank).

Test Data

For our test data we need to create another column family. This column family will consist of
an email address as the key, and a single column value that will indicate the result of
the particular test we are running in this example.


    {ABTest : [
        james@example.com : {
            'shiny_or_cats' : cats
            },
        johnny@example.com : {
            'shiny_or_cats' : shiny
            }
    ]};

In this column family our rows consist of a key (the email address), followed by a column that
tells us what the result was for this particular run. If later we decided to store the results of
a different test we could append a column to each of the rows:


    {ABTest : [
        james@example.com : {
            'shiny_or_cats' : 'cats',
            'lulz_or_hawt' : 'lulz'
            }
        johnny@example.com : {
            'shiny_or_cats' : 'shiny', 
            'lulz_or_hawt' : 'lulz', 
            'daft_or_dire' : 'daft'
            }
    ]};

As you can see, in the above column family we have run at least three A/B campaigns, james has
responded to two of them, and johnny has responded to three.

The Procedure

  1. Create the keyspace for the customer database by running the following script in the CQL shell.

    
    CREATE KEYSPACE ABTest WITH strategy_class = 'SimpleStrategy'
    ... and strategy_options:replication_factor=1
    
  2. Create a column family that has the following definition:

    
    cqlsh> USE ABTest;
    cqlsh:ABTest> CREATE COLUMNFAMILY Users (key text PRIMARY KEY,
    ... firstName text,
    ... lastName text,
    ... email text,
    ... age bigint,
    ... sex text,
    ... state text);
    
  3. Now I insert a few thousand users by creating a user object, then inserting the
    data into our column family. For this I used lists of first and list names
    and combined them to generate my name data.

    
    def saveUser(u):
    cursor.execute('''UPDATE Users SET 
                      sex = :sex, 
                      state = :state, 
                      lastName = :lastName, 
                      firstName = :firstName, 
                      email = :email,
                      age = :age
                      WHERE key = :uuid''',\
                   {'sex' : u.sex, 
                    'state' : u.state, 
                    'lastName' : u.lastName, 
                    'firstName' : u.firstName,\
                    'email' : u.email, 
                    'age' : u.age, 
                    'uuid' : uuid.uuid1() })
    
  4. Create a schema to hold the responses to our AB Test, create a secondary index on the column where we will hold the samples:

    
    cqlsh> USE ABTest;
    cqlsh:ABTest> CREATE COLUMNFAMILY lookcats
    ... (key text PRIMARY KEY,
    ... email text,
    ... cats_or_shiny text);
    cqlsh:ABTest> CREATE INDEX shinyorcats on lookcats (cats_or_shiny);
    
  5. Now we can simulate an A/B test over our data with the following Python script:

    
    options = ['cats', 'shiny', 'cats', 'cats', 'shiny']
    
    
    def saveUser():
        cursor.execute("SELECT email FROM Users LIMIT 1000")
        for row in cursor.fetchall():
            cursor.execute('''UPDATE lookcats SET email = :email,
                              cats_or_shiny = :cats_or_shiny 
                              WHERE key = :uuid''', 
                              {'email': row[0], 
                              'cats_or_shiny': choice(options),
                              'uuid': uuid.uuid1()})
    
  6. The last step is to query our data for results. This is easy for us. We created
    a secondary index on the column catsorshiny which represents the response for each
    test. To see the results we merely need to create a simple CQL statement.

    
    cqlsh:ABTest> SELECT COUNT(*)
    ... FROM lookcats where cats_or_shiny = 'cats'... ;
    
    
    count |
    615 |
    
    
    cqlsh:ABTest> SELECT COUNT(*) 
    ... FROM lookcats where cats_or_shiny = 'shiny';
    
    
    count |
    385 |
    

As you can see, in this set of customers I was able to see that more users preferred the ‘cats’ link to the ‘shiny’ link. We can use this data not only to get the marketing department back to work, but to also improve their effectiveness when they do.



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>