Improved Cassandra 2.1 Stress Tool: Benchmark Any Schema – Part 1

By Jake Luciani -  July 31, 2014 | 34 Comments

This is the first part of a two part series on the new stress tool.

Background

The choices you make when data modeling your application can make a big difference in how it performs. How many times have you seen database benchmarks that look impressive in an article but when you try them with your data/schema you are left disappointed and confused?  To get a proper understanding how a database scales and to capacity plan your application, requires significant effort in load testing.  Most importantly, to understand the tradeoffs you are making in your data model and settings requires multiple iterations before you get it right.

To make testing data models simpler, we have extended the cassandra-stress tool in Cassandra 2.1 to support stress testing arbitrary CQL tables and arbitrary queries on that table.  We think it will be a very useful tool for users who want to quickly see how a schema will perform.  And it will help us on the Cassandra team diagnose and fix performance problems and other issues from a single tool.  Although this tool comes in the Cassandra 2.1 release it also works on Cassandra 2.0 clusters.

In this post I’ll explain how to create a CQL based stress profile and how to execute it. Finally I’ll cover some of the current limitations.

The new stress YAML profile

The new cassandra-stress now supports a YAML based profile that is used to define your specific schema with any potential compaction strategy, cache settings and types you wish, without having to write a custom tool.

The YAML file is split into a few sections:

  1. DDL – for defining your schema
  2. Column Distributions – for defining the shape and size of each column globally and within each partition
  3. Insert Distributions  – for defining how the data is written during the stress test
  4. DML –  for defining how the data is queried during the stress test

To help explain the file let’s define one to model a simple app to hold blog posts for multiple websites, the posts are ordered in reverse chronological order.

DDL

The DDL section is straight forward.  Just define the keyspace and table information.  If the schema is not yet defined the stress tool will create it the first time you run stress on this profile.  If you have already created the schema separately then you only need to define the keyspace and table names.

DML-stress1

Column Distributions

Next, the ‘columnspec’ section describes the different distributions to use for each column.  These distributions model the size of the data in the column, the number of unique values, and the clustering of them within a given partition. These distributions are used to auto generate data that “looks” like what you would see in reality.  The actual data is garbage but it’s reproducible and procedural to generate.

The possible distributions are:

NOTE: If you use a ~ prefix, the distribution will be inverted.

For each column you can specify (note the defaults):

  • Size distribution – Defines the distribution of sizes for text, blob, set and list types  (default of UNIFORM(4..8))
  • Population distribution – Defines the distribution of unique values for the column values (default of UNIFORM(1..100B))
  • Cluster distribution – Defines the  distribution for the number of clustering prefixes within a given partition (default of FIXED(1))

In our example it makes sense to size the fields appropriately to their limits in reality. Most blogs have large bodies and at most a thousand posts per blog.

columspec-stress

 

Insert Distributions

The insert section lets you specify how data is inserted during stress.  This get’s a little tricky to think about but it’s pretty straight forward once you grasp it.

For each insert operation you can specify the following distributions/ratios:

  • Partition distribution
    • The number of partitions to update per batch (default FIXED(1))
  • select distribution ratio
    • The ratio of rows each partition should insert as a proportion of the total possible rows for the partition (as defined by the clustering distribution columns). default FIXED(1)/1
  • Batch type
    • The type of CQL batch to use. Either LOGGED/UNLOGGED (default LOGGED)

In our example it makes sense to only insert a single blog post at once to a single domain.

insert_section

DML

You can specify any CQL query on the table by naming them under the ‘queries’ section.

The ‘fields’ field specifies if the bind variables should be picked from the same row or across all rows in the partition 

In our example case we may want to see how fetching the most recent post for a domain as well as the previous 10 post meta-information to show in a timeline view.

query_section

Putting it all together

So now that we have our profile we can run it with the following commands.   The complete YAML and results is located here.

Inserts:

./bin/cassandra-stress user profile=./blogpost.yaml ops\(insert=1\)

Without any other options stress will run our inserts starting with 4 threads and increasing them till it reaches a limit. All inserts are done with the native transport and prepared statements. The full list of cassandra-stress features is listed under the help command.

On my laptop this was ~8,500 inserts/s with 401 threads. This is significantly slower then the default stress, but we don’t expect it to be as fast since this is > 1Kb per insert.

Queries:

./bin/cassandra-stress user profile=blogpost.yaml ops\(singlepost=1\)

Reading a single post yields ~7000 queries/sec

./bin/cassandra-stress user profile=./blogpost.yaml ops\(timeline=1\)

Reading a timline yields ~7000 queries/sec, but ~25000 CQL rows/sec since this is multiple rows per domain

Mixed:

./bin/cassandra-stress user profile=./blogpost.yaml ops\(singlepost=2,timeline=1,insert=1\)

We can also run many types of queries and inserts at once.  This syntax sends three queries for every one insert.

Other YAML examples

Cassandra 2.1 comes with three sample yaml files in the tools directory with more advanced examples

https://github.com/apache/cassandra/tree/cassandra-2.1/tools

Limitations/improvements

The new stress covers a lot of use cases but there are some things it can’t do.  We do plan to address these in future releases:

  • Doesn’t support map types or user defined types
  • Indexes must be manually added to your tables

Some of the features we wish to add are:

  • Random sentence, instead of random string, generation to more accurately test the effect of compression.
  • More control over read and write patterns, like only query the most recent partitions added.

To be continued…

This post covered some of the basic of the new stress tool, in the next post we will cover a more advanced example.



Comments

  1. giacomo says:

    Well… what can I say? This is just great! Thanks!

  2. Andrew Tolbert says:

    This is *amazing*. Quite an advancement over the existing cassandra stress tool (which is very useful as is). Really nice way to test your data models without having to write a script to write/read data. I’m super excited to play with this. Thanks!

  3. Ma says:

    Do you have a sample yaml file that I can use to only run select statements?

    Based on the github post it says “Remember that you must perform inserts before performing reads or range slices.” What if my keyspace already has data in it?

    Thanks

    1. Jake Luciani Jake Luciani says:

      In order for the system to find data it must know how the possible data values are. So you can’t read existing data, the data must be data generated by the stress tool.

      1. Ma says:

        Thanks, that make sense. I was getting output but for Partitions column i was getting 0…probably indicating something is wrong perhaps?

        Is the partitions column essentially # of records being read/written to?

        Also, i have multiple tables per keyspace but if I am reading only from one table do I need to include entire keyspace definition including all stables or only the one I’m stress testing against?

        Thanks

        1. Jake Luciani Jake Luciani says:

          Partitions are the primary key of the table. The keyspace definition is optional if it already exists, just specify the name

          1. Ma says:

            I notice the inserts are all garbage text…is there a way I can add my own content during insert?

            I see this:

            name
            ———-
            yx^4
            T}
            }$<1oP

            Can I add something like this:

            name
            ———-
            abcdefg
            abcdefg
            abcdefg

            Thanks!

  4. Ra says:

    In GitHub, I see an example file query1.txt which was generated by:
    ./bin/cassandra-stress user profile=blogpost.yaml ops\(singlepost=1\)

    So in this case inserts are not done before reads. Can you please explain how the tool worked in this scenario.

    You can take a look at the file I am talking about here:
    https://gist.github.com/tjake/fb166a659e8fe4c8d4a3

    1. Jake Luciani Jake Luciani says:

      The insert command was run first… followed by the query

      1. Sebastian says:

        How do you make shure that generated read-queries only query data that is actually in the database? Refering to the example posted here: can you be shure your generated queries only look for domain values that are actually in the database? If you cannot be shure: are queries for non-existing data included in the statistics?

        1. Sebastián Estévez says:

          @Sebastián, if you use the same seed, you can generate the same “random” values in sequence.

  5. Roger says:

    Is it possible to parameterize INSERT statements?

    E.g.If I have columnspec with name ‘username’ and I want to use that variable in my insert statement

    Like:
    INSERT INTO account(username) values() IF NOT EXISTS;

  6. sam says:

    How do i integrate the new stress tool into my production cluster which runs dse 4.5??

  7. sai says:

    How do i integrate the new stress-test tool with dse-4.5 in our production cluster?

  8. Vega says:

    You used to be able to do -b for batch and I just don’t understand how that translates into 2.1. Can someone give an example of how to insert in batches of 1000?

  9. Suraj says:

    We use a tool called RowGen (from IRI) to generate our ‘big’ test data from DDL or metadata we define. We connect to DBs in Eclipse to connect to source/target tables and build flat-files for loads, et al. Has anyone built stress/benchmark data in this environment with RowGen, and do you have any pointers?

  10. Pawan says:

    Hi,

    I need up getting an error:

    “Application does not allow arbitrary arguments:write,yaml=cqlstress-example.yaml

    command used:
    cassandra-stress write -schema yaml=cqlstress-example.yaml

    Can you please help

  11. infomaven says:

    I’m getting the following error when I try to run the commands from this article. How to fix this?

    infomav:tools nwhit8$ ./bin/cassandra-stress user profile=blogpost.yaml ops\(singlepost=1\)

    Error: Could not find or load main class org.apache.cassandra.stress.Stress
    infomav:tools:tools infomav:tools$

    1. infomaven says:

      Project was built from source using ant v1.9.4

  12. infomaven says:

    I found the problem — I did not use the correct command in Ant. I was using *buiild* instead of *release*.

    You can delete my comment.

  13. vaibhav misra says:

    Are you planning to come up with the 2nd part of this series soon or will we have to suffice with just this one?

  14. Abhishek Patel says:

    I am try to run the following command

    $ sh cassandra-stress counter_write profile=../epcountertest.yaml

    but this gives me the following error

    Invalid parameter profile=../epcountertest.yaml

    I am using cassandra-2.1.8
    Please help.

    1. ml says:

      cassandra-stress counter_write user profile=../epcountertest.yaml

      try putting in user in front of profile

  15. Barsha says:

    I am getting error InvalidRequestException(why:Invalid version for TimeUUID type.) on executing this.
    It works if I drop the timeuuid column. What should I do to include timeuuid columns?

  16. csea says:

    Is there any official document that describles the method in detail? There are still something out of my understanding.

  17. kant says:

    How to define multiple table definitions with multiple column specs within the same keyspace in the yaml file?

    1. Shuabham Baldava says:

      Hi Kant,

      I have the same problem. Did you get any solution for this ?

      Thanks in advance.

  18. Shuabham Baldava says:

    How to define multiple table definitions with multiple column specs within the same keyspace in the yaml file?

  19. Ankit Saran says:

    No Host Availaible exception is there when i am hitting the command : ./bin/cassandra-stress user profile=./blogpost.yaml ops\(insert=1\).

    Can yoy please help for the same?

    1. mehnaaz says:

      this is probably happening cause the version of cassandra is different from where ur are running .

  20. Oded says:

    How do I define multiple table definitions and multiple column specs for cassandra-stress profile file?

    You can’t.

    http://stackoverflow.com/questions/34889485/how-do-i-define-multiple-table-definitions-and-multiple-column-specs-for-cassand

  21. Joseph Wang says:

    where do we download the tool?

  22. Miraj says:

    I am trying to evaluate a data model with wide rows. What’s happening is cassandra-stress is not generating enough unique values for clustering columns and hence, instead of inserting new cells, it starts rewriting to the old cells.

    name: column1
    size: fixed(32)
    population: gaussian(1..10000000)
    cluster: fixed(40)

    ^ Above is one of the four columns in composite clustering key. From the specs, it should generate 40 unique values for column1 for one partition. It is hardly generating 5-6. Same issue with other three columns.

    Is this the limitation of cassandra-stress or something is wring with the yaml?

  23. Mick says:

    Jake,
    The following text

    “The ratio of rows each partition should insert as a proportion of the total possible rows for the partition (as defined by the clustering distribution columns). default FIXED(1)/1”

    I believe is inaccurate (or ambiguous).
    cassandra-stress has no way of keeping track of the total rows within a partition over a cassandra-stress invocation.

    It would be more accurate to say:

    The ratio of rows each partition should insert as a proportion of the total possible rows for the partition within a request (as defined by the clustering distribution columns). default FIXED(1)/1

Comments

Your email address will not be published. Required fields are marked *




Subscribe for newsletter:

Tel. +1 (408) 933-3120 sales@datastax.com Offices France Germany

DataStax Enterprise is powered by the best distribution of Apache Cassandra™.

© 2017 DataStax, All Rights Reserved. DataStax, Titan, and TitanDB are registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.