4 simple rules when using the DataStax drivers for Cassandra

By Alex Popescu -  June 5, 2014 | 8 Comments

When using one of the DataStax drivers for Cassandra, either if it’s C#, Python, or Java, there are 4 simple rules that should clear up the majority of questions and that will also make your code efficient:

  1. Use one Cluster instance per (physical) cluster (per application lifetime)
  2. Use at most one Session per keyspace, or use a single Session and explicitely specify the keyspace in your queries
  3. If you execute a statement more than once, consider using a PreparedStatement
  4. You can reduce the number of network roundtrips and also have atomic operations by using Batches

In case you are wondering what’s behind these rules, keep reading. Otherwise, happy coding!

Cluster

A Cluster instance allows to configure different important aspects of the way connections and queries will be handled. At this level you can configure everything from contact points (address of the nodes to be contacted initially before the driver performs node discovery), the request routing policy, retry and reconnection policies, etc. Generally such settings are set once at the application level.

cluster = Cluster(['10.1.1.3', '10.1.1.4', '10.1.1.5'],
    compression=True,
    load_balancing_policy=TokenAwarePolicy(
        DCAwareRoundRobinPolicy(local_dc='US_EAST')))

Please note that besides the usual Python conventions, you’ll find the same API across all the DataStax drivers.

Session

While the API of Session is centered around query execution, the Session does some heavy lifting behind the scenes as it manages the per-node connection pools. The Session instance is a long-lived object and it should not be used in a request/response short-lived fashion. Basically you will want to share the same cluster and session instances across your application.

Prepared statements

Using prepared statements provides multiple benefits. A prepared statement is parsed and prepared on the Cassandra nodes and thus ready for future execution. When binding parameters, only these (and the query id) are sent over the wire. These performance gains will add up when using the same queries (with different parameters) repeatedly.

Remember the rule for using prepared statements is simple: prepare once, bind and execute multiple times.

Batch operations

As per the documentation, the BATCH statement combines multiple data modification statements (INSERT, UPDATE, DELETE) into a single logical operation which is sent to the server in a single request. Also batching together multiple operations ensures these are executed in an atomic way: either all succeed or none.

To make the best use of batches, I strongly encourage reading Atomic batches in Cassandra 1.2 and Static columns and batching of conditional updates.

Common questions

There are some specific scenarios in which the above rules might need slightly tweaking. Let’s take a look at some of these:

Should I still use one Session per keyspace if I have too many keyspaces?

As mentioned above, a Session instance is responsible for managing the per-node connection pools and as a consequence using too many Session instances might have a major impact on your server resources. For the case where your application interacts with a large number of keyspaces, using a predefined number of Sessions and fully qualified table identifiers in the queries will lead to better resource utilization.

Can I combine Batches and PreparedStatements?

Starting with Cassandra 2.0 and the corresponding versions of the C#, Java, and Python drivers, PreparedStatements can be used in batch operations (nb before that you could still prepare a complete batch operation, but you’d need to know apriori the number of statements that will be included).

from cassandra.query import BatchStatement

//Prepare the statements involved in a profile update
profile_stmt = session.prepare(
    "UPDATE user_profiles SET email=? WHERE key=?")
user_track_stmt = session.prepare(
    "INSERT INTO user_track (key, text, date) VALUES (?, ?, ?)")

# add the prepared statements to a batch
batch = BatchStatement()
batch.add(profile_stmt, [emailAddress, "hendrix"])
batch.add(user_track_stmt,
  ["hendrix", "email changed", datetime.utcnow()])

# execute the batch
session.execute(batch)

Basically you get the benefits of both PreparedStatements and Batch operations.

My table has many columns and I insert data in different combinations

Cassandra’s storage engine is optimized to avoid storing unnecessary empty columns, but when using prepared statements those parameters that are not provided result in null values being passed to Cassandra (and thus tombstones being stored). Currently the only workaround for this scenario is to have a predefined set of prepared statement for the most common insert combinations and using normal statements for the more rare cases.

These 4 simple rules should cover a lot of common ground when using the DataStax drivers for Cassandra. We have dedicated mailing lists for all our drivers and the team at DataStax is always happy to answer your questions.




Comments

  1. All the examples I have found and looked at for how to add PreparedStatements to BatchStatements are incorrect. They show adding an instance of PreparedStatement and array of variables directly to a BatchStatement. That is actually not supported in driver core 2.0 or 2.1. You must call PreparedStatement.bind(Object[] values) which returns a BoundStatement that you can add to the BatchStatment. BatchStatement takes a Statement which PreparedStatement does not implement. Please update the very confusing and misdirecting examples.

    1. Kevin Gallardo says:

      Hello,

      It seems like you are referring here to the Java driver (judging by the “PreparedStatement.bind(Object[] values)”). And indeed it is not doable with the Java driver. However, the examples provided in the article here refer to the usage of the Python Driver, and the syntax showed here is valid for the Python Driver. Please see http://datastax.github.io/python-driver/api/cassandra/query.html#cassandra.query.BatchStatement

      Thanks.

  2. XiaoboGu says:

    Does rule 2 apply in the mod_python situation? Can multi threads of the apache/mod_python access the same session object to execute different queries concurrently?

  3. J McC says:

    The article starts with: “When using one of the DataStax drivers for Cassandra, either if it’s C#, Python, or Java, there are 4 simple rules …”

    Do theses apply to the c++ driver too?

    1. Kevin Gallardo says:

      Yes, all the DataStax drivers follow the same concepts and those 4 good practices apply to all of the drivers.

  4. Dana Blair says:

    “When binding parameters, only these (and the query id) are sent over the wire.”

    Where can I find more detail about what is actually sent over the wire in this case ?

    1. Kevin Gallardo says:

      Hello,

      Details about the protocol can be found on the Cassandra’s doc. In case of a PreparedStatement, the request will be a EXECUTE message, for which the specs are here : https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v3.spec#L354

  5. Dan BC says:

    I’m trying to run two docker containers each running a python script, and have them both connect and add entries to a single-node cassandra running in another docker.

    This work fine just using the docker ‘–link’ notation between a python and a Cassandra container, using ‘cluster = Cluster([socket.gethostbyname(‘cassandra’)])’.

    But if I try to start multiple python containers, which attempt to connect in the same way as above, it results in all or all but one of the python containers failing to connect (inconsistent on which will and which won’t connect).

    Does this arrangement (multiple ‘Cluster([socket.gethostbyname(‘cassandra’)])’ call from different scripts in different containers) break rule 1. above? And what exactly is the issue with this? And how best to work around this with dockerized processes?

Comments

Your email address will not be published. Required fields are marked *




Subscribe for newsletter:

© 2017 DataStax, All rights reserved. Tel. +1 (408) 933-3120 sales@datastax.com Offices

DataStax is a registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.