4 simple rules when using the DataStax drivers for Cassandra
date: June 5, 2014
When using one of the DataStax drivers for Cassandra, either if it’s C#, Python, or Java, there are 4 simple rules that should clear up the majority of questions and that will also make your code efficient:
- Use one Cluster instance per (physical) cluster (per application lifetime)
- Use at most one Session per keyspace, or use a single Session and explicitely specify the keyspace in your queries
- If you execute a statement more than once, consider using a PreparedStatement
- You can reduce the number of network roundtrips and also have atomic operations by using Batches
In case you are wondering what’s behind these rules, keep reading. Otherwise, happy coding!
A Cluster instance allows to configure different important aspects of the way connections and queries will be handled. At this level you can configure everything from contact points (address of the nodes to be contacted initially before the driver performs node discovery), the request routing policy, retry and reconnection policies, etc. Generally such settings are set once at the application level.
cluster = Cluster(['10.1.1.3', '10.1.1.4', '10.1.1.5'], compression=True, load_balancing_policy=TokenAwarePolicy( DCAwareRoundRobinPolicy(local_dc='US_EAST')))
Please note that besides the usual Python conventions, you’ll find the same API across all the DataStax drivers.
While the API of Session is centered around query execution, the Session does some heavy lifting behind the scenes as it manages the per-node connection pools. The Session instance is a long-lived object and it should not be used in a request/response short-lived fashion. Basically you will want to share the same cluster and session instances across your application.
Using prepared statements provides multiple benefits. A prepared statement is parsed and prepared on the Cassandra nodes and thus ready for future execution. When binding parameters, only these (and the query id) are sent over the wire. These performance gains will add up when using the same queries (with different parameters) repeatedly.
Remember the rule for using prepared statements is simple: prepare once, bind and execute multiple times.
As per the documentation, the BATCH statement combines multiple data modification statements (INSERT, UPDATE, DELETE) into a single logical operation which is sent to the server in a single request. Also batching together multiple operations ensures these are executed in an atomic way: either all succeed or none.
To make the best use of batches, I strongly encourage reading Atomic batches in Cassandra 1.2 and Static columns and batching of conditional updates.
There are some specific scenarios in which the above rules might need slightly tweaking. Let’s take a look at some of these:
Should I still use one Session per keyspace if I have too many keyspaces?
As mentioned above, a Session instance is responsible for managing the per-node connection pools and as a consequence using too many Session instances might have a major impact on your server resources. For the case where your application interacts with a large number of keyspaces, using a predefined number of Sessions and fully qualified table identifiers in the queries will lead to better resource utilization.
Can I combine Batches and PreparedStatements?
Starting with Cassandra 2.0 and the corresponding versions of the C#, Java, and Python drivers, PreparedStatements can be used in batch operations (nb before that you could still prepare a complete batch operation, but you’d need to know apriori the number of statements that will be included).
from cassandra.query import BatchStatement //Prepare the statements involved in a profile update profile_stmt = session.prepare( "UPDATE user_profiles SET email=? WHERE key=?") user_track_stmt = session.prepare( "INSERT INTO user_track (key, text, date) VALUES (?, ?, ?)") # add the prepared statements to a batch batch = BatchStatement() batch.add(profile_stmt, [emailAddress, "hendrix"]) batch.add(user_track_stmt, ["hendrix", "email changed", datetime.utcnow()]) # execute the batch session.execute(batch)
Basically you get the benefits of both PreparedStatements and Batch operations.
My table has many columns and I insert data in different combinations
Cassandra’s storage engine is optimized to avoid storing unnecessary empty columns, but when using prepared statements those parameters that are not provided result in null values being passed to Cassandra (and thus tombstones being stored). Currently the only workaround for this scenario is to have a predefined set of prepared statement for the most common insert combinations and using normal statements for the more rare cases.
These 4 simple rules should cover a lot of common ground when using the DataStax drivers for Cassandra. We have dedicated mailing lists for all our drivers and the team at DataStax is always happy to answer your questions.