DataStax Developer Blog

DataStax Python Driver is Now Final

By Michael Figuiere -  January 29, 2014 | 0 Comments

We’re pleased to announce that the DataStax Python Driver 1.0.0 for Cassandra is now final. Since we open sourced it in the summer of 2013, we’ve been working closely with the community to improve its API, stability and performance. It is now considered production ready and is covered by enterprise support for DataStax customers.

This new release is part of our plan to provide modern, enterprise-class open source Cassandra drivers for major programming languages. We hope that the DataStax drivers simplify the work of developers and DBAs when designing applications by bringing a common architecture and a similar interface across languages.

At a glance, the Python driver may look similar to drivers for relational databases, but it comes with some unique features:

  • Configurable, per-node connection pools that can grow and shrink automatically to accommodate changing loads
  • Automatic node discovery and transparent handling of node additions and removals
  • Synchronous and asynchronous query execution
  • Pluggable request routing (including token-aware routing)
  • Query tracing and metrics providing more insight into query execution, latency, and errors
  • SSL and authentication support
  • Thorough logging based on Python’s standard logging module

This version of the Python driver runs on Python 2.6 and 2.7. As the driver can be run without any C extensions, PyPy is well supported (and has great performance). Support for Python 3 is planned.

You can install or upgrade to the 1.0 version by using pip, download it from the DataStax downloads page, or check out the code on GitHub. And there’s also a Quick getting started guide.

Basic Architecture

At its core, the new python driver utilizes an event loop for handling communication with Cassandra. This event loop may either use the asyncore module in the standard library or libev for improved performance.

At a higher level, the driver maintains a small connection pool for each Cassandra node (with special treatment for multi-datacenter environments). When a query is executed, a list of nodes to attempt the query against is generated. If the query fails against the first node in the list, the second node may be used, and so on. When sending a query to a node, the driver selects the least-utilized connection from that node’s connection pool and issues the query.

From a user-API perspective, you can choose to synchronously block for the query to complete, or you can execute the query asynchronously and either attach callbacks or synchronously block for the final result at any time.

Differences from pycassa

pycassa is the most popular Thrift-based driver for Cassandra. If you’re using pycassa today, here are some of the notable differences between the new python driver and pycassa:

  • CQL3 is supported (exclusively).
  • Multiple queries can be executed concurrently on the same connection, allowing the driver to achieve higher throughput with fewer open connections.
  • Nodes in the cluster are automatically discovered and the driver transparently handles nodes being added or removed. pycassa simply uses a single non-changing list of nodes.
  • Connection pooling is more configurable and connection pool sizes are adjusted automatically. This leads to better out-of-the-box performance and better support for multi-datacenter environments.
  • Through token aware routing, the driver can select a more efficient coordinator node for queries, increasing throughput and decreasing latency for many workloads.
  • Query tracing is supported.

What Comes Next?

Here’s what we will be working on next for the python driver:

Cassandra 2.0 Support

Cassandra 2.0 added several interesting features for developers, including automatic query paging for large result sets, lightweight transaction support, and the ability to execute prepared statements in batches. The 1.0 release of the python driver does not support these yet, but the 2.0 release will add support for them.

gevent

gevent, in its own words, is “a coroutine-based Python networking library that uses greenlet to provide a high-level synchronous API on top of the libev event loop.” The gevent library is very popular within the Python community and its architecture matches that of the python driver well. Although the driver does not currently work with gevent, first-class support for it should come soon.

Object Mapper (cqlengine)

cqlengine is an excellent object mapping library for CQL3. The cqlengine library currently depends on another driver to connect to Cassandra, but we will be working closely with the cqlengine developers to allow it to use the new python driver.

Python 3 Support

Many pycassa users have requested Python 3 support. We plan to add Python 3 support within the same code base by using the six library.

Performance Improvements

For the 1.0 version of the python driver, we have focused primarily on stability, so there is still plenty of room for improvement in terms of performance. This may include work around avoiding lock contention as well as utilizing a C library or extension for message encoding and decoding.



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>