DataStax Developer Blog

Python Driver 2.1.4 Released

By Adam Holmberg -  January 26, 2015 | 0 Comments

Version 2.1.4 of the DataStax Python driver for Apache Cassandra is released! A full list of changes can be found in the CHANGELOG. In addition to several bug fixes improving the robustness and stability of the driver, there are a number of interesting new features and subtle API changes.

The headlines:

  • Support for nested collections in Cassandra 2.1.3+; a different Python type is returned for map data
  • New optional, configurable heartbeat mechanism for probing and keeping idle connections open
  • New generic SASL authenticator for using Kerberos via pure-sasl package
  • Clients can now configure how schema agreement takes place under all circumstances, and call for synchronous schema updates on demand
  • Removed implicit scaling for numeric timestamps

I will expound on these below.

Support for Nested, Frozen Collections in Cassandra

Beginning in version 2.1.3, Cassandra will extend frozen to collections, meaning that frozen collections can be nested arbitrarily in parameterized types, including set types, and even keys of maps. This means that the driver needs to be able to construct maps and sets of types that are not necessarily hashable.

The challenge in Python is that hashability is a requirement for built-in map keys and set contents, hashing a collection requires immutability, and there is no natural concept of frozen for these built-in collections. This was addressed some time ago for sets, using ‘sortedset’ in order to accommodate sets of User Defined Types. However, it was not required for mappings until nested collections appeared on the horizon.

Prior to this release, the driver would return maps in OrderedDict collections. Even though OrderedDict uses a linked list to maintain insertion order, it still relies on dict internally for lookups (meaning no mutable types could be used as keys). Now, the driver returns map results in an OrderedMap, a pure Python implementation of the collections.Mapping read-only map abstraction.

Similar to OrderedDict, the implementation of OrderedMap is using a list to maintain the order of elements and a dict for constant-time lookup. However, in this case the dict uses as a key, a serialized representation of a key’s value. As a consequence, read access on these OrderedMaps is identical to other map types. The main difference is that they are constructed in-whole, and do not support item-level updates (i.e. __setitem__ or __delitem__). Also note that initialization with mutable items as keys requires one to use a list of key-value pairs approach.

For example, using this contrived table:

    CREATE TABLE nested_frozen (
        key int PRIMARY KEY,
        value map<frozen<list<int>>, text>

A sample script:

from cassandra.cluster import Cluster
from cassandra.util import OrderedMap
import six

cluster = Cluster(protocol_version=3)

key = 0
map_list_text = OrderedMap([([1, 2], "one two"), ([3, 4], "three four")])
session.execute("INSERT INTO nested_frozen (key, value) VALUES (%s, %s)", (key, map_list_text))

od_value = session.execute("SELECT value FROM nested_frozen WHERE key=%s", (key,))[0].value

print od_value
# {[1, 2]: one two, [3, 4]: three four}

for k, v in six.iteritems(od_value):
    print "%s => %s" % (k, v)
# [1, 2] => one two
# [3, 4] => three four

Note that protocol_version=3 is required when using nested types, because inner types are always encoded in the v3 format.

Idle Connection Heartbeat

The new version of the DataStax Python Driver adds an idle connection heartbeat mechanism. This was introduced to help long-lived clients in environments in which network devices (e.g. firewalls, load balancers) close idle connections, regardless of TCP keepalive settings. It also has the added benefit of proactively discovering broken connections, in the absence of request traffic (or protocol events, in the case of the control connection).

Using an idle_heartbeat_interval configured on client Cluster initialization, the driver will send an options message on each idle connection. If no idle connections are present, no messages are sent.

The idle heartbeat is enabled by default, with an interval 30 seconds. This mechanism is optionally disabled by setting the interval to zero.

SASL Authenticator Supporting Kerberos Authentication

The new driver includes a new, more generic SaslAuthProvider (utilizing the third-party package pure-sasl), making it possible to create auth providers for GSSAPI/Kerberos, as well as other SASL mechanisms. This makes it easier than ever to integrate Python client applications with custom authenticators on the Cassandra server.
The API doc for the SaslAuthProvider shows a simple example using GSSAPI, and the test shows an example using a plaintext mechanism in place of the specialized built-in provider.

Controlling Wait for Schema Agreement

Since version 1.0, the driver has provided the ability to set max_schema_agreement_wait, a timeout for waiting for schema agreement before refreshing schema metadata on startup, or after change events. However, clients in some situations are not concerned with concurrent schema modifications — they may not care that schema changes are occurring, or may know that schema of interest will not be changing.

In the latest driver, setting max_schema_agreement_wait=0 bypasses the agreement check completely, and builds the schema model on the current view of schema. This saves at least a round-trip to all nodes during startup, and following schema change events.

Also new in this version is a function to synchronously refresh schema, with the same configuration for schema agreement wait. The method cassandra.Cluster.refresh_schema allows a client to refresh all, or part of the schema metadata using an optional wait time that overrides the initialized cluster setting.

# Never wait for agreement on general events - client knows when schema of interest will change
cluster = Cluster(max_schema_agreement_wait=0)
session = cluster.connect()

# Now wait for agreement and synchronize model at some discrete point in the app
    cluster.refresh_schema(keyspace='my_keyspace', max_schema_agreement_wait=5)
except Exception:
    pass  # Problem refreshing

Removed Implicit Timestamp Scaling

Cassandra stores timestamps as milliseconds from unix epoch. Previous versions of the driver would implicitly scale any numeric type, assuming it was converting seconds to milliseconds. However, this is incongruent with other native type mappings, and could lead to confusion. Since version 1.0 the driver has emitted a warning when serializing with this implicit scaling.

As of version 2.1.4, this scaling is removed. Now, users must either use datetime types, or explicitly scale numeric timestamps to milliseconds in Cassandra. Failing to do so results in small timestamps in 1970, as seconds going in would be treated as milliseconds coming out:

now_time_seconds = time.time()
now_datetime = datetime.datetime.utcfromtimestamp(now_time_seconds)
session.execute("INSERT INTO times (key, time) VALUES (%s, %s)", (0, int(now_time_seconds)))
session.execute("INSERT INTO times (key, time) VALUES (%s, %s)", (1, int(now_time_seconds * 1e3)))
session.execute("INSERT INTO times (key, time) VALUES (%s, %s)", (2, now_datetime))
rows = session.execute("SELECT key, time FROM times WHERE key IN (0, 1, 2)")
print '\n'.join(("%d: %s" % (r.key, r.time) for r in rows))

# 0: 1970-01-17 11:04:56.764000
# 1: 2015-01-26 18:26:04.859000
# 2: 2015-01-26 18:26:04.859000

Wrapping Up

The new version of this driver contains several interesting features making it possible to integrate the latest capabilities of Cassandra. Thanks to all who provided contributions and bug reports. The continued involvement of the community is appreciated:

Leave a Reply

Your email address will not be published. Required fields are marked *