DataStax Developer Blog

What’s new in Cassandra 0.7: Live schema updates

By August 9, 2010 | 3 Comments

(This is a guest post by Gary Dusbabek, who works on Cassandra full-time for Rackspace. You can contact him at gdusbabek@gmail.com or @gdusbabek on Twitter.)

Cassandra has always been schemaless within a ColumnFamily, in the sense that columns may be created at will simply by using them in a row. ColumnFamilies themselves, however, and Keyspaces, have to be explicitly defined before use (so Cassandra knows how to index the columns within their rows). “Live schema updates” refer to this ability to create, rename, and remove both Keyspaces and ColumnFamilies in a live cluster, and were added early in the 0.7 development cycle in CASSANDRA-44. Prior to 0.7, changing the schema meant editing the configuration file for every node and then manually executing a rolling restart of the cluster, which afforded the opportunity for humans to make mistakes.

This post covers some changes this feature required under the hood and explains what you can do to be ready if you are upgrading from 0.6. It is an expansion of the wiki article I wrote when live schema updates first appeared in the trunk.

Starting up, or “Dude, where’s my schema?”

We now store schema in the system keyspace using two column families. The first (Schema) stores the keyspace and column family definitions, while the second (Migrations) stores individual keyspace changes over time. All migrations and schema definitions are keyed by a time-based UUID. storage-conf.xml if you are upgrading, or cassandra.yaml* on a new 0.7 node may still contain keyspace definitions, but Cassandra ignores them during startup. Instead, Cassandra looks up the latest schema version UUID it has stored. If it finds nothing it loads nothing and logs a warning:


Couldn't detect any schema definitions in local storage.

If a schema does exist Cassandra loads the correct keyspace definitions from local storage and applies them using the same approach used in previous versions in which keyspaces were loaded from schema-conf.xml.

At the same time, the node incorporates the version UUID from its schema into the gossip digests it sends to other nodes. If this node does not have the latest schema definitions when it starts up (as a result of a network partition, restart or bootstrapping a new node), a version mismatch is detected by the gossiper and the definition promulgation mechanism described next is invoked.

Promulgation

Definition promulgation consists of two asynchronous phases: announce and push. Announce is a way for node A to declare to node B “this is the schema version I have.” If the versions are equal, the message is ignored. If A is older than B (Case 1), B responds with a push containing all the migrations from B that A doesn’t have. If A is newer than B (Case 2), B responds with announce to A (this functions as a request for updates) after which A responds with a push to B.

Cassandra Migrations

Schema updates can also be pushed from the client (thrift). When this happens gossip promulgation is invoked using the announce-announce-push.

These schema changes typically take seconds to finish. Time to complete will scale linearly with the size of your cluster.

IMPORTANT: since schema changes need to be applied and promulgated serially, operators shouldn’t issue schema changes from multiple nodes simultaneously. If two changes make their way across the cluster at the same time they will collide and leave the cluster in an inconsistent state. Cassandra does a few things to guard against this, but an ounce of prevention goes a long way. Cluster operators should adopt the practice of issuing schema changes from a single node and always use the same node, preferably a seed.

Initial schema loading

We have made it convenient for you to import the schema formerly defined in storage-conf.xml (0.6) or cassandra.yaml (0.7). You should use JMX to call StorageService.loadSchemaFromYaml() or perform the same operation from the command line using bin/schematool. This manual operation can be performed only once. It will fail if you try to load the schema again. If you are upgrading from 0.6, make sure you have already run the storage-conf.xml to cassandra.yaml converter. One caveat of this process is that your cluster must have enough live nodes greater than or equal to the maximum replication factor of all your keyspaces.

Loading schema via JMX must be done on exactly one node in your cluster (preferably a seed node). Changes will be promulgated from that node to the rest of the cluster. This capability will be deprecated in the next version of Cassandra (0.7+1) and will be completely removed in the version after that (0.7+2).

Further schema modifications

Once your schema is saved in the system table, any schema modifications will have to be made via the Thrift. There are six methods that accomplish this:


system_add_column_family()
system_drop_column_family()
system_rename_column_family()
system_add_keyspace()
system_drop_keyspace()
system_rename_keyspace()

These methods do exactly what their names imply. Some things to note:

  • The drop and rename methods create a snapshot of your existing data before doing their work.
  • The rename methods block while filenames are changed.
  • All methods go through a bit of validation to check for sanity.

Conclusion

Live schema changes will give you the ability to make low level changes to your cluster without any kind of restart. We plan on taking this feature further to allow you to make more fine-grained schema changes in the future.

Are you interested in learning more about Cassandra? We invite users and developers to participate in the Cassandra Summit in San Francisco on August 10th co-sponsored by Rackspace and Riptano.

* What is this YAML of which you speak? The Cassandra configuration file was changed from XML to YAML between 0.6 and 0.7. Don’t worry—we provide a converter that will export your old storage-conf.xml to a newer cassandra.yaml. You will need to do that before attempting to import your old schema.



Comments

  1. Salman says:

    Happy to see this functionality, it will go a long way for those wanted to create seperate keyspaces for clients (in Saas environments).

  2. Ed says:

    I have a simple question that I can’t seem to locate any answer to. In 0.7- what’s the best way (any way really) to export the schema from one instance of cassandra in order to import into another instance? For example, I have a development environment and need an easy way for developers to update their local cassandra installs to be consistent with the schema in the dev environment. Up to this point we’ve just been tracking our changes incrementally, but I’d like a simpler way to do it.

  3. Ed – if you happen to be in a Ruby environment you can take a look at a new gem that we just released called active_column. (http://github.com/carbonfive/active_column)

    It offers, among other things, ActiveRecord-like database migrations. Included in this are two rake tasks:
    rake ks:schema:dump
    raks ks:schema:load

    You can use them to export and import Cassandra schemas. Please feel free to ping me if you want more info. mike at carbonfive dot com.

    NOTE: the current release of active_column on rubygems is 0.1.1. The rake tasks I mentioned above are available in the next version, so at the moment you will have to use the github version. I will get 0.2 released on rubygems soon.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>