What’s new in Cassandra 1.1: Flexible data file placement
Apache Cassandra is designed from the ground up to work well on spinning disks, but it can also leverage the high IOPS of SSDs. (Don’t miss the video and slides about using Cassandra with SSDs from our solutions architect.)
Suppose you have a column family whose data is written once and read infrequently (named “Logs”), and one whose data is accessed frequently (named “UserData”) under the same keyspace named “App”. You may want to use an SSD for the frequently accessed column family in order to boost IO performance. At first, it looks like you can achieve this by mounting the SSD to an appropriate data directory, but then you realize that Cassandra stores all column family data files under a single directory for their keyspace, like below:
/var/lib/cassandra/data/App/Logs-hc-1-Data.db /var/lib/cassandra/data/App/Logs-hc-1-Index.db ... /var/lib/cassandra/data/App/UserData-hc-1-Data.db /var/lib/cassandra/data/App/UsreData-hc-1-Index.db ...
Until now, you can only use a separate disk per keyspace, not per column family.
More control over data files
In version 1.1, CASSANDRA-2749 changes the way Cassandra stores data files by using separate column family directories within each keyspace directory. In 1.1, the above data files will instead be stored like this:
/var/lib/cassandra/data/App/Logs/App-Logs-hc-1-Data.db /var/lib/cassandra/data/App/Logs/App-Logs-hc-1-Index.db ... /var/lib/cassandra/data/App/UserData/App-UserData-hc-1-Data.db /var/lib/cassandra/data/App/UserData/App-UserData-hc-1-Index.db ...
This allows you to mount an SSD on a particular directory (in this case UserData) to boost the performance for a particular column family. You may notice that the file name format has also been changed to include the keyspace name at the beginning. This makes it easy to distinguish which keyspace and column family the file belongs when streaming or bulk loading.
What about upgrading?
Do you need to manually move all pre-1.1 data files to the new directory structure before upgrading to 1.1? No. Immediately after Cassandra 1.1 starts, it checks to see whether it has old directory structure and migrates all data files (including backups and snapshots) to the new directory structure if needed. So, just upgrade as you always do (don’t forget to read NEWS.txt first), and you will get more control over data files for free.
Starting with Cassandra 1.1, data files are stored inside their own column family directory, which enables you to control what column family goes to which disk. Upgrading to the new directory structure is done automatically, so no extra upgrade steps are required. The beta2 version of 1.1 is available for download, so feel free to try it out! Feedback is always appreciated.