DataStax Developer Blog

New in DataStax Enterprise 2.0: Support for appending files in Cassandra File System

By Piotr Kołaczkowski -  March 23, 2012 | 0 Comments

The Cassandra File System (CFS) is a distributed filesystem that allows for easy integration of DataStax Enterprise platform with Hadoop. It implements the HDFS API on top of Cassandra. Therefore, you can run your Hadoop MapReduce Jobs unchanged on DataStax Enterprise analytical nodes. Additionally, thanks to distributed architecture of Cassandra, CFS has no single point of failure.

So far, files in CFS were immutable. You could load a file from the local file system by using

dse hadoop fs -copyFromLocal

or you could call the create method of the Hadoop FileSystem API and write to the returned output stream. But once you closed the stream and the file was there, you could not append any new data to it. The only option was to delete it and write its contents once again. Neither very convenient nor efficient. DSE 2.0 implements the append method of the HDFS API. This method, when passed a path to a file, returns an output stream positioned at the end of the file. Any data you write to that stream gets appended.

Example

Implementation notes

Files in HDFS, as well as in CFS, are organized in blocks, usually several tens MBs large. Compressed data of each block is stored in a separate row in sblocks column family. Blocks are immutable so it is not possible to grow a block that has been already saved. However, it is possible to add new blocks and this is exactly what the append method does.

One particular problem we had to solve to implement append was to determine the correct block size. Contrary to the create method, the HDFS API append method does not provide a parameter for setting the block size. In versions prior to 2.0 the block size was not saved in the inode header of the file, so for files created with these versions the default block size is used. You can set the default block size by setting the dfs.block.size property. For files created with DSE 2.0, append uses the block size that was used at the creation of the file. More on design and implementation of Cassandra File System you can find here: http://www.datastax.com/dev/blog/cassandra-file-system-design



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>