New in DataStax Enterprise 2.0: Support for appending files in Cassandra File System
The Cassandra File System (CFS) is a distributed filesystem that allows for easy integration of DataStax Enterprise platform with Hadoop. It implements the HDFS API on top of Cassandra. Therefore, you can run your Hadoop MapReduce Jobs unchanged on DataStax Enterprise analytical nodes. Additionally, thanks to distributed architecture of Cassandra, CFS has no single point of failure.
So far, files in CFS were immutable. You could load a file from the local file system by using
dse hadoop fs -copyFromLocal
or you could call the
create method of the Hadoop FileSystem API and write to the returned output stream. But once you closed the stream and the file was there, you could not append any new data to it. The only option was to delete it and write its contents once again. Neither very convenient nor efficient. DSE 2.0 implements the
append method of the HDFS API. This method, when passed a path to a file, returns an output stream positioned at the end of the file. Any data you write to that stream gets appended.
Files in HDFS, as well as in CFS, are organized in blocks, usually several tens MBs large. Compressed data of each block is stored in a separate row in sblocks column family. Blocks are immutable so it is not possible to grow a block that has been already saved. However, it is possible to add new blocks and this is exactly what the append method does.
One particular problem we had to solve to implement append was to determine the correct block size. Contrary to the create method, the HDFS API append method does not provide a parameter for setting the block size. In versions prior to 2.0 the block size was not saved in the inode header of the file, so for files created with these versions the default block size is used. You can set the default block size by setting the
dfs.block.size property. For files created with DSE 2.0, append uses the block size that was used at the creation of the file. More on design and implementation of Cassandra File System you can find here: http://www.datastax.com/dev/blog/cassandra-file-system-design