A Hive or Pig analytics job requires a Hadoop file system to function. DataStax Enterprise provides a replacement for the Hadoop Distributed File System (HDFS) called the Cassandra File System (CassandraFS), which serves this purpose. When an analytics node starts up, DataStax Enterprise creates a default CassandraFS rooted at cfs:/ and an archive file system named cfs-archive.
A CFS superuser is the DSE daemon user, the user who starts DataStax Enterprise. A cassandra superuser, set up using the CQL CREATE USER . . . SUPERUSER command, is also a CFS superuser.
A CFS superuser can modify files in the CassandraFS without any restrictions. Files that a superuser adds to the CassandraFS are password-protected.
Cassandra does not immediately remove deleted data from disk when you use the dse hadoop fs -rm <file> command. Instead, Cassandra treats the deleted file like any data deleted from Cassandra. A tombstone is written to indicate the new column status. Columns marked with a tombstone exist for a configured time period (defined by the gc_grace_seconds value set on the table). When the grace period expires, the compaction process permanently deletes the column. You do not have to manually remove expired data. A deleted column can reappear if you do not run node repair routinely.
To force deletion of data after using dse hadoop fs -rm <file> command, use the nodetool flush command. This command forces tombstones in the memtable to flush to disk and compact together with the data to remove the files.
DataStax Enterprise 2.1 and later support multiple CassandraFS's. Some typical reasons for using an additional CassandraFS are:
To create an additional CassandraFS:
Open the core-site.xml file for editing. This file is located in:
Add one or more property elements to core-site.xml using this format:
<property> <name>fs.cfs-<filesystem name>.impl</name> <value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value> </property>
Save the file and restart Cassandra.
DSE creates the new CassandraFS.
To access the new CassandraFS, construct a URL using the following format:
For example, assuming the new file system name is NewCassandraFS use dse commands to copy data to the new CassandraFS site:
dse hadoop fs -copyFromLocal /tmp/giant_log.gz cfs-NewCassandraFS://cassandrahost/tmp dse hadoop fs distcp hdfs:/// cfs-NewCassandraFS:///