Brisk is an open-source Hadoop and Hive distribution developed by DataStax that utilizes Apache Cassandra for its core services and storage. Brisk provides Hadoop MapReduce capabilities using CassandraFS, an HDFS-compatible storage layer inside Cassandra. By replacing HDFS with CassandraFS, users are able to leverage their current MapReduce jobs on Cassandra’s peer-to-peer, fault-tolerant, and scalable architecture. Brisk is also able to support dual workloads, allowing you to use the same cluster of machines for both real-time applications and data analytics without having to move the data around between systems.
This release introduces the following new features, plus numerous bug fixes and performance improvements:
Pig Integration - A Brisk-compatible Pig client allows developers to develop MapReduce programs in Pig Latin and access data stored in CassandraFS as well as regular Cassandra keyspaces.
JobTracker Failover - A utility has been added to allow administrators to change the JobTracker node for a Brisk cluster in the event of a primary JobTracker node failure.
Some of the key features of Brisk include:
No Single Point of Failure - The Hadoop Distributed File System (HDFS) utilizes a master/slave architecture. The NameNode is the entry point into the cluster and it stores all of the metadata about how the cluster is configured. If the NameNode fails, the Hadoop system is down. With CassandraFS, all nodes are peers. Data files can be loaded through any node in the cluster, and any node can serve as the JobTracker for MapReduce jobs.
Streamlined Setup and Operations - In Hadoop, there is the notion of having to set up different configurations depending on the mode you want to run in: stand-alone mode or pseudo-distributed mode for a single node setup, or cluster mode for a multi-node configuration. Moving from one mode to another requires multiple configuration steps. In Brisk, there is only one mode (cluster mode). It does not matter if it is a cluster of one or one hundred, the configuration is the same. Since there is no NameNode, all nodes in a CassandraFS are the same. Likewise with running Hive against Brisk. Hive has a metastore which is where it stores its schema. In regular Hive, this is a stand-alone database which requires multiple configuration steps to make it a database instance that can be shared by multiple Hive clients. In Brisk, the Hive metastore is automatically a shared metastore in Cassandra available through any node in the cluster without any additional configuration.
Analytics Without ETL - When using Brisk, it is possible to run MapReduce jobs directly against your data in Cassandra. You can even perform real-time and analytic workloads at the same time without one workload affecting the performance of the other. Using Cassandra’s multi-datacenter support, you can start some nodes as Brisk nodes and some nodes as pure Cassandra nodes. With this split-workload configuration, data is automatically replicated between the Cassandra nodes and the Brisk Hadoop nodes.
Full Integration with DataStax OpsCenter 1.1 - DataStax OpsCenter release 1.1 and later allows you to monitor and administer your Brisk cluster in one easy-to-use graphical interface. Using OpsCenter, you can see detailed health and status information about your Brisk cluster, including the status of MapReduce jobs running on the cluster. For more information on how you can get DataStax OpsCenter, see the DataStax OpsCenter Product Page.
![]()
Brisk combines Apache Cassandra with Hadoop. A Brisk cluster can be run as a pure Hadoop MapReduce cluster (using Hadoop with Cassandra as its underlying storage) or as a combination of Hadoop nodes and Cassandra nodes. The Brisk distribution includes the Hive and Pig MapReduce clients.
Like Apache Cassandra, Apache Hadoop is an open-source project administered by the Apache Software Foundation. Hadoop consists of two key services, the Hadoop Distributed File System (HDFS) and a parallel data processing framework using a technique called MapReduce.
In Brisk, the Hadoop Distributed File System (HDFS) is replaced by CassandraFS. CassandraFS is compatible with Hadoop MapReduce clients, but uses a cfs keyspace in Cassandra for the underlying storage layer. CassandraFS provides all of the benefits of HDFS such as replication and data location awareness, with the added benefits of the Cassandra peer-to-peer architecture.
On top of the distributed file system is the MapReduce engine, which consists of a centralized Job Tracker service. Client applications submit their MapReduce jobs to the Job Tracker. For each job submitted to the Job Tracker, a series of tasks are scheduled on the compute nodes. There is one Task Tracker service per node to handle the map and reduce tasks scheduled for that node. The Job Tracker monitors the execution and status of all of the distributed tasks that comprise a MapReduce job. In Brisk, you must choose one node to be the Job Tracker for your MapReduce jobs. Typically, this will be the same node as your Cassandra seed node.
Brisk includes a Cassandra-enabled Hive MapReduce client. Hive is a data warehouse system for Hadoop that allows you to project a relational structure onto data stored in Hadoop-compatible file systems, and to query the data using a SQL-like language called HiveQL. The HiveQL language also allows traditional MapReduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. In Brisk, you can start the Hive client on any Brisk node, define Hive data structures, and issue MapReduce queries. Brisk Hive includes a custom storage handler for Cassandra that allows you to run Hive queries directly on data stored in Cassandra.
Brisk includes a Cassandra-enabled Pig MapReduce client. Pig is a platform for analyzing large data sets that uses a high-level language (called Pig Latin) for expressing data analysis programs. Pig Latin lets developers specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig comes with many built-in functions, but developers can also create their own user-defined functions for special-purpose processing.
Pig Latin programs run in a distributed fashion on a Brisk cluster (programs are complied into MapReduce jobs and executed using Hadoop). When using Pig with Brisk, all jobs can be run in MapReduce mode (even on a single-node cluster). Since all Hadoop nodes are peers in Brisk (no Name Node), there is no concept of local mode for Pig. Brisk Pig includes a custom storage handler for Cassandra that allows you to run Pig programs directly on data stored in Cassandra. The native Pig storage handler stores data in CassandraFS.
To get started with Brisk, you can install Brisk on a cluster or on a single node and look at the Brisk demo. DataStax provides packaged distributions as well as an Amazon EC2 image. When you start a node in a Brisk cluster, you have the option of starting the node as a Cassandra-only node or as a Brisk node (CassandraFS plus the MapReduce Job and Task Tracker services). See the following topics for more information:
For more information about Hadoop MapReduce, Hive, and Pig, see the MapReduce Getting Started Guide, the Hive Getting Started Guide, and the Pig Latin Reference Manuals on the Apache Hadoop project web site.