DataStax Enterprise (DSE) is a commercial distribution of Apache Cassandra and Apache Hadoop developed by DataStax. DSE provides Hadoop MapReduce capabilities using CassandraFS, an HDFS-compatible storage layer inside of Cassandra. By replacing HDFS with CassandraFS, users are able to leverage their current MapReduce jobs on Cassandra’s peer-to-peer, fault-tolerant, and scalable architecture. DataStax Enterprise is also able to support dual workloads, allowing you to use the same cluster of machines for both real-time applications and data analytics without having to move the data around between systems.
Some of the key features of DataStax Enterprise include:
No Single Point of Failure - The Hadoop Distributed File System (HDFS) utilizes a master/slave architecture. The NameNode is the entry point into the cluster and it stores all of the metadata about how the cluster is configured. If the NameNode fails, the Hadoop system is down. With CassandraFS, all nodes are peers. Data files can be loaded through any node in the cluster, and any node can serve as the JobTracker for MapReduce jobs.
Streamlined Setup and Operations - In Hadoop, there is the notion of having to set up different configurations depending on the mode you want to run in: stand-alone mode or pseudo-distributed mode for a single node setup, or cluster mode for a multi-node configuration. Moving from one mode to another requires multiple configuration steps. In DataStax Enterprise, there is only one mode (cluster mode). It does not matter if it is a cluster of one or one hundred, the configuration is the same. Since there is no NameNode, all nodes in a CassandraFS are the same. Likewise with running Hive against DataStax Enterprise. Hive has a metastore which is where it stores its schema. In regular Hive, this is a stand-alone database which requires multiple configuration steps to make it a database instance that can be shared by multiple Hive clients. In DataStax Enterprise, the Hive metastore is automatically a shared metastore (Cassandra keyspace available through any node in the cluster without any additional configuration.
Analytics Without ETL - When using DataStax Enterprise, it is possible to run MapReduce jobs directly against your data in Cassandra. You can even perform real-time and analytic workloads at the same time without one workload affecting the performance of the other. Using Cassandra’s multi-datacenter support, you can start some nodes as Hadoop analytics nodes and some nodes as pure Cassandra real-time nodes. With this split-workload configuration, data is automatically replicated between the Cassandra real-time nodes and the Hadoop analytics nodes.
Full Integration with DataStax OpsCenter 1.4 - DataStax OpsCenter allows you to monitor and administer your DataStax Enterprise cluster in one easy-to-use graphical interface. Using OpsCenter, you can see detailed health and status information about your DataStax Enterprise cluster, including the status of Hadoop MapReduce jobs running on the cluster. To install DataStax OpsCenter, see the DataStax OpsCenter Install Guide.
DataStax Enterprise combines Apache Cassandra with Hadoop. A DataStax Enterprise cluster can be run as a pure Hadoop MapReduce cluster (using Hadoop with Cassandra as its underlying storage) or as a combination of Hadoop analytics nodes and Cassandra real-time nodes. The DataStax Enterprise distribution also includes the Hive and Pig MapReduce clients.
Like Apache Cassandra, Apache Hadoop is an open-source project administered by the Apache Software Foundation. Hadoop consists of two key services, the Hadoop Distributed File System (HDFS) and a parallel data processing framework using a technique called MapReduce.
In DataStax Enterprise, the Hadoop Distributed File System (HDFS) is replaced by CassandraFS. CassandraFS is compatible with Hadoop MapReduce clients, but uses a cfs keyspace in Cassandra for the underlying storage layer. CassandraFS provides all of the benefits of HDFS such as replication and data location awareness, with the added benefits of the Cassandra peer-to-peer architecture.
On top of the distributed file system is the MapReduce engine, which consists of a centralized Job Tracker service. Client applications submit their MapReduce jobs to the Job Tracker. For each job submitted to the Job Tracker, a series of tasks are scheduled on the compute nodes. There is one Task Tracker service per node to handle the map and reduce tasks scheduled for that node. The Job Tracker monitors the execution and status of all of the distributed tasks that comprise a MapReduce job. In DataStax Enterprise, you must choose one node to be the Job Tracker for your MapReduce jobs (set by configuring the Cassandra seed node for your DSE cluster).
DataStax Enterprise includes a Cassandra-enabled Hive MapReduce client. Hive is a data warehouse system for Hadoop that allows you to project a relational structure onto data stored in Hadoop-compatible file systems, and to query the data using a SQL-like language called HiveQL. The HiveQL language also allows traditional MapReduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. In DataStax Enterprise, you can start the Hive client on any DSE Analytics node, define Hive data structures, and issue MapReduce queries. DSE Hive includes a custom storage handler for Cassandra that allows you to run Hive queries directly on data stored in Cassandra.
DataStax Enterprise includes a Cassandra-enabled Pig MapReduce client. Pig is a platform for analyzing large data sets that uses a high-level language (called Pig Latin) for expressing data analysis programs. Pig Latin lets developers specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig comes with many built-in functions, but developers can also create their own user-defined functions for special-purpose processing.
Pig Latin programs run in a distributed fashion on a DSE cluster (programs are complied into MapReduce jobs and executed using Hadoop). When using Pig with DSE, all jobs can be run in MapReduce mode (even on a single-node cluster). Since all Hadoop nodes are peers in DataStax Enterprise (no Name Node), there is no concept of local mode for Pig. Pig in DSE includes a custom storage handler for Cassandra that allows you to run Pig programs directly on data stored in Cassandra. The native Pig storage handler stores data in CassandraFS.
The fastest way to get started with DataStax Enterprise is to install it on a single node and run the Portfolio Manager demo application. For quick instructions on getting up and running on a single node, see:
For cluster installations of DSE, see:
To get started with the Hive and Pig MapReduce clients bundled with DSE, see:
For more information about Apache Cassandra 1.0, see the Cassandra 1.0 Documentation.
For more information about Hadoop MapReduce, Hive, and Pig, see the MapReduce Getting Started Guide, the Hive Getting Started Guide, and the Pig Latin Reference Manuals on the Apache Hadoop project web site.