DataStax Enterprise 2.0 Documentation

About DataStax Enterprise

DataStax Enterprise is a big data platform that integrates real-time, analytic, and enterprise search features to help you respond quickly to user and business demands. It leverages Apache Cassandra, Apache Hadoop, and Apache Solr to shift your focus from grappling with your own infrastructure to using your data strategically.


../_images/DSE2-full.png

Key Features of DataStax Enterprise

The key features of DataStax Enterprise include:

No Single Point of Failure - The Hadoop Distributed File System (HDFS) utilizes a master/slave architecture. The NameNode is the entry point into the cluster and it stores all of the configuration metadata about the cluster. If the NameNode fails, the Hadoop system is down. With DataStax Enterprise, all nodes are peers: data files can be loaded through any node in the cluster, and any Analytic node can perform the role of JobTracker for MapReduce jobs. Additionally in DataStax Enterprise, Solr is fully fault-tolerant and has no single point of failure.

Streamlined Setup and Operations - In Hadoop, you have to set up different configurations depending on the mode you want to run in: stand-alone mode or pseudo-distributed mode for a single node setup, or cluster mode for a multi-node configuration. Moving from one mode to another requires multiple configuration steps. In DataStax Enterprise, there is only one mode (cluster mode). It does not matter if it is a cluster has one or one hundred nodes, the configuration is the same.

Hadoop MapReduce capabilities using CassandraFS - CassandraFS is an HDFS-compatible storage layer inside of Cassandra. By replacing HDFS with CassandraFS, you can leverage your current MapReduce jobs on Cassandra's peer-to-peer, fault-tolerant, and scalable architecture.

Analytics Without ETL - When using DataStax Enterprise, it is possible to run MapReduce jobs directly against your data in Cassandra. You can even perform real-time and analytic workloads at the same time without one workload affecting the performance of the other. In a DataStax Enterprise cluster, you can start some nodes as Hadoop analytics nodes and some nodes as pure Cassandra real-time nodes. With this split-workload configuration, data is automatically replicated between the Cassandra real-time nodes and the Hadoop analytics nodes.

Enterprise Search Capabilities - DataStax Enterprise is fully integrated with Apache Solr to provide ad-hoc querying on the data; full-text search; hit highlighting; multiple search attributes; search rich documents, such as PDF and Microsoft Word; geo-spatial search, and more. Additionally, near real-time indexing can be performed to manage real-time, analytic, and enterprise search features within a single integrated platform.

Hive Support - Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. DataStax Enterprise allows you to use any JDBC compliant user interface to connect to and work with Hive from within the server. In regular Hive, its metastore is a stand-alone database that requires multiple configuration steps to make it a database instance that can be shared by multiple Hive clients. In DataStax Enterprise, the Hive metastore is automatically a shared metastore that is available through any Cassandra node in the cluster without any additional configuration.

Pig Support - Pig is a high-level platform for creating MapReduce programs used with Hadoop. You use Pig for analyzing large data sets. When using Pig with DataStax Enterprise, all jobs can be run in MapReduce mode and you can run Pig programs directly on data stored in Cassandra.

Elastic Workload Re-provisioning - DataStax Enterprise provides the ability to re-provision existing nodes to assume a different workload, such as changing a real-time node to an analytic node thereby changing the overall usage and capacity of a cluster. For example, you could change two real-time/transactional nodes to Analytic nodes during off-peak hours and then return them to the original configuration once the analytic tasks have completed.

Migration of RDBMS data - DataStax Enterprise provides easy migration of RDBMS data into the DataStax Enterprise server using Apache Sqoop. For example, you can import data from Oracle, Microsoft SQL Server, MySQL, Sybase, and DB2 RDBMS, and non-relational data sources, such as NoSQL.

Runtime Logging - DataStax Enterprise integrates Apache log4j. This Java-based logging framework provides runtime application feedback and the ability to control the granularity of log statements using an external configuration file. With Cassandra Appender you can store the log4j messages in a column family where they're available for in-depth analysis using the Hadoop and Solr capabilities provided by DataStax Enterprise. Also included in DataStax Enterprise is a log4j search demo that shows an example of searching and filtering log4 messages.

Full Integration with DataStax OpsCenter - DataStax OpsCenter allows you to monitor, administer, and configure one or more DataStax Enterprise clusters in one easy-to-use graphical interface. Additionally, you can perform and schedule automatic backups. Using OpsCenter, you can see detailed health and status information about multiple clusters, such as the status of Hadoop MapReduce jobs running on a cluster.

About the DataStax Enterprise Architecture

DataStax Enterprise combines Apache Cassandra with Hadoop and Solr. A DataStax Enterprise cluster can be run as a pure Hadoop MapReduce cluster (using Hadoop with Cassandra as its underlying storage) or as a combination of Hadoop analytics nodes, Cassandra real-time/transactional nodes, and Solr enterprise search nodes. The DataStax Enterprise distribution also includes the Hive and Pig MapReduce clients.

About Hadoop and MapReduce in DataStax Enterprise

Like Apache Cassandra, Apache Hadoop is an open-source project administered by the Apache Software Foundation. Hadoop consists of two key services, the Hadoop Distributed File System (HDFS) and a parallel data processing framework using a technique called MapReduce.

In DataStax Enterprise, CassandraFS replaces the Hadoop Distributed File System (HDFS). CassandraFS is compatible with Hadoop MapReduce clients, but uses a cfs keyspace in Cassandra for the underlying storage layer. CassandraFS provides all of the benefits of HDFS such as replication and data location awareness, with the added benefits of the Cassandra peer-to-peer architecture.

On top of the distributed file system is the MapReduce engine, which consists of a centralized Job Tracker service. Client applications submit their MapReduce jobs to the Job Tracker. For each job submitted to the Job Tracker, a series of tasks are scheduled on the compute nodes. There is one Task Tracker service per node to handle the map and reduce tasks scheduled for that node. The Job Tracker monitors the execution and status of all of the distributed tasks that comprise a MapReduce job. In DataStax Enterprise, you must choose one node to be the Job Tracker for your MapReduce jobs, which is set by configuring the Cassandra seed node for your DataStax Enterprise cluster.

About Hive in DataStax Enterprise

DataStax Enterprise includes a Cassandra-enabled Hive MapReduce client. Hive is a data warehouse system for Hadoop that allows you to project a relational structure onto data stored in Hadoop-compatible file systems, and to query the data using a SQL-like language called HiveQL. The HiveQL language also allows traditional MapReduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. In DataStax Enterprise, you can start the Hive client on any DataStax Enterprise Analytics node, define Hive data structures, and issue MapReduce queries. DataStax Enterprise Hive includes a custom storage handler for Cassandra that allows you to run Hive queries directly on data stored in Cassandra. Hive includes support for binary data and support for wide rows (up to 2 billion columns)

About Pig in DataStax Enterprise

DataStax Enterprise includes a Cassandra-enabled Pig MapReduce client. Pig is a platform for analyzing large data sets. It uses a high-level language called Pig Latin for expressing data analysis programs. Pig Latin lets developers specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig comes with many built-in functions, but developers can also create their own user-defined functions for special-purpose processing.

Pig Latin programs run in a distributed fashion on a DataStax Enterprise cluster (programs are complied into MapReduce jobs and executed using Hadoop). When using Pig with DataStax Enterprise, all jobs can be run in MapReduce mode (even on a single-node cluster). Since all Hadoop nodes are peers in DataStax Enterprise (no Name Node), there is no concept of local mode for Pig. Pig in DataStax Enterprise includes a custom storage handler for Cassandra that allows you to run Pig programs directly on data stored in Cassandra. The native Pig storage handler stores data in CassandraFS.

About Solr in DataStax Enterprise

DataStax Enterprise Search provides powerful free-text search capabilities based on the Apache Solr project. Solr is an open source, widely-used search engine technology. In addition to free-text search, Solr provides more advanced features like aggregation, grouping, and geo-spatial search.

The unique combination of Cassandra, Hadoop, and Solr in DataStax Enterprise overcomes MapReduce performance problems when querying real-time data. DataStax Enterprise Search adds capabilities to Cassandra for performing complex queries and searches. It offers unique search capacity scaling that improves native Solr search capabilities. You can add search capacity in the same way as you add Hadoop or Cassandra capacity to your cluster. Additionally, the Cassandra Query Language (CQL) has been extended to support Solr/enterprise search queries.

If you don't need Hadoop/Cassandra, you can use DataStax Enterprise strictly for Solr and create an exclusively Solr cluster. This cluster configuration improves on the master-slave configuration supported by native Solr and, because DataStax Enterprise supports the native Solr tools and APIs, migration from Solr to DataStax Enterprise is painless.

DataStax Enterprise includes a demo that downloads and indexes all or part of Wikipedia. This demo shows how simple it is to input data and perform enterprise search operations.

About Sqoop

Apache Sqoop is a tool for transferring data between an external data source and Hadoop. You can migrate data from any JDBC-compliant data source, including non-relational data sources, such as NoSQL, and relational data sources, such as Oracle, MySQL, and SQL Server.

About log4j

DataStax Enterprise provides the ability to transfer log-based data directly into the server using log4j. Apache log4j is a Java-based logging framework that provides runtime application feedback. It provides allows you to control the granularity of log statements using an external configuration file (log4j.properties). With the Cassandra Appender, you can store the log4j messages in a column family where they're available for in-depth analysis using the Hadoop and Solr capabilities provided by DataStax Enterprise.

About DataStax OpsCenter

DataStax OpsCenter is a browser-based user interface for monitoring and administering Cassandra and/or DataStax Enterprise clusters in a single centralized management console. The key features of OpsCenter include:

  • A Dashboard that displays an overview of commonly watched performance metrics.
  • An Overview page that shows a condensed view of each cluster’s Dashboard (only displayed when multiple clusters are monitored).
  • Basic cluster configuration.
  • Built-in external notification capabilities.
  • Administration tasks using simple point-and-click actions.
  • Re-balancing data across a cluster when new nodes are added.
  • Alert warnings of impending issues.
  • Automatic backup operations, including scheduling and removing of old backups.
  • Multiple cluster management from a single OpsCenter instance.

Getting Started with DataStax Enterprise

The fastest way to get started with DataStax Enterprise is to install it on a single node and run the Portfolio Manager demo application. For quick instructions on getting up and running on a single node, see:

For cluster installations of DataStax Enterprise, see:

To get started with the Hive and Pig MapReduce clients bundled with DataStax Enterprise, see:

Using DataStax Enterprise

These documents contain more in-depth information about using DataStax Enterprise:

Other Documentation References

Usefull resources for using DataStax Enterprise:

DataStax Enterprise Demos

  • Portfolio Manager Demo - demonstrates a hybrid workflow using DataStax Enterprise.
  • Search Demo - demonstrates the Solr search capabilities using Wikipedia.
  • Sqoop Demo - migrates data from a MySQL database containing information from the North American Numbering Plan.
  • Log4j Search Demo - shows an example of searching and filtering log4j messages generated by a standard Java application.
  • Hive Demo - shows how to use Hive to access data in Cassandra.
  • Pig Demo - includes a sample data file containing tuples of two fields each (name and score). Using this file you create a Pig relation, perform a simple MapReduce job to calculate the total score for each user, and then put the result back into CassandraFS or into a Cassandra column family.