DataStax Enterprise is a big data platform built on Apache Cassandra that manages real-time, analytics, and enterprise search data. DataStax Enterprise leverages Cassandra, Apache Hadoop, and Apache Solr to shift your focus from the data infrastructure to using your data strategically.
DataStax Enterprise 2.1 introduces enhanced multiple data-center Hadoop support, which includes the capability of running multiple job trackers across data centers and creating multiple Cassandra File System keyspaces per data center. Using this capability, you can keep metadata local to each data center for faster performance and use different keyspace replication configurations depending on the job. This release supports Mahout machine learning and data mining capabilities. Oracle Enterprise Linux has been added to the list of certified platforms.
The key features of DataStax Enterprise include:
Full Integration with DataStax OpsCenter - Using DataStax OpsCenter, you can monitor, administer, and configure one or more DataStax Enterprise clusters in an easy-to-use graphical interface. Schedule automatic backups, explore Cassandra data, and see detailed health and status information about clusters, such as the up or down status of nodes, graphs of performance metrics, storage limitations, and progress of Hadoop MapReduce jobs.
No Single Point of Failure - In the Hadoop Distributed File System (HDFS) master/slave architecture, the NameNode entry point into the cluster stores configuration metadata about the cluster. If the NameNode fails, the Hadoop system goes down. DataStax Enterprise improves upon this architecture by making nodes peers. Being peers, any node in the cluster can load data files, and any analytics node can assume the responsibilities of job tracker for MapReduce jobs.
Reserve Job Tracker - DataStax Enterprise keeps a job tracker in reserve to take over in the event of a problem that would affect availability.
Multiple Job Trackers - In the Cassandra File System (CassandraFS), you can run one or more job tracker services across multiple data centers and create multiple CassandraFS keyspaces per data center. Using this capability has performance, data replication, and other benefits.
Hadoop MapReduce using Multiple Cassandra File Systems - CassandraFS is an HDFS-compatible storage layer. DataStax replaces HDFS with CassandraFS to run MapReduce jobs on Cassandra's peer-to-peer, fault-tolerant, and scalable architecture. In DataStax Enterprise 2.1 and later, you can create additional CassandraFS's to organize and optimize Hadoop data.
Analytics Without ETL - Using DataStax Enterprise, you run MapReduce jobs directly against your data in Cassandra. You can even perform real-time and analytics workloads at the same time without one workload affecting the performance of the other. Starting some cluster nodes as Hadoop analytics nodes and others as pure Cassandra real-time nodes automatically replicates data between nodes.
Elastic Workload Re-provisioning - Existing nodes can be re-provisioned to assume a different workload. For example, you can change two real-time/transactional nodes to analytics nodes during off-peak hours and then return them to the original configuration after the analytics tasks have completed.
Streamlined Setup and Operations - In Hadoop, you have to set up different mode configurations: stand-alone mode or pseudo-distributed mode for a single node setup, or cluster mode for a multi-node configuration. In DataStax Enterprise, you configure only one mode (cluster mode) regardless of the number of nodes.
Hive Support - Hive, a data warehouse system, facilitates data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. Any JDBC compliant user interface connects to Hive from the server. Using the Cassandra-enabled Hive MapReduce client in DataStax Enterprise, you project a relational structure onto Hadoop data in the Cassandra file systems, and query the data using a SQL-like language. Cassandra nodes share the Hive metastore automatically, eliminating repetitive HIVE configuration steps.
Pig Support - The Cassandra-enabled Pig MapReduce client included with DataStax Enterprise is a high-level platform for creating MapReduce programs used with Hadoop. You can analyze large data sets, running jobs in MapReduce mode and Pig programs directly on data stored in Cassandra.
Enterprise Search Capabilities - DataStax Enterprise Search fully integrates Apache Solr for ad-hoc querying of data, full-text search, hit highlighting, multiple search attributes, geo-spatial search, and for searching rich documents, such as PDF and Microsoft Word, and more.
Migration of RDBMS data - Apache Sqoop in DataStax Enterprise provides easy migration of RDBMS data, such as Oracle, Microsoft SQL Server, MySQL, Sybase, and DB2 RDBMS, and non-relational data sources, such as NoSQL into the DataStax Enterprise server.
Runtime Logging - DataStax Enterprise transfers log-based data directly into the server using log4j. Apache log4j is a Java-based logging framework that provides runtime application feedback and control over the size of log statements. Cassandra Appender can store the log4j messages in the Cassandra table-like structure for in-depth analysis using the Hadoop and Solr capabilities.
Support for Mahout - The Hadoop component, Apache Mahout, incorporated into DataStax Enterprise 2.1 and later offers machine learning libraries. Machine learning improves a system, such as the one that recreates the Google priority inbox, based on past experience or examples.