Get your copy of the O’Reilly Cassandra eBook: The Definitive Guide - Download FREE Today
Today, DataStax is announcing the introduction of DataStax Enterprise (DSE) Analytics Solo, a new offering to enable more flexible and cost-effective analytics processing of data stored in DataStax Enterprise.
DSE Analytics Solo delivers all of the powerful features of DSE Analytics that have allowed numerous companies to blend the functionality of a continuously available, scalable, distributed data layer with a powerful analytic processing engine. Also, it is designed to cover some of the new deployment modes our customers have been leveraging as of late.
DSE Analytics Solo allows customers to deploy DSE Analytics processing on hardware configurations segregated from the DataStax database to ensure consistent behavior of both engines in a configuration that does not compete for compute resources. This separation of compute and storage configurations is good for processing-intensive analytic workloads, whereas DSE’s traditional collocated configuration, which allows for easy addition of analytic processing to the DataStax database without the need of additional hardware, is good when the analysis is not as intensive or the database is not as heavily in use.
DSE Analytics Solo enables customers to:
- Leverage the same highly available, scalable, secure Apache Spark™ deployment that is included with DSE Analytics, including faster overall performance than open source Spark/Cassandra, a fault-tolerant resource manager with secured communications, the ability to create pools of resources grantable to particular users, and a continuously available, scalable, HDFS-compatible distributed file system.
- Deploy on dedicated hardware, ensuring segregated resources for both the database and the processing engine, and predictable performance of both.
- Have the flexibility to quickly and cost-effectively add more, or fewer, analytic processing nodes than database nodes, as the use case requires.
- Deploy analytic processing nodes via the same OpsCenter management suite that manages the database nodes.
DataStax started a new era of big data processing in DSE with the introduction of Apache Spark to the DataStax platform, replacing Apache HadoopTM, in DSE 4.5. Since then, DataStax has invested in improving the integration of Spark in every DSE release to include:
- A highly-available Spark Resource Manager allowing applications to be submitted at all times to any node, with minimal impact during failures;
- Secured communications between all Spark processes (Master, Worker, Driver, and Executor) leveraging the security of the DataStax database drivers for all communications;
- Continuously available, scalable, HDFS-compatible, distributed file system (DSEFS) with no single point of failure, no Zookeeper dependencies, etc;
- Ability to define pools of resources and grant permissions to specific users to specific workpools, restricting which users can run applications on which resources;
- Optimizations to more efficiently read from the DataStax database, up to 3x faster;
- Capability to leverage DSE Search indices for increased performance; and
- Spark Jobserver for REST submission and management of Spark applications, including the ability to share cached data between applications.
In DSE 5.1, the DSE Spark Resource Manager is able to support a variety of deployment configurations, including:
- Collocated — The traditional deployment mode where all nodes in the Analytics data center are running both Spark and the DataStax database, and a copy of the data has been replicated to this data center;
- DSE Analytics-Only Data Center — In this mode a data center is DSE Analytics enabled but no data is replicated to the data center. So, while the data uses the same set of users and permissions, it is not replicated locally and is read from another data center in the same cluster. One advantage of this configuration is that scaling up or down the size of this DSE Analytics-Only Data Center does not require moving the database data, making those operations much quicker;
- DSE Analytics-Only Cluster — This takes the DSE Analytics-Only Data Center scenario one step more decoupled. In this mode, the Spark cluster is a completely separate from DSE cluster. Data is read remotely, as in the DSE Analytics-Only Data Center scenario, but, since it is a separate cluster, the users are not necessarily the same.
The benefits of these non-collocated deployment configurations include:
- Segregated hardware to remove resource contention between the analytics engine and the database;
- Allowing configurations with more, or fewer, analytic processing nodes than database nodes;
- Easier addition, or removal, of analytic processing nodes to the cluster, requiring no database data movement when changing cluster size; and
- The ability to add multiple DSE Analytics-Only Data Centers to allocate separate analytic processing resources to different sets of users.
These new deployment configuration options allow administrators to choose a scenario that best suits their needs. If the analytic needs are light or the database is not overly busy, then the collocated configuration is probably suitable and is the simplest option.
If analytic workloads are heavier or more consistent, such as with stream processing scenarios, then a DSE Analytics-Only Data Center configuration is a good choice, since it will protect the database and analytic engine from competing with each other for resources, but will still retain the same user management. If the analytic need is very sporadic and as such the administrator would like it to be very lightly coupled, then a DSE Analytics-Only Cluster configuration could be a good choice.
As an example, a user with a streaming application that analyzes streaming data, processes the incoming records and filters out 99.9% of the data, only persisting 0.1% of the incoming data, may choose a DSE Analytics-Only Data Center configuration with more analytic processing nodes than database nodes. This allows the user to deploy more analytic processing nodes than database nodes, since only 0.1% of the data is persisted. It also allows the user to segregate the processing nodes from the database nodes to remove resource contention, and to use the same set of users in the database and for the Spark cluster.
As another example, a user requiring a weekly report on the data stored in their DataStax database in a cloud deployment could deploy a DSE Analytics-Only Data Center configuration. This would allow the user to use the analytic engine to produce the reports, and then reduce the size of the analytic cluster (potentially removing it) during the week to reduce cloud hardware instance costs, spinning them back up to run the next report.
It should be noted that the DSE Analytics-Only Data Center configuration is usually suitable for most scenarios that cover the DSE Analytics-Only Cluster configuration, and has some benefits for simpler management, including using the same users and permissions as the database nodes. For more technical information, see the DataStax documentation.
Analytics on Dedicated Nodes
To enable DSE Analytics-Only Data Center and DSE Analytics-Only Cluster deployments, the new DSE Analytics Solo option allows customers to add analytics processing nodes to their existing database nodes, but not store user data on those nodes. DSE Analytics Solo nodes use the same installation methods and configuration mechanisms as DSE Analytics, including OpsCenter, as well as advanced security features such as LDAP, but are licensed to only use DSE’s production-certified Spark processing engine and DSEFS and not to store DataStax database data or to use DSE Search or DSE Graph.