DataStax Enterprise 3.1 Documentation

Mixing workloads in a cluster

This documentation corresponds to an earlier product version. Make sure this document corresponds to your version.

Latest DSE documentation | Earlier DSE documentation

A common question is how to use real-time (Cassandra), analytics (Hadoop), or search (Solr) nodes in the same cluster. Within the same data center, attempting to run Solr on some nodes and real-time queries or analytics on other nodes does not work. The answer is to organize the nodes running different workloads into virtual data centers.

Creating a virtual data center

Virtual data centers are a convenient way to organize work loads within clusters. When you create a keyspace using CQL, you can set up virtual data centers, independent of what physical data center the individual nodes are in. You assign analytics nodes to one data center, search nodes to another, and Cassandra real-time nodes to yet another data center. The separate, virtual data centers for different types of nodes segregate workloads running Solr from those running Cassandra real-time or Hadoop analytics applications. Segregating workloads ensures that only one type of workload is active per data center.

In separate data centers, different types of nodes can handle search while others handle MapReduce, or just act as ordinary Cassandra nodes. In this diagram, nodes in data centers 1 and 2 (DC 1 and DC 2) run a mix of:

  • Real-time queries (Cassandra and no other services)
  • Analytics (Cassandra and Hadoop)

Data centers 3 and 4 (DC 3 and DC 4) are dedicated to search.

../../_images/dse_search_datacenter.png

Cassandra ingests the data, Solr indexes the data, and you run MapReduce against that data, all in one cluster without having to do any manual extract, transform, and load (ETL) operations. Cassandra handles the replication and isolation of resources.

The Solr nodes run HTTP and hold the indexes for the column family data. If a Solr node goes down, the commit log replays the Cassandra inserts, which correspond to Solr inserts, and the node is restored automatically.

Workload segregation

The batch needs of Hadoop and the interactive needs of Solr are incompatible from a performance perspective, so these workloads need to be segregated. Cassandra real-time applications and DSE Search/Solr applications or Hadoop are also incompatible, but for a different reason--dramatically distinct access patterns:

  • A Cassandra real-time application needs very rapid access to Cassandra data.

    The real-time application accesses data directly by key, large sequential blocks, or sequential slices.

  • A DSE Search/Solr application needs a broadcast or scatter model to perform full-index searching.

    Virtually every Solr search needs to hit a large percentage of the nodes in the virtual data center (depending on the RF setting) to access data in the entire Cassandra table. The data from a small number of rows are returned at a time.

To deploy a mixed workload cluster, see Multiple data center deployment.

Restrictions

Do not run Solr and Hadoop on the same node in either production or development environments.

In multiple data centers having clusters that are not running Solr, do not attempt to insert data to be indexed by Solr using CQL or Thrift from these Hadoop or Cassandra real-time clusters. Run the CQL or Thrift inserts on a Solr node.

Replicating data across data centers

You set up replication for Solr nodes exactly as you do for other nodes in a Cassandra cluster, by creating or altering a keyspace to define the replication strategy. You can use CREATE KEYSPACE to set up replication.