DataStax Enterprise 3.0 Documentation

Initializing a DataStax Enterprise Cluster on Amazon EC2

This is a step-by-step guide to using the Amazon Web Services EC2 Management Console to set up a DataStax Enterprise (DSE) cluster using the DataStax AMI (Amazon Machine Image). Installing via the AMI allows you to quickly deploy a cluster with a pre-configured mixed workload. When you launch the AMI, you can specify the total number of nodes in your cluster and how many nodes should be Real-Time/Transactional (Cassandra), Analytics (Hadoop), or Search (Solr).

For information about upgrading or expanding an existing installation, see Upgrading the DataStax AMI or Expanding a DataStax AMI cluster.

The DataStax AMI does the following:

  • Installs the latest DataStax Enterprise with an Ubuntu 12.04 LTS (Precise Pangolin) image (Ubuntu Cloud 20121218 release).
  • Uses RAID0 ephemeral disks for data storage and commit logs.
  • Uses the private interface for intra-cluster communication.
  • Starts the nodes in the specified mode (Real-time, Analytics, or Search).
  • Configures a Cassandra cluster using the RandomPartitioner.
  • Configures the replication strategy for a mixed-workload cluster. The DseDelegateSnitch sets DseSimpleSnitch as the default.
  • Sets the seed nodes cluster-wide.
  • Installs the DataStax OpsCenter on the first node in the cluster (by default).

Note

If you want an EC2 cluster that spans multiple regions and availability zones, do not use the DataStax AMI. Instead, install DataStax Enterprise on your EC2 instances as described in Installing the DataStax Enterprise package on Debian and Ubuntu, and then configure the cluster as a multiple data center cluster.

Production considerations

For production Cassandra clusters on EC2, use Large or Extra Large instances with local storage. RAID0 the ephemeral disks, and put both the data directory and the commit log on that volume. This has proved to be better in practice than putting the commit log on the root volume (which is also a shared resource). For more data redundancy, consider deploying your Cassandra cluster across multiple availability zones or using EBS volumes to store your Cassandra backup files.

Creating an EC2 security group for DataStax Enterprise

An EC2 Security Group acts as a firewall that allows you to choose which protocols and ports are open in your cluster. You can specify the protocols and ports either by a range of IP addresses or by security group. The default EC2 security group opens all ports and protocols only to computers that are members of the default group. This means you must define a security group for your Cassandra cluster. Be aware that specifying a Source IP of 0.0.0.0/0 allows every IP address access by the specified protocol and port range.

  1. In your Amazon EC2 Console Dashboard, select Security Groups in the Network & Security section.

  2. Click Create Security Group. Fill out the name and description and then click Yes, Create.


    ../../_images/ami1_securitygroup_dse2.png
  3. Click the Inbound tab and add rules for the ports listed in the table below:

    • Create a new rule: Custom TCP rule.
    • Port range: See table.
    • Source: See table. To create rules that open a port to other nodes in the same security group, use the Group ID listed in the Group Details tab.

Port Description
Public Facing Ports
22 SSH (default)
DataStax Enterprise Specific
8012 Hadoop Job Tracker client port. The Job Tracker listens on this port for job submissions and communications from task trackers; allows traffic from each Analytics node in a cluster.
8983 Solr port and Demo applications website port (Portfolio, Search, Search log)
50030 Hadoop Job Tracker website port. The Job Tracker listens on this port for HTTP requests. If initiated from the OpsCenter UI, these requests are proxied through the opscenterd daemon; otherwise, they come directly from the browser.
50060 Hadoop Task Tracker website port. Each Task Tracker listens on this port for HTTP requests coming directly from the browser and not proxied by the opscenterd daemon.
OpsCenter Specific
8888 OpsCenter website. The opscenterd daemon listens on this port for HTTP requests coming directly from the browser.
Inter-node Ports
Cassandra Specific
1024+ JMX reconnection/loopback ports. See description for port 7199.
7000 Cassandra inter-node cluster communication.
7199 Cassandra JMX monitoring port. After the initial handshake, the JMX protocol requires that the client reconnects on a randomly chosen port (1024+).
9160 Cassandra client port (Thrift). OpsCenter agents makes Thrift requests to their local node on this port. Additionally, the port can be used by the opscenterd daemon to make Thrift requests to each node in the cluster.
DataStax Enterprise Specific
9290 Hadoop Job Tracker Thrift port. The Job Tracker listens on this port for Thrift requests coming from the opscenterd daemon.
OpsCenter Specific
50031 OpsCenter HTTP proxy for Job Tracker. The opscenterd daemon listens on this port for incoming HTTP requests from the browser when viewing the Hadoop Job Tracker page directly.
61620 OpsCenter monitoring port. The opscenterd daemon listens on this port for TCP traffic coming from the agent.
61621 OpsCenter agent port. The agents listen on this port for SSL traffic initiated by OpsCenter.

Note

Generally, when you have firewalls between machines, it is difficult to run JMX across a network and maintain security. This is because JMX connects on port 7199, handshakes, and then uses any port within the 1024+ range. Instead use SSH to execute commands remotely to connect to JMX locally or use the DataStax OpsCenter.

  1. After you are done adding the above port rules, click Apply Rule Changes. Your completed port rules should look similar to this:


    ../../_images/ami2_securityports_dse2.png

    Warning

    This security configuration shown in the above example opens up all externally accessible ports to incoming traffic from any IP address (0.0.0.0/0). The risk of data loss is high. If you desire a more secure configuration, see the Amazon EC2 help on Security Groups.

Launching the DataStax AMI

After you have created your security group, you are ready to launch an instance of Cassandra using the DataStax AMI.

  1. Right-click the following link to open the DataStax Amazon Machine Image page in a new window:

    https://aws.amazon.com/amis/datastax-auto-clustering-ami-2-2

  2. Click Launch AMI, then select the region where you want to launch the AMI.


    ../../_images/ami_launch.png
  3. On the Request Instances Wizard page, verify the settings and then click Continue.

  4. On the Instance Details page, enter the total number of nodes that you want in your cluster, select the Instance Type, and then click Continue.

    Use the following guidelines when selecting the type of instance:

    • Extra large for production.
    • Large for development and light production.
    • Small and Medium not supported.

../../_images/ami3_num_instances_dse2.png

Note

EBS volumes are not recommended. In Cassandra data volumes, EBS throughput may fail in a saturated network link, I/O may be exceptionally slow, and adding capacity by increasing the number of EBS volumes per host does not scale. For more information and graphs related to ephemeral versus EBS performance, see the blog article at http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/.

  1. On the next page, under Advanced Instance Options, add the following options to the User Data section according to the type of cluster you want, and then click Continue.

    For new DataStax Enterprise clusters the available options are:

    Option

    Description

    Basic AMI Switches

    --clustername <name>

    Required. The name of the cluster.

    --totalnodes <#_nodes>

    Required. The total number of nodes in the cluster.

    --version [enterprise | community]

    Required. The version of the cluster. Use enterprise to install the latest version of DataStax Enterprise (DSE).

    DataStax Enterprise Switches

    --username <username>

    Required for DSE. DataStax registration username. Register at DataStax registration.

    --password <password>

    Required for DSE. DataStax registration password. Register at DataStax registration.

    --analyticsnodes <#_node>

    Optional for DSE. For mixed-workload clusters, the number of Analytics (Hadoop) nodes. Default: 0

    --searchnodes <#_num>

    Optional for DSE. For mixed-workload clusters, the number of Search (Solr) nodes. Default: 0

    Advanced Switches

    --release <release_version>

    Optional for DSE. Allows for the installation of a previous DSE version. Example: 1.0.2-1 Default: Ignored

    --cfsreplicationfactor <#_num>

    Optional for DSE. Sets the replication factor for the CFS keyspace This number must be less than or equal to the number of Analytics nodes. Default: 1

    --opscenter [no]

    Optional. By default, DataStax OpsCenter is installed on the first instance. Specify no to disable.

    --reflector <url>

    Optional. Allows you to use your own reflector. Default: http://reflector2.datastax.com/reflector2.php

    --release <version>

    Optional. Allows the installation of a previous DataStax Enterprise version. For example, 1.0.2-1.

    For example:

    --clustername myDSEcluster --totalnodes 6 --version enterprise --username my_name
    --password my_password --analyticsnodes 2 --searchnodes 2
    

../../_images/ami4_options_dse2.png
  1. On the Storage Device Configuration page, you can add ephemeral drives if needed.

    Amazon Web Service recently reduced the number of default ephemeral disks attached to the image from four to two. Performance will be slower for new nodes unless you manually attach the additional two disks; see Amazon EC2 Instance Store.

  2. On the Tags page, give a name to your DSE instance, such as mixed-workload-dse, and then click Continue.

  3. On the Create Key Pair page, create a new key pair or select an existing key pair, and then click Continue. Save this key (.pem file) to your local machine; you will need it to log in to your DataStax Enterprise instance.

  4. On the Configure Firewall page, select the DSE security group that you created earlier and click Continue.

  5. On the Review page, review your cluster configuration and then click Launch.

  6. Close the Launch Install Wizard and go to the My Instances page to see the status of your DSE instance. Once a node has a status of running, you can connect to it.

Connecting to your DataStax Enterprise EC2 instance

You can connect to your new DSE EC2 instance using any SSH client, such as PuTTY or from a Terminal. To connect, you will need the private key (.pem file you created earlier and the public DNS name of a node.

Connect as user ubuntu rather than as root.

If this is the first time you are connecting, copy your private key file (<keyname>.pem) you downloaded earlier to your home directory, and change the permissions so it is not publicly viewable. For example:

chmod 400 dsekey.pem
  1. From the My Instances page in your AWS EC2 Dashboard, select the node that you want to connect to.

    Because all nodes are peers in DSE, you can connect using any node in the cluster. However, the first node generally runs OpsCenter and is the Cassandra seed node.


    ../../_images/ami_connect1_dse2.png
  2. To get the public DNS name of a node, select Instance Actions > Connect.

  3. In the Connect Help - Secure Shell (SSH) page, copy the command line and change the connection user from root to ubuntu, then paste it into your SSH client.


    ../../_images/ami_connect2_dse2.png
  4. The AMI image configures your cluster and starts the Cassandra, Hadoop, Solr, and OpsCenter services. After you have logged into a node, run the nodetool ring -h localhost command (nodetool) to make sure your cluster is running.


    ../../_images/ami_node_ring2.png
  5. If you installed the OpsCenter with your DSE cluster, allow about 60 to 90 seconds after the cluster has finished initializing for OpsCenter to start. You can launch OpsCenter using the URL: http://<public-dns-of-first-instance>:8888.


    ../../_images/ami_instance2.png
  6. After the OpsCenter loads, you must install the OpsCenter agents to see the cluster performance data:

    1. Click the Fix link located near the top of the Dashboard in the left navigation pane to install the agents.


      ../../_images/agent_initial_fix_dse.png
    2. When prompted for credentials for the agent nodes, use the username ubuntu and copy and paste the entire contents from your private key (.pem) file that you downloaded earlier.


      ../../_images/ami_opscenter2_dse.png