Apache Cassandra 0.8 Documentation

Introduction to Apache Cassandra

Apache Cassandra is a free, open-source, distributed storage system for managing large amounts of structured data. It differs from traditional relational database management systems in some significant ways. Cassandra is designed to scale to a very large size across many commodity servers, with no single point of failure, and provides a simple schema-optional data model designed to allow maximum power and performance at scale.

The History of Cassandra

Cassandra was created for solving the problem of inbox search at Facebook. It combines Amazon Dynamo’s fully distributed design with Google Bigtable’s column-oriented data model. Facebook open-sourced Cassandra in 2008 and it became an Apache Incubator project. In early 2010, Cassandra became a top-level Apache project. Today there are hundreds of Cassandra deployments in production, including at companies such as Netflix, Twitter, Rackspace, and Cisco.

Core Strengths of Cassandra

Elastic and Scalable

Read and write throughput both increase linearly as new machines are added, with no downtime, interruption, or reconfiguration of applications.

Reliable

Cassandra was designed with the expectation that hardware is unreliable and nodes will fail. Cassandra can tolerate multiple node failures without downtime, and failed nodes can be replaced with the cluster online. Cassandra supports replication across data centers, ensuring data redundancy across data centers and geographic regions. Cassandra uses a built-in accrual failure detector to monitor the health of nodes within the cluster. Because all nodes are symmetric and there are no master nodes, there is no single point of failure or special failover actions required to handle node failures.

Durable

Durability is the property that writes, once completed, will survive permanently even in the face of hardware failure. Cassandra provides configurable durability by appending writes to a commit log first (which obviates the need for disk seeks since this is a sequential operation), then uses the fsync system call to flush the data to disk.

Analytics Without ETL

Hadoop jobs can be executed directly against data stored in Cassandra without impacting performance of your real-time applications.

Performant

Consistency is tunable per operation, allowing consistency to be traded for higher availability as desired. There are no reads or seeks in the write path. Cassandra’s built-in caching allows for precise tuning of read performance for specific workloads and data models.

Getting Started with Cassandra

To learn more about Cassandra and how it works, see the following topics:

The easiest way to get started with Cassandra is to install it on a single node (see Installing Cassandra Using the Packaged Releases or Installing the Cassandra Binary Distribution), and start up a single node instance (see Initializing a Single-Node Cluster (for evaluation purposes)).

DataStax also provides an Amazon Machine Instance (AMI) to allow you to quickly get a Cassandra cluster up and running on Amazon EC2. See Initializing a Cassandra Cluster on Amazon EC2 Using the DataStax AMI.

After you have a Cassandra instance up and running, you can practice some data definition (DDL) and data manipulation (DML) commands using one of the client interfaces packaged with Cassandra. See: