This FAQ answers questions about Apache Cassandra, its architecture, its data model, and developing with Cassandra.
Apache Cassandra is a high performance, extremely scalable, fault tolerant (i.e. no single point of failure), distributed post-relational database solution. Cassandra combines all the benefits of Google Bigtable and Amazon Dynamo to handle the types of database management needs that traditional RDBMS vendors cannot support. From a commercial software standpoint, DataStax is the leading worldwide commercial provider of Cassandra products, services, support, and training.
There are many technical benefits that come from using Cassandra. See our white papers for more detail.
Downloading and installing Cassandra is very easy. Downloads of Cassandra are available via the DataStax web site at: http://www.datastax.com/download.
For installation guidance, please see our online documentation at: http://www.datastax.com/docs/1.0/getting_started/index and http://www.datastax.com/docs/1.0/install/index. You can also view a guided video tutorial for installing a simple Cassandra and DataStax OpsCenter setup at: http://www.datastax.com/resources/tutorials.
Starting Cassandra involves connecting to the machine where it is installed with the proper security credentials, and invoking the cassandra executable from the installation’s binary directory. An example of starting Cassandra on Mac could be:
sudo /Applications/Cassandra/apache-cassandra-0.8.1/bin/cassandra
The basic command line interface (CLI) for logging into and executing commands against Cassandra is the cassandra-cli utility, which is found in the software installation’s bin directory.
An example of logging into a local machine’s Cassandra installation using the CLI and the default Cassandra port might be:
Welcome to the Cassandra CLI.
Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.
[default@unknown] connect localhost/9160;
Connected to: "Test Cluster" on localhost/9160
[default@unknown]
Cassandra can be used in many different data management situations. Some of the most common use cases for Cassandra include:
Cassandra is typically not the choice for transactional data that needs per-transaction commit/rollback capabilities. Note that Cassandra does have atomic transactional abilities on a per row/insert basis (but with no rollback capabilities).
The primary difference between Cassandra and Hadoop is that Cassandra targets real-time/operational data, while Hadoop has been designed for batch-based analytic work.
There are many different technical differences between Cassandra and Hadoop, including Cassandra’s underlying data structure (based on Google’s Bigtable), its fault-tolerant, peer-to-peer architecture, multi-data center capabilities, tunable data consistency, all nodes being the same (no concept of a namenode, etc.) and much more.
HBase is an open-source, column-oriented data store modeled after Google Bigtable, and is designed to offer Bigtable-like capabilities on top of data stored in Hadoop. However, while HBase shared the Bigtable design with Cassandra, its foundational architecture is much different.
A Cassandra cluster is much easier to setup and configure than a comparable HBase cluster. HBase’s reliance on the Hadoop namenode equates to there being a single point of failure in HBase, whereas with Cassandra, because all nodes are the same, there is no such issue.
In internal performance tests conducted at DataStax (using the Yahoo Cloud Serving Benchmark – YCSB), Cassandra offered literally 5X better performance in writes and 4X better performance on reads than HBase.
MongoDB is a document-oriented database that is built upon a master-slave/sharding architecture. MongoDB is designed to store/manage collections of JSON-styled documents.
By contrast, Cassandra uses a peer-to-peer, write/read-anywhere styled architecture that is based on a combination of Google BigTable and Amazon Dynamo. This allows Cassandra to avoid the various complications and pitfalls of master/slave and sharding architectures. Moreover, Cassandra offers linear performance increases as new nodes are added to a cluster, scales to terabyte-petabyte data volumes, and has no single point of failure.
DataStax Community Edition is a free software bundle from DataStax that combines Apache Cassandra with a number of developer and management tools provided by DataStax, which are designed to get someone up and productive with Cassandra in very little time. DataStax Community Edition is provided for open source enthusiasts and is not recommended for production use as it is not formally supported by DataStax’ production support staff.
DataStax Enterprise is the commercial product offering from DataStax that is designed for enterprise-class, production usage that combines Apache Cassandra with real-time analytics capabilities and smart mixed workload management. DataStax Enterprise also provides full 24×7 production support, consultative assistance, certified maintenance updates, and much more.
Cassandra has been built from the ground up to be a fault tolerant, peer-to-peer database that offers no single point of failure. Cassandra can automatically replicate data between nodes to offer data redundancy. It also offers built-in intelligence to replicate data between different physical server racks (so that if one rack goes down the data on other racks is safe) as well as between geographically dispersed data centers, and/or public Cloud providers and on-premises machines, which offers the strongest possible uptime and disaster recovery capabilities:
Cassandra does not use a master/slave architecture, but instead uses a peer-to-peer implementation, which avoids the pitfalls, latency problems, single point of failure issues, and performance headaches associated with master/slave setups.
Replication is the process of storing copies of data on multiple nodes to ensure reliability and fault tolerance. When you create a keyspace in Cassandra, you must decide the replica placement strategy: the number of replicas and how those replicas are distributed across nodes in the cluster. The replication strategy relies on the cluster-configured snitch to help it determine the physical location of nodes and their proximity to each other.
The total number of replicas across the cluster is often referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row. A replication factor of 2 means two copies of each row. All replicas are equally important; there is no primary or master replica in terms of how read and write requests are handled.
Replication options are defined when you create a keyspace in Cassandra. The snitch is configured per node.
Cassandra provides a number of options to partition your data across nodes in a cluster.
The RandomPartitioner is the default partitioning strategy for a Cassandra cluster. It uses a consistent hashing algorithm to determine which node will store a particular row. The end result is an even distribution of data across a cluster.
The ByteOrderedPartitioner ensures that row keys are stored in sorted order. It is not recommended for most use cases and can result in uneven distribution of data across a cluster.
A seed node in Cassandra is a node that is contacted by other nodes when they first start up and join the cluster. A cluster can have multiple seed nodes. Cassandra uses a protocol called gossip to discover location and state information about the other nodes participating in a Cassandra cluster. When a node first starts, it contacts a seed node to bootstrap the gossip communication process. The seed node designation has no purpose other than bootstrapping new nodes joining the cluster. Seed nodes are not a single point of failure.
The snitch is a configurable component of a Cassandra cluster used to define how the nodes are grouped together within the overall network topology (such as rack and data center groupings). Cassandra uses this information to route inter-node requests as efficiently as possible within the confines of the replica placement strategy. The snitch does not affect requests between the client application and Cassandra (it does not control which node a client connects to).
Cassandra is capable of offering linear performance benefits when new nodes are added to a cluster.
A new machine can be added to an existing cluster by installing the Cassandra software on the server and configuring the new node so that it knows (1) the name of the Cassandra cluster it is joining; (2) the seed node(s) it should obtain its data from; (3) the range of data that it is responsible for, which is done by assigning a token to the node.
Please see the online documentation about how to assign a token to a new node and the various use cases that dictate the complexity of token assignment.
Note that OpsCenter is capable of automatically rebalancing the data across all nodes in a cluster when new nodes are added.
Nodes can be removed from a Cassandra cluster by using the nodetool utility and issuing a decommission command. This can be done without affecting the overall operations or uptime of the cluster.
Cassandra uses gossip state information to locally determine if another node in the system is up or down. This failure detection information is used by Cassandra to avoid routing client requests to unreachable nodes whenever possible.
The gossip inter-node communication process tracks heartbeats from other nodes both directly (nodes gossiping directly to it) and indirectly (nodes heard about secondhand, thirdhand, and so on). Rather than have a fixed threshold for marking nodes without a heartbeat as down, Cassandra uses an accrual detection mechanism to calculate a per-node threshold that takes into account network conditions, workload, or other conditions that might affect the perceived heartbeat rate.
Node failures can result from various causes such as hardware failures, network outages, and so on. Node outages are often transient but can last for extended intervals. A node outage rarely signifies a permanent departure from the cluster, and therefore does not automatically result in permanent removal of the failed node from the cluster. Other nodes will still try to periodically initiate gossip contact with failed nodes to see if they are back up.
When a node comes back online after an outage, it may have missed writes for the replica data it maintains. Writes missed due to short, transient outages are saved for a period of time on other replicas and replayed on the failed host once it recovers using Cassandra’s built-in hinted handoff feature. If a node is down for an extended period, an administrator can run the nodetool repair utility after the node is back online to ‘catch it up’ with its corresponding replicas.
To permanently change a node’s membership in a cluster, administrators must explicitly remove a node from a Cassandra cluster using the nodetool management utility.
Cassandra can easily replicate data between different physical datacenters by creating a keyspace that uses the replication strategy currently termed NetworkTopologyStrategy. This strategy allows you to configure Cassandra to automatically replicate data to different data centers and even different racks within datacenters to protect against specific rack/physical hardware failures causing a cluster to go down. It can also replicate data between public Clouds and on-premises machines.
The main Cassandra configuration file is the cassandra.yaml file, which houses all the main options that control how Cassandra operates.
Cassandra’s architecture make it perfect for full Cloud deployments as well as hybrid implementations that store some data in the Cloud and other data on premises.
DataStax provides an Amazon AMI that allows you to quickly deploy a Cassandra cluster on EC2. See the online documentation for a step-by-step guide to installing a Cassandra cluster on Amazon.
Cassandra negates the need for extra software caching layers like memcached through its distributed architecture, fast write throughput capabilities, and internal memory caching structures.
Cassandra has been architecture for consuming large amounts of data as fast as possible. To accomplish this, Cassandra first writes new data to a commit log to ensure it is safe. After that, the data is then written to an in-memory structure called a memtable. Cassandra deems the write successful once it is stored on both the commit log and a memtable, which provides the durability required for mission-critical systems.
Once a memtable‘s memory limit is reached, all writes are then written to disk in the form of an SSTable (sorted strings table). An SSTable is immutable, meaning it is not written to ever again. If the data contained in the SSTable is modified, the data is written to Cassandra in an upsert fashion and the previous data automatically removed.
Because SSTables are immutable and only written once the corresponding memtable is full, Cassandra avoids random seeks and instead only performs sequential IO in large batches, resulting in high write throughput.
A related factor is that Cassandra doesn’t have to do a read as part of a write (i.e. check index to see where current data is). This means that insert performance remains high as data size grows, while with b-tree based engines (e.g. MongoDB) it deteriorates.
Cassandra is architected in a peer-to-peer fashion and uses a protocol called gossip to communicate with other nodes in a cluster. The gossip process runs every second to exchange information across the cluster.
Gossip only includes information about the cluster itself (up/down, joining, leaving, version, schema, etc.) and does not manage the data. Data is transferred node-to-node using a message passing like protocol on a distinct port from what client applications connect to. The Cassandra partitioner turns a column family key into a token, the replication strategy picks the set of nodes responsible for that token (using information from the snitch) and Cassandra sends messages to those replicas with the request (read or write).
The gossip protocol is used to determine the state of all nodes in a cluster and if a particular node has gone down.
The gossip process tracks heartbeats from other nodes and uses an accrual detection mechanism to calculate a per-node threshold that takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate before a node is actually marked as down.
The configuration parameter phi_convict_threshold in the cassandra.yaml file is used to control Cassandra’s sensitivity of node failure detection. The default value is appropriate for most situations. However in Cloud environments, such as Amazon EC2, the value should be increased to 12 in order to account for network issues that sometimes occur on such platforms.
Yes, data compression is available with Cassandra 1.0 and above. The snappy compression algorithm from Google is used and is able to deliver fairly impressive storage savings, in some cases compressing raw data up to 80+% with no performance penalties for read/write operations. In fact, because of the reduction in physical I/O, compression actually increases performance in some use cases. Compression is enabled/disabled on a per-column family basis and is not enabled by default.
Currently, the most common method for backing up data in Cassandra is using the snapshot function in the nodetool utility. This is an online operation and does not require any downtime or block any operations on the server.
Snapshots are sent by default to a snapshots directory that is located in the Cassandra data directory (controlled via the data_file_directories in the cassandra.yaml file). Once taken, snapshots can be moved off-site to be protected.
Incremental backups (i.e. data backed up since the last full snapshot) can be performed by setting the incremental_backups parameter in the cassandra.yaml file to ‘true’. When incremental backup is enabled, Cassandra copies every flushed SSTable for each keyspace to a backup directory located under the Cassandra data directory. Restoring from an incremental backup involves first restoring from the last full snapshot and then copying each incremental file back into the Cassandra data directory.
In general, restoring a Cassandra node is done by first following these procedures:
rm /var/lib/cassandra/commitlog/*)rm /var/lib/cassandra/data/keyspace1/*.db). Take care so as not to remove the snapshot directory for the keyspacecp -p /var/lib/cassandra/data/keyspace1/snapshots/56046198758643-snapshotkeyspace1/* /var/lib/cassandra/data/keyspace1)Currently, no uninstaller exists for Cassandra. Therefore, removing Cassandra from a machine consists of the manual deletion of the Cassandra software, data, and log files.
Yes. First, data durability is fully supported in Cassandra so that any data written to a database cluster is first written to a commit log in the same fashion as nearly every popular RDBMS does.
Second, Cassandra offers tunable data consistency so that a developer or administrator can choose how strong they wish consistency across nodes to be. The strongest form of consistency is to mandate that any data modifications be made to all nodes, with any unsuccessful attempt on a node resulting in a failed data operation. Cassandra provides consistency in the CAP sense in that all readers will see the same values.
Other forms of tunable consistency involve having a quorum of nodes written to or just one node for the loosest form of consistency. Cassandra is very flexible and allows data consistency to be chosen on a per operation basis if needed so that very strong consistency can be used when desired, or very loose consistency can be utilized when the use case permits.
In Cassandra, consistency refers to how up-to-date and synchronized a row of data is on all of its replicas. Cassandra offers a number of built-in features to ensure data consistency:
Cassandra extends the concept of ‘eventual consistency’ by offering ‘tunable consistency’. For any given read or write operation, the client application decides how consistent the requested data should be.
Consistency levels in Cassandra can be set on any read or write query. This allows application developers to tune consistency on a per-query basis depending on their requirements for response time versus data accuracy. Cassandra offers a number of consistency levels for both reads and writes.
Choosing a consistency level for reads and writes involves determining your requirements for consistent results (always reading the most recently written data) versus read or write latency (the time it takes for the requested data to be returned or for the write to succeed).
If latency is a top priority, consider a consistency level of ONE (only one replica node must successfully respond to the read or write request). There is a higher probability of stale data being read with this consistency level (as the replicas contacted for reads may not always have the most recent write). For some applications, this may be an acceptable trade-off.
If consistency is top priority, you can ensure that a read will always reflect the most recent write by using the following formula:
(nodes_written + nodes_read) > replication_factor
For example, if your application is using the QUORUM consistency level for both write and read operations and you are using a replication factor of 3, then this ensures that 2 nodes are always written and 2 nodes are always read. The combination of nodes written and read (4) being greater than the replication factor (3) ensures strong read consistency.
With respect to loading external data, Cassandra supplies a load utility called the sstableloader. The sstableloader is able to load flat files into Cassandra, however the files must first be converted into SSTable format. An example of how to do this can be found at: http://www.datastax.com/dev/blog/bulk-loading.
Most RDBMS’s have an unload utility that allows data to be unloaded to flat files. Once in flat file format, the sstableloader utility can be used to load the data into Cassandra column families.
Some developers write programs to connect to both an RDBMS and Cassandra and move data in that way.
Read operations trigger consistency checks across all replicas for a requested row using a process called read repair. For reads, there are two types of read requests that a coordinator node can send to a replica; a direct read request and a background read repair request. The number of replicas contacted by a direct read request is determined by the read consistency level specified by the client. Background read repair requests are sent to any additional replicas that did not receive a direct request. Read repair requests ensure that the requested row is made consistent on all replicas. Read repair is an optional feature and can be configured per column family.
There are a number of CQL (Cassandra Query Language) drivers and native client libraries available for most all popular development languages (e.g. Java, Ruby, etc.) All drivers and client libraries can be downloaded from: http://www.datastax.com/download/clientdrivers.
The Cassandra data model is a schema-optional, column-oriented data model. This means that, unlike a relational database, you do not need to model all of the columns required by your application up front, as each row is not required to have the same set of columns. Columns and their metadata can be added by your application as they are needed without incurring downtime to your application.
Although it is natural to want to compare the Cassandra data model to a relational database, they are really quite different. In a relational database, data is stored in tables and the tables comprising an application are typically related to each other. Data is usually normalized to reduce redundant entries, and tables are joined on common keys to satisfy a given query.
In Cassandra, the keyspace is the container for your application data, similar to a database or schema in a relational database. Inside the keyspace are one or more column family objects, which are analogous to tables. Column families contain columns, and a set of related columns is identified by an application-supplied row key. Each row in a column family is not required to have the same set of columns.
Cassandra does not enforce relationships between column families the way that relational databases do between tables: there are no formal foreign keys in Cassandra, and joining column families at query time is not supported. Each column family has a self-contained set of columns that are intended to be accessed together to satisfy specific queries from your application.
In a relational database, you must specify a data type for each column when you define a table. The data type constrains the values that can be inserted into that column. For example, if you have a column defined as an integer datatype, you would not be allowed to insert character data into that column.
In Cassandra, you can specify a data type for both the column name (called a comparator) as well as for row key and column values (called a validator).
Column and row key data in Cassandra is always stored internally as hex byte arrays, but the compartor/validators are used to verify data on insert and translate data on retrieval. In the case of comparators (column names), the comparator also determines the sort order in which columns are stored.
Cassandra comes with the following comparators and validators:
| BytesType | Bytes (no validation) |
| AsciiType | US-ASCII bytes |
| UTF8Type | UTF-8 encoded strings |
| LongType | 64-bit longs |
| LexicalUUIDType | 128-bit UUID by byte value |
| TimeUUIDType | Version 1 128-bit UUID by timestamp |
| CounterColumnType* | 64-bit signed integer |
* can only be used as a column validator, not valid as a row key validator or column name comparator
A simple example might be:
CREATE COLUMNFAMILY Standard1 WITH comparator_type = "UTF8Type";
In Cassandra, the keyspace is the container for your application data, similar to a schema in a relational database. Keyspaces are used to group column families together. Typically, a cluster has one keyspace per application.
Replication is controlled on a per-keyspace basis, so data that has different replication requirements should reside in different keyspaces. Keyspaces are not designed to be used as a significant map layer within the data model, only as a way to control data replication for a set of column families.
When comparing Cassandra to a relational database, the column family is similar to a table in that it is a container for columns and rows. However, a column family requires a major shift in thinking for those coming from the relational world.
In a relational database, you define tables, which have defined columns. The table defines the column names and their data types, and the client application then supplies rows conforming to that schema: each row contains the same fixed set of columns.
In Cassandra, you define column families. Column families can (and should) define metadata about the columns, but the actual columns that make up a row are determined by the client application. Each row can have a different set of columns.
A Cassandra column family can contain regular columns (key/value pairs) or super columns. Super columns add another level of nesting to the regular column family column structure. Super columns are comprised of a (super) column name and an ordered map of sub-columns. A super column is a way to group multiple columns based on a common lookup value.
The primary use case for super columns is to denormalize multiple rows from other column families into a single row, allowing for materialized view data retrieval.
Super columns should not be used when the number of sub-columns is expected to be a large number. During reads, all sub-columns of a super column must be deserialized to read a single sub-column, so performance of super columns is not optimal if there are a large number of sub-columns. Also, you cannot create a secondary index on a sub-column of a super column.
Yes and No, depending on what you mean by “transactions”. Unlike relational databases, Cassandra does not offer fully ACID-compliant transactions. There is no locking or transactional dependencies when concurrently updating multiple rows or column families. But if by “transactions” you mean real-time data entry and retrieval, with durability and tunable consistency, then yes.
Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing operation. Nor does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.
However, this does not mean that Cassandra cannot be used as an operational or real time data store. Data is very safe in Cassandra because writes in Cassandra are durable. All writes to a replica node are recorded both in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.
Cassandra 0.8 is the first release to introduce Cassandra Query Language (CQL), the first standardized query language for Apache Cassandra. CQL pushes all of the implementation details to the server in the form of a CQL parser. Clients built on CQL only need to know how to interpret query result objects. CQL is the start of the first officially supported client API for Apache Cassandra. CQL drivers for the various languages are hosted within the Apache Cassandra project.
CQL syntax in based on SQL (Structured Query Language), the standard for relational database manipulation. Although CQL has many similarities to SQL, it does not change the underlying Cassandra data model. There is no support for JOINs, for example.
Cassandra is optimized for write throughput. Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (Sorted String table). The “Sorted” part means SSTables are sorted by row token (as determined by the partitioner), which is what makes merges for compaction efficient (don’t have to read entire SSTables into memory). Row contents are also sorted by column comparator, so Cassandra can support larger-than-memory rows too.
SSTables are immutable (they are not written to again after they have been flushed). This means that a row is typically stored across multiple SSTable files.
In the background, Cassandra periodically merges SSTables together into larger SSTables using a process called compaction. Compaction merges row fragments together, removes expired tombstones (deleted columns), and rebuilds primary and secondary indexes. Since the SSTable files are sorted by row key, this merge is efficient (no random disk I/O). Once a newly merged SSTable is complete, the smaller input SSTables are marked as obsolete and eventually deleted by the Java Virtual Machine (JVM) garbage collection (GC) process. However, during compaction, there is a temporary spike in disk space usage and disk I/O on the node.
Cassandra is a Java application, meaning that a compiled binary distribution of Cassandra can run on any platform that has a Java Runtime Environment (JRE), also referred to as a Java Virtual Machine (JVM). DataStax strongly recommends using the Oracle Sun Java Runtime Environment (JRE), version 1.6.0_19 or later, for optimal performance.
Packaged releases are provided for RedHat, CentOS, Debian and Ubuntu Linux platforms.
DataStax supplies both a free and commercial version of OpsCenter, which is a visual, browser-based management tool for Cassandra. With OpsCenter, a user can visually carry out many administrative tasks, monitor a cluster for performance, and do much more. Downloads of OpsCenter are available on the DataStax Web site.
A number of command line tools also ship with Cassandra for querying/writing to the database, performing administration functions, etc.
Cassandra also exposes a number of statistics and management operations via Java Management Extensions (JMX). Java Management Extensions (JMX) is a Java technology that supplies tools for managing and monitoring Java applications and services. Any statistic or operation that a Java application has exposed as an MBean can then be monitored or manipulated using JMX.
During normal operation, Cassandra outputs information and statistics that you can monitor using JMX-compliant tools such as JConsole, the Cassandra nodetool utility, or the DataStax OpsCenter centralized management console. With the same tools, you can perform certain administrative commands and operations such as flushing caches or doing a repair.