Apache Cassandra™ is a massively scalable open source NoSQL database. Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud. Cassandra delivers continuous availability, linear scalability, and operational simplicity across many commodity servers with no single point of failure, along with a powerful dynamic data model designed for maximum flexibility and fast response times.
Cassandra sports a “masterless” architecture meaning all nodes are the same. Cassandra provides automatic data distribution across all nodes that participate in a “ring” or database cluster. There is nothing programmatic that a developer or administrator needs to do or code to distribute data across a cluster because data is transparently partitioned across all nodes in a cluster.
Cassandra also provides built-in and customizable replication, which stores redundant copies of data across nodes that participate in a Cassandra ring. This means that if any node in a cluster goes down, one or more copies of that node’s data is available on other machines in the cluster. Replication can be configured to work across one data center, many data centers, and multiple cloud availability zones.
Cassandra supplies linear scalability, meaning that capacity may be easily added simply by adding new nodes online. For example, if 2 nodes can handle 100,000 transactions per second, 4 nodes will support 200,000 transactions/sec and 8 nodes will tackle 400,000 transactions/sec:
To gain an understanding of Cassandra's origins and where it has evolved to today, please read "Facebook’s Cassandra paper, annotated and compared to Apache Cassandra 2.0", authored by Jonathan Ellis.
Cassandra 2.0 includes enhancements to CQL, security, and performance. The significant number of enhancements resulted in an update of the CQL specification to version 3.1.0. Key features of Cassandra 2.0 are:
The first phase of support for triggers for firing an event that executes a set of programmatic logic, which runs either inside or outside a database cluster
Paging of result sets of SELECT statements
executed over a CQL native protocol 2
connection, which eliminates the need to use
the token function to page through results. For example, to page through data in this table, a simple SELECT statement after Cassandra 2.0 replaces the complex one using the token function before Cassandra 2.0.
Atomic BATCH guarantees for large sets of prepared statements
One-shot binding of optional variables or prepared statements and variables for server-side request parsing and execution using a BATCH message containing a list of query strings--no reparsing
SASL support for easier and better authentication over prior versions of the CQL native protocol
Re-introduction of the ALTER TABLE DROP command
SELECT hdate AS hired_date FROM emp WHERE empid = 500
Indexing of any part, partition key or clustering columns, portion of a compound key
Use of a prepared statement, even for the single execution of a query to pass binary values for a statement, for example to avoid a conversion of a blob to a string, over a native protocol version 2 connection
Sending the user request to other replicas before the query times out when a replica is unusually slow in delivering needed data
Hybrid (leveled and size-tiered) compaction improvements to the leveled compaction strategy to reduce the performance overhead on read operations when compaction cannot keep pace with write-heavy workloads
Auto_bootstrapping of a single-token node with no initial_token
Continued support for apps that query super columns, translation of super columns on the fly into CQL constructs and results
Use the blobAsType and typeAsBlob conversion functions instead of ASSUME
Cqlsh COPY command support for collections