Getting started with DataStax Enterprise and DataStax Community
This is an install guide for evaluating DataStax Enterprise and DataStax Community (Cassandra), not development or production. For full installation instructions, see the DataStax Enterprise or Community documentation.
Two minute overview
The data model distilled
Cassandra is a partitioned row store.
The key concept for designing your Cassandra data model is that you base the design on the queries you want to perform, not modeling entities and relationships like you do for relational databases.
The essential elements of the data model include:
Column: The smallest increment of data. It's a tuple that contains a name, a value, and a timestamp.
Row: Each row in a column family is identified by its row key, similar to the primary key in a relational table. The row key determines what node the data is stored on.
Column Family: A Cassandra database consists of column families. A column family is a set of key-value pairs. Every column family has a key and consists of columns and rows. You can think of column family as a table and a key-value pair as a record in a table.
Note: In CQL 3 (the latest implementation of the Cassandra Query Language), column families are called tables. The Cassandra CLI client utility, API classes, and OpsCenter continue to use column family.
Table: In CQL 3, a table is a collection of ordered (by name) columns. In previous versions of CQL, the column family was synonymous, in many respects, to a table. In CQL 3 a table is sparse, including only columns that rows have been assigned a value.
Keyspaces: The outermost grouping of data, similar to a schema in a relational database. All column families go inside a keyspace. Typically, a cluster has one keyspace per application.
Key concepts
The following concepts are important for understanding Cassandra:
- Cluster: A group of nodes where you store your data. You can create a single-node cluster.
- Replication: The process of storing copies of data on multiple nodes to ensure reliability and fault tolerance. The number of copies is set by the replication factor.
- Partitioner: A partitioner distributes data evenly across the nodes in the cluster for load balancing.
- Data Center: A group of related nodes configured together within a cluster for replication purposes. It is not necessarily a physical data center. The term related nodes means the type of node: transactional, analytics, search. Each type of node must be contained in its own data center.