Apache Cassandra 1.2 Documentation
Glossary of Cassandra Terms
- anti-entropy
- The synchronization of replica data on nodes to ensure that the data is fresh.
- Bloom filter
- An off-heap structure associated with each SSTable that checks if any data for the requested row exists in the SSTable before doing any disk I/O.
- cluster
- Two or more Cassandra instances the exchange messages using the gossip protocol.
- clustering
- The storage engine process that creates an index and keeps data in order based on the index.
- clustering column
- Columns other than the partition key in a compound primary key definition.
- column
- The smallest increment of data, which contains a name, a value and a timestamp.
- column family
- A container for rows, similar to the table in a relational system. Replaced by table in CQL 3.
- commit log
- A file to which Cassandra appends changed data for recovery in the event of a hardware failure.
- compaction
- A process that consists primarily of consolidating SSTables, but also discards tombstones and regenerates the index in the SSTable. A major compaction merges all SSTables into one. A minor compaction merges from 4 to 32 SSTables for a table.
- compound primary key
- A column having a name and value like a standard column except that the value, which is a byte array, is stuffed with more bytes that can be converted to multiple values by built-in handlers.
- consistency
- The synchronization of data on replicas in a cluster. Consistency is categorized as weak or strong.
- consistency level
- A setting that defines a successful write or read by the number of cluster replicas that acknowledge the write or respond to the read request, respectively.
- coordinator node
- The node that determines which nodes in the ring should get the request based on the cluster configured snitch.
- cross-data center forwarding
- A technique for optimizing replication across data centers. To replicate data in a node in one data center to nodes in another data center, the data is sent to one node in the other data center, and that node forwards the data to other nodes in its data center.
- data center
- Synonymous with replication group. A group of related nodes configured together within a cluster for replication purposes. It is not necessarily a physical data center.
- gossip
- A peer-to-peer communication protocol for exchanging location and state information between nodes.
- HDFS
- Hadoop Distributed File System that stores data on nodes to improve performance. A necessary component in addition to MapReduce in a Hadoop distribution.
- idempotent
- An operation that can occur multiple times without changing the result, such as Cassandra performing the same update multiple times without affecting the outcome.
- index summary
- A subset of the primary index. By default, 1 row key out of every 128 is sampled.
- keyspace
- A namespace container that defines how data is replicated on nodes.
- MapReduce
- Hadoop's parallel processing engine that can process large data sets relatively quickly. A necessary component in addition to MapReduce in a Hadoop distribution.
- memtable
- A Cassandra table-specific, in-memory data structure that resembles a write-back cache. See also About writes.
- mutation
- 1) An upsert. 2) A Thrift base class that has abstract methods for reading and writing data input and output.
- node repair
- A process that makes all data on a replica consistent.
- partitioner
- Distributes the data across the cluster. The types of partitioners are Murmur3Partitioner (default), RandomPartitioner, and OrderPreservingPartitioner.
- partition key
- The first column declared in the PRIMARY KEY definition, or in the case of a compound key, multiple columns can declare those columns that form the primary key.
- primary index
- A list of row keys and the start position of rows in the data file
- primary key
- One or more columns that uniquely identify a row in a table.
- read repair
- A process that updates Cassandra replicas with the most recent version of frequently-read data.
- replication group
- See data center.
- replica placement strategy
- A specification that determines the replicas for each row of data.
- rolling restart
- A procedure that is performed during upgrading nodes in a cluster for zero downtime. Nodes are upgraded and restarted one at a time, while other nodes continue to operate online.
- row
- 1) Columns that have the same primary key. 2) A collection of cells per combination of columns in the storage engine.
- secondary index
- A native Cassandra capability for finding a row in the database that does not involve using the row key.
- slice
- A Thrift API term for a set of columns from a single row, described either by name or as a contiguous run of columns from a starting point.
- snitch
- The mapping from the IP addresses of nodes to physical and virtual locations, such as racks and data centers. There are several types of snitches. The type of snitch affects the request routing mechanism.
- SSTable
- A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are stored on disk sequentially and maintained for each Cassandra table. See also About writes.
- strong consistency
- When reading data, Cassandra performs read repair before returning results.
- superuser
- By default, each installation of Cassandra includes a superuser account named cassandra and whose password is also cassandra. A superuser grants initial permissions to access Cassandra data, and subsequently a user may or may not be given the permission to grant/revoke permissions.
- table
- A collection of ordered (by name) columns fetched by row. A row consists of columns and have a primary key. The first part of the key is a column name. Subsequent parts of a compound key are other column names that define the order of columns in the table.
- token
- An element on the ring that depends on the partitioner. A token determines the node's position on the ring and the portion of data it is responsible for. The range for the Murmur3Partitioner (default) is -263 to +263. The range for the RandomPartitionerIntegers is 0 to 2 127-1.
- tombstone
- A marker in a row that indicates a column was deleted. During compaction, marked column are deleted.
- TTL
- Time-to-live. An optional expiration date for values inserted into a column. See also Expiring columns.
- weak consistency
- When reading data, Cassandra performs read repair after returning results.
- upsert
- A change in the database that updates a specified column in a row if the column exists or inserts the column if it does not exist.