FAQ

Here are answers to the most common questions we get about our massively scalable, continuously available big data platform, DataStax Enterprise, its features and benefits and integrated components: Apache Cassandra, Apache Hadoop, Apache Solr and DataStax OpsCenter.

More technical information can be found in our Developer Center.

DataStax Enterprise

Apache Cassandra FAQ

DataStax Enterprise

What is DataStax Enterprise Edition?

DataStax is the company behind Apache Cassandra, which powers real-time online applications. DataStax delivers a production version of Apache Cassandra, DataStax Enterprise (DSE), a powerful database platform with an in-memory option, search, analytics and security all in the same database cluster.

What benefits do I get by running in-memory in DataStax Enterprise?

DataStax’s new in-memory option allows users to create memory-only database objects with faster performance over objects that are standard disk or SSD-based. This is useful in online applications such as trading systems, call routers and switches, and online advertising where DataStax Enterprise delivers the type of performance needed for such applications. Data protection and durability ensure that no data is lost due to power failure or other problems. From a developer perspective, there is no difference in working with an in-memory database object over a traditional disk-based table.

Can in-memory and disk tables exist together in DataStax Enterprise?

Yes. With DataStax Enterprise, in-memory and standard disk tables can be created easily to co-exist inside the same keyspace. Any object can be switched from in-memory to disk and vice-versa.

Are there any limitations and recommendations on in-memory tables?

Currently, in-memory tables are limited to the memory in the Java heap. We recommend 1GB per column family per node.

What benefits do I get by running Hadoop within DataStax Enterprise?

First, you automatically get a continuously available (i.e., no single point of failure) Hadoop system. Unlike traditional Hadoop, which has name nodes and other such things, DataStax Enterprise is a peer-to-peer system and provides automatic and transparent redundancy for all Hadoop operations.

You also get a much easier deployment experience with Hadoop in DataStax Enterprise than if community Hadoop is used.

Another great benefit of DataStax Enterprise is that it completely eliminates the need for complex extract-transform-load (ETL) operations that are normally needed to move data from real-time systems to analytic databases or data warehouses. Instead, data is transparently and automatically replicated among real-time and analytic nodes; no work on the part of a developer or administrator is necessary.

Lastly, having one integrated database for real-time transactional work, analytics, and enterprise search makes for a much more productive environment for operations personnel and easier development experience for developers.

Isn’t it a bad idea to have both real-time and search tasks running in the same database?

Not with DataStax Enterprise. DataStax Enterprise uses smart workload isolation so that real-time and search nodes do not compete for either the underlying data or compute resources. All search tasks execute on nodes marked out for enterprise search and all real-time, online operations take place on nodes designated for real-time data tasks.

How does DataStax Enterprise provide support for enterprise search operations?

DataStax Enterprise uses Apache Solr, the most popular open source search software, to support enterprise search tasks.

What benefits do I get by running Solr within DataStax Enterprise?

First, you automatically get a continuously available (i.e., no single point of failure) Solr/enterprise search system. Unlike community Solr, which requires manual work to create a true high-availability environment, DataStax Enterprise uses its peer-to-peer architecture to provide automatic and transparent redundancy for all Solr components and operations.

Next, you get full data durability for incoming search data. Unlike community Solr, which can lose data if a node goes down before new data is flushed to disk, DataStax Enterprise guarantees that no data is ever lost through the use of Cassandra’s write ahead log.

DataStax Enterprise also provides a scalable design for write operations. Unlike community Solr’s master-slave architecture that experiences write bottlenecks with its single master, DataStax Enterprise allows writing to all Solr nodes – even across multiple data centers – and ensures everything stays in sync.

Other benefits of using Solr in DataStax Enterprise include automatic sharding (vs. manual with community), search indexes being able to span multiple data centers, on-demand search index rebuilds, and more.

One last benefit worth noting is that DataStax Enterprise completely eliminates the need for complex ETL operations that are normally needed to move data from real-time systems to search databases. Instead, data is transparently and automatically replicated among real-time and search nodes; no work on the part of a developer or administrator is necessary.

How does DataStax Enterprise handle both real-time and search data in the same database?

DataStax Enterprise uses Cassandra’s replication to replicate data between nodes designated for real-time data and nodes specified for search operations. Any node may be written to, with changes being propagated across all nodes. All nodes may also be read. Such a configuration eliminates write bottlenecks and read/write hotspots.

Can I access data in Solr/search nodes with CQL?

Yes. DataStax Enterprise extends Cassandra’s CQL to include Solr queries. See the online documentation for more on how to construct Solr CQL queries.

Does DataStax Enterprise offer any type of workload management reprovisioning?

Yes. Real-time (Cassandra) and analytic (Hadoop) nodes can be easily reprovisioned by stopping/starting nodes in a different mode. This allows you to easily adjust the performance and capacity for various workloads. As an example, you may need more real-time processing power during the day and more batch analytic capability at night. You can easily schedule a database cluster to stop some or all real-time nodes and restart them as Hadoop nodes to increase analytic capacity during the evening and then switch the nodes back to real-time for daytime processing.

What type of security does DataStax Enterprise Offer?

DataStax Enterprise 3.0 and higher provides the following built-in security features: (1) internally-managed authentication (login ID’s and passwords are managed within Cassandra); (2) external authentication option that supports Kerberos and LDAP; (3) internal authorization / object permission management via GRANT/REVOKE; (4) client to node encryption via SSL; (5) transparent data encryption at the table / column family level; (6) data auditing. None of these security features are enabled by default.

How can I move data from RDBMSs to DataStax Enterprise?

DataStax Enterprise uses Sqoop to move data from any RDBMS with a JDBC driver (e.g., Oracle, MySQL) over to the DataStax Enterprise server. One or more tables are simply mapped to new Cassandra column families and the Sqoop interface takes care of the rest.

You can also use third-party tools such as Pentaho’s Kettle, which has full ETL capabilities and is free to download and use.

Can I move application log data to DataStax Enterprise?

Yes. Using log4j, application log data can be moved easily into the DataStax Enterprise server and then indexed and searched via the Solr support that is in the server.

Apache Cassandra FAQ

What is Cassandra?

Apache Cassandra™, an Apache Software Foundation project, is a massively scalable NoSQL database. Cassandra is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with extremely high database performance and availability.

What are the benefits of using Cassandra?

There are many technical benefits that come from using Cassandra. See our white papers for more detail.

How do I install Cassandra?

Downloading and installing Cassandra is very easy. Downloads of Cassandra are available via the DataStax web site at: http://www.datastax.com/download.

For installation guidance, please see our online documentation at: http://www.datastax.com/docs/1.0/getting_started/index and http://www.datastax.com/docs/1.0/install/index.

You can also view a guided video tutorial for installing a simple Cassandra and DataStax OpsCenter setup at: http://www.datastax.com/resources/tutorials.

How do I start/stop Cassandra on a machine?

Starting Cassandra involves connecting to the machine where it is installed with the proper security credentials, and invoking the cassandra executable from the installation’s binary directory. An example of starting Cassandra on Mac could be:

sudo /Applications/Cassandra/apache-cassandra-0.8.1/bin/cassandra

How do I log into Cassandra?

The basic interfaces for logging in to Cassandra are the CQL (Cassandra Query Language) utility and the command line interface (CLI).

The CQL utility (found in the installation directory’s bin subdirectory) connects by default to a local Cassandra instance running on the default port of 9160:

    $ ./cqlsh Connected to Test Cluster at localhost:9160. [cqlsh 2.0.0 | Cassandra 1.0.7 | CQL spec 2.0.0 | Thrift protocol 19.20.0] Use HELP for help. cqlsh>
  

An example of logging into a local machine’s Cassandra installation using the CLI and the default Cassandra port might be:

    
    Welcome to the Cassandra CLI. Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. [default@unknown] connect localhost/9160; Connected to: "Test Cluster" on localhost/9160 [default@unknown]
  

When is Cassandra required for an application?

Cassandra is perfect for big data applications, and can be used in many different data management situations. Some of the most common use cases for Cassandra include:

  • Time series data management
  • High-velocity device data ingestion and analysis
  • Media streaming (e.g., music, movies)
  • Social media input and analysis
  • Online web retail (e.g., shopping carts, user transactions)
  • Web log management / analysis
  • Web click-stream analysis
  • Real-time data analytics
  • Online gaming (e.g., real-time messaging)
  • Write-intensive transaction systems
  • Buyer event analytics
  • Risk analysis and management

When should I not use Cassandra?

Cassandra is typically not the choice for transactional data that needs per-transaction commit/rollback capabilities. Note that Cassandra does have atomic transactional abilities on a per row/insert basis (but with no rollback capabilities).

How does Cassandra differ from Hadoop?

The primary difference between Cassandra and Hadoop is that Cassandra targets real-time/operational data, while Hadoop has been designed for batch-based analytic work.

There are many different technical differences between Cassandra and Hadoop, including Cassandra’s underlying data structure (based on Google’s BigTable); its fault-tolerant, peer-to-peer architecture; multi-data center capabilities; tunable data consistency; and much more – including the fact that all nodes in Cassandra are the same (e.g., no concept of a namenode).

How does Cassandra differ from HBase?

HBase is an open source, column-oriented datastore modeled after Google BigTable, and is designed to offer BigTable-like capabilities on top of data stored in Hadoop. However, while HBase shared the BigTable design with Cassandra, its foundational architecture is much different.

A Cassandra cluster is much easier to setup and configure than a comparable HBase cluster. HBase’s reliance on the Hadoop namenode equates to there being a single point of failure in HBase, whereas with Cassandra, because all nodes are the same, there is no such issue.

In internal performance tests conducted at DataStax (using the Yahoo Cloud Serving Benchmark – YCSB), Cassandra offered literally 5X better performance in writes and 4X better performance on reads than HBase.

How does Cassandra differ from MongoDB?

MongoDB is a document-oriented database that is built upon a master-slave/sharding architecture. MongoDB is designed to store/manage collections of JSON-styled documents.

By contrast, Cassandra uses a peer-to-peer, write/read-anywhere styled architecture that is based on a combination of Google BigTable and Amazon Dynamo. This allows Cassandra to avoid the various complications and pitfalls of master/slave and sharding architectures. Moreover, Cassandra offers linear performance increases as new nodes are added to a cluster, scales to terabyte-petabyte data volumes, and has no single point of failure.

What is DataStax Community Edition?

DataStax Community Edition is a free software bundle from DataStax that combines Apache Cassandra with a number of developer and management tools provided by DataStax, which are designed to get someone up and productive with Cassandra in very little time. DataStax Community Edition is provided for open source enthusiasts and is not recommended for production use as it is not formally supported by DataStax’ production support staff.

What is DataStax Enterprise Edition?

DataStax Enterprise is the commercial product offering from DataStax that is designed for enterprise-class, production usage. DataStax Enterprise is a complete big data platform, built on Cassandra, architected to manage real-time, analytic, and enterprise search data all in the same database cluster

How does Cassandra protect me against downtime?

Cassandra has been built from the ground up to be a fault-tolerant, peer-to-peer database that offers no single point of failure. Cassandra can automatically replicate data between nodes to offer data redundancy. It also offers built-in intelligence to replicate data between different physical server racks (so that if one rack goes down the data on other racks is safe) as well as between geographically dispersed data centers, and/or public cloud providers and on-premise machines, which offers the strongest possible uptime and disaster recovery capabilities:

  • Automatically replicates data between nodes to offer data redundancy
  • Offers built-in intelligence to replicate data between different physical server racks (so that if one rack goes down the data on other racks is safe)
  • Easily replicates between geographically dispersed data centers
  • Leverages any combination of cloud and on-premise resources

Does Cassandra use a master/slave architecture or something else?

Cassandra does not use a master/slave architecture, but instead uses a peer-to-peer implementation, which avoids the pitfalls, latency problems, single point of failure issues, and performance headaches associated with master/slave setups.

How do I replicate data across Cassandra nodes?

Replication is the process of storing copies of data on multiple nodes to ensure reliability and fault tolerance. When you create a keyspace in Cassandra, you must decide the replica placement strategy: the number of replicas and how those replicas are distributed across nodes in the cluster. The replication strategy relies on the cluster-configured snitch (see FAQ “What is a snitch”)to help it determine the physical location of nodes and their proximity to each other.

The total number of replicas across the cluster is often referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row. A replication factor of 2 means two copies of each row. All replicas are equally important; there is no primary or master replica in terms of how read and write requests are handled.

Replication options are defined when you create a keyspace in Cassandra. The snitch is configured per node.

How is my data partitioned in Cassandra across nodes in a cluster?

Cassandra provides a number of options to partition your data across nodes in a cluster.

The RandomPartitioner is the default partitioning strategy for a Cassandra cluster. It uses a consistent hashing algorithm to determine which node will store a particular row. The end result is an even distribution of data across a cluster.

The ByteOrderedPartitioner ensures that row keys are stored in sorted order. It is not recommended for most use cases and can result in uneven distribution of data across a cluster.

What are ‘seed nodes’ in Cassandra?

A seed node in Cassandra is a node that is contacted by other nodes when they first start up and join the cluster. A cluster can have multiple seed nodes. Cassandra uses a protocol called gossip to discover location and state information about the other nodes participating in a Cassandra cluster. When a node first starts, it contacts a seed node to bootstrap the gossip communication process. The seed node designation has no purpose other than bootstrapping new nodes joining the cluster. Seed nodes are not a single point of failure.

What is a “snitch”?

The snitch is a configurable component of a Cassandra cluster used to define how the nodes are grouped together within the overall network topology (such as rack and data center groupings). Cassandra uses this information to route inter-node requests as efficiently as possible within the confines of the replica placement strategy. The snitch does not affect requests between the client application and Cassandra (it does not control which node a client connects to).

How do I add new nodes to a cluster?

Cassandra is capable of offering linear performance benefits when new nodes are added to a cluster.

A new machine can be added to an existing cluster by installing the Cassandra software on the server and configuring the new node so that it knows (1) the name of the Cassandra cluster it is joining; (2) the seed node(s) it should obtain its data from; and (3) the range of data that it is responsible for, which is done by assigning a token to the node.

Please see the online documentation about how to assign a token to a new node and the various use cases that dictate the complexity of token assignment.

Note that OpsCenter is capable of automatically rebalancing the data across all nodes in a cluster when new nodes are added.

How do I remove nodes from an existing cluster?

Nodes can be removed from a Cassandra cluster by using the nodetool utility and issuing a decommission command. This can be done without affecting the overall operations or uptime of the cluster.

What happens when a node fails in Cassandra?

Cassandra uses gossip state information to locally determine if another node in the system is up or down. This failure detection information is used by Cassandra to avoid routing client requests to unreachable nodes whenever possible.

The gossip inter-node communication process tracks “heartbeats” from other nodes both directly (nodes gossiping directly to it) and indirectly (nodes heard about secondhand, thirdhand, and so on). Rather than have a fixed threshold for marking nodes without a heartbeat as down, Cassandra uses an accrual detection mechanism to calculate a per-node threshold that takes into account network conditions, workload, or other conditions that might affect the perceived heartbeat rate.

Node failures can result from various causes such as hardware failures, network outages, and so on. Node outages are often transient but can last for extended intervals. A node outage rarely signifies a permanent departure from the cluster, and therefore does not automatically result in permanent removal of the failed node from the cluster. Other nodes will still try to periodically initiate gossip contact with failed nodes to see if they are back up.

When a node comes back online after an outage, it may have missed writes for the replica data it maintains. Writes missed due to short, transient outages are saved for a period of time on other replicas and replayed on the failed host once it recovers using Cassandra’s built-in hinted handoff feature. If a node is down for an extended period, an administrator can run the nodetool repair utility after the node is back online to ‘catch it up’ with its corresponding replicas.

To permanently change a node’s membership in a cluster, administrators must explicitly remove a node from a Cassandra cluster using the nodetool management utility.

How can I use the same Cassandra cluster across multiple datacenters?

Cassandra can easily replicate data between different physical datacenters by creating a keyspace that uses the replication strategy currently termed NetworkTopologyStrategy. This strategy allows you to configure Cassandra to automatically replicate data to different data centers and even different racks within datacenters to protect against specific rack/physical hardware failures causing a cluster to go down. It can also replicate data between public clouds and on-premise machines.

What configuration files does Cassandra use?

The main Cassandra configuration file is the cassandra.yaml file, which houses all the main options that control how Cassandra operates.

How can I use Cassandra in the cloud?

Cassandra’s architecture make it perfect for full cloud deployments as well as hybrid implementations that store some data in the cloud and other data on-premises.

DataStax provides an Amazon AMI that allows you to quickly deploy a Cassandra cluster on EC2. See the online documentation for a step-by-step guide to installing a Cassandra cluster on Amazon.

Do I need to use a caching layer (like memcached) with Cassandra?

Cassandra negates the need for extra software caching layers like memcached through its distributed architecture, fast write throughput capabilities, and internal memory caching structures.

Why is Cassandra so fast for write activity/data loads?

Cassandra has been architecture for consuming large amounts of data as fast as possible. To accomplish this, Cassandra first writes new data to a commit log to ensure it is safe. After that, the data is then written to an in-memory structure called a memtable. Cassandra deems the write successful once it is stored on both the commit log and a memtable, which provides the durability required for mission-critical systems.

Once a memtable‘s memory limit is reached, all writes are then written to disk in the form of an SSTable (sorted strings table). An SSTable is immutable, meaning it is not written to ever again. If the data contained in the SSTable is modified, the data is written to Cassandra in an upsert fashion and the previous data automatically removed.

Because SSTables are immutable and only written once the corresponding memtable is full, Cassandra avoids random seeks and instead only performs sequential IO in large batches, resulting in high write throughput.

A related factor is that Cassandra doesn’t have to do a read as part of a write (i.e. check index to see where current data is). This means that insert performance remains high as data size grows, while with b-tree based engines (e.g. MongoDB) it deteriorates.

How does Cassandra communicate across nodes in a cluster?

Cassandra is architected in a peer-to-peer fashion and uses a protocol called “gossip” to communicate with other nodes in a cluster. The gossip process runs every second to exchange information across the cluster.

Gossip only includes information about the cluster itself (e.g., up/down, joining, leaving, version, schema) and does not manage the data. Data is transferred node-to-node using a message-passing like protocol on a distinct port from what client applications connect to. The Cassandra partitioner turns a column family key into a token, the replication strategy picks the set of nodes responsible for that token (using information from the snitch) and Cassandra sends messages to those replicas with the request (read or write).

How does Cassandra detect that a node is down?

The gossip protocol is used to determine the state of all nodes in a cluster and if a particular node has gone down. The gossip process tracks heartbeats from other nodes and uses an accrual detection mechanism to calculate a per-node threshold that takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate before a node is actually marked as down.

The configuration parameter phi_convict_threshold in the cassandra.yaml file is used to control Cassandra’s sensitivity of node failure detection. The default value is appropriate for most situations. However in cloud environments, such as Amazon EC2, the value should be increased to 12 in order to account for network issues that sometimes occur on such platforms.

Does Cassandra compress data on disk?

Yes, data compression is available with Cassandra 1.0 and above. The snappy compression algorithm from Google is used and is able to deliver fairly impressive storage savings- in some cases, compressing raw data up to 80+ percent with no performance penalties for read/write operations. In fact, because of the reduction in physical I/O, compression actually increases performance in some use cases. Compression is enabled/disabled on a per-column family basis and is not enabled by default.

How do I backup data in Cassandra?

Currently, the most common method for backing up data in Cassandra is using the snapshot function in the nodetool utility. This is an online operation and does not require any downtime or block any operations on the server.

Snapshots are sent by default to a snapshots directory located in the Cassandra data directory (controlled via the data_file_directories in the cassandra.yaml file). Once taken, snapshots can be moved off-site to be protected.

Incremental backups (i.e., data backed up since the last full snapshot) can be performed by setting the incremental_backups parameter in the cassandra.yaml file to “true.” When incremental backup is enabled, Cassandra copies every flushed SSTable for each keyspace to a backup directory located under the Cassandra data directory. Restoring from an incremental backup involves first restoring from the last full snapshot and then copying each incremental file back into the Cassandra data directory.

How do I restore data in Cassandra?

In general, restoring a Cassandra node is done by first following these procedures:

  1. Shut down the node that is to be restored
  2. Clear the commit log by removing all the files in the commit log directory (e.g., rm /var/lib/cassandra/commitlog/*)
  3. Remove the database files for all keyspaces (e.g., rm /var/lib/cassandra/data/keyspace1/*.db). Take care so as not to remove the snapshot directory for the keyspace
  4. Copy the latest snapshot directory contents for each keyspace to the keyspace’s data directory (e.g., cp -p /var/lib/cassandra/data/keyspace1/snapshots/56046198758643-snapshotkeyspace1/* /var/lib/cassandra/data/keyspace1)
  5. Copy any incremental backups taken for each keyspace into the keyspace’s data directory
  6. Repeat steps 3-5 for each keyspace
  7. Restart the node

How do I uninstall Cassandra?

Currently, no uninstaller exists for Cassandra. Therefore, removing Cassandra from a machine consists of the manual deletion of the Cassandra software, data, and log files.

Is my data safe in Cassandra?

Yes. First, data durability is fully supported in Cassandra so that any data written to a database cluster is first written to a commit log in the same fashion as nearly every popular RDBMS does.

Second, Cassandra offers tunable data consistency so that a developer or administrator can choose how strong they wish consistency across nodes to be. The strongest form of consistency is to mandate that any data modifications be made to all nodes, with any unsuccessful attempt on a node resulting in a failed data operation. Cassandra provides consistency in the CAP sense in that all readers will see the same values.

Other forms of tunable consistency involve having a quorum of nodes written to or just one node for the loosest form of consistency. Cassandra is very flexible and allows data consistency to be chosen on a per operation basis if needed so that very strong consistency can be used when desired, or very loose consistency can be utilized when the use case permits.

What type of security does Cassandra offer?

Cassandra 1.2 and higher provides the following built-in security features: (1) internally-managed authentication (login ID’s and passwords are managed within Cassandra); (2) internal authorization / object permission management via GRANT/REVOKE; (3) client to node encryption via SSL. None of these security features are enabled by default and must be configured in the cassandra.yaml file

What options do I have to make sure my data is consistent across nodes?

In Cassandra, consistency refers to how up to date and synchronized a row of data is on all of its replicas. Cassandra offers a number of built-in features to ensure data consistency:

  • Hinted Handoff Writes – Writes are always sent to all replicas for the specified row regardless of the consistency level specified by the client. If a node happens to be down at the time of write, its corresponding replicas will save hints about the missed writes, and then handoff the affected rows once the node comes back online again. Hinted handoff ensures data consistency due to short, transient node outages.
  • Read Repair – Read operations trigger consistency across all replicas for a requested row using a process called read repair. For reads, there are two types of read requests that a coordinator node can send to a replica; a direct read request and a background read repair request. The number of replicas contacted by a direct read request is determined by the read consistency level specified by the client. Background read repair requests are sent to any additional replicas that did not receive a direct request. Read repair requests ensure that the requested row is made consistent on all replicas.
  • Anti-Entropy Node Repair – For data that is not read frequently, or to update data on a node that has been down for an extended period, the node repair process (also referred to as anti-entropy repair) ensures that all data on a replica is made consistent. Node repair (using the nodetool utility) should be run routinely as part of regular cluster maintenance operations.

What is ‘tunable consistency’ in Cassandra?

Cassandra extends the concept of ‘eventual consistency’ by offering ‘tunable consistency’. For any given read or write operation, the client application decides how consistent the requested data should be.

Consistency levels in Cassandra can be set on any read or write query. This allows application developers to tune consistency on a per-query basis depending on their requirements for response time versus data accuracy. Cassandra offers a number of consistency levels for both reads and writes.

Choosing a consistency level for reads and writes involves determining your requirements for consistent results (always reading the most recently written data) versus read or write latency (the time it takes for the requested data to be returned or for the write to succeed).

If latency is a top priority, consider a consistency level of ONE (only one replica node must successfully respond to the read or write request). There is a higher probability of stale data being read with this consistency level (as the replicas contacted for reads may not always have the most recent write). For some applications, this may be an acceptable trade-off.

If consistency is top priority, you can ensure that a read will always reflect the most recent write by using the following formula:

(nodes_written + nodes_read) > replication_factor

For example, if your application is using the QUORUM consistency level for both write and read operations and you are using a replication factor of 3, then this ensures that 2 nodes are always written and 2 nodes are always read. The combination of nodes written and read (4) being greater than the replication factor (3) ensures strong read consistency.

How do I load data into Cassandra?

With respect to loading external data, Cassandra supplies a load utility called the sstableloader. The sstableloader is able to load flat files into Cassandra, however the files must first be converted into SSTable format. An example of how to do this can be found at: http://www.datastax.com/dev/blog/bulk-loading.

How can I move data from another database to Cassandra?

Most RDBMS’s have an unload utility that allows data to be unloaded to flat files. Once in flat file format, the sstableloader utility can be used to load the data into Cassandra column families.

In addition, DataStax has partnered with various data integration vendors such as Pentaho to provide a free and powerful extract-transform-load (ETL) framework that allows easy migration of various source systems (e.g., Oracle, MySQL) into Cassandra.

What is read repair in Cassandra?

Read operations trigger consistency checks across all replicas for a requested row using a process called read repair. For reads, there are two types of read requests that a coordinator node can send to a replica; a direct read request and a background read repair request. The number of replicas contacted by a direct read request is determined by the read consistency level specified by the client. Background read repair requests are sent to any additional replicas that did not receive a direct request. Read repair requests ensure that the requested row is made consistent on all replicas. Read repair is an optional feature and can be configured per column family.

How can I move data from other databases/sources to Cassandra?

There are a number of internal utilities and external tools that allow data to easily to moved into/out of Cassandra. See this blog post that describes the most commonly used methods.

What client libraries/drivers can I use with Cassandra?

There are a number of CQL (Cassandra Query Language) drivers and native client libraries available for most all popular development languages (e.g. Java, Ruby, etc.) All drivers and client libraries can be downloaded from: http://www.datastax.com/download/clientdrivers.

What type of data model does Cassandra use?

The Cassandra data model is a dynamic schema, column-oriented data model. This means that, unlike a relational database, you do not need to model all of the columns required by your application up front, as each row is not required to have the same set of columns. Columns and their metadata can be added by your application as needed without incurring downtime to your application.

Although it is natural to want to compare the Cassandra data model to a relational database, they are really quite different. In a relational database, data is stored in tables and the tables comprising an application are typically related to each other. Data is usually normalized to reduce redundant entries, and tables are joined on common keys to satisfy a given query.

In Cassandra, the keyspace is the container for your application data, similar to a database or schema in a relational database. Inside the keyspace are one or more column family objects, which are analogous to tables. Column families contain columns, and a set of related columns is identified by an application-supplied row key. Each row in a column family is not required to have the same set of columns.

Cassandra does not enforce relationships between column families the way that relational databases do between tables: there are no formal foreign keys in Cassandra, and joining column families at query time is not supported. Each column family has a self-contained set of columns that are intended to be accessed together to satisfy specific queries from your application.

What datatypes does Cassandra support?

In a relational database, you must specify a data type for each column when you define a table. The data type constrains the values that can be inserted into that column. For example, if you have a column defined as an integer datatype, you would not be allowed to insert character data into that column.

In Cassandra, you can specify a data type for both the column name (called a comparator) as well as for row key and column values (called a validator).

Column and row key data in Cassandra is always stored internally as hex byte arrays, but the comparator/validators are used to verify data on insert and translate data on retrieval. In the case of comparators (column names), the comparator also determines the sort order in which columns are stored.

Cassandra comes with the following comparators and validators:

BytesType Bytes (no validation)
AsciiType US-ASCII bytes
UTF8Type UTF-8 encoded strings
LongType 64-bit longs
LexicalUUIDType 128-bit UUID by byte value
TimeUUIDType Version 1 128-bit UUID by timestamp
CounterColumnType* 64-bit signed integer

*Can only be used as a column validator, not valid as a row key validator or column name comparator

A simple example might be:

CREATE COLUMNFAMILY Standard1 WITH comparator_type = "UTF8Type";

What is a keyspace in Cassandra?

In Cassandra, the keyspace is the container for your application data, similar to a schema in a relational database. Keyspaces are used to group column families together. Typically, a cluster has one keyspace per application.

Replication is controlled on a per-keyspace basis, so data that has different replication requirements should reside in different keyspaces. Keyspaces are not designed to be used as a significant map layer within the data model, only as a way to control data replication for a set of column families.

What is a column family in Cassandra?

When comparing Cassandra to a relational database, the column family is similar to a table in that it is a container for columns and rows. However, a column family requires a major shift in thinking for those coming from the relational world.

In a relational database, you define tables, which have defined columns. The table defines the column names and their data types, and the client application then supplies rows conforming to that schema: each row contains the same fixed set of columns.

In Cassandra, you define column families. Column families can (and should) define metadata about the columns, but the actual columns that make up a row are determined by the client application. Each row can have a different set of columns.

What is a supercolumn in Cassandra?

A Cassandra column family can contain regular columns (key/value pairs) or super columns. Super columns add another level of nesting to the regular column family column structure. Super columns are comprised of a (super) column name and an ordered map of sub-columns. A super column is a way to group multiple columns based on a common lookup value.

When should I use a supercolumn in Cassandra?

The primary use case for super columns is to denormalize multiple rows from other column families into a single row, allowing for materialized view data retrieval.

Super columns should not be used when the number of sub-columns is expected to be a large number. During reads, all sub-columns of a super column must be deserialized to read a single sub-column, so performance of super columns is not optimal if there are a large number of sub-columns. Also, you cannot create a secondary index on a sub-column of a super column.

Does Cassandra support transactions?

Yes and no, depending on what is meant by “transactions.” Unlike relational databases, Cassandra does not offer fully ACID-compliant transactions. There is no locking or transactional dependencies when concurrently updating multiple rows or column families. But if by “transactions” you mean real-time data entry and retrieval, with durability and tunable consistency, then yes.

Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing operation. Nor does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.

However, this does not mean that Cassandra cannot be used as an operational or real time datastore. Data is very safe in Cassandra because writes in Cassandra are durable. All writes to a replica node are recorded both in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.

What is the CQL language?

Cassandra 0.8 is the first release to introduce Cassandra Query Language (CQL), the first standardized query language for Apache Cassandra. CQL pushes all implementation details to the server in the form of a CQL parser. Clients built on CQL only need to know how to interpret query result objects. CQL is the start of the first officially supported client API for Apache Cassandra. CQL drivers for the various languages are hosted within the Apache Cassandra project.

CQL syntax in based on SQL (Structured Query Language), the standard for relational database manipulation. Although CQL has many similarities to SQL, it does not change the underlying Cassandra data model. There is no support for JOINs, for example.

What is a compaction in Cassandra?

Cassandra is optimized for write throughput. Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (Sorted String table).

The “Sorted” part means SSTables are sorted by row token (as determined by the partitioner), which is what makes merges for compaction efficient (i.e., don’t have to read entire SSTables into memory). Row contents are also sorted by column comparator, so Cassandra can support larger-than-memory rows too. SSTables are immutable (i.e., they are not written to again after they have been flushed). This means that a row is typically stored across multiple SSTable files.

In the background, Cassandra periodically merges SSTables together into larger SSTables using a process called compaction. Compaction merges row fragments together, removes expired tombstones (deleted columns), and rebuilds primary and secondary indexes. Since the SSTable files are sorted by row key, this merge is efficient (no random disk I/O). Once a newly merged SSTable is complete, the smaller input SSTables are marked as obsolete and eventually deleted by the Java Virtual Machine (JVM) garbage collection (GC) process. However, during compaction, there is a temporary spike in disk space usage and disk I/O on the node.

What platforms does Cassandra run on?

Cassandra is a Java application, meaning that a compiled binary distribution of Cassandra can run on any platform that has a Java Runtime Environment (JRE), also referred to as a Java Virtual Machine (JVM).

DataStax strongly recommends using the Oracle Sun Java Runtime Environment (JRE), version 1.6.0_19 or later, for optimal performance.

DataStax makes available packaged releases for Red Hat, CentOS, Debian, and Ubuntu Linux, as well as Microsoft Windows and Mac OSX.

What management tools exist for Cassandra?

DataStax supplies both a free and commercial version of DataStax OpsCenter, which is a visual, browser-based management tool for Cassandra. With OpsCenter, a user can visually carry out many administrative tasks, monitor a cluster for performance, and do much more. Downloads of OpsCenter are available on here.

A number of command line tools also ship with Cassandra for querying/writing to the database, performing administration functions, and so on.

Cassandra also exposes a number of statistics and management operations via Java Management Extensions (JMX). Java Management Extensions (JMX) is a Java technology that supplies tools for managing and monitoring Java applications and services. Any statistic or operation that a Java application has exposed as an MBean can then be monitored or manipulated using JMX.

During normal operation, Cassandra outputs information and statistics that you can monitor using JMX-compliant tools such as JConsole, the Cassandra nodetool utility, or the DataStax OpsCenter centralized management console. With the same tools, you can perform certain administrative commands and operations such as flushing caches or doing a repair.

Finally, third-party vendors make tools that work with Cassandra. Examples include Quest Software with their TOAD for cloud databases product, and Pentaho’s Kettle data integration suite.