If you haven’t begun using Apache Cassandra yet and you wanted a little handholding to help get you started, you’re in luck. This article will help you get your feet wet with Cassandra and show you the basics so you’ll be ready to start developing Cassandra applications in no time.
Do you need a more flexible data model than what’s offered in the relational database world? Would you like to start with a database you know can scale to meet any number of concurrent user connections and/or data volume size and run blazingly fast? Have you been needing a database that has no single point of failure and one that can easily distribute data among multiple geographies, data centers, and the cloud? Well, that’s Cassandra.
In this article, we’ll show you how to kick the tires of Cassandra on a single machine, but note that it’s also very easy to configure a multi-node, clustered setup, which is what allows Cassandra to really flex its muscles where scale and performance are concerned.
The first step is to download and install Cassandra on your target test machine. To download Cassandra, go to www.datastax.com/download and select the DataStax Community Edition, which includes the most up-to-date, stable version of Cassandra, the Cassandra Query Language (CQL) interface, and a free version of DataStax OpsCenter, which is a web-based management and monitoring solution for Cassandra, and a sample Cassandra application.
For this exercise, choose the Tarball option for the version of the operating system you’re using (either Linux or Mac). You’ll want to download the Datastax Community server, which includes the CQL (Cassandra Query Language) shell and sample application. For now, don’t worry about downloading OpsCenter, as we’ll cover that in another article.
Once your download of Cassandra finishes, move the file to whatever directory you’d like to use for testing Cassandra. Then uncompress the file:
Once you uncompress the Cassandra database server, you’ll want to start up Cassandra by invoking the cassandra command:
Now you have Cassandra running. Next, use the CQL shell to connect to Cassandra:
Now you’re ready to start creating Cassandra keyspaces and data objects.
The nice thing about Cassandra is that the CQL language makes it very easy to get started for anyone coming from legacy relational databases (and that’s probably you and most everyone you know). CQL is very much like SQL, so the learning curve with Cassandra is practically non-existent where creating objects, manipulating data, and querying data is concerned.
Cassandra has the concept of a keyspace, which is similar to a database in a RDBMS. A keyspace is what holds data objects and is the level where you specify options for a data partitioning and replication strategy.
For this brief introduction, we’ll just create a basic keyspace to hold the data objects we’ll create:
Note that you can have multiple keyspaces in a Cassandra server/cluster, so when you’re ready to start creating objects, you need to use the USE command to tell Cassandra which keyspace you want to work with.
Now that you have a keyspace created, it’s time to create a data object to store data. Because Cassandra is based on Google Bigtable, you’ll use column families to store data.
Column families are similar to RDBMS tables, but are much more flexible and dynamic. Column families have rows like RDBMS tables, but they are a sparse column type of object, meaning that rows in a column family can have different columns depending on the data you want to store for a particular row.
Let’s create a base column family to hold employee data:
The column family is named emp and contains only one column at the moment – the employee ID, which is a numeric and will act as the primary key of the column family. Note that a column family must have a primary key that’s used for initial query activity.
Let’s now go ahead and insert data into our new column family using the CQL INSERT command:
Notice how you don’t have to predefine columns ahead of time to insert data. You certainly can if you like, and assign specific datatypes, etc., but it’s not required.
Now let’s insert a new row for an employee who is a manager of other employees. Managers might have additional information that regular employees don’t have:
The third row inserted is a manager row and this row has additional columns not needed for regular employees. Because Cassandra column families are sparse row data objects, this is perfectly fine. Row 3 has two extra columns that the other rows do not.
Just as the CQL command is nearly identical to SQL for inserting data, updating, deleting, and truncating data in a column family also resemble SQL. For example, if you want to update employee number 1 to be in the ‘fin’ department instead of ‘eng’, you would issue the well-known UPDATE command:
One difference between Cassandra and traditional RDBMS’s where DML commands are concerned is that you have the ability to specify how consistent you want your data to be across the various nodes in a Cassandra cluster upon command completion. This is called “tunable” data consistency in Cassandra and can be specified on a per-operation basis.
For example, if you had a Cassandra cluster of 4 machines, and you wanted to issue the above update and ensure that the update was successful on a quorum of the cluster (which would be 3 machines), you would issue the following update command:
If you want to create a column family with predefined columns for the columns you know you’re going to have, you can do that. Let’s recreate our emp column family using predefined column datatypes:
For more information on creating column families, inserting, and manipulating data, see the online DataStax CQL reference guide.
Query data in Cassandra with the CQL language is very easy. The standard SELECT command is used as it is in RDBMS’s:
However, watch what happens when you try and query something in the emp column family using a column other than the primary key column:
In Cassandra, if you want to query columns other than the primary key, you need to create a secondary index on them:
Note that you have to create a column with a specific column definition (i.e. define it with a datatype) to create an index on it.
If you want to limit the number of rows returned in a query, you can use the LIMIT clause:
There’s more you can do with the SELECT command, and for more information, please see the online DataStax CQL reference.
We’ve reached the end for this short article on how to get started with Cassandra. Hopefully, you now have a basic feel for how to install, create objects, manipulate data, and query data in Cassandra.
To download either the DataStax Community or Enterprise editions, please visit the DataStax downloads page at www.datastax.com/download.