The Next Great Data Developer

DataStax is proud to announce “The Next Great Data Developer” $10,000 scholarship, and paid internship program!

This program is aimed at computer science/engineering students who are either in a Bachelor’s or Master’s program in North America. One of the scholarships will be reserved for a female developer to champion women in IT. The two winning contestants will be presented a scholarship for $10,000, paid travel and passes to OSCON, and a paid internship at DataStax!

Our 6 finalists will be invited to present their Apache Cassandra project live on-stage at the Bay Area Cassandra Summit 2013 (June 11th – 12th). Two winning finalists will be selected and awarded their scholarship at OSCON 2013 (July 22nd – 26th) in Portland by the Apache Cassandra Project Chair, Jonathan Ellis!

What

In order to participate in this scholarship program, you must build any kind of application on Apache Cassandra and document your experiences using the following tools:

  1. Create an Activity Blog (Minimum of 5 postings on your project activity)
  2. Create a YouTube, Vimeo, etc. Channel for Video Recordings (Minimum of 2 recordings, any length)
  3. Using Your Blog, Tell Us Your Business Case (Minimum of 1 page, describing how your application can be applied in a business setting)
  4. Create a Working Prototype and Post to GitHub.com

Our judging panel will evaluate your work based on your working prototype and ability to document your development process’ using the tools found above. The panel will review your application and monitor your progress on an ongoing basis. Judges will review how well you’ve communicated your development process.

Who

Participation is limited to students currently enrolled in an undergraduate or graduate computer science/engineering program at a North American college or university. DataStax is committed to championing women in technology, and will reserve one scholarship for a female developer.

When

By May 15, 2013, the judges will finalize their list of 6 applicants and invite them, all expenses paid, to the Bay Area Cassandra Summit, taking place June 11-12 where they will present their projects. Summit attendees can watch the presentations and participate in a question and answer session. DataStax will announce the final 2 winners at OSCON 2013, which takes place July 22-26 in Portland; Apache Cassandra Project Chair, Jonathan Ellis, will present the awards at the show.

Looking to get started with Apache Cassandra? Find more information about getting started in our Getting Started tab.

Do you have any questions? Please feel free to reach out for assistance at brady@datastax.com.

Getting Started with Apache Cassandra

If you haven’t begun using Apache Cassandra yet and you wanted a little handholding to help get you started, you’re in luck. This article will help you get your feet wet with Cassandra and show you the basics so you’ll be ready to start developing Cassandra applications in no time.

Why Cassandra?

Do you need a more flexible data model than what’s offered in the relational database world? Would you like to start with a NoSQL database you know can scale to meet any number of concurrent user connections and/or data volume size and run blazingly fast? Have you needed a database that has no single point of failure and one that can easily distribute data among multiple geographies, data centers, and the cloud? Well, that’s Cassandra.

Step 1 – Installing Cassandra

In this article, we’ll show you how to kick the tires of Cassandra on a single machine, but note that it’s also very easy to configure a multi-node, clustered setup, which is what allows Cassandra to really flex its muscles where scale and performance are concerned.

The first step is to download and install Cassandra on your target test machine. To download Cassandra, go to the downloads page at DataStax.com and select the DataStax Community Edition, which includes the most up-to-date, stable version of Cassandra, the Cassandra Query Language (CQL) interface, and a free version of DataStax OpsCenter, which is a web-based management and monitoring solution for Cassandra, and a sample Cassandra application.

This article will show you how to install and get going with Apache Cassandra on a Mac or Linux machine. If you’re using a Windows setup instead, see this article that I wrote, which will guide you through using Cassandra on Windows.

For this exercise, choose the Tarball option for the version of the operating system you’re using (either Linux or Mac). You’ll want to download the Datastax Community server, which includes the CQL (Cassandra Query Language) shell and sample application. For now, don’t worry about downloading DataStax OpsCenter, as we’ll cover that in another article.

Once your download of Cassandra finishes, move the file to whatever directory you’d like to use for testing Cassandra. Then uncompress the file (whose name will change depending on the version you’re downloading):

tar -xzf dsc-cassandra-1.2.2-bin.tar.gz

Then switch to the new Cassandra bin directory and start up Cassandra:

robinsmac:dev robin$ cd dsc-cassandra-1.2.2/bin
robinsmac:bin robin$ sudo ./cassandra
robinsmac:bin robin$  INFO 14:49:57,739 Logging initialized
INFO 14:49:57,750 JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.6.0_35
INFO 14:49:57,750 Heap size: 2093809664/2093809664
INFO 14:49:57,751 Classpath:
.
.
INFO 14:49:59,208 Completed flushing /var/lib/cassandra/data/system/schema_columns/system-schema_columns-ib-2-Data.db (210 bytes) for commitlog position ReplayPosition(segmentId=1362167398602, position=53130)

Step 2 – Connecting to Cassandra

Now that you have Cassandra running, the next thing to do is connect to the server and begin creating database objects. This is done with the Cassandra Query Language (CQL) utility. CQL is a very SQL-like language that lets you create objects as you’re likely used to doing in the RDBMS world.

The CQL utility (cqlsh) is in the same bin directory as the cassandra executable:

robinsmac:bin robin$ ./cqlsh
Connected to Test Cluster at localhost:9160.

[cqlsh 2.3.0 | Cassandra 1.2.2 | CQL spec 3.0.0 | Thrift protocol 19.35.0]

Use HELP for help.
cqlsh>

Step 3 – Creating a Keyspace

Cassandra has the concept of a keyspace, which is similar to a database in a RDBMS. A keyspace holds data objects and is the level where you specify options for a data partitioning and replication strategy.

For this brief introduction, we’ll just create a basic keyspace to hold some example data objects we’ll create:

cqlsh> create keyspace dev
... with replication = {'class':'SimpleStrategy','replication_factor':1};

Step 4 – Creating Data Objects

Now that you have a keyspace created, it’s time to create a data object to store data. Because Cassandra is based on Google Bigtable, you’ll use column families /tables to store data.

Tables in Cassandra are similar to RDBMS tables, but are much more flexible and dynamic. Cassandra tables have rows like RDBMS tables, but they are a sparse column type of object, meaning that rows in a column family can have different columns depending on the data you want to store for a particular row.

Let’s create a base table to hold employee data:

cqlsh> use dev;
cqlsh:dev> create table emp (empid int primary key,
... emp_first varchar, emp_last varchar, emp_dept varchar);
cqlsh:dev>

The column family is named emp and contains four columns, including the employee ID, which acts as the primary key of the table. Note that a column family must have a primary key that’s used for initial query activity.

Step 5 – Inserting, Manipulating, and Querying Data

Let’s now go ahead and insert data into our new column family using the CQL INSERT command:

cqlsh:dev> insert into emp (empid, emp_first, emp_last, emp_dept)
... values (1,'fred','smith','eng');

Notice how Cassandra’s CQL is literally identical to the RDBMS INSERT command. Other DML statements are as well:

cqlsh:dev> update emp set emp_dept = 'fin' where empid = 1;

Querying data uses the familiar SELECT statement:

cqlsh:dev> select * from emp;
empid | emp_dept | emp_first | emp_last
------+----------+-----------+----------
1     |      fin |      fred |    smith

However, look what happens when you try to use a WHERE predicate and reference a non-primary key column:

cqlsh:dev> select * from emp where empid = 1;
empid | emp_dept | emp_first | emp_last
------+----------+-----------+----------
1     |      fin |      fred |    smith
cqlsh:dev> select * from emp where emp_dept = 'fin';
Bad Request: No indexed columns present in by-columns clause with Equal operator

In Cassandra, if you want to query columns other than the primary key, you need to create a secondary index on them:

cqlsh:dev> create index idx_dept on emp(emp_dept);
cqlsh:dev> select * from emp where emp_dept = 'fin';
empid | emp_dept | emp_first | emp_last
------+----------+-----------+----------
1     |      fin |      fred |    smith

Conclusion

We’ve reached the end for this short article on how to get started with Cassandra. Hopefully, you now have a basic feel for how to install, create objects, manipulate data, and query data in Cassandra.

Where can you go for more information? To get a good overview of Cassandra and its architecture, read the Introduction to Apache Cassandra white paper. To learn more about CQL, as well as about setting up a multi-node Cassandra cluster, see the DataStax online documentation for Apache Cassandra. Also visit the DataStax Dev Center for more articles, technical blog posts, videos, and more.

Judging Panel

Kelly SommersKelly Sommers (@kellabyte), Mobile Developer & DataStax MVP for Apache Cassandra

Kelly Sommers is a software developer from eastern Canada with experience in the telecom and mobile industries. Kelly has been an advisor to Microsoft’s Patterns and Practices CQRS Journey project. Kelly is well known on Twitter and her popular blog (kellabyte.com) for her interactions and bringing energy to a diverse set of communities across multiple platforms such as mobile, big data, and distributed systems. She has created a mailing list (distsys-discuss on google groups) to bring people from different ecosystems together to discuss topics about distributed systems. On most days you will find Kelly learning something new and involving the community along the way.

Patrick McFadinPatrick McFadin (@PatrickMcFadin), Principle Solutions Architect at DataStax

Patrick is an early adopter of the Cassandra project and is proficient at data modeling and operations topics; he has designed and implemented production systems. Prior to working at DataStax as a Principle Solutions Architect, Patrick was the Chief Architect at Hobsons, an education services company. His responsibilities include ensuring product availability and scaling for all higher education products. He obtained a BS in Computer Engineering from Cal Poly, San Luis Obispo and holds the distinction of being the only recipient of a medal (as anyone can find out) for hacking while serving in the US Navy.

Gwen ShapiraGwen Shapira (@Gwenshap), Lead DBA at Pythian

Gwen’s experience includes Linux system administration, web developement, technical leadership and many years of Oracle database administration. She is currently a senior consultant at Pythian, Oracle ACE Director, Board member at NoCOUG and a member of the Oak Table Network. Gwen studied computer science, statistics and operations research at the University of Tel Aviv, and then went on to spend the next 15 years in different technical positions in the IT industry. In her own words: “I love troubleshooting and making things faster. I enjoy stringing together different data technologies to make the best of each, and I love explaining things to people – from one-on-one mentorship to presenting to 300 person crowd.”

Christian HaskerChristian Hasker (@chasker), Planet Cassandra Editor at DataStax

Christian leads community and marketing efforts at DataStax. Prior to working at DataStax, he lead product marketing for the database business unit of Quest software, acquired by Dell. When he’s is not marketing for a living, Christian enjoys cooking, playing piano, working on his language learning website LingoJingo.com and running around after his 6 year old twin daughters.

Jonathan EllisJonathan Ellis (@spyced), Apache Cassandra Project Chair

Jonathan Ellis is CTO and co-founder at DataStax. Prior to DataStax, Jonathan worked extensively with Apache Cassandra while employed at Rackspace. Before Rackspace, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy. Jonathan graduated from Brigham Young University with a BS in Computer Science.

Finalists

Six Finalists

By May 15, 2013, the judges will finalize their list of 6 applicants and invite them to the Bay Area Cassandra Summit, all expenses paid, taking place June 11-12 where they will present their projects. Summit attendees can watch the presentations and participate in a question and answer session.

Two Winners

DataStax will announce the 2 winners prior to OSCON 2013, which takes place July 22-26 in Portland. The two winners will be flown out to Portland, OR to participate in OSCON 2013 where the Apache Cassandra Project Chair, Jonathan Ellis, will present their $10,000 scholarships and an invitation to join the DataStax team for an internship.

The DataStax paid internship opportunity gives students the chance to experience working for a fast-growing tech startup in Silicon Valley. This internship period is very flexible and we will work with your schedule in order to best accommodate. We understand that you are in school and it’s important that your education come first.

So, are you ready to apply?

Apply Here

If you have any questions or need help, please reach out to Brady Gentile.