Cassandra Summit 2012 Presentations

Slides and videos from Cassandra Summit 2012 will be posted here. Thanks to all that attended and we look forward to seeing you next year!

Session Presentations

The State of Cassandra, 2012

Jonathan Ellis, DataStax
(slides / video)

1, 2, 3, 4, Add Another Data Store

Eric Lubow, SimpleReach
(slides / video)

In order to meet all our data needs including high volume ingestion, Map Reduce capabilities, real-time analytics, historical analytics, and other analysis technologies, we needed to incorporate the use of Redis, Mongo, a MySQL column store and Cassandra. Wrap the whole thing up in a Node.js API for speed and consistent access patterns and you have a whole data storage spread.

Building a Cassandra Based Application from Scratch

Patrick McFadin, Hobsons
(slides / video)

The goal of this talk is to take people unfamiliar with building an application with Cassandra, through the design cycle. We’ll take a simple example of a video sharing application and go from design to implementation. One of the first questions asked when first starting with Cassandra is how will my data look? By starting from the beginning, you will see how the design process works. We’ll cover topics like data modeling, trade-offs with different design ideas and considerations for deployment. The language used in this talk will be Java, but will be abstract enough for implementation in any language.

Buy It Now! Cassandra at eBay

Jay Patel, eBay
(slides / video)

This talk will cover use cases for Cassandra at eBay. It’ll start with some simple logging & tracking use cases and move into a more complex use case called “eBay Social Signal”, which enables like/own/want social-oriented features on eBay product and item pages. For each use case, Jay will cover in-depth data model design with trade-offs, deployment topology, and lessons learned. To conclude, Jay will summarize the best practices that guide Cassandra utilization at eBay.

Cassandra – A Foundation for Real-time Big Data Applications at Reltio

Manish Sood, Reltio
(slides / video)

Reltio provides the world’s largest enterprise customers with real-time Big Data Applications that help business users understand markets, drive revenue and reduce risk through actionable, real-time and predictive insights integrated into applications and dashboards. These applications require the ability to not only handle data scale, latency, and high availability but also deliver capabilities that address reconciliation of data from multiple sources, handle multiple dimensions (various entities, relationships and interaction data) of varying complexity, with the ability to track audit and bi-temporal data. To solve this challenge, Reltio leverages Cassandra as the enabling data store for multi-tenancy, complex attribute structures, graph storage, real-time search and analytics. These capabilities are delivered as a Service to customers across verticals to address scenarios such as Extended Client views, Client experience & engagement, and Risk & Compliance

Cassandra at Apigee Usergrid: Powering Mobile

Ed Anuff, Apigee
(slides / video)

Usergrid is a cloud service and open source stack built on top of Cassandra for powering mobile and rich client applications. In building Usergrid, Ed and his team tackled advanced Cassandra topics such as deep indexing of JSON documents and multi-tenancy at scale. Usergrid is based on Hector and key members of the Hector team work on the project, so whether you’re interested in mobile development or solving hard problems with Hector, you’ll find this an interesting session.

Cassandra at Telco Scale – Big Data, Mobile Apps, Optimized Footprint, and A/A Resiliency

Darshan Rawal, Openwave
(slides / video)

Prominent Cassandra deployments are at high volume “service” (hosted) environments. Use cases of Cassandra as part of a “product” which is deployed @ global Tier 1 Telcos in A/A and “hybrid” cloud deployments are being discussed.

Cassandra in Action: Solving Big Data Problems

Eddie Satterly, Splunk
(slides / video)

Cassandra in Rackspace Cloud Monitoring

Russell Haering, Rackspace
(slides / video)

Rackspace Cloud Monitoring is a highly-available API-driven monitoring system built to scale up to 1 million active checks. The product also introduces a lot of new and unique features such as a powerful and flexible DSL for specifying alarm thresholds.

Cassandra Performance and Scalability on AWS

Adrian Cockcroft, Netflix
(slides / video)

Netflix published a Cassandra scalability benchmark in 2011 that showed linear scalability as the number of nodes in the cluster was increased from 48 to 288 and over a million triple replicated writes per second. This talk will summarize a range of new benchmarks that take advantage of more powerful EC2 instances and improvements in Cassandra itself.

Cassandra Plus Solr

Matt Stump, SourceNinja
(slides / video)

Most NOSQL solutions force you to give up ad-hoc queries, and the force the developer to write code to maintain indexes and perform basic search. Datastax Enterprise solves this problem with the integration of SOLR. SOLR integration provides provides robust, scalable, indexing, and searching, allowing your team to write less code and focus on what really matters.

Columns Enough and Time – Using Cassandra for a Time Series Analytical Repository

John Akred, Accenture
(slides / video)

Time series data is abundant. Machine logs, process sensors, user chats, short broadcast messages and news stories all share that important dimension – time. Existing solutions for storing and managing time series data include data historians, file-based approaches and use of relational systems, but none of these posses the desired scaling, access pattern and use case support, and cost properties that would truly unlock the potential of these vast data streams. Many of the opportunities in that data must be enabled by analytics that are not effectively enabled on current platforms. We show how our work at Accenture Technology Labs demonstrates that Cassandra effectively supports the time series storage, processing and analytical workloads and discuss how it compares to alternate approaches from a scaling, cost and use-case perspective.

CQL, and the Road to Redemption

Eric Evans, Acunu
(slides / video)

A number of technologies have emerged in recent years to address the difficulties of storage in an increasingly data-rich world. A side-effect of this license-to-reimagine however, is that each of these so-called NoSQL technologies have also reimagined the way that developers access data. Too often though, these bespoke query interfaces eschew the lessons of the past to differentiate, and usability suffers.

CQL (Cassandra Query Language), is an SQL-alike query language for Apache Cassandra. It provides a high-level query interface based on a time-honored standard that most developers should feel right at home with. This talk will cover the motivation for CQL, the history of it’s development, and will provide a brief introduction to the latest incarnation, CQL3.

End-to-end Analytic Workflows With Cassandra

Jeremy Hanna, DataStax
(slides / video)

As more data is stored in Cassandra, performing batch oriented analytics over that data becomes increasingly valuable. Users may want to perform data validation, discover trends, evolve their Cassandra data model, or just generally explore their data. This presentation will be an overview of how to use tools such as Pig, Mahout and Oozie with Cassandra to create non-trivial analytic workflows. It will draw on experience and lessons learned building end-to-end analytic workflows in a production environment at The Dachis Group.

Hastur: Open-Source Scalable Metrics

Noah Gibbs, Ooyala
(slides / video)

Ooyala recently open-sourced the Hastur monitoring system, backed by Cassandra. We present our Read-Write-Modify-free Cassandra schema for high-volume, variable-size time series data and how we use Btrfs LZO compression on disk to improve both storage space and retrieval time. We also describe the ZeroMQ routing architecture for our metrics, how scaling and fault-tolerance are guaranteed, and future directions for our work.

Increasing Your Prospects: Cassandra in Online Advertising

Ed Capriolo, M6D
(slides / video)

This presentation explores how M6D (Media6Degrees) leverages it’s capabilities by:

  • using Cassandra with stream processing as an alternative to batch based ETL processes
  • leveraging the ColumnFamily data model to support both static data with fixed column names as well as dynamic data where columns where entries are optionally tagged with columns
  • utilizing Cassandra’s built in validators to verify data on insert and type system to store data and access data efficiently
  • optimize with build in compression to drastically reduce disk usage space while increasing performance
  • benefiting from Cassandra’s consistency and fault tolerance which ease administrative burden

Moving at the Speed of Markets

Gyan Aggarwal, Triple Point Technology
(slides / video)

Our application is a Commodity Trading platform where market data changes very frequently. The actual value change can be in a few columns of only some of the column families but it has a very wide impact in the valuation of trades where most of the relevant columns are computed.

This framework has been developed to implement some of the core functionality of an application that has to process a very large volume of data every day and it also performs some very complex computations. It has to access data from many column families to perform its complex computation. It runs a batch process on very large volume of data and here speed of process is of essence.

Cassandra proved to a natural and cost-effective platform to implement this kind of application. It is natural because it is column-oriented persistent store. Although, it can be daunting task a developer to implement a complex application using Cassandra API. This framework addresses this same issue by encapsulating the Cassandra API complexity into a very simple framework API. This framework also implements one-to-one and one-to-many secondary index column family pattern.

Servers Fail, Who Cares?

Greg Ulrich, Netflix
(slides / video)

Take Any App to the Cloud of your Choice

Uri Cohen, GigaSpaces
(slides / video)

The massive computing and storage resources that are needed to support big data applications make cloud environments an ideal fit. Now more than ever, there is a growing number of choices of cloud infrastructure providers, from Amazon AWS, OpenStack offered by the likes of HP, Rackspace and soon even Dell, VMware vCloud as well as private cloud offerings based on OpenStack, CloudStack, vCloud, and more. There is also a new class of bare-metal clouds from SoftLayer and PistonCloud that provide high performance resources designed for I/O and CPU intensive applications that don’t run as well on a virtualized resources. The recent announcements by Google & Microsoft about their new infrastructure as a service offerings, add additional significant players to this growing marketplace.

Given the diverse options, and the dynamic environments involved, it becomes ever more important to maintain the flexibility to choose the right cloud for the job.

In this session, you’ll learn how to deploy and manage a Cassandra cluster on any Cloud, as well as manage the rest of your big data application stack using a new open source framework called “Cloudify.” (Visit cloudifysource.org)

Technical Deep Dive: Cassandra + Solr

Jason Rutherglen
(slides / video)

Technical Deep Dive: Data Modeling

Matt Dennis, DataStax
(slides / video)

Cassandra stores data fundamentally differently than traditional RDBMS. These differences allow for vast improvements in performance, availability and scalability, but in order to achieve the gains one must understand the differences. This talk covers a Cassandra solution to a problem often believed to only be solvable in a strictly ACID compliant system. The focus of the solution is on the data model and the interactions with it to achieve scalability, availability and performance at scale.

Technical Deep Dive: Query Performance

Aaron Morton, Cassandra Committer
(slides / video)

Ever wondered how to make Cassandra faster? In this talk, Aaron Morton will step through the impact different configuration settings, data models and query types have on read and write performance.

Technical Deep Dive: Secondary Indexes

Christian Romming, VigLink
(slides / video)

This talk discusses the custom indexing technique that is used to power VigLink’s analytics dashboard. It is similar to Cassandra’s built-in secondary indexing, but also supports efficient range queries. We’ll also cover pitfalls around scaling this technique to high write throughputs.

Titan: Big Graph Data with Cassandra

Matthias Broecheler, Aurelius
slides / video)

Unlocking the Value of Big Data

Panel hosted by Matt Pfeil (DataStax) featuring Matt Stump (Source Ninja), Eddie Satterly (Splunk), and Patrick McFadin (Hobsons)
(slides / video)

Using Cassandra in an S3 Cloud Storage System

Gary Ogasawara, Gemini Mobile
(slides / video)

Cloudian is a cloud storage software package that is compatible with Amazon’s S3 API. This allows existing applications that use Amazon S3 to simply be re-pointed to a Cloudian system. Beyond the basic object GET/PUT/DELETE functionality, fully compatible S3 APIs include ACL (access control lists), accounting, multi-part uploads, object versioning, multi-regions (location constraint), and more. Cassandra is used in multiple ways, but most notably, for its distributed systems algorithms for data partitioning, replication, and node membership. Actual storage of objects is done on file system based on the Cassandra algorithms for distributed data management. In addition, Cassandra is used for object metadata, transaction and reporting data, and user and group data. Cloudian is currently in active production use at cloud service providers (CSPs) and enterprises in Japan, US, and Europe.

Virtual Nodes – Operational Aspirin

Sam Overton, Acunu
(slides / video)

This talk will explain the concept and implementation of virtual nodes in Cassandra, and the numerous benefits it brings. We will show you how virtual nodes make token management a thing of the past, improves the failure characteristics, improves bootstrapping and decommission speed, make incremental cluster growing and shrinking possible, and much more.

We Messed Up, So You Don’t Have to

Panel hosted by Matt Pfeil (DataStax) featuring Matt Dennis (DataStax), Jason Brown (Netflix), Lee Parker (Spredfast), and Ed Capriolo (m6d)
slides / video)

Why Architecture Matters (No, Really)

Rick Branson, Instagram
(slides / video)

Lightning Talks

Actuate: From Data to Insights

Mihail Mihaylov, Actuate
(slides / video)

M2M with Cassandra

Subu Balakrishnan, Aeris
(slides / video)

Scaling MongoDB with Cassandra

Ed Anuff and Nate McCall, Apigee
(slides / video)

Robin Schumacher, DataStax
(slides / video)

Cassandra on ACID

DeWayne Filppi, GigaSpaces
(slides / video)

Reporting and Analytics on Cassandra

Matthew Dahlman, Jaspersoft
(slides / video)

Upcoming Changes in Drivers

Michael Figuiere, DataStax
(slides / video)

Astyanax: Hector’s Smart-ass Son

Eran Landau, Netflix
(video)

Toad for Cloud: How to write SQL against Cassandra

Peter Evans, Quest
(slides / video)

Node.js and Cassandra (Helenus)

Russell Bradberry, SimpleReach
(slides / video)

Apache Cassandra, Cassandra, Apache Hadoop, Hadoop, Apache Solr, Solr and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission as of 2012. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by DataStax.