DataStax Developer Blog

Cassandra Flume Sink and Logsandra Integration

By Tyler Hobbs -  September 14, 2010 | 0 Comments

Flume is an open source project described as a “distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data”. Because that large amount of log data eventually needs to be stored in a distributed, reliable, and available way, there was interest in allowing Cassandra to be used as a destination for data, or “sink,” in Flume terminology.

A new Cassandra plugin for Flume is designed to require very little configuration and produce good results out of the box. The plugin offers two sinks: a simple sink that indexes entries by date, and a sink designed to take syslog events and store for use with the Logsandra search engine, a tool for searching and analyzing logs stored in Cassandra.

Simple Cassandra Sink

This sink makes use of two column families: one for storing the data, and one for storing and index. Upon receiving a Flume event, the Cassandra sink does the following:

  1. Create a column where the name is a timestamp-based UUID and the value is empty. Insert this column into the index column family with row key YYYYMMDDHH (the current hour).
  2. Create a second column where the name is “data” and the value is the body of the Flume event. Insert this column into the data column family using the UUID from step #1 as the row key.

The bulk of the workload — storing the event bodies — is evenly distributed around the Cassandra cluster, but the index column family still allows you to easily query for all events from a slice of time.

Logsandra Syslog Sink

Logsandra is a project that allows you to search logs by keyword and view graphs of their occurrence over time.

Assuming your Cassandra cluster is setup to work with Logsandra, the sink should be ready to use out of the box; all it requires is a list of Cassandra servers.

Syslog events are stored in Cassandra using a timestamp-based UUID as the row key and each event is indexed by its source, syslog facility, and syslog severity. A Logsandra query on any of these fields will show a list of all log events that match.

License and Location

The plugin is open source under the MIT license and is available here:

http://github.com/thobbs/flume-cassandra-plugin



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>