Advanced Time Series Data Modelling
date: December 15, 2015
Collecting Time Series Vs Storing Time Series
Cassandra is well known as the database of choice when collecting time series events. These may be messages, events or similar transactions that have a time element to them. If you are not familiar on how Cassandra holds time series, there is a useful data modelling tutorial on the DataStax academy website.
In this document I will try to explain some of the pros and cons of using time series in Cassandra and show some techniques and tips which make make your application better not just for now but also 5 years down the line.
Choosing you long term storage
Choosing your long term storage is not really a trivial thing. In most applications there are business requirements about how long data will need to be held for and sometimes these requirements change. More and more, business want and are required to hold data for longer. For example, a lot of financial companies must keep audit data for up to seven years.
Using some sample applications.
We will look at some examples and see how time series is used for each.
1. A credit card account which shows transaction for a particular account number. Data is streamed in real time.
2. Collecting energy data for a smart meter. Data comes from files sent from devices after one day of activity.
3. Tick data for a financial instrument. Data is streamed in real time.
All of the above use cases are time series examples and would benefit from using Cassandra. But when we look at the queries and retention policies for this data we may look at different ways of storing them.
Clustering columns for time series.
The credit card application will need to query a users transactions and show them to the user. They will need to be in descending order with the latest transaction first. The data may be paged over multiple pages. This data needs to be kept for 7 years.
Using a simply clustering column in the table definition, will allow all the transactions for a particular account to be on one row for extremely fast retrieval.
Our table model would be similar to this
create table if not exists latest_transactions( credit_card_no text, transaction_time timestamp, transaction_id text, user_id text, location text, items map<text, double>, merchant text, amount double, status text, notes text, PRIMARY KEY (credit_card_no, transaction_time) ) WITH CLUSTERING ORDER BY ( transaction_time desc);
The smart meter application is a little different. The data will come in for each meter no with data every 30 mins of increments to the meter value. Eg. 00:00 - 13, 00:30 11, 01:00 3......23:30 10. So the daily amount is an aggregation of all the data points together.
The business requirement state that the data must be held for 5 years and a days data will always be looked up together. Cassandra's has column type of Map which can be used to hold our daily readings in a format of time offset and value.
Our table model would look something like this
create table if not exists smart_meter_reading ( meter_id int, date timestamp, source_id text, readings map<text, double>, PRIMARY KEY(meter_id, date) ) WITH CLUSTERING ORDER BY(date desc);
This seems sensible until we look at how much data we will be holding and for how long. This application has 10 Million meters and hopes to double that over the next 3 years. If we start at 10M customers holding 365*5 years of data which 48 columns of offset data per day (every half hour), this can quickly add up to over 50o billion points (a map of 48 entries is held as 48 columns) and we haven't talked about the increase over those years. Since we don't have to query the reading individually it might suit better to look at other storage capabilities. A map can be simply transformed to and from a JSON string which would allow us to hold the same data but not have the over head of all the columns.
create table if not exists smart_meter_reading ( meter_id int, date timestamp, source_id text, readings text, PRIMARY KEY(meter_id, date) ) WITH CLUSTERING ORDER BY(date desc);
So instead of 500 billion points we now have 5 billion.
Now we finally look at application no 3. In this case we have data streaming to our application for thousands of different instruments. We can expect on 100,000 ticks a day on some of the instruments. If the requirement is to hold this data long term and be able to create different views of the data for charting capabilities, the storage requirement will be extremely large.
Collecting data vs Storing data.
We can collect the data in the traditional way using a clustering column with a table like so
CREATE TABLE tick_data ( symbol text, date timestamp, time_of_day timestamp, value double, PRIMARY KEY ((symbol,time_of_day), date) ) WITH CLUSTERING ORDER BY (date DESC);
When we think of keeping this data long term we have to understand the implications of having billions of columns in our tables. Our normal queries will be charting the last 5 days of instrument data in 15 min intervals or show the open, high, low and close of an instrument for the last 10 days in 30 mins intervals. So 99% of the queries will be looking at the whole days data. In this example we can then change the long term storage for this table to create a second table which handles any requests for data that is not today. At the end of each day we can compress and the store the data more efficiently for the rest of its life in the database.
For example we can use the following
CREATE TABLE tick_data_binary ( symbol text, date timestamp; dates blob, ticks blob, PRIMARY KEY ((symbol,date)) );
Inserting into the tick_data_binary table can sustain inserts and reads of around 5 million ticks per server, compared to 25000 for the tick_data table. The tick_data_binary table is also three times less storage that the tick_data table. This is not surprising as instead of holding 100,000 TTLs for all the columns, in the binary example we only hold 1. But there are bigger advantages when it comes to Cassandra's management services like compaction and repair. Compaction needs to be able to search for tombstones(deleted columns) which means that the more columns we have, the longer compaction can take. A similar problem arises in repair as this is in fact a compaction job. Comparing the repair time of a table with clustering columns and a table with binary data shows an increase of 10 times for the clustering table over the binary table.
There are always trade offs to each of the models above. For example the binary data in particular can't be filtered using CQL, the filtering needs to happen in some code. This post isn't supposed to be a catch all for time series applications but it it is supposed to help with the modelling of your data, both current and future, and the thought process that goes into that. In particular, don't be afraid to change the data model structure once its usefulness has decreased.
Check out https://academy.datastax.com/tutorials for more information of Cassandra and data modelling. Also have a look at the certification options that you can achieve https://academy.datastax.com/certifications.
For examples of this data models and see the github projects below.