Time Series Data Management at Scale

Using time series for business analytics is not a novel idea. However, the ability to ingest, store and analyze that data at a massive scale is a unique proposition. DataStax is purposely designed to handle time series data at massive scale.

icon

High Throughput and Low Latency

DataStax solutions are built to support high-speed data ingestion and Apache Cassandra™ masterless architecture allow for the database to be close to the applications, offer limitless geo-distribution options and guarantee low latency at the edge.

icon

Granular Partition Control

Apache Cassandra™ and DataStax at its core delivers partition control for data storage and providing users with the ability to group and efficiently access time series buckets.

icon

Spark and Kafka

With integrated support for Apache Spark™, DataStax Enterprise delivers a robust analytical platform for time series aggregations and computations; and, with the Apache Kafka™ connector, DataStax offers seamless data ingestion for event-driven architectures as well.

Sample Data Model and Queries for Time Series

The example below demonstrates a sample data model and queries for collecting and storing sensor data and then retrieving time series data.

Code
CREATE TABLE IF NOT EXISTS iot.sensor_values_over_time (
machine_id UUID, // source machine
sensor_name text, // sensor name
time timestamp, // timestamp of collection
sensor_value double, //
station_id UUID, // source location
PRIMARY KEY ((machine_id, sensor_name), time)
) WITH CLUSTERING ORDER BY (time DESC);

Defining Data Model for Time Series

Let’s create a table to store some time series sensor data. We will define a compound primary key where we will use machine id and sensor name as a partition key to keep all the data from each sensor in one partition and time as a clustering key in descending order, so it determines the on-disk sort order and ensures that the latest data will be available first.

Code
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (28df63b7-cc57-43cb-9752-fae69d1653da, 'rotational_latency', '2019-07-01 00:00:00Z', 97.65195455640468, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (5752fae6-9d16-43da-b20f-557a1dd5c571, 'half_life', '2019-07-01 00:00:00Z', 102.36957817450308, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (28df63b7-cc57-43cb-9752-fae69d1653da, 'rotational_latency', '2019-07-01 00:00:01Z', 98.65192355640468, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (5752fae6-9d16-43da-b20f-557a1dd5c571, 'half_life', '2019-07-01 00:00:01Z', 105.3697657450308, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (28df63b7-cc57-43cb-9752-fae69d1653da, 'rotational_latency', '2019-07-01 00:00:02Z', 100.65197895640468, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (5752fae6-9d16-43da-b20f-557a1dd5c571, 'half_life', '2019-07-01 00:00:02Z', 107.36989717450308, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (28df63b7-cc57-43cb-9752-fae69d1653da, 'rotational_latency', '2019-07-01 00:00:03Z', 97.96695455640468, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (5752fae6-9d16-43da-b20f-557a1dd5c571, 'half_life', '2019-07-01 00:00:03Z', 109.36957885450308, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (28df63b7-cc57-43cb-9752-fae69d1653da, 'rotational_latency', '2019-07-01 00:00:04Z', 99.87195455640468, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (5752fae6-9d16-43da-b20f-557a1dd5c571, 'half_life', '2019-07-01 00:00:04Z', 112.37667817450308, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (28df63b7-cc57-43cb-9752-fae69d1653da, 'rotational_latency', '2019-07-01 00:00:05Z', 98.65985455640468, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (5752fae6-9d16-43da-b20f-557a1dd5c571, 'half_life', '2019-07-01 00:00:05Z', 115.37857817450308, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (28df63b7-cc57-43cb-9752-fae69d1653da, 'rotational_latency', '2019-07-01 00:00:06Z', 100.65195455640468, 28df63b7-cc57-43cb-9752-fae69d1653da);
INSERT INTO iot.sensor_values_over_time (machine_id, sensor_name, time, sensor_value, station_id) VALUES (5752fae6-9d16-43da-b20f-557a1dd5c571, 'half_life', '2019-07-01 00:00:06Z', 118.36957817450308, 28df63b7-cc57-43cb-9752-fae69d1653da);

Inserting Time Series Data

Now use the following code snippet to INSERT sensor data into the table with increasing timestamps.

Code
SELECT sensor_value
FROM iot.sensor_values_over_time
WHERE machine_id = 28df63b7-cc57-43cb-9752-fae69d1653da
AND sensor_name = 'rotational_latency'
AND time > '2019-06-01 00:00:00'
AND time < '2019-08-01 00:00:00';

Querying Time Series Data

And now use a SELECT statement to fetch data over time buckets. It's that easy!

Code
SELECT sensor_name,
MIN(sensor_value) low,
MAX(sensor_value) high
FROM iot.sensor_values_over_time
WHERE time > '2019-06-01 00:00:00'
AND time < '2019-08-01 00:00:00'
GROUP BY sensor_name;

Time Series Data Aggregation with Spark

Let’s take it to the next level, the following Spark query will return low and high values per sensor over a specified time period.

DataStax Solutions for Time Series Data

Built for Streaming and Time Series Data

Icon
Blog
Advanced Time Series Data Modelling

Collecting Time Series Vs Storing Time Series Cassandra is well known as the database of choice when collecting time series events. These may be messages, events or similar transactions that have a time element to them. If you are not familiar on how Cassandra holds time series, there is a useful data modelling tutorial on the DataStax academy website. https://academy.datastax.com/demos/getting-started-time-series-data-modeling In this document I will try to explain some of the pros and cons of using time series in Cassandra and show some techniques and tips which make make your application better not just for now but also 5 years down the line. Choosing you long term storage Choosing your long term storage is not really a trivial thing. In most applications there are business requirements about how long data will need to be held for and sometimes these requirements change. More and more, business want and are required to hold data for longer. For example, a lot of financial companies must keep audit data for up to seven years. Using some sample applications. We will look at some examples and see how time series is used for each. 1. A credit card account which shows transaction for a particular account number. Data is streamed in real time. 2. Collecting energy data for a smart meter. Data comes from files sent from devices after one day of activity. 3. Tick data for a financial instrument. Data is streamed in real time. All of the above use cases are time series examples and would benefit from using Cassandra. But when we look at the queries and retention policies for this data we may look at different ways of storing them. Clustering columns for time series. The credit card application will need to query a users transactions and show them to the user. They will need to be in descending order with the latest transaction first. The data may be paged over multiple pages. This data needs to be kept for 7 years. Using a simply clustering column in the table definition, will allow all the transactions for a particular account to be on one row for extremely fast retrieval. Our table model would be similar to this create table if not exists latest_transactions( credit_card_no text, transaction_time timestamp, transaction_id text, user_id text, location text, items map<text, double>, merchant text, amount double, status text, notes text, PRIMARY KEY (credit_card_no, transaction_time) ) WITH CLUSTERING ORDER BY ( transaction_time desc); The smart meter application is a little different. The data will come in for each meter no with data every 30 mins of increments to the meter value. Eg. 00:00 - 13, 00:30 11, 01:00 3......23:30 10. So the daily amount is an aggregation of all the data points together. The business requirement state that the data must be held for 5 years and a days data will always be looked up together. Cassandra's has column type of Map which can be used to hold our daily readings in a format of time offset and value. Our table model would look something like this create table if not exists smart_meter_reading ( meter_id int, date timestamp, source_id text, readings map<text, double>, PRIMARY KEY(meter_id, date) ) WITH CLUSTERING ORDER BY(date desc); This seems sensible until we look at how much data we will be holding and for how long. This application has 10 Million meters and hopes to double that over the next 3 years. If we start at 10M customers holding 365*5 years of data which 48 columns of offset data per day (every half hour), this can quickly add up to over 50o billion points (a map of 48 entries is held as 48 columns) and we haven't talked about the increase over those years. Since we don't have to query the reading individually it might suit better to look at other storage capabilities. A map can be simply transformed to and from a JSON string which would allow us to hold the same data but not have the over head of all the columns. create table if not exists smart_meter_reading ( meter_id int, date timestamp, source_id text, readings text, PRIMARY KEY(meter_id, date) ) WITH CLUSTERING ORDER BY(date desc); So instead of 500 billion points we now have 5 billion. Now we finally look at application no 3. In this case we have data streaming to our application for thousands of different instruments. We can expect on 100,000 ticks a day on some of the instruments. If the requirement is to hold this data long term and be able to create different views of the data for charting capabilities, the storage requirement will be extremely large. Collecting data vs Storing data. We can collect the data in the traditional way using a clustering column with a table like so CREATE TABLE tick_data ( symbol text, date timestamp, time_of_day timestamp, value double, PRIMARY KEY ((symbol,time_of_day), date) ) WITH CLUSTERING ORDER BY (date DESC); When we think of keeping this data long term we have to understand the implications of having billions of columns in our tables. Our normal queries will be charting the last 5 days of instrument data in 15 min intervals or show the open, high, low and close of an instrument for the last 10 days in 30 mins intervals. So 99% of the queries will be looking at the whole days data. In this example we can then change the long term storage for this table to create a second table which handles any requests for data that is not today. At the end of each day we can compress and the store the data more efficiently for the rest of its life in the database. For example we can use the following CREATE TABLE tick_data_binary ( symbol text, date timestamp; dates blob, ticks blob, PRIMARY KEY ((symbol,date)) ); Inserting into the tick_data_binary table can sustain inserts and reads of around 5 million ticks per server,  compared to 25000 for the tick_data table. The tick_data_binary table is also three times less storage that the tick_data table. This is not surprising as instead of holding 100,000 TTLs for all the columns, in the binary example we only hold 1. But there are bigger advantages when it comes to Cassandra's management services like compaction and repair. Compaction needs to be able to search for tombstones(deleted columns) which means that the more columns we have, the longer compaction can take. A similar problem arises in repair as this is in fact a compaction job. Comparing the repair time of a table with clustering columns and a table with binary data shows an increase of 10 times for the clustering table over the binary table. Trade offs There are always trade offs to each of the models above. For example the binary data in particular can't be filtered using CQL, the filtering needs to happen in some code. This post isn't supposed to be a catch all for time series applications but it it is supposed to help with the modelling of your data, both current and future, and the thought process that goes into that. In particular, don't be afraid to change the data model structure once its usefulness has decreased. Check out https://academy.datastax.com/tutorials for more information of Cassandra and data modelling. Also have a look at the certification options that you can achieve https://academy.datastax.com/certifications. For examples of this data models and see the github projects below. https://github.com/DataStaxCodeSamples/datastax-creditcard-demo https://github.com/DataStaxCodeSamples/datastax-iot-demo https://github.com/DataStaxCodeSamples/datastax-tickdata-comparison https://github.com/DataStaxCodeSamples/datastax-tickdb-full

Get the Blog
Icon
Video
Designing for Time Series Applications at Scale with DataStax Enterprise

Time series data sets are everywhere—and handling them poses many challenges. Large volumes of data need to be stored over long periods of time and retrieved efficiently, and downtime is not an option. In this talk, we will explore how Apache Cassandra and DataStax Enterprise (DSE) have built-in features, such as DSE File System, Search, and Analytics, that can be used to design robust solutions at scale for time series use cases.

Get the Video
Icon
Whitepaper
DataStax Enterprise and Apache Kafka™ for Modern Architectures

DataStax Enterprise and Apache Kafka are designed specifically to fit the needs of modern, next-generation businesses. With DataStax Enterprise (DSE) providing the blazing fast, highly-available hybrid cloud data layer and Apache Kafka™ detangling the web of complex architectures via its distributed streaming attributes, these two form a perfect match for event-driven enterprise architectures.

Get the Whitepaper