Ep. 94 Distributed Data Show
One of the most common use cases when dealing with Apache Cassandra™ are timeSeries. After introducing the concept of Time series in a few words Amanda and Cedrick will analyze why Cassandra got so much traction and detail what we see at customers, what are the pitfalls and what are today’s challenges.
Amanda: Welcome to this Episode of the DIstributed Datashow I am Cedrick and this is my coworker Cedrick -- all the way from Paris! This is our first DDS episode together! Hopefully, next time we film I will get to go to Paris!
Cedrick: Hi, i’m so glad to be here, BTW we are recording this episode from sunny Florida today. That’s awesome
Amanda: I know today we wanted to discuss Time Series! Cedrick --with all his other duties as a Developer Advocate-- has also been working a lot with our customers on different time series applications. So maybe first could you remember what TimeSeries are ?
Cedrick: Sure. Simply put this a sequence of numerical data points, values in a successive order and this order is time. Most of natural phenomenon could be described in that way. Choose one source and measure the values over time.
Amanda: I can think of use cases in today IT, logs, stocks, sensors, events, logs all those are time series. Ok but why Cassandra ?
Cedrick: Amanda, if you meet someone at an event who asked you about Apache Cassandra in 1 minute what would you say ? Personally I tell it is a distributed database that means multiple nodes. On each node 1TB of data, about 3K/tx/s/cpu. If you need more capacity add nodes, if you need more throughput add nodes. And here is think this is the key point not the data replication for resiliency. More throughput ? Tunable consistency can also help put some CL=ONE and you speed up.
Amanda : Throughput is key, number of events increases exponentially, 5G will come. So with Cassandra with easy write the data at good pace. What about reading data then ? I would like to be able to graph, charts, aggregation, show trends, both coarse and fine grained charts
Part II - Data Modelling
Cedrick : Haha success of a Cassandra project is all about data modelling. When you graph a chart for a dedicated stock or dedicated sensor it is like multiple data points for a single entity as such the entity identifier is a good candidate for partition key. It will be evenly distributed in the cluster and you read a single partition to graph.
Amanda : wait. If I don’t have a lot of sensors I will always hit the same node (hot spot). If I have a lot of data I will hit the partition size limit (100MB or 2billions cells).
Amanda : good. But what about aggregation, you said before you would like to chart both fine and coarse grained charts. You have a lot of computation to do for aggregations.