Hello:
I recently read a couple articles regarding Cassandra and time series data:
http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
I'm already recording activity performed for a given OAuth token (row key). The token lifetime is 24 hours, so I don't think a given row will get too long; 1000s of columns perhaps. But, I also want to track what a given client (source IP address) does. What I'm doing is more the "Index Column Family" in the advanced time series article. This data will not be needed in real-time, so I don't want to denormalize it all again. Referring to the tokens the client used allows for processing the actions of each token. There is an anti-hijacking feature in the tokens, so there is a one-to-many relationship of client to tokens.
But, unlike the token, an IP address does not expire, so the number of columns will continue to grow w/o bound over time, making for some very long rows. I think 'sharding' by day is sensible, so I want the client tracking rowkey to consist of IpAddress-day. For example 1.2.3.4-20120627. I'm doing this using Java, and I can use Date and SimpleDateFormat to get the day suffix.
This is running in a Tomcat app server, and the SimpleDateFormat javadocs say instances are not thread-safe. So, I could either use a ThreadLocal, synchronization, or construct a new one each time. As the number of clients grows, I expect this to be pretty high volume, and wonder about the overhead of constructing Date objects, and using SimpleDateFormat. So, my thought is to do this:
public static final long CLIENT_METRICS_GRANULARITY_MS = 24 * 60 * 60 * 1000; // One day in ms
...
final long interval = System.currentTimeMillis() / CLIENT_METRICS_GRANULARITY_MS;
final String rowKey = String.format("%s-%d", clientAddress, interval);
My time series CF is defined as:
CREATE COLUMN FAMILY ClientMetrics
WITH comparator=TimeUUIDType
AND key_validation_class=UTF8Type
AND default_validation_class=UTF8Type;
Example content looks like (for testing for today):
[default@IsecMetrics] list ClientMetrics;
Using default limit of 100
-------------------
RowKey: 1.2.3.4-15518
=> (column=a5281d10-c0a2-11e1-8239-c8bcc88a2db5, value={"category":"security","name":"bindToken","params":{"tokenValue":"269e16d1-9b1f-48ac-8431-6fa54f42e2e0"}}, timestamp=1340834058593000)
=> (column=aa6c3a40-c0a2-11e1-8239-c8bcc88a2db5, value={"category":"security","name":"bindToken","params":{"tokenValue":"eedf7da3-3a99-4746-a27d-2e82afcf21bc"}}, timestamp=1340834067428000)
=> (column=bb06de50-c0a2-11e1-8239-c8bcc88a2db5, value={"category":"security","name":"blacklistedClientAddress","params":{"tokenValue":"f2ca9749-1391-4cd8-9b82-e15acbc91d07","duration":"120"}}, timestamp=1340834095285001)
=> (column=d76040f0-c0a2-11e1-8239-c8bcc88a2db5, value={"category":"security","name":"blacklistedBindingAttempt","params":{"tokenValue":"ceeb67a8-6a94-4c2b-8e09-6a899c864689"}}, timestamp=1340834142847000)
=> (column=6d8d3b00-c0a3-11e1-8239-c8bcc88a2db5, value={"category":"security","name":"blacklistedClientAddress","params":{"tokenValue":"c5fd3021-f431-405c-b8e4-7fd1c0bcaedd","duration":"0"}}, timestamp=1340834394800000)
1 Row Returned.
Elapsed time: 2 msec(s).
[default@IsecMetrics]
So, today is 15518, yesterday is 15517 and tomorrow is 15519 etc. Then, later when performing analytics, I can reconstitute the data. E.g.:
def long granularity = 24 * 60 * 60 * 1000L
def long interval = 15518L
def theDate = new Date(interval * granularity)
println "interval date: ${theDate}"
And that seems to work (considering the time zone). I don't think there's much overhead in System.currentTimeMillis(), and I'm sure if I make use of Date when I record the event, I'll be incurring that anyway. No SimpleDateFormat instances to deal with etc.
Does this make sense, or is there something that's going to bite me later?
Thanks,
Jeff
