We are wondering how to modelize our Column Family (CFs) to store the number of unique visitors in a time period in order to be able to request it fast.
We thought of sharding them by day (row = 20120118, column = visitor_id, value = '') and perform a getcount. This would work to get unique visitors per day, per week or per month but it wouldn't work if I want to get unique visitors between 2 specific dates because 2 rows can share the same visitors (same columns). I can have 1500 unique visitors today, 1000 unique visitors yesterday but only 2000 new visitors when aggregating these days.
I could get all the columns for this 2 rows and perform an intersect with my client language but performance won't be good with big data, or we might run into OOM issues.
What will be the recommended best way to achieve this? We are open to changing our CF and the rest of the schema to make it happen.