I think you can try composite and secondary index should give you good read performance. Although secondary indices introduce additional overhead on writes - they required read-before-write to organize data in the index, value is flipped to be a key in indexing ColumnFamily. Also it would be good idea to test different row cache sizes (if your load is mostly reads).
schema question
(4 posts) (2 voices)-
Posted 10 months ago #
-
have a question on what the best way is to store the data in my schema.
The data
I have millions of nodes, each with a different cartesian coordinate. The keys for the nodes are hashed based on the coordinate.My search is a proximity search. I'd like to find all the nodes within a given distance from a particular node. I can create an arbitrary grouping that groups an arbitrary number of nodes together, based on proximity…
e.g.
group 0 contains all points from (0,0) to (10,10)
group 1 contains all points from (10,0 to 20,10).For each coordinate, I store various meta data:
8 columns, 4 UTF8Type ~20bytes each, 4 DoubleTypeThe query
I need a proximity search to return all data within a range from a selected node. The typical read size is ~100 distinct rows (e.g. a 10x10 grid around the selected node).. Since it's on a coordinate system, I know ahead of time exactly which 100 rows I need.The modeling options
Option 1:
- single column family, with key being the coordinate hashe,g,
'0,0' : { meta }
'0,1' : { meta }
…
'10, 20' : { meta}- query for 100 rows in parallel
- I think this option sucks because it's essentially 100 non-sequential reads??
Option 2:
- column family with composite key from grouping and locatione.g.
'0:0,0': { meta }
...
'0:10,10' : { meta }
'1:10,0' : {meta}
…
'1:20, 10': {meta}- query by the appropriate grouping
- since i can't guarantee the query won't fall near the boundary of a grouping, I'm looking at querying up to 4 different super column rows for each query
- this seems reasonable, since i'm doing bulk sequential reads, but have some overhead in terms of pre-filtering and post-filtering
- sucks in terms of flexibility for modifying size of proximity searchOption 3:
- create a secondary index based on the groupinge.g.
e,g,
'0,0' : { meta, group='0' }
'0,1' : { meta, group='0' }
…
'10, 20' : { meta, group='1'}- query by secondary index
- same as above, will return some extra data, and will need to do filtering..
- no idea how cassandra stores this data internally, but will the data access here be sequential?
- a little more flexible in terms of proximity search - can create multiple grouping types based on the size of the searchquestions
- I know there are pros and cons to each approach wrt flexibility of my search size, but assuming my search proximity size is fixed, which method provides the optimal performance?
- I guess the main question is will querying by secondary index be efficient enough or is it worth it to group the data into super columns?
- Is there a better way I haven't thought about to model the data?Posted 10 months ago # -
Thanks for the response.. It still seems like you'd be doing 100 random reads even though they're indexed. e.g. Cassandra finds the matching keys based on secondary index, which points to the original rows by keys. Wouldn't it be roughly the same as just fetching 100 keys by index? Or does cassandra duplicate all data for secondary indexes?
I'm just trying to understand how the secondary index would benefit this use case.
Posted 10 months ago # -
Yes, you would be doing random I/O with indexes and they don't duplicate the data but you can't escape it even with composite keys because if you would have e.g. updates on the attributes to get the latest version of the row Cassandra would merge previous versions (in the different SSTables) together to get the final result. I suggest you to test all three solutions on the dummy dataset and tune key/row cache, it could be even Option #1 could give you the best results with appropriate key/row cache size.
Posted 10 months ago #
Reply
You must log in to post.
