I have a cassandra installation with 3 nodes + OpsCenter.
I loaded about 800k rows containing a single value, about 2.6k size each.
The content was loaded using a maponly job from hadoop, the input file 2 Gb input file.
I started with an empty cassandra cluster (except for OpsCenter data, about 50M)
After the import the cassandra nodes have from 20 to 30 Gb used data.
I tried compacting, multiple times, draning, shutting down, restarting, compacting, repairing, etc
Usage is still high, far above what is needed to store the data. There are no deleted
rows, and the commit logs are drained and everything compacted.
Why on earth does cassandra require so much disk space?
(I have to store max 10 Gb of real data, I have to allocate 100Gb now to hold 2-3 Gb and have free space to perform compactions and repairs etc)
Furthermore, why one node is so bigger than others? I use random partitioner.
http://awesomescreenshot.com/05cd8igb1
Is there a way to see that the 30 Gb of the biggest node are indeed filled also by data
arriving from the another peer; and that part of the data which belongs to this biggest node is actually replicated? I have somehow the feeling that one node gets more data and doesn't replicate. (my payloads are quite homogeneous)
Any help is appreciated!
Marko
