Today a colleague asked me: how can I find out how many keys I just inserted into Cassandra? You’d expect any half-decent database to be able to tell you how much stuff it has got. Cassandra being (somewhat better than) half-decent can, but there are many subtleties that are worth understanding.
Firstly, a clarification on what counting keys actually means. Keys in Cassandra parlance mean rows, so we’re counting the number of rows in a column family. However, there is not actually a distinct row object in Cassandra; rows are just containers for columns. So empty rows don’t exist (caveat: see row deletes later); an empty row is the same as a row that never existed so cannot contribute.
OK, let’s count. In CQL, you can use
select count(*) from cf
However, there is a default limit of 10,000 applied to this statement which will truncate the result for larger column families. The limit can be increased for larger column families:
select count(*) from cf limit 1000000
for example. The thrift equivalent involves some code, but can be done quite simply in pycassa:
import pycassa from pycassa.pool import ConnectionPool from pycassa.columnfamily import ColumnFamily pool = ConnectionPool('ks') col_fam = pycassa.ColumnFamily(pool, 'cf') result = col_fam.get_range() count = sum(1 for _ in result)
The CQL code shows nothing strange; in an SQL DB you would expect it to return quickly and this query looks the same. But why do I need to specify a limit? The pycassa code gives a hint why.
Cassandra doesn’t maintain such counts, unlike the indexing structures used in relational databases. It’s not that someone didn’t bother to implement it, such a count would violate many of Cassandra’s principles.
A key design of Cassandra is that it is write optimized. All of Cassandra’s basic data structures allow writes with zero reads. Cassandra doesn’t check if there are old columns or slot the new column in its final resting place next to the others. It just accepts it into an in memory structure called a memtable, which when large enough gets flushed to disk. This ensures all writes are sequential on disk, which is the primary reason why Cassandra gets such good write performance.
However, what this means is Cassandra doesn’t know if a new column is in a row not yet seen. Or if a delete removes the last column from a row. If it did, it would have had to read to see what data exists which is not allowed.
Actually, Cassandra knows a little bit about this. If a row already exists in a memtable, Cassandra knows this for free. This means when a memtable is flushed to disk (becoming an SSTable), Cassandra knows how many rows there are in it. However, across SSTables, Cassandra doesn’t know if the rows are disjoint or entirely overlapping.
This means the only way to find out is to read through the rows. This is exactly what the pycassa example is doing: read through all the rows. And under the hood this is what the ‘select count’ query does too (with large enough limit). I
f you have 10 TB of data per node in a single column family, your innocuos one line query will scan through 10 TB of data on all nodes just to give you one number!
I don’t care about the exact number, can I have a ballpark estimate?
Because Cassandra knows how many rows there are in each SSTable it is possible to get an estimate. The ‘nodetool cfstats’ output tells you these counts in the ‘Number of Keys (estimate)’ line. This is the sum of rows in each SStable (again approximate due to the indexing used but can’t be off by more than 128 by default).
This will often be fairly accurate but it depends on your workload. For write once workloads with unique row keys it will be close (it doesn’t include memtables). But for workloads with lots of overwrites, or writes to the same row a long time apart, it could be significantly different from the true value.
Because this estimate is the sum across all SSTables, It can’t be more than a factor of your SSTable count (also printed by cfstats) too high. If you are testing you could do a full compaction (‘nodetool compact’) to get all your data in one SSTable to get a more accurate count.
Can I maintain a count myself?
If you only ever wrote to a row once, you could use a Cassandra counter that you increment on each new row insertion. But if rows were written to more than once you would have the same problem as Cassandra itself has: is this the first write or not? And maintaining this would be hard: a single error throws the count off forever.
You could also maintain a row that you insert all row keys to. This will have as many columns as you have total rows but no data. Counting the number of rows is reduced to counting the number of columns in this row. However, now every insert requires an insert to this row, and the row isn’t scalable. You could come up with complex solutions to fix this but it will likely become a maintenance problem that is not worth the effort.
Because deletes are actually insertions of tombstones (see explanation why), rows that only have deleted columns will show up in the above cfstats counts. Only when the tombstones have been garbage collected will they be removed from the count. As a further complication, you can also delete whole rows. This inserts a special row tombstone object. This can result in empty rows being returned in queries messing up counts. By default, pycassa filters these out (you can get them back by setting ‘filter_empty=False’ in the get_range call). Also the CQL count query doesn’t count them.
What about consistency?
In a replicated setting, even if each Cassandra node knew exactly how many unique rows it had seen, it wouldn’t necessarily know the full count. Maybe each node has missed a row, but collectively they have all the rows. This further complicates any solution to finding accurate counts.
Separately, the count returned can be inconsistent in the sense that it might disagree with other counts obtained at the same time. This will happen if you are changing your data while counting: new rows may or may not get counted by ongoing count operations.
Why was I counting anyway?
We’ve seen above just how hard it is to get a count. I can’t think of many important use cases where the count is required though. You may want it for testing or auditing, but it is unlikely to be required by your application. If you think you need it, look to see if it really is required. You don’t want to be initiating a distributed table scan regularly just to find a count that will be out of date by the time you read it.