When running a Hadoop job that uses a Cassandra column family as input, how does setting the read consistency affect the range scan used to get the input? Ie if we want to guarantee that a hadoop job runs on the most up to date information in a column family, is it better to have written with a consistency level of "ALL" so the job can read with consistency of "ONE" or to both read and write at a consistency level of "QUORUM"?
In an effort to consolidate free help offered for our products we have decided to move these forums to a more widely used forum. Please use one of the following queries (or any combination):
- Cassandra: tag search or plain text search
- DataStax Enterprise: tag search or plain text search
- DataStax OpsCenter: tag search or plain text search
ColumnFamilyInputFormat - read and write consistency levels(11 posts) (2 voices)
This is best summed up by the following: "Note that if W + R > ReplicationFactor, where W is the number of nodes to block for on write, and R the number to block for on reads, you will have strongly consistent behavior"
ALL should only be used in situations where high availability is not required.
When we run our hadoop job at read consistency "QUORUM" we get a timeout in the ColumnFamilyInputFormat. Our rows really aren’t that big (a few dozen columns @ a few KB per). We have a 4 node cluster running Cassandra 1.0.5. We have tried setting the timeout to a minute, as well as decreasing the range batch size, but we still end up with the following error:
Is there something we can check or try and/or change to keep from timing out?
What is the replication factor and range batch size? Also, do you have any other metrics for performance of individual nodes? Like what is the OpsCenter output for the nodes when the timeouts occur?
The replication factor is set to 3 and we have dropped the batch size ranges as low as 32. What metrics would you like and is there a way to print a report of the metrics?
When we run the job and read with consistency level "QUORUM" the cluster's read latency will jump to over a minute, and we see the disk utilization hits 100%.
When we run the same job against a keyspace with a replication factor of 1, CL "ONE", the read latency gets as high as 2 or 3 seconds and the disk utilization doesn't get above 55%.
Are there settings we can change to help bring down the disk utilization on replicate keyspaces?
What version of Cassandra is this? This behavior certainly does not sound correct.
We are currently looking into this and we'll update here as soon as we find something. Thanks for the extra details and patience.
Ok, looks like you are getting hit by https://issues.apache.org/jira/browse/CASSANDRA-3551
If you would like to apply the patch on the source and give it a try, this would actually be super valueable as an external test of the fix.
We were able to try the patch yesterday and we were able to run the hadoop job and read at CL of QUORUM. Thanks for the help.