The symptoms are the repair/compaction/stream information on the cluster views gets "stuck". The percentages no longer move, existing repairs do not go away in Opscenter, even though the cassandra node is no longer repairing or compacting. No new information shows up in opscenter for nodes that start repairs. Basically that part of the agent appears to stall indefinitely, while the OS stats and basic ring information still works.
The only ERROR line I show in the agent log is from the initial configuration, which happens on every restart and appears to be the auto-discover process of the thrift port. It does connect to jmx on localhost
ERROR [Initialization] 2012-12-17 12:24:01,871 MARK HOST AS DOWN TRIGGERED for host 10.1.1.43(10.1.1.43):9160
ERROR [Initialization] 2012-12-17 12:24:01,872 Pool state on shutdown: <ConcurrentCassandraClientPoolByHost>:{10.1.1.43(10.1.1.43):9160}; IsActive?: true; Active: 0; Blocked: 0; Idle: 0; NumBeforeExhausted: 1
ERROR [Initialization] 2012-12-17 12:24:01,878 Error when performing thrift operation: #<HectorException me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.>
ERROR [Thread-5] 2012-12-17 12:24:01,879 Unable to connect to Cassandra #<HectorException me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.>
I turned logging up to debugging on our test cluster (we are seeing the same situation there, much less CF's) and I see it collecting metrics on a regular basis, but then randomly it will spam the following (multiple times a second). Debugging level logging did not provide any additional information to the cause. I can send full logs if you are interested.
WARN [Thread-2] 2012-12-17 13:16:13,062 Thrift operation queue is full, discarding thrift operation
WARN [Thread-2] 2012-12-17 13:16:13,062 271315 operations dropped so far.
It doesnt appear that the issue is based on number of metrics, the "stall" happens in our test cluster with roughly 100 CF's, and I cranked down the metrics in production (using ignore_keyspaces in the opscenter server config), but the issue still exists.