We're trying to run truncate on a columnfamily in a cluster running cassandra 1.1.4 but getting an error, "Unable to complete request: one or more nodes were unavailable".
The error *appears* to be spurious, in that the truncate completes, but I don't like what I'm seeing, as the same command works fine on a test cluster without the error (also a 1.1.4 cluster). The prod cluster was initially started on cassandra 1.0.x; and we had two hard node failures that we had to force remove the tokens on (replication factor = 3; so no data loss).
cqlsh:app> truncate b_projectstatsbucket;
Unable to complete request: one or more nodes were unavailable.
But the command actually works:
cqlsh:app> select * from b_projectstatsbucket;
cqlsh:app>
Inserting a row and checking it again:
cqlsh:app> insert into b_projectstatsbucket (uuid, active) values ('5999c46f-ad38-43f2-899c-5c6db97e470b', 'True');
cqlsh:app> select * from b_projectstatsbucket;
uuid | active
--------------------------------------+-------------+
5999c46f-ad38-43f2-899c-5c6db97e470b | True |
cqlsh:app> truncate b_projectstatsbucket;
Unable to complete request: one or more nodes were unavailable.
cqlsh:app> select * from b_projectstatsbucket;
cqlsh:app>
The cassandra logs show these entries when this command is run:
INFO 2012-12-355 16:20:13,612 Cannot perform truncate, some hosts are down
INFO [Thrift:16] 2012-12-20 16:20:13,612 StorageProxy.java (line 1192) Cannot perform truncate, some hosts are down
However, no nodes are down:
nodetool ring
Note: Ownership information does not include topology, please specify a keyspace.
Address DC Rack Status State Load Owns Token
141784319550391026443072753096570088106
10.x.x.1 xxx b Up Normal xxx GB 16.67% 1
10.x.x.2 xxx a Up Normal xxx GB 16.67% 28356863910078205288614550619314017621
10.x.x.3 xxx b Up Normal xxx GB 16.67% 56713727820156410577229101238628035242
10.x.x.4 xxx a Up Normal xxx GB 16.67% 85070591730234615865843651857942052864
10.x.x.5 xxx b Up Normal xxx GB 16.67% 113427455640312821154458202477256070485
10.x.x.6 xxx a Up Normal xxx GB 16.67% 141784319550391026443072753096570088106
And gossip info shows nodes in NORMAL or LEFT state (I ran unsafeAssassinateEndpoint to get a number of "removed" nodes into LEFT state -- nodes that we had to force-remove the token due to failed cluster migration stuff -- similar to https://issues.apache.org/jira/browse/CASSANDRA-3876 -- but restarting nodes did not clear it out. Was this ok?)
nodetool gossipinfo | grep STATUS
STATUS:LEFT,10876113728672349282005072061992923682,1356220177260
STATUS:NORMAL,1
STATUS:NORMAL,113427455640312821154458202477256070485
STATUS:NORMAL,141784319550391026443072753096570088106
STATUS:LEFT,10876113728672349282005072061992923682,1356220258126
STATUS:NORMAL,56713727820156410577229101238628035242
STATUS:NORMAL,28356863910078205288614550619314017621
STATUS:LEFT,10876113728672349282005072061992923682,1356219977587
STATUS:LEFT,10876113728672349282005072061992923682,1356218804433
STATUS:LEFT,10878842786151078770452885825817290206,1356280367454
STATUS:NORMAL,85070591730234615865843651857942052864
(I'm a little confused why the token position on all the LEFT nodes is 108761137...; they were at different positions when in the "removed" state. Also, after running the unsafeAssassinateEndpoint command yesterday on the "removed" nodes, they all showed as LEFT, but today, one of them went back to "removed" -- wondering if there's something not correctly garbage-collected here?)
Checking via mx4j:
http://<nodeip>:<port>/getattribute?objectname=org.apache.cassandra.db:type=StorageService&attribute=UnreachableNodes&format=collection&template=viewcollection
gives an empty list (fwiw; it does not give an empty list on a test cluster where we removed a node but didn't assassinate the endpoint -- but in that test cluster, we didn't force remove a token / failed node -- and in the test cluster the truncate works -- so I may be chasing down the wrong path here. Note: I did not assassinate any nodes before seeing this issue; the first assassination was done while trying to fix this problem.)
Looking at line 1192 of StorageProxy.java for the 1.1.4 src, it's calling isAnyHostDown(); which is defined as:
return !Gossiper.instance.getUnreachableMembers().isEmpty();
which is defined as:
return unreachableEndpoints.keySet();
Calling StorageService.UnreachableNodes calls Gossiper.instance.getUnreachableMembers, so I'm definitely confused by what I'm seeing: the error in the logs is only reachable when the list isn't empty; but JMX is showing the list as empty.
Is it possible that the cause of the error is not, in fact, unreachable members, but an InvalidRequestException, for whatever reason? Or something else?
Any clues at this point would be greatly appreciated.
