Hi,
We are currently using a cluster of 2 cassandra nodes on 2 larges EC2 instances. Each of them has about 65G of used data. ( Same setup for Dev and Production ).
The issue we are having is during writes on production - the CPU load becomes 40 + ( 2 core machine ) and eventually the machines becomes unavailable and need to be rebooted. I tried mutiple "tunning strategies" such as decreasing the total memtable space, changing ratio of eden space vs survival space in the young generation, copying larger object directly from Eden space to the Older Generation, optimizing compaction to run more frequently, using smaller # of sstables ( I noticed that during the crash disk utilization was almost none, so I tried to relieve the memory usage).
Looking at cassandra system.log, I wasn't able to see any ERRORS or WARNINGs.
Only thing that shows up during the crash is the StatusLogger :
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,741 StatusLogger.java (line 50) Pool Name Active Pending Blocked
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,741 StatusLogger.java (line 65) ReadStage 1 1 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,742 StatusLogger.java (line 65) RequestResponseStage 0 0 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,742 StatusLogger.java (line 65) ReadRepairStage 0 0 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,742 StatusLogger.java (line 65) MutationStage 1 600 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,743 StatusLogger.java (line 65) ReplicateOnWriteStage 0 0 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,743 StatusLogger.java (line 65) GossipStage 0 0 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,744 StatusLogger.java (line 65) AntiEntropyStage 0 0 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,744 StatusLogger.java (line 65) MigrationStage 0 0 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,744 StatusLogger.java (line 65) StreamStage 0 0 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,744 StatusLogger.java (line 65) MemtablePostFlusher 0 0 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,745 StatusLogger.java (line 65) FlushWriter 0 0 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,760 StatusLogger.java (line 65) MiscStage 0 0 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,760 StatusLogger.java (line 65) InternalResponseStage 0 0 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,761 StatusLogger.java (line 65) HintedHandoff 0 0 0
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,761 StatusLogger.java (line 69) CompactionManager n/a 2
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,761 StatusLogger.java (line 81) MessagingService n/a 0,45
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,761 StatusLogger.java (line 85) ColumnFamily Memtable ops,data Row cache size/cap Key cache size/cap
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,762 StatusLogger.java (line 88) system.NodeIdInfo 0,0 0/0 0/1
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,762 StatusLogger.java (line 88) system.IndexInfo 0,0 0/0 0/1
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,762 StatusLogger.java (line 88) system.LocationInfo 0,0 0/0 2/2
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,762 StatusLogger.java (line 88) system.Versions 3,103 0/0 0/2
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,763 StatusLogger.java (line 88) system.Migrations 0,0 0/0 0/3
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,763 StatusLogger.java (line 88) system.HintsColumnFamily 0,0 0/0 1/1
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,763 StatusLogger.java (line 88) system.Schema 0,0 0/0 2/3
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,763 StatusLogger.java (line 88) test.popular_neighbors 0,0 0/0 105549/200000
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,764 StatusLogger.java (line 88) test.popular_neighbors_root 0,0 0/0 0/200000
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,764 StatusLogger.java (line 88) upp.topcat 113,73 0/0 31472/200000
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,764 StatusLogger.java (line 88) upp.fulllisting 295447,74865591 0/0 101034/200000
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,764 StatusLogger.java (line 88) collector.seo_tags 0,0 0/0 1104/200000
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,765 StatusLogger.java (line 88) collector.seo_tags_full_ids 0,0 0/0 8/200000
INFO [ScheduledTasks:1] 2012-03-08 08:21:21,765 StatusLogger.java (line 88) collector.seo_tags_full 0,0 0/0 14/200000
If I run writes on the Dev cluster, everything runs smoothly - no errors, load is max 1.5-2.
If I reboot the 2 production instances, I can run reads and writes for a while before the crash occurs, otherwise given that machines have been up for 2+ days they crash occurs within minutes.
Any suggestions, ideas would be highly appreciated.
Thanks
