You're right Joaquin. That EOFException error was not critical. It actually happened during the Cassandra start, not during the crash. Jonathan Ellis also said it was harmless on the Cassandra mailing list.
However, we have tried running Repair and compaction on the node unbalanced node with no luck. Both Repair and Compaction seem to be crashing the Cassandra java process on the node. I eventually got repair to successfully complete on the node.
Then yesterday, while running compaction, I got the following error in OpsCenter in the evening:
7/27/2011 11:30pm Alert Node reported as being down 10.2.206.x
However, the system.log file for that node shows nothing but informational messages at that time:
INFO [ScheduledTasks:1] 2011-07-28 06:29:21,429 GCInspector.java (line 128) GC for ParNew: 213 ms, 147241608 reclaimed leaving 2152005360 used; max is 4030726144
INFO [ScheduledTasks:1] 2011-07-28 06:29:28,622 GCInspector.java (line 128) GC for ParNew: 219 ms, 147301832 reclaimed leaving 2188187360 used; max is 4030726144
INFO [ScheduledTasks:1] 2011-07-28 06:29:39,666 GCInspector.java (line 128) GC for ParNew: 240 ms, 148102840 reclaimed leaving 2222749344 used; max is 4030726144
After that in the log, there is a long gap till I restart Cassandra this morning. Note that the time stamp in the log is different than what OpsCenter reports maybe because of EC2 or OpsCenter time zone stuff, but they are the same time.
There was definitely something quirky with the Cassandra process on the node this morning. I couldn't run nodetool ring on it:
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
I restarted the Cassandra process this morning and now OpsCenter has it marked as up, but I think Compaction is incomplete now.
This shows me restarting the Cassandra process this morning:
INFO [ScheduledTasks:1] 2011-07-28 06:29:21,429 GCInspector.java (line 128) GC for ParNew: 213 ms, 147241608 reclaimed leaving 2152005360 used; max is 4030726144
INFO [ScheduledTasks:1] 2011-07-28 06:29:28,622 GCInspector.java (line 128) GC for ParNew: 219 ms, 147301832 reclaimed leaving 2188187360 used; max is 4030726144
INFO [ScheduledTasks:1] 2011-07-28 06:29:39,666 GCInspector.java (line 128) GC for ParNew: 240 ms, 148102840 reclaimed leaving 2222749344 used; max is 4030726144
INFO [main] 2011-07-28 17:48:28,584 AbstractCassandraDaemon.java (line 78) Logging initialized
INFO [main] 2011-07-28 17:48:28,617 AbstractCassandraDaemon.java (line 96) Heap size: 3894411264/3894411264
INFO [main] 2011-07-28 17:48:34,199 CLibrary.java (line 106) JNA mlockall successful
I also checked the Linux syslog around that time (July 28 6:29am) and it looks like the OS dumped the java process?
Jul 28 06:25:01 ip-10-2-206-127 CRON[15405]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 28 06:25:01 ip-10-2-206-127 CRON[15406]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627014] apt-get invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627020] apt-get cpuset=/ mems_allowed=0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627025] Pid: 15420, comm: apt-get Not tainted 2.6.35-30-virtual #54-Ubuntu
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627027] Call Trace:
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627041] [<ffffffff810aefbd>] ? cpuset_print_task_mems_allowed+0x9d/0xb0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627048] [<ffffffff81104451>] dump_header+0x81/0xc0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627052] [<ffffffff81104511>] oom_kill_process+0x81/0x180
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627056] [<ffffffff81104a48>] __out_of_memory+0x58/0xd0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627059] [<ffffffff81104b46>] out_of_memory+0x86/0x1c0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627063] [<ffffffff811085ae>] __alloc_pages_slowpath+0x58e/0x5a0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627068] [<ffffffff8110872c>] __alloc_pages_nodemask+0x16c/0x1d0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627074] [<ffffffff8113aaaa>] alloc_pages_current+0x9a/0x100
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627078] [<ffffffff81101b57>] __page_cache_alloc+0x87/0x90
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627081] [<ffffffff8110168e>] ? find_get_page+0x1e/0x90
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627085] [<ffffffff81103113>] filemap_fault+0x1b3/0x450
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627091] [<ffffffff8111e3d4>] __do_fault+0x54/0x560
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627095] [<ffffffff81121379>] handle_mm_fault+0x1b9/0x440
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627101] [<ffffffff810072f2>] ? check_events+0x12/0x20
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627107] [<ffffffff815aaa35>] do_page_fault+0x125/0x350
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627111] [<ffffffff815a75b5>] page_fault+0x25/0x30
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627114] Mem-Info:
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627116] Node 0 DMA per-cpu:
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627119] CPU 0: hi: 0, btch: 1 usd: 0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627122] CPU 1: hi: 0, btch: 1 usd: 0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627124] Node 0 DMA32 per-cpu:
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627127] CPU 0: hi: 186, btch: 31 usd: 41
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627129] CPU 1: hi: 186, btch: 31 usd: 172
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627131] Node 0 Normal per-cpu:
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627134] CPU 0: hi: 186, btch: 31 usd: 0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627136] CPU 1: hi: 186, btch: 31 usd: 30
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627141] active_anon:665806 inactive_anon:133214 isolated_anon:0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627142] active_file:259 inactive_file:1039 isolated_file:0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627143] unevictable:1039172 dirty:0 writeback:259 unstable:0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627144] free:8797 slab_reclaimable:1978 slab_unreclaimable:3119
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627146] mapped:4552 shmem:49 pagetables:18095 bounce:0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627148] Node 0 DMA free:7808kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15712kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627159] lowmem_reserve[]: 0 4024 7559 7559
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627165] Node 0 DMA32 free:22208kB min:5916kB low:7392kB high:8872kB active_anon:2135320kB inactive_anon:427136kB active_file:888kB inactive_file:3612kB unevictable:1425952kB isolated(anon):0kB isolated(file):0kB present:4120800kB mlocked:1425952kB dirty:0kB writeback:1036kB mapped:2532kB shmem:4kB slab_reclaimable:756kB slab_unreclaimable:4656kB kernel_stack:1240kB pagetables:50796kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:6976 all_unreclaimable? no
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627176] lowmem_reserve[]: 0 0 3535 3535
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627182] Node 0 Normal free:5172kB min:5196kB low:6492kB high:7792kB active_anon:527904kB inactive_anon:105720kB active_file:148kB inactive_file:544kB unevictable:2730736kB isolated(anon):0kB isolated(file):0kB present:3619840kB mlocked:2730736kB dirty:16kB writeback:0kB mapped:15676kB shmem:192kB slab_reclaimable:7156kB slab_unreclaimable:7820kB kernel_stack:2208kB pagetables:21584kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1088 all_unreclaimable? yes
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627192] lowmem_reserve[]: 0 0 0 0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627197] Node 0 DMA: 2*4kB 1*8kB 1*16kB 1*32kB 1*64kB 2*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 1*4096kB = 7808kB
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627212] Node 0 DMA32: 672*4kB 662*8kB 501*16kB 4*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 22208kB
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627226] Node 0 Normal: 221*4kB 28*8kB 0*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 5172kB
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627257] 5640 total pagecache pages
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627260] 0 pages in swap cache
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627263] Swap cache stats: add 0, delete 0, find 0/0
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627264] Free swap = 0kB
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.627266] Total swap = 0kB
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.652382] 1966064 pages RAM
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.652387] 54111 pages reserved
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.652388] 14176 pages shared
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.652390] 1897463 pages non-shared
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.652394] Out of memory: kill process 2126 (java) score 20305 or a child
Jul 28 06:29:44 ip-10-2-206-127 kernel: [576488.652414] Killed process 2126 (java) vsz:951250956kB, anon-rss:4554448kB, file-rss:15276kB
I don't see a JVM crashlog ( hs_err_pid[pid].log) in ~/brisk/resources/cassandra/bin or /tmp.
So, I guess my question is: Why did the OS kill Java/Cassandra during compaction? Any more thoughts on why the ring would be so unbalanced?
Also, I'm starting to suspect a bug in the random partitioner in 0.8.1? Has anyone loaded a large amount of data into 0.8.1?