Heyo all. :)
First of all, OpsCenter is pretty sweet. Simple UI, gives me some good insight as a whole where I would otherwise be using CLI tools to try and aggregate stats in my head all at once... yuck. :P
I'm unfortunately having a problem with OpsCenter, though, where the agent seems to lock up an entire core on the box for hours on end... seemingly at random.
I have a four-node cluster running Cassandra 1.1 on a RightScale Ubuntu 10.04 LTS image. I had OpsCenter do the automatic SSH in and install the agent. At first, things were working fine, then two nodes went dead. They were pegged at 50% CPU for a few hours, with occasional drops of CPU usage back down to normal every 45 minutes or so. This lasted, overall, for 7 - 8 hours and then mysteriously the boxes were back to normal and the nodes went UP again. The CPU usage problem has manifested itself on all nodes thus far. I tried restarting the agents, and sometimes even when restarting it would immediately go into this problem and peg a core.
I made sure to test the problem by running for 24 hours with the agents stopped. CPU barely went above 1% on average. As soon as I re-enabled all the agents.... the boxes locked up again and came back randomly a few hours later.
One thing I noticed is that when the agent starts the lock the box up... I can "fix" the issue by bringing down the main OpsCenter instance. For example, one agent in particular here started to lock up (at around 12:37) and I decided to try the upgrade to 2.1.1 to see if it remedies the issue... and I brought OpsCenter down around 13:02. The
CPU usage drops back down to nearly nothing at around 13:04 and then the log finally gets some data again at 13:10 when a connection times out/closes unexpectedly.
DEBUG [Thread-7] 2012-07-02 12:37:21,835 Connection shut down
ERROR [StompConnection receiver] 2012-07-02 13:10:57,019 Connection closed unexpectedly:
java.io.EOFException: reading verb
Any thoughts? :(