The passing of an era: Nagios

2007-05-20 19:05:00

Well, I have finally unsubscribed myself from the Nagios mailing lists. It was great being a member of those lists while I was working with the software on a daily basis, but these days I've put Nagios behind me. I haven't written one line of Nagios monitoring code for months now.

I'm sure I'll also be skipping this year's Nagios Konferenz unless a job involving monitoring comes up again.

Thanks Ethan, for making such great software freely available! All the best to you and maybe we'll meet again o/

TruCluster: an interesting performance problem

2007-05-11 11:24:00

The past two weeks we've been having a rather mysterious problem with one of our TruClusters.

During hardware maintenance of the B-node we moved all cluster resources to the A-node to remain up and running. Afterwards we let TruCluster balance all the resources so performance would benefit again. Sounds good so far and everything kept on working like it should.

However, during some nights the A-node would slow to a crawl, not responding to any commands and inputs. We were stumped, because we simply couldn't find the cause of the problem. The system wasn't overloaded, with a low load average. The CPU load was a bit remarkable, with 10% user, 50% system and the rest in idle. The network wasn't overloaded and there was no traffic corruption. None of the disks were overloaded, with just two disks seeing moderate to heavy use. It was a mystery and we asked HP to help us out.

After some analysis they found the cause of the problem :) Part of one of the applications that was failed over to the A-node were two file systems. After the balancing of resources these file systems stuck with the A-node, while the application moved back to the B-node. So now the A-node was serving I/O to the B-node through its cluster interconnect! This also explains the high System Land CPU load, since that was the kernel serving the I/O. :D

We'll be moving the file systems back to the B-node as well and we'll see whether that solves the issues. It probably will :)

