Kilala.nl - Personal website of Tess Sluijter

Unimportant background
Login
  RSS feed

About me

Blog archives

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

> Weblog

> Sysadmin articles

> Maths teaching

Installing CentOS Linux as default OS on a Macbook

2013-08-12 16:46:00

While preparing for my RHCSA exams, I was in dire need of a Linux playground. At first I could make do with virtual machines running inside Parallels Workstation on my Macbook. But in order to use Michael Jang's practice exams I really needed to run Linux as the main OS (the tests require KVM virtualization). I tried and I tried and I tried but CentOS refused to boot, mostly ending up on the grey Tux / penguin screen of rEFIt

On my final attempt I managed to get it running. I started off with this set of instructions, which got me most of the way. After resyncing the partition table using rEFIt's menu, using the rEFIt boot menu would still send me to the grey penguin screen. But then I found this page! It turns out that rEFIt is only needed in order to tell EFI about the Linux boot partition! Booting is then done using the normal Apple boot loader!

Just hold down the ALT button after powerin up and then choose the disk labeled "Windows". And presto! It works, CentOS boots up just fine. You can simply set it to the default boot disk, provided that you left OS X on there as well (by using the Boot Disk Selector).


kilala.nl tags: , , , ,

View or add comments (curr. 0)

How exciting! I'll be presenting at NLUUG

2010-01-31 07:46:00

The NLUUG logo

Yesterday I received an email from the NLUUG (dutch Unix users group) conference bureau:

Dear Thomas,

We have received your abstract for the NLUUG spring conference 2010. We would like to thank you for your submission and your patience.

It is our pleasure to inform you that your abstract has been chosen by the program committee to be presented on May 6th.

Holy carp! This means that I'll be getting on stage, in front of 50-200 Unix admins, my peers if you will. The last time I got in front of a big crowd it was thirty high school juniors, so this is going to be just a -little- bit different. =_=;

Yeah, this is a little scary, aside from being uber-exciting :)


kilala.nl tags: , ,

View or add comments (curr. 8)

Network configuration on Red Hat and Fedora WTF?!

2009-12-05 22:10:00

Why the frag does the IPv4 networking setup on Red Hat and Fedora Linii need to be so damn complicated? I've just spent half an hour Googling to find the right commands to ensure that my Fedora 12 VM in Parallels configures its eth0 at boot time. Seriously, compare the two:

Solaris 10:

1. Enter hostname and IP in /etc/hosts

2. Enter hostname in /etc/hostname.ni0

3. Enter network base IP and netmask in /etc/netmasks

Fedora 12:

1. Run system-config-network. Fill out all details.

2. Enter hostname and IP in /etc/hosts

3. Enter hostname in /etc/sysconfig/network

4. Set "ONBOOT" to yes in /etc/sysconfig/networking/devices/ifcfg-eth0

5. Run: chkconfig --level 35 network on

Seriously, who the fsck comes up with that last line?! I already have the network startup scripts in /etc/init.d and everything in /etc/sysconfig is set up and I -still- need to enable the network config to be loaded at boot time? WTF?! A few years back I had the same fights with setting up static routes that needed to be carried over reboots.

I fscking hate Red Hat.


kilala.nl tags: , ,

View or add comments (curr. 2)

Repairing Solaris 10 networking in Parallels Desktop 5

2009-12-04 18:33:00

The Parallels support team finally came through. Earlier, I reported network problems with Solaris 10 and Parallels Desktop 5. After upgrading to Parallels 5 my virtual NICs were not getting detected anymore and ifconfig was completely unable to plumb my ni0 or ni1 interfaces.

It took us over three weeks of mailing back and forth, but finally the Parallels team were able to both reproduce the issue and to provide a fix. Here's the summary of what tech support told me.

Cause of the problem

Parallels Desktop 5 now supports 64-bit operating systems. Furthermore, it will now by default boot any capable OS into 64-bit mode. This means that all of my Solaris 10 VMs that had been running in 32-bit mode all of a sudden switched over to 64-bit. This also means that any 32-bit only drivers are rendered unusable. This is what broke the usage of Parallels' virtual network interfaces.

Solution 1: forcing the OS back to 32-bit mode

1. Stop the VM

2. Go to VM configuration -> Hardware -> Boot order.

3. In the "boot parameters" field enter: devices.apic.disable=1

3a. Alternatively, add the following to /etc/system: set pcplusmp:apic_forceload = -1

4. Boot the VM.

5. As root run: /usr/sbin/eeprom boot-file="kernel/unix"

Solution 2: recompiling the RTL3829 drivers as 64-bit

1. Start the VM

2. Remove the old drivers. Run: rem_drv ni

3. Mount the Parallels tools CDROM ISO image on /cdrom.

4. Run: cd /cdrom/Drivers/Network/RTL3829

5. Run: cp -rp SOLARIS /tmp

6. Run: cd /tmp/SOLARIS

7. Edit the network.sh file and add the following lines right before the echo of "Compiling driver".

PATH=$PATH:/usr/sfw/bin

rm $driver/Makefile

ln -s $tmpdir/$driver/Makefile.amd64_gcc $tmpdir/$driver/Makefile

8. Save the file and run: ./network.sh

9. Answer the usual questions to configure the NIC. Then reboot.

I went with the second solution (might as well stay running in 64-bit mode now that I can). I can confirm that it works and that my NI interface is now back. You may find that network.sh configured the ni0 interface, while it's actually called ni1. Reconfigure if needed by moving /etc/hostname.ni0 to /etc/hostname.ni1.


kilala.nl tags: , , ,

View or add comments (curr. 3)

Converting Unix epoch time to other formats

2009-09-25 09:37:00

The Epoch Converter website is truly a handy resource! Anything you'd ever want to do with epoch dates gathered onto one easy to read/browse page. Awesome!

It's got information for many different operating systems and programming languages so you're never lost. *bookmark*


kilala.nl tags: , ,

View or add comments (curr. 0)

Nagios script: check_cnr

2009-09-14 22:05:00

This script is used to monitor the basic processes that go with Cisco's CNR (Network Registrar), which can be likened to a DHCP server. Cisco's Support Wiki described CNR as follows:

Cisco CNS Network Registrar is a full-featured DNS/DHCP system that provides scalable naming and addressing services for service provider and enterprise networks. Cisco CNS Network Registrar dramatically improves the reliability of naming and addressing services for enterprise networks. For cable ISPs, Cisco CNS Network Registrar provides scalable DNS and DHCP services and forms the basis of a DOCSIS cable modem provisioning system.

As said my script only checks the basics of CNR to ensure that the required daemons are running. It does not actually check any of the functionality, though at a later point in time it may be expanded to include this.


Usage of check_cnr

./check_cnr [-nagios|-tivoli] [-d -o FILE]
-nagios	Nagios output mode (default)
-tivoli	Tivoli output mode
-d	Debug mode
-o 	Output file for debug logging

Output

Depending on which mode you've selected the output of the script will differ slightly.

In Tivoli mode the output will be limited to a numerical value as the script is to be used as a "numeric script". 0 = OK, 1 = WARNING/UNKNOWN, 2= SEVERE. The exit code of the script will be identical to this value.

In Nagios mode the exit code of the script will be be similar to Tivoli's, with the exception that the value 3 portrays an unknown state. The output on stdout includes the service name and state (CNR OK/NOK) and a helpful error message.


Limitations


Download

Download check_cnr.sh
$ wc check_cnr.sh
189     666    4531 check_cnr.sh

$ cksum check_cnr.sh
4161895780 4531 check_cnr.sh

kilala.nl tags: , , ,

View or add comments (curr. 0)

Finally: BoKS has a logo!

2008-11-22 20:37:00

The new BoKS logo

Since I've joined $CLIENT in October my life has been nothing but BoKS, BoKS, BoKS. It's great to be working with FoxT's security software again :) A lot of things have changed over the years, though the software is still very, very familiar.

One of the things that's made me happy is that Fox Tech have -finally- made an official logo for their BoKS products! I find it odd that they've been marketing this software for over ten years and that their last logo dates back to the nineties. Said decrepit logo hasn't been used in ages and henceforth BoKS was just known by that: a plain text rendition of the name. By request of $CLIENT, Fox Tech have gotten of their hineys and created a new logo that matches their corporate identity.

As a side note: over the past few weeks I've seen a lot of in-depth troubleshooting and I've decided to share some of the stuff I've learnt. Hence you'll find that the BoKS part of the sysadmin section has been revamped :)


kilala.nl tags: , , ,

View or add comments (curr. 0)

Dabbling with SQL

2008-07-17 08:46:00

Bwahah, this is priceless :D

Yesterday I'd spent an hour or two writing a PHP+SQL script for one of my colleagues, so he could get his hands on the report he needed. We have this big database with statistics (gathered over the course of a year) and now it was a matter of getting the right info out of there. Let's say that what we wanted was the following:

For four quarters, per host, the total sum of the reported sizes of file systems.

Now, because my SQL skills aren't stellar what I did was create a FOR-loop on a "select distinct" of the hostnames from the table. Then, for each loop instance I'd "select sum(size)" to get the totals for one date. But because we wanted to know the totals for four quarters, said query was run four times with a different date. This means that to get my hands on said information I was running 168 * 4 = 672 queries in a row. All in all, it took our box fifteen minutes to come up with the final answer.

On my way to work this morning a thought struck me: I really ought to be able to do this with four queries, or even with -one-! What I want isn't that hard! And in a flash of insight it came to me!

SELECT hostname, date, SUM(size) AS total FROM vdisks WHERE (date="2007-10-03" OR date="2008-01-01" OR date="2008-04-01" OR date="2008-07-01") GROUP BY hostname, date;

The runtime of the total query has gone from 15 minutes, to 1 second. o_O

Holy shit :D I guess it -does- pay to optimize your queries and applications!


kilala.nl tags: , , ,

View or add comments (curr. 0)

Certification tips for Unix sysadmins

2008-01-01 00:00:00

It takes at least a couple of years for a fledgling sys admin to build up his or her experience to a level where people will say: "Yeah! He's a good sysadmin.. He knows his way around the OS."

Most of the time of your first two or three years (assuming that you start admining in college) will be spent either with your nose in the books (learning new stuff) or with your nose to the grind stone (practicing the new stuff). A lot of time will be spent on basic grunt work, combined with maybe a couple of nice projects and some programming. But at some point in time a dreaded new word will drop on you like a brick from up on high... Management level that is...

_certification_

At first official vendor certification may seem like a humongous task! Especially if you take a look at the requirements that the vendor publishes on its website and at the sheer volume of the prep-books available. I had the same problem! One day may Field Managers mentioned that official certificates would look good on my resume and that I should go order a book or two... Which I did... And I subsequently try to read three times over... And just could not get through...

You see, I made the fatal mistake of wanting to cram everything in my head before even setting a date for the exam. This gave me way too much slack, causing me to lose interest at least two times over. So, after a bit of coaching from one of my friends/colleagues I came to the following conclusion on how to prepare for certification.

1. Get some experience :) Don't try to get certified immediately after being introduced to a new OS.

2. Take a look at the vendor's requirements for the certificate. These are usually published on their website.

3. Order one, maybe two good study books. I've created a small list of which books are good and which ones should be avoided.

4. Make a rough guestimate on how long you'll think you'll take studying. Don't make this any longer than two months, else you'll simply lose interest.

5. Order an exam voucher from your vendor.

6. Schedule the exam.

7. Start studying.

There's also a couple of other things that can really help you get the knack of things, ensuring that you'll be absolutely ready for the exam:

* Ask your employer to provide a sandbox system: a simple, small server which you are free to tinker with, configure, play with and break. This is an invaluable study tool!

* Purchase an account for a practice exam website (or get your employer to pitch in). The guys at Unixporting.com provide damn good test exams for Solaris Sysadmin 1 and 2, at a low price!

Most important of all: don't sweat it! A little excitement or a couple of shivers are good, but honestly: the fate of the world does not lean on your shoulders. If you don't make the exame, try, try try and try again. :)

Good luck!


kilala.nl tags: , , , , ,

View or add comments (curr. 3)

LPI-101 summary

2008-01-01 00:00:00

In 2007 I got my LPI-1 certification. This certificate requires one to take two exams: LPI-101 and LPI-102. I've studied hard for both exams and created summaries of all of the stuff I had to learn. I thought I'd share my summaries with all of the other LPI students. I hope they are useful to you!


kilala.nl tags: , , , , ,

View or add comments (curr. 1)

Recommended Unix Sysadmin books

2008-01-01 00:00:00

When getting certified, one of the most important tools are your cram sessions. With books.. You know: dead trees? treeware? those big leafy things which you read?... But you gotta know which ones are good and which ones to avoid like the plague.

Sun Solaris SCSA 1 and 2

"Exam cram: Solaris 8 System Administrator" ~ Darrell L. Ambro, Coriolis press

478 pages, comes with seperate cram sheet with "everything you need to know for the exam".



Avoid this one. This is the book it bought at first as it got some good reviews at Amazon.com. It was also the one that I tried getting through three times over *ugh* Honestly, the book is written in a very dull style but worst of all: it really isn't that much of a cram book since the author misses almost all of the important stuff for the exams. Way too little detail, so I wouldn't recommend it to anyone, but the starting Solaris sysadmin who needs to find a start.

"Solaris: Sun Certified System Administrator for Solaris 8.0 study guide" ~ Global Knowledge, Osborne McGraw-Hill

892 pages, comes with CD containing practice exams and a digital copy of the book.



Now _this_ is what I'm talking about! My colleague Martijn recommended this book and it really _does_ cover everything you need to know to ace the exame, plus a little more. The authors don't brush over any subject and take on each and every topic in detail. Yes, it's a big book and it may take you a while to get through it, but it's worth it. The exames included on the CD are a bit dodgy and are only good for one, maybe two attempts. In any case I recommend that you go out and get an account for a trial exame site.

Sun SCNA (TCP/IP Network admin)

"Sun certified network administrator for Solaris 8 study guide" ~ Rick Bushnell, Sun Microsystems Press

462 pages, no extras



Martijn also tipped me off about this book; apparently he aced the test with this book. I have to admit that the book _does_ take its time in explaining everything to you and that Rick doesn't leave out any details. I have to warn you though that the author also made a couple of mistakes, that he likes repetition (sometimes a little too much) and that at times he underestimates the exame (tells you that you don't need to know what he's about to explain, when you do). All in all a good book, but I'm not too crazy about it.


kilala.nl tags: , , , , ,

View or add comments (curr. 0)

SCNA summary

2008-01-01 00:00:00

Back in 2004, when I originally studied for my SCNA certification, I wrote a big summary based on the course books. I thought I'd share this summary with the rest of this world's students. Even though it was meant for the Solaris 8 SCNA exam, it should still be useful.


kilala.nl tags: , , , , ,

View or add comments (curr. 0)

LPI-102 summary

2008-01-01 00:00:00

In 2007 I got my LPI-1 certification. This certificate requires one to take two exams: LPI-101 and LPI-102. I've studied hard for both exams and created summaries of all of the stuff I had to learn. I thought I'd share my summaries with all of the other LPI students. I hope they are useful to you!


kilala.nl tags: , , , , ,

View or add comments (curr. 5)

Troubleshooting BoKS fault situations

2008-01-01 00:00:00

A PDF version of this document is available. Get it over here.

1.1.Verifying the proper functioning of a BoKS client

People have often asked me how one can check of a newly installed BoKS client is functioning

properly. With these three easy steps you too can become a milliona..!!.... Oops... Wrong show!

These easy steps will show you whether your new client is working like it should.

  1. Check the boks_errlog in $BOKS_var.
  2. Run cadm –l –f bcastaddr –h $client from the BoKS master (in a BoKS shell).
  3. Try to login to the new client.

If all three steps go through without error your systems is as healthy as a very healthy good

thing... or something.

1.2.SCENARIO: The BoKS master is not replicating to a replica (or all replicas)

Since on or more of the replicas is/are out of sync login attempts by users may fail, assuming that

the BoKS client on the server in question was looking at the out-of-sync BoKS replica. Other

nasty stuff may also occur.

Standard procedure is to follow these steps:

  1. Check the status of all BoKS replicas.
  2. Check BoKS error logs on the master and the replica(s).
  3. Try a forced database download.
  4. Check BoKS replication processes to see if they are all running.
  5. Check the master queue, using the boksdiag fque –master command.
  6. Check BoKS communications, using the cadm command.
  7. Check node keys.
  8. Check the replica server’s definition on BoKS database.
  9. Check the BoKS configuration on the replica.
  10. Debug replication processes.

All commands are run in a BoKS shell, on the master server unless specified otherwise.

1. Check the status of all BoKS replicas.

# /opt/boksm/sbin/boksadm –S boksdiag list

Since last pckt

The amount of minutes/seconds since the BoKS master

last sent a communication packet to the respective

replica server. This amount should never exceed more

than a couple of minutes.

Since last fail

The amount of days/hours/minutes since the BoKS

master was last unable to update the database on the

respective replica server. If an amount of a couple of

hours is listed you’ll know that the replica server had a

recent failure.

Since last sync

Shows the amount of days/hours/minutes since BoKS last

sent a database update to the respective replica server.

Last status

Yes indeed! The last known status of the replica server in

question. OK means that the server is running perfectly

and that updates are received. Loading means that the

server was just restarted and is still loading the database

or any updates. Down indicates that the replica server is

down or even dead.

2. Check BoKS error logs on the master and the replica(s).

This should be pretty self-explanatory. Read the /var/opt/boksm/boks_errlog file on both the

master and the replicas to see if you can detect any errors there. If the log file doesn’t mention

something about the hosts involved you should be able to find the cause of the problem pretty

quickly.

3. Try a forced database download.

Keon> boksdiag download –force $hostname

This will push a database update to the replica. Perform another boksdiag list to see if it

worked. Re-read the BoKS error log file to see if things have cleared up.

4. Check BoKS replication processes to see if they are all running.

Keon> ps –ef | grep –i drainmast

This should show two drainmast processes running. If there aren’t you should see errors about

this in the error logs and in Tivoli.

Keon> Boot –k

Keon> ps –ef | grep –i boks (kill any remaining BoKS processes)

Keon> Boot

Check to see if the two drainmast processes stay up. Keep checking for at least two minutes. If

one of them crashes again, try the following:

Check to see that /opt/boksm/lib/boks_drainmast is still linked to boks_drainmast_d, which

should be in the same directory. Also check to see that boks_drainmast_d is still the same file as

boks_drainmast_d.nonstripped.

If it isn’t, copy boks_drainmast_d to boks_drainmast_d.orig and then copy the non-stripped

version over the boks_drainmast_d. This will allow you to create a core file which is useful to TFS

Technology.

Keon> Boot –k

Keon> Boot

Keon> ls –al /core

Check that the core file was just created by boks_drainmast_d.

Keon> Boot –k

Keon> cd /var/opt/boksm/data

Keon> tar –cvf masterspool.tar master_spool

Keon> rm master_spool/*

Keon> Boot

Things should now be back to normal. Send both the tar file and the core file to TFS Technology

(support@tfstech.com).

5. Check the master queue.

Keon> boksdiag fque –master

If any messages are stuck there is most likely still something wrong with the drainmast processes.

You may want to try and reboot the BoKS master software. Do NOT reboot the master server!

Reboot the software using the Boot command. If that doesn’t help, perform the troubleshooting

tips from step 4.

6. Check BoKS communications, using the cadm command.

Verify that the BoKS communication between the master and the replica itself is up and running.

Keon> cadm –l –f bcastaddr –h $replica.

If this doesn’t work, re-check the error logs on the client and proceed with step 7.

7. Check node keys.

On the replica system run:

Keon> hostkey

Take the output from that command and run the following on the master:

Keon> dumpbase | grep $hostkey

If this doesn’t return the configuration for the replica server, the keys have become

unsynchronized. If you make any changes you will need to restart the BoKS processes, using the

Boot command.

8. Check the replica server’s definition on BoKS database.

Keon> dumpbase | grep RNAME | grep $replica

The TYPE field in the definition of the replica should be set to 261. Anything else is wrong, so you

need to update the configuration in the BoKS database. Either that or have SecOPS do it for you.

9. Check the BoKS configuration on the replica.

On the replica system, review the settings in /etc/opt/boksm/ENV.

10. Debug replication processes.

If all of the above fails you should really get cracking with the debugger. Refer to the appropriate

chapter of this manual for details.

1.3.SCENARIO: You can’t log in to a BoKS client

Most obviously we can’t do our work on that particular server and neither can our customers.

Naturally this is something that needs to be fixed quite urgently!

  1. Check BoKS transaction log.
  2. Check if you can log in.
  3. Check BoKS communications
  4. Check bcastaddr and bremotever files.
  5. Check BoKS port number.
  6. Check node keys
  7. Check BoKS error logs.
  8. Debug servc process on replica server or relevant process on client.

All commands are run in a BoKS shell, on the master server unless specified otherwise.

1. Check BoKS transaction log.

Keon> cd /var/opt/boksm/data

Keon> grep $user LOG | bkslog –f - -wn

This should give you enough output to ascertain why a certain user cannot login. If there is no

output at all, do the following:

Keon> cd /var/junkyard/bokslogs

Keon> for file in `ls –lrt | tail –5 | awk ‘{print $9}’`

> do

> grep $user $file | bkslog –f - -wn

> done

If this doesn’t provide any output, perform step 2 as well to see if us sys admins can login.

2. Check if you can log in.

Pretty self-explanatory, isn’t it? Try if you can log in yourself.

3. Check BoKS communications

Keon> cadm –l –f bcastaddr –h $client

4. Check bcastaddr and bremotever files.

Login to the client through its console port.

Keon> cat /etc/opt/boksm/bcastaddr

Keon> cat /etc/opt/boksm/bremotever

These two files should match the same files on another working client. Do not use a replica or

master to compare the files. These are different over there. If you make any changes you will need

to restart the BoKS processes using the Boot command.

5. Check BoKS port number.

On the client and master run:

Keon> getent services boks

This should return the same value for the BoKS base port. If it doesn’t either check /etc/services

or NIS+. If you make any changes you will need to restart the BoKS processes using the Boot

command.

6. Check node keys

On the client system run:

Keon> hostkey

Take the output from that command and run the following on the master:

Keon> dumpbase | grep $hostkey

If this doesn’t return the definition for the client server, the keys have become unsynchronized.

Reset them and restart the BoKS client software. If you make any changes you will need to restart

the BoKS processes using the Boot command.

7. Check BoKS error logs.

This should be pretty self-explanatory. Read the /var/opt/boksm/boks_errlog file on both the

master and the client to see if you can detect any errors there. If the log file doesn’t mention

something about the hosts involved you should be able to find the cause of the problem pretty

quickly.

8. Debug servc process on replica server or relevant process on client.

If all of the above fails you should really get cracking with the debugger. Refer to the appropriate

chapter of this manual for details (see chapter: SCENARIO: Setting a trace within BoKS)

NOTE: If you need to restart the BoKS software on the client without logging in, try doing so using a remote management tool, like Tivoli.

1.4 SCENARIO: The BoKS client queues are filling up

The whole of BoKS is still up and running and everything’s working perfectly. The only client(s)

that won’t work are the one(s) that have stuck queues. The only way you’ll find out about this is

by running boksdiag fque –bridge which reports all of the queues which are stuck.

  1. Check if client is up and running.
  2. Check BoKS communications.
  3. Check node keys.
  4. Check BoKS error logs.

All commands are run in a BoKS shell, on the master server unless specified otherwise.

1. Check if client is up and running.

Keon> ping $client

Also ask your colleagues to see if they’re working on the system. Maybe they’re performing

maintenance.

2. Check BoKS communications.

Keon> cadm –l –f bcastaddr –h $client

3. Check node keys.

On the client system run:

Keon> hostkey

Take the output from that command and run the following on the master:

Keon> dumpbase | grep $hostkey

If this doesn’t return the definition for the client server, the keys have become unsynchronised.

Reset them and restart the BoKS client software using the Boot command.

4. Check BoKS error logs.

This should be pretty self-explanatory. Read the /var/opt/boksm/boks_errlog file on both the

master and the client to see if you can detect any errors there. If the log file doesn’t mention

something about the hosts involved you should be able to find the cause of the problem pretty

quickly.

NOTE: What can we do about it?

If you’re really desperate to get rid of the queue, do the following

Keon> boksdiag fque –bridge –delete $client-ip

At one point in time we thought it would be wise to manually delete

messages from the spool directories. Do not under any circumstance touch the

crypt_spool and master_spool directories in /var/opt/boksm. Really:

DON’T DO THIS! This is unnecessary and will lead to troubles with BoKS.

1.5 SCENARIO: Setting a trace within BoKS

We are required to run a BoKS debug trace when either:

  1. People are unable to login without any apparent reason. A debug will show why login are

    getting rejected.

  2. We have run into a bug or a problem with BoKS which cannot easily be dealt with through e-

    mail. TFS Tech support will usually request us to perform a number of traces and that we send

    them the output files..

First off, let me warn you: debug trace log files can grow pretty vast pretty fast! Make sure that

you turn on the trace only right before you’re ready to use the faulty part of BoKS and also be

sure to stop the trace immediately once you’re done.

Now, before you can start a trace you will need to make sure that the BoKS client system only

performs transactions with one BoKS server. If you don’t you will have no way of knowing on

which server you should run the trace.

Login to the client system experiencing problems.

$ su –

# cd /etc/opt/boksm

# cp bcastaddr bcastaddr.orig

# vi bcastaddr

Edit the file in such a way that it only points to one of the available BoKS servers. Preferably a

BoKS replica. Please refrain from using the BoKS master server.

# /opt/boksm/sbin/boksadm –S Boot –k

# sleep 10; ps –ef | grep –i boks | awk '{print $2}' | xargs kill

# /opt/boksm/sbin/boksadm –S Boot

Now, how you proceed depends on what problems you are experiencing.

If people are having problems logging in:

Log in to the replica server and start Boks with sx.

# sx /opt/boksm/sbin/boksadm –S

# cd /var/tmp

Now, type the following command, but DO NOT press enter yet.

# bdebug –x 9 bridge_servc_r –f /var/tmp/BR-SERVC.trace

Open a new terminal window, because we will try to login to the failing client. BEFORE YOU

START THE TOOL USED TO LOGIN (SSH, Telnet, FTP, whatever) press enter at the command

waiting on the replica server. Attempt to login as usual. If it fails you have successfully set a trace.

Switch back to the window on the replica server and run the following command to stop the

trace.

# bdebug –x 0 bridge_servc_r

Repeat the same process once more, but this time around debug the servc process instead of

bridge_servc_r. Send the output to /var/tmp/SERVC.trace.

You can now read through the files /var/tmp/BR-SERVC.trace and /var/tmp/SERVC.trace to

troubleshoot the problem by your self, or you could send it to TFS Tech for analysis. If the

attempted login did NOT fail there’s something else going on: one of the other replica servers is

not working properly! Find out which one it is by changing the client’s bcastaddr file while every

time using a different BoKS server as a target.

If you are attempting to troubleshoot another kind of problem:

Tracing any other part of BoKS isn’t really altogether that different from tracing the login process.

You prepare in the same way (make bcastaddr point at one BoKS server) and you will probably

have to prepare the trace on bridge_servc_r as well (see the text block above; if you do not have

to trace bridge_servc_r TFS Tech will probably tell you so).

Yet again, BEFORE you start the trace on the master side by running

# bdebug –x 9 bridge_servc_r –f /var/tmp/SERVC.trace

You will have to go to the client system with the problematic situation and perform the following.

# cd /var/tmp

# bdebug –x 9 $PROG –f /var/tmp/$PROG.trace

$PROG in this case is the name of the BoKS process (bridge_servc_r, drainmast_download) or the

access method (login, su, sshd) that you want to debug.

Now, start both traces and attempt to perform the task that is failing. Once it has failed, stop

both traces again using bdebug –x 0 $PROG.

1.6 SCENARIO: Debugging the BoKS SSH daemon

From time to time you may have problems with the BoKS SSH daemon which cannot be explained

in any logical way. At such a time a debug trace of the SSH daemon can be very helpful! This can

be done by starting a second daemon on an unused port temporarily.

On the troubled system, login and start a BoKS shell:

# /opt/boksm/sbin/boksadm –S

Keon> boks_sshd –d –d –d –p 24 /tmp/sshd.out 2>&1

From another system:

$ ssh –l $username -p24 $target-host

Try logging in; it shouldn’t work :) Now close the SSH session with Ctrl-C, which should also

close the temporary SSH daemon on port 24. /tmp/sshd.out should now contain all of the

debugging information you or TFS Technology could need.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Promoting NIS+ replicas into master servers

2008-01-01 00:00:00

EDIT: 23/11/2004

DO NOT USE THE FOLLOWING PROCEDURE! IT HAS PROVEN TO BE FAULTY AND SHOULD ONLY BE USED AS A GUIDELINE FOR MAKING YOUR OWN PROCEDURE!

I will try and correct all of the mistakes as soon as possible.. Please be patient...

At some point in time it may happen that your NIS+ master server has become too old or overloaded to function properly. Maybe you used old decrepid hardware to begin with, or maybe you have been using NIS+ in your organisation for ages :) Anywho, you've now reached the point where the new hardware has received its proper build and that the server is ready to assume its role as NIS+.

Of course you want things to go smoothly and with as little downtime as possible. Of course one of the methods to go about this is about to use the other procedure in this menu: "Rebuild your master". That way you'll literally build a new master server after which you reload all of the database contents from raw ASCII dumps.

The other method would be by using the procedure below :) This way you'll transfer mastership of all your NIS+ database from the current master to the new one. I must admit that I haven't used this procedure in our production environment as of yet (15/11/04), but I will in about a week! But even after that time, after I've added alterations and after I've fixed any errors, don't come sueing me because the procedure didn't work for you. NIS+ can be a fickle little bitch if she really wants to...


This procedure requires that your new NIS+ master server is already a replica server. There are numerous books and procedures on the web which describe how to promote a NIS+ client into a replica, but I'll include that procedure in the menu sometime soon.

Before you begin, disable replication of NIS+ on any other replica servers you may have running. This is easily done by killing the rpc.nisd process on each of these systems. Beware though that all of the replicas do need to remain functioning NIS+ clients! This ensures that their NIS_COLD_START gets updated.

Log in to both the current master and the replica server you wish to upgrade. Become root on both systems.

On the master server:

# for table in `nisls`

>do

>nismkdir -m $replica $table

>done

# cd /var/nis/data

# scp root.object $user@$replica:/tmp

On the replica server:

# cd /var/nis/data

# mv /tmp/root.object .

# chown root:root root.object

# chmod 644 root.object

Kill all NIS processes on both the master and the replica in question. Then restart on the replica using:

# /usr/sbin/rpc.nisd -S 0

# /usr/sbin/nis_cachemgr -i

Verify that the replica server is now recognised as the current master server by using the following commands.

# nisshowcache -v

# niscat -o groups_dir.`domainname`.

# niscat -o org_dir.`domainname`.

# niscat -o `domainname`.

If the replica system is not recognised as the master, re-run the for-loop which was described above. This will re-run the nismkdir command for each table that isn't configured properly.

# for table in `nisls`

>do

> nischown `hostname` $table.`domainname`.

> for subtable in `nisls $table | grep -v $table`

> do

> nischown `hostname` $subtable.$table

> done

>done

Once again verify the ownership of the tables which you just modified.

# niscat -o `domainname`.

# niscat -o passwd.org_dir

Checkpoint the whole NIS+ domain.

# for table in `nisls`

>do

>nisping -C $table

>done

Kill all NIS+ daemons on the new master. Then restart using:

# /usr/sbin/rpc.nisd

# /usr/sbin/nis_cachemgr -i

Currently the old master has reverted to replica status. If you want to remove the old master from the infrastructure as a server, proceed with the next section.


Login to both the new master and the old master. Become root on both.

On the new master:

# for table in `nisls`

>do

>nisrmdir -s $oldmaster $table

>done

Checkpoint the whole NIS+ domain.

# for table in `nisls`

>do

>nisping -C $table

>done

Now make the old NIS+ master a client system.

# rm -rf /var/nis

# /usr/sbin/rpc.nisd

# nisinit -c -H $newmaster

# nisclient -i -d `domainname` -h $newmaster


kilala.nl tags: , , ,

View or add comments (curr. 0)

Combining net-SNMP and SUNWmasf on Solaris

2008-01-01 00:00:00

In some cases you're going to want to use Net-SNMP on your Solaris hosts, while still being able to monitor Sun-specific SNMP objects. It took me a while to get all of this to work and it's a bit of a puzzle, but here's how to make it work.

In our current environment at $CLIENT we want to standardise all of our UNIX hosts to the Net-SNMP agent software. This will allow us to use a configuration file which can be at least 60% identical on each host, making life just a little bit easier for all of us. Unfortunately Net-SNMP isn't equipped to deal with all of Sun's specific SNMP objects, so we're going to have to make a few big modifications to the software.

Of course packaging all these changes into one big .PKG is the nicest way of ensuring that all required changes are made in one blow, so that's what I've done. Unfortunately I cannot share this package with you, since it contains quite a large amount of $CLIENT internal information. I may be tempted at another time to recreate a non-$CLIENT version of the package that can be used elsehwere.


Re-compiling Net-SNMP

The latest versions of Net-SNMP comes with experimental LM_Sensors support for Sun hardware. Oddly, I've found that you need to drop one version below the latest version to get it to work nicely with Solaris 8. So here's the steps to take...

  1. Download the source code for Net-SNMP version 5.2.3 from their website.
  2. Move the .TGZ to your build system and unpack it in your regular build location. Also, building Net-SNMP successfully requires OpenSSL 0.9.7g or higher, so make sure that it's installed on your build system.
  3. Run the configure script with the following options:

    --with-mib-modules="host disman/event-mib ucd-snmp/diskio smux agentx disman/event-mib ucd-snmp/lmSensors" --with-perl-module

  4. Run "make", "make test" and "make install" to complete the creation of Net-SNMP. If "make test" fails on every check, it is likely that your system is unable to find the requisite OpenSSL libraries. This may be solved by running:

    /usr/bin/crle -c /var/ld/ld.config -l /lib:/usr/lib:/usr/local/lib:/usr/local/ssl/lib

  5. After "make install" has finished all the Net-SNMP files have been installed on your build system. Naturally it's important to know which files to include in your package. To help you, I've created a list of the files that are installed.

Installing SUNWmasf and its components

PLEASE NOTE: SUNWmasf will currently (july of 2006) only get useful results on the following models: V210, V240, V250, V440, V1280, E2900, N210, N240, N440, N1280. On other systems you may have more luck using the LM_Sensors pieces of Net-SNMP. They have been tested to work on E450, V880 and 280R.

As I mentioned earlier Net-SNMP with LM_Sensors can only gather limited amounts of Sun specific information. That's besides the fact that it is also still an experimental feature. So we're going to need an alternative SNMP agent to gather more information for us. Enter the SUNWmasf package.

SUNWmasf and its components may be downloaded from the Sun Microsystems website. Either use this direct link (which may be subject to change), or go to www.sun.com/download and search for "Sun SNMP Management Agent".

You can opt to install SUNWmasf manually on each of your clients, but it would be much nicer to include it into your custom made package. To have a full list of all the files and symlinks that you should include, you can take a peek at the prototype file I made for the package. It includes all the files required for Net-SNMP.

Installation of the software couldn't be easier. Just run the following command, after extracting the .TAR.Z file that contains SUNWmasf.

pkgadd -d . SUNWescdl SUNWescfl SUNWeschl SUNWescnl SUNWescpl SUNWmasf SUNWmasfr


Configuring SUNWmasf

Go into /etc/opt/SUNWmasf/conf and replace the snmpd.conf file with the following:

rocommunity public

agentaddress 1161

agentuser daemon

agentgroup daemon


Configuring Net-SNMP

The configuration file for Net-SNMP is located in /usr/local/share/snmp. You will need to make a whole bunch of changes over here that I won't cover, like security ACLs, SNMP trap hosts and bunches of other stuff. However, you _will_ need to add the following lines to allow Net-SNMP to talk to SUNWmasf.

proxy -c public localhost:1161 .1.3.6.1.4.1.42

proxy -c public localhost:1161 .1.3.6.1.2.1.47


Starting the software

Since SUNWmasf relies upon Net-SNMP, it will need to be started after that piece of software. The prototype file I mentioned earlier already takes this into account, but if you're not going to use it just make sure that /etc/init.d/masfd gets called _after_ /etc/init.d/snmpd during the boot process.

Also, I've noticed that SUNWmasf will need about thirty seconds before it can be read using commands like snmpget and snmpwalk.


Reading values from the agents

As you may well know, SNMP is a tangly web of numerical identifiers. I will make a nice overview of the various useful OIDs that you can use for monitoring through both LM_Sensors and SUNWmasf. However, I will put these in a seperate document, since it falls outside the scope of this mini-howto.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Interesting SUNWmasf (Management Agent for Sun Fire) SNMP objects

2008-01-01 00:00:00

In my mini-howto about monitoring Sun specific SNMP objects through Net-SNMP I refered to a few interesting objects which could be read through SUNWmasf.

Unfortuntately I can currently only list details for two of the supported models, since I do not have test boxen for the other models. The following lists are only a small selection from all possible objects, that we found interesting. A full list of available options can be obtained by running:

snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1.2

Details about the structure of the various MIBs can be found in other articles in the Sysadmin section of my website. Just browse through the menu on the left. Point is that the lists below only list the OID _within_ the specific sub-trees (for example: .1.3.6.1.2.1.47.1.1.1.1.2.46). As I said: details on actually _reading_ these values will be contained in another document.

The possible values for service indicators (enterprise.42.2.70.101.1.1.12.1.2.$OID) are:

1 = unknown, 2 = off, 3 = on, 4 = alternating

The possible values for the keyswitch (enterprise.42.2.70.101.1.1.9.1.1.$OID) are:

1 = unknown, 2 = stand-by, 3 = normal, 4 = locked, 5 = diag

Sun Fire V240

Object

Description

Unit

.21 .23 .25 and .27

HDD[0-3] Service required indicator

Integer

.39

SYSTEM Service required indicator

Integer

.33 and .36

PSU[0-1] Service required indicator

Integer

.69 and.70

CPU[0-1] Core temperature

Degrees

.71

SYSTEM Enclosure temperature

Degrees

.99 and .100

PSU[0-1] Over-temperature warning

Integer

.81 .82 and .83

SYSTEM Enclosure fan[0-2] tacho meter

Integer

.84 .85 .86 and .87

CPU[0-1] Fan[0-1] tacho meter

Integer

.91 and .92

PSU[0-1] Fan underspeed warning

Integer

.31 and .34

PSU[0-1] Active (power?)

Integer

Sun Fire V440

.28 .30 .32 and .34

HDD[0-3] Service Required indicator

Integer

.37 and .41

PSU[0-1] Service Required indicator

Integer

.46

SYSTEM Service Required indicator

Integer

.43

Keyswitch

Integer

.98 .100 .102 and .104

CPU[0-3] Core temperature

Degrees

.106

MOBO temperature

Degrees

.107

SCSI temperature

Degrees

.131 and .132

PSU[0-1] Predict fan fault

Integer

.121

PCIFAN tacho meter

Integer

.122 and .123

CPUFAN[0-1] tacho meter

Integer

.36 and .40

PSU[0-1] Power OK

Integer

.124 .125 .126 and .127

CPU[0-3] Power fault

Integer

.128

MOBO Power fault

Integer


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Interesting SNMP objects for LM_Sensors on Solaris

2008-01-01 00:00:00

In my mini-howto about monitoring Sun specific SNMP objects through Net-SNMP I refered to a few interesting objects which could be read through LM_Sensors.

Unfortuntately I can currently only list details for two of the supported models, since I do not have test boxen for the other models. The following lists are only a small selection from all possible objects, that we found interesting. A full list of available options can be obtained by running:

snmpwalk -c public -m ALL localhost .1.3.6.1.4.1.2021.13

Details about the structure of the various MIBs can be found in other articles in the Sysadmin section of my website. Just browse through the menu on the left. Point is that the lists below only list the OID _within_ the specific sub-trees (for example: .1.3.6.1.4.1.2021.13.16.5.1.2.9). As I said: details on actually _reading_ these values will be contained in another document.

Sun Fire V240

Object

Description

Unit

2.1.2.1 and .2

CPU[0-1] Core temperature

Integer *

2.1.2.3

SYSTEM Enclosure temperature

Integer *

5.1.2.2

SYSTEM Service required indicator

Integer

5.1.2.5

PSU[0-1] Service required indicator

Degrees

5.1.2.10 .12 .14 and .16

HDD[0-3] Service required indicator

Integer

5.1.2.18

Keyswitch

Integer

5.1.2.4 and .7

PSU[0-1] Activity (power?)

Integer

*: In order to get the real temperature, you will need to divide the integer contained within this variable by 65.526. For some odd reason Net-SNMP does not store the real temperature in degrees Centrigrade.

Sun Fire V440

2.1.2.1 .2 .3 and .4

CPU[0-3] Core temperature

Integer *

2.1.2.5 .6 .7 and .8

CPU[0-3] Ambient temperature

Integer *

2.1.2.9

SCSI temperature

Integer *

.10

MOBO temperature

Integer *

.98 .100 .102 and .104

CPU[0-3] Core temperature

Degrees

.106

MOBO temperature

Degrees

.107

SCSI temperature

Degrees

5.1.2.2

SYSTEM Service required indicator

Integer

5.1.2.6 and .10

PSU[0-1] Service required indicator

Integer

5.1.2.12 .14 .16 and .18

HDD[0-3] Service required indicator

Integer

5.1.2.20

Keyswitch

Integer

5.1.2.4 and .8

PSU[0-1] Power OK

Integer

*: In order to get the real temperature, you will need to divide the integer contained within this variable by 65.526. For some odd reason Net-SNMP does not store the real temperature in degrees Centrigrade.


kilala.nl tags: , , ,

View or add comments (curr. 0)

Reading from Solaris SNMP agents

2008-01-01 00:00:00

I have to admit that figuring out how all the parts of SNMP on Sun stick together took me a little while. Just like when I was learning Nagios it took me about a week of mucking about to gain clarity. Now that I've figured it out, I thought I'd share it with you...

First off, everything I will describe over here depends on the availability of two pieces of software on your clients: Net-SNMP and SUNWmasf. See the article on combining the two for further details on installing and configuring this software.

We should begin by verifying that you can read from each of the important pieces of the SNMP tree. You can verify this by running the following three commands on your client system. Each should return a long list of names, numbers and values. Don't worry if it doesn't make sense yet.

snmpwalk -c public localhost .1.3.6.1.2.1.47

snmpwalk -c public localhost .1.3.6.1.4.1.42

snmpwalk -c public -m ALL localhost .1.3.6.1.4.1.2021.13

Incidentally you should also be able to access the same parts of the SNMP tree remotely (from your Nagios server, for example).

snmpwalk -c public $remote_client .1.3.6.1.2.1.47

snmpwalk -c public $remote_client .1.3.6.1.4.1.42

snmpwalk -c public -m ALL $remote_client .1.3.6.1.4.1.2021.13

Please keep in mind that you should replace the word "public" in all the examples with the community string that you've chosen for your SNMP agents. It could very well be something other than "public".


Which witch is witch?

Now that we've made sure that you can actually talk to your SNMP agent, it's time to figure out which components you want to find out about. The easy way to find out all components that are available to you is by running the following command.

snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1.2

Let me explain what the output of this command really means... The SNMP sub-tree MIB-2.1.1.1.1 contains descriptive information of system-specific SNMP objects. Each object has a sub-object in the following sub-trees (each number follows after MIB-2.1.1.1.1).

Sub-OID

Description

Sub-OID

Description

.1

entPhysicalIndex

.9

entPhysicalFirmwareRev

.2

entPhysicalDescr

.10

entPhysicalSoftwareRev

.3

entPhysicalVendorType

.11

entPhysicalSerialNum

.4

entPhysicalContainedIn

.12

entPhysicalMfgName

.5

entPhysicalClass

.13

entPhysicalModelName

.6

entPhysicalParentRelPos

.14

entPhysicalAlias

.7

entPhysicalName

.15

entPhysicalAssetID

.8

entPhysicalHardwareRev

.16

entPhysicalIsFRU

In this case all the sub-objects under .2 contain descriptions of the various components that are human readable. What you need to do now is go through the complete list of descriptions to pick those elements that you want to access remotely through SNMP. You will see that each entry has a number behind the .2. Each of these numbers is the unique component identifier within the system, meaning that we are lucky enough to have the same identifier within other parts of the SNMP tree.


An example

$ snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1.2 | grep Core

SNMPv2-SMI::mib-2.47.1.1.1.1.2.98 = STRING: "CPU 0 Core Temperature Monitor"

SNMPv2-SMI::mib-2.47.1.1.1.1.2.100 = STRING: "CPU 1 Core Temperature Monitor"

SNMPv2-SMI::mib-2.47.1.1.1.1.2.102 = STRING: "CPU 2 Core Temperature Monitor"

SNMPv2-SMI::mib-2.47.1.1.1.1.2.104 = STRING: "CPU 3 Core Temperature Monitor"

$ snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1 | grep "\.98 ="

SNMPv2-SMI::mib-2.47.1.1.1.1.2.98 = STRING: "CPU 0 Core Temperature Monitor"

SNMPv2-SMI::mib-2.47.1.1.1.1.3.98 = OID: SNMPv2-SMI::zeroDotZero

SNMPv2-SMI::mib-2.47.1.1.1.1.4.98 = INTEGER: 94

SNMPv2-SMI::mib-2.47.1.1.1.1.5.98 = INTEGER: 8

SNMPv2-SMI::mib-2.47.1.1.1.1.6.98 = INTEGER: -1

SNMPv2-SMI::mib-2.47.1.1.1.1.7.98 = STRING: "040349/adbs04:CH/C0/P0/T_CORE"

SNMPv2-SMI::mib-2.47.1.1.1.1.8.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.9.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.10.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.11.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.12.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.13.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.14.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.15.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.16.98 = INTEGER: 2


Getting some useful data

Aside from the fact that the sub-OID we have found for our object is used in other parts of the tree, there's another parameter that makes its return. The character string in .7 is reused in the SUN MIB as well, as you will see in a moment.

Let's see what happens when we take our sub-OID .98 to the SUN MIB tree...

$ snmpwalk -c public localhost .1.3.6.1.4.1.42.2.70.101.1.1 | grep "\.98 ="

SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.1.98 = INTEGER: 2

SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.2.98 = INTEGER: 2

SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.3.98 = INTEGER: 7

SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.4.98 = INTEGER: 2

SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.5.98 = STRING: "040349/adbs04:CH/C0/P0"

SNMPv2-SMI::enterprises.42.2.70.101.1.1.6.1.1.98 = INTEGER: 2

SNMPv2-SMI::enterprises.42.2.70.101.1.1.6.1.2.98 = INTEGER: 3

SNMPv2-SMI::enterprises.42.2.70.101.1.1.6.1.3.98 = Gauge32: 60000

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.1.98 = INTEGER: 3

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.2.98 = INTEGER: 0

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.3.98 = INTEGER: 1

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.4.98 = INTEGER: 41

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.5.98 = INTEGER: 0

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.6.98 = INTEGER: 0

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.7.98 = INTEGER: 0

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.8.98 = INTEGER: 0

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.9.98 = INTEGER: 97

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.10.98 = INTEGER: -10

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.11.98 = INTEGER: 102

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.12.98 = INTEGER: -20

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.13.98 = INTEGER: 120

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.14.98 = Gauge32: 0

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.15.98 = Hex-STRING: FC

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.16.98 = INTEGER: 1

Take a look at 2.1.5.98... Looks familiar? At least now you're sure that you're reading the right sub-object :) The list in the example above looks quite complicated, but there's a little help in the shape of a .PDF I once made. This .PDF shows the basic structure of the objects inside enterprises.42.2.70.101.1.1.

You should immediately notice though that the returns of the command are divided into three groups: ...101.1.1.2, ...101.1.1.6 and ...101.1.1.8. Matching these groups up to the .PDF you'll see that these groups are respectively sunPlatEquipmentTable (which is an expansion on the information from MIB-2), sunPlatSensorTable (which contains a description of the sensor in question) and sunPlatNumericSensorTable (which contains all kinds of real-life values pertaining to the sensor).

In this case the most interesting sub-OID is enterprises.42.2.70.101.1.1.8.1.4.98, sunPlatNumericSensorCurrent, which obviously contains the current value of the sensor readings. Putting things into perspective this means that the core temperature of CPU0 at the time of the snmpwalk was 41 degrees centigrade.


Going on from there

So... Now you know how to find out the following things:

You can now do loads of things! For example, you can use your monitoring software to verify that certain values don't exceed a set limit. You wouldn't want your CPUs to get hotter than 65 degrees now, do you?


kilala.nl tags: , , , ,

View or add comments (curr. 2)

A closer look at SUN-PLATFORM-MIB

2008-01-01 00:00:00

For some reason unknown to me Sun has always kept their MIB file rather closed and hard to find. There's no place you can actually download the file. You will have to extract the file from the SUNWmasf package if you want to take a look at it.

To help us sysadmins out I've published the file over here. I do not claim ownership of the file in any way. Sun has the sole copyright of the file. I just put it here, so people can easily read through the file.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Combining net-SNMP with Dell Open Manage and HP Insight Manager

2008-01-01 00:00:00

Monitoring Dell and HP systems through SNMP is as big a puzzle as using SNMP on Sun Microsystems' boxen. Luckily I've come a long way into figuring out how to use Net-SNMP together with HP's SIM and Dell's OpenManage.

Just like with our Solaris boxen, we want to use the Net-SNMP daemon as the main daemon on our Linux systems. At $CLIENT we use Red Hat ES3 on a great variety of Dell and HP hardware. And as was the case with SUNWmasf on Solaris, we're going to need both Dell's and HP's custom SNMP agents to monitor out hardware-specific SNMP objects. Enter SIM and OpenManage. In the next few paragraphs I'll tell you all about installing and configuring the whole deal.

Naturally it would be great if you could package all of these files into one nice .RPM, since that'll make the whole installation process a snap. Especially if you want to roll it out across hundreds of servers. I'll be making such a package for $CLIENT, but unfortunately I cannot distribute it (which is logical, what with all the proprietary info that goes into the package). Maybe, some day I'll make a generic .RPM which you guys can use.


Installing HP SIM and its components.

Just like everyone else HP also chooses to hide the installer for their SNMP agent quite deeply into their website. You will need to go to their download site and browse to the software section for your model of server. Once there you choose "Download drivers and software" and you pick your Linux flavour (in our case RHEL3). From there go to "Software - Systems management" where you can finally choose "A Collection of SNMP Protocol Tools from Net-SNMP for $YOUR_FLAVOUR". *phew* To help you get there, here's the direct link to the RHES3 version of the package.

As the file name (net-snmp-cmaX-5.1.2) suggests, this package is a modified version of the net-SNMP daemon which has added support for a whole bunch of Compaq and HP stuff. But as you can see the version of net-SNMP used is way behind today's standards, so it's wisest to use this daemon while proxied through a more current version of net-SNMP. The crappy thing though is that HP's package installs their net-SNMP in exactly the same location as our own net-SNMP. Don't worry, we'll get to that.

The download page doesn't make this immediately clear, but you'll need to download five (or six if you want the source) files. For your convenience, HP has decided to put all files into a pull-down menu, with one "Download" button. Yes, very handy indeed. =_= Another neat thing is that, for some reason, the combination Safari+Realplayer decides that -they- need to open the .RPM file that's loaded. Very odd and I've never encountered this before with other RPMs.

Because we're going to use two versions of net-SNMP that use the same locations on your hard drive, we're going to have to fiddle around a bit.

First copy these two RPMs to your system: net-snmp-cmaX and net-snmp-cmaX-libs. Install them using RPM, starting with libs and ending with the basic package. Now do the following.

$ cd /usr/sbin
$ sudo mv snmpd HPsnmpd
$ sudo mv snmptrapd HPsnmptrapd
$ cd /etc
$ sudo ln -s ./snmpd.conf ./HPsnmdp.conf
$ cd /etc/rc.d/init.d
$ sudo mv snmpd HPsnmpd
$ sudo mv snmptrapd HPsnmptrapd
$ cd /etc/logrotate.d
$ sudo mv snmpd HPsnmpd

You've now made sure that all parts that are required for the HP SNMP agent are safe from being overwritten by the "real" net-SNMP.

You can now install net-SNMP using the instruction laid out in the following paragraph.


Re-compiling Net-SNMP

PLEASE NOTE: If you're going to use HP SIM, please install that -first- before proceeding. See below for details.

Basically, recompiling Net-SNMP for your Linux install follows the same procedure as the recompilation on Solaris.

  1. Download the source code for Net-SNMP version 5.2.3 (or a newer version, if you wish) from their website.
  2. Move the .TGZ to your build system and unpack it in your regular build location. Also, building Net-SNMP successfully requires OpenSSL 0.9.7g or higher, so make sure that it's installed on your build system.
  3. Run the configure script with the following options:

    --with-mib-modules="host disman/event-mib ucd-snmp/diskio smux agentx disman/event-mib ucd-snmp/lmSensors" --with-perl-module

  4. Run "make", "make test" and "make install" to complete the creation of Net-SNMP.

  5. After "make install" has finished all the Net-SNMP files have been installed on your build system. Naturally it's important to know which files to include in your package. I will make a full listing of all files RSN(tm)..

Installing Dell OpenManage and its components.

I had a hard time finding the installer files for Dell OM on Dell's download site, util I finally figured out how their "logic" works. :D You can get Dell OM 4.5 for Linux through this direct link (which can be changed at any time by Dell), or you can search their downloads page using the term "openmanage server agent". Adding the key word "linux" seems to confuse it though, so you're going to have to manually search through the list.

Unfortunately I never did get around to using Dell OpenManage, so I cannot give you the installation instructions ;_;


Configuring HP-SIM

The configuration file for HP's version of net-SNMP is stored in /etc/snmp, unlike the version that'll be used by our own net-SNMP. Edit HP's config file and remove all the current content. Replace it with the following:

rocommunity public 0.0.0.0 agentaddress 1162 pass .1.3.6.1.4.1.4413.4.1 /usr/bin/ucd5820stat

You will not have to make any further changes. The init-script and such can remain unchanged.


Configuring Dell OpenManage

Again, unfortunately I cannot give you instructions on working with OpenManage since I ran out of time.

rocommunity public 0.0.0.0 agentaddress 1163


Configuring Net-SNMP

The configuration file for Net-SNMP is located in /usr/local/share/snmp. You will need to make a whole bunch of changes over here that I won't cover, like security ACLs, SNMP trap hosts and bunches of other stuff. However, you _will_ need to add the following lines to allow Net-SNMP to talk to HP SIM and/or OpenManage.

# Pass requests to HP SIM

proxy -c public localhost:1162 .1.3.6.1.4.1.232

# Pass requests to Dell OpenManage

proxy -c public localhost:1163 .1.3.6.1.2.1.674


Starting the software

Make sure that you start Net-SNMP before OpenManage or SIM. These sub-agents rely on Net-SNMP to be running, so that one needs to go first. Take care of this order using the RC scripts of your particular Linux flavour.



kilala.nl tags: , , ,

View or add comments (curr. 0)

Interesting HP SIM (Insight Manager) SNMP objects

2008-01-01 00:00:00

In my mini-howto about monitoring HP and Dell specific SNMP objects through Net-SNMP I refered to a few interesting objects which could be read through their repsective SNMP agents. This page covers the interesting objects for HP Compaq systems.

Right now I've only got a very limited amount of different models to test all this stuff on, so bear with me :) The following lists are only a small selection from all possible objects, that we found interesting. A full list of available options can be obtained by running:

snmpwalk -c public localhost .1.3.6.1.4.1.232

I've tried my best at making the more interesting parts of the HP and Dell MIBs legible. The results can be found in the PDF, in the menu on the left. But once again, these lists are only a small subset of the complete MIB for both vendors. You won't know all that's available to you unless you start digging through the flat .TXT files yourself. Unlike Sun, HP and Dell -do- publish their MIB files freely, so you'll have no trouble finding them on the web.

I've also expanded on the HP SIM MIB a little in a PDF document. Get it over here.


On the monitoring of disks.

Unfortunately, HP and Compaq have made it impossible to monitor hard disk statuses without add-on software. The plain vanilla SNMP agent has no way of filling the relevant objects. Instead it requires the CPQarrayd add-on.

If you do choose to install this piece of software, you can find all the objects regarding -internal- drives under OID .1.3.6.1.4.1.232.3.2.5.2 (cpqDaPhyDrvErrTable). Refer to CPQIDA.MIB.txt for all relevant details and a full listing of the appropriate OIDs.

Currently I have no way of making sure, but I assume that the alert message for HDD[0-7] can be found in .1.3.6.1.4.1.232.3.2.5.2.1.15.[0-7]. Any value above 0 is indicates a failure.


Basic Object Identifiers

All object IDs below fit under .1.3.6.1.4.1.232. These objects should be usable on every HP system in the DL/ML rangen, although I have only tested the on DL380, DL385, DL580 and ML570.

Object

Description

Values

.1.2.2.1.1.6.OID

CPU[0-3] status

1/2 = ok, 3 = warn, 4 = crit

.3.2.2.1.1.6.OID

HDD controler

1/2 = ok, 3 = warn, 4 = crit

.3.2.3.1.1.11.OID

LDD[0-X] status

1/2 = ok, 3 = warn, 4 = crit

.3.2.4.1.1.6.OID

Hot spare HDD status

>2 =crit

.3.2.5.1.1.37.OID

HDD[0-X] status

1/2 = ok, 3 = warn, 4 = crit

.5.2.2.1.1.12.OID

SCSI controler status

1/2 = ok, 3 = warn, 4 = crit

.5.2.3.1.1.8.OID

SCSI LDD[0-X] status

1/2 = ok, 3 = warn, 4 = crit

.5.2.4.1.1.26.OID

SCSI HDD[0-x] status

1/2 = ok, 3 = warn, 4 = crit

.6.2.6.7.1.9.OID

Fan status

1/2 = ok, 3 = warn, 4 = crit

.6.2.6.8.1.4.1

CPU0 temperature

Contains current temperature

.6.2.6.8.1.4.4

CPU1 temperature

Contains current temperature

.6.2.6.8.1.4.5

PSU temperature

Contains current temperature

.6.2.9.3.1.4.0.OID

PSU[0-X] status

1/2 = ok, 3 = warn, 4 = crit

.14.2.2.1.1.5.OID

IDE HDD[0-X] status

1/2 = ok, 3 = warn, 4 = crit


Fan and sensor placement

As I already said, most of the OIDs from the tables above can be used to monitor vanilla HP systems (with the exceptions of the hard disks). The biggest difference lies in the placement of certain fans and sensors. The table below outlines the various locations, depending on the model.

Each system contains multiple fans and temperature sensors and will thus have multiple instances of these objects in its SNMP tree. The locations for each of these instances can be read from .6.2.6.7.1.3.OID (fans) and 6.2.6.8.1.3.OID (temperature sensor). The $OID part of these numeric sequences are always .1.1, .1.2, .1.3, .1.4 and so on.

Fan

DL380

DL385

DL580

ML570

.1.1

CPU

CPU

System

?

.1.2

CPU

CPU

System

?

.1.3

IO Board

IO Board

System

?

.1.4

IO Board

IO Board

System

?

.1.5

CPU

CPU

IO Board

?

.1.6

CPU

CPU

IO Board

?

.1.7

PSU

PSU

-

?

.1.8

PSU

PSU

-

?

Sensor

DL380

DL385

DL580

ML570

.1.1

CPU

CPU

CPU

?

.1.2

CPU

IO Board

CPU

?

.1.3

IO Board

CPU

CPU

?

.1.4

CPU

CPU

CPU

?

.1.5

PSU

PSU

IO Board

?

.1.6

-

-

Ambient

?

.1.7

-

-

System

?



kilala.nl tags: , , ,

View or add comments (curr. 0)

SAN data migration scripts

2008-01-01 00:00:00

It's been way too long since I used these scripts. I believe they stem from 2003.

I'll need to write some more about them later. For now, know that I used these scripts to prepare for data migrations between local systems and SAN boxen. We moved from local to EMC2, then we moved from EMC2 to HP XP-1024.


kilala.nl tags: , , ,

View or add comments (curr. 0)

PHP-Syslog update: admin mode

2008-01-01 00:00:00

At $CLIENT I've built a centralised logging environment based on Syslog-ng, combined with MySQL. To make any useful from all the data going into the database we use PHP-syslog-ng. However, I've found a bit of a flaw with that software: any account you create has the ability to add, remove or change other accounts... Which kinda makes things insecure.

So yesterday was spent teaching myself PHP and MySQL to such a degree that I'd be able to modify the guy's source code. In the end I managed to bolt on some sort of "admin-mode" which allows you to set an "admin" flag on certain user accounts (thus giving them the capabilities mentioned above).

The updated PHP files can be found in the TAR-ball in the menu of the Sysadmin section. The only thing you'll need to do to make things work is to either:

  1. Re-create your databases using the dbsetup.sql script.
  2. Add the "admin" column to the "users" table using the following command. ALTER TABLE users ADD COLUMN baka BOOLEAN;

The update is available as a .TAR file. Get it over here.


kilala.nl tags: , ,

View or add comments (curr. 0)

The scope of variables in shell scripts

2008-01-01 00:00:00

Just today I ran into something shiny that peeked my interest. A shell script I'd written in Bash didn't work like I expected it to, with regards to the scope of a variable. I thought the incident was interesting enough to report, although I won't go into the whole scoping story too deeply.

What is basically boils down to is that there was a difference in the way two shells handle a certain situation. A difference that I didn't expect to be there. Not that exciting, but still very educational.


Scope?

Yeah. In most programming languages variables have a certain range within your program, within which they can be used. Some variables only exist within one subroutine, while other exist across the whole program or even across multiple parts of the whole.

In shell scripting things aren't that complicated, luckily. In most cases a variable that's set in one part of the script can be used in every other part of the script. There are some notable exceptions, one of which I ran into today without realising it.


The real code

My situation:

I have a command that outputs a number of lines, some of which I need. The lines that I'm interested in consist of various fields, two of which I need as variables. Depending on the value of one of these variables, a counter needs to be incremented.

I guess that sounds kinda complicated, so here's the real code snippet:

function check_transport_paths

{

TOTAL=`scstat -W | grep "Transport path:" | wc -l`

let COUNT=0



scstat -W | grep "Transport path:" | awk '{print $3" "$6}' | while read PATH STATUS

do

if [ $STATUS == "online" ]

then

let COUNT=$COUNT+1

fi

done



if [ $COUNT -lt 1 ]

then

echo "NOK - No transport paths online."

exit $STATE_CRITICAL

elif [ $COUNT -lt $TOTAL ]

then

echo "NOK - One or more transport paths offline."

exit $STATE_WARNING

fi

}


Where it goes wrong

While testing my script, I found out that $COUNT would never retain the value it gained in the while-loop. This of course led to the script always failing the check. After some fiddling about, I found out that the problem lay in the use of the while loop: it was being used that the end of a pipe.

To illustrate, the following -does- work.

let COUNT=0

while read i

do

let COUNT=$COUNT+$i

echo $COUNT

done



echo "Total is $COUNT."

This leads to the following output.

$ ./baka.sh

1

1

2

3

3

6

4

10

^D

Total is 10.

However, if I were to create a script called neko.sh that outputs the numbers one through four on seperate lines, which is then used in baka.sh... well... it doesn't work :D Regardez!

let COUNT=0

./neko.sh | while read i

do

let COUNT=$COUNT+$i

echo $COUNT

done



echo "Total is $COUNT."

This gives the following output

1

3

6

10

Total is 0.

Conclusions

After discussing the matter with two of my colleagues (one of them as puzzled as I was, and the other knowing what was going wrong) we came to the following conclusions.

This conclusion is supported by an example in the "Advanced Bash-scripting guide" by Mendel Cooper. In the following example an additional comment is made about the scoping of variables with redirected while loops. The comment warns that older shells branch a redirected while into a sub-shell, but also tells that Bash and Ksh this properly.

I guess our version of Bash is too old :3

Work around

A word of thanks

I'd like to thank my colleagues Dennis Roos and Tom Scholten for spending a spare hour with me, hacking at this problem. And I'd like to thank Ondrej Jombik for pointing out the fact that this article didn't make my conclusions very clear in its original version.


kilala.nl tags: , , ,

View or add comments (curr. 28)

Installing additional locales on Tru64

2007-11-28 10:48:00

Wow, that was a fight :/

A few days ago we had a "new" TruCluster installed, running Tru64 5.1b. All of the stuff on it was plain vanilla, which meant that we were bound to run into some trouble. Case in point: the EMC/Legato Networker installation.

Upon installation setld complained as follows:

==========

Your choice:

1 LGTOCLNT999 EMC NetWorker Client

cannot be installed as required subset IOSWWEURLOC??? is not available.

==========

As the name suggests (EURLOC) the missing files involve the additional European locales that are not part of the default installation.

After fighting and searching and swearing a lot I got things sorted out as follows:

1. Get the Tru64 CD-ROM that was used for the installation. You'll need the "Associated Products 1" CD.

2. Insert the CD into your system.

3. Mount the CD: mount -r /dev/disk/cdrom1c /mnt

4. cd /mnt/Worldwide_Language_Support/kit

5. setld -l `pwd` IOSWWEURLOC540

This will install the locale I needed. Of course you are free to substitute the names of other locales as well.

EDIT:

Also, feel free to read through the proper instructions.


kilala.nl tags: , , ,

View or add comments (curr. 1)

Sometimes clusters do not guarantee high uptime

2007-10-20 13:36:00

Oh me, oh my... Clustering software does not always guarantee high uptime :/

At $CLIENT we've been having some nasty problems with our development SAP box. The box is part of a Veritas cluster and actually runs a bunch of Solaris Zones. The problems originally started about two months ago when we ran into a rare and newly discovered bug in UFS. It took a while for us to get the proper patches, but we finally managed to get that sorted out.

Remco installed the patches on Thursday morning, though he ran into some trouble. As always, patches can give you crap when it comes to cross-dependencies and this time wasn't any different. Around lunch time we thought we had things sorted out and went for the final reboot. All the zones were transferred to the proper boxen and things looked okay.

Until we tried to make a network connection. D:

None of the zones had access to the network, even though their interfaces were up and running. We sought for hours, but couldn't find anything. And like us, Sun was in the dark as well. In the end Remco and Sun worked all night to get an answer. Unfortunately they didn't make it, so I took over in the morning. Lemme tell you, once I was in the middle of all the tech and the phone calls and the managers, I found some more respect for Remco. He did a great job all through Thursday!

Just before lunch both Sun and one of the other guys came up with the solution. That was an awesome coincidence :) Turns out that the problems we were having are caused by timing issues during the boot-up of the Solaris Zones. Because we let Veritas Cluster handle the network interfaces things turned sour. Things would've worked better if we'd let the Zone framework handle things.

The stopgap solution: freeze all cluster resources to prevent fail-over, then manually restart all virtual interfaces for the zones. And presto! It works again!

Happily we went to lunch, only to come back to more crap!

Turns out that the five SAP instances we were running wouldn't fit into the available swap space anymore. Weird! Before yesterday, things would barely fit in the 30GB of swap space. And now all of a sudden SAP would eat about 38GB! o_O WTF?!

A whole bunch of managers wanted us to work through the whole weekend to sort everything out. Naturally we didn't feel to enthused, let alone the fact that the box's SLA doesn't cover weekend work.

In the end we tacked on some temporary swap space, started SAP and left for the weekend. We'll have to take more downtime on Monday for granted. It also leaves us with two big things to fix:

1. Modify the cluster/zone config for the network interfaces.

2. Find out why SAP has grown gluttonous and fix it.


kilala.nl tags: , ,

View or add comments (curr. 1)

Tivoli script: check_ntpconfig.sh

2007-08-30 11:46:00

This script was written at the time I was hired by T-Systems.

This script is an evolution of my earlier check_ntp_config. This time it's meant for use with Tivoli, although modifying it for use with Nagios is trivial. The script was written to be usable on at least five different Unices, though i've been having trouble with Darwin/OS X.

The script was tested on Red Hat Linux, Tru64, HP-UX, AIX and Solaris. Only Darwin seems to have problems.

Just like my other recent Nagios scripts, check_ntpconfig.sh comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.


#!/usr/bin/ksh
#
# NTP configuration check script for Tivoli.
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of T-Systems, CSS-CCTMO, the Netherlands
# Last Modified: 13-09-2007
# 
# Usage: ./check_ntp_config
#
# Description:
#   Well, there's not much to tell. We have no way of making sure that our 
# NTP clients are all configured in the right way, so I thought I'd make
# a Nagios check for it. ^_^ After that came this derivative Tivoli script.
#   You can change the NTP config at the top of this script, to match your
# own situation.
#
# Limitations:
#   This script should work fine on Solaris, HP-UX, AIX, Tru64 and some
# flavors of Linux. So far Darwin-compatibility has eluded me.
#
# Output:
#   If the NTP client config does not match what has been defined at the 
# top of this script, the script will echo $STATE_NOK. In this case, the 
# STATE variables contain a zero and a one, so you'll need to use a 
# "Numeric Script" monitor definition in Tivoli. Anything above zero is bad.
#
# Other notes:
#   If you ever run into problems with the script, set the DEBUG variable
# to 1. I'll need the output the script generates to do troubleshooting.
# See below for details.
#   I realise that all the debugging commands strewn throughout the script
# may make things a little harder to read. But in the end I'm sure it was
# well worth adding them. It makes troubleshooting so much easier. :3
#

### SETTING THINGS UP ###
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
PROGNAME="./check_ntp_config"
STATE_NOK="1"
STATE_OK="0"

. /opt/Tivoli/lcf/dat/dm_env.sh >/dev/null 2>&1


### DEFINING THE NTP CLIENT CONFIGURATION AS IT SHOULD BE ###
NTPSERVERS="192.168.22.7 192.168.25.7 192.168.16.7"


### DEBUGGING SETUP ###
# Cause you never know when you'll need to squash a bug or two
DEBUG="1"

if [[ $DEBUG -gt 0 ]]
then
        DEBUGFILE="/tmp/thomas-debug.txt"
	if [[ -f $DEBUGFILE ]]
	then
            rm $DEBUGFILE >/dev/null 2>&1
	    [[ $? -gt 0 ]] && echo "Removing old debug file failed."
	    touch $DEBUGFILE
	fi
fi


### REQUISITE COMMAND LINE STUFF ###

print_usage() {
	echo ""
	echo "Usage: $PROGNAME"
}

print_help() {
	echo ""
	echo "NTP client configuration monitor plugin for Tivoli."
	echo ""
	echo "This plugin not developped by IBM."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
	echo ""
	print_usage
	echo ""
}

while test -n "$1" 
do
	case "$1" in
	  *) print_help; exit $STATE_OK;;
	esac
done


### DEFINING SUBROUTINES ###

function SetupEnv
{
    case $(uname) in
	Linux) 	CFGFILE="/etc/ntp.conf"; 
		IPCMD="host" 
		IPMOD="tail -1"
		NAMEMOD="tail -1"
		IPFIELD="4"
		NAMEFIELD="5" 
		GREP="egrep -e" ;;
	SunOS) 	CFGFILE="/etc/inet/ntp.conf"
		IPCMD="getent hosts"
		IPMOD=""
		NAMEMOD=""
		IPFIELD="1"
		NAMEFIELD="2"
		GREP="egrep -e" ;;
	Darwin) CFGFILE="/etc/ntp.conf"
		IPCMD="host"
		IPMOD=""
		NAMEMOD=""
		IPFIELD="4"
		NAMEFIELD="1"
		GREP="egrep -e" ;;
	AIX)    CFGFILE="/etc/ntp.conf"
		IPCMD="host"
		IPMOD=""
		NAMEMOD=""
		IPFIELD="3"
		NAMEFIELD="1"
		GREP="egrep -e" ;;
	HP-UX)  CFGFILE="/etc/ntp.conf"
		IPCMD="nslookup"
		IPMOD="grep ^\"Address\""
		NAMEMOD="grep ^\"Name\""
		IPFIELD="2"
		NAMEFIELD="2"
		GREP="egrep -e" ;;
	OSF1)   CFGFILE="/etc/ntp.conf"
		IPCMD="nslookup"
		IPMOD="grep ^\"Address\" | tail -1"
		NAMEMOD="grep ^\"Name\" |tail -1"
		IPFIELD="2"
		NAMEFIELD="2"
		GREP="egrep -e" ;;
	*) echo "Sorry. OS not supported."; exit 1 ;;
    esac

    FAULT=0

    if [[ $DEBUG -gt 0 ]]
    then
	echo "=== SETUP ===" >> $DEBUGFILE
	echo "OS name is $(uname)" >> $DEBUGFILE
	echo "CFGFILE is $CFGFILE" >> $DEBUGFILE
	echo "IPCMD is $IPCMD" >> $DEBUGFILE
	echo "IPMOD is $IPMOD" >> $DEBUGFILE
	echo "NAMEMOD is $NAMEMOD" >> $DEBUGFILE
	echo "IPFIELD is $IPFIELD" >> $DEBUGFILE
	echo "NAMEFIELD is $NAMEFIELD" >> $DEBUGFILE
	echo "" >> $DEBUGFILE
	echo "NTPSERVERS is $NTPSERVERS" >> $DEBUGFILE
	echo "" >> $DEBUGFILE
    fi
} 

function ListInConf
{
    if [[ -z $NTPSERVERS ]]
    then
	echo "You haven't configured this monitor yet. Set \$NTPSERVERS."; exit 0
	[[ $DEBUG -gt 0 ]] && echo "NTPSERVERS variable not set." >> $DEBUGFILE
    else

    for HOST in $(echo $NTPSERVERS)
    do
    SKIPIP=0
    SKIPNAME=0

    if [[ $DEBUG -gt 0 ]]
    then
	echo "=== LISTINCONF ===" >> $DEBUGFILE
	echo "HOST is $HOST" >> $DEBUGFILE
	echo "" >> $DEBUGFILE
    fi

        if [[ -z $(echo $HOST | $GREP [a-z,A-Z]) ]]	    
        then
            IPADDRESS="$HOST"
	    TEST=$($IPCMD $HOST 2>/dev/null)

	    if [[ ( $? -eq 0 ) && ( -z $(echo $TEST | $GREP NXDOMAIN) ) ]] 
	    then
		[[ $DEBUG -gt 0 ]] && echo "TEST is $TEST" >> $DEBUGFILE
            	HOSTNAME=$($IPCMD $HOST 2>/dev/null | $NAMEMOD | cut -f$NAMEFIELD -d" " | cut -f1 -d.)
	    else
		[[ $DEBUG -gt 0 ]] && echo "TEST is $TEST" >> $DEBUGFILE
		HOSTNAME=""
	    fi

	    if [[ $HOSTNAME -eq "" ]]
	    then
	    	QUERY="$IPADDRESS"
	    	[[ $DEBUG -gt 0 ]] && echo "Skipping hostname verification" >> $DEBUGFILE
	    else
	    	QUERY="$HOSTNAME $IPADDRESS"	
	    	[[ $DEBUG -gt 0 ]] && echo "Checking both IP and name." >> $DEBUGFILE
	    fi
        else
            HOSTNAME="$HOST"
	    TEST=$($IPCMD $HOST 2>/dev/null)

	    if [[ ( $? -eq 0 ) && ( -z $(echo $TEST | $GREP NXDOMAIN) ) ]] 
	    then
		[[ $DEBUG -gt 0 ]] && echo "TEST is $TEST" >> $DEBUGFILE
            	IPADDRESS=$($IPCMD $HOST 2>/dev/null | $IPMOD | cut -f$IPFIELD -d" ")
	    else
		[[ $DEBUG -gt 0 ]] && echo "TEST is $TEST" >> $DEBUGFILE
		IPADDRESS=""
	    fi

	    if [[ $IPADDRESS -eq "" ]]
	    then
		QUERY="$HOSTNAME"
		[[ $DEBUG -gt 0 ]] && echo "Skipping IP address verification" >> $DEBUGFILE
	    else
		QUERY="$HOSTNAME $IPADDRESS"	
		[[ $DEBUG -gt 0 ]] && echo "Checking both IP and name." >> $DEBUGFILE
	    fi
        fi

    if [[ $DEBUG -gt 0 ]]
    then
	echo "IPADDRESS is $IPADDRESS" >> $DEBUGFILE
	echo "HOSTNAME is $HOSTNAME" >> $DEBUGFILE
	echo "" >> $DEBUGFILE
    fi

	for NAME in `echo $QUERY`
	do
       	    [[ -z $($GREP $NAME $CFGFILE | $GREP "server") ]] && let FAULT=$FAULT+1
	done

    done

    fi
}

function ConfInList
{
    NUMSERVERS=$($GREP ^"server" $CFGFILE | wc -l)

    if [[ $DEBUG -gt 0 ]]
    then
	echo "=== CONFINLIST ===" >> $DEBUGFILE
	echo "Number of \"server\" lines in $CFGFILE is $NUMSERVERS" >> $DEBUGFILE
	echo "" >> $DEBUGFILE
    fi

    if [[ $($GREP ^"server" $CFGFILE | wc -l) -gt 0 ]]
    then

	for HOST in $(cat $CFGFILE | $GREP ^"server" | awk '{print $2}')
	do
		if [[ $DEBUG -gt 0 ]]
		then
			echo "HOST is $HOST" >> $DEBUGFILE
			echo "" >> $DEBUGFILE
		fi
		if [[ -z $(echo $HOST | $GREP [a-z,A-Z]) ]]	    
		then
			IPADDRESS="$HOST"
	    		TEST=$($IPCMD $HOST 2>/dev/null)

			if [[ ( $? -eq 0 ) && ( -z $(echo $TEST | $GREP NXDOMAIN) ) ]] 
			then
			    [[ $DEBUG -gt 0 ]] && echo "TEST is $TEST" >> $DEBUGFILE
            		    HOSTNAME=$($IPCMD $HOST 2>/dev/null | $NAMEMOD | cut -f$NAMEFIELD -d" " | cut -f1 -d.)
			else
			    [[ $DEBUG -gt 0 ]] && echo "TEST is $TEST" >> $DEBUGFILE
			    HOSTNAME=""
	    		fi

			if [[ $HOSTNAME -eq "" ]]
			then
			    QUERY="$IPADDRESS"
			    echo "Skipping hostname verification" >> $DEBUGFILE
			else
			    QUERY="$HOSTNAME $IPADDRESS"	
			    [[ $DEBUG -gt 0 ]] && echo "Checking both IP and name." >> $DEBUGFILE
			fi
		else
			HOSTNAME="$HOST"
	    		TEST=$($IPCMD $HOST 2>/dev/null)

			if [[ ( $? -eq 0 ) && ( -z $(echo $TEST | $GREP NXDOMAIN) ) ]] 
			then
			    [[ $DEBUG -gt 0 ]] && echo "TEST is $TEST" >> $DEBUGFILE
            		    HOSTNAME=$($IPCMD $HOST 2>/dev/null | $IPMOD | cut -f$IPFIELD -d" ")
			else
			    [[ $DEBUG -gt 0 ]] && echo "TEST is $TEST" >> $DEBUGFILE
			    IPADDRESS=""
	    		fi

			if [[ $IPADDRESS -eq "" ]]
			then
				QUERY="$HOSTNAME"
				echo "Skipping IP address verification" >> $DEBUGFILE
			else
				QUERY="$HOSTNAME $IPADDRESS"	
				[[ $DEBUG -gt 0 ]] && echo "Checking both IP and name." >> $DEBUGFILE
			fi
		fi

		if [[ $DEBUG -gt 0 ]]
		then
			echo "IPADDRESS is $IPADDRESS" >> $DEBUGFILE
			echo "HOSTNAME is $HOSTNAME" >> $DEBUGFILE
			echo "" >> $DEBUGFILE
		fi

		for NAME in `echo $QUERY`
		do
		    [[ -z $(echo $NTPSERVERS | $GREP $NAME) ]] && let FAULT=$FAULT+1
		done

	done
    fi
}

### FINALLY, THE MAIN ROUTINE ###

SetupEnv

    if [[ $DEBUG -gt 0 ]]
    then
	echo "=== STARTING MAIN PHASE ===" >> $DEBUGFILE
	echo "" >> $DEBUGFILE
	echo "=== NTP CONFIG FILE ===" >> $DEBUGFILE
	cat $CFGFILE | grep -v ^"\#" >> $DEBUGFILE
	echo "" >> $DEBUGFILE
	echo "" >> $DEBUGFILE
    fi

ListInConf
ConfInList

# Nothing caused us to exit early, so we're okay.
if [[ $FAULT -gt 0 ]]
then
    echo "$STATE_NOK"
    exit $STATE_NOK
else
    echo "$STATE_OK"
    exit $STATE_OK
fi

kilala.nl tags: , , ,

View or add comments (curr. 0)

Grappling with HP ServiceGuard

2007-08-01 15:26:00

Last night's planned change was supposed to last about two hours: get in, install some patches, switch some cluster resources around the nodes, install some more patches and get out. The fact that the installation involved a HP-UX system didn't get me down, even though we only work with Sun Solaris and Tru64. The fact that it involved a ServiceGuard cluster did make me a little apprehensive, but I felt confident that the procedures $CLIENT had supplied me would suffice.

Everything went great, until the 80% mark... Failing the applications back over to their original node failed for some reason and the cluster went into a wonky state. The cluster software told me everything was down, even though some of the software was obviously still running. The cluster wouldn't budge, not up, nor down. And that's when I found out that I rather dislike HP ServiceGuard, all because of one stupid flaw.

You see, all the other cluster software I know provides me with a proper listing of all the defined resources and their current state. Sun Cluster, Veritas Cluster Service and Tru Cluster? All of them are able to give me neat list of what's running where and why something went wrong. Well, not HP Damn ServiceGuard. Feh!

We ended up stopping the database manually and resetting all kinds of flags on the cluster software. Finally, after six hours (instead of the original two), I got off from work around 23:00. Yes... /me heartily dislikes HP ServiceGuard.


kilala.nl tags: , , , ,

View or add comments (curr. 1)

Sun Fire V890: pretty, but with a nasty flaw

2007-07-17 10:12:00

The ports section of the V890.

Oy vey! One of the folks on the Sun Fire V890 must've been mesjoge! Why else would you decided to make such a weird design decision?!

What's up? I'll tell you what's up!

For some reason the design team decided to throw out the RJ45 console port that's been a Sun standard for nigh on ten years. And what did they replace it with? A DB25 port commonly seen in the Mesozoic Era! Good lord! This left me stranded without the proper cable for this morning's installation (thankfully I could borrow one). However, it also requires us to get completely new and different cables for our Cyclades console server!

Bad Sun! How could you make such a silly decision?!


kilala.nl tags: , , , ,

View or add comments (curr. 5)

Open Solaris, like Linux but with bureaucrats

2007-06-15 18:23:00

Two things before I start this rant:

1. I'm not overly familiar with the OGB and the Open Solaris project's modus operandi. I'm going to bone up on those subjects tonight.

2. It seems that the dutch branch of the OS project doesn't even notice much of the OGB's dealings. When I asked one of "our" leading guys about some recent dealings he hadn't actually even heard of them yet.

Now... On with the show.

When it comes to the Open Solaris project I'm having mixed feelings. On the one hand Solaris and it's step-sister Open Solaris are my favourite "true" UNIX and I really want to see the OS to be a successful one. I feel at home in the OS, I admire the great improvements Sun and the community make to the OS and Solaris has almost never let me down (maybe one or two occasions).

But then there's discussions such as these: a few members of the Open Solaris community propose to build an official binary distribution (dubbed Project Indiana) and they have executive backing from Sun. The first reply is a rather constructive one: it tells what's wrong with the proposal and why it won't be accepted (in it's current form) by the OGB. But then the whole discussion derails with post upon post of bureaucracy, going back and forth about which rules should be applied to whom and what in which situations and at what times... Etc, etc...

While I'm all in favor of having strict project management and of handling your business in a organised and procedural manner, one can go too far. Linux has always felt a little bit too organic to me, although they do seem to get the job done in a rather good way. But the way the Open Solaris group works seems just way too convoluted to me. I hope that it's just a matter of streamlining things over the coming months/years and that things will loosen up a little by then.


kilala.nl tags: , , ,

View or add comments (curr. 0)

Fedora Core 6 image for Parallels Desktop

2007-06-11 16:47:00

Now that I've gotten my mits on an Intel Macbook I've also started dabbling with Parallels Desktop, a piece of software that'll let you run a whole bunch of virtual machines inside Mac OS X. For my work it's rather handy to have a spare Solaris system lying around, so I went with the Solaris Express image that I mentioned a few weeks ago. And now that it's about time for me to get started on my LPIC-2 exam it's also handy to have at least one Linux at hand.

Enter a pre-installed and configured Fedora Core 6 image for Parallels. At only ~730MB in size that really isn't that bad. Saves me a lot of trouble as well.

Just be sure to set your RAM at 512 MB. Any higher is supposed to crash FC, according to this OS X hint.

EDIT:

Tried it with my last day of the Parallels demo. It works like a charm :)


kilala.nl tags: , , ,

View or add comments (curr. 6)

The passing of an era: Nagios

2007-05-20 19:05:00

Well, I have finally unsubscribed myself from the Nagios mailing lists. It was great being a member of those lists while I was working with the software on a daily basis, but these days I've put Nagios behind me. I haven't written one line of Nagios monitoring code for months now.

I'm sure I'll also be skipping this year's Nagios Konferenz unless a job involving monitoring comes up again.

Thanks Ethan, for making such great software freely available! All the best to you and maybe we'll meet again o/


kilala.nl tags: , , , ,

View or add comments (curr. 0)

TruCluster: an interesting performance problem

2007-05-11 11:24:00

The past two weeks we've been having a rather mysterious problem with one of our TruClusters.

During hardware maintenance of the B-node we moved all cluster resources to the A-node to remain up and running. Afterwards we let TruCluster balance all the resources so performance would benefit again. Sounds good so far and everything kept on working like it should.

However, during some nights the A-node would slow to a crawl, not responding to any commands and inputs. We were stumped, because we simply couldn't find the cause of the problem. The system wasn't overloaded, with a low load average. The CPU load was a bit remarkable, with 10% user, 50% system and the rest in idle. The network wasn't overloaded and there was no traffic corruption. None of the disks were overloaded, with just two disks seeing moderate to heavy use. It was a mystery and we asked HP to help us out.

After some analysis they found the cause of the problem :) Part of one of the applications that was failed over to the A-node were two file systems. After the balancing of resources these file systems stuck with the A-node, while the application moved back to the B-node. So now the A-node was serving I/O to the B-node through its cluster interconnect! This also explains the high System Land CPU load, since that was the kernel serving the I/O. :D

We'll be moving the file systems back to the B-node as well and we'll see whether that solves the issues. It probably will :)


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Sun's Solaris Express image for Parallels Desktop

2007-04-27 14:32:00

Ever since Apple switched to Intel processors in their systems and Parallels came out with their Parallels Desktop software it's been possible to run Windows, Linux and other Unices inside virtual machines on your Mac. That's totally great, since it allows you to run various test systems without needing additional hardware!

A lot of people also got Solaris 10 to run in PD, although some ran into a little bit of trouble. Well, not anymore! Sun has created a pre-installed Solaris Express image for use with Parallels Desktop. This allows you to immediately get up and running with Solaris, without even having to go through any of the normal installation hoops.

I know what I'll be doing when I get my Macbook in ;)

Thanks to Ben Rockwood for pointing out this little gem.


kilala.nl tags: , , ,

View or add comments (curr. 0)

Cutting down on the use of pipes

2007-04-18 14:38:00

One of the obvious down sides to using a scripting language like ksh as opposed to a "real" programming language like Perl or PHP (or C for that matter) is that, for each command that you string together, you're forking off a new process.

This isn't much of a problem when your script isn't too convoluted or when your dataset isn't too large. However, when you start processing 40-50MB log files with multiple FOR loops containing a few IF statements for each line, then you start running into performance issues.

And as I'm running into just that I'm trying to find ways to cut down on the forking, which means getting rid of as many IFs and pipes as possible. Here's a few examples of what has worked for me so far...

Instead of running:

[ expr1 ] && command1

[ expr2 ] && command1

Run:

[ (expr1) && (expr2) ] && command1

Why? Because if test works the way I expect it to, it'll die if the first expression is untrue, meaning that it won't even try the second expression. If you have multiple commands that complement eachother then you ought to be able to fit them into a set of parentheses after test cutting down on more forks.

Instead of running:

if [ `echo $STRING | grep $QUERY | wc -l` -gt 0 ]; then

Run:

if [ ! -z `echo $STRING | grep $QUERY` ]; then

More ideas to follow soon. Maybe I ought to start learning a "real" programming language? :D

EDIT:

OMG! I can't believe that I've just learnt this now, after eight years in the field! When using the Korn shell use [[ expr ]] for your tests as opposed to [ expr ].

Why? Because the [ expr ] is a throw-back to Bourne shell compatibility that makes use of the external test binary, as opposed to the built-in test function. This should speed up things considerably!


kilala.nl tags: , , , ,

View or add comments (curr. 0)

On commenting and debugging your code

2007-04-16 16:38:00

When writing shell scripts for my customers I always try to be as clear as possible, allowing them to modify my code even long after I'm gone. In order to achieve this I usually provide a rather lengthy piece of opening comments, with comments add throughout the script for each subroutine and for every switch or command that may be unclear to the untrained eye.

In general I've found that it's best to have at least the following information in your opening blurb:

* Who made the program? When was it finalised? Who requested the script to be made? Where can the author be reached for questions?

* A "usage" line that shows the reader how to call the program and which parameters are at his disposal.

* A description of what the program actually does.

* Descriptions for each of the parameters and options that can be passed to the script.

* The limitations imposed upon the script. Which specific software is needed? What other requisites are there? What are the nasty little things that may pop up unexpectedly?

* What are the current bugs and faults? The so-called FIXMEs.

* A description of the input that the program takes.

* A description of the output that the program generates.

Equally important is the inclusion of debugging capabilities. Of course you can start adding "echo" lines at various, strategic points in the script when you run into problems, but it's oh-so-much nicer if they're already in there! Adding those new lines is usually a messy affair that can make your problems even worse :( I usually prepend the debugging commands with "[ $DEBUG -eq 1 ] &&", which allows me to turn the debugging on or off at the top of the script using one variable.

And finally, for the more involved scripts, it's a great idea to write a small test suite. Build a script that actually takes the real script through its loops by automatically generating input and by introducing errors.

Two examples of script where I did all of this are check_suncluster and check_log3 with the new TEC-analysis.sh on its way in a few days.

So far, TEC-analysis.sh checks in at:

* 497 lines in total.

* 306 lines of actual code.

* 136 lines of comments.

* 55 lines of debugging code.

Approximately 39% of this script exists solely for the benefit of the reader and user.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

w00t! Passed my LPIC-102!

2007-04-06 10:57:00

Yay! There wasn't much reason for my doubting :) I passed with a 690 score (on a 200-930 scale), which boils down to 87% of 73 questions answered correctly. Not bad... Not bad at all...

Next up: ITIL Foundations!


kilala.nl tags: , , , ,

View or add comments (curr. 7)

LPIC-102 summary

2007-04-03 23:41:00

The LPIC-102 summary is done. You can find it over here, or in the menu on the left. Enjoy!


kilala.nl tags: , , , , , ,

View or add comments (curr. 0)

Finally! I'm done!

2007-04-03 23:37:00

Calvin hard at work

Ruddy heck, what a day! All in all it took me around thirteen hours, but I've finally finished my LPIC-102 summary. 41 pages of Linuxy goodness, bound to drag me through the second part of my LPIC-1 exams.

Argh, now I'm off to bed. =_= *cough* Let's hope I don't get called for any stand-by work.


kilala.nl tags: , , , ,

View or add comments (curr. 2)

Parallellization in shell scripts

2007-03-13 15:05:00

Today I was working on a shell script that's supposed to process multiple text files in the exact same manner. Usually you can get through this by running a FOR-loop where the code inside the loop is repeated for each file in a sequential manner.

Since this would take a lot of time (going over 1e6 lines of text in multiple passes) I wondered whether it wouldn't be possible to run the contents of the FOR-loop in parallel. I rehashed my script into the following form:

subroutine()

{

contents of old FOR-loop, using $FILE

}

for file in "list of files"

do

FILE="$file"

subroutine &

done

This will result in a new instance of your script for each file in the list. Got seven files to process? You'll end up with seven additional processes that are vying for the CPUs attention.

On average I've found that the performance of my shell script was improved by a factor of 2.5, going from ~40 lines per three seconds to ~100 lines. I was processing seven files in this case.

The only downside to this is that you're going to have to build in some additional code that prevents your shell script from running ahead, while the subroutines are running in the background. What this code needs to be fully depends on the stuff you're doing in the subroutine.


kilala.nl tags: , , , ,

View or add comments (curr. 2)

Recovering a broken mirror in Tru64

2007-03-01 14:26:00

Today I faced the task of replacing a failing hard drive in one of our Tru64 boxen. The disk was part of a disk group being used to serve plain data (as opposed to being part of the boot mirror / rootdg), so the replacement should be rather simple.

After some poking about I came to the following procedure. Those in the know will recognize that it's very similar to how Veritas Volume Manager (VXVM) handles things. This is because Tru64 LSM is based on VXVM v2.

* voldiskadm -> option 4 -> list -> select the failing disk, this'll be used as $vmdisk below.

* voldisk list -> select the failing disk, this'll be used as $disk below.

* voldisk rm $disk

* Now replace the hard drive.

* hwmgr -show scsi -> take a note of your current set of disks.

* hwmgr -scan scsi

* hwmgr -show scsi -> the replaced disk should show up as a new disk at the bottom of the list. This'll be used as $newdisk below.

* dsfmgr -e $newdisk $disk

* disklabel -rw $disk

* voldisk list -> $disk should be labeled as "unknown" again.

* voldiskadm -> option 5 -> $vmdisk -> $disk -> y -> y -> your VM disk should now be replaced.

* volrecover -g $diskgroup -sb

The remirroring process will now start for all broken mirrors. Unfortunately there is no way of tracking the actual process. You can check whether the mirroring's still running with "volprint -ht -g $diskgroup | grep RECOV", but that's about it.


kilala.nl tags: , , , ,

View or add comments (curr. 2)

I've never liked HP-UX that much ...

2007-02-21 12:47:00

I've never been overly fond of HP-UX, mostly sticking to Solaris and Mac OS X, with a few outings here and there. Given the nature of one of my current projects however, I am forced to delve into HP's own flavour of Unix.

You see, I'm building a script that will retrieve all manner of information regarding firmware levels, driver versions and such so we can start a networkwide upgrade of our SAN infrastructure. With most OSes I'm having a fairly easy time, but HP-UX takes the cake when it comes to being backwards :[

You see, if I want to find out the firmware level for a server running HP-UX I have two choices:

1. Reboot the system and check the firmware revision from the boot prompt.

2. Use the so-called Support Tools Manager utility, called [x,m,c]stm.

CSTM is the command line interface to STM and thank god that it's scriptable. In reality the binary is a CLI menu driven system, but it takes an input file for your commands.

For those who would like to retrieve their firmware version automatically, here's how:

...

Uhm... FSCK! *growl* *snarl* What the heck is this?! For some screwed up reason my shell keeps on adding a NewLine char after the output of each command. That way a variable which gets its value from a string of commands will always be "$VALUE ". WTF?! o_O

I'm going to have to bang on this one a little more. More info later.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Why ZFS matters to the rest of us

2006-12-22 22:41:00

Thanks to a link on the MacFreak fora I stumbled onto a great blog post explaining why ZFS is actually a big deal. The article approaches ZFS from the normal user's angle and actually did a good job explaining to me why I should care about ZFS.

Real nice stuff and I'm greatly looking forward to Mac OS X.5 which includes ZFS.


kilala.nl tags: , , , ,

View or add comments (curr. 1)

As promised: adding a new LUN to Tru64

2006-12-22 09:00:00

As I promised a few days ago I'd also give you guys the quick description of how to add a new LUN to a Tru64 box. Instead of what I told you earlier, I thought I'd put it in a separate blog post instead. No need to edit the original one, since it's right below this one.

Adding a new LUN to a Tru64 box with TruCluster

1. Assign new LUn in the SAN fabric.

No something I usually do.

2. Let the system search for new hardware.

hwmgr scan scsi

3. Label the "disk".

disklabel -rw $DISK

4. Add the disk to a file domain (volume group).

mkfdmn $DISK $DOMAIN

5. Create a file set (logical volume).

mkfset $DOMAIN $FILESET

6. Create a file system.

Not required on Tru64. Done by the mkfset command.

7. Test mount.

Mount.

8. Add to fstab.

vi /etc/fstab

Also, if you want to make the new file system fail over with your clustered application, add the appropriate cfsmgr command to the stop/start script in /var/cluster/caa/bin.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Crash course in new OSes

2006-12-20 20:20:00

The past two weeks I've been learning new stuff at a very rapid pace, because my client uses only a few Solaris boxen and has no Linux whatsoever. So now I need to give myself a crash course in both AIX and Tru64 to do stuff that I used to do in a snap.

For example, there's adding a new SAN device to a box, so it can use it for a new file system. Luckily most of the steps that you need to take are the same on each platform. It's just that you need to use different commands and terms and that you can skip certain steps. The lists below show the instructions for creating a simple volume (no mirroring, striping, RAID tricks, whatever) on all three platforms.

Adding a new LUN to a Solaris box with SDS

1. Assign new LUN in the SAN fabric.

Not something I usually do.

2. Let the system search for new hardware.

devfsadm -C disks

3. Label the "disk".

format -> confirm label request

When using Solaris Volume Manager

4. Add the disk to the volume manager.

metainit -f $META 1 1 $DISK

5. Create a logical volume.

metainit $META -p $SOFTPART $SIZE

6. Create a filesystem

newfs /dev/md/rdsk/$META

7. Test mount.

mount $MOUNT

8 Add to fstab.

vi /etc/vfstab

When using Veritas Volume Manager

4. Let Veritas find the new disk.

vxdctl enable

5. Initialize the disk for VXVM usage and add it to a disk group.

vxdiskadm -> initialize

6. Create a new volume in the diskgroup.

Use the vxassist command.

7. Create a file system.

newfs /dev/vx/rdsk/$VOLUME

8. Test mount

mount $MOUNT

9. Add to vfstab

vi /etc/vfstab

Adding a new LUN to an AIX box with LVM

1. Assign new LUN in the SAN fabric.

Not something I usually do.

2. Let the system search for new hardware.

cfgmgr

3. Label the "disk".

Not required on AIX.

4. Add the disk to a volume group.

mkvg -y $VOLGRP -s 64 -S $DISK

5. Create a logical volume.

mklv -y $VOLNAME -t jfs2 -c1 $VOLGRP $SIZE

6. Create a filesystem

crfs -v jfs2 -d '$VOLNAME' -m '$MOUNT' -p 'rw' -a agblksize='4096' -a logname='INLINE'

7. Test mount

mount $MOUNT

8 Add to fstab.

vi /etc/filesystems

Adding a new LUN to a Tru64 box running TruCluster

I'll edit this post to add these instructions tomorrow, or on Friday. I still need to try them out on a live box ;)

Anywho. It's all pretty damn interesting and it's a blast having to almost instantly know stuff that's completely new to me. An absolute challenge! It's also given me a bunch of eye openers!

For example I've always thought it natural that, in order to make a file system switch between nodes in your cluster, you'd have to jump through a bunch of hoops to make it happen. Well, not so with TruCluster! Here, you add the LUN, go through the hoops described above and that's it! The OS automagically takes care of the rest. That took my brain a few minutes to process ^_^


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Got my LPIC-101

2006-12-14 11:37:00

This morning I went to my local Prometric testing center for my LPI 101 exam (part one of two, for the LPIC-1). On forehand I knew I wasn't perfectly prepared, since I'd skipped trial exams and hadn't studied that hard, so I was a little anxious. Only a little though, since I usually test quite well.

Anywho: out of a maximum of 890 points I got 660, with 500 points being the minimum passing grade. Read item 2.15 this page to learn more about the weird scoring method used by the LPI. It boils down to this: out of 70 questions I got 61 correct, with a minimum of 42 to pass. If we'd use the scoring method Sun uses, I'd have gotten an 87%. Not too bad, I'd say!

I did run into two things that I was completely unprepared for. I'd like to mention them here, so you won't run into the same problem.

1. All the time, while preparing, I was told that I'd have to choose a specialization for my exam: either RPM or DPKG. Since I know more about RPM I had decided to solely focus on that subject. But lo and behold! Apparently LPI has _very_ recently changed their requisites for the LPIC-1 exams and now they cover _both_ package managers! D:

2. In total I've answered 98 questions, instead of the 70 that was advertised. LPI mentions on their website (item 2.13) that these are test-questions, considered for inclusion in future exams. These questions are not marked as such and they do not count towards your scoring. It would've been nice if there had been some kind of screen or message warning me about this _at_the_test_site_.

Anywho... I made and now I'm on to the next step: LPIC-102.


kilala.nl tags: , , , , ,

View or add comments (curr. 0)

LPIC-101 Summary

2006-12-12 22:38:00

Version 1.0 of my LPIC-101 study notes is available. I bashed it together using the two books mentioned below. A word of caution though: this summary was made with my previous knowledge of Solaris and Linux in mind. This means that I'm skipping over a shitload of stuff that might still be interesting to others. Please only use my summary as something extra when studying for your own exam.

I'm up for my exam next Thursday, at ten in the morning. =_=;

Oh yeah... The books:

Ross Brunson - "Exam cram 2: LPIC 1", 0-7897-3127-4

Roderick W. Smith - "LPIC 1 study guide", 978-0-7821-4425-3


kilala.nl tags: , , , , , ,

View or add comments (curr. 0)

NLOSUG meeting

2006-10-25 23:38:00

Phew! That was a long night! I'm not used to staying up this late on weekdays =_=

I went to the first NLOSUG meeting tonight, like I said I would a few days ago. Aside from finally learning a little bit about Open Solaris (although most of it was basic community stuff) and some more in-depth stuff on ZFS, it was also very cool to meet some old acquaintances. There was a bunch of folks from Sun whom I hadn't seen in a long time, as well as Martijn and Job with whom I'd worked as colleagues a long time ago. Shiny :)

So the eve' was mostly for fun, with a little education thrown in. Well worth the hours I put in...


kilala.nl tags: , ,

View or add comments (curr. 3)

Using BSD hardware sensors with SNMP.

2006-10-25 09:05:00

Many thanks to my colleague Guldan who pointed me towards a website giving a short description of using the BSD hardware-sensors daemon, together with Nagios in order to monitor your hardware. Using sensord should make things a lot easier for people running BSD, as they won't have to muck about with SNMP OIDs and so on.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Open Solaris Users Group

2006-10-20 13:13:00

Sun has made arrangements for the inaugural meeting of the Dutch Open Solaris Users Group. The meeting will be held on the evening of Thursday the 26th, at their office in Amersfoort.

Aside from the stuff you'd expect (like a few lectures on new Solaris features) you could also say it'll be a fun evening :) Meet some new people, have some food'n'drinks all mixed in with some interesting work-related stuff.

I'm game :)


kilala.nl tags: , ,

View or add comments (curr. 0)

Dependency hell

2006-08-23 14:37:00

Damn! I'm really starting to hate Dependency Hell. Installing a few Nagios check scripts requires the Perl Net::SNMP module. This in turn requires three other modules. Each of these three modules requires three other modules, three of which require a C compiler on your system (which we naturally don't install on production systems). And neither can we use the port/emerge/apt-get alike Perl tools from CPAN, since (yet again) these are production systems. Augh!


kilala.nl tags: , , , , ,

View or add comments (curr. 0)

Building RPM packages

2006-08-10 13:48:00

While working on the $CLIENT-internal package for the Nagios client (net-SNMP + NRPE + Nagios scripts + Dell/HP SNMP agent), I've been learning about compound RPM packages. I.e., packages where you combine multiple source .TGZs into one big RPM package. This requires a little magic when it comes to running the various configure and make scripts. Luckily I've found two great examples.

* SPEC file for TCL, a short SPEC file that builds a package from two source .TGZs.

* SPEC file for MythTV, a -huge- SPEC file that builds multiple packages from multiple source .TGZs, along with a very dynamic set of configure rules.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Creating packages

2006-08-08 11:04:00

Recently I've been trying to learn how to build my own packages, both on Solaris and on Linux. I mean, using real packages to install your custom software is a much better approach than simply working with .TGZ files. In the process I've found two great tutorials/books:

* Maximum RPM, originally written as a book by one of Red Hat's employees.

* Creating Solaris packages, a short HOWTO by Mark.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

SNMP = hard work

2006-08-01 17:23:00

Boy lemme tell ya: making a nice SNMP configuration so you can actually monitor something useful takes a lot of work! :) The menu on the left has been gradually expanding with more and more details regarding the monitoring of Solaris (and Sun hardware) through SNMP. Check'em out!


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Nagios clients for UNIX/Linux

2006-07-27 13:01:00

I've added a small comparison between the various ways in which your Nagios server can communicate with its clients. It's in the menu on the left, or you can go there directly.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Using SNMP with Solaris and Sun hardware

2006-07-26 16:25:00

After digging through Sun's MIB description (see SUN-PLATFORM-MIB.txt) it became clear to me that things are a lot more convoluted than I originally expected. For example, each sensor in the Sun Fire systems lead to at least five objects each describing another aspect of the sensor (name, value, expected value, unit, and so on). Unfortunately Sun has no (public) description of all possible SNMP sensor objects so I've come to the following two conclusions:

1. I'll figure it all out myself. For each model that we're using I'll weasel out every possible sensor and all information relevant to these sensors.

2. I'll have to write my own check script for Nagios which deals with with all the various permutations of sensor arrays in an appropriate fashion. Joy...

EDIT:

For your reference, Sun has released the following documents that pertain to their SNMP implementation. Mostly they're a slight expansion on the info from the MIB. At least they're much easier on the eyes when reading :p

* 817-2559-13

* 817-6832-10

* 817-6238-10

* 817-3000-10


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Sun-platform-mib.txt

2006-07-25 09:34:00

Right now I'm working on getting my Sun systems properly monitored through SNMP. Using the LM_sensors module for Net-SNMP has gotten me quite far, but there's one drawback. A lot of Sun's internal counters use some really odd values that don't speak for themselves. This makes it necessary to read through Sun's own MIB and correlate the data in there with the stuff from LM_sensors.

Point is, Sun isn't very forthcoming with their MIB even though it should probably be public knowlegde. Nowhere on the web can I find a copy of the file. The only way to get it is by extracting it from Sun's free SUNWmasfr package, which I have done: here's SUN-PLATFORM-MIB.txt

In now way am I claiming this file to be a product of mine and it definitely has Sun's copyright on it. I just thought I'd make the file a -little- bit more accessible through the Internet. If Sun objects, I'm sure they'll tell me :3


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Fixes to check_log2 and check_log3

2006-06-19 15:11:00

Both check_log2 and check_log3 have been thoroughly debugged today. Finally. Thanks to both Kyle Tucker and Ali Khan for pointing out the mistakes I'd made. I also finally learned the importance of proper testing tools, so I wrote test_log2 and test_log3 which run the respective check scripts through all the possible states they can encounter.

Oh... check_ram was also -finally- modified to take the WARN and CRIT percentages through the command line. Shame on me for not doing that earlier.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Check_log3 is born

2006-06-01 14:53:00

Today I made an improved version of the Nagios monitor "check_log2", which is now aptly called "check_log3". Version 3 of this script gives you the option to add a second query to the monitor. The previous two incarnations of the script only allowed you to search for one query and would return a Critical if it was found. Now you can also add a query which will return in a Warning message as well. Goody!


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Nagios script: check_ntp_config

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

As far as I know there was no Nagios plugin that allowed you to really check your client configuration. I mean, it would be nice to know for sure that all your systems are syncing against the proper server... Wouldn't it?

The script was tested on Redhat ES3, Mac OS X and Solaris. Its basic requirement is the bash shell.

EDIT:

Oh! Just like my other recent Nagios scripts, check_ntp_config comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.



#!/usr/bin/bash
#
# CPU load monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide, the Netherlands
# Last Modified: 10-07-2006
# 
# Usage: ./check_ntp_config
#
# Description:
#   Well, there's not much to tell. We have no way of making sure that our 
# NTP clients are all configured in the right way, so I thought I'd make
# a Nagios check for it. ^_^ 
#   You can change the NTP config at the top of this script, to match your
# own situation.
#
# Limitations:
#   This script should work properly on all implementations of Linux, Solaris
# and Mac OS X.
#
# Output:
#   If the NTP client config does not match what has been defined at the 
# top of this script, the script will return a WARN.
#
# Other notes:
#   If you ever run into problems with the script, set the DEBUG variable
# to 1. I'll need the output the script generates to do troubleshooting.
# See below for details.
#   I realise that all the debugging commands strewn throughout the script
# may make things a little harder to read. But in the end I'm sure it was
# well worth adding them. It makes troubleshooting so much easier. :3
#

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh


### DEFINING THE NTP CLIENT CONFIGURATION AS IT SHOULD BE ###
NTP_SERVER="ntp.wxs.nl"


### DEBUGGING SETUP ###
# Cause you never know when you'll need to squash a bug or two
DEBUG="0"

if [ $DEBUG -gt 0 ]
then
        DEBUGFILE="/tmp/foobar"
        rm $DEBUGFILE >/dev/null 2>&1
fi


### REQUISITE NAGIOS COMMAND LINE STUFF ###

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "NTP client configuration monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done


### DEFINING SUBROUTINES ###

function gather_config()
{
    case `uname` in
	Linux) CFGFILE="/etc/ntp.conf"; IP_SERVER=`host $NTP_SERVER | awk '{print $4}'` ;;
	SunOS) CFGFILE="/etc/inet/ntpd.conf"; IP_SERVER=`getent hosts $NTP_SERVER | awk '{print $2}'`;;
	Darwin) CFGFILE="/etc/ntp.conf"; IP_SERVER=`host $NTP_SERVER | awk '{print $4}'` ;;
	*) ;;
    esac

    REAL_SERVER=`cat $CFGFILE | grep ^server | awk '{print $2}'`

[ $DEBUG -gt 0 ] && echo "Gather_config: Host name for required server is $NTP_SERVER." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Gather_config: IP address for required server is $IP_SERVER." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Gather_config: currently configured server is $REAL_SERVER." >> $DEBUGFILE
} 

function check_config()
{
    if [ $REAL_SERVER != $NTP_SERVER ]
    then
	if [ $REAL_SERVER != $IP_SERVER ]
	then
	    echo "NOK - NTP client is not configured to speak to $NTP_SERVER"
	    exit $STATE_WARNING
     	fi
    fi
}


### FINALLY, THE MAIN ROUTINE ###

gather_config
check_config

# Nothing caused us to exit early, so we're okay.
echo "OK - NTP client configured correctly."
exit $STATE_OK


kilala.nl tags: , , ,

View or add comments (curr. 0)

Nagios script: check_nsca

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

At $CLIENT we've often run into problems with the NSCA daemon, where the daemon would not crash per se, but where it would also not process incoming service checks. The nsca process was still running, but it simply wasn't transferring the incoming results to the Nagios command file.

I was amazed to find that nobody else had written a script to do this! So I quickly wrote one.


#!/usr/bin/bash
#
# NSCA Nagios service results monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide, the Netherlands
# Last Modified: 16-08-2006
# 
# Usage: ./check_nsca
#
# Description:
# Aside from checking whether the NSCA process is still running, this script
# also attempts to insert a message into the Nagios queue. After sending a 
# message to the NSCA daemon, it will verify that the message is received by
# Nagios, by checking the nagios.log file. 
#
# Limitations:
#   This script should work properly on all implementations of Linux, Solaris
# and Mac OS X.
#
# Output:
# If the NSCA daemon, or something along the message path, is borked, a 
# CRIT message will be issued. 
#

# You may have to change this, depending on where you installed your
# Nagios plugins
PROGNAME="check_nsca"
PATH="/usr/bin:/usr/sbin:/bin:/sbin"

NAGIOSHOME="/usr/local/nagios"
LIBEXEC="$NAGIOSHOME/libexec"
NAGVAR="$NAGIOSHOME/var"
NAGBIN="$NAGIOSHOME/bin"
NAGETC="$NAGIOSHOME/etc"

. $LIBEXEC/utils.sh


### REQUISITE NAGIOS COMMAND LINE STUFF ###

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "NSCA Nagios service results monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done


### PLATFORM INDEPENDENCE ###

case `uname` in
	Linux) PSLIST="ps -ef";;
	SunOS) PSLIST="ps -ef";;
	Darwin) PSLIST="ps -ajx";;
	*) ;;
    esac


### CHECKING FOR THE NSCA PROCESS ###

[ `$PSLIST | grep nsca | grep -v grep | wc -l` -lt 1 ] && (echo "NSCA process not running."; exit $STATE_CRITICAL)


### INSERTING A TEST MESSAGE ###

DATE=`date +%Y%m%d%H%M`
STRING="`hostname`\tFOOBAR\t0\t$DATE This is a test of the emergency broadcast system.\n"

echo -e "$STRING" | $NAGBIN/send_nsca -H localhost -c $NAGETC/send_nsca.cfg >/dev/null 2>&1


### CHECKING THE NAGIOS LOG FILE ###

sleep 10

if [ `tail -1000 $NAGVAR/nagios.log | grep "emergency broadcast system" | grep $DATE | wc -l` -lt 1 ] 
then
	# Giving it a second try
	sleep 10
	if [ `tail -5000 $NAGVAR/nagios.log | grep "emergency broadcast system" | grep $DATE | wc -l` -lt 1 ]	
	then
		echo "NSCA daemon not processing check results."
		exit $STATE_CRITICAL
	fi
fi


### EXITING NORMALLY ###

echo "OK - NSCA working like it should."
exit $STATE_OK


kilala.nl tags: , , ,

View or add comments (curr. 2)

Nagios script: check_processes

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

A very simply script that takes a list of processes, instead of a single processes name (as is the case with check_process). This should make monitoring a basic list of processes a lot easier. I really should change the script in such a way that it takes the process list from the command line, instead of from the $LIST variable that's defined internally. I'll do that when I have the time.

Until I've made those change, I use the script by copying check_processes to a new file which is used specifically for one purpose. For example check_linux_processes and check_solaris_processes check a list of processes that should be up and running on Linux and Solaris respectively.

This check script should work on just about any UNIX OS.



#!/bin/bash
#
# Process monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide, the Netherlands
# Last Modified: 13-07-2006
# 
# Usage: ./check_solaris_processes
#
# Description:
# This script couldn't be simpler than it is. It just checks to see
# whether a predefined list of processes is up and running. 
#
# Limitations:
#   This script should work properly on all implementations of Linux, Solaris
# and Mac OS X.
#
# Output:
# If there one of the processes is down, a CRIT is issued.
#

# You may have to change this, depending on where you installed your
# Nagios plugins
PROGNAME="check_linux_processes"
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh


### DEFINING THE PROCESS LIST ###
LIST="init"


### REQUISITE NAGIOS COMMAND LINE STUFF ###

print_usage() {
        echo "Usage: $PROGNAME"
        echo "Usage: $PROGNAME --help"
}

print_help() {
        echo ""
        print_usage
        echo ""
        echo "Basic processes list monitor plugin for Nagios"
        echo ""
        echo "This plugin not developped by the Nagios Plugin group."
        echo "Please do not e-mail them for support on this plugin, since"
        echo "they won't know what you're talking about :P"
        echo ""
        echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
        case "$1" in
          --help) print_help; exit $STATE_OK;;
          -h) print_help; exit $STATE_OK;;
          *) print_usage; exit $STATE_UNKNOWN;;
        esac
done


### FINALLY THE MAIN ROUTINE ###

COUNT="0"
DOWN=""

for PROCESS in `echo $LIST`
do
        if [ `ps -ef | grep -i $PROCESS | grep -v grep | wc -l` -lt 1 ]
        then
                let COUNT=$COUNT+1
                DOWN="$DOWN $PROCESS"
        fi
done

if [ $COUNT -gt 0 ]
then
        echo "NOK - $COUNT processes not running: $DOWN"
        exit $STATE_CRITICAL
fi

# Nothing caused us to exit early, so we're okay.
echo "OK - All requisite processes running."
exit $STATE_OK


kilala.nl tags: , , ,

View or add comments (curr. 2)

Nagios script: check_processes

2006-06-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Basic monitor to check percentage of used physical RAM.

This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!

I've also -finally- changed the script so that it takes the Warning and Critical percentages from the command line.

UPDATE 15/07/2006:

Whoops... I just noticed that the file had gone missing <3



#!/bin/ksh
#
# Free physical RAM monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 20-10-2006
# 
# Usage: ./check_ram
#
# Description:
# This plugin determines how much of the physical RAM in the 
# system is in use.
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
# And it really is only usefull at DTV Labs.
#
# Output:
# The script returns either a WARN or a CRIT, depending on the 
# percentage of free physical memory.
#

# Enabling the following dumps information into DEBUGFILE at various
# stages during the execution of this script.
DEBUG="1"
DEBUGFILE="/tmp/foobar"
rm $DEBUGFILE >/dev/null 2>&1
echo "Starting script check_ram." > $DEBUGFILE

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
        exit 1
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/usr/local/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
        echo "Usage: $PROGNAME warning-percentage critical-percentage"
        echo ""
        echo "e.g. : $PROGNAME 15 5"
        echo "This will start alerting when more than 85% of RAM has"
        echo "been used."
        echo ""
}

print_help() {
        echo ""
        print_usage
        echo ""
        echo "Free physical RAM plugin for Nagios"
        echo ""
        echo "This plugin not developped by the Nagios Plugin group."
        echo "Please do not e-mail them for support on this plugin, since"
        echo "they won't know what you're talking about :P"
        echo ""
        echo "For contact info, read the plugin itself..."
}

if [ $# -lt 2 ]; then print_help; exit $STATE_WARNING;fi

case "$1" in
        --help) print_help; exit $STATE_OK;;
        -h) print_help; exit $STATE_OK;;
        *) if [  $# -lt 2 ]; then print_help; exit $STATE_WARNING;fi ;;
esac

RAM_WARN=$1
RAM_CRIT=$2
[ $DEBUG -gt 0 ] && echo "Warning and Critical percentages are $RAM_WARN and $RAM_CRIT." >> $DEBUGFILE

if [ $RAM_WARN -le RAM_CRIT ]
then
        echo "Warning percentage should be larger than critical percentage."
        exit $STATE_WARNING
fi

check_space()
{
[ $DEBUG -gt 0 ] && echo "Starting check_space." >> $DEBUGFILE
        TOTALSPACE=0
        TOTALSPACE=`prtconf | grep ^"Memory size" | awk '{print $3}'`
[ $DEBUG -gt 0 ] && echo "Total space is $TOTALSPACE." >> $DEBUGFILE

        TOTALFREE=0
        TOTALFREE=`vmstat 2 2 | tail -1 | awk '{print $5}'`
[ $DEBUG -gt 0 ] && echo "Free space is $TOTALFREE." >> $DEBUGFILE
        let TOTALFREE=$TOTALFREE/1000
[ $DEBUG -gt 0 ] && echo "Free space, div1000 is $TOTALFREE." >> $DEBUGFILE
}

check_percentile() 
{
[ $DEBUG -gt 0 ] && echo "Starting check_percentile." >> $DEBUGFILE
        FRACTION=`echo "scale=2; $TOTALFREE/$TOTALSPACE" | bc`
[ $DEBUG -gt 0 ] && echo "Fraction is $FRACTION." >> $DEBUGFILE

        PERCENT=`echo "scale=2; $FRACTION*100" | bc | awk -F. '{print $1}'`
[ $DEBUG -gt 0 ] && echo "Percentile is $PERCENT." >> $DEBUGFILE

        if [ $PERCENT -lt $RAM_CRIT ]; then
[ $DEBUG -gt 0 ] && echo "$PERCENT is smaller than $RAM_CRIT. Critical." >> $DEBUGFILE
          echo "RAM NOK - Less than $RAM_CRIT % of physical RAM is unused."
          exitstatus=$STATE_CRITICAL
          exit $exitstatus
        fi

        if [ $PERCENT -lt $RAM_WARN ]; then
[ $DEBUG -gt 0 ] && echo "$PERCENT is smaller than $RAM_WARN. Warning." >> $DEBUGFILE
          echo "RAM NOK - Less than $RAM_WARN % of physical RAM is unused."
          exitstatus=$STATE_WARNING
          exit $exitstatus
        fi
}

check_space
check_percentile

[ $DEBUG -gt 0 ] && echo "$PERCENT is greater than $RAM_WARN. OK." >> $DEBUGFILE
echo "RAM OK - $TOTALFREE MB out of $TOTALSPACE MB RAM unused."
exitstatus=$STATE_OK
exit $exitstatus



kilala.nl tags: , , ,

View or add comments (curr. 0)

Nagios script: check_nfs_stale

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

There really isn't much to say... This script is so fscking basic that it shames me to even put it up here among all the other projects


#!/usr/bin/bash
#
# NFS stale mounts monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide, the Netherlands
# Last Modified: 13-07-2006
# 
# Usage: ./check_nfs_stale
#
# Description:
# This script couldn't be simpler than it is. It just checks to see
# whether there are any stale NFS mounts present on the system. 
#
# Limitations:
#   This script should work properly on all implementations of Linux, Solaris
# and Mac OS X.
#
# Output:
# If there are stale NFS mounts, a CRIT is issued.
#

# You may have to change this, depending on where you installed your
# Nagios plugins
PROGNAME="check_nfs_stale"
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh


### REQUISITE NAGIOS COMMAND LINE STUFF ###

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "NFS stale mounts monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

[ `df -k | grep "Stale NFS file handle" | wc -l` -gt 0 ] && (echo "NOK - Stale NFS mounts."; exit $STATE_CRITICAL)

# Nothing caused us to exit early, so we're okay.
echo "OK - No stale NFS mounts."
exit $STATE_OK


kilala.nl tags: , , ,

View or add comments (curr. 0)

Nagios script: check_networking

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

I couldn't find an easy way to check whether all interfaces of a host are up and running from the -inside-, so I wrote a Nagios plugin to do this.

Naturally you could also try to ping all of the IP addresses of all of these network cards, but this isn't always possible. Lord knows how many routing issues I had fight through to get our current IP set monitored. I guess using this script is a bit easier :)

The script was tested on Redhat ES3, Mac OSX and Solaris. Its basic requirement is the Korn shell (due to some conversions happening inside the script). On Linux/RH you'll need mii-tool (and sudo) and on Solaris you'll need Perl (for one lousy piece of math :p ).

EDIT:

Oh! Just like my other recent Nagios scripts, check_networking comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.



#!/usr/bin/ksh
#
# Basic UNIX networking check script.
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide SYS, the Netherlands
# Last Modified: 22-06-2006
#
# Usage: ./check_networking
#
# Description:
#   This plugin determines whether the local host's network interfaces
# are all up and running like they should. It uses the following
# questions to determine this.
# * Does /sbin/mii-tool report any problems? (Linux only)
# * Are the gateways for each subnet pingable?
#
# Limitations:
# * I have no clue whether mii-tool is something specific to Redhat ES3,
#   or whether all Linii have it. 
# * Sudo access to mii-tool is required for the nagios account.
# * Perl is required on Solaris, to do just tiny bit of math.
# * KSH is required.
# * The script assumes that the first available IP from a subnet is the
#   router.
#
# Output:
#   The script retunrs a CRIT when one of the criteria mentioned
# above is not matched.
#
# Other notes:
#   I wish I'd learn Perl. I'm sure that doing all of this stuff in Perl
# would have cut down on the size of this script tremendously. Ah well.
#   If you ever run into problems with the script, set the DEBUG variable
# to 1. I'll need the output the script generates to do troubleshooting.
# See below for details. 
#   I realise that all the debugging commands strewn throughout the script
# may make things a little harder to read. But in the end I'm sure it was
# well worth adding them. It makes troubleshooting so much easier. :3
#

# Enabling the following dumps information into DEBUGFILE at various
# stages during the execution of this script.
DEBUG="0"
DEBUGFILE="/tmp/foobar"


### REQUISITE NAGIOS USER INTERFACE STUFF ###

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

[ $DEBUG -gt 0 ] && rm $DEBUGFILE 

print_usage() {
        echo "Usage: $PROGNAME"
        echo "Usage: $PROGNAME --help"
}

print_help() {
        echo ""
        print_usage
        echo ""
        echo "Basic UNIX networking check plugin for Nagios"
        echo ""
        echo "This plugin not developped by the Nagios Plugin group."
        echo "Please do not e-mail them for support on this plugin, since"
        echo "they won't know what you're talking about :P"
        echo ""
        echo "For contact info, read the plugin itself..."
}

while test -n "$1"
do
        case "$1" in
          --help) print_help; exit $STATE_OK;;
          -h) print_help; exit $STATE_OK;;
          *) print_usage; exit $STATE_UNKNOWN;;
        esac
done


### SETTING UP THE ENVIRONMENT ###

# Host OS check and warning message
MIITOOL="0"
if [ -f /sbin/mii-tool ]
then
        MIITOOL="1"

        sudo /sbin/mii-tool >/dev/null 2>&1
        if [ $? -gt 0 ]
        then
                echo "ERROR: sudo permissions"
                echo ""
                echo "This script requires that the Nagios user account has"
                echo "sudo permissions for the mii-tool command. Currently it"
                echo "does not have these permissions. Please fix this."
                echo ""
                exit $STATE_UNKNOWN
        fi
fi


### SUB-ROUTINE DEFINITIONS ### 

function convert_base
{
        typeset -i${2:-16} x
        x=$1
        echo $x
}

function subnet_router
{
[ $DEBUG -gt 0 ] && echo "- Starting subnet_router -" >> $DEBUGFILE
    first="0"; second="0"; third="0"; fourth="0"
    first=`echo $1 | cut -c 1-8`; FIRST=`convert_base 2#$first 10`
[ $DEBUG -gt 0 ] && echo "First: $first $FIRST" >> $DEBUGFILE
    second=`echo $1 | cut -c 9-16`; SECOND=`convert_base 2#$second 10`
[ $DEBUG -gt 0 ] && echo "Second: $second $SECOND" >> $DEBUGFILE
    third=`echo $1 | cut -c 17-24`; THIRD=`convert_base 2#$third 10`
[ $DEBUG -gt 0 ] && echo "Third: $third $THIRD" >> $DEBUGFILE
    fourth=`echo $1 | cut -c 25-32`
    [ `echo $fourth|wc -c` -gt 1 ] || fourth="0"
    TEMPCOUNT=`echo $fourth | wc -c | awk '{print $1}'`
    let PADDING=9-$TEMPCOUNT 
[ $DEBUG -gt 0 ] && echo "Fourth: padding fourth with $PADDING zeroes" >> $DEBUGFILE
    i=1
    while ((i <= $PADDING));
    do
       fourth=$fourth"0" 
       let i=$i+1
    done
    FOURTH=`convert_base 2#$fourth 10`; let FOURTH=$FOURTH+1
[ $DEBUG -gt 0 ] && echo "Fourth: $fourth $FOURTH" >> $DEBUGFILE

    echo "$FIRST.$SECOND.$THIRD.$FOURTH"
}

gather_interfaces_linux()
{
[ $DEBUG -gt 0 ] && echo "- Starting gather_interfaces_linux -" >> $DEBUGFILE
    for INTF in `ifconfig -a | grep ^[a-z] | grep -v ^lo | awk '{print $1}'`
    do
	if [ `echo $INTF | grep : | wc -l` -gt 0 ]
	then
            export INTERFACES="`echo $INTF|awk -F: '{print $1}'` $INTERFACES"
	else
            export INTERFACES="$INTF $INTERFACES"
	fi
    done

    INTFCOUNT=`echo $INTERFACES | wc -w`
[ $DEBUG -gt 0 ] && echo "Interfaces: There are $INTFCOUNT interfaces: $INTERFACES." >> $DEBUGFILE
    if [ $INTFCOUNT -lt 1 ] 
    then
	echo "NOK - No active network interfaces."
	exit $STATE_CRITICAL
    fi
}

gather_interfaces_darwin()
{
[ $DEBUG -gt 0 ] && echo "- Starting gather_interfaces_darwin -" >> $DEBUGFILE
    for INTF in `ifconfig -a | grep ^[a-z] | grep -v ^gif | grep -v ^stf | grep -v ^lo | awk '{print $1}'`
    do
        [ `echo $INTF | grep : | wc -l` -gt 0 ] && INTF=`echo $INTF|awk -F: '{print $1}'`
	[ `ifconfig $INTF | grep "status: inactive" | wc -l` -gt 0 ] && break
        INTERFACES="$INTF $INTERFACES" 
    done

    INTFCOUNT=`echo $INTERFACES | wc -w`
[ $DEBUG -gt 0 ] && echo "Interfaces: There are $INTFCOUNT interfaces: $INTERFACES." >> $DEBUGFILE
    if [ $INTFCOUNT -lt 1 ] 
    then
	echo "NOK - No active network interfaces."
	exit $STATE_CRITICAL
    fi
}

gather_gateway_linux()
{
[ $DEBUG -gt 0 ] && echo "- Starting gather_gateway_linux for interface $1 -" >> $DEBUGFILE
    MASKBIN=""
    MASK=`ifconfig $1 | grep Mask | awk '{print $4}' | awk -F: '{print $2}'` 
    for PART in `echo $MASK | awk -F. '{print $1" "$2" "$3" "$4}'`
    do
        MASKBIN="$MASKBIN`convert_base $PART 2  | awk -F# '{print $2}'`"
    done
[ $DEBUG -gt 0 ] && echo "Mask: $MASK $MASKBIN" >> $DEBUGFILE

        BITCOUNT=`echo $MASKBIN | grep -o 1 | wc -l | awk '{print $1}'`

[ $DEBUG -gt 0 ] && echo "Bitcount: $BITCOUNT" >> $DEBUGFILE

    IPBIN=""
    IP=`ifconfig $1 | grep "inet addr" | awk '{print $2}' | awk -F: '{print $2}'` 
    for PART in `echo $IP | awk -F. '{print $1" "$2" "$3" "$4}'`
    do
        TEMPBIN=`convert_base $PART 2 | awk -F# '{print $2}'`
        TEMPCOUNT=`echo $TEMPBIN | wc -c | awk '{print $1}'`
        let PADDING=9-$TEMPCOUNT
        i=1
        while ((i <= $PADDING));
        do
            IPBIN=$IPBIN"0" 
            let i=$i+1
        done
        IPBIN=$IPBIN$TEMPBIN
    done
[ $DEBUG -gt 0 ] && echo "IP address: $IP $IPBIN" >> $DEBUGFILE

    CUT="1-$BITCOUNT"
[ $DEBUG -gt 0 ] && echo "Cutting: Cutting chars $CUT" >> $DEBUGFILE
    NETBIN=`echo $IPBIN | cut -c $CUT`
[ $DEBUG -gt 0 ] && echo "Netbin: $NETBIN" >> $DEBUGFILE
    ROUTER=`subnet_router $NETBIN`
[ $DEBUG -gt 0 ] && echo "Router: $ROUTER" >> $DEBUGFILE
    echo $ROUTER
}

gather_gateway_darwin()
{
[ $DEBUG -gt 0 ] && echo "- Starting gath_gateway_darwin for interface $1 -" >> $DEBUGFILE
    MASKBIN=""
    [ `uname` == "Darwin" ] && MASK=`ifconfig $1 | grep netmask | awk '{print $4}' | awk -Fx '{print $2}'`
    [ `uname` == "SunOS" ] && MASK=`ifconfig $1 | grep netmask | awk '{print $4}'`
    for PART in `echo 1 3 5 7`
    do
	let PLUSPART=$PART+1
	MASKPART=`echo $MASK | cut -c $PART-$PLUSPART`
        MASKBIN="$MASKBIN`convert_base 16#$MASKPART 2  | awk -F# '{print $2}'`"
    done
[ $DEBUG -gt 0 ] && echo "Mask: $MASK $MASKBIN" >> $DEBUGFILE

    BITCOUNT=`echo $MASKBIN | grep -o 1 | wc -l | awk '{print $1}'`
[ $DEBUG -gt 0 ] && echo "Bitcount: $BITCOUNT" >> $DEBUGFILE

    IPBIN=""
    IP=`ifconfig $1 | grep "inet " | awk '{print $2}'`
    for PART in `echo $IP | awk -F. '{print $1" "$2" "$3" "$4}'`
    do
        TEMPBIN=`convert_base $PART 2 | awk -F# '{print $2}'`
        TEMPCOUNT=`echo $TEMPBIN | wc -c | awk '{print $1}'`
        let PADDING=9-$TEMPCOUNT
        i=1
        while ((i <= $PADDING));
        do
            TEMPBIN="0"$TEMPBIN
            let i=$i+1
        done
        IPBIN=$IPBIN$TEMPBIN
    done
[ $DEBUG -gt 0 ] && echo "IP address: $IP $IPBIN" >> $DEBUGFILE

    CUT="1-$BITCOUNT"
[ $DEBUG -gt 0 ] && echo "Cutting: cutting chars $CUT" >> $DEBUGFILE
    NETBIN=`echo $IPBIN | cut -c $CUT`
[ $DEBUG -gt 0 ] && echo "Netbin: $NETBIN" >> $DEBUGFILE
    ROUTER=`subnet_router $NETBIN`
[ $DEBUG -gt 0 ] && echo "Router: $ROUTER" >> $DEBUGFILE
    echo $ROUTER
}

gather_gateway_sunos()
{
[ $DEBUG -gt 0 ] && echo "- Starting gath_gateway_solaris for interface $1 -" >> $DEBUGFILE
    MASKBIN=""
    [ `uname` == "Darwin" ] && MASK=`ifconfig $1 | grep netmask | awk '{print $4}' | awk -Fx '{print $2}'`
    [ `uname` == "SunOS" ] && MASK=`ifconfig $1 | grep netmask | awk '{print $4}'`
    for PART in `echo 1 3 5 7`
    do
        let PLUSPART=$PART+1
        MASKPART=`echo $MASK | cut -c $PART-$PLUSPART`
        MASKBIN="$MASKBIN`convert_base 16#$MASKPART 2  | awk -F# '{print $2}'`"
    done
[ $DEBUG -gt 0 ] && echo "Mask: $MASK $MASKBIN" >> $DEBUGFILE

# This piece of kludge also requires that all tabs are removed from the beginning of each line.
# Additional character needed to trick the counter below
# Shitty thing is that it doesn't work. Stupid "let" aryth engine...
#MASKBIN="$MASKBIN-"
#[ $DEBUG -gt 0 ] && echo "Bitcount: kludged binmask is $MASKBIN" >> $DEBUGFILE
#
#IFS="1"
#read TEMP << EOT
#echo $MASKBIN
#EOT
#let "BITCOUNT=(${#TEMP[@]} - 1)"
#IFS=" "

# The kludge above was replaced by this one line of Perl. 

    BITCOUNT=`echo $MASKBIN | perl -ne 'while(/1/g){++$count}; print "$count"'`
[ $DEBUG -gt 0 ] && echo "Bitcount: $BITCOUNT" >> $DEBUGFILE

    IPBIN=""
    IP=`ifconfig $1 | grep "inet " | awk '{print $2}'`
    for PART in `echo $IP | awk -F. '{print $1" "$2" "$3" "$4}'`
    do
[ $DEBUG -gt 0 ] && echo "IP part: converting part $PART" >> $DEBUGFILE
        TEMPBIN=`convert_base $PART 2 | awk -F# '{print $2}'`
[ $DEBUG -gt 0 ] && echo "IP part: converted part is $TEMPBIN" >> $DEBUGFILE
        TEMPCOUNT=`echo $TEMPBIN | wc -c | awk '{print $1}'`
[ $DEBUG -gt 0 ] && echo "IP part: this part is $TEMPCOUNT chars long." >> $DEBUGFILE
        let PADDING=9-$TEMPCOUNT
[ $DEBUG -gt 0 ] && echo "IP part: will be padded with $PADDING zeroes" >> $DEBUGFILE
        i=1
        while ((i <= $PADDING));
        do
            TEMPBIN="0"$TEMPBIN
            let i=$i+1
        done
        IPBIN=$IPBIN$TEMPBIN
    done
[ $DEBUG -gt 0 ] && echo "IP address: $IP $IPBIN" >> $DEBUGFILE

    CUT="1-$BITCOUNT"
[ $DEBUG -gt 0 ] && echo "Cutting: cutting chars $CUT" >> $DEBUGFILE
    NETBIN=`echo $IPBIN | cut -c $CUT`
[ $DEBUG -gt 0 ] && echo "Netbin: $NETBIN" >> $DEBUGFILE
    ROUTER=`subnet_router $NETBIN`
[ $DEBUG -gt 0 ] && echo "Router: $ROUTER" >> $DEBUGFILE
    echo $ROUTER
}

check_miitool()
{
[ $DEBUG -gt 0 ] && echo "- Starting check_miitool -" >> $DEBUGFILE
    COUNT="0"
    for INTF in `echo $INTERFACES`
    do
        [ `sudo /sbin/mii-tool $INTF | head -1 | grep -c ok` -gt 0 ] || let COUNT=$COUNT+1
        [ `sudo /sbin/mii-tool $INTF | head -1 | grep -c 100baseTx-FD` -gt 0 ] || let COUNT=$COUNT+1
        [ `sudo /sbin/mii-tool $INTF | head -1 | grep -c 1000baseTx-FD` -gt 0 ] || let COUNT=$COUNT+1
    done

    [ $COUNT -gt $INTFCOUNT ] && (echo "NOK - Problem with one of the interfaces"; exit $STATE_CRITICAL)
}

check_ping()
{
[ $DEBUG -gt 0 ] && echo "- Starting check_ping -" >> $DEBUGFILE
    INTF=""
    for INTF in `echo $INTERFACES`
    do
	case `uname` in
	    Linux) GATEWAY=`gather_gateway_linux $INTF`;;
	    Darwin) GATEWAY=`gather_gateway_darwin $INTF`;;
	    SunOS) GATEWAY=`gather_gateway_sunos $INTF`;;
	    *) echo "OS not supported by this check."; exit 1;;
	esac
[ $DEBUG -gt 0 ] && echo "Gateway: $GATEWAY" >> $DEBUGFILE

 	ping -c 3 $GATEWAY >/dev/null 2>&1
        if [ $? -gt 0 ] 
        then
            echo "NOK - Problem pinging gateway $GATEWAY"; exit $STATE_CRITICAL
        fi
    done
}


### THE MAIN ROUTINE FINALLY STARTS ###

case `uname` in
            Linux) gather_interfaces_linux;;
            Darwin) gather_interfaces_darwin;;
            #SunOS) gather_interfaces_sunos;;
            SunOS) gather_interfaces_linux;;
            *) echo "OS not supported by this check."; exit 1;;
        esac

[ $MIITOOL -eq 1 ] && check_miitool

check_ping

# None of the other subroutines forced us to exit 1 before here, so let's quit with a 0.
echo "OK - Everything running like it should"
exit $STATE_OK


kilala.nl tags: , , ,

View or add comments (curr. 0)

Nagios script: check_suncluster

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

A few of our projects and services are run on Solaris systems running Sun Cluster software. Since there were no Nagios scripts available to perform checks against Sun Cluster I made a basic script that checks the most important factors.

This script performs a different function, depending on the parameter with which it is called. This allows you to define multiple service checks in Nagios, without needing seperate check scripts for each.

EDIT:

Oh! Just like my other recent Nagios scripts, check_suncluster comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution. And like my other, recent scripts it also comes with its own test script.



#!/usr/bin/ksh
#
# Nagios check script for Sun Cluster.
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide SYS, the Netherlands
# Last Modified: 25-09-2006
#
# Usage: ./check_suncluster [-t, -q, -g, -G resource-group, -r, -R resource, -i]
#
# Description:
# This script is capable of performing a number of basic checks on a 
# system running Sun Cluster. Depending on the parameter you pass to 
# it, it will check:
# * Transport paths (-t).
# * Quorum (-q).
# * Resource groups (-g).
# * One selected resource group (-G).
# * Resources (-r).
# * One selected resource (-R).
# * IPMP groups (-i).
#
# Limitations:
# This script will only work with Korn shell, due to some funky while
# looping with pipe forking. Bash doesn't handle this very gracefully,
# due to its sub-shell variable scoping. Maybe I really should learn
# to program in Perl.   
#
# Output:
# * Transport paths return a WARN when one of the paths is down and a
#   CRIT when all paths are offline. 
# * Quorum returns a WARN when not all, but enough quorum devices are
#   available. It returns a CRIT when quorum cannot be reached.
# * Resource groups returns a CRIT when a group is offline on all nodes
#   and a WARN if a group is in an unstable state.
# * Resources returns a CRIT when a resource is offline on all nodes
#   and a WARN if a resource is in an unstable state.
# * IPMP groups returns a CRIT when a group is offline.
#
# Other notes:
# Aside from the debugging output that I've built into most of my recent
# scripts, this check script will also have a testing mode  hacked on, as
# a bag on the side. This testing mode is only engaged when the test_check_suncluster
# script is being run and will intentionally "break" a few things, to 
# verify the failure options of this check script.
#

# Enabling the following dumps information into DEBUGFILE at various
# stages during the execution of this script.
DEBUG=0
DEBUGFILE="/tmp/foobar"

if [ -f /tmp/neko-wa-baka ]
then
	if [ `cat /tmp/neko-wa-baka` == "Nyo!" ]
	then
	   TESTING="1"
	else
	   TESTING="0"
	fi
else
	TESTING="0"
fi


### REQUISITE NAGIOS USER INTERFACE STUFF ###

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin:/usr/cluster/bin"
LIBEXEC="/usr/local/nagios/libexec"
PROGNAME="check_suncluster"
. $LIBEXEC/utils.sh

[ $DEBUG -gt 0 ] && rm $DEBUGFILE 

print_usage() {
        echo "Usage: $PROGNAME [-t, -q, -g, -G resource-group, -r, -R resource, -i]"
        echo "Usage: $PROGNAME --help"
}

print_help() {
        echo ""
        print_usage
        echo ""
        echo "Sun Cluster check plugin for Nagios"
        echo ""
        echo "-t: check transport paths"
        echo "-q: check quorum"
        echo "-g: check resource groups"
        echo "-G: check one individual resource group"
        echo "-r: check all resources"
        echo "-R: check one individual resources"
        echo "-i: check IPMP groups"
        echo ""
        echo "This plugin not developped by the Nagios Plugin group."
        echo "Please do not e-mail them for support on this plugin, since"
        echo "they won't know what you're talking about :P"
        echo ""
        echo "For contact info, read the plugin itself..."
}


### SUB-ROUTINE DEFINITIONS ### 

function check_transport_paths
{
[ $DEBUG -gt 0 ] && echo "Starting check_transport_path subroutine." >> $DEBUGFILE

	TOTAL=`scstat -W | grep "Transport path:" | wc -l`
	let COUNT=0

	scstat -W | grep "Transport path:" | awk '{print $3" "$6}' | while read PATH STATUS
	do
[ $DEBUG -gt 0 ] && echo "Before math, Count has the value of $COUNT." >> $DEBUGFILE
		if [ $STATUS == "online" ]
		then
		   let COUNT=$COUNT+1
		fi
[ $DEBUG -gt 0 ] && echo "Path: $PATH has status $STATUS" >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Count: $COUNT online transport paths." >> $DEBUGFILE
	done

[ $DEBUG -gt 0 ] && echo "Count: Outside the loop it has a value of $COUNT." >> $DEBUGFILE
[ $TESTING -gt 0 ] && COUNT="0"

	if [ $COUNT -lt 1 ]
	then
	   echo "NOK - No transport paths online."
	   exit $STATE_CRITICAL
	elif [ $COUNT -lt $TOTAL ]
	then
	   echo "NOK - One or more transport paths offline."
	   exit $STATE_WARNING
	fi
}

function check_quorum
{
[ $DEBUG -gt 0 ] && echo "Starting check_quorum subroutine." >> $DEBUGFILE
	NEED=`scstat -q | grep "votes needed:" | awk '{print $4}'`
	PRES=`scstat -q | grep "votes present:" | awk '{print $4}'`

[ $DEBUG -gt 0 ] && echo "Quorum needed: $NEED" >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Quorum present: $PRES" >> $DEBUGFILE

[ $TESTING -gt 0 ] && PRES="0"
	if [ $PRES -ge $NEED ]
	then
[ $DEBUG -gt 0 ] && echo "Enough quorum votes." >> $DEBUGFILE
		scstat -q | grep "votes:" | awk '{print $3" "$6}' | while read VOTE STATUS
		do
[ $DEBUG -gt 0 ] && echo "Vote: $VOTE has status $STATUS." >> $DEBUGFILE
			if [ $STATUS != "Online" ] 
			then
			   echo "NOK - Quorum vote $VOTE not available."
			   exit $STATE_WARNING
			fi
		done		
	else
[ $DEBUG -gt 0 ] && echo "Not enough quorum." >> $DEBUGFILE
		echo "NOK - Not enough quorum votes present."
		exit $STATE_CRITICAL
	fi
}

function check_resource_groups
{
[ $DEBUG -gt 0 ] && echo "Starting check_resource_groups subroutine." >> $DEBUGFILE
	scstat -g | grep "Group:" | awk '{print $2}' | sort -u | while read GROUP
	do
	ONLINE=`scstat -g | grep "Group: $GROUP" | grep "Online" | wc -l`
	WEIRD=`scstat -g | grep "Group: $GROUP" | grep -v "Resources" | grep -v "Online" | grep -v "Offline" | wc -l`
[ $DEBUG -gt 0 ] && echo "Resource Group $GROUP has $ONLINE instances online." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Resource Group $GROUP has $WEIRD instances in a weird state." >> $DEBUGFILE
[ $TESTING -gt 0 ] && ONLINE="0"
		if [ $ONLINE -lt 1 ] 
		then
		   echo "NOK - Resource group $GROUP not online."
		   exit $STATE_CRITICAL
		fi
                if [ $WEIRD -gt 1 ]
                then
                   echo "NOK - Resource group $GROUP is an unstable state."
                   exit $STATE_WARNING
                fi
	done
}

function check_resource_grp
{
[ $DEBUG -gt 0 ] && echo "Starting check_resource_grp subroutine." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Selected group: $RGROUP" >> $DEBUGFILE
	ONLINE=`scstat -g | grep $RGROUP | grep "Online" | wc -l`
	WEIRD=`scstat -g | grep $RGROUP | grep -v "Resources" | grep -v "Online" | grep -v "Offline" | wc -l`
[ $DEBUG -gt 0 ] && echo "Resource Group $GROUP has $ONLINE instances online." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Resource Group $GROUP has $WEIRD instances in a weird state." >> $DEBUGFILE
[ $TESTING -gt 0 ] && ONLINE="0"
	if [ $ONLINE -lt 1 ] 
	then
	   echo "NOK - Resource group $RGROUP not online."
	   exit $STATE_CRITICAL
	fi
	if [ $WEIRD -gt 1 ]
        then
           echo "NOK - Resource group $RGROUP is in an unstable state."
           exit $STATE_WARNING
        fi
}

function check_resources
{
[ $DEBUG -gt 0 ] && echo "Starting check_resources subroutine." >> $DEBUGFILE
	RESOURCES=`scstat -g | grep "Resource:" | awk '{print $2}' | sort -u`
[ $DEBUG -gt 0 ] && echo "List of resources to check: $RESOURCES" >> $DEBUGFILE
	for RESOURCE in `echo $RESOURCES`
	do
	ONLINE=`scstat -g | grep "Resource: $RESOURCE" | awk '{print $4}' | grep "Online" | wc -l` 
	WEIRD=`scstat -g | grep "Resource: $RESOURCE" | awk '{print $4}' | grep -v "Online" | grep -v "Offline" | wc -l`
[ $DEBUG -gt 0 ] && echo "Resource $RESOURCE has $ONLINE instances online." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Resource $RESOURCE has $WEIRD instances in a weird state." >> $DEBUGFILE
[ $TESTING -gt 0 ] && ONLINE="0"
		if [ $ONLINE -lt 1 ] 
		then
		   echo "NOK - Resource $RESOURCE not online."
		   exit $STATE_CRITICAL
		fi
                if [ $WEIRD -gt 1 ]
                then
                   echo "NOK - Resource $RESOURCE is in an unstable state."
                   exit $STATE_WARNING
                fi
	done
}

function check_rsrce
{
[ $DEBUG -gt 0 ] && echo "Starting check_rsrce subroutine." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Selected resource: $RSRCE" >> $DEBUGFILE
	ONLINE=`scstat -g | grep "Resource: $RSRCE" | awk '{print $4}' | grep "Online" | wc -l`
	WEIRD=`scstat -g | grep "Resource: $RSRCE" | awk '{print $4}' | grep -v "Online" | grep -v "Offline" | wc -l`
[ $DEBUG -gt 0 ] && echo "Resource $RESOURCE has $ONLINE instances online." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Resource $RESOURCE has $WEIRD instances in a weird state." >> $DEBUGFILE
[ $TESTING -gt 0 ] && ONLINE="0"
	if [ $ONLINE -lt 1 ] 
	then
	   echo "NOK - Resource $RESOURCE not online."
	   exit $STATE_CRITICAL
	fi
	if [ $WEIRD -gt 1 ]
        then
           echo "NOK - Resource $RESOURCE is in an unstable state."
           exit $STATE_WARNING
        fi
}

function check_ipmp
{
[ $DEBUG -gt 0 ] && echo "Starting check_ipmp subroutine." >> $DEBUGFILE
	scstat -i | grep "IPMP Group:" | awk '{print $3" "$5}' | while read GROUP STATUS
	do
[ $DEBUG -gt 0 ] && echo "IPMP Group: $GROUP has status $STATUS" >> $DEBUGFILE
		if [ $STATUS != "Online" ] 
		then
		   echo "NOK - IPMP group $GROUP not online."
		   exit $STATE_CRITICAL
		fi
if [ $TESTING -gt 0 ]
then
   echo "NOK - IPMP group $GROUP not online."
   exit $STATE_CRITICAL
fi
	done
}

### THE MAIN ROUTINE FINALLY STARTS ###

[ $DEBUG -gt 0 ] && echo "Starting main routine." >> $DEBUGFILE

if [ $# -lt 1 ]
then
	print_usage
	exit $STATE_UNKNOWN
fi

[ $DEBUG -gt 0 ] && echo "More than one argument." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "" >> $DEBUGFILE

case "$1" in
	--help) print_help; exit $STATE_OK;;
	-h) print_help; exit $STATE_OK;;
	-t) check_transport_paths;;
	-q) check_quorum;;
	-g) check_resource_groups;;
	-G) RGROUP="$2"; check_resource_grp;;
	-r) check_resources;;
	-R) RSRCE="$2"; check_rsrce;;
	-i) check_ipmp;;
	*) print_usage; exit $STATE_UNKNOWN;;
esac

[ $DEBUG -gt 0 ] && echo "No problems. Exiting normally." >> $DEBUGFILE

# None of the other subroutines forced us to exit 1 before here, so let's quit with a 0.
echo "OK - Everything running like it should"
exit $STATE_OK

#!/usr/bin/bash

function testrun()
{
	echo "Running without parameters."
	/usr/local/nagios/libexec/check_suncluster 
	echo "Exit code is $?."
	echo ""

	echo "Testing transport paths."
	/usr/local/nagios/libexec/check_suncluster -t
	echo "Exit code is $?."
	echo ""

	echo "Quorum votes."
	/usr/local/nagios/libexec/check_suncluster -q
	echo "Exit code is $?."
	echo ""

	echo "Checking all resource groups."
	/usr/local/nagios/libexec/check_suncluster -g
	echo "Exit code is $?."
	echo ""

	echo "Checking individual resource groups."
	for GROUP in `scstat -g | grep "Group:" | awk '{print $2}' | sort -u`
	do
		echo "Running for group $GROUP."
		/usr/local/nagios/libexec/check_suncluster -G $GROUP
		echo "Exit code is $?."
		echo ""
	done

	echo "Checking all resources."
	/usr/local/nagios/libexec/check_suncluster -r
	echo "Exit code is $?."
	echo ""
	
	echo "Checking all resources."
	for RESOURCE in `scstat -g | grep "Resource:" | awk '{print $2}' | sort -u`
	do
		echo "Running for resource $RESOURCE."
		/usr/local/nagios/libexec/check_suncluster -R $RESOURCE
		echo "Exit code is $?."
		echo ""
	done
	
	echo "Checking IPMP groups."
	/usr/local/nagios/libexec/check_suncluster -i
	echo "Exit code is $?."
	echo ""
}

function breakstuff()
{
	# Now we'll start breaking things!!
	echo ""
	echo "Now it's time to start breaking things! Gruaargh!"
	echo "Mind you, it's all fake and simulated. I am not changing -anything-"
	echo "about the cluster itself."
	echo ""
	
	echo "Nyo!" > /tmp/neko-wa-baka 
}

echo "Starting clean"
rm /tmp/neko-wa-baka /tmp/foobar >/dev/null 2>&1
echo ""

testrun
breakstuff
testrun

echo "Starting clean at the end"
rm /tmp/neko-wa-baka  >/dev/null 2>&1
echo ""

kilala.nl tags: , , ,

View or add comments (curr. 2)

Nagios script: retrieve_custom_snmp

2006-06-01 00:00:00

This script was written while I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

One of the things we've been looking into recently, is running the standard Nagios plugins through SNMP instead of through NRPE. Putting aside the discussion of the various merits and flaws such a solution has, let's say that it works nicely.

How do you do this?

In your snmpd.conf add a line like:

exec .1.3.6.1.4.1.6886.4.1.1 check_load /usr/local/nagios/libexec/check_load

exec .1.3.6.1.4.1.6886.4.1.2 check_mem /usr/local/nagios/libexec/check_mem –w 85 –c 95

exec .1.3.6.1.4.1.6886.4.1.3 check_swap /usr/local/nagios/libexec/check_swap -w 15% -c 5%

What this does, is tell the SNMP daemon to run the check_load script when someone asks for object .1.3.6.1.4.1.6886.4.1.1 (or .2, or .3). The exit code for the script will be place in OID.100.0 and the first line of output will be placed in OID.101.1. This script retrieves those two values through SNMP and returns them to Nagios.

Your checkcommands.cfg should contain something like:

define command{

command_name retrieve_custom_snmp

command_line $USER1$/retrieve_custom_snmp -H $HOSTADDRESS$ -o $ARG1$ }

The "-o" parameter takes the OID you have selected for your custom check.

Now... How do you select an OID? There's two ways:

1. The WRONG way = randomly selecting some OID. You might pick an OID which is needed for other monitoring purposes in your network.

2. The RIGHT way = requesting a private Enterprise ID for your company at IANA. You are free to build an SNMP tree beneath this EID. For example, the EID 6886 mentioned above is registered to KPN (my current client). The sub-tree .4.1 contains all OIDs referring to Nagios checks performed by my department.

Before sending out that request, please check the current EID list to see if you company already owns a private subtree. If that's the case, contact the "owner" to request your own part of the subtree.

UPDATE (2006-10-02):

Thanks to the kind folks on the Nagios Users ML I've found out that my original version of the script was totally bug-ridden. I've made a big bunch of adjustments and now the script should work properly. Thanks especially to Andreas Ericsson.



#!/bin/bash
#
# Script to retrieve custom SNMP objects set using the "exec" handler
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide, the Netherlands
# Last Modified: 18-07-2006
# 
# Usage: ./retrieve_custom_snmp
#
# Description:
#   On our Nagios client systems we use a lot of custom MIB OIDs which are
# registered under our own Enterprise ID. A whole bunch of the 
# original Nagios script are run through the SNMP daemon and their exit
# codes and output are appended to specific OID. This all happens using the
# SNMP "exec" handler.
#   Unfortunately the default check_snmp script doesn't allow for easy 
# handling of these objects, so I hacked together a quick script. 
#
# So basically this script doesn't do any checking. It just retrieves 
# information :)
#
# Limitations:
# This script should work properly on all implementations of Linux, Solaris
# and Mac OS X.
#
# Output:
# The exit code is the exit code retrieved from OID.100.1. It is temporarily
# stored in $EXITCODE.
# The output string is the string retrieved from OID.101.1. It is tempo-
# rarily stored in $OUTPUT.
#
# Other notes:
#   If you ever run into problems with the script, set the DEBUG variable
# to 1. I'll need the output the script generates to do troubleshooting.
# See below for details.
#   I realise that all the debugging commands strewn throughout the script
# may make things a little harder to read. But in the end I'm sure it was
# well worth adding them. It makes troubleshooting so much easier. :3
#   Also, for some reason the case statement with the shifts (to detect
# passed options) doesn't seem to be working right. FIXME!
#
# Check command definition:
# define command{
#       command_name    retrieve_custom_snmp
#       command_line    $USER1$/retrieve_custom_snmp -H $HOSTADDRESS$ -o $ARG1$
#		}
#

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh
PROGNAME="retrieve_custom_snmp"
COMMUNITY="public"

[ `uname` == "SunOS" ] && SNMPGET="/usr/local/bin/snmpget -Oqv -v 2c -c $COMMUNITY"
[ `uname` == "Darwin" ] && SNMPGET="/usr/bin/snmpget -Oqv -v 2c -c $COMMUNITY"
[ `uname` == "Linux" ] && SNMPGET="/usr/bin/snmpget -Oqv -v 2c -c $COMMUNITY"

### DEBUGGING SETUP ###
# Cause you never know when you'll need to squash a bug or two
DEBUG="0"

if [ $DEBUG -gt 0 ]
then
        DEBUGFILE="/tmp/foobar"
        rm $DEBUGFILE >/dev/null 2>&1
fi


### REQUISITE NAGIOS COMMAND LINE STUFF ###

print_usage() {
	echo "Usage: $PROGNAME -H hostname -o OID"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Script to retrieve the status for custom SNMP objects."
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1"; do
    case "$1" in
        --help)
            print_help
            exit $STATE_OK
            ;;
        -h)
            print_help
            exit $STATE_OK
            ;;
        -H)
			HOST=$2
	        shift
            ;;
        -o)
	    	OID=$2
	    	STATUS="$OID.100.1"
	    	STRING="$OID.101.1"
            shift
            ;;
        *)
            echo "Unknown argument: $1"
            print_usage
            exit $STATE_UNKNOWN
            ;;
    esac
    shift
done


### FINALLY... RETRIEVING THE VALUES ###

EXITCODE=`$SNMPGET $HOST $STATUS`
[ $DEBUG -gt 0 ] && echo "Retrieve exit code is $EXITCODE" >> $DEBUGFILE
 
OUTPUT=`$SNMPGET $HOST $STRING | sed 's/"//g'`
[ $DEBUG -gt 0 ] && echo "Retrieve status message is: $OUTPUT" >> $DEBUGFILE

echo $OUTPUT
exit $EXITCODE


kilala.nl tags: , , ,

View or add comments (curr. 0)

How do Nagios clients communicate?

2006-06-01 00:00:00

I know that, the first time I started using Nagios, I got confused a little when it came to monitoring systems other than the one running Nagios. To shed a little light on the subject for the beginning Nagios user, here's a discussion of the various methods of talking to Nagios clients.

First off, let me make it absolutely clear that, in order to monitor systems other than the one running Nagios, you are indeed going to have to communicate with them in some fashion. Unfortunately very few things in the Sysadmin trade are magical, and Nagios is unfortunately not one of them.

So first off, let's look at the -wrong- way of doing things. When I first started with Nagios (actually I made this mistake on my second day with the software) I wrote something like this:

define service{

   host_name remote-host

   service_description D_ROOT

   check_command check_disk!85!95!/

}

The problem with this setup is that I was using a -local- check and said it belonged to remote-host. Now this may look alright on the status screen ("Hey! It's green!"), but naturally you're not monitoring the right thing ^_^

So how -do- you monitor remote resources? Here's a table comparing various methods. After that I'll give examples on how you can correct the mistake I made above with each method.

PLEASE NOTE: the following discussion will not cover the monitoring of systems other than the various UNIX flavours. Later on I'll write a similar article covering Windows and stuff like Cisco.


A quick comparison

SSH

NRPE

SNMP

SNMP traps

NSCA

Connection

initiation

Srv -> Clnt

Srv -> Clnt

Srv -> Clnt

Clnt -> Srv

Clnt -> Srv

Security

Encryption

TCP wrappers

Key pairs

Encryption

Access List

TCP wrappers

Access List (v2)

Password (v3)

Access List (v2)

Password (v3)

TCP wrappers

Encryption

Access List

TCP wrappers

Configuration

On server

On client

On client

On client and On server

On client

Difficulty

Easy

Moderate

Hard

Hard

Moderate


SSH

Just about everyone should already have SSH running on their servers (except for those few who are still running telnet or, horror or horrors!, rsh). So it's safe to assume that you can immediately start using this communications method to check your clients. You will need to:

You can now set up your services.cfg in such a way that each remote service is checked like so:

define service{

   host_name remote-host

   service_description D_ROOT

   check_command check_disk_by_ssh!85!95!/

}

Your check command definition would look something like this:

define command {

   command_name check_disk_by_ssh

   command_line /usr/local/nagios/libexec/check_by_ssh -H $HOSTADDRESS$ -C "/usr/local/nagios/libexec/check_disk -w $ARG1$ -c $ARG2$ $ARG3$"

}

Working this way will allow you to do most of your configuring centrally (on the Nagios server), thus saving you a lot of work on each client system. All you have to do over there is make sure that there's a working user account and that all the scripts are in place. Quite convenient... The only drawback being that you're making a relatively open account which has full access to the system (sometimes even with sudo access).



NRPE

As a replacement for the SSH access method, Ethan also wrote the NRPE daemon. Using NRPE requires that you:

You can now set up your services.cfg in such a way that each remote service is checked like so:

define service{

   host_name remote-host

   service_description D_ROOT

   check_command check_nrpe!check_root

}

And in /usr/local/nagios/etc/nrpe.cfg on the client you would need to include:

command[check_root]=/usr/local/nagios/libexec/check_disk 85 95 /

Good thing is that you won't have a semi-open account lying about. Bad things are that, if you want to change the configuration of your client, you're going to have to login. And you're going to have yet another piece of software to keep up to date.



SNMP

Whoo boy! This is something I'm working on right now at $CLIENT and let me tell you: it's hard! At least much harder than I was expecting.

SNMP is a network management protocol used by the more advanced system administrators. Using SNMP you can access just about -any- piece of equipment in your server room to read statistics, alarms and status messages. SNMP is universal, extensible, but it is also quite complicated. Not for the faint of heart.

To make proper use of monitoring through SNMP you'll need to:

The reason why point C tells you to register a private EID, is because the SNMP tree has a very rigid structure. Technically speaking you -could- just plonk down your results at a random place in the tree, but it's likely that this will screw up something else at a later time. IANA allows each company to have only one private EID, so first check if your company doesn't already have one on the IANA list.

Ufortunately the check_snmp script that comes with Nagios isn't flexible enough to let you monitor custom SNMP objects in a nice way. This is why I wrote the retrieve_custom_nagios script, which is available from the menu. Your service definition would look like this:

define service{

   host_name remote-host

   service_description D_ROOT

   check_command retrieve_custom_snmp!.1.3.6.1.4.1.6886.4.1.4

}

And in this case your snmpd.conf would contain a line like this:

exec .1.3.6.1.4.1.6886.4.1.4 check_d_root /usr/local/nagios/libexec/check_disk -w 85 -c 95 /

Up to now things are actually not that different from using NRPE, are they? Well, that's because we haven't even started using all the -real- features of SNMP. Point is that using SNMP you can dig very deeply into your system to retrieve all kinds of useful information. And -that's- where things get complicated because you're going to have to dig up all the object IDs (OIDs) that you're going to need. And in some cases you're going to have to install vendor specific sub-agents that know how to speak to your specific hardware.

One of the best features of SNMP though are the so-called traps. Using traps the SNMP daemon will actively undertake action when something goes wrong in your system. So if for instance your hard disk starts failing, it is possible to have the daemon send out an alert to your Nagios server! Awesome! But naturally this will require a boatload of additional configuration :(

So... SNMP is an awesomely powerful tool, but you're going to have to pay through the nose (in effort) to get it 100% perfect.



SNMP traps

SNMP doesn't involve polling alone. SNMP enabled devices can also be configured to automatically send status updates do a so-call trap host. The downside to receiving SNMP traps with Nagios is that it takes quite a lot of work to get them into Nagios :D

To make proper use of monitoring through SNMP you'll need to:

There are -many- ways to get the SNMP traps translated for Nagios' purposes, 'cause there's many roads that lead to Rome. Unfortunately none of them are very easy to use.



NSCA

And finally there's NSCA. This daemon is usually used by distributed Nagios servers to send their results to the central Nagios server, which gathers them as so-called "passive checks". It is however entirely possible to install NSCA on each of your Nagios clients, which will then get called to send in the results of local checks. In this case you'll need to:

On your Nagios server things would look like this:

define service{

   host_name remote-host

   service_description D_ROOT

   check_command check_disk!85!95!/

   passive_checks_enable 1

   active_checks_enable 0

}

For the configuration on the client side I recommend that you read up on NSCA. It's a little bit too much to show over here.

The upside to this is that you won't have to run any daemon on your client to accept incoming connections. This will allow you to lock down your system in a hard way.


Naturally you are absolutely free to combine two or more of the methods described above. You could poll through NRPE and receive SNMP traps in one environment. This will have both ups and downs, but it's up to your own discretion. Use the tools that feel natural to you, or use those that are already standard in your environment.

I realise I've rushed through things a little bit, but I was in a slight hurry :) I will go over this article a second time RSN, to apply some polish.


kilala.nl tags: , , , ,

View or add comments (curr. 1)

Nagios script: check_named

2006-06-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Basic monitor to check whether BIND is up and running. It checks for a number of processes and tries to perform a basic lookup using the localhost.

This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:

A Critical is sent if:

A) one or more of the required processes is not running, or

B) the script is unable to perform a basic lookup using the localhost.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!


#!/usr/bin/bash
#
# DNS / Named process monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 19-06-2006
# 
# Usage: ./check_named
#
# Description:
# This plugin determines whether the named DNS server
# is running properly. It will check the following:
# * Are all required processes running?
# * Is it possible to make DNS requests?
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
#
# Output:
# The script returns a CRIT when the abovementioned criteria are
# not matched.
#

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Named DNS monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

check_processes()
{
	PROCESS="0"
	if [ `ps -ef | grep named | grep -v grep | grep -v nagios | wc -l` -lt 1 ]; then 
		echo "NAMED NOK - One or more processes not running"
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_service()
{
	SERVICE=0
	nslookup www.google.com localhost >/dev/null 2>&1
	if [ $? -eq 1 ]; then SERVICE=1;fi

	if [ $SERVICE -eq 1 ]; then 
		echo "SQUID NOK - One or more TCP/IP ports not listening."
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_processes
check_service

echo "NAMED OK - Everything running like it should"
exitstatus=$STATE_OK
exit $exitstatus

kilala.nl tags: , , ,

View or add comments (curr. 0)

Nagios script: check_log3

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

Today I made an improved version of the Nagios monitor "check_log2", which is now aptly called "check_log3". It includes all the improvements I originally added to "check_log2", so you can simply use this as a drop-in replacement.

Version 3 of this script gives you the option to add a second query to the monitor.

The previous two incarnations of the script only allowed you to search for one query and would return a Critical if it was found. Now you can also add a query which will return in a Warning message as well. Goody! :3

1st of Feb, 2006:

Kyle Tucker pointed out that he had problems running this script with bash on Solaris. The changes he suggested have been worked into the newer version. Thanks Kyle :)

5th of Mar, 2006:

I finally got round to fix the script according to all the changes Kyle (and others) suggested. So here's another try! Right now I've tested the script on Red Hat, Mac OS X and Solaris, so it should be much better than before.

19th of June, 2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!

Also stomped out a few horrendous bugs! I'm very sorry for putting out such a buggy script earlier... If you've started using the script in your environment, please download the latest version. Thanks to Ali Khan for pointing out these mistakes.



#!/bin/bash
#
# Log file pattern detector plugin for Nagios
# Written by Ethan Galstad (nagios@nagios.org)
# Last Modified: 07-31-1999
# Heavily modified by Thomas Sluyter (nagiosATkilalaDOTnl)
# Last Modified: 19-06-2006
#
# Usage: ./check_log3 -F log_file -O old_log_file -C crit-pattern -W warn-pattern
#
# Description:
#
# This plugin will scan a log file (specified by the log_file option)
# for specific patterns (specified by the XXX-pattern options).  Successive
# calls to the plugin script will only report *new* pattern matches in the
# log file, since an copy of the log file from the previous run is saved
# to old_log_file.
#
# Output:
#
# On the first run of the plugin, it will return an OK state with a message
# of "Log check data initialized".  On successive runs, it will return an OK
# state if *no* pattern matches have been found in the *difference* between the
# log file and the older copy of the log file.  If the plugin detects any 
# pattern matches in the log diff, it will return a CRITICAL state and print
# out a message is the following format: "(x) last_match", where "x" is the
# total number of pattern matches found in the file and "last_match" is the
# last entry in the log file which matches the pattern.
#
# Notes:
#
# If you use this plugin make sure to keep the following in mind:
#
#    1.  The "max_attempts" value for the service should be 1, as this
#        will prevent Nagios from retrying the service check (the
#        next time the check is run it will not produce the same results).
#
#    2.  The "notify_recovery" value for the service should be 0, so that
#        Nagios does not notify you of "recoveries" for the check.  Since
#        pattern matches in the log file will only be reported once and not
#        the next time, there will always be "recoveries" for the service, even
#        though recoveries really don't apply to this type of check.
#
#    3.  You *must* supply a different old_file_log for each service that
#        you define to use this plugin script - even if the different services
#        check the same log_file for pattern matches.  This is necessary
#        because of the way the script operates.
#
#    4.  Changes to the script were made by Thomas Sluyter (cailin@kilala.nl).
#	 * The first set of changes will allow the script to run properly on Solaris, which
#	   it did not do by default. The second set of changes will allow the following:
#	 * State retention. In the original script, if a NOK was put into the log file
#	   at point A in time and it is not repeated at A+1, then an OK is sent to Nagios. 
# 	   Not something that you would like to happen.
#	      I've added the $oldlog.STATE trigger file which retains the last exitstatus. Should
# 	   there be no new lines added to the log, check_log will simply repeat the last state
#	   instead of give an OK.
#	      In order for this state retention to work properly your client system MUST
#	   HAVE THE DIRECTORY /USR/LOCAL/NAGIOS/VAR.
#        * Two queries. In the original script you could only enter one query which, when
#	   found, would result in  a Critical message being sent to Nagios. I've added the 
#	   possibility to add another query, which will result in a Warning message.
#	 * Bugfix: changed all instances of "crit-count" and "warn-count" to "critcount" and
#	   "warncount" after a tip from Kyle Tucker who ran into problems running this script
#	   with bash on Solaris.
#

# Paths to commands used in this script.  These
# may have to be modified to match your system setup.

PATH="/usr/bin:/usr/sbin:/bin:/sbin"

PROGNAME=`basename $0`
PROGPATH=`echo $0 | sed -e 's,[\\/][^\\/][^\\/]*$,,'`

#. $PROGPATH/utils.sh
. /usr/local/nagios/libexec/utils.sh

print_usage() {
    echo "Usage: $PROGNAME -F logfile -O oldlog -C CRITquery -W WARNquery"
    echo "Usage: $PROGNAME --help"
    echo "Usage: $PROGNAME --version"
}

print_help() {
    echo ""
    print_usage
    echo ""
    echo "Log file pattern detector plugin for Nagios"
    echo ""
    support
}

# Make sure the correct number of command line
# arguments have been supplied

if [ $# -lt 8 ]; then
    print_usage
    exit $STATE_UNKNOWN
fi

# Grab the command line arguments

exitstatus=$STATE_WARNING #default
while test -n "$1"; do
    case "$1" in
        --help)
            print_help
            exit $STATE_OK
            ;;
        -h)
            print_help
            exit $STATE_OK
            ;;
        -F)
            logfile=$2
            shift
            ;;
        -O)
            oldlog=$2
            shift
            ;;
        -C)
            CRITquery=$2
            shift
            ;;
        -W)
            WARNquery=$2
            shift
            ;;
        *)
            echo "Unknown argument: $1"
            print_usage
            exit $STATE_UNKNOWN
            ;;
    esac
    shift
done

# If the source log file doesn't exist, exit

if [ ! -e $logfile ]; then
    echo "Log check error: Log file $logfile does not exist!"
    exit $STATE_UNKNOWN
    echo $STATE_UNKNOWN > $oldlog.STATE
fi

# If the dump/temp log file doesn't exist, this must be the first time
# we're running this test, so copy the original log file over to
# the old diff file and exit

if [ ! -e $oldlog ]; then
    cat $logfile > $oldlog

    TEMPcount=0
    let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $WARNquery | wc -l | awk '{print $1}')
    let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $CRITquery | wc -l | awk '{print $1}')

    if [ $TEMPcount -gt 0 ]
    then
       echo "Log check data initialized... Last line contained error message."
       echo $STATE_WARNING > $oldlog.STATE
       exit $STATE_WARNING
    else
       echo "Log check data initialized..."
       echo $STATE_OK > $oldlog.STATE
       exit $STATE_OK
    fi
fi

# A bug which was caught very late:
# If newlog is shorter than oldlog, the diff used below will return
# false positives for the query because the will be in $oldlog. Why?
# Because $oldlog is not rolled over / rotated, like $newlog. I need
# to fix this in a kludgy way.

if [ `wc -l $logfile|awk '{print $1}'` -lt `wc -l $oldlog|awk '{print $1}'` ]
then
    rm $oldlog
    cat $logfile > $oldlog
    TEMPcount=0
    let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $WARNquery | wc -l | awk '{print $1}')
    let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $CRITquery | wc -l | awk '{print $1}')

    if [ $TEMPcount -gt 0 ]
    then
       echo "Log check data initialized... Last line contained error message."
       echo $STATE_WARNING > $oldlog.STATE
       exit $STATE_WARNING
    else
       echo "Log check data initialized..."
       echo $STATE_OK > $oldlog.STATE
       exit $STATE_OK
    fi
fi

# The oldlog file exists, so compare it to the original log now

# The temporary file that the script should use while
# processing the log file.
if [ -x mktemp ]; then
    tempdiff=`mktemp /tmp/check_log.XXXXXXXXXX`
else
    tempdate=`/bin/date '+%H%M%S'`
    tempdiff="/tmp/check_log.${tempdate}"
    touch $tempdiff
fi

diff $logfile $oldlog > $tempdiff

if [ `wc -l $tempdiff | awk '{print $1}'` -eq 0 ]
then
     rm $tempdiff
     touch $oldlog.STATE
     exitstatus=`cat $oldlog.STATE`
     echo "LOG FILE - No status change detected. Status = $exitstatus"
     exit $exitstatus
fi

# Count the number of matching log entries we have
CRITcount=`grep -c "$CRITquery" $tempdiff`
WARNcount=`grep -c "$WARNquery" $tempdiff`

# Get the last matching entry in the diff file
CRITlastentry=`grep "$CRITquery" $tempdiff | tail -1`
WARNlastentry=`grep "$WARNquery" $tempdiff | tail -1`

rm $tempdiff
cat $logfile > $oldlog

if [ "$CRITcount" -gt 0 ]; then
    	echo "($CRITcount) $CRITlastentry"
    	echo $STATE_CRITICAL > $oldlog.STATE
	exit $STATE_CRITICAL
fi

if [ "$WARNcount" -gt 0 ]; then
    	echo "($WARNcount) $WARNlastentry"
    	echo $STATE_WARNING > $oldlog.STATE
	exit $STATE_WARNING
fi

echo "Log check ok - 0 pattern matches found"
exit $STATE_OK



echo "Starting clean"
rm /tmp/foobar /usr/local/nagios/var/foobar*
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""

echo "Starting normally"
echo "baka"
echo "normal" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "baka"
echo "normal" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "warning"
echo "bla" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "critical"
echo "neko" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "warning"
echo "bla" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""

echo "Log rotation with crit"
rm /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "critical"
echo "neko" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""

echo "Log rotation with warn"
rm /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "warning"
echo "bla" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""

echo "Normal log rotation"
rm /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""


kilala.nl tags: , , ,

View or add comments (curr. 2)

Nagios script: check_log2

2006-06-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Improved log checker for Solaris, with state retention.

I found that the version of check_log included in the default monitor package doesn't work perfectly on Solaris: it needs a bit of tweaking... Which is what I've done for the script.

Also, I've added state retention. It's a bit of a hack, but hey! I needed a quick solution.

The original script sends a Critical when it detects the string you've queried the log file for, but it clears that same Critical immediately if the same message is not repeated once the monitor runs again. Meaning that, if there are no updates to your log file, the Critical will only be around until the next time the monitor runs.

Not very handy if the Critical occurs during the night.

This new version of the script creates a file called $oldlog.STATE in /usr/local/nagios/var (which should be 755, nagios:nagios), which contains the exit status for the last detected _changed_ status... If there are no changes detected in your log file, this old exit state is repeated.

The script has been tested on Solaris 8, Mac OS X 10.4 and Redhat ES3.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!

Also stomped out a few horrendous bugs! I'm very sorry for putting out such a buggy script earlier... If you've started using the script in your environment, please download the latest version. Thanks to Ali Khan for pointing out these mistakes.


#!/bin/bash
#
# Log file pattern detector plugin for Nagios
# Written by Ethan Galstad (nagios@nagios.org)
# Last Modified: 07-31-1999
# Updated by Thomas Sluyter (nagiosATkilalaDOTnl)
# Last Modified: 19-06-2006
#
# Usage: ./check_log2 -F log_file -O old_log_file -Q pattern
#
# Description:
#
# This plugin will scan a log file (specified by the log_file option)
# for a specific pattern (specified by the pattern option).  Successive
# calls to the plugin script will only report *new* pattern matches in the
# log file, since an copy of the log file from the previous run is saved
# to old_log_file.
#
# Output:
#
# On the first run of the plugin, it will return an OK state with a message
# of "Log check data initialized".  On successive runs, it will return an OK
# state if *no* pattern matches have been found in the *difference* between the
# log file and the older copy of the log file.  If the plugin detects any 
# pattern matches in the log diff, it will return a CRITICAL state and print
# out a message is the following format: "(x) last_match", where "x" is the
# total number of pattern matches found in the file and "last_match" is the
# last entry in the log file which matches the pattern.
#
# Notes:
#
# If you use this plugin make sure to keep the following in mind:
#
#    1.  The "max_attempts" value for the service should be 1, as this
#        will prevent Nagios from retrying the service check (the
#        next time the check is run it will not produce the same results).
#
#    2.  The "notify_recovery" value for the service should be 0, so that
#        Nagios does not notify you of "recoveries" for the check.  Since
#        pattern matches in the log file will only be reported once and not
#        the next time, there will always be "recoveries" for the service, even
#        though recoveries really don't apply to this type of check.
#
#    3.  You *must* supply a different old_file_log for each service that
#        you define to use this plugin script - even if the different services
#        check the same log_file for pattern matches.  This is necessary
#        because of the way the script operates.
#
#    4.  Changes to the script were made by Thomas Sluyter (nagios@kilala.nl).
#	 The first set of changes will allow the script to run properly on Solaris, which
#	 it did not do by default. The second set of changes will allow the following:
#	 * State retention. If a NOK was generated at point A in time and it is not repeated
# 	   at A+1, then an OK is sent to Nagios. Not something that you would like to happen.
#	   I've added the $oldlog.STATE trigger file which retains the last exitstatus. Should
# 	   there be no new lines added to the log, check_log will simply repeat the last state
#	   instead of give an OK.
#
# Examples:
#
# Check for login failures in the syslog...
#
#   check_log -F /var/log/messages -O /usr/local/nagios/var/check_log.badlogins.old -Q "LOGIN FAILURE"
#
# Check for port scan alerts generated by Psionic's PortSentry software...
#
#   check_log -F /var/log/messages -O /usr/local/nagios/var/check_log.portscan.old -Q "attackalert"
#

# Paths to commands used in this script.  These
# may have to be modified to match your system setup.

PATH="/usr/bin:/usr/sbin:/bin:/sbin"

PROGNAME=`basename $0`
PROGPATH=`echo $0 | sed -e 's,[\\/][^\\/][^\\/]*$,,'`

#. $PROGPATH/utils.sh
. /usr/local/nagios/libexec/utils.sh

print_usage() {
    echo "Usage: $PROGNAME -F logfile -O oldlog -Q query"
    echo "Usage: $PROGNAME --help"
}

print_help() {
    echo ""
    print_usage
    echo ""
    echo "Log file pattern detector plugin for Nagios"
    echo ""
    support
}

# Make sure the correct number of command line
# arguments have been supplied

if [ $# -lt 6 ]; then
    print_usage
    exit $STATE_UNKNOWN
fi

# Grab the command line arguments

exitstatus=$STATE_WARNING #default
while test -n "$1"; do
    case "$1" in
        --help)
            print_help
            exit $STATE_OK
            ;;
        -h)
            print_help
            exit $STATE_OK
            ;;
        -F)
            logfile=$2
            shift
            ;;
        -O)
            oldlog=$2
            shift
            ;;
        -Q)
            query=$2
            shift
            ;;
        *)
            echo "Unknown argument: $1"
            print_usage
            exit $STATE_UNKNOWN
            ;;
    esac
    shift
done

# If the source log file doesn't exist, exit

if [ ! -e $logfile ]; then
    echo "Log check error: Log file $logfile does not exist!"
    exit $STATE_UNKNOWN
    echo $STATE_UNKNOWN > $oldlog.STATE
fi

# If the oldlog file doesn't exist, this must be the first time
# we're running this test, so copy the original log file over to
# the old diff file and exit

if [ ! -e $oldlog ]; then
    cat $logfile > $oldlog
    if [ `tail -1 $logfile | grep -i $query | wc -l` -gt 0 ]
    then
        echo "Log check data initialized... Last line contained error message."
        echo $STATE_CRITICAL > $oldlog.STATE
	exit $STATE_CRITICAL
    else
        echo "Log check data initialized..."
        echo $STATE_OK > $oldlog.STATE
        exit $STATE_OK
    fi
fi

# A bug which was caught very late:
# If newlog is shorter than oldlog, the diff used below will return
# false positives for the query because the will be in $oldlog. Why?
# Because $oldlog is not rolled over / rotated, like $newlog. I need 
# to fix this in a kludgy way.

if [ `wc -l $logfile|awk '{print $1}'` -lt `wc -l $oldlog|awk '{print $1}'` ]
then
    rm $oldlog
    cat $logfile > $oldlog
    if [ `tail -1 $logfile | grep -i $query | wc -l` -gt 0 ]
    then
        echo "Log check data re-initialized... Last line contained error message."
        echo $STATE_CRITICAL > $oldlog.STATE
	exit $STATE_CRITICAL
    else
        echo "Log check data re-initialized..."
        echo $STATE_OK > $oldlog.STATE
        exit $STATE_OK
    fi
fi

# Everything seems fine, so compare it to the original log now

# The temporary file that the script should use while
# processing the log file.
if [ -x mktemp ]; then
    tempdiff=`mktemp /tmp/check_log.XXXXXXXXXX`
else
    tempdate=`/bin/date '+%H%M%S'`
    tempdiff="/tmp/check_log.${tempdate}"
    touch $tempdiff
fi

diff $logfile $oldlog > $tempdiff

if [ `wc -l $tempdiff|awk '{print $1}'` -eq 0 ]
then
     rm $tempdiff
     touch $oldlog.STATE
     exitstatus=`cat $oldlog.STATE`
     echo "LOG FILE - No status change detected. Status = $exitstatus"
     exit $exitstatus
fi

# Count the number of matching log entries we have
count=`grep -c "$query" $tempdiff`

# Get the last matching entry in the diff file
lastentry=`grep "$query" $tempdiff | tail -1`

rm -f $tempdiff
cat $logfile > $oldlog

if [ "$count" = "0" ]; then # no matches, exit with no error
    echo "Log check ok - 0 pattern matches found"
    exitstatus=$STATE_OK
else # Print total matche count and the last entry we found
#    echo "($count) $lastentry"
    echo "Log check NOK - $lastentry"
    exitstatus=$STATE_CRITICAL
    echo $STATE_CRITICAL > $oldlog.STATE
fi

exit $exitstatus


echo "Starting clean"
rm /tmp/foobar /usr/local/nagios/var/foobar*
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""

echo "Starting normally"
echo "normal"
echo "normal" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "normal"
echo "normal" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "critical"
echo "neko" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""

echo "Log rotation with crit"
rm /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "critical"
echo "neko" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""

echo "Normal log rotation"
rm /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""


kilala.nl tags: , , ,

View or add comments (curr. 2)

Hacked admin mode into Syslog-ng

2005-11-22 11:09:00

At $CLIENT I've built a centralised logging environment based on Syslog-ng, combined with MySQL. To make any useful from all the data going into the database we use PHP-syslog-ng. However, I've found a bit of a flaw with that software: any account you create has the ability to add, remove or change other accounts... Which kinda makes things insecure.

So yesterday was spent teaching myself PHP and MySQL to such a degree that I'd be able to modify the guy's source code. In the end I managed to bolt on some sort of "admin-mode" which allows you to set an "admin" flag on certain user accounts (thus giving them the capabilities mentioned above).

The updated PHP files can be found in the TAR-ball in the menu of the Sysadmin section. The only thing you'll need to do to make things work is to either:

1. Re-create your databases using the dbsetup.sql script.

2. Add the "admin" column to the "users" table using the following command. ALTER TABLE users ADD COLUMN baka BOOLEAN;


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Added Nagios plugins

2005-09-11 01:00:00

I've added all the custom Nagios monitors I wrote for $CLIENT. They might come in handy for any of you. They're not beauties, but they get the job done.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Nagios and BoKS/Keon

2005-09-11 00:47:00

Major updates in the Sysadmin section! w00t!

In this case a lot of information one of my favourite security tools and Nagios, my new-found love on the monitoring front.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

Jumpstart, FLAR and console servers

2005-07-01 15:22:00

Currently at the office, so I'll make it a quick one :3

Unfortunately I've been making longer days than I should this week. I mean, it's not a horrendous amount of hours, but still I'd rather be at home relaxing. This week has seen the people in charge at $CLIENT up the prio on a centralised Jumpstart/FLAR server, which I was supposed to deliver. I was already working on it part time, but now they have me working on it full time. It's quite a lot of fun, since I get to work together with other departments within $CLIENT, thus making more friends and allies ^_^

I also had to struggle with Perle IOLan+ terminal servers this week, since we need to be able to use the serial management port on our Sun servers. Yes, admittedly these boxen do work for this purpose, but I'd rather have a proper console server instead of a piece of kit which was originally meant as a dial-in box for dumb terminals or modems. Let's just say that I dream of Cyclades.

Oh! Last wednesday was my birthday by the way... I've hit 26 now :3 We went out for a lovely dinner at Konichi wa in Utrecht, since we wanted to try out a different Japanese restaurant for a change. I must say: their price/quality proportions are really good! If you ever are in the neighbourhood of Utrecht and feel like Japanese, head over there! They're at Mariaplaats 9. BTW, they don't just to Tepan Yaki... They also serve excellent sushi and will make you _ramen_ or _udon_ noodles if you ask nicely!!! My new favourite restaurant :9


kilala.nl tags: , , ,

View or add comments (curr. 0)

Nagios script: check_squid

2005-07-01 00:00:00

This script was written in the time I was hired by UPC / Liberty Global.

The text I wrote on Nagios Exchange about this script has been lost. I guess it speaks for itself :)



#!/usr/bin/bash
#
# Squid process monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 19-06-2006
# 
# Usage: ./check_squid
#
# Description:
# This plugin determines whether the Squid proxy server
# is running properly. It will check the following:
# * Are all required processes running?
# * Are all the required TCP/IP ports open?
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
#
# Output:
# The script returns a CRIT when the abovementioned criteria are
# not matched
#

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Squid monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

check_processes()
{
	PROCESS="0"
	if [ `ps -ef | grep squid | grep -v grep | grep -v nagios | wc -l` -lt 2 ]; then 
		echo "SQUID NOK - One or more processes not running"
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_ports()
{
	PORTS=0
	PORTLIST="8080 3128 3130"
	for NUM in `echo $PORTLIST`; do
	if [ `netstat -an | grep LISTEN | grep $NUM | grep -v grep | wc -l` -lt 1 ]; then PORTS=1;fi
	done

	if [ $PORTS -eq 1 ]; then 
		echo "SQUID NOK - One or more TCP/IP ports not listening."
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_processes
check_ports

echo "SQUID OK - Everything running like it should"
exitstatus=$STATE_OK
exit $exitstatus


kilala.nl tags: , , ,

View or add comments (curr. 0)

Nagios script: check_retro_client

2005-07-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Basic monitor that checks if the Retrospect client is up and running.

This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:

The script sends a Critical if the required process is not running.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!



#!/usr/bin/bash
#
# Retrospect Backup Client monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 19-06-2006
# 
# Usage: ./check_retro_client
#
# Description:
# This plugin determines whether the Retrospect backup client 
# is running properly. It will check the following:
# * Are all required processes running?
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
#
# Output:
# The script returns a CRIT when the abovementioned criteria are
# not matched
#

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Retrospect Backup Client monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

check_processes()
{
	PROCESS="0"
	if [ `ps -ef | grep retroclient | grep -v grep | grep -v nagios | wc -l` -lt 1 ]; then 
		echo "RETROSPECT NOK - One or more processes not running"
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_processes

echo "RETROSPECT OK - Everything running like it should"
exitstatus=$STATE_OK
exit $exitstatus


kilala.nl tags: , , ,

View or add comments (curr. 0)

Nagios script: check_postfix

2005-07-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Basic monitor that checks if Postfix is up and running. It checks for a number of processes and ports.

This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:

The script sends a Critical if:

A) One or more processes are not running, or

B) One or more ports are not available for connections.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!



#!/usr/bin/bash
#
# Postfix process monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 19-06-2006
# 
# Usage: ./check_postfix
#
# Description:
# This plugin determines whether the Postfix SMTP server
# is running properly. It will check the following:
# * Are all required processes running?
# * Are all the required TCP/IP ports open?
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
#
# Output:
# Script returns a CRIT when one of the abovementioned criteria is 
# not matched
#

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Postfix monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

check_processes()
{
	PROCESS="0"
	PROCLIST="smtpd qmgr pickup master sendmail"
	for PROC in `echo $PROCLIST`; do
	if [ `ps -ef | grep $PROC | grep -v grep | wc -l` -lt 1 ]; then 
		if [ $PROC == "smtpd" ]; then
			if [ `ps -ef | grep proxymap | grep -v grep | wc -l` -lt 1 ]; then
				PROCESS=1
			else
				PROCESS=0
			fi
		else
			PROCESS=1
		fi
	fi
	done

	if [ $PROCESS -eq 1 ]; then 
		echo "SMTP-S NOK - One or more processes not running"
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_ports()
{
	PORTS="0"
	PORTLIST="25"
	for NUM in `echo $PORTLIST`; do
	if [ `netstat -an | grep LISTEN | grep $NUM | grep -v grep | wc -l` -lt 1 ]; then PORTS=1;fi
	done

	if [ $PORTS -eq 1 ]; then 
		echo "SMTP-S NOK - One or more TCP/IP ports not listening."
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_processes
check_ports

echo "SMTP-S OK - Everything running like it should"
exitstatus=$STATE_OK
exit $exitstatus



kilala.nl tags: , , ,

View or add comments (curr. 0)

Nagios script: check_ntp_s

2005-07-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Basic monitor that checks if the server is up and running. It checks for a process and whether the server has drifted from its higher level Stratum server.

This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:

The script sends a Critical if:

A) One or more processes are not running, or

B) The server's clock has drifted too far from its higher level Stratum server.

Requires the "check_ntp" plugin which is part of the default monitor package.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!



#!/usr/bin/bash
#
# NTP server process monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 19-06-2006
# 
# Usage: ./check_ntp_s
#
# Description:
# This plugin determines whether the Nagios client is functioning 
# properly as an NTP server. It does this by checking:
# * Are all required processes running?
# * Is the server's time up to scratch with its higher stratum server?
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
#
# Output:
# The script returns a CRIT when one of the abovementioned criteria
# is not matched.
#

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "NTP server plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

check_processes()
{
	PROCESS="0"
	if [ `ps -ef | grep xntpd | grep -v grep | grep -v nagios | wc -l` -lt 1 ]; then PROCESS=1;fi
	if [ $PROCESS -eq 1 ]; then 
		echo "NTP-S NOK - One or more processes not running"
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_time()
{
	TIME="0"
	#SERVERS="ntp0.nl.net ntp1.nl.net ntp2.nl.net"
	SERVERS="nl-ams99z-a02-01"
	for SERV in `echo $SERVERS`; do
		if [ `/usr/local/nagios/libexec/check_ntp -H $SERV | awk '{print $2}'` != "OK:" ]; then
			TIME=1
		else
			TIME=0
			break
		fi
	done
	if [ $TIME -eq 1 ]; then
		echo "NTP-S NOK - Time not in synch with higher Stratum."
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_processes
check_time

echo "NTP-S OK - Everything running like it should"
exitstatus=$STATE_OK
exit $exitstatus


kilala.nl tags: , , ,

View or add comments (curr. 1)

Nagios script: check_load2

2005-07-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

We are currently in the process of distributing a standard set of Nagios monitoring scripts to over 300 client systems. One of the metrics we would like to monitor is the three load averages (or as Dr. Gunther calls them: the LaLaLa triplets).

Since these 300 servers aren't all alike, we are bound to run into systems with one, two, four, eight or more processors. That way there is no nice way of making one standard configuration, since you'll have to define separate LA levels for WARN and CRIT. Why? Cause a quad system can take much more load than a single core system.

One way to get around this would be by defining separate host groups, based on the amount of processors in a system. You could then define a unique check_load command for each CPU host group.

I've gone the other way around though...

My work-around for this is by replacing check_load with check_load2. This script takes no command line parameters and works on the basis of standard multipliers. We are of the opinion that the number of processors multiplied by a certain factor (150%? 200%? and so on) is a good enough way to define these WARN and CRIT levels. These multipliers can easily be modified (at the top of the script) to fit what -you- think is a worrying level of activity.

This script was tested on Redhat ES3, Solaris 8 and Mac OS X 10.4. It should run on other versions of these OSes as well.

EDIT:

Oh! Just like my other recent Nagios scripts, check_load2 comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.


#!/usr/bin/bash
#
# CPU load monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide, the Netherlands
# Last Modified: 22-06-2006
# 
# Usage: ./check_load2
#
# Description:
#   Ethan's original version of the check_load script is very flexible.
# It allows you to specifically set WARN and CRIT levels regarding 
# the CPU load of the system you're monitoring.
#   However: flexibility is not always a good thing. Say for example that
# you want to monitor the CPU load across a few hundred of systems having
# various CPU configurations. You -could- define host groups for single, dual
# quad (and so on) processor systems and assign unique check_load command
# definitions to each group.
#   Or you could write a script which checks the amount of active CPUs and
# then makes an educated guess at the WARN and CRIT levels for the system. 
# In most cases this should really be enough. 
#
# Limitations:
# This script should work properly on all implementations of Linux, Solaris
# and Mac OS X.
#
# Output:
# Depending on the levels defined at the top of the script,
# the script returns an OK, WARN or CRIT to Nagios based on CPU load.
#
# Other notes:
#   If you ever run into problems with the script, set the DEBUG variable
# to 1. I'll need the output the script generates to do troubleshooting.
# See below for details.
#   I realise that all the debugging commands strewn throughout the script
# may make things a little harder to read. But in the end I'm sure it was
# well worth adding them. It makes troubleshooting so much easier. :3
#

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh


### DEBUGGING SETUP ###
# Cause you never know when you'll need to squash a bug or two
DEBUG="1"
DEBUGFILE="/tmp/foobar"
rm $DEBUGFILE


### REQUISITE NAGIOS COMMAND LINE STUFF ###

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Semi-intelligent CPU load monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done


### SETTING UP THE WARN AND CRIT FACTORS ###
# Please be aware that these are -factors- and not real load average values.
# The numbers below will be multiplied by the amount of processors to come
# to the desired WARN and CRIT levels. Feel free to adjust these factors, if
# you feel the need to tweak them.

WARN_1min="2.00"
WARN_5min="1.50"
WARN_15min="1.50"
[ $DEBUG -gt 0 ] && echo "Factors: warning factors are at $WARN_1min, $WARN_5min, $WARN_15min." >> $DEBUGFILE

CRIT_1min="3.00"
CRIT_5min="2.00"
CRIT_15min="2.00"
[ $DEBUG -gt 0 ] && echo "Factors: critical factors are at $CRIT_1min, $CRIT_5min, $CRIT_15min." >> $DEBUGFILE


### DEFINING SUBROUTINES ###

function gather_procs_linux()
{
    NUMPROCS=`cat /proc/cpuinfo | grep ^processor | wc -l` 
[ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE
} 

function gather_procs_sunos()
{
    NUMPROCS=`/usr/bin/mpstat | grep -v CPU | wc -l` 
[ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE
}

function gather_procs_darwin()
{
    NUMPROCS=`/usr/bin/hostinfo | grep "Default processor set" | awk '{print $8}'` 
[ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE
}

function gather_load_linux()
{
    REAL_1min=`cat /proc/loadavg | awk '{print $1}'`
    REAL_5min=`cat /proc/loadavg | awk '{print $2}'`
    REAL_15min=`cat /proc/loadavg | awk '{print $3}'`
[ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE
}

function gather_load_sunos()
{
    REAL_1min=`w | grep "load average" | awk -F, '{print $4}' | awk '{print $3}'`
    REAL_5min=`w | grep "load average" | awk -F, '{print $5}'`
    REAL_15min=`w | grep "load average" | awk -F, '{print $6}'`
[ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE
}

function gather_load_darwin()
{
    REAL_1min=`sysctl -n vm.loadavg | awk '{print $1}'`
    REAL_5min=`sysctl -n vm.loadavg | awk '{print $2}'`
    REAL_15min=`sysctl -n vm.loadavg | awk '{print $3}'`
[ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE
}

function check_load()
{
    WARN="0"; CRIT="0"

    [ `echo "if(($NUMPROCS * $WARN_1min) > $REAL_1min) 0; if(($NUMPROCS * $WARN_1min) <= $REAL_1min) 1" | bc` -gt 0 ] && let WARN=$WARN+1
    [ `echo "if(($NUMPROCS * $WARN_5min) > $REAL_5min) 0; if(($NUMPROCS * $WARN_5min) <= $REAL_5min) 1" | bc` -gt 0 ] && let WARN=$WARN+1
    [ `echo "if(($NUMPROCS * $WARN_15min) > $REAL_15min) 0; if(($NUMPROCS * $WARN_15min) <= $REAL_15min) 1" | bc` -gt 0 ] && let WARN=$WARN+1
[ $DEBUG -gt 0 ] && echo "Check_load: warning levels are `echo "$NUMPROCS * $WARN_1min"|bc`, `echo "$NUMPROCS * $WARN_5min"|bc`, `echo "$NUMPROCS * $WARN_15min"|bc`," >> $DEBUGFILE

    [ `echo "if(($NUMPROCS * $CRIT_1min) > $REAL_1min) 0; if(($NUMPROCS * $CRIT_1min) <= $REAL_1min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1
    [ `echo "if(($NUMPROCS * $CRIT_5min) > $REAL_5min) 0; if(($NUMPROCS * $CRIT_5min) <= $REAL_5min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1
    [ `echo "if(($NUMPROCS * $CRIT_15min) > $REAL_15min) 0; if(($NUMPROCS * $CRIT_15min) <= $REAL_15min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1
[ $DEBUG -gt 0 ] && echo "Check_load: critical levels are `echo "$NUMPROCS * $CRIT_1min"|bc`, `echo "$NUMPROCS * $CRIT_5min"|bc`, `echo "$NUMPROCS * $CRIT_15min"|bc`," >> $DEBUGFILE

    [ $WARN -gt 0 ] && (echo "NOK: load averages are at $REAL_1min, $REAL_5min, $REAL_15min"; exit $STATE_WARNING)
    [ $CRIT -gt 0 ] && (echo "NOK: load averages are at $REAL_1min, $REAL_5min, $REAL_15min"; exit $STATE_CRITICAL)
}

### FINALLY, THE MAIN ROUTINE ###

NUMPROCS="0"

case `uname` in
            Linux) gather_procs_linux; gather_load_linux; check_load;;
            Darwin) gather_procs_darwin; gather_load_darwin; check_load;;
            SunOS) gather_procs_sunos; gather_load_sunos; check_load;;
            *) echo "OS not supported by this check."; exit 1;;
esac

# Nothing caused us to exit early, so we're okay.
echo "OK - load averages are at $REAL_1min, $REAL_5min, $REAL_15min"
exit $STATE_OK


kilala.nl tags: , , ,

View or add comments (curr. 7)

Nagios script: check_fwm

2005-07-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Basic monitor that checks if the Checkpoint Firewall-1 Management software is up and running. It checks for a number of processes and ports.

This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:

The script sends a Critical if:

A) One or more processes are not running, or

B) One or more ports are not available for connections.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!


#!/usr/bin/bash
#
# Firewall-1 process monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 19-06-2006
# 
# Usage: ./check_fwm
#
# Description:
# This plugin determines whether the Firewall-1 management
# software is running properly. It will check the following:
# * Are all required processes running?
# * Are all the required TCP/IP ports open?
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
#
# Output:
# The script retunrs a CRIT when one of the criteria mentioned
# above is not matched.
#

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Firewall-1 monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

check_processes()
{
	PROCESS="0"
	# PROCLIST="cpd fwd fwm cpwd cpca cpmad cplmd cpstat cpshrd cpsnmpd"
	PROCLIST="cpd fwd fwm cpwd cpca cpmad cpstat cpsnmpd"
	for PROC in `echo $PROCLIST`; do
	if [ `ps -ef | grep $PROC | grep -v grep | wc -l` -lt 1 ]; then PROCESS=1;fi
	done

	if [ $PROCESS -eq 1 ]; then 
		echo "FWM NOK - One or more processes not running"
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_ports()
{
	PORTS="0"
	PORTLIST="256 257 18183 18184 18187 18190 18191 18192 18196 18264"
	for NUM in `echo $PORTLIST`; do
	if [ `netstat -an | grep LISTEN | grep $NUM | grep -v grep | wc -l` -lt 1 ]; then PORTS=1;fi
	done

	if [ $PORTS -eq 1 ]; then 
		echo "FWM NOK - One or more TCP/IP ports not listening."
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_processes
check_ports

echo "FWM OK - Everything running like it should"
exitstatus=$STATE_OK
exit $exitstatus

kilala.nl tags: , , ,

View or add comments (curr. 0)

Migrating to a new NIS+ master

2005-02-21 08:30:00

Currently listening to "Press Conference rag" from the musical Chicago.

What a relief! we finally managed to move NIS+ to a new Master server. We put in about twelve hours on saturday, but we finally got that bitch tamed! :) Proper credit needs to be awarded, so I would like to say that our success was mostly due to the scripts which had been crafted by Jeroen and Roland.


kilala.nl tags: , ,

View or add comments (curr. 0)

Switching NIS+ to a new master server

2005-01-16 15:38:00

Bad news for those sysadmins out there waiting for news regarding the NIS+. We tried our best yesterday, but moving NIS+ to a new Master server failed again :( This time around we used a tried and true (although much improved upon by Jeroen) procedure, which is usually reserved for worst case scenarios. Unfortunately we ran into some unforeseen problems. I'll tell you more about them when I deliver the _real_ procedure.


kilala.nl tags: , ,

View or add comments (curr. 0)

Hacking NIS+ and BoKS

2004-11-17 18:25:00

Holy moly, what a weekend! I can tell you guys right now that the procedure I wrote for switching NIS+ master servers is NOT fool proof! We had planned to only take about four hours at a max, for switching both NIS+ and BoKS over to a new master server. Unfortunately it turned out that we would only get to spend one hour on switching NIS+ until things went horribly sour.

In the end I spent a total of eightteen hours in the office on Saturday and Sunday. I'll spare you the gory details for now (I'll incorporate them in version 2.0 of the master switch procedure).

But God, what a weekend! And the way it looks now we'll be repeating it in a week or so...

Aniwho... I'm still trying to put as much time as possible into my work for the convention, but it's going slowly. I plan on spending every free minute of coming thursday on my Foundation work though. That should get me along the way nicely.


kilala.nl tags: , , ,

View or add comments (curr. 0)

Moving NIS+ to a new master server

2004-11-15 20:14:00

Finally got round to writing the "Switch to a new master" procedure for NIS+. This procedure is damn handy when you want to move your current NIS+ root master to new hardware. This is something that we'll be doing at my employer on the 20th of November, so I'll keep you guys posted. I'll also be sure to update the procedure should anything go wrong :]


kilala.nl tags: , ,

View or add comments (curr. 0)

SCNA erratum

2004-09-22 20:01:00

In the menu of the Sysadmin section you will also find a link to a small erratum which I wrote after reading Rick Bushnell's book. As you can see I found quite a number of errors. I also e-mailed this list to Prentice Hall publishers and hope that they will make proper use of the list.


kilala.nl tags: , , , ,

View or add comments (curr. 0)

SCNA summary done!

2004-09-15 22:13:00

Well, it took me a couple of days, but finally it's done: my summary own the "SCNA study guide" by Rick Bushnell (see the book list). I'll be taking my first shot at the SCNA exam in about a week (the 22nd, keeping my fingers crossed), so I'm happy that I've finished the document. I thought I'd share it with the rest of you; maybe it'll be of some use.

All 29 pages are available for download as a PDF from the Sysadmin section.


kilala.nl tags: , , , , ,

View or add comments (curr. 0)

Older blog posts