2009-09-14 22:05:00

This script is used to monitor the basic processes that go with Cisco's CNR (Network Registrar), which can be likened to a DHCP server. Cisco's Support Wiki described CNR as follows:

Cisco CNS Network Registrar is a full-featured DNS/DHCP system that provides scalable naming and addressing services for service provider and enterprise networks. Cisco CNS Network Registrar dramatically improves the reliability of naming and addressing services for enterprise networks. For cable ISPs, Cisco CNS Network Registrar provides scalable DNS and DHCP services and forms the basis of a DOCSIS cable modem provisioning system.

As said my script only checks the basics of CNR to ensure that the required daemons are running. It does not actually check any of the functionality, though at a later point in time it may be expanded to include this.

Usage of check_cnr

./check_cnr [-nagios|-tivoli] [-d -o FILE]
-nagios	Nagios output mode (default)
-tivoli	Tivoli output mode
-d	Debug mode
-o 	Output file for debug logging

Output

Depending on which mode you've selected the output of the script will differ slightly.

In Tivoli mode the output will be limited to a numerical value as the script is to be used as a "numeric script". 0 = OK, 1 = WARNING/UNKNOWN, 2= SEVERE. The exit code of the script will be identical to this value.

In Nagios mode the exit code of the script will be be similar to Tivoli's, with the exception that the value 3 portrays an unknown state. The output on stdout includes the service name and state (CNR OK/NOK) and a helpful error message.

Limitations

This script has currently only been tested on Solaris and Linux.
The plugin will only check for the required daemons. It currently does no functional checks, though the framework for these checks is already in place.
By default the script runs in Nagios mode.

Download

Download check_cnr.sh

$ wc check_cnr.sh
189     666    4531 check_cnr.sh

$ cksum check_cnr.sh
4161895780 4531 check_cnr.sh

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 0)

Coming up: a refresh of my Nagios and Tivoli plugins

2009-09-14 19:35:00

For the longest of times my Nagios plugins have used a rather oldfashioned approach to configuration: everything's hardcoded into the script and you'll need to modify the script to make changes. Obviously that sucks if you want to use the script for multiple purposes. My newer scripts all use command line flags and parameters to pass variables, making them a lot more versatile. Hence I will soon be rewriting all my Nagios plugins for this particular purpose.

I will also be changing their individual pages, putting the plugin back into its own .ksh script instead of including the code into the HTML page. Whatever was I thinking when I did that?!

Finally, I will also modifying all plugins (also the new ones) to work with multiple monitoring systems. By passing a certain command line option one will be able to chose between modes for Nagios and Tivoli, with possible extensions along the way.

I've got my work cut out for me!

kilala.nl tags: sysadmin, nagios,

View or add comments (curr. 0)

Interesting HP SIM (Insight Manager) SNMP objects

2008-01-01 00:00:00

In my mini-howto about monitoring HP and Dell specific SNMP objects through Net-SNMP I refered to a few interesting objects which could be read through their repsective SNMP agents. This page covers the interesting objects for HP Compaq systems.

Right now I've only got a very limited amount of different models to test all this stuff on, so bear with me :) The following lists are only a small selection from all possible objects, that we found interesting. A full list of available options can be obtained by running:

snmpwalk -c public localhost .1.3.6.1.4.1.232

I've tried my best at making the more interesting parts of the HP and Dell MIBs legible. The results can be found in the PDF, in the menu on the left. But once again, these lists are only a small subset of the complete MIB for both vendors. You won't know all that's available to you unless you start digging through the flat .TXT files yourself. Unlike Sun, HP and Dell -do- publish their MIB files freely, so you'll have no trouble finding them on the web.

I've also expanded on the HP SIM MIB a little in a PDF document. Get it over here.

On the monitoring of disks.

Unfortunately, HP and Compaq have made it impossible to monitor hard disk statuses without add-on software. The plain vanilla SNMP agent has no way of filling the relevant objects. Instead it requires the CPQarrayd add-on.

If you do choose to install this piece of software, you can find all the objects regarding -internal- drives under OID .1.3.6.1.4.1.232.3.2.5.2 (cpqDaPhyDrvErrTable). Refer to CPQIDA.MIB.txt for all relevant details and a full listing of the appropriate OIDs.

Currently I have no way of making sure, but I assume that the alert message for HDD[0-7] can be found in .1.3.6.1.4.1.232.3.2.5.2.1.15.[0-7]. Any value above 0 is indicates a failure.

Basic Object Identifiers

All object IDs below fit under .1.3.6.1.4.1.232. These objects should be usable on every HP system in the DL/ML rangen, although I have only tested the on DL380, DL385, DL580 and ML570.

Object	Description	Values
.1.2.2.1.1.6.OID	CPU[0-3] status	1/2 = ok, 3 = warn, 4 = crit
.3.2.2.1.1.6.OID	HDD controler	1/2 = ok, 3 = warn, 4 = crit
.3.2.3.1.1.11.OID	LDD[0-X] status	1/2 = ok, 3 = warn, 4 = crit
.3.2.4.1.1.6.OID	Hot spare HDD status	>2 =crit
.3.2.5.1.1.37.OID	HDD[0-X] status	1/2 = ok, 3 = warn, 4 = crit
.5.2.2.1.1.12.OID	SCSI controler status	1/2 = ok, 3 = warn, 4 = crit
.5.2.3.1.1.8.OID	SCSI LDD[0-X] status	1/2 = ok, 3 = warn, 4 = crit
.5.2.4.1.1.26.OID	SCSI HDD[0-x] status	1/2 = ok, 3 = warn, 4 = crit
.6.2.6.7.1.9.OID	Fan status	1/2 = ok, 3 = warn, 4 = crit
.6.2.6.8.1.4.1	CPU0 temperature	Contains current temperature
.6.2.6.8.1.4.4	CPU1 temperature	Contains current temperature
.6.2.6.8.1.4.5	PSU temperature	Contains current temperature
.6.2.9.3.1.4.0.OID	PSU[0-X] status	1/2 = ok, 3 = warn, 4 = crit
.14.2.2.1.1.5.OID	IDE HDD[0-X] status	1/2 = ok, 3 = warn, 4 = crit

Fan and sensor placement

As I already said, most of the OIDs from the tables above can be used to monitor vanilla HP systems (with the exceptions of the hard disks). The biggest difference lies in the placement of certain fans and sensors. The table below outlines the various locations, depending on the model.

Each system contains multiple fans and temperature sensors and will thus have multiple instances of these objects in its SNMP tree. The locations for each of these instances can be read from .6.2.6.7.1.3.OID (fans) and 6.2.6.8.1.3.OID (temperature sensor). The $OID part of these numeric sequences are always .1.1, .1.2, .1.3, .1.4 and so on.

Fan	DL380	DL385	DL580	ML570
.1.1	CPU	CPU	System	?
.1.2	CPU	CPU	System	?
.1.3	IO Board	IO Board	System	?
.1.4	IO Board	IO Board	System	?
.1.5	CPU	CPU	IO Board	?
.1.6	CPU	CPU	IO Board	?
.1.7	PSU	PSU	-	?
.1.8	PSU	PSU	-	?
Sensor	DL380	DL385	DL580	ML570
.1.1	CPU	CPU	CPU	?
.1.2	CPU	IO Board	CPU	?
.1.3	IO Board	CPU	CPU	?
.1.4	CPU	CPU	CPU	?
.1.5	PSU	PSU	IO Board	?
.1.6	-	-	Ambient	?
.1.7	-	-	System	?

kilala.nl tags: sysadmin, unix, nagios,

View or add comments (curr. 0)

Combining net-SNMP with Dell Open Manage and HP Insight Manager

2008-01-01 00:00:00

Monitoring Dell and HP systems through SNMP is as big a puzzle as using SNMP on Sun Microsystems' boxen. Luckily I've come a long way into figuring out how to use Net-SNMP together with HP's SIM and Dell's OpenManage.

Just like with our Solaris boxen, we want to use the Net-SNMP daemon as the main daemon on our Linux systems. At $CLIENT we use Red Hat ES3 on a great variety of Dell and HP hardware. And as was the case with SUNWmasf on Solaris, we're going to need both Dell's and HP's custom SNMP agents to monitor out hardware-specific SNMP objects. Enter SIM and OpenManage. In the next few paragraphs I'll tell you all about installing and configuring the whole deal.

Naturally it would be great if you could package all of these files into one nice .RPM, since that'll make the whole installation process a snap. Especially if you want to roll it out across hundreds of servers. I'll be making such a package for $CLIENT, but unfortunately I cannot distribute it (which is logical, what with all the proprietary info that goes into the package). Maybe, some day I'll make a generic .RPM which you guys can use.

Installing HP SIM and its components.

Just like everyone else HP also chooses to hide the installer for their SNMP agent quite deeply into their website. You will need to go to their download site and browse to the software section for your model of server. Once there you choose "Download drivers and software" and you pick your Linux flavour (in our case RHEL3). From there go to "Software - Systems management" where you can finally choose "A Collection of SNMP Protocol Tools from Net-SNMP for $YOUR_FLAVOUR". *phew* To help you get there, here's the direct link to the RHES3 version of the package.

As the file name (net-snmp-cmaX-5.1.2) suggests, this package is a modified version of the net-SNMP daemon which has added support for a whole bunch of Compaq and HP stuff. But as you can see the version of net-SNMP used is way behind today's standards, so it's wisest to use this daemon while proxied through a more current version of net-SNMP. The crappy thing though is that HP's package installs their net-SNMP in exactly the same location as our own net-SNMP. Don't worry, we'll get to that.

The download page doesn't make this immediately clear, but you'll need to download five (or six if you want the source) files. For your convenience, HP has decided to put all files into a pull-down menu, with one "Download" button. Yes, very handy indeed. =_= Another neat thing is that, for some reason, the combination Safari+Realplayer decides that -they- need to open the .RPM file that's loaded. Very odd and I've never encountered this before with other RPMs.

Because we're going to use two versions of net-SNMP that use the same locations on your hard drive, we're going to have to fiddle around a bit.

First copy these two RPMs to your system: net-snmp-cmaX and net-snmp-cmaX-libs. Install them using RPM, starting with libs and ending with the basic package. Now do the following.

$ cd /usr/sbin
$ sudo mv snmpd HPsnmpd
$ sudo mv snmptrapd HPsnmptrapd
$ cd /etc
$ sudo ln -s ./snmpd.conf ./HPsnmdp.conf
$ cd /etc/rc.d/init.d
$ sudo mv snmpd HPsnmpd
$ sudo mv snmptrapd HPsnmptrapd
$ cd /etc/logrotate.d
$ sudo mv snmpd HPsnmpd

You've now made sure that all parts that are required for the HP SNMP agent are safe from being overwritten by the "real" net-SNMP.

You can now install net-SNMP using the instruction laid out in the following paragraph.

Re-compiling Net-SNMP

PLEASE NOTE: If you're going to use HP SIM, please install that -first- before proceeding. See below for details.

Basically, recompiling Net-SNMP for your Linux install follows the same procedure as the recompilation on Solaris.

Download the source code for Net-SNMP version 5.2.3 (or a newer version, if you wish) from their website.
Move the .TGZ to your build system and unpack it in your regular build location. Also, building Net-SNMP successfully requires OpenSSL 0.9.7g or higher, so make sure that it's installed on your build system.
Run the configure script with the following options:
--with-mib-modules="host disman/event-mib ucd-snmp/diskio smux agentx disman/event-mib ucd-snmp/lmSensors" --with-perl-module
Run "make", "make test" and "make install" to complete the creation of Net-SNMP.
After "make install" has finished all the Net-SNMP files have been installed on your build system. Naturally it's important to know which files to include in your package. I will make a full listing of all files RSN(tm)..

Installing Dell OpenManage and its components.

I had a hard time finding the installer files for Dell OM on Dell's download site, util I finally figured out how their "logic" works. :D You can get Dell OM 4.5 for Linux through this direct link (which can be changed at any time by Dell), or you can search their downloads page using the term "openmanage server agent". Adding the key word "linux" seems to confuse it though, so you're going to have to manually search through the list.

Unfortunately I never did get around to using Dell OpenManage, so I cannot give you the installation instructions ;_;

Configuring HP-SIM

The configuration file for HP's version of net-SNMP is stored in /etc/snmp, unlike the version that'll be used by our own net-SNMP. Edit HP's config file and remove all the current content. Replace it with the following:

rocommunity public 0.0.0.0 agentaddress 1162 pass .1.3.6.1.4.1.4413.4.1 /usr/bin/ucd5820stat

You will not have to make any further changes. The init-script and such can remain unchanged.

Configuring Dell OpenManage

Again, unfortunately I cannot give you instructions on working with OpenManage since I ran out of time.

rocommunity public 0.0.0.0 agentaddress 1163

Configuring Net-SNMP

The configuration file for Net-SNMP is located in /usr/local/share/snmp. You will need to make a whole bunch of changes over here that I won't cover, like security ACLs, SNMP trap hosts and bunches of other stuff. However, you _will_ need to add the following lines to allow Net-SNMP to talk to HP SIM and/or OpenManage.

# Pass requests to HP SIM

proxy -c public localhost:1162 .1.3.6.1.4.1.232

# Pass requests to Dell OpenManage

proxy -c public localhost:1163 .1.3.6.1.2.1.674

Starting the software

Make sure that you start Net-SNMP before OpenManage or SIM. These sub-agents rely on Net-SNMP to be running, so that one needs to go first. Take care of this order using the RC scripts of your particular Linux flavour.

kilala.nl tags: sysadmin, unix, nagios,

View or add comments (curr. 0)

A closer look at SUN-PLATFORM-MIB

2008-01-01 00:00:00

For some reason unknown to me Sun has always kept their MIB file rather closed and hard to find. There's no place you can actually download the file. You will have to extract the file from the SUNWmasf package if you want to take a look at it.

To help us sysadmins out I've published the file over here. I do not claim ownership of the file in any way. Sun has the sole copyright of the file. I just put it here, so people can easily read through the file.

kilala.nl tags: sysadmin, unix, nagios, solaris,

View or add comments (curr. 0)

Reading from Solaris SNMP agents

2008-01-01 00:00:00

I have to admit that figuring out how all the parts of SNMP on Sun stick together took me a little while. Just like when I was learning Nagios it took me about a week of mucking about to gain clarity. Now that I've figured it out, I thought I'd share it with you...

First off, everything I will describe over here depends on the availability of two pieces of software on your clients: Net-SNMP and SUNWmasf. See the article on combining the two for further details on installing and configuring this software.

We should begin by verifying that you can read from each of the important pieces of the SNMP tree. You can verify this by running the following three commands on your client system. Each should return a long list of names, numbers and values. Don't worry if it doesn't make sense yet.

snmpwalk -c public localhost .1.3.6.1.2.1.47

snmpwalk -c public localhost .1.3.6.1.4.1.42

snmpwalk -c public -m ALL localhost .1.3.6.1.4.1.2021.13

Incidentally you should also be able to access the same parts of the SNMP tree remotely (from your Nagios server, for example).

snmpwalk -c public $remote_client .1.3.6.1.2.1.47

snmpwalk -c public $remote_client .1.3.6.1.4.1.42

snmpwalk -c public -m ALL $remote_client .1.3.6.1.4.1.2021.13

Please keep in mind that you should replace the word "public" in all the examples with the community string that you've chosen for your SNMP agents. It could very well be something other than "public".

Which witch is witch?

Now that we've made sure that you can actually talk to your SNMP agent, it's time to figure out which components you want to find out about. The easy way to find out all components that are available to you is by running the following command.

snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1.2

Let me explain what the output of this command really means... The SNMP sub-tree MIB-2.1.1.1.1 contains descriptive information of system-specific SNMP objects. Each object has a sub-object in the following sub-trees (each number follows after MIB-2.1.1.1.1).

Sub-OID	Description	Sub-OID	Description
.1	entPhysicalIndex	.9	entPhysicalFirmwareRev
.2	entPhysicalDescr	.10	entPhysicalSoftwareRev
.3	entPhysicalVendorType	.11	entPhysicalSerialNum
.4	entPhysicalContainedIn	.12	entPhysicalMfgName
.5	entPhysicalClass	.13	entPhysicalModelName
.6	entPhysicalParentRelPos	.14	entPhysicalAlias
.7	entPhysicalName	.15	entPhysicalAssetID
.8	entPhysicalHardwareRev	.16	entPhysicalIsFRU

In this case all the sub-objects under .2 contain descriptions of the various components that are human readable. What you need to do now is go through the complete list of descriptions to pick those elements that you want to access remotely through SNMP. You will see that each entry has a number behind the .2. Each of these numbers is the unique component identifier within the system, meaning that we are lucky enough to have the same identifier within other parts of the SNMP tree.

An example

$ snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1.2 | grep Core

SNMPv2-SMI::mib-2.47.1.1.1.1.2.98 = STRING: "CPU 0 Core Temperature Monitor"

SNMPv2-SMI::mib-2.47.1.1.1.1.2.100 = STRING: "CPU 1 Core Temperature Monitor"

SNMPv2-SMI::mib-2.47.1.1.1.1.2.102 = STRING: "CPU 2 Core Temperature Monitor"

SNMPv2-SMI::mib-2.47.1.1.1.1.2.104 = STRING: "CPU 3 Core Temperature Monitor"

$ snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1 | grep "\.98 ="

SNMPv2-SMI::mib-2.47.1.1.1.1.2.98 = STRING: "CPU 0 Core Temperature Monitor"

SNMPv2-SMI::mib-2.47.1.1.1.1.3.98 = OID: SNMPv2-SMI::zeroDotZero

SNMPv2-SMI::mib-2.47.1.1.1.1.4.98 = INTEGER: 94

SNMPv2-SMI::mib-2.47.1.1.1.1.5.98 = INTEGER: 8

SNMPv2-SMI::mib-2.47.1.1.1.1.6.98 = INTEGER: -1

SNMPv2-SMI::mib-2.47.1.1.1.1.7.98 = STRING: "040349/adbs04:CH/C0/P0/T_CORE"

SNMPv2-SMI::mib-2.47.1.1.1.1.8.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.9.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.10.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.11.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.12.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.13.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.14.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.15.98 = ""

SNMPv2-SMI::mib-2.47.1.1.1.1.16.98 = INTEGER: 2

Getting some useful data

Aside from the fact that the sub-OID we have found for our object is used in other parts of the tree, there's another parameter that makes its return. The character string in .7 is reused in the SUN MIB as well, as you will see in a moment.

Let's see what happens when we take our sub-OID .98 to the SUN MIB tree...

$ snmpwalk -c public localhost .1.3.6.1.4.1.42.2.70.101.1.1 | grep "\.98 ="

SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.1.98 = INTEGER: 2

SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.2.98 = INTEGER: 2

SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.3.98 = INTEGER: 7

SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.4.98 = INTEGER: 2

SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.5.98 = STRING: "040349/adbs04:CH/C0/P0"

SNMPv2-SMI::enterprises.42.2.70.101.1.1.6.1.1.98 = INTEGER: 2

SNMPv2-SMI::enterprises.42.2.70.101.1.1.6.1.2.98 = INTEGER: 3

SNMPv2-SMI::enterprises.42.2.70.101.1.1.6.1.3.98 = Gauge32: 60000

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.1.98 = INTEGER: 3

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.2.98 = INTEGER: 0

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.3.98 = INTEGER: 1

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.4.98 = INTEGER: 41

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.5.98 = INTEGER: 0

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.6.98 = INTEGER: 0

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.7.98 = INTEGER: 0

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.8.98 = INTEGER: 0

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.9.98 = INTEGER: 97

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.10.98 = INTEGER: -10

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.11.98 = INTEGER: 102

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.12.98 = INTEGER: -20

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.13.98 = INTEGER: 120

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.14.98 = Gauge32: 0

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.15.98 = Hex-STRING: FC

SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.16.98 = INTEGER: 1

Take a look at 2.1.5.98... Looks familiar? At least now you're sure that you're reading the right sub-object :) The list in the example above looks quite complicated, but there's a little help in the shape of a .PDF I once made. This .PDF shows the basic structure of the objects inside enterprises.42.2.70.101.1.1.

You should immediately notice though that the returns of the command are divided into three groups: ...101.1.1.2, ...101.1.1.6 and ...101.1.1.8. Matching these groups up to the .PDF you'll see that these groups are respectively sunPlatEquipmentTable (which is an expansion on the information from MIB-2), sunPlatSensorTable (which contains a description of the sensor in question) and sunPlatNumericSensorTable (which contains all kinds of real-life values pertaining to the sensor).

In this case the most interesting sub-OID is enterprises.42.2.70.101.1.1.8.1.4.98, sunPlatNumericSensorCurrent, which obviously contains the current value of the sensor readings. Putting things into perspective this means that the core temperature of CPU0 at the time of the snmpwalk was 41 degrees centigrade.

Going on from there

So... Now you know how to find out the following things:

What Sun-specific system components are at your disposal?
What unique identifier is used to refer to the component in question?
What is the current value of the component in quesiton?

You can now do loads of things! For example, you can use your monitoring software to verify that certain values don't exceed a set limit. You wouldn't want your CPUs to get hotter than 65 degrees now, do you?

kilala.nl tags: sysadmin, unix, nagios, solaris,

View or add comments (curr. 2)

Interesting SNMP objects for LM_Sensors on Solaris

2008-01-01 00:00:00

In my mini-howto about monitoring Sun specific SNMP objects through Net-SNMP I refered to a few interesting objects which could be read through LM_Sensors.

Unfortuntately I can currently only list details for two of the supported models, since I do not have test boxen for the other models. The following lists are only a small selection from all possible objects, that we found interesting. A full list of available options can be obtained by running:

snmpwalk -c public -m ALL localhost .1.3.6.1.4.1.2021.13

Details about the structure of the various MIBs can be found in other articles in the Sysadmin section of my website. Just browse through the menu on the left. Point is that the lists below only list the OID _within_ the specific sub-trees (for example: .1.3.6.1.4.1.2021.13.16.5.1.2.9). As I said: details on actually _reading_ these values will be contained in another document.

Sun Fire V240

Object	Description	Unit
2.1.2.1 and .2	CPU[0-1] Core temperature	Integer *
2.1.2.3	SYSTEM Enclosure temperature	Integer *
5.1.2.2	SYSTEM Service required indicator	Integer
5.1.2.5	PSU[0-1] Service required indicator	Degrees
5.1.2.10 .12 .14 and .16	HDD[0-3] Service required indicator	Integer
5.1.2.18	Keyswitch	Integer
5.1.2.4 and .7	PSU[0-1] Activity (power?)	Integer

*: In order to get the real temperature, you will need to divide the integer contained within this variable by 65.526. For some odd reason Net-SNMP does not store the real temperature in degrees Centrigrade.

Sun Fire V440

2.1.2.1 .2 .3 and .4	CPU[0-3] Core temperature	Integer *
2.1.2.5 .6 .7 and .8	CPU[0-3] Ambient temperature	Integer *
2.1.2.9	SCSI temperature	Integer *
.10	MOBO temperature	Integer *
.98 .100 .102 and .104	CPU[0-3] Core temperature	Degrees
.106	MOBO temperature	Degrees
.107	SCSI temperature	Degrees
5.1.2.2	SYSTEM Service required indicator	Integer
5.1.2.6 and .10	PSU[0-1] Service required indicator	Integer
5.1.2.12 .14 .16 and .18	HDD[0-3] Service required indicator	Integer
5.1.2.20	Keyswitch	Integer
5.1.2.4 and .8	PSU[0-1] Power OK	Integer

kilala.nl tags: sysadmin, unix, nagios,

View or add comments (curr. 0)

Interesting SUNWmasf (Management Agent for Sun Fire) SNMP objects

2008-01-01 00:00:00

In my mini-howto about monitoring Sun specific SNMP objects through Net-SNMP I refered to a few interesting objects which could be read through SUNWmasf.

snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1.2

Details about the structure of the various MIBs can be found in other articles in the Sysadmin section of my website. Just browse through the menu on the left. Point is that the lists below only list the OID _within_ the specific sub-trees (for example: .1.3.6.1.2.1.47.1.1.1.1.2.46). As I said: details on actually _reading_ these values will be contained in another document.

The possible values for service indicators (enterprise.42.2.70.101.1.1.12.1.2.$OID) are:

1 = unknown, 2 = off, 3 = on, 4 = alternating

The possible values for the keyswitch (enterprise.42.2.70.101.1.1.9.1.1.$OID) are:

1 = unknown, 2 = stand-by, 3 = normal, 4 = locked, 5 = diag

Sun Fire V240

Object	Description	Unit
.21 .23 .25 and .27	HDD[0-3] Service required indicator	Integer
.39	SYSTEM Service required indicator	Integer
.33 and .36	PSU[0-1] Service required indicator	Integer
.69 and.70	CPU[0-1] Core temperature	Degrees
.71	SYSTEM Enclosure temperature	Degrees
.99 and .100	PSU[0-1] Over-temperature warning	Integer
.81 .82 and .83	SYSTEM Enclosure fan[0-2] tacho meter	Integer
.84 .85 .86 and .87	CPU[0-1] Fan[0-1] tacho meter	Integer
.91 and .92	PSU[0-1] Fan underspeed warning	Integer
.31 and .34	PSU[0-1] Active (power?)	Integer

Sun Fire V440

.28 .30 .32 and .34	HDD[0-3] Service Required indicator	Integer
.37 and .41	PSU[0-1] Service Required indicator	Integer
.46	SYSTEM Service Required indicator	Integer
.43	Keyswitch	Integer
.98 .100 .102 and .104	CPU[0-3] Core temperature	Degrees
.106	MOBO temperature	Degrees
.107	SCSI temperature	Degrees
.131 and .132	PSU[0-1] Predict fan fault	Integer
.121	PCIFAN tacho meter	Integer
.122 and .123	CPUFAN[0-1] tacho meter	Integer
.36 and .40	PSU[0-1] Power OK	Integer
.124 .125 .126 and .127	CPU[0-3] Power fault	Integer
.128	MOBO Power fault	Integer

kilala.nl tags: sysadmin, unix, solaris, nagios,

View or add comments (curr. 0)

Combining net-SNMP and SUNWmasf on Solaris

2008-01-01 00:00:00

In some cases you're going to want to use Net-SNMP on your Solaris hosts, while still being able to monitor Sun-specific SNMP objects. It took me a while to get all of this to work and it's a bit of a puzzle, but here's how to make it work.

In our current environment at $CLIENT we want to standardise all of our UNIX hosts to the Net-SNMP agent software. This will allow us to use a configuration file which can be at least 60% identical on each host, making life just a little bit easier for all of us. Unfortunately Net-SNMP isn't equipped to deal with all of Sun's specific SNMP objects, so we're going to have to make a few big modifications to the software.

Of course packaging all these changes into one big .PKG is the nicest way of ensuring that all required changes are made in one blow, so that's what I've done. Unfortunately I cannot share this package with you, since it contains quite a large amount of $CLIENT internal information. I may be tempted at another time to recreate a non-$CLIENT version of the package that can be used elsehwere.

Re-compiling Net-SNMP

The latest versions of Net-SNMP comes with experimental LM_Sensors support for Sun hardware. Oddly, I've found that you need to drop one version below the latest version to get it to work nicely with Solaris 8. So here's the steps to take...

Download the source code for Net-SNMP version 5.2.3 from their website.
Move the .TGZ to your build system and unpack it in your regular build location. Also, building Net-SNMP successfully requires OpenSSL 0.9.7g or higher, so make sure that it's installed on your build system.
Run the configure script with the following options:
--with-mib-modules="host disman/event-mib ucd-snmp/diskio smux agentx disman/event-mib ucd-snmp/lmSensors" --with-perl-module
Run "make", "make test" and "make install" to complete the creation of Net-SNMP. If "make test" fails on every check, it is likely that your system is unable to find the requisite OpenSSL libraries. This may be solved by running:
/usr/bin/crle -c /var/ld/ld.config -l /lib:/usr/lib:/usr/local/lib:/usr/local/ssl/lib
After "make install" has finished all the Net-SNMP files have been installed on your build system. Naturally it's important to know which files to include in your package. To help you, I've created a list of the files that are installed.

Installing SUNWmasf and its components

PLEASE NOTE: SUNWmasf will currently (july of 2006) only get useful results on the following models: V210, V240, V250, V440, V1280, E2900, N210, N240, N440, N1280. On other systems you may have more luck using the LM_Sensors pieces of Net-SNMP. They have been tested to work on E450, V880 and 280R.

As I mentioned earlier Net-SNMP with LM_Sensors can only gather limited amounts of Sun specific information. That's besides the fact that it is also still an experimental feature. So we're going to need an alternative SNMP agent to gather more information for us. Enter the SUNWmasf package.

SUNWmasf and its components may be downloaded from the Sun Microsystems website. Either use this direct link (which may be subject to change), or go to www.sun.com/download and search for "Sun SNMP Management Agent".

You can opt to install SUNWmasf manually on each of your clients, but it would be much nicer to include it into your custom made package. To have a full list of all the files and symlinks that you should include, you can take a peek at the prototype file I made for the package. It includes all the files required for Net-SNMP.

Installation of the software couldn't be easier. Just run the following command, after extracting the .TAR.Z file that contains SUNWmasf.

pkgadd -d . SUNWescdl SUNWescfl SUNWeschl SUNWescnl SUNWescpl SUNWmasf SUNWmasfr

Configuring SUNWmasf

Go into /etc/opt/SUNWmasf/conf and replace the snmpd.conf file with the following:

rocommunity public

agentaddress 1161

agentuser daemon

agentgroup daemon

Configuring Net-SNMP

proxy -c public localhost:1161 .1.3.6.1.4.1.42

proxy -c public localhost:1161 .1.3.6.1.2.1.47

Starting the software

Since SUNWmasf relies upon Net-SNMP, it will need to be started after that piece of software. The prototype file I mentioned earlier already takes this into account, but if you're not going to use it just make sure that /etc/init.d/masfd gets called _after_ /etc/init.d/snmpd during the boot process.

Also, I've noticed that SUNWmasf will need about thirty seconds before it can be read using commands like snmpget and snmpwalk.

Reading values from the agents

As you may well know, SNMP is a tangly web of numerical identifiers. I will make a nice overview of the various useful OIDs that you can use for monitoring through both LM_Sensors and SUNWmasf. However, I will put these in a seperate document, since it falls outside the scope of this mini-howto.

kilala.nl tags: sysadmin, unix, solaris, nagios,

View or add comments (curr. 0)

The passing of an era: Nagios

2007-05-20 19:05:00

Well, I have finally unsubscribed myself from the Nagios mailing lists. It was great being a member of those lists while I was working with the software on a daily basis, but these days I've put Nagios behind me. I haven't written one line of Nagios monitoring code for months now.

I'm sure I'll also be skipping this year's Nagios Konferenz unless a job involving monitoring comes up again.

Thanks Ethan, for making such great software freely available! All the best to you and maybe we'll meet again o/

kilala.nl tags: nagios, unix, work, sysadmin,

View or add comments (curr. 0)

Using BSD hardware sensors with SNMP.

2006-10-25 09:05:00

Many thanks to my colleague Guldan who pointed me towards a website giving a short description of using the BSD hardware-sensors daemon, together with Nagios in order to monitor your hardware. Using sensord should make things a lot easier for people running BSD, as they won't have to muck about with SNMP OIDs and so on.

kilala.nl tags: work, nagios, unix, sysadmin,

View or add comments (curr. 0)

Great minds think alike

2006-10-03 23:31:00

This goes to show that the proverb above is right: Joerg Linge, whom I met at NagKon 2006, just e-mailed me. He mentioned that right around the same time we had both come up with a similar solution to one problem.

The problem: use Nagios plugins through a normal SNMP daemon.

Our solutions were identical when it came to configuring the daemon, but differed slightly when it comes to getting the information from the client. The approach is the same, but while he uses Perl for the plugin, I use Bash ^_^

Life's little coincidences :)

Joerg's solution and write-up.

My solution and write-up.

Anywho... Joerg's a cool guy :) Go check out his website and have a look around.

kilala.nl tags: work, nagios, sysadmin,

View or add comments (curr. 2)

Nagios Conference, aftermath

2006-09-24 09:04:00

So I made it back home in one piece. My trip back took me around 7.5 hours, which was mostly due to me driving a little bit faster :p

I have to say that the A45 route up north is much less glamorous than the A3 :( The Rast Hofe all look much older and less fancy than the ones on the A3. Ah, but they sufficed anyway...

I'm thinking of moving my summaries from the previous blog posts into one big page in the Sysadmin section. Reckon that should prevent Google from raising the Archives above the Sysadmin section when it comes to Nagios.

/me starts immediately.

kilala.nl tags: work, nagios, website, conference,

View or add comments (curr. 0)

Nagios Conference, day 2

2006-09-22 23:27:00

< moved to Sysadmin section, to keep Google from messing up >

kilala.nl tags: work, nagios,

View or add comments (curr. 0)

Nagios Conference, intermission

2006-09-21 17:10:00

Astounding by the way, the amount of Apple laptops I see around here. Less than at SANE'06, but still, around 35%. o/

kilala.nl tags: work, nagios,

View or add comments (curr. 0)

Nagios Conference, day 1

2006-09-21 17:01:00

< moved to Sysadmin section, to keep Google from messing up >

kilala.nl tags: work, nagios,

View or add comments (curr. 0)

Nagios Conference, intermission

2006-09-21 14:19:00

For the conference I had Snow buy me the iMic and a nice Philips microphone. For now though, I'm not completely happy with the setup.

* The mic is omnidirectional and thus doesn't pick up much of what person out in front is telling, while it does pick up quite a loot of noise from the room.

* iMic is a USB device and it seems that it claims enough CPU resources to mess with the rest of my system :(

Lunch was nice though! <3

kilala.nl tags: work, nagios,

View or add comments (curr. 0)

Nagios Conference, day 0

2006-09-20 23:21:00

< moved to Sysadmin section, to keep Google from messing up >

kilala.nl tags: work, nagios,

View or add comments (curr. 2)

Off to Germania I go!

2006-09-19 21:13:00

The next few days I'll be in Germania... Nurnberg, to be precise.

Together with around eighty other Nagios administrators and experts I'll be attending the first, annual Nagios Conference. Over the course of two days, we'll get a chance to meet up together, exchange ideas and generally have a go at improving both Nagios and our knowledge of the software. I'm looking forward to it quite a lot.

Maybe I'll even meet up with a few of the mailing list members :) I'll bring the camera and I'll try to snap a few quick pics.

kilala.nl tags: work, snow, nagios,

View or add comments (curr. 2)

Nagios clients for UNIX/Linux

2006-07-27 13:01:00

I've added a small comparison between the various ways in which your Nagios server can communicate with its clients. It's in the menu on the left, or you can go there directly.

kilala.nl tags: work, unix, nagios, sysadmin,

View or add comments (curr. 0)

Using SNMP with Solaris and Sun hardware

2006-07-26 16:25:00

After digging through Sun's MIB description (see SUN-PLATFORM-MIB.txt) it became clear to me that things are a lot more convoluted than I originally expected. For example, each sensor in the Sun Fire systems lead to at least five objects each describing another aspect of the sensor (name, value, expected value, unit, and so on). Unfortunately Sun has no (public) description of all possible SNMP sensor objects so I've come to the following two conclusions:

1. I'll figure it all out myself. For each model that we're using I'll weasel out every possible sensor and all information relevant to these sensors.

2. I'll have to write my own check script for Nagios which deals with with all the various permutations of sensor arrays in an appropriate fashion. Joy...

EDIT:

For your reference, Sun has released the following documents that pertain to their SNMP implementation. Mostly they're a slight expansion on the info from the MIB. At least they're much easier on the eyes when reading :p

* 817-2559-13

* 817-6832-10

* 817-6238-10

* 817-3000-10

kilala.nl tags: unix, work, nagios, sysadmin,

View or add comments (curr. 0)

Sun-platform-mib.txt

2006-07-25 09:34:00

Right now I'm working on getting my Sun systems properly monitored through SNMP. Using the LM_sensors module for Net-SNMP has gotten me quite far, but there's one drawback. A lot of Sun's internal counters use some really odd values that don't speak for themselves. This makes it necessary to read through Sun's own MIB and correlate the data in there with the stuff from LM_sensors.

Point is, Sun isn't very forthcoming with their MIB even though it should probably be public knowlegde. Nowhere on the web can I find a copy of the file. The only way to get it is by extracting it from Sun's free SUNWmasfr package, which I have done: here's SUN-PLATFORM-MIB.txt

In now way am I claiming this file to be a product of mine and it definitely has Sun's copyright on it. I just thought I'd make the file a -little- bit more accessible through the Internet. If Sun objects, I'm sure they'll tell me :3

kilala.nl tags: unix, work, nagios, sysadmin,

View or add comments (curr. 0)

Fixes to check_log2 and check_log3

2006-06-19 15:11:00

Both check_log2 and check_log3 have been thoroughly debugged today. Finally. Thanks to both Kyle Tucker and Ali Khan for pointing out the mistakes I'd made. I also finally learned the importance of proper testing tools, so I wrote test_log2 and test_log3 which run the respective check scripts through all the possible states they can encounter.

Oh... check_ram was also -finally- modified to take the WARN and CRIT percentages through the command line. Shame on me for not doing that earlier.

kilala.nl tags: work, unix, nagios, sysadmin,

View or add comments (curr. 0)

Check_log3 is born

2006-06-01 14:53:00

Today I made an improved version of the Nagios monitor "check_log2", which is now aptly called "check_log3". Version 3 of this script gives you the option to add a second query to the monitor. The previous two incarnations of the script only allowed you to search for one query and would return a Critical if it was found. Now you can also add a query which will return in a Warning message as well. Goody!

kilala.nl tags: work, unix, nagios, sysadmin,

View or add comments (curr. 0)

How do Nagios clients on Windows communicate?

2006-06-01 00:00:00

After reading through my small write-up on Nagios clients on UNIX you may also be interested in the same story for Windows systems.

Since Nagios was originally written with UNIX systems in mind, it'll be a little bit trickier to get the same amount of information from a Windows box. Luckily there are a few tools available that will help you along the way.

For a quick introduction the Nagios clients, read the write-up linked above. Or pick it from the menu on the left.

A quick comparison

	NSClient	NRPEnt	NSClient++	SNMP	SNMP traps	NC_net **
Connection initiation	Srv -> Clnt	Srv -> Clnt	Srv -> Clnt	Srv -> Clnt	Clnt -> Srv	Clnt -> Srv Srv -> Clnt
Security	Password	Password Encryption	Password Encryption * ACL	Access List Password	Access List Password	Encryption ACL
Configuration	On client	On client	On client	On client	On client and On server	On client
Difficulty	Moderate	Moderate	Moderate	Hard	Hard	Moderate
Resource usage ***	unknown	unknown	9MB RAM	unknown	unknown	30MB RAM
Available	Here	Here	Here	Here	Here	Here

*: Thanks to Jeronimo Zucco for pointing out that encryption in NSClient++ only works when used with the NRPE DLL.

**: Thanks to Anthony Montibello for pointing out recent changes to NC_Net, which is now at version 3.

***: Thanks to Kyle Hasegawa for providing me with resource usage infor on the various clients.

NSClient

NSClient was originally written to work with Nagios when it was still called NetSaint: a long, long time ago. NSClient only provides you with access to a very small number of system metrics, including those that are usually available through the Windows Performance Tool.

Personally I have no love for this tool since it is quite fidgetty to use. In order to use NSClient on your systems, you will need to do the following.

Install NSClient on your Windows box.
Define check commands based on the check_nt plugin.
Use these check commands for your service checks.

You can now set up your services.cfg in such a way that each remote service is checked like so:

define service{

   host_name remote-host

   service_description D_ROOT

   check_command check_nt_disk!C!85!95

}

Your check command definition would look something like this:

define command {

command_name check_nt_disk

command_line /usr/local/nagios/libexec/check_nt -H $HOSTADDRESS$ -p 1248 -v USEDDISKSPACE -l $ARG1$ -w $ARG2$ -c $ARG3

}

NRPEnt

NRPEnt is basically a drop-in replacement for NRPE on Windows. It really does work the same way: on the Nagios server you run check_nrpe and on the Windows side you have plugins to run locally. These plugins can be binaries, Perl scripts, VBScript, .BAT files, whatever.

To set things up, you'll need the same things as with the normal NRPE.

Download and install NRPEnt on each client,
Install all check scripts on the system, and
Configure the NRPE daemon to have a check command for each check you would like to perform.

You can now set up your services.cfg in such a way that each remote service is checked like so:

define service{

   host_name remote-host

   service_description D_ROOT

   check_command check_nrpe!check_root

}

And in nrpent.cfg on the client you would need to include:

command[check_root]=C:\windows\system32\cscript.exe //NoLogo //T:10 c:\nrpe_nt\check_disk.wsf /drive:"c:/" /w:300 /c:100

NSClient++

Due to the limited use provided by NSClient, someone decided to create NSClient++. This piece of software is a lot more useful because it actually combines the functionality of the original NSClient and that of NRPEnt into one Windows daemon.

NSClient++ includes the same security measures as NRPEnt and NSClient, but adds an ACL functionality on top of that.

On the configuration side things are basically the same as with NSClient and NRPEnt. You can use both methods to talk to a client running NSClient++.

SNMP

Unfortunately I haven't yet worked with SNMP on Windows systems, so I can't tell you much about this. I'm sure though that things won't be much different from the UNIX side. So please check the Nagios UNIX clients story for the full details.

To make proper use of monitoring through SNMP you'll need to:

Install an SNMP daemon / agent on your system,
put all the check scripts you want to run on the system as well,
register a private Enterprise ID with IANA to hold your custom objects,
configure the SNMP daemon in such a way that the results of the check scripts are placed in custom objects, and
configure basic security on the SNMP daemon.

Ufortunately the check_snmp script that comes with Nagios isn't flexible enough to let you monitor custom SNMP objects in a nice way. This is why I wrote the retrieve_custom_nagios script, which is available from the menu. Your service definition would look like this:

define service{

   host_name remote-host

   service_description D_ROOT

   check_command retrieve_custom_snmp!.1.3.6.1.4.1.6886.4.1.4

}

As I said, I haven't configured a Windows SNMP daemon before, so I really can't tell you what the config would look like. Just look for options similar to "EXEC", which allows you to run a certain command on demand.

Just as is the case with UNIX systems you will need to dig around the MIB files provided to you by Microsoft and you hardware vendors to find the OIDs for interesting metrics. It's not an easy job, but with some luck you'll find a website where someone's already done the hard work for you :)

SNMP traps

SNMP doesn't involve polling alone. SNMP enabled devices can also be configured to automatically send status updates do a so-call trap host. The downside to receiving SNMP traps with Nagios is that it takes quite a lot of work to get them into Nagios :D

To make proper use of monitoring through SNMP traps you'll need to:

install an SNMP daemon / agent on your system,
define all the SNMP traps you would like to send on the client,
install an SNMP trap daemon on your server,
configure the SNMP trap daemon to tell it what to do with the incoming traps,
install something that makes a translation between SNMP traps and Nagios service definitions.

There are -many- ways to get the SNMP traps translated for Nagios' purposes, 'cause there's many roads that lead to Rome. Unfortunately none of them are very easy to use.

SNMPtt, an open source tool.
EventDB, a database driven piece of software by Netways GMBH.
My crappy-ass solution, that just cross-references a list of OIDs to a list of Nagios actions.

NC_net

NC_net is another replacement for the original NSClient daemon. It performs the same basic checks, plus a few additional ones, but it is not exentable with your own scripts (like NRPEnt is).

So why run NC_net instead of NSClient++? Because it is capable of sending passive check results to your Nagios server using a send_nsca-alike method. So if you're going all the way in passifying all your service checks, then NC_net is the way to go.

I haven't worked with NC_net yet, so I can't tell you anything about how it works. Too bad :(

UPDATE 31/10/2006:
I was informed by Marlo Bell of the Nagios mailing list that NC_net version 3.x does indeed allow running your own scripts and calling them through the NRPEnt interface! That's great to know, as it does in fact make NC_net the most versatile solution for running Nagios on your Windows.

Also, Anthony Montibello (lead NC_Net dev) tells me that NC_Net 3 requires dotNET 2.0.

kilala.nl tags: tutorial, sysadmin, nagios, windows,

View or add comments (curr. 7)

Nagios script: retrieve_custom_snmp

2006-06-01 00:00:00

This script was written while I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

One of the things we've been looking into recently, is running the standard Nagios plugins through SNMP instead of through NRPE. Putting aside the discussion of the various merits and flaws such a solution has, let's say that it works nicely.

How do you do this?

In your snmpd.conf add a line like:

exec .1.3.6.1.4.1.6886.4.1.1 check_load /usr/local/nagios/libexec/check_load

exec .1.3.6.1.4.1.6886.4.1.2 check_mem /usr/local/nagios/libexec/check_mem –w 85 –c 95

exec .1.3.6.1.4.1.6886.4.1.3 check_swap /usr/local/nagios/libexec/check_swap -w 15% -c 5%

What this does, is tell the SNMP daemon to run the check_load script when someone asks for object .1.3.6.1.4.1.6886.4.1.1 (or .2, or .3). The exit code for the script will be place in OID.100.0 and the first line of output will be placed in OID.101.1. This script retrieves those two values through SNMP and returns them to Nagios.

Your checkcommands.cfg should contain something like:

define command{

command_name retrieve_custom_snmp

command_line $USER1$/retrieve_custom_snmp -H $HOSTADDRESS$ -o $ARG1$ }

The "-o" parameter takes the OID you have selected for your custom check.

Now... How do you select an OID? There's two ways:

1. The WRONG way = randomly selecting some OID. You might pick an OID which is needed for other monitoring purposes in your network.

2. The RIGHT way = requesting a private Enterprise ID for your company at IANA. You are free to build an SNMP tree beneath this EID. For example, the EID 6886 mentioned above is registered to KPN (my current client). The sub-tree .4.1 contains all OIDs referring to Nagios checks performed by my department.

Before sending out that request, please check the current EID list to see if you company already owns a private subtree. If that's the case, contact the "owner" to request your own part of the subtree.

UPDATE (2006-10-02):

Thanks to the kind folks on the Nagios Users ML I've found out that my original version of the script was totally bug-ridden. I've made a big bunch of adjustments and now the script should work properly. Thanks especially to Andreas Ericsson.


#!/bin/bash
#
# Script to retrieve custom SNMP objects set using the "exec" handler
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide, the Netherlands
# Last Modified: 18-07-2006
# 
# Usage: ./retrieve_custom_snmp
#
# Description:
#   On our Nagios client systems we use a lot of custom MIB OIDs which are
# registered under our own Enterprise ID. A whole bunch of the 
# original Nagios script are run through the SNMP daemon and their exit
# codes and output are appended to specific OID. This all happens using the
# SNMP "exec" handler.
#   Unfortunately the default check_snmp script doesn't allow for easy 
# handling of these objects, so I hacked together a quick script. 
#
# So basically this script doesn't do any checking. It just retrieves 
# information :)
#
# Limitations:
# This script should work properly on all implementations of Linux, Solaris
# and Mac OS X.
#
# Output:
# The exit code is the exit code retrieved from OID.100.1. It is temporarily
# stored in $EXITCODE.
# The output string is the string retrieved from OID.101.1. It is tempo-
# rarily stored in $OUTPUT.
#
# Other notes:
#   If you ever run into problems with the script, set the DEBUG variable
# to 1. I'll need the output the script generates to do troubleshooting.
# See below for details.
#   I realise that all the debugging commands strewn throughout the script
# may make things a little harder to read. But in the end I'm sure it was
# well worth adding them. It makes troubleshooting so much easier. :3
#   Also, for some reason the case statement with the shifts (to detect
# passed options) doesn't seem to be working right. FIXME!
#
# Check command definition:
# define command{
#       command_name    retrieve_custom_snmp
#       command_line    $USER1$/retrieve_custom_snmp -H $HOSTADDRESS$ -o $ARG1$
#		}
#

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh
PROGNAME="retrieve_custom_snmp"
COMMUNITY="public"

[ `uname` == "SunOS" ] && SNMPGET="/usr/local/bin/snmpget -Oqv -v 2c -c $COMMUNITY"
[ `uname` == "Darwin" ] && SNMPGET="/usr/bin/snmpget -Oqv -v 2c -c $COMMUNITY"
[ `uname` == "Linux" ] && SNMPGET="/usr/bin/snmpget -Oqv -v 2c -c $COMMUNITY"

### DEBUGGING SETUP ###
# Cause you never know when you'll need to squash a bug or two
DEBUG="0"

if [ $DEBUG -gt 0 ]
then
        DEBUGFILE="/tmp/foobar"
        rm $DEBUGFILE >/dev/null 2>&1
fi


### REQUISITE NAGIOS COMMAND LINE STUFF ###

print_usage() {
	echo "Usage: $PROGNAME -H hostname -o OID"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Script to retrieve the status for custom SNMP objects."
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1"; do
    case "$1" in
        --help)
            print_help
            exit $STATE_OK
            ;;
        -h)
            print_help
            exit $STATE_OK
            ;;
        -H)
			HOST=$2
	        shift
            ;;
        -o)
	    	OID=$2
	    	STATUS="$OID.100.1"
	    	STRING="$OID.101.1"
            shift
            ;;
        *)
            echo "Unknown argument: $1"
            print_usage
            exit $STATE_UNKNOWN
            ;;
    esac
    shift
done


### FINALLY... RETRIEVING THE VALUES ###

EXITCODE=`$SNMPGET $HOST $STATUS`
[ $DEBUG -gt 0 ] && echo "Retrieve exit code is $EXITCODE" >> $DEBUGFILE
 
OUTPUT=`$SNMPGET $HOST $STRING | sed 's/"//g'`
[ $DEBUG -gt 0 ] && echo "Retrieve status message is: $OUTPUT" >> $DEBUGFILE

echo $OUTPUT
exit $EXITCODE

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 0)

Nagios script: check_suncluster

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

A few of our projects and services are run on Solaris systems running Sun Cluster software. Since there were no Nagios scripts available to perform checks against Sun Cluster I made a basic script that checks the most important factors.

This script performs a different function, depending on the parameter with which it is called. This allows you to define multiple service checks in Nagios, without needing seperate check scripts for each.

EDIT:

Oh! Just like my other recent Nagios scripts, check_suncluster comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution. And like my other, recent scripts it also comes with its own test script.


#!/usr/bin/ksh
#
# Nagios check script for Sun Cluster.
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide SYS, the Netherlands
# Last Modified: 25-09-2006
#
# Usage: ./check_suncluster [-t, -q, -g, -G resource-group, -r, -R resource, -i]
#
# Description:
# This script is capable of performing a number of basic checks on a 
# system running Sun Cluster. Depending on the parameter you pass to 
# it, it will check:
# * Transport paths (-t).
# * Quorum (-q).
# * Resource groups (-g).
# * One selected resource group (-G).
# * Resources (-r).
# * One selected resource (-R).
# * IPMP groups (-i).
#
# Limitations:
# This script will only work with Korn shell, due to some funky while
# looping with pipe forking. Bash doesn't handle this very gracefully,
# due to its sub-shell variable scoping. Maybe I really should learn
# to program in Perl.   
#
# Output:
# * Transport paths return a WARN when one of the paths is down and a
#   CRIT when all paths are offline. 
# * Quorum returns a WARN when not all, but enough quorum devices are
#   available. It returns a CRIT when quorum cannot be reached.
# * Resource groups returns a CRIT when a group is offline on all nodes
#   and a WARN if a group is in an unstable state.
# * Resources returns a CRIT when a resource is offline on all nodes
#   and a WARN if a resource is in an unstable state.
# * IPMP groups returns a CRIT when a group is offline.
#
# Other notes:
# Aside from the debugging output that I've built into most of my recent
# scripts, this check script will also have a testing mode  hacked on, as
# a bag on the side. This testing mode is only engaged when the test_check_suncluster
# script is being run and will intentionally "break" a few things, to 
# verify the failure options of this check script.
#

# Enabling the following dumps information into DEBUGFILE at various
# stages during the execution of this script.
DEBUG=0
DEBUGFILE="/tmp/foobar"

if [ -f /tmp/neko-wa-baka ]
then
	if [ `cat /tmp/neko-wa-baka` == "Nyo!" ]
	then
	   TESTING="1"
	else
	   TESTING="0"
	fi
else
	TESTING="0"
fi


### REQUISITE NAGIOS USER INTERFACE STUFF ###

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin:/usr/cluster/bin"
LIBEXEC="/usr/local/nagios/libexec"
PROGNAME="check_suncluster"
. $LIBEXEC/utils.sh

[ $DEBUG -gt 0 ] && rm $DEBUGFILE 

print_usage() {
        echo "Usage: $PROGNAME [-t, -q, -g, -G resource-group, -r, -R resource, -i]"
        echo "Usage: $PROGNAME --help"
}

print_help() {
        echo ""
        print_usage
        echo ""
        echo "Sun Cluster check plugin for Nagios"
        echo ""
        echo "-t: check transport paths"
        echo "-q: check quorum"
        echo "-g: check resource groups"
        echo "-G: check one individual resource group"
        echo "-r: check all resources"
        echo "-R: check one individual resources"
        echo "-i: check IPMP groups"
        echo ""
        echo "This plugin not developped by the Nagios Plugin group."
        echo "Please do not e-mail them for support on this plugin, since"
        echo "they won't know what you're talking about :P"
        echo ""
        echo "For contact info, read the plugin itself..."
}


### SUB-ROUTINE DEFINITIONS ### 

function check_transport_paths
{
[ $DEBUG -gt 0 ] && echo "Starting check_transport_path subroutine." >> $DEBUGFILE

	TOTAL=`scstat -W | grep "Transport path:" | wc -l`
	let COUNT=0

	scstat -W | grep "Transport path:" | awk '{print $3" "$6}' | while read PATH STATUS
	do
[ $DEBUG -gt 0 ] && echo "Before math, Count has the value of $COUNT." >> $DEBUGFILE
		if [ $STATUS == "online" ]
		then
		   let COUNT=$COUNT+1
		fi
[ $DEBUG -gt 0 ] && echo "Path: $PATH has status $STATUS" >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Count: $COUNT online transport paths." >> $DEBUGFILE
	done

[ $DEBUG -gt 0 ] && echo "Count: Outside the loop it has a value of $COUNT." >> $DEBUGFILE
[ $TESTING -gt 0 ] && COUNT="0"

	if [ $COUNT -lt 1 ]
	then
	   echo "NOK - No transport paths online."
	   exit $STATE_CRITICAL
	elif [ $COUNT -lt $TOTAL ]
	then
	   echo "NOK - One or more transport paths offline."
	   exit $STATE_WARNING
	fi
}

function check_quorum
{
[ $DEBUG -gt 0 ] && echo "Starting check_quorum subroutine." >> $DEBUGFILE
	NEED=`scstat -q | grep "votes needed:" | awk '{print $4}'`
	PRES=`scstat -q | grep "votes present:" | awk '{print $4}'`

[ $DEBUG -gt 0 ] && echo "Quorum needed: $NEED" >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Quorum present: $PRES" >> $DEBUGFILE

[ $TESTING -gt 0 ] && PRES="0"
	if [ $PRES -ge $NEED ]
	then
[ $DEBUG -gt 0 ] && echo "Enough quorum votes." >> $DEBUGFILE
		scstat -q | grep "votes:" | awk '{print $3" "$6}' | while read VOTE STATUS
		do
[ $DEBUG -gt 0 ] && echo "Vote: $VOTE has status $STATUS." >> $DEBUGFILE
			if [ $STATUS != "Online" ] 
			then
			   echo "NOK - Quorum vote $VOTE not available."
			   exit $STATE_WARNING
			fi
		done		
	else
[ $DEBUG -gt 0 ] && echo "Not enough quorum." >> $DEBUGFILE
		echo "NOK - Not enough quorum votes present."
		exit $STATE_CRITICAL
	fi
}

function check_resource_groups
{
[ $DEBUG -gt 0 ] && echo "Starting check_resource_groups subroutine." >> $DEBUGFILE
	scstat -g | grep "Group:" | awk '{print $2}' | sort -u | while read GROUP
	do
	ONLINE=`scstat -g | grep "Group: $GROUP" | grep "Online" | wc -l`
	WEIRD=`scstat -g | grep "Group: $GROUP" | grep -v "Resources" | grep -v "Online" | grep -v "Offline" | wc -l`
[ $DEBUG -gt 0 ] && echo "Resource Group $GROUP has $ONLINE instances online." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Resource Group $GROUP has $WEIRD instances in a weird state." >> $DEBUGFILE
[ $TESTING -gt 0 ] && ONLINE="0"
		if [ $ONLINE -lt 1 ] 
		then
		   echo "NOK - Resource group $GROUP not online."
		   exit $STATE_CRITICAL
		fi
                if [ $WEIRD -gt 1 ]
                then
                   echo "NOK - Resource group $GROUP is an unstable state."
                   exit $STATE_WARNING
                fi
	done
}

function check_resource_grp
{
[ $DEBUG -gt 0 ] && echo "Starting check_resource_grp subroutine." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Selected group: $RGROUP" >> $DEBUGFILE
	ONLINE=`scstat -g | grep $RGROUP | grep "Online" | wc -l`
	WEIRD=`scstat -g | grep $RGROUP | grep -v "Resources" | grep -v "Online" | grep -v "Offline" | wc -l`
[ $DEBUG -gt 0 ] && echo "Resource Group $GROUP has $ONLINE instances online." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Resource Group $GROUP has $WEIRD instances in a weird state." >> $DEBUGFILE
[ $TESTING -gt 0 ] && ONLINE="0"
	if [ $ONLINE -lt 1 ] 
	then
	   echo "NOK - Resource group $RGROUP not online."
	   exit $STATE_CRITICAL
	fi
	if [ $WEIRD -gt 1 ]
        then
           echo "NOK - Resource group $RGROUP is in an unstable state."
           exit $STATE_WARNING
        fi
}

function check_resources
{
[ $DEBUG -gt 0 ] && echo "Starting check_resources subroutine." >> $DEBUGFILE
	RESOURCES=`scstat -g | grep "Resource:" | awk '{print $2}' | sort -u`
[ $DEBUG -gt 0 ] && echo "List of resources to check: $RESOURCES" >> $DEBUGFILE
	for RESOURCE in `echo $RESOURCES`
	do
	ONLINE=`scstat -g | grep "Resource: $RESOURCE" | awk '{print $4}' | grep "Online" | wc -l` 
	WEIRD=`scstat -g | grep "Resource: $RESOURCE" | awk '{print $4}' | grep -v "Online" | grep -v "Offline" | wc -l`
[ $DEBUG -gt 0 ] && echo "Resource $RESOURCE has $ONLINE instances online." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Resource $RESOURCE has $WEIRD instances in a weird state." >> $DEBUGFILE
[ $TESTING -gt 0 ] && ONLINE="0"
		if [ $ONLINE -lt 1 ] 
		then
		   echo "NOK - Resource $RESOURCE not online."
		   exit $STATE_CRITICAL
		fi
                if [ $WEIRD -gt 1 ]
                then
                   echo "NOK - Resource $RESOURCE is in an unstable state."
                   exit $STATE_WARNING
                fi
	done
}

function check_rsrce
{
[ $DEBUG -gt 0 ] && echo "Starting check_rsrce subroutine." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Selected resource: $RSRCE" >> $DEBUGFILE
	ONLINE=`scstat -g | grep "Resource: $RSRCE" | awk '{print $4}' | grep "Online" | wc -l`
	WEIRD=`scstat -g | grep "Resource: $RSRCE" | awk '{print $4}' | grep -v "Online" | grep -v "Offline" | wc -l`
[ $DEBUG -gt 0 ] && echo "Resource $RESOURCE has $ONLINE instances online." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Resource $RESOURCE has $WEIRD instances in a weird state." >> $DEBUGFILE
[ $TESTING -gt 0 ] && ONLINE="0"
	if [ $ONLINE -lt 1 ] 
	then
	   echo "NOK - Resource $RESOURCE not online."
	   exit $STATE_CRITICAL
	fi
	if [ $WEIRD -gt 1 ]
        then
           echo "NOK - Resource $RESOURCE is in an unstable state."
           exit $STATE_WARNING
        fi
}

function check_ipmp
{
[ $DEBUG -gt 0 ] && echo "Starting check_ipmp subroutine." >> $DEBUGFILE
	scstat -i | grep "IPMP Group:" | awk '{print $3" "$5}' | while read GROUP STATUS
	do
[ $DEBUG -gt 0 ] && echo "IPMP Group: $GROUP has status $STATUS" >> $DEBUGFILE
		if [ $STATUS != "Online" ] 
		then
		   echo "NOK - IPMP group $GROUP not online."
		   exit $STATE_CRITICAL
		fi
if [ $TESTING -gt 0 ]
then
   echo "NOK - IPMP group $GROUP not online."
   exit $STATE_CRITICAL
fi
	done
}

### THE MAIN ROUTINE FINALLY STARTS ###

[ $DEBUG -gt 0 ] && echo "Starting main routine." >> $DEBUGFILE

if [ $# -lt 1 ]
then
	print_usage
	exit $STATE_UNKNOWN
fi

[ $DEBUG -gt 0 ] && echo "More than one argument." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "" >> $DEBUGFILE

case "$1" in
	--help) print_help; exit $STATE_OK;;
	-h) print_help; exit $STATE_OK;;
	-t) check_transport_paths;;
	-q) check_quorum;;
	-g) check_resource_groups;;
	-G) RGROUP="$2"; check_resource_grp;;
	-r) check_resources;;
	-R) RSRCE="$2"; check_rsrce;;
	-i) check_ipmp;;
	*) print_usage; exit $STATE_UNKNOWN;;
esac

[ $DEBUG -gt 0 ] && echo "No problems. Exiting normally." >> $DEBUGFILE

# None of the other subroutines forced us to exit 1 before here, so let's quit with a 0.
echo "OK - Everything running like it should"
exit $STATE_OK

#!/usr/bin/bash

function testrun()
{
	echo "Running without parameters."
	/usr/local/nagios/libexec/check_suncluster 
	echo "Exit code is $?."
	echo ""

	echo "Testing transport paths."
	/usr/local/nagios/libexec/check_suncluster -t
	echo "Exit code is $?."
	echo ""

	echo "Quorum votes."
	/usr/local/nagios/libexec/check_suncluster -q
	echo "Exit code is $?."
	echo ""

	echo "Checking all resource groups."
	/usr/local/nagios/libexec/check_suncluster -g
	echo "Exit code is $?."
	echo ""

	echo "Checking individual resource groups."
	for GROUP in `scstat -g | grep "Group:" | awk '{print $2}' | sort -u`
	do
		echo "Running for group $GROUP."
		/usr/local/nagios/libexec/check_suncluster -G $GROUP
		echo "Exit code is $?."
		echo ""
	done

	echo "Checking all resources."
	/usr/local/nagios/libexec/check_suncluster -r
	echo "Exit code is $?."
	echo ""
	
	echo "Checking all resources."
	for RESOURCE in `scstat -g | grep "Resource:" | awk '{print $2}' | sort -u`
	do
		echo "Running for resource $RESOURCE."
		/usr/local/nagios/libexec/check_suncluster -R $RESOURCE
		echo "Exit code is $?."
		echo ""
	done
	
	echo "Checking IPMP groups."
	/usr/local/nagios/libexec/check_suncluster -i
	echo "Exit code is $?."
	echo ""
}

function breakstuff()
{
	# Now we'll start breaking things!!
	echo ""
	echo "Now it's time to start breaking things! Gruaargh!"
	echo "Mind you, it's all fake and simulated. I am not changing -anything-"
	echo "about the cluster itself."
	echo ""
	
	echo "Nyo!" > /tmp/neko-wa-baka 
}

echo "Starting clean"
rm /tmp/neko-wa-baka /tmp/foobar >/dev/null 2>&1
echo ""

testrun
breakstuff
testrun

echo "Starting clean at the end"
rm /tmp/neko-wa-baka  >/dev/null 2>&1
echo ""

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 2)

Nagios script: check_processes

2006-06-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Basic monitor to check percentage of used physical RAM.

This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:

Solaris 8
NRPE 1.9
Should work with other versions as well.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!

I've also -finally- changed the script so that it takes the Warning and Critical percentages from the command line.

UPDATE 15/07/2006:

Whoops... I just noticed that the file had gone missing <3


#!/bin/ksh
#
# Free physical RAM monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 20-10-2006
# 
# Usage: ./check_ram
#
# Description:
# This plugin determines how much of the physical RAM in the 
# system is in use.
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
# And it really is only usefull at DTV Labs.
#
# Output:
# The script returns either a WARN or a CRIT, depending on the 
# percentage of free physical memory.
#

# Enabling the following dumps information into DEBUGFILE at various
# stages during the execution of this script.
DEBUG="1"
DEBUGFILE="/tmp/foobar"
rm $DEBUGFILE >/dev/null 2>&1
echo "Starting script check_ram." > $DEBUGFILE

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
        exit 1
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/usr/local/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
        echo "Usage: $PROGNAME warning-percentage critical-percentage"
        echo ""
        echo "e.g. : $PROGNAME 15 5"
        echo "This will start alerting when more than 85% of RAM has"
        echo "been used."
        echo ""
}

print_help() {
        echo ""
        print_usage
        echo ""
        echo "Free physical RAM plugin for Nagios"
        echo ""
        echo "This plugin not developped by the Nagios Plugin group."
        echo "Please do not e-mail them for support on this plugin, since"
        echo "they won't know what you're talking about :P"
        echo ""
        echo "For contact info, read the plugin itself..."
}

if [ $# -lt 2 ]; then print_help; exit $STATE_WARNING;fi

case "$1" in
        --help) print_help; exit $STATE_OK;;
        -h) print_help; exit $STATE_OK;;
        *) if [  $# -lt 2 ]; then print_help; exit $STATE_WARNING;fi ;;
esac

RAM_WARN=$1
RAM_CRIT=$2
[ $DEBUG -gt 0 ] && echo "Warning and Critical percentages are $RAM_WARN and $RAM_CRIT." >> $DEBUGFILE

if [ $RAM_WARN -le RAM_CRIT ]
then
        echo "Warning percentage should be larger than critical percentage."
        exit $STATE_WARNING
fi

check_space()
{
[ $DEBUG -gt 0 ] && echo "Starting check_space." >> $DEBUGFILE
        TOTALSPACE=0
        TOTALSPACE=`prtconf | grep ^"Memory size" | awk '{print $3}'`
[ $DEBUG -gt 0 ] && echo "Total space is $TOTALSPACE." >> $DEBUGFILE

        TOTALFREE=0
        TOTALFREE=`vmstat 2 2 | tail -1 | awk '{print $5}'`
[ $DEBUG -gt 0 ] && echo "Free space is $TOTALFREE." >> $DEBUGFILE
        let TOTALFREE=$TOTALFREE/1000
[ $DEBUG -gt 0 ] && echo "Free space, div1000 is $TOTALFREE." >> $DEBUGFILE
}

check_percentile() 
{
[ $DEBUG -gt 0 ] && echo "Starting check_percentile." >> $DEBUGFILE
        FRACTION=`echo "scale=2; $TOTALFREE/$TOTALSPACE" | bc`
[ $DEBUG -gt 0 ] && echo "Fraction is $FRACTION." >> $DEBUGFILE

        PERCENT=`echo "scale=2; $FRACTION*100" | bc | awk -F. '{print $1}'`
[ $DEBUG -gt 0 ] && echo "Percentile is $PERCENT." >> $DEBUGFILE

        if [ $PERCENT -lt $RAM_CRIT ]; then
[ $DEBUG -gt 0 ] && echo "$PERCENT is smaller than $RAM_CRIT. Critical." >> $DEBUGFILE
          echo "RAM NOK - Less than $RAM_CRIT % of physical RAM is unused."
          exitstatus=$STATE_CRITICAL
          exit $exitstatus
        fi

        if [ $PERCENT -lt $RAM_WARN ]; then
[ $DEBUG -gt 0 ] && echo "$PERCENT is smaller than $RAM_WARN. Warning." >> $DEBUGFILE
          echo "RAM NOK - Less than $RAM_WARN % of physical RAM is unused."
          exitstatus=$STATE_WARNING
          exit $exitstatus
        fi
}

check_space
check_percentile

[ $DEBUG -gt 0 ] && echo "$PERCENT is greater than $RAM_WARN. OK." >> $DEBUGFILE
echo "RAM OK - $TOTALFREE MB out of $TOTALSPACE MB RAM unused."
exitstatus=$STATE_OK
exit $exitstatus

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 0)

Nagios script: check_processes

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

A very simply script that takes a list of processes, instead of a single processes name (as is the case with check_process). This should make monitoring a basic list of processes a lot easier. I really should change the script in such a way that it takes the process list from the command line, instead of from the $LIST variable that's defined internally. I'll do that when I have the time.

Until I've made those change, I use the script by copying check_processes to a new file which is used specifically for one purpose. For example check_linux_processes and check_solaris_processes check a list of processes that should be up and running on Linux and Solaris respectively.

This check script should work on just about any UNIX OS.


#!/bin/bash
#
# Process monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide, the Netherlands
# Last Modified: 13-07-2006
# 
# Usage: ./check_solaris_processes
#
# Description:
# This script couldn't be simpler than it is. It just checks to see
# whether a predefined list of processes is up and running. 
#
# Limitations:
#   This script should work properly on all implementations of Linux, Solaris
# and Mac OS X.
#
# Output:
# If there one of the processes is down, a CRIT is issued.
#

# You may have to change this, depending on where you installed your
# Nagios plugins
PROGNAME="check_linux_processes"
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh


### DEFINING THE PROCESS LIST ###
LIST="init"


### REQUISITE NAGIOS COMMAND LINE STUFF ###

print_usage() {
        echo "Usage: $PROGNAME"
        echo "Usage: $PROGNAME --help"
}

print_help() {
        echo ""
        print_usage
        echo ""
        echo "Basic processes list monitor plugin for Nagios"
        echo ""
        echo "This plugin not developped by the Nagios Plugin group."
        echo "Please do not e-mail them for support on this plugin, since"
        echo "they won't know what you're talking about :P"
        echo ""
        echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
        case "$1" in
          --help) print_help; exit $STATE_OK;;
          -h) print_help; exit $STATE_OK;;
          *) print_usage; exit $STATE_UNKNOWN;;
        esac
done


### FINALLY THE MAIN ROUTINE ###

COUNT="0"
DOWN=""

for PROCESS in `echo $LIST`
do
        if [ `ps -ef | grep -i $PROCESS | grep -v grep | wc -l` -lt 1 ]
        then
                let COUNT=$COUNT+1
                DOWN="$DOWN $PROCESS"
        fi
done

if [ $COUNT -gt 0 ]
then
        echo "NOK - $COUNT processes not running: $DOWN"
        exit $STATE_CRITICAL
fi

# Nothing caused us to exit early, so we're okay.
echo "OK - All requisite processes running."
exit $STATE_OK

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 2)

Nagios script: check_ntp_config

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

As far as I know there was no Nagios plugin that allowed you to really check your client configuration. I mean, it would be nice to know for sure that all your systems are syncing against the proper server... Wouldn't it?

The script was tested on Redhat ES3, Mac OS X and Solaris. Its basic requirement is the bash shell.

EDIT:

Oh! Just like my other recent Nagios scripts, check_ntp_config comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.


#!/usr/bin/bash
#
# CPU load monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide, the Netherlands
# Last Modified: 10-07-2006
# 
# Usage: ./check_ntp_config
#
# Description:
#   Well, there's not much to tell. We have no way of making sure that our 
# NTP clients are all configured in the right way, so I thought I'd make
# a Nagios check for it. ^_^ 
#   You can change the NTP config at the top of this script, to match your
# own situation.
#
# Limitations:
#   This script should work properly on all implementations of Linux, Solaris
# and Mac OS X.
#
# Output:
#   If the NTP client config does not match what has been defined at the 
# top of this script, the script will return a WARN.
#
# Other notes:
#   If you ever run into problems with the script, set the DEBUG variable
# to 1. I'll need the output the script generates to do troubleshooting.
# See below for details.
#   I realise that all the debugging commands strewn throughout the script
# may make things a little harder to read. But in the end I'm sure it was
# well worth adding them. It makes troubleshooting so much easier. :3
#

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh


### DEFINING THE NTP CLIENT CONFIGURATION AS IT SHOULD BE ###
NTP_SERVER="ntp.wxs.nl"


### DEBUGGING SETUP ###
# Cause you never know when you'll need to squash a bug or two
DEBUG="0"

if [ $DEBUG -gt 0 ]
then
        DEBUGFILE="/tmp/foobar"
        rm $DEBUGFILE >/dev/null 2>&1
fi


### REQUISITE NAGIOS COMMAND LINE STUFF ###

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "NTP client configuration monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done


### DEFINING SUBROUTINES ###

function gather_config()
{
    case `uname` in
	Linux) CFGFILE="/etc/ntp.conf"; IP_SERVER=`host $NTP_SERVER | awk '{print $4}'` ;;
	SunOS) CFGFILE="/etc/inet/ntpd.conf"; IP_SERVER=`getent hosts $NTP_SERVER | awk '{print $2}'`;;
	Darwin) CFGFILE="/etc/ntp.conf"; IP_SERVER=`host $NTP_SERVER | awk '{print $4}'` ;;
	*) ;;
    esac

    REAL_SERVER=`cat $CFGFILE | grep ^server | awk '{print $2}'`

[ $DEBUG -gt 0 ] && echo "Gather_config: Host name for required server is $NTP_SERVER." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Gather_config: IP address for required server is $IP_SERVER." >> $DEBUGFILE
[ $DEBUG -gt 0 ] && echo "Gather_config: currently configured server is $REAL_SERVER." >> $DEBUGFILE
} 

function check_config()
{
    if [ $REAL_SERVER != $NTP_SERVER ]
    then
	if [ $REAL_SERVER != $IP_SERVER ]
	then
	    echo "NOK - NTP client is not configured to speak to $NTP_SERVER"
	    exit $STATE_WARNING
     	fi
    fi
}


### FINALLY, THE MAIN ROUTINE ###

gather_config
check_config

# Nothing caused us to exit early, so we're okay.
echo "OK - NTP client configured correctly."
exit $STATE_OK

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 0)

Nagios script: check_log3

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

Today I made an improved version of the Nagios monitor "check_log2", which is now aptly called "check_log3". It includes all the improvements I originally added to "check_log2", so you can simply use this as a drop-in replacement.

Version 3 of this script gives you the option to add a second query to the monitor.

The previous two incarnations of the script only allowed you to search for one query and would return a Critical if it was found. Now you can also add a query which will return in a Warning message as well. Goody! :3

1st of Feb, 2006:

Kyle Tucker pointed out that he had problems running this script with bash on Solaris. The changes he suggested have been worked into the newer version. Thanks Kyle :)

5th of Mar, 2006:

I finally got round to fix the script according to all the changes Kyle (and others) suggested. So here's another try! Right now I've tested the script on Red Hat, Mac OS X and Solaris, so it should be much better than before.

19th of June, 2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!

Also stomped out a few horrendous bugs! I'm very sorry for putting out such a buggy script earlier... If you've started using the script in your environment, please download the latest version. Thanks to Ali Khan for pointing out these mistakes.


#!/bin/bash
#
# Log file pattern detector plugin for Nagios
# Written by Ethan Galstad (nagios@nagios.org)
# Last Modified: 07-31-1999
# Heavily modified by Thomas Sluyter (nagiosATkilalaDOTnl)
# Last Modified: 19-06-2006
#
# Usage: ./check_log3 -F log_file -O old_log_file -C crit-pattern -W warn-pattern
#
# Description:
#
# This plugin will scan a log file (specified by the log_file option)
# for specific patterns (specified by the XXX-pattern options).  Successive
# calls to the plugin script will only report *new* pattern matches in the
# log file, since an copy of the log file from the previous run is saved
# to old_log_file.
#
# Output:
#
# On the first run of the plugin, it will return an OK state with a message
# of "Log check data initialized".  On successive runs, it will return an OK
# state if *no* pattern matches have been found in the *difference* between the
# log file and the older copy of the log file.  If the plugin detects any 
# pattern matches in the log diff, it will return a CRITICAL state and print
# out a message is the following format: "(x) last_match", where "x" is the
# total number of pattern matches found in the file and "last_match" is the
# last entry in the log file which matches the pattern.
#
# Notes:
#
# If you use this plugin make sure to keep the following in mind:
#
#    1.  The "max_attempts" value for the service should be 1, as this
#        will prevent Nagios from retrying the service check (the
#        next time the check is run it will not produce the same results).
#
#    2.  The "notify_recovery" value for the service should be 0, so that
#        Nagios does not notify you of "recoveries" for the check.  Since
#        pattern matches in the log file will only be reported once and not
#        the next time, there will always be "recoveries" for the service, even
#        though recoveries really don't apply to this type of check.
#
#    3.  You *must* supply a different old_file_log for each service that
#        you define to use this plugin script - even if the different services
#        check the same log_file for pattern matches.  This is necessary
#        because of the way the script operates.
#
#    4.  Changes to the script were made by Thomas Sluyter (cailin@kilala.nl).
#	 * The first set of changes will allow the script to run properly on Solaris, which
#	   it did not do by default. The second set of changes will allow the following:
#	 * State retention. In the original script, if a NOK was put into the log file
#	   at point A in time and it is not repeated at A+1, then an OK is sent to Nagios. 
# 	   Not something that you would like to happen.
#	      I've added the $oldlog.STATE trigger file which retains the last exitstatus. Should
# 	   there be no new lines added to the log, check_log will simply repeat the last state
#	   instead of give an OK.
#	      In order for this state retention to work properly your client system MUST
#	   HAVE THE DIRECTORY /USR/LOCAL/NAGIOS/VAR.
#        * Two queries. In the original script you could only enter one query which, when
#	   found, would result in  a Critical message being sent to Nagios. I've added the 
#	   possibility to add another query, which will result in a Warning message.
#	 * Bugfix: changed all instances of "crit-count" and "warn-count" to "critcount" and
#	   "warncount" after a tip from Kyle Tucker who ran into problems running this script
#	   with bash on Solaris.
#

# Paths to commands used in this script.  These
# may have to be modified to match your system setup.

PATH="/usr/bin:/usr/sbin:/bin:/sbin"

PROGNAME=`basename $0`
PROGPATH=`echo $0 | sed -e 's,[\\/][^\\/][^\\/]*$,,'`

#. $PROGPATH/utils.sh
. /usr/local/nagios/libexec/utils.sh

print_usage() {
    echo "Usage: $PROGNAME -F logfile -O oldlog -C CRITquery -W WARNquery"
    echo "Usage: $PROGNAME --help"
    echo "Usage: $PROGNAME --version"
}

print_help() {
    echo ""
    print_usage
    echo ""
    echo "Log file pattern detector plugin for Nagios"
    echo ""
    support
}

# Make sure the correct number of command line
# arguments have been supplied

if [ $# -lt 8 ]; then
    print_usage
    exit $STATE_UNKNOWN
fi

# Grab the command line arguments

exitstatus=$STATE_WARNING #default
while test -n "$1"; do
    case "$1" in
        --help)
            print_help
            exit $STATE_OK
            ;;
        -h)
            print_help
            exit $STATE_OK
            ;;
        -F)
            logfile=$2
            shift
            ;;
        -O)
            oldlog=$2
            shift
            ;;
        -C)
            CRITquery=$2
            shift
            ;;
        -W)
            WARNquery=$2
            shift
            ;;
        *)
            echo "Unknown argument: $1"
            print_usage
            exit $STATE_UNKNOWN
            ;;
    esac
    shift
done

# If the source log file doesn't exist, exit

if [ ! -e $logfile ]; then
    echo "Log check error: Log file $logfile does not exist!"
    exit $STATE_UNKNOWN
    echo $STATE_UNKNOWN > $oldlog.STATE
fi

# If the dump/temp log file doesn't exist, this must be the first time
# we're running this test, so copy the original log file over to
# the old diff file and exit

if [ ! -e $oldlog ]; then
    cat $logfile > $oldlog

    TEMPcount=0
    let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $WARNquery | wc -l | awk '{print $1}')
    let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $CRITquery | wc -l | awk '{print $1}')

    if [ $TEMPcount -gt 0 ]
    then
       echo "Log check data initialized... Last line contained error message."
       echo $STATE_WARNING > $oldlog.STATE
       exit $STATE_WARNING
    else
       echo "Log check data initialized..."
       echo $STATE_OK > $oldlog.STATE
       exit $STATE_OK
    fi
fi

# A bug which was caught very late:
# If newlog is shorter than oldlog, the diff used below will return
# false positives for the query because the will be in $oldlog. Why?
# Because $oldlog is not rolled over / rotated, like $newlog. I need
# to fix this in a kludgy way.

if [ `wc -l $logfile|awk '{print $1}'` -lt `wc -l $oldlog|awk '{print $1}'` ]
then
    rm $oldlog
    cat $logfile > $oldlog
    TEMPcount=0
    let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $WARNquery | wc -l | awk '{print $1}')
    let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $CRITquery | wc -l | awk '{print $1}')

    if [ $TEMPcount -gt 0 ]
    then
       echo "Log check data initialized... Last line contained error message."
       echo $STATE_WARNING > $oldlog.STATE
       exit $STATE_WARNING
    else
       echo "Log check data initialized..."
       echo $STATE_OK > $oldlog.STATE
       exit $STATE_OK
    fi
fi

# The oldlog file exists, so compare it to the original log now

# The temporary file that the script should use while
# processing the log file.
if [ -x mktemp ]; then
    tempdiff=`mktemp /tmp/check_log.XXXXXXXXXX`
else
    tempdate=`/bin/date '+%H%M%S'`
    tempdiff="/tmp/check_log.${tempdate}"
    touch $tempdiff
fi

diff $logfile $oldlog > $tempdiff

if [ `wc -l $tempdiff | awk '{print $1}'` -eq 0 ]
then
     rm $tempdiff
     touch $oldlog.STATE
     exitstatus=`cat $oldlog.STATE`
     echo "LOG FILE - No status change detected. Status = $exitstatus"
     exit $exitstatus
fi

# Count the number of matching log entries we have
CRITcount=`grep -c "$CRITquery" $tempdiff`
WARNcount=`grep -c "$WARNquery" $tempdiff`

# Get the last matching entry in the diff file
CRITlastentry=`grep "$CRITquery" $tempdiff | tail -1`
WARNlastentry=`grep "$WARNquery" $tempdiff | tail -1`

rm $tempdiff
cat $logfile > $oldlog

if [ "$CRITcount" -gt 0 ]; then
    	echo "($CRITcount) $CRITlastentry"
    	echo $STATE_CRITICAL > $oldlog.STATE
	exit $STATE_CRITICAL
fi

if [ "$WARNcount" -gt 0 ]; then
    	echo "($WARNcount) $WARNlastentry"
    	echo $STATE_WARNING > $oldlog.STATE
	exit $STATE_WARNING
fi

echo "Log check ok - 0 pattern matches found"
exit $STATE_OK


echo "Starting clean"
rm /tmp/foobar /usr/local/nagios/var/foobar*
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""

echo "Starting normally"
echo "baka"
echo "normal" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "baka"
echo "normal" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "warning"
echo "bla" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "critical"
echo "neko" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "warning"
echo "bla" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""

echo "Log rotation with crit"
rm /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "critical"
echo "neko" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""

echo "Log rotation with warn"
rm /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "warning"
echo "bla" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""

echo "Normal log rotation"
rm /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla
echo $?
echo ""

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 2)

How do Nagios clients communicate?

2006-06-01 00:00:00

I know that, the first time I started using Nagios, I got confused a little when it came to monitoring systems other than the one running Nagios. To shed a little light on the subject for the beginning Nagios user, here's a discussion of the various methods of talking to Nagios clients.

First off, let me make it absolutely clear that, in order to monitor systems other than the one running Nagios, you are indeed going to have to communicate with them in some fashion. Unfortunately very few things in the Sysadmin trade are magical, and Nagios is unfortunately not one of them.

So first off, let's look at the -wrong- way of doing things. When I first started with Nagios (actually I made this mistake on my second day with the software) I wrote something like this:

define service{

   host_name remote-host

   service_description D_ROOT

   check_command check_disk!85!95!/

}

The problem with this setup is that I was using a -local- check and said it belonged to remote-host. Now this may look alright on the status screen ("Hey! It's green!"), but naturally you're not monitoring the right thing ^_^

So how -do- you monitor remote resources? Here's a table comparing various methods. After that I'll give examples on how you can correct the mistake I made above with each method.

PLEASE NOTE: the following discussion will not cover the monitoring of systems other than the various UNIX flavours. Later on I'll write a similar article covering Windows and stuff like Cisco.

A quick comparison

	SSH	NRPE	SNMP	SNMP traps	NSCA
Connection initiation	Srv -> Clnt	Srv -> Clnt	Srv -> Clnt	Clnt -> Srv	Clnt -> Srv
Security	Encryption TCP wrappers Key pairs	Encryption Access List TCP wrappers	Access List (v2) Password (v3)	Access List (v2) Password (v3) TCP wrappers	Encryption Access List TCP wrappers
Configuration	On server	On client	On client	On client and On server	On client
Difficulty	Easy	Moderate	Hard	Hard	Moderate

SSH

Just about everyone should already have SSH running on their servers (except for those few who are still running telnet or, horror or horrors!, rsh). So it's safe to assume that you can immediately start using this communications method to check your clients. You will need to:

create a nagios user on the client,
make sure that the nagios user from the server can log in to this account without a password (through keys),
install all check script on the system, and
for each command that you want to run through SSH, create a check command definition in checkcommands.cfg.

You can now set up your services.cfg in such a way that each remote service is checked like so:

define service{

   host_name remote-host

   service_description D_ROOT

   check_command check_disk_by_ssh!85!95!/

}

Your check command definition would look something like this:

define command {

command_name check_disk_by_ssh

command_line /usr/local/nagios/libexec/check_by_ssh -H $HOSTADDRESS$ -C "/usr/local/nagios/libexec/check_disk -w $ARG1$ -c $ARG2$ $ARG3$"

}

Working this way will allow you to do most of your configuring centrally (on the Nagios server), thus saving you a lot of work on each client system. All you have to do over there is make sure that there's a working user account and that all the scripts are in place. Quite convenient... The only drawback being that you're making a relatively open account which has full access to the system (sometimes even with sudo access).

NRPE

As a replacement for the SSH access method, Ethan also wrote the NRPE daemon. Using NRPE requires that you:

create a nagios user on the client,
configure inetd, xinetd, tcp wrappers, services and stuff like that,
download and install NRPE on each client,
install all check scripts on the system, and
configure the NRPE daemon to have a check command for each check you would like to perform.

You can now set up your services.cfg in such a way that each remote service is checked like so:

define service{

   host_name remote-host

   service_description D_ROOT

   check_command check_nrpe!check_root

}

And in /usr/local/nagios/etc/nrpe.cfg on the client you would need to include:

command[check_root]=/usr/local/nagios/libexec/check_disk 85 95 /

Good thing is that you won't have a semi-open account lying about. Bad things are that, if you want to change the configuration of your client, you're going to have to login. And you're going to have yet another piece of software to keep up to date.

SNMP

Whoo boy! This is something I'm working on right now at $CLIENT and let me tell you: it's hard! At least much harder than I was expecting.

SNMP is a network management protocol used by the more advanced system administrators. Using SNMP you can access just about -any- piece of equipment in your server room to read statistics, alarms and status messages. SNMP is universal, extensible, but it is also quite complicated. Not for the faint of heart.

To make proper use of monitoring through SNMP you'll need to:

install an SNMP daemon / agent on your system,
put all the check scripts you want to run on the system as well,
register a private Enterprise ID with IANA to hold your custom objects,
configure the SNMP daemon in such a way that the results of the check scripts are placed in custom objects, and
configure basic security on the SNMP daemon.

The reason why point C tells you to register a private EID, is because the SNMP tree has a very rigid structure. Technically speaking you -could- just plonk down your results at a random place in the tree, but it's likely that this will screw up something else at a later time. IANA allows each company to have only one private EID, so first check if your company doesn't already have one on the IANA list.

define service{

   host_name remote-host

   service_description D_ROOT

   check_command retrieve_custom_snmp!.1.3.6.1.4.1.6886.4.1.4

}

And in this case your snmpd.conf would contain a line like this:

exec .1.3.6.1.4.1.6886.4.1.4 check_d_root /usr/local/nagios/libexec/check_disk -w 85 -c 95 /

Up to now things are actually not that different from using NRPE, are they? Well, that's because we haven't even started using all the -real- features of SNMP. Point is that using SNMP you can dig very deeply into your system to retrieve all kinds of useful information. And -that's- where things get complicated because you're going to have to dig up all the object IDs (OIDs) that you're going to need. And in some cases you're going to have to install vendor specific sub-agents that know how to speak to your specific hardware.

One of the best features of SNMP though are the so-called traps. Using traps the SNMP daemon will actively undertake action when something goes wrong in your system. So if for instance your hard disk starts failing, it is possible to have the daemon send out an alert to your Nagios server! Awesome! But naturally this will require a boatload of additional configuration :(

So... SNMP is an awesomely powerful tool, but you're going to have to pay through the nose (in effort) to get it 100% perfect.

SNMP traps

To make proper use of monitoring through SNMP you'll need to:

install an SNMP daemon / agent on your system,
define all the SNMP traps you would like to send on the client,
install an SNMP trap daemon on your server,
configure the SNMP trap daemon to tell it what to do with the incoming traps,
install something that makes a translation between SNMP traps and Nagios service definitions.

There are -many- ways to get the SNMP traps translated for Nagios' purposes, 'cause there's many roads that lead to Rome. Unfortunately none of them are very easy to use.

SNMPtt, an open source tool.
EventDB, a database driven piece of software by Netways GMBH.
My crappy-ass solution, that just cross-references a list of OIDs to a list of Nagios actions.

NSCA

And finally there's NSCA. This daemon is usually used by distributed Nagios servers to send their results to the central Nagios server, which gathers them as so-called "passive checks". It is however entirely possible to install NSCA on each of your Nagios clients, which will then get called to send in the results of local checks. In this case you'll need to:

make sure all your check scripts are on the client,
download and install the NSCA binaries on your client,
make a script which can be run from cron to run each script and then to forward the results through NSCA, and
finally, configure cron to run the various scripts at set times.

On your Nagios server things would look like this:

define service{

   host_name remote-host

   service_description D_ROOT

   check_command check_disk!85!95!/

   passive_checks_enable 1

   active_checks_enable 0

}

For the configuration on the client side I recommend that you read up on NSCA. It's a little bit too much to show over here.

The upside to this is that you won't have to run any daemon on your client to accept incoming connections. This will allow you to lock down your system in a hard way.

Naturally you are absolutely free to combine two or more of the methods described above. You could poll through NRPE and receive SNMP traps in one environment. This will have both ups and downs, but it's up to your own discretion. Use the tools that feel natural to you, or use those that are already standard in your environment.

I realise I've rushed through things a little bit, but I was in a slight hurry :) I will go over this article a second time RSN, to apply some polish.

kilala.nl tags: tutorial, sysadmin, nagios, unix,

View or add comments (curr. 1)

Nagions Conference 2006, Nurnberg

2006-06-01 00:00:00

September 21-22 of 2006 saw the first annual Nagios Conference. Organised by the good folk of Netways, the conference was attended by around 130 people (mostly Germans, with some foreigners thrown in for fun).

Originally I posted some comments about the conference on my blog, but I thought I'd move them over into the Sysadmin section, to keep Google from thinking the Archives had content about Nagios :D

Day 0

Wow... Today was a long day :)

Left Utrecht around 09:30 and finally arrived at the hotel at 17:30. Eight hours, just as I predicted! 6 hours driving (0.5 of which due to delays) and 2 hours spent resting. Speaking of: I -love- the Germanian Autobahn! They are littered with comfortable places to take a break and there's also an abundance of what they call a Rasthof: parking space, combined with restaurants, gas station, maybe a hotel, a few shops and very cool sanitary facilities (by the Sanifair company). I'll talk about those some more another time :)

What else is there to tell? I showered, I unpacked, we had dinner with the whole group and I met some interesting people. *waves* Hi Stephan! Hi Jorg! *waves*

Now... I feel really tired (I also notice that it's getting harder for me to string together coherent thought, despite the recent cappuccino), so I'd better get to bed... I'm actually quite woozy in the head! :)

Tomorrow the conference'll start, so I'd better be at my best!

Day 1

So far, it's been an interesting day.

In the morning, Ethan Galstad (main Nagios developer) covered his plans for the future. Version 3.x (improved notification, expanded plugin output, custom variables and a greatly improved method for host checking) will Alpha in October and Stable somewhere this winter, while 4.x (a new PHP-based GUI, among other things) is on the long-term roadmap.

After that Michael Kienle and Markus Kosters told us a few things about the practical side of implementing Nagios in your organisation. I was already familiar with most of what they told us, but it must've been an eye opener for a lot of people! The notion that Nagios needs much more than just "download and install" is apparently foreign to a lot of people, which comes back to bite them in the ass later.

Lunch was terrific. I don't know how they do it, but the Nurnberg Holiday Inn are perfectly capable of making a buffet-style meal that -is- quite edible and actually varied and tasty! Kudos to them!

While on the subject of the hotel... The hotel, the rooms, the facilities: they're all wonderful. Nice ambience, a swanky in-house cafe and comfortable furniture. I like it! I just have to wonder about one thing: why the heck are there at least a dozen brothels and sex clubs surrounding the hotel?! o_O

The afternoon saw two sessions regarding data collection and representation: RRDTool and NagiosGrapher. RRD itself couldn't interest me for long, but NagGraph (which relies on RRD) on the other hand could. NagGraph allows you to include somewhat complicated graphs to Nagios (inside the Nagios GUI), which gives you something that is a little similar to Cacti

I had to skip the session on monitoring storage systems, because I -really- needed some fresh air. So I walked around Nurnberg's Alt Stad for a while. Looks nice, I have to say :) Of course I was only able to see a small part of it, but hey... At least I got out for a while. [EDIT: Anand from ASAM told me afterwards that I didn't miss much. Apparently is was kind of a marketing spiel]

So... The plan for the rest of the day:

Visit Ton Voon's lecture on Nagios plugins (past, present, future)
Rest up a little bit in my room and maybe have a shower.
Get a clean shave.
Go out for dinner with all the Conference people. Apparently, Netways have booked a local restaurant for dinner and cocktails :9
Sleep!

See you guys tomorrow!

Day 2

*phew* That was great! <3

I'm sitting here in my hotel room with some apple soda and some Pringles, feeling nice and drowsy thanks to the hotel's sauna. It felt real good, just spending an hour and a half relaxing.

Anywho... The conference today... Pretty darn interesting and it gave me a load of things to think about! In his morning session Ethan covered some things that you usually don't think of when configuring Nagios, but that can save you loads of trouble! A few of the things he mentioned I will actually try to work into the design of $CLIENT's new Nagios infra, 'cause else they may run into some problems later.

The rest of the morning for me was filled with two sessions on varying ways to get info into Nagios. On the one hand there was SNMPtt (trap translator), which to me seemed like a really backward solution to a problem that wasn't too difficult to start with. And on the other hand, there was EventDB whose goal it is to have only one check command to access information provided by a great variety of sources. The only down-side being that you'll need translation adapters for each of these sources (which means that you basically are filling one whole by digging another).

Now I don't mean to be too negative about these two sessions. I'm sure that a lot of people are actually very happy to see these tools and that they will have some great uses for them.

Lunch... What can I say? It was great, just like yesterday. The hotel took great care of us, thanks to Netways.

After lunch, Ton Voon kicked off with a brief session on open source etiquette. Basically telling the attendees both the up and down sides their companies could experience by contributing to the Nagios community. As ever, Ton was charismatic and displayed a good sense of humor ^_^

Two Netways employees gave talks on:
1. The IT Portal they implemented at the Bundesverwaltungsamt. This is actually the same portal that Markus Kosters told us about yesterday, but Julian actually took time to show us the technology behind the portal.
2. Integrating Nagios with Asterisk (among other things), to allow for some nice telephone trickery. Mind you, Asterisk isn't really my thing, but I can imagine some people enjoying the idea of being called by the Nagios server to literally -tell- them (through a .WAV voice) that their server's down.

For me, the con was closed by a guy giving a marketing spiel about the services his company provides, but I was actually able to glean something useful from the talk.

Unfortunately there was no official closing ceremony, so the con ended quite abruptly. Which means that just about everyone stormed out of the building in the span of thirty minutes. I did however get to say goodbye to a few nice acquaintances I've made during these three days. And my hat's off to Anand who decided to drive home during the night (he lives in The Hague)... He should be arriving home, somewhere around 0200 ;_; Wow!

While waving off the last person to leave (Stephan), I met up with Ethan and his SO, Mary. We went to dinner together and I must say I enjoyed their company! Friendly folks and very down to earth. I believe that sometimes Ethan is just overwhelmed by all the attention people are willing to give him... Who could blame him?

Aside from Ethan and Mary, I'm the last conference attendee at the hotel. In the morning, I'll have a nice breakfast, grab some rolls at the bakery and head off home. I reckon I should get there around five-ish.

kilala.nl tags: conference, nagios,

View or add comments (curr. 0)

An introduction to Nagios monitoring

2006-06-01 00:00:00

Working at $CLIENT in 2005 was the first time that I built a complete monitoring infrastructure from the ground on up. In order to keep expenses low we went for a free, yet versatile monitoring tool: Nagios

Nagios, which is available over here, is a free and Open Source monitoring solution based on what was once known as NetSaint.

Nagios allows you to monitor a number of different platforms through the use of plugins which can run on both the server as well as on the monitoring clients. So far I've heard of clients being available for various UNIXen and BSDs (including Mac OS X) and Windows. Windows monitoring requires either the unclear NSClient software, or the NRPE_nt daemon which is basically a port of the UNIX Nagios client.

Setting up the basic server requires some fidgeting with compilers, dependencies and so on. However, a reasonably experienced sysadmin should be able to have the basic software up and running (and configured) in a day. However, adding all the monitors for all the clients is a matter entirely

Although there are a number of GUI's available which should make configuring Nagios a bit easier, I chose to do it all by hand. Just because that's what I'm used to and because I have little faith in GUI-generated config files. You will need to define each monitor separately for each host, so let's take a look at a quick example.

Say that you have twenty servers that need to be monitored by ten monitors each. Each definition in the configuration file takes up approximately sixteen lines, so in the end your config file will be at least 3200 lines long :)

But please don't let that deter you! Nagios is a powerful tool and can help you keep an eye on _a_lot_ of different things in your environment. I for one have become quite smitten with it.

In the menu you will find a configuration manual which I wrote for $CLIENT, as well as a bunch of plugins which were either modified or created for their environment. Quite possible there's one or two in there that will be interesting for you.

kilala.nl tags: tutorial, sysadmin, nagios,

View or add comments (curr. 0)

Nagios script: check_log2

2006-06-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Improved log checker for Solaris, with state retention.

I found that the version of check_log included in the default monitor package doesn't work perfectly on Solaris: it needs a bit of tweaking... Which is what I've done for the script.

Also, I've added state retention. It's a bit of a hack, but hey! I needed a quick solution.

The original script sends a Critical when it detects the string you've queried the log file for, but it clears that same Critical immediately if the same message is not repeated once the monitor runs again. Meaning that, if there are no updates to your log file, the Critical will only be around until the next time the monitor runs.

Not very handy if the Critical occurs during the night.

This new version of the script creates a file called $oldlog.STATE in /usr/local/nagios/var (which should be 755, nagios:nagios), which contains the exit status for the last detected _changed_ status... If there are no changes detected in your log file, this old exit state is repeated.

The script has been tested on Solaris 8, Mac OS X 10.4 and Redhat ES3.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!

#!/bin/bash
#
# Log file pattern detector plugin for Nagios
# Written by Ethan Galstad (nagios@nagios.org)
# Last Modified: 07-31-1999
# Updated by Thomas Sluyter (nagiosATkilalaDOTnl)
# Last Modified: 19-06-2006
#
# Usage: ./check_log2 -F log_file -O old_log_file -Q pattern
#
# Description:
#
# This plugin will scan a log file (specified by the log_file option)
# for a specific pattern (specified by the pattern option).  Successive
# calls to the plugin script will only report *new* pattern matches in the
# log file, since an copy of the log file from the previous run is saved
# to old_log_file.
#
# Output:
#
# On the first run of the plugin, it will return an OK state with a message
# of "Log check data initialized".  On successive runs, it will return an OK
# state if *no* pattern matches have been found in the *difference* between the
# log file and the older copy of the log file.  If the plugin detects any 
# pattern matches in the log diff, it will return a CRITICAL state and print
# out a message is the following format: "(x) last_match", where "x" is the
# total number of pattern matches found in the file and "last_match" is the
# last entry in the log file which matches the pattern.
#
# Notes:
#
# If you use this plugin make sure to keep the following in mind:
#
#    1.  The "max_attempts" value for the service should be 1, as this
#        will prevent Nagios from retrying the service check (the
#        next time the check is run it will not produce the same results).
#
#    2.  The "notify_recovery" value for the service should be 0, so that
#        Nagios does not notify you of "recoveries" for the check.  Since
#        pattern matches in the log file will only be reported once and not
#        the next time, there will always be "recoveries" for the service, even
#        though recoveries really don't apply to this type of check.
#
#    3.  You *must* supply a different old_file_log for each service that
#        you define to use this plugin script - even if the different services
#        check the same log_file for pattern matches.  This is necessary
#        because of the way the script operates.
#
#    4.  Changes to the script were made by Thomas Sluyter (nagios@kilala.nl).
#	 The first set of changes will allow the script to run properly on Solaris, which
#	 it did not do by default. The second set of changes will allow the following:
#	 * State retention. If a NOK was generated at point A in time and it is not repeated
# 	   at A+1, then an OK is sent to Nagios. Not something that you would like to happen.
#	   I've added the $oldlog.STATE trigger file which retains the last exitstatus. Should
# 	   there be no new lines added to the log, check_log will simply repeat the last state
#	   instead of give an OK.
#
# Examples:
#
# Check for login failures in the syslog...
#
#   check_log -F /var/log/messages -O /usr/local/nagios/var/check_log.badlogins.old -Q "LOGIN FAILURE"
#
# Check for port scan alerts generated by Psionic's PortSentry software...
#
#   check_log -F /var/log/messages -O /usr/local/nagios/var/check_log.portscan.old -Q "attackalert"
#

# Paths to commands used in this script.  These
# may have to be modified to match your system setup.

PATH="/usr/bin:/usr/sbin:/bin:/sbin"

PROGNAME=`basename $0`
PROGPATH=`echo $0 | sed -e 's,[\\/][^\\/][^\\/]*$,,'`

#. $PROGPATH/utils.sh
. /usr/local/nagios/libexec/utils.sh

print_usage() {
    echo "Usage: $PROGNAME -F logfile -O oldlog -Q query"
    echo "Usage: $PROGNAME --help"
}

print_help() {
    echo ""
    print_usage
    echo ""
    echo "Log file pattern detector plugin for Nagios"
    echo ""
    support
}

# Make sure the correct number of command line
# arguments have been supplied

if [ $# -lt 6 ]; then
    print_usage
    exit $STATE_UNKNOWN
fi

# Grab the command line arguments

exitstatus=$STATE_WARNING #default
while test -n "$1"; do
    case "$1" in
        --help)
            print_help
            exit $STATE_OK
            ;;
        -h)
            print_help
            exit $STATE_OK
            ;;
        -F)
            logfile=$2
            shift
            ;;
        -O)
            oldlog=$2
            shift
            ;;
        -Q)
            query=$2
            shift
            ;;
        *)
            echo "Unknown argument: $1"
            print_usage
            exit $STATE_UNKNOWN
            ;;
    esac
    shift
done

# If the source log file doesn't exist, exit

if [ ! -e $logfile ]; then
    echo "Log check error: Log file $logfile does not exist!"
    exit $STATE_UNKNOWN
    echo $STATE_UNKNOWN > $oldlog.STATE
fi

# If the oldlog file doesn't exist, this must be the first time
# we're running this test, so copy the original log file over to
# the old diff file and exit

if [ ! -e $oldlog ]; then
    cat $logfile > $oldlog
    if [ `tail -1 $logfile | grep -i $query | wc -l` -gt 0 ]
    then
        echo "Log check data initialized... Last line contained error message."
        echo $STATE_CRITICAL > $oldlog.STATE
	exit $STATE_CRITICAL
    else
        echo "Log check data initialized..."
        echo $STATE_OK > $oldlog.STATE
        exit $STATE_OK
    fi
fi

# A bug which was caught very late:
# If newlog is shorter than oldlog, the diff used below will return
# false positives for the query because the will be in $oldlog. Why?
# Because $oldlog is not rolled over / rotated, like $newlog. I need 
# to fix this in a kludgy way.

if [ `wc -l $logfile|awk '{print $1}'` -lt `wc -l $oldlog|awk '{print $1}'` ]
then
    rm $oldlog
    cat $logfile > $oldlog
    if [ `tail -1 $logfile | grep -i $query | wc -l` -gt 0 ]
    then
        echo "Log check data re-initialized... Last line contained error message."
        echo $STATE_CRITICAL > $oldlog.STATE
	exit $STATE_CRITICAL
    else
        echo "Log check data re-initialized..."
        echo $STATE_OK > $oldlog.STATE
        exit $STATE_OK
    fi
fi

# Everything seems fine, so compare it to the original log now

# The temporary file that the script should use while
# processing the log file.
if [ -x mktemp ]; then
    tempdiff=`mktemp /tmp/check_log.XXXXXXXXXX`
else
    tempdate=`/bin/date '+%H%M%S'`
    tempdiff="/tmp/check_log.${tempdate}"
    touch $tempdiff
fi

diff $logfile $oldlog > $tempdiff

if [ `wc -l $tempdiff|awk '{print $1}'` -eq 0 ]
then
     rm $tempdiff
     touch $oldlog.STATE
     exitstatus=`cat $oldlog.STATE`
     echo "LOG FILE - No status change detected. Status = $exitstatus"
     exit $exitstatus
fi

# Count the number of matching log entries we have
count=`grep -c "$query" $tempdiff`

# Get the last matching entry in the diff file
lastentry=`grep "$query" $tempdiff | tail -1`

rm -f $tempdiff
cat $logfile > $oldlog

if [ "$count" = "0" ]; then # no matches, exit with no error
    echo "Log check ok - 0 pattern matches found"
    exitstatus=$STATE_OK
else # Print total matche count and the last entry we found
#    echo "($count) $lastentry"
    echo "Log check NOK - $lastentry"
    exitstatus=$STATE_CRITICAL
    echo $STATE_CRITICAL > $oldlog.STATE
fi

exit $exitstatus


echo "Starting clean"
rm /tmp/foobar /usr/local/nagios/var/foobar*
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""

echo "Starting normally"
echo "normal"
echo "normal" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "normal"
echo "normal" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "critical"
echo "neko" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""

echo "Log rotation with crit"
rm /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "critical"
echo "neko" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""

echo "Normal log rotation"
rm /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""
echo "normal"
echo "baka" >> /tmp/foobar
/usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko
echo $?
echo ""

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 2)

Nagios script: check_nsca

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

At $CLIENT we've often run into problems with the NSCA daemon, where the daemon would not crash per se, but where it would also not process incoming service checks. The nsca process was still running, but it simply wasn't transferring the incoming results to the Nagios command file.

I was amazed to find that nobody else had written a script to do this! So I quickly wrote one.

#!/usr/bin/bash
#
# NSCA Nagios service results monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide, the Netherlands
# Last Modified: 16-08-2006
# 
# Usage: ./check_nsca
#
# Description:
# Aside from checking whether the NSCA process is still running, this script
# also attempts to insert a message into the Nagios queue. After sending a 
# message to the NSCA daemon, it will verify that the message is received by
# Nagios, by checking the nagios.log file. 
#
# Limitations:
#   This script should work properly on all implementations of Linux, Solaris
# and Mac OS X.
#
# Output:
# If the NSCA daemon, or something along the message path, is borked, a 
# CRIT message will be issued. 
#

# You may have to change this, depending on where you installed your
# Nagios plugins
PROGNAME="check_nsca"
PATH="/usr/bin:/usr/sbin:/bin:/sbin"

NAGIOSHOME="/usr/local/nagios"
LIBEXEC="$NAGIOSHOME/libexec"
NAGVAR="$NAGIOSHOME/var"
NAGBIN="$NAGIOSHOME/bin"
NAGETC="$NAGIOSHOME/etc"

. $LIBEXEC/utils.sh


### REQUISITE NAGIOS COMMAND LINE STUFF ###

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "NSCA Nagios service results monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done


### PLATFORM INDEPENDENCE ###

case `uname` in
	Linux) PSLIST="ps -ef";;
	SunOS) PSLIST="ps -ef";;
	Darwin) PSLIST="ps -ajx";;
	*) ;;
    esac


### CHECKING FOR THE NSCA PROCESS ###

[ `$PSLIST | grep nsca | grep -v grep | wc -l` -lt 1 ] && (echo "NSCA process not running."; exit $STATE_CRITICAL)


### INSERTING A TEST MESSAGE ###

DATE=`date +%Y%m%d%H%M`
STRING="`hostname`\tFOOBAR\t0\t$DATE This is a test of the emergency broadcast system.\n"

echo -e "$STRING" | $NAGBIN/send_nsca -H localhost -c $NAGETC/send_nsca.cfg >/dev/null 2>&1


### CHECKING THE NAGIOS LOG FILE ###

sleep 10

if [ `tail -1000 $NAGVAR/nagios.log | grep "emergency broadcast system" | grep $DATE | wc -l` -lt 1 ] 
then
	# Giving it a second try
	sleep 10
	if [ `tail -5000 $NAGVAR/nagios.log | grep "emergency broadcast system" | grep $DATE | wc -l` -lt 1 ]	
	then
		echo "NSCA daemon not processing check results."
		exit $STATE_CRITICAL
	fi
fi


### EXITING NORMALLY ###

echo "OK - NSCA working like it should."
exit $STATE_OK

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 2)

Nagios script: check_named

2006-06-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Basic monitor to check whether BIND is up and running. It checks for a number of processes and tries to perform a basic lookup using the localhost.

This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:

Solaris 8
NRPE 1.9
BIND 8
Should work with other versions as well.

A Critical is sent if:

A) one or more of the required processes is not running, or

B) the script is unable to perform a basic lookup using the localhost.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!

#!/usr/bin/bash
#
# DNS / Named process monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 19-06-2006
# 
# Usage: ./check_named
#
# Description:
# This plugin determines whether the named DNS server
# is running properly. It will check the following:
# * Are all required processes running?
# * Is it possible to make DNS requests?
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
#
# Output:
# The script returns a CRIT when the abovementioned criteria are
# not matched.
#

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Named DNS monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

check_processes()
{
	PROCESS="0"
	if [ `ps -ef | grep named | grep -v grep | grep -v nagios | wc -l` -lt 1 ]; then 
		echo "NAMED NOK - One or more processes not running"
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_service()
{
	SERVICE=0
	nslookup www.google.com localhost >/dev/null 2>&1
	if [ $? -eq 1 ]; then SERVICE=1;fi

	if [ $SERVICE -eq 1 ]; then 
		echo "SQUID NOK - One or more TCP/IP ports not listening."
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_processes
check_service

echo "NAMED OK - Everything running like it should"
exitstatus=$STATE_OK
exit $exitstatus

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 0)

Nagios script: check_networking

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

I couldn't find an easy way to check whether all interfaces of a host are up and running from the -inside-, so I wrote a Nagios plugin to do this.

Naturally you could also try to ping all of the IP addresses of all of these network cards, but this isn't always possible. Lord knows how many routing issues I had fight through to get our current IP set monitored. I guess using this script is a bit easier :)

The script was tested on Redhat ES3, Mac OSX and Solaris. Its basic requirement is the Korn shell (due to some conversions happening inside the script). On Linux/RH you'll need mii-tool (and sudo) and on Solaris you'll need Perl (for one lousy piece of math :p ).

EDIT:

Oh! Just like my other recent Nagios scripts, check_networking comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.


#!/usr/bin/ksh
#
# Basic UNIX networking check script.
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide SYS, the Netherlands
# Last Modified: 22-06-2006
#
# Usage: ./check_networking
#
# Description:
#   This plugin determines whether the local host's network interfaces
# are all up and running like they should. It uses the following
# questions to determine this.
# * Does /sbin/mii-tool report any problems? (Linux only)
# * Are the gateways for each subnet pingable?
#
# Limitations:
# * I have no clue whether mii-tool is something specific to Redhat ES3,
#   or whether all Linii have it. 
# * Sudo access to mii-tool is required for the nagios account.
# * Perl is required on Solaris, to do just tiny bit of math.
# * KSH is required.
# * The script assumes that the first available IP from a subnet is the
#   router.
#
# Output:
#   The script retunrs a CRIT when one of the criteria mentioned
# above is not matched.
#
# Other notes:
#   I wish I'd learn Perl. I'm sure that doing all of this stuff in Perl
# would have cut down on the size of this script tremendously. Ah well.
#   If you ever run into problems with the script, set the DEBUG variable
# to 1. I'll need the output the script generates to do troubleshooting.
# See below for details. 
#   I realise that all the debugging commands strewn throughout the script
# may make things a little harder to read. But in the end I'm sure it was
# well worth adding them. It makes troubleshooting so much easier. :3
#

# Enabling the following dumps information into DEBUGFILE at various
# stages during the execution of this script.
DEBUG="0"
DEBUGFILE="/tmp/foobar"


### REQUISITE NAGIOS USER INTERFACE STUFF ###

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

[ $DEBUG -gt 0 ] && rm $DEBUGFILE 

print_usage() {
        echo "Usage: $PROGNAME"
        echo "Usage: $PROGNAME --help"
}

print_help() {
        echo ""
        print_usage
        echo ""
        echo "Basic UNIX networking check plugin for Nagios"
        echo ""
        echo "This plugin not developped by the Nagios Plugin group."
        echo "Please do not e-mail them for support on this plugin, since"
        echo "they won't know what you're talking about :P"
        echo ""
        echo "For contact info, read the plugin itself..."
}

while test -n "$1"
do
        case "$1" in
          --help) print_help; exit $STATE_OK;;
          -h) print_help; exit $STATE_OK;;
          *) print_usage; exit $STATE_UNKNOWN;;
        esac
done


### SETTING UP THE ENVIRONMENT ###

# Host OS check and warning message
MIITOOL="0"
if [ -f /sbin/mii-tool ]
then
        MIITOOL="1"

        sudo /sbin/mii-tool >/dev/null 2>&1
        if [ $? -gt 0 ]
        then
                echo "ERROR: sudo permissions"
                echo ""
                echo "This script requires that the Nagios user account has"
                echo "sudo permissions for the mii-tool command. Currently it"
                echo "does not have these permissions. Please fix this."
                echo ""
                exit $STATE_UNKNOWN
        fi
fi


### SUB-ROUTINE DEFINITIONS ### 

function convert_base
{
        typeset -i${2:-16} x
        x=$1
        echo $x
}

function subnet_router
{
[ $DEBUG -gt 0 ] && echo "- Starting subnet_router -" >> $DEBUGFILE
    first="0"; second="0"; third="0"; fourth="0"
    first=`echo $1 | cut -c 1-8`; FIRST=`convert_base 2#$first 10`
[ $DEBUG -gt 0 ] && echo "First: $first $FIRST" >> $DEBUGFILE
    second=`echo $1 | cut -c 9-16`; SECOND=`convert_base 2#$second 10`
[ $DEBUG -gt 0 ] && echo "Second: $second $SECOND" >> $DEBUGFILE
    third=`echo $1 | cut -c 17-24`; THIRD=`convert_base 2#$third 10`
[ $DEBUG -gt 0 ] && echo "Third: $third $THIRD" >> $DEBUGFILE
    fourth=`echo $1 | cut -c 25-32`
    [ `echo $fourth|wc -c` -gt 1 ] || fourth="0"
    TEMPCOUNT=`echo $fourth | wc -c | awk '{print $1}'`
    let PADDING=9-$TEMPCOUNT 
[ $DEBUG -gt 0 ] && echo "Fourth: padding fourth with $PADDING zeroes" >> $DEBUGFILE
    i=1
    while ((i <= $PADDING));
    do
       fourth=$fourth"0" 
       let i=$i+1
    done
    FOURTH=`convert_base 2#$fourth 10`; let FOURTH=$FOURTH+1
[ $DEBUG -gt 0 ] && echo "Fourth: $fourth $FOURTH" >> $DEBUGFILE

    echo "$FIRST.$SECOND.$THIRD.$FOURTH"
}

gather_interfaces_linux()
{
[ $DEBUG -gt 0 ] && echo "- Starting gather_interfaces_linux -" >> $DEBUGFILE
    for INTF in `ifconfig -a | grep ^[a-z] | grep -v ^lo | awk '{print $1}'`
    do
	if [ `echo $INTF | grep : | wc -l` -gt 0 ]
	then
            export INTERFACES="`echo $INTF|awk -F: '{print $1}'` $INTERFACES"
	else
            export INTERFACES="$INTF $INTERFACES"
	fi
    done

    INTFCOUNT=`echo $INTERFACES | wc -w`
[ $DEBUG -gt 0 ] && echo "Interfaces: There are $INTFCOUNT interfaces: $INTERFACES." >> $DEBUGFILE
    if [ $INTFCOUNT -lt 1 ] 
    then
	echo "NOK - No active network interfaces."
	exit $STATE_CRITICAL
    fi
}

gather_interfaces_darwin()
{
[ $DEBUG -gt 0 ] && echo "- Starting gather_interfaces_darwin -" >> $DEBUGFILE
    for INTF in `ifconfig -a | grep ^[a-z] | grep -v ^gif | grep -v ^stf | grep -v ^lo | awk '{print $1}'`
    do
        [ `echo $INTF | grep : | wc -l` -gt 0 ] && INTF=`echo $INTF|awk -F: '{print $1}'`
	[ `ifconfig $INTF | grep "status: inactive" | wc -l` -gt 0 ] && break
        INTERFACES="$INTF $INTERFACES" 
    done

    INTFCOUNT=`echo $INTERFACES | wc -w`
[ $DEBUG -gt 0 ] && echo "Interfaces: There are $INTFCOUNT interfaces: $INTERFACES." >> $DEBUGFILE
    if [ $INTFCOUNT -lt 1 ] 
    then
	echo "NOK - No active network interfaces."
	exit $STATE_CRITICAL
    fi
}

gather_gateway_linux()
{
[ $DEBUG -gt 0 ] && echo "- Starting gather_gateway_linux for interface $1 -" >> $DEBUGFILE
    MASKBIN=""
    MASK=`ifconfig $1 | grep Mask | awk '{print $4}' | awk -F: '{print $2}'` 
    for PART in `echo $MASK | awk -F. '{print $1" "$2" "$3" "$4}'`
    do
        MASKBIN="$MASKBIN`convert_base $PART 2  | awk -F# '{print $2}'`"
    done
[ $DEBUG -gt 0 ] && echo "Mask: $MASK $MASKBIN" >> $DEBUGFILE

        BITCOUNT=`echo $MASKBIN | grep -o 1 | wc -l | awk '{print $1}'`

[ $DEBUG -gt 0 ] && echo "Bitcount: $BITCOUNT" >> $DEBUGFILE

    IPBIN=""
    IP=`ifconfig $1 | grep "inet addr" | awk '{print $2}' | awk -F: '{print $2}'` 
    for PART in `echo $IP | awk -F. '{print $1" "$2" "$3" "$4}'`
    do
        TEMPBIN=`convert_base $PART 2 | awk -F# '{print $2}'`
        TEMPCOUNT=`echo $TEMPBIN | wc -c | awk '{print $1}'`
        let PADDING=9-$TEMPCOUNT
        i=1
        while ((i <= $PADDING));
        do
            IPBIN=$IPBIN"0" 
            let i=$i+1
        done
        IPBIN=$IPBIN$TEMPBIN
    done
[ $DEBUG -gt 0 ] && echo "IP address: $IP $IPBIN" >> $DEBUGFILE

    CUT="1-$BITCOUNT"
[ $DEBUG -gt 0 ] && echo "Cutting: Cutting chars $CUT" >> $DEBUGFILE
    NETBIN=`echo $IPBIN | cut -c $CUT`
[ $DEBUG -gt 0 ] && echo "Netbin: $NETBIN" >> $DEBUGFILE
    ROUTER=`subnet_router $NETBIN`
[ $DEBUG -gt 0 ] && echo "Router: $ROUTER" >> $DEBUGFILE
    echo $ROUTER
}

gather_gateway_darwin()
{
[ $DEBUG -gt 0 ] && echo "- Starting gath_gateway_darwin for interface $1 -" >> $DEBUGFILE
    MASKBIN=""
    [ `uname` == "Darwin" ] && MASK=`ifconfig $1 | grep netmask | awk '{print $4}' | awk -Fx '{print $2}'`
    [ `uname` == "SunOS" ] && MASK=`ifconfig $1 | grep netmask | awk '{print $4}'`
    for PART in `echo 1 3 5 7`
    do
	let PLUSPART=$PART+1
	MASKPART=`echo $MASK | cut -c $PART-$PLUSPART`
        MASKBIN="$MASKBIN`convert_base 16#$MASKPART 2  | awk -F# '{print $2}'`"
    done
[ $DEBUG -gt 0 ] && echo "Mask: $MASK $MASKBIN" >> $DEBUGFILE

    BITCOUNT=`echo $MASKBIN | grep -o 1 | wc -l | awk '{print $1}'`
[ $DEBUG -gt 0 ] && echo "Bitcount: $BITCOUNT" >> $DEBUGFILE

    IPBIN=""
    IP=`ifconfig $1 | grep "inet " | awk '{print $2}'`
    for PART in `echo $IP | awk -F. '{print $1" "$2" "$3" "$4}'`
    do
        TEMPBIN=`convert_base $PART 2 | awk -F# '{print $2}'`
        TEMPCOUNT=`echo $TEMPBIN | wc -c | awk '{print $1}'`
        let PADDING=9-$TEMPCOUNT
        i=1
        while ((i <= $PADDING));
        do
            TEMPBIN="0"$TEMPBIN
            let i=$i+1
        done
        IPBIN=$IPBIN$TEMPBIN
    done
[ $DEBUG -gt 0 ] && echo "IP address: $IP $IPBIN" >> $DEBUGFILE

    CUT="1-$BITCOUNT"
[ $DEBUG -gt 0 ] && echo "Cutting: cutting chars $CUT" >> $DEBUGFILE
    NETBIN=`echo $IPBIN | cut -c $CUT`
[ $DEBUG -gt 0 ] && echo "Netbin: $NETBIN" >> $DEBUGFILE
    ROUTER=`subnet_router $NETBIN`
[ $DEBUG -gt 0 ] && echo "Router: $ROUTER" >> $DEBUGFILE
    echo $ROUTER
}

gather_gateway_sunos()
{
[ $DEBUG -gt 0 ] && echo "- Starting gath_gateway_solaris for interface $1 -" >> $DEBUGFILE
    MASKBIN=""
    [ `uname` == "Darwin" ] && MASK=`ifconfig $1 | grep netmask | awk '{print $4}' | awk -Fx '{print $2}'`
    [ `uname` == "SunOS" ] && MASK=`ifconfig $1 | grep netmask | awk '{print $4}'`
    for PART in `echo 1 3 5 7`
    do
        let PLUSPART=$PART+1
        MASKPART=`echo $MASK | cut -c $PART-$PLUSPART`
        MASKBIN="$MASKBIN`convert_base 16#$MASKPART 2  | awk -F# '{print $2}'`"
    done
[ $DEBUG -gt 0 ] && echo "Mask: $MASK $MASKBIN" >> $DEBUGFILE

# This piece of kludge also requires that all tabs are removed from the beginning of each line.
# Additional character needed to trick the counter below
# Shitty thing is that it doesn't work. Stupid "let" aryth engine...
#MASKBIN="$MASKBIN-"
#[ $DEBUG -gt 0 ] && echo "Bitcount: kludged binmask is $MASKBIN" >> $DEBUGFILE
#
#IFS="1"
#read TEMP << EOT
#echo $MASKBIN
#EOT
#let "BITCOUNT=(${#TEMP[@]} - 1)"
#IFS=" "

# The kludge above was replaced by this one line of Perl. 

    BITCOUNT=`echo $MASKBIN | perl -ne 'while(/1/g){++$count}; print "$count"'`
[ $DEBUG -gt 0 ] && echo "Bitcount: $BITCOUNT" >> $DEBUGFILE

    IPBIN=""
    IP=`ifconfig $1 | grep "inet " | awk '{print $2}'`
    for PART in `echo $IP | awk -F. '{print $1" "$2" "$3" "$4}'`
    do
[ $DEBUG -gt 0 ] && echo "IP part: converting part $PART" >> $DEBUGFILE
        TEMPBIN=`convert_base $PART 2 | awk -F# '{print $2}'`
[ $DEBUG -gt 0 ] && echo "IP part: converted part is $TEMPBIN" >> $DEBUGFILE
        TEMPCOUNT=`echo $TEMPBIN | wc -c | awk '{print $1}'`
[ $DEBUG -gt 0 ] && echo "IP part: this part is $TEMPCOUNT chars long." >> $DEBUGFILE
        let PADDING=9-$TEMPCOUNT
[ $DEBUG -gt 0 ] && echo "IP part: will be padded with $PADDING zeroes" >> $DEBUGFILE
        i=1
        while ((i <= $PADDING));
        do
            TEMPBIN="0"$TEMPBIN
            let i=$i+1
        done
        IPBIN=$IPBIN$TEMPBIN
    done
[ $DEBUG -gt 0 ] && echo "IP address: $IP $IPBIN" >> $DEBUGFILE

    CUT="1-$BITCOUNT"
[ $DEBUG -gt 0 ] && echo "Cutting: cutting chars $CUT" >> $DEBUGFILE
    NETBIN=`echo $IPBIN | cut -c $CUT`
[ $DEBUG -gt 0 ] && echo "Netbin: $NETBIN" >> $DEBUGFILE
    ROUTER=`subnet_router $NETBIN`
[ $DEBUG -gt 0 ] && echo "Router: $ROUTER" >> $DEBUGFILE
    echo $ROUTER
}

check_miitool()
{
[ $DEBUG -gt 0 ] && echo "- Starting check_miitool -" >> $DEBUGFILE
    COUNT="0"
    for INTF in `echo $INTERFACES`
    do
        [ `sudo /sbin/mii-tool $INTF | head -1 | grep -c ok` -gt 0 ] || let COUNT=$COUNT+1
        [ `sudo /sbin/mii-tool $INTF | head -1 | grep -c 100baseTx-FD` -gt 0 ] || let COUNT=$COUNT+1
        [ `sudo /sbin/mii-tool $INTF | head -1 | grep -c 1000baseTx-FD` -gt 0 ] || let COUNT=$COUNT+1
    done

    [ $COUNT -gt $INTFCOUNT ] && (echo "NOK - Problem with one of the interfaces"; exit $STATE_CRITICAL)
}

check_ping()
{
[ $DEBUG -gt 0 ] && echo "- Starting check_ping -" >> $DEBUGFILE
    INTF=""
    for INTF in `echo $INTERFACES`
    do
	case `uname` in
	    Linux) GATEWAY=`gather_gateway_linux $INTF`;;
	    Darwin) GATEWAY=`gather_gateway_darwin $INTF`;;
	    SunOS) GATEWAY=`gather_gateway_sunos $INTF`;;
	    *) echo "OS not supported by this check."; exit 1;;
	esac
[ $DEBUG -gt 0 ] && echo "Gateway: $GATEWAY" >> $DEBUGFILE

 	ping -c 3 $GATEWAY >/dev/null 2>&1
        if [ $? -gt 0 ] 
        then
            echo "NOK - Problem pinging gateway $GATEWAY"; exit $STATE_CRITICAL
        fi
    done
}


### THE MAIN ROUTINE FINALLY STARTS ###

case `uname` in
            Linux) gather_interfaces_linux;;
            Darwin) gather_interfaces_darwin;;
            #SunOS) gather_interfaces_sunos;;
            SunOS) gather_interfaces_linux;;
            *) echo "OS not supported by this check."; exit 1;;
        esac

[ $MIITOOL -eq 1 ] && check_miitool

check_ping

# None of the other subroutines forced us to exit 1 before here, so let's quit with a 0.
echo "OK - Everything running like it should"
exit $STATE_OK

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 0)

Nagios script: check_nfs_stale

2006-06-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

There really isn't much to say... This script is so fscking basic that it shames me to even put it up here among all the other projects

#!/usr/bin/bash
#
# NFS stale mounts monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide, the Netherlands
# Last Modified: 13-07-2006
# 
# Usage: ./check_nfs_stale
#
# Description:
# This script couldn't be simpler than it is. It just checks to see
# whether there are any stale NFS mounts present on the system. 
#
# Limitations:
#   This script should work properly on all implementations of Linux, Solaris
# and Mac OS X.
#
# Output:
# If there are stale NFS mounts, a CRIT is issued.
#

# You may have to change this, depending on where you installed your
# Nagios plugins
PROGNAME="check_nfs_stale"
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh


### REQUISITE NAGIOS COMMAND LINE STUFF ###

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "NFS stale mounts monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

[ `df -k | grep "Stale NFS file handle" | wc -l` -gt 0 ] && (echo "NOK - Stale NFS mounts."; exit $STATE_CRITICAL)

# Nothing caused us to exit early, so we're okay.
echo "OK - No stale NFS mounts."
exit $STATE_OK

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 0)

Added Nagios plugins

2005-09-11 01:00:00

I've added all the custom Nagios monitors I wrote for $CLIENT. They might come in handy for any of you. They're not beauties, but they get the job done.

kilala.nl tags: work, nagios, unix, sysadmin,

View or add comments (curr. 0)

Nagios and BoKS/Keon

2005-09-11 00:47:00

Major updates in the Sysadmin section! w00t!

In this case a lot of information one of my favourite security tools and Nagios, my new-found love on the monitoring front.

kilala.nl tags: nagios, boks, work, unix,

View or add comments (curr. 0)

Nagios script: check_squid

2005-07-01 00:00:00

This script was written in the time I was hired by UPC / Liberty Global.

The text I wrote on Nagios Exchange about this script has been lost. I guess it speaks for itself :)


#!/usr/bin/bash
#
# Squid process monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 19-06-2006
# 
# Usage: ./check_squid
#
# Description:
# This plugin determines whether the Squid proxy server
# is running properly. It will check the following:
# * Are all required processes running?
# * Are all the required TCP/IP ports open?
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
#
# Output:
# The script returns a CRIT when the abovementioned criteria are
# not matched
#

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Squid monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

check_processes()
{
	PROCESS="0"
	if [ `ps -ef | grep squid | grep -v grep | grep -v nagios | wc -l` -lt 2 ]; then 
		echo "SQUID NOK - One or more processes not running"
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_ports()
{
	PORTS=0
	PORTLIST="8080 3128 3130"
	for NUM in `echo $PORTLIST`; do
	if [ `netstat -an | grep LISTEN | grep $NUM | grep -v grep | wc -l` -lt 1 ]; then PORTS=1;fi
	done

	if [ $PORTS -eq 1 ]; then 
		echo "SQUID NOK - One or more TCP/IP ports not listening."
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_processes
check_ports

echo "SQUID OK - Everything running like it should"
exitstatus=$STATE_OK
exit $exitstatus

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 0)

Nagios script: check_retro_client

2005-07-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Basic monitor that checks if the Retrospect client is up and running.

This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:

Solaris 8
NRPE 1.9
Retrospect 7.0
Should work with other versions as well.

The script sends a Critical if the required process is not running.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!


#!/usr/bin/bash
#
# Retrospect Backup Client monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 19-06-2006
# 
# Usage: ./check_retro_client
#
# Description:
# This plugin determines whether the Retrospect backup client 
# is running properly. It will check the following:
# * Are all required processes running?
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
#
# Output:
# The script returns a CRIT when the abovementioned criteria are
# not matched
#

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Retrospect Backup Client monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

check_processes()
{
	PROCESS="0"
	if [ `ps -ef | grep retroclient | grep -v grep | grep -v nagios | wc -l` -lt 1 ]; then 
		echo "RETROSPECT NOK - One or more processes not running"
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_processes

echo "RETROSPECT OK - Everything running like it should"
exitstatus=$STATE_OK
exit $exitstatus

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 0)

Nagios script: check_load2

2005-07-01 00:00:00

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

We are currently in the process of distributing a standard set of Nagios monitoring scripts to over 300 client systems. One of the metrics we would like to monitor is the three load averages (or as Dr. Gunther calls them: the LaLaLa triplets).

Since these 300 servers aren't all alike, we are bound to run into systems with one, two, four, eight or more processors. That way there is no nice way of making one standard configuration, since you'll have to define separate LA levels for WARN and CRIT. Why? Cause a quad system can take much more load than a single core system.

One way to get around this would be by defining separate host groups, based on the amount of processors in a system. You could then define a unique check_load command for each CPU host group.

I've gone the other way around though...

My work-around for this is by replacing check_load with check_load2. This script takes no command line parameters and works on the basis of standard multipliers. We are of the opinion that the number of processors multiplied by a certain factor (150%? 200%? and so on) is a good enough way to define these WARN and CRIT levels. These multipliers can easily be modified (at the top of the script) to fit what -you- think is a worrying level of activity.

This script was tested on Redhat ES3, Solaris 8 and Mac OS X 10.4. It should run on other versions of these OSes as well.

EDIT:

Oh! Just like my other recent Nagios scripts, check_load2 comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.

#!/usr/bin/bash
#
# CPU load monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of KPN-IS, i-Provide, the Netherlands
# Last Modified: 22-06-2006
# 
# Usage: ./check_load2
#
# Description:
#   Ethan's original version of the check_load script is very flexible.
# It allows you to specifically set WARN and CRIT levels regarding 
# the CPU load of the system you're monitoring.
#   However: flexibility is not always a good thing. Say for example that
# you want to monitor the CPU load across a few hundred of systems having
# various CPU configurations. You -could- define host groups for single, dual
# quad (and so on) processor systems and assign unique check_load command
# definitions to each group.
#   Or you could write a script which checks the amount of active CPUs and
# then makes an educated guess at the WARN and CRIT levels for the system. 
# In most cases this should really be enough. 
#
# Limitations:
# This script should work properly on all implementations of Linux, Solaris
# and Mac OS X.
#
# Output:
# Depending on the levels defined at the top of the script,
# the script returns an OK, WARN or CRIT to Nagios based on CPU load.
#
# Other notes:
#   If you ever run into problems with the script, set the DEBUG variable
# to 1. I'll need the output the script generates to do troubleshooting.
# See below for details.
#   I realise that all the debugging commands strewn throughout the script
# may make things a little harder to read. But in the end I'm sure it was
# well worth adding them. It makes troubleshooting so much easier. :3
#

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh


### DEBUGGING SETUP ###
# Cause you never know when you'll need to squash a bug or two
DEBUG="1"
DEBUGFILE="/tmp/foobar"
rm $DEBUGFILE


### REQUISITE NAGIOS COMMAND LINE STUFF ###

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Semi-intelligent CPU load monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done


### SETTING UP THE WARN AND CRIT FACTORS ###
# Please be aware that these are -factors- and not real load average values.
# The numbers below will be multiplied by the amount of processors to come
# to the desired WARN and CRIT levels. Feel free to adjust these factors, if
# you feel the need to tweak them.

WARN_1min="2.00"
WARN_5min="1.50"
WARN_15min="1.50"
[ $DEBUG -gt 0 ] && echo "Factors: warning factors are at $WARN_1min, $WARN_5min, $WARN_15min." >> $DEBUGFILE

CRIT_1min="3.00"
CRIT_5min="2.00"
CRIT_15min="2.00"
[ $DEBUG -gt 0 ] && echo "Factors: critical factors are at $CRIT_1min, $CRIT_5min, $CRIT_15min." >> $DEBUGFILE


### DEFINING SUBROUTINES ###

function gather_procs_linux()
{
    NUMPROCS=`cat /proc/cpuinfo | grep ^processor | wc -l` 
[ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE
} 

function gather_procs_sunos()
{
    NUMPROCS=`/usr/bin/mpstat | grep -v CPU | wc -l` 
[ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE
}

function gather_procs_darwin()
{
    NUMPROCS=`/usr/bin/hostinfo | grep "Default processor set" | awk '{print $8}'` 
[ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE
}

function gather_load_linux()
{
    REAL_1min=`cat /proc/loadavg | awk '{print $1}'`
    REAL_5min=`cat /proc/loadavg | awk '{print $2}'`
    REAL_15min=`cat /proc/loadavg | awk '{print $3}'`
[ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE
}

function gather_load_sunos()
{
    REAL_1min=`w | grep "load average" | awk -F, '{print $4}' | awk '{print $3}'`
    REAL_5min=`w | grep "load average" | awk -F, '{print $5}'`
    REAL_15min=`w | grep "load average" | awk -F, '{print $6}'`
[ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE
}

function gather_load_darwin()
{
    REAL_1min=`sysctl -n vm.loadavg | awk '{print $1}'`
    REAL_5min=`sysctl -n vm.loadavg | awk '{print $2}'`
    REAL_15min=`sysctl -n vm.loadavg | awk '{print $3}'`
[ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE
}

function check_load()
{
    WARN="0"; CRIT="0"

    [ `echo "if(($NUMPROCS * $WARN_1min) > $REAL_1min) 0; if(($NUMPROCS * $WARN_1min) <= $REAL_1min) 1" | bc` -gt 0 ] && let WARN=$WARN+1
    [ `echo "if(($NUMPROCS * $WARN_5min) > $REAL_5min) 0; if(($NUMPROCS * $WARN_5min) <= $REAL_5min) 1" | bc` -gt 0 ] && let WARN=$WARN+1
    [ `echo "if(($NUMPROCS * $WARN_15min) > $REAL_15min) 0; if(($NUMPROCS * $WARN_15min) <= $REAL_15min) 1" | bc` -gt 0 ] && let WARN=$WARN+1
[ $DEBUG -gt 0 ] && echo "Check_load: warning levels are `echo "$NUMPROCS * $WARN_1min"|bc`, `echo "$NUMPROCS * $WARN_5min"|bc`, `echo "$NUMPROCS * $WARN_15min"|bc`," >> $DEBUGFILE

    [ `echo "if(($NUMPROCS * $CRIT_1min) > $REAL_1min) 0; if(($NUMPROCS * $CRIT_1min) <= $REAL_1min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1
    [ `echo "if(($NUMPROCS * $CRIT_5min) > $REAL_5min) 0; if(($NUMPROCS * $CRIT_5min) <= $REAL_5min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1
    [ `echo "if(($NUMPROCS * $CRIT_15min) > $REAL_15min) 0; if(($NUMPROCS * $CRIT_15min) <= $REAL_15min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1
[ $DEBUG -gt 0 ] && echo "Check_load: critical levels are `echo "$NUMPROCS * $CRIT_1min"|bc`, `echo "$NUMPROCS * $CRIT_5min"|bc`, `echo "$NUMPROCS * $CRIT_15min"|bc`," >> $DEBUGFILE

    [ $WARN -gt 0 ] && (echo "NOK: load averages are at $REAL_1min, $REAL_5min, $REAL_15min"; exit $STATE_WARNING)
    [ $CRIT -gt 0 ] && (echo "NOK: load averages are at $REAL_1min, $REAL_5min, $REAL_15min"; exit $STATE_CRITICAL)
}

### FINALLY, THE MAIN ROUTINE ###

NUMPROCS="0"

case `uname` in
            Linux) gather_procs_linux; gather_load_linux; check_load;;
            Darwin) gather_procs_darwin; gather_load_darwin; check_load;;
            SunOS) gather_procs_sunos; gather_load_sunos; check_load;;
            *) echo "OS not supported by this check."; exit 1;;
esac

# Nothing caused us to exit early, so we're okay.
echo "OK - load averages are at $REAL_1min, $REAL_5min, $REAL_15min"
exit $STATE_OK

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 7)

Nagios script: check_fwm

2005-07-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Basic monitor that checks if the Checkpoint Firewall-1 Management software is up and running. It checks for a number of processes and ports.

This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:

Solaris 8
NRPE 1.9
Checkpoint R55.
Should work with other versions as well.

The script sends a Critical if:

A) One or more processes are not running, or

B) One or more ports are not available for connections.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!

#!/usr/bin/bash
#
# Firewall-1 process monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 19-06-2006
# 
# Usage: ./check_fwm
#
# Description:
# This plugin determines whether the Firewall-1 management
# software is running properly. It will check the following:
# * Are all required processes running?
# * Are all the required TCP/IP ports open?
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
#
# Output:
# The script retunrs a CRIT when one of the criteria mentioned
# above is not matched.
#

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Firewall-1 monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

check_processes()
{
	PROCESS="0"
	# PROCLIST="cpd fwd fwm cpwd cpca cpmad cplmd cpstat cpshrd cpsnmpd"
	PROCLIST="cpd fwd fwm cpwd cpca cpmad cpstat cpsnmpd"
	for PROC in `echo $PROCLIST`; do
	if [ `ps -ef | grep $PROC | grep -v grep | wc -l` -lt 1 ]; then PROCESS=1;fi
	done

	if [ $PROCESS -eq 1 ]; then 
		echo "FWM NOK - One or more processes not running"
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_ports()
{
	PORTS="0"
	PORTLIST="256 257 18183 18184 18187 18190 18191 18192 18196 18264"
	for NUM in `echo $PORTLIST`; do
	if [ `netstat -an | grep LISTEN | grep $NUM | grep -v grep | wc -l` -lt 1 ]; then PORTS=1;fi
	done

	if [ $PORTS -eq 1 ]; then 
		echo "FWM NOK - One or more TCP/IP ports not listening."
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_processes
check_ports

echo "FWM OK - Everything running like it should"
exitstatus=$STATE_OK
exit $exitstatus

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 0)

Nagios script: check_postfix

2005-07-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Basic monitor that checks if Postfix is up and running. It checks for a number of processes and ports.

This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:

Solaris 8
NRPE 1.9
Should work with other versions as well.

The script sends a Critical if:

A) One or more processes are not running, or

B) One or more ports are not available for connections.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!


#!/usr/bin/bash
#
# Postfix process monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 19-06-2006
# 
# Usage: ./check_postfix
#
# Description:
# This plugin determines whether the Postfix SMTP server
# is running properly. It will check the following:
# * Are all required processes running?
# * Are all the required TCP/IP ports open?
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
#
# Output:
# Script returns a CRIT when one of the abovementioned criteria is 
# not matched
#

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Postfix monitor plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

check_processes()
{
	PROCESS="0"
	PROCLIST="smtpd qmgr pickup master sendmail"
	for PROC in `echo $PROCLIST`; do
	if [ `ps -ef | grep $PROC | grep -v grep | wc -l` -lt 1 ]; then 
		if [ $PROC == "smtpd" ]; then
			if [ `ps -ef | grep proxymap | grep -v grep | wc -l` -lt 1 ]; then
				PROCESS=1
			else
				PROCESS=0
			fi
		else
			PROCESS=1
		fi
	fi
	done

	if [ $PROCESS -eq 1 ]; then 
		echo "SMTP-S NOK - One or more processes not running"
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_ports()
{
	PORTS="0"
	PORTLIST="25"
	for NUM in `echo $PORTLIST`; do
	if [ `netstat -an | grep LISTEN | grep $NUM | grep -v grep | wc -l` -lt 1 ]; then PORTS=1;fi
	done

	if [ $PORTS -eq 1 ]; then 
		echo "SMTP-S NOK - One or more TCP/IP ports not listening."
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_processes
check_ports

echo "SMTP-S OK - Everything running like it should"
exitstatus=$STATE_OK
exit $exitstatus

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 0)

Nagios script: check_ntp_s

2005-07-01 00:00:00

This script was written at the time I was hired by UPC / Liberty Global.

Basic monitor that checks if the server is up and running. It checks for a process and whether the server has drifted from its higher level Stratum server.

This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:

Solaris 8
NRPE 1.9
xntpd
Should work with other versions as well.

The script sends a Critical if:

A) One or more processes are not running, or

B) The server's clock has drifted too far from its higher level Stratum server.

Requires the "check_ntp" plugin which is part of the default monitor package.

UPDATE 19/06/2006:

Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!


#!/usr/bin/bash
#
# NTP server process monitor plugin for Nagios
# Written by Thomas Sluyter (nagiosATkilalaDOTnl)
# By request of DTV Labs, Liberty Global, the Netherlands
# Last Modified: 19-06-2006
# 
# Usage: ./check_ntp_s
#
# Description:
# This plugin determines whether the Nagios client is functioning 
# properly as an NTP server. It does this by checking:
# * Are all required processes running?
# * Is the server's time up to scratch with its higher stratum server?
#
# Limitations:
# Currently this plugin will only function correctly on Solaris systems.
#
# Output:
# The script returns a CRIT when one of the abovementioned criteria
# is not matched.
#

# Host OS check and warning message
if [ `uname` != "SunOS" ]
then
        echo "WARNING:"
        echo "This script was originally written for use on Solaris."
        echo "You may run into some problems running it on this host."
        echo ""
        echo "Please verify that the script works before using it in a"
        echo "live environment. You can easily disable this message after"
        echo "testing the script."
        echo ""
fi

# You may have to change this, depending on where you installed your
# Nagios plugins
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
LIBEXEC="/usr/local/nagios/libexec"
. $LIBEXEC/utils.sh

print_usage() {
	echo "Usage: $PROGNAME"
	echo "Usage: $PROGNAME --help"
}

print_help() {
	echo ""
	print_usage
	echo ""
	echo "NTP server plugin for Nagios"
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info, read the plugin itself..."
}

while test -n "$1" 
do
	case "$1" in
	  --help) print_help; exit $STATE_OK;;
	  -h) print_help; exit $STATE_OK;;
	  *) print_usage; exit $STATE_UNKNOWN;;
	esac
done

check_processes()
{
	PROCESS="0"
	if [ `ps -ef | grep xntpd | grep -v grep | grep -v nagios | wc -l` -lt 1 ]; then PROCESS=1;fi
	if [ $PROCESS -eq 1 ]; then 
		echo "NTP-S NOK - One or more processes not running"
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_time()
{
	TIME="0"
	#SERVERS="ntp0.nl.net ntp1.nl.net ntp2.nl.net"
	SERVERS="nl-ams99z-a02-01"
	for SERV in `echo $SERVERS`; do
		if [ `/usr/local/nagios/libexec/check_ntp -H $SERV | awk '{print $2}'` != "OK:" ]; then
			TIME=1
		else
			TIME=0
			break
		fi
	done
	if [ $TIME -eq 1 ]; then
		echo "NTP-S NOK - Time not in synch with higher Stratum."
		exitstatus=$STATE_CRITICAL
		exit $exitstatus
	fi
}

check_processes
check_time

echo "NTP-S OK - Everything running like it should"
exitstatus=$STATE_OK
exit $exitstatus

kilala.nl tags: nagios, unix, programming,

View or add comments (curr. 1)

Older blog posts

All content, with exception of "borrowed" blogpost images, or unless otherwise indicated, is copyright of Tess Sluijter. The character Kilala the cat-demon is copyright of Rumiko Takahashi and used here without permission.

About me

Blog archives

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Usage of check_cnr

Output

Limitations

Download

On the monitoring of disks.

Basic Object Identifiers

Fan and sensor placement

Installing HP SIM and its components.

Re-compiling Net-SNMP

Installing Dell OpenManage and its components.

Configuring Dell OpenManage

Configuring Net-SNMP

Starting the software

Which witch is witch?

An example

Getting some useful data

Going on from there

Sun Fire V240

Sun Fire V440

Sun Fire V240

Sun Fire V440

Re-compiling Net-SNMP

Installing SUNWmasf and its components

Configuring SUNWmasf

Configuring Net-SNMP

Starting the software

Reading values from the agents

A quick comparison

NSClient

NRPEnt

NSClient++

SNMP

SNMP traps

NC_net

A quick comparison

SSH

NRPE

SNMP

SNMP traps

NSCA

Day 0

Day 1

Day 2