2009-09-14 22:05:00
This script is used to monitor the basic processes that go with Cisco's CNR (Network Registrar), which can be likened to a DHCP server. Cisco's Support Wiki described CNR as follows:
Cisco CNS Network Registrar is a full-featured DNS/DHCP system that provides scalable naming and addressing services for service provider and enterprise networks. Cisco CNS Network Registrar dramatically improves the reliability of naming and addressing services for enterprise networks. For cable ISPs, Cisco CNS Network Registrar provides scalable DNS and DHCP services and forms the basis of a DOCSIS cable modem provisioning system.
As said my script only checks the basics of CNR to ensure that the required daemons are running. It does not actually check any of the functionality, though at a later point in time it may be expanded to include this.
./check_cnr [-nagios|-tivoli] [-d -o FILE] -nagios Nagios output mode (default) -tivoli Tivoli output mode -d Debug mode -o Output file for debug logging
Depending on which mode you've selected the output of the script will differ slightly.
In Tivoli mode the output will be limited to a numerical value as the script is to be used as a "numeric script". 0 = OK, 1 = WARNING/UNKNOWN, 2= SEVERE. The exit code of the script will be identical to this value.
In Nagios mode the exit code of the script will be be similar to Tivoli's, with the exception that the value 3 portrays an unknown state. The output on stdout includes the service name and state (CNR OK/NOK) and a helpful error message.
$ wc check_cnr.sh 189 666 4531 check_cnr.sh $ cksum check_cnr.sh 4161895780 4531 check_cnr.sh
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 0)
2009-09-14 19:35:00
For the longest of times my Nagios plugins have used a rather oldfashioned approach to configuration: everything's hardcoded into the script and you'll need to modify the script to make changes. Obviously that sucks if you want to use the script for multiple purposes. My newer scripts all use command line flags and parameters to pass variables, making them a lot more versatile. Hence I will soon be rewriting all my Nagios plugins for this particular purpose.
I will also be changing their individual pages, putting the plugin back into its own .ksh script instead of including the code into the HTML page. Whatever was I thinking when I did that?!
Finally, I will also modifying all plugins (also the new ones) to work with multiple monitoring systems. By passing a certain command line option one will be able to chose between modes for Nagios and Tivoli, with possible extensions along the way.
I've got my work cut out for me!
kilala.nl tags: sysadmin, nagios,
View or add comments (curr. 0)
2008-01-01 00:00:00
In my mini-howto about monitoring HP and Dell specific SNMP objects through Net-SNMP I refered to a few interesting objects which could be read through their repsective SNMP agents. This page covers the interesting objects for HP Compaq systems.
Right now I've only got a very limited amount of different models to test all this stuff on, so bear with me :) The following lists are only a small selection from all possible objects, that we found interesting. A full list of available options can be obtained by running:
snmpwalk -c public localhost .1.3.6.1.4.1.232
I've tried my best at making the more interesting parts of the HP and Dell MIBs legible. The results can be found in the PDF, in the menu on the left. But once again, these lists are only a small subset of the complete MIB for both vendors. You won't know all that's available to you unless you start digging through the flat .TXT files yourself. Unlike Sun, HP and Dell -do- publish their MIB files freely, so you'll have no trouble finding them on the web.
I've also expanded on the HP SIM MIB a little in a PDF document. Get it over here.
Unfortunately, HP and Compaq have made it impossible to monitor hard disk statuses without add-on software. The plain vanilla SNMP agent has no way of filling the relevant objects. Instead it requires the CPQarrayd add-on.
If you do choose to install this piece of software, you can find all the objects regarding -internal- drives under OID .1.3.6.1.4.1.232.3.2.5.2 (cpqDaPhyDrvErrTable). Refer to CPQIDA.MIB.txt for all relevant details and a full listing of the appropriate OIDs.
Currently I have no way of making sure, but I assume that the alert message for HDD[0-7] can be found in .1.3.6.1.4.1.232.3.2.5.2.1.15.[0-7]. Any value above 0 is indicates a failure.
All object IDs below fit under .1.3.6.1.4.1.232. These objects should be usable on every HP system in the DL/ML rangen, although I have only tested the on DL380, DL385, DL580 and ML570.
Object |
Description |
Values |
.1.2.2.1.1.6.OID |
CPU[0-3] status |
1/2 = ok, 3 = warn, 4 = crit |
.3.2.2.1.1.6.OID |
HDD controler |
1/2 = ok, 3 = warn, 4 = crit |
.3.2.3.1.1.11.OID |
LDD[0-X] status |
1/2 = ok, 3 = warn, 4 = crit |
.3.2.4.1.1.6.OID |
Hot spare HDD status |
>2 =crit |
.3.2.5.1.1.37.OID |
HDD[0-X] status |
1/2 = ok, 3 = warn, 4 = crit |
.5.2.2.1.1.12.OID |
SCSI controler status |
1/2 = ok, 3 = warn, 4 = crit |
.5.2.3.1.1.8.OID |
SCSI LDD[0-X] status |
1/2 = ok, 3 = warn, 4 = crit |
.5.2.4.1.1.26.OID |
SCSI HDD[0-x] status |
1/2 = ok, 3 = warn, 4 = crit |
.6.2.6.7.1.9.OID |
Fan status |
1/2 = ok, 3 = warn, 4 = crit |
.6.2.6.8.1.4.1 |
CPU0 temperature |
Contains current temperature |
.6.2.6.8.1.4.4 |
CPU1 temperature |
Contains current temperature |
.6.2.6.8.1.4.5 |
PSU temperature |
Contains current temperature |
.6.2.9.3.1.4.0.OID |
PSU[0-X] status |
1/2 = ok, 3 = warn, 4 = crit |
.14.2.2.1.1.5.OID |
IDE HDD[0-X] status |
1/2 = ok, 3 = warn, 4 = crit |
As I already said, most of the OIDs from the tables above can be used to monitor vanilla HP systems (with the exceptions of the hard disks). The biggest difference lies in the placement of certain fans and sensors. The table below outlines the various locations, depending on the model.
Each system contains multiple fans and temperature sensors and will thus have multiple instances of these objects in its SNMP tree. The locations for each of these instances can be read from .6.2.6.7.1.3.OID (fans) and 6.2.6.8.1.3.OID (temperature sensor). The $OID part of these numeric sequences are always .1.1, .1.2, .1.3, .1.4 and so on.
Fan |
DL380 |
DL385 |
DL580 |
ML570 |
.1.1 |
CPU |
CPU |
System |
? |
.1.2 |
CPU |
CPU |
System |
? |
.1.3 |
IO Board |
IO Board |
System |
? |
.1.4 |
IO Board |
IO Board |
System |
? |
.1.5 |
CPU |
CPU |
IO Board |
? |
.1.6 |
CPU |
CPU |
IO Board |
? |
.1.7 |
PSU |
PSU |
- |
? |
.1.8 |
PSU |
PSU |
- |
? |
Sensor |
DL380 |
DL385 |
DL580 |
ML570 |
.1.1 |
CPU |
CPU |
CPU |
? |
.1.2 |
CPU |
IO Board |
CPU |
? |
.1.3 |
IO Board |
CPU |
CPU |
? |
.1.4 |
CPU |
CPU |
CPU |
? |
.1.5 |
PSU |
PSU |
IO Board |
? |
.1.6 |
- |
- |
Ambient |
? |
.1.7 |
- |
- |
System |
? |
kilala.nl tags: sysadmin, unix, nagios,
View or add comments (curr. 0)
2008-01-01 00:00:00
Monitoring Dell and HP systems through SNMP is as big a puzzle as using SNMP on Sun Microsystems' boxen. Luckily I've come a long way into figuring out how to use Net-SNMP together with HP's SIM and Dell's OpenManage.
Just like with our Solaris boxen, we want to use the Net-SNMP daemon as the main daemon on our Linux systems. At $CLIENT we use Red Hat ES3 on a great variety of Dell and HP hardware. And as was the case with SUNWmasf on Solaris, we're going to need both Dell's and HP's custom SNMP agents to monitor out hardware-specific SNMP objects. Enter SIM and OpenManage. In the next few paragraphs I'll tell you all about installing and configuring the whole deal.
Naturally it would be great if you could package all of these files into one nice .RPM, since that'll make the whole installation process a snap. Especially if you want to roll it out across hundreds of servers. I'll be making such a package for $CLIENT, but unfortunately I cannot distribute it (which is logical, what with all the proprietary info that goes into the package). Maybe, some day I'll make a generic .RPM which you guys can use.
Just like everyone else HP also chooses to hide the installer for their SNMP agent quite deeply into their website. You will need to go to their download site and browse to the software section for your model of server. Once there you choose "Download drivers and software" and you pick your Linux flavour (in our case RHEL3). From there go to "Software - Systems management" where you can finally choose "A Collection of SNMP Protocol Tools from Net-SNMP for $YOUR_FLAVOUR". *phew* To help you get there, here's the direct link to the RHES3 version of the package.
As the file name (net-snmp-cmaX-5.1.2) suggests, this package is a modified version of the net-SNMP daemon which has added support for a whole bunch of Compaq and HP stuff. But as you can see the version of net-SNMP used is way behind today's standards, so it's wisest to use this daemon while proxied through a more current version of net-SNMP. The crappy thing though is that HP's package installs their net-SNMP in exactly the same location as our own net-SNMP. Don't worry, we'll get to that.
The download page doesn't make this immediately clear, but you'll need to download five (or six if you want the source) files. For your convenience, HP has decided to put all files into a pull-down menu, with one "Download" button. Yes, very handy indeed. =_= Another neat thing is that, for some reason, the combination Safari+Realplayer decides that -they- need to open the .RPM file that's loaded. Very odd and I've never encountered this before with other RPMs.
Because we're going to use two versions of net-SNMP that use the same locations on your hard drive, we're going to have to fiddle around a bit.
First copy these two RPMs to your system: net-snmp-cmaX and net-snmp-cmaX-libs. Install them using RPM, starting with libs and ending with the basic package. Now do the following.
$ cd /usr/sbin
$ sudo mv snmpd HPsnmpd
$ sudo mv snmptrapd HPsnmptrapd
$ cd /etc
$ sudo ln -s ./snmpd.conf ./HPsnmdp.conf
$ cd /etc/rc.d/init.d
$ sudo mv snmpd HPsnmpd
$ sudo mv snmptrapd HPsnmptrapd
$ cd /etc/logrotate.d
$ sudo mv snmpd HPsnmpd
You've now made sure that all parts that are required for the HP SNMP agent are safe from being overwritten by the "real" net-SNMP.
You can now install net-SNMP using the instruction laid out in the following paragraph.
PLEASE NOTE: If you're going to use HP SIM, please install that -first- before proceeding. See below for details.
Basically, recompiling Net-SNMP for your Linux install follows the same procedure as the recompilation on Solaris.
--with-mib-modules="host disman/event-mib ucd-snmp/diskio smux agentx disman/event-mib ucd-snmp/lmSensors" --with-perl-module
I had a hard time finding the installer files for Dell OM on Dell's download site, util I finally figured out how their "logic" works. :D You can get Dell OM 4.5 for Linux through this direct link (which can be changed at any time by Dell), or you can search their downloads page using the term "openmanage server agent". Adding the key word "linux" seems to confuse it though, so you're going to have to manually search through the list.
Unfortunately I never did get around to using Dell OpenManage, so I cannot give you the installation instructions ;_;
Configuring HP-SIM
The configuration file for HP's version of net-SNMP is stored in /etc/snmp, unlike the version that'll be used by our own net-SNMP. Edit HP's config file and remove all the current content. Replace it with the following:
rocommunity public 0.0.0.0 agentaddress 1162 pass .1.3.6.1.4.1.4413.4.1 /usr/bin/ucd5820stat
You will not have to make any further changes. The init-script and such can remain unchanged.
Again, unfortunately I cannot give you instructions on working with OpenManage since I ran out of time.
rocommunity public 0.0.0.0 agentaddress 1163
The configuration file for Net-SNMP is located in /usr/local/share/snmp. You will need to make a whole bunch of changes over here that I won't cover, like security ACLs, SNMP trap hosts and bunches of other stuff. However, you _will_ need to add the following lines to allow Net-SNMP to talk to HP SIM and/or OpenManage.
# Pass requests to HP SIM
proxy -c public localhost:1162 .1.3.6.1.4.1.232
# Pass requests to Dell OpenManage
proxy -c public localhost:1163 .1.3.6.1.2.1.674
Make sure that you start Net-SNMP before OpenManage or SIM. These sub-agents rely on Net-SNMP to be running, so that one needs to go first. Take care of this order using the RC scripts of your particular Linux flavour.
kilala.nl tags: sysadmin, unix, nagios,
View or add comments (curr. 0)
2008-01-01 00:00:00
For some reason unknown to me Sun has always kept their MIB file rather closed and hard to find. There's no place you can actually download the file. You will have to extract the file from the SUNWmasf package if you want to take a look at it.
To help us sysadmins out I've published the file over here. I do not claim ownership of the file in any way. Sun has the sole copyright of the file. I just put it here, so people can easily read through the file.
kilala.nl tags: sysadmin, unix, nagios, solaris,
View or add comments (curr. 0)
2008-01-01 00:00:00
I have to admit that figuring out how all the parts of SNMP on Sun stick together took me a little while. Just like when I was learning Nagios it took me about a week of mucking about to gain clarity. Now that I've figured it out, I thought I'd share it with you...
First off, everything I will describe over here depends on the availability of two pieces of software on your clients: Net-SNMP and SUNWmasf. See the article on combining the two for further details on installing and configuring this software.
We should begin by verifying that you can read from each of the important pieces of the SNMP tree. You can verify this by running the following three commands on your client system. Each should return a long list of names, numbers and values. Don't worry if it doesn't make sense yet.
snmpwalk -c public localhost .1.3.6.1.2.1.47
snmpwalk -c public localhost .1.3.6.1.4.1.42
snmpwalk -c public -m ALL localhost .1.3.6.1.4.1.2021.13
Incidentally you should also be able to access the same parts of the SNMP tree remotely (from your Nagios server, for example).
snmpwalk -c public $remote_client .1.3.6.1.2.1.47
snmpwalk -c public $remote_client .1.3.6.1.4.1.42
snmpwalk -c public -m ALL $remote_client .1.3.6.1.4.1.2021.13
Please keep in mind that you should replace the word "public" in all the examples with the community string that you've chosen for your SNMP agents. It could very well be something other than "public".
Now that we've made sure that you can actually talk to your SNMP agent, it's time to figure out which components you want to find out about. The easy way to find out all components that are available to you is by running the following command.
snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1.2
Let me explain what the output of this command really means... The SNMP sub-tree MIB-2.1.1.1.1 contains descriptive information of system-specific SNMP objects. Each object has a sub-object in the following sub-trees (each number follows after MIB-2.1.1.1.1).
Sub-OID |
Description |
Sub-OID |
Description |
.1 |
entPhysicalIndex |
.9 |
entPhysicalFirmwareRev |
.2 |
entPhysicalDescr |
.10 |
entPhysicalSoftwareRev |
.3 |
entPhysicalVendorType |
.11 |
entPhysicalSerialNum |
.4 |
entPhysicalContainedIn |
.12 |
entPhysicalMfgName |
.5 |
entPhysicalClass |
.13 |
entPhysicalModelName |
.6 |
entPhysicalParentRelPos |
.14 |
entPhysicalAlias |
.7 |
entPhysicalName |
.15 |
entPhysicalAssetID |
.8 |
entPhysicalHardwareRev |
.16 |
entPhysicalIsFRU |
In this case all the sub-objects under .2 contain descriptions of the various components that are human readable. What you need to do now is go through the complete list of descriptions to pick those elements that you want to access remotely through SNMP. You will see that each entry has a number behind the .2. Each of these numbers is the unique component identifier within the system, meaning that we are lucky enough to have the same identifier within other parts of the SNMP tree.
$ snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1.2 | grep Core
SNMPv2-SMI::mib-2.47.1.1.1.1.2.98 = STRING: "CPU 0 Core Temperature Monitor"
SNMPv2-SMI::mib-2.47.1.1.1.1.2.100 = STRING: "CPU 1 Core Temperature Monitor"
SNMPv2-SMI::mib-2.47.1.1.1.1.2.102 = STRING: "CPU 2 Core Temperature Monitor"
SNMPv2-SMI::mib-2.47.1.1.1.1.2.104 = STRING: "CPU 3 Core Temperature Monitor"
$ snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1 | grep "\.98 ="
SNMPv2-SMI::mib-2.47.1.1.1.1.2.98 = STRING: "CPU 0 Core Temperature Monitor"
SNMPv2-SMI::mib-2.47.1.1.1.1.3.98 = OID: SNMPv2-SMI::zeroDotZero
SNMPv2-SMI::mib-2.47.1.1.1.1.4.98 = INTEGER: 94
SNMPv2-SMI::mib-2.47.1.1.1.1.5.98 = INTEGER: 8
SNMPv2-SMI::mib-2.47.1.1.1.1.6.98 = INTEGER: -1
SNMPv2-SMI::mib-2.47.1.1.1.1.7.98 = STRING: "040349/adbs04:CH/C0/P0/T_CORE"
SNMPv2-SMI::mib-2.47.1.1.1.1.8.98 = ""
SNMPv2-SMI::mib-2.47.1.1.1.1.9.98 = ""
SNMPv2-SMI::mib-2.47.1.1.1.1.10.98 = ""
SNMPv2-SMI::mib-2.47.1.1.1.1.11.98 = ""
SNMPv2-SMI::mib-2.47.1.1.1.1.12.98 = ""
SNMPv2-SMI::mib-2.47.1.1.1.1.13.98 = ""
SNMPv2-SMI::mib-2.47.1.1.1.1.14.98 = ""
SNMPv2-SMI::mib-2.47.1.1.1.1.15.98 = ""
SNMPv2-SMI::mib-2.47.1.1.1.1.16.98 = INTEGER: 2
Aside from the fact that the sub-OID we have found for our object is used in other parts of the tree, there's another parameter that makes its return. The character string in .7 is reused in the SUN MIB as well, as you will see in a moment.
Let's see what happens when we take our sub-OID .98 to the SUN MIB tree...
$ snmpwalk -c public localhost .1.3.6.1.4.1.42.2.70.101.1.1 | grep "\.98 ="
SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.1.98 = INTEGER: 2
SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.2.98 = INTEGER: 2
SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.3.98 = INTEGER: 7
SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.4.98 = INTEGER: 2
SNMPv2-SMI::enterprises.42.2.70.101.1.1.2.1.5.98 = STRING: "040349/adbs04:CH/C0/P0"
SNMPv2-SMI::enterprises.42.2.70.101.1.1.6.1.1.98 = INTEGER: 2
SNMPv2-SMI::enterprises.42.2.70.101.1.1.6.1.2.98 = INTEGER: 3
SNMPv2-SMI::enterprises.42.2.70.101.1.1.6.1.3.98 = Gauge32: 60000
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.1.98 = INTEGER: 3
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.2.98 = INTEGER: 0
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.3.98 = INTEGER: 1
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.4.98 = INTEGER: 41
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.5.98 = INTEGER: 0
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.6.98 = INTEGER: 0
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.7.98 = INTEGER: 0
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.8.98 = INTEGER: 0
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.9.98 = INTEGER: 97
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.10.98 = INTEGER: -10
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.11.98 = INTEGER: 102
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.12.98 = INTEGER: -20
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.13.98 = INTEGER: 120
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.14.98 = Gauge32: 0
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.15.98 = Hex-STRING: FC
SNMPv2-SMI::enterprises.42.2.70.101.1.1.8.1.16.98 = INTEGER: 1
Take a look at 2.1.5.98... Looks familiar? At least now you're sure that you're reading the right sub-object :) The list in the example above looks quite complicated, but there's a little help in the shape of a .PDF I once made. This .PDF shows the basic structure of the objects inside enterprises.42.2.70.101.1.1.
You should immediately notice though that the returns of the command are divided into three groups: ...101.1.1.2, ...101.1.1.6 and ...101.1.1.8. Matching these groups up to the .PDF you'll see that these groups are respectively sunPlatEquipmentTable (which is an expansion on the information from MIB-2), sunPlatSensorTable (which contains a description of the sensor in question) and sunPlatNumericSensorTable (which contains all kinds of real-life values pertaining to the sensor).
In this case the most interesting sub-OID is enterprises.42.2.70.101.1.1.8.1.4.98, sunPlatNumericSensorCurrent, which obviously contains the current value of the sensor readings. Putting things into perspective this means that the core temperature of CPU0 at the time of the snmpwalk was 41 degrees centigrade.
So... Now you know how to find out the following things:
You can now do loads of things! For example, you can use your monitoring software to verify that certain values don't exceed a set limit. You wouldn't want your CPUs to get hotter than 65 degrees now, do you?
kilala.nl tags: sysadmin, unix, nagios, solaris,
View or add comments (curr. 2)
2008-01-01 00:00:00
In my mini-howto about monitoring Sun specific SNMP objects through Net-SNMP I refered to a few interesting objects which could be read through LM_Sensors.
Unfortuntately I can currently only list details for two of the supported models, since I do not have test boxen for the other models. The following lists are only a small selection from all possible objects, that we found interesting. A full list of available options can be obtained by running:
snmpwalk -c public -m ALL localhost .1.3.6.1.4.1.2021.13
Details about the structure of the various MIBs can be found in other articles in the Sysadmin section of my website. Just browse through the menu on the left. Point is that the lists below only list the OID _within_ the specific sub-trees (for example: .1.3.6.1.4.1.2021.13.16.5.1.2.9). As I said: details on actually _reading_ these values will be contained in another document.
Object |
Description |
Unit |
2.1.2.1 and .2 |
CPU[0-1] Core temperature |
Integer * |
2.1.2.3 |
SYSTEM Enclosure temperature |
Integer * |
5.1.2.2 |
SYSTEM Service required indicator |
Integer |
5.1.2.5 |
PSU[0-1] Service required indicator |
Degrees |
5.1.2.10 .12 .14 and .16 |
HDD[0-3] Service required indicator |
Integer |
5.1.2.18 |
Keyswitch |
Integer |
5.1.2.4 and .7 |
PSU[0-1] Activity (power?) |
Integer |
*: In order to get the real temperature, you will need to divide the integer contained within this variable by 65.526. For some odd reason Net-SNMP does not store the real temperature in degrees Centrigrade.
2.1.2.1 .2 .3 and .4 |
CPU[0-3] Core temperature |
Integer * |
2.1.2.5 .6 .7 and .8 |
CPU[0-3] Ambient temperature |
Integer * |
2.1.2.9 |
SCSI temperature |
Integer * |
.10 |
MOBO temperature |
Integer * |
.98 .100 .102 and .104 |
CPU[0-3] Core temperature |
Degrees |
.106 |
MOBO temperature |
Degrees |
.107 |
SCSI temperature |
Degrees |
5.1.2.2 |
SYSTEM Service required indicator |
Integer |
5.1.2.6 and .10 |
PSU[0-1] Service required indicator |
Integer |
5.1.2.12 .14 .16 and .18 |
HDD[0-3] Service required indicator |
Integer |
5.1.2.20 |
Keyswitch |
Integer |
5.1.2.4 and .8 |
PSU[0-1] Power OK |
Integer |
*: In order to get the real temperature, you will need to divide the integer contained within this variable by 65.526. For some odd reason Net-SNMP does not store the real temperature in degrees Centrigrade.
kilala.nl tags: sysadmin, unix, nagios,
View or add comments (curr. 0)
2008-01-01 00:00:00
In my mini-howto about monitoring Sun specific SNMP objects through Net-SNMP I refered to a few interesting objects which could be read through SUNWmasf.
Unfortuntately I can currently only list details for two of the supported models, since I do not have test boxen for the other models. The following lists are only a small selection from all possible objects, that we found interesting. A full list of available options can be obtained by running:
snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1.2
Details about the structure of the various MIBs can be found in other articles in the Sysadmin section of my website. Just browse through the menu on the left. Point is that the lists below only list the OID _within_ the specific sub-trees (for example: .1.3.6.1.2.1.47.1.1.1.1.2.46). As I said: details on actually _reading_ these values will be contained in another document.
The possible values for service indicators (enterprise.42.2.70.101.1.1.12.1.2.$OID) are:
1 = unknown, 2 = off, 3 = on, 4 = alternating
The possible values for the keyswitch (enterprise.42.2.70.101.1.1.9.1.1.$OID) are:
1 = unknown, 2 = stand-by, 3 = normal, 4 = locked, 5 = diag
Object |
Description |
Unit |
.21 .23 .25 and .27 |
HDD[0-3] Service required indicator |
Integer |
.39 |
SYSTEM Service required indicator |
Integer |
.33 and .36 |
PSU[0-1] Service required indicator |
Integer |
.69 and.70 |
CPU[0-1] Core temperature |
Degrees |
.71 |
SYSTEM Enclosure temperature |
Degrees |
.99 and .100 |
PSU[0-1] Over-temperature warning |
Integer |
.81 .82 and .83 |
SYSTEM Enclosure fan[0-2] tacho meter |
Integer |
.84 .85 .86 and .87 |
CPU[0-1] Fan[0-1] tacho meter |
Integer |
.91 and .92 |
PSU[0-1] Fan underspeed warning |
Integer |
.31 and .34 |
PSU[0-1] Active (power?) |
Integer |
.28 .30 .32 and .34 |
HDD[0-3] Service Required indicator |
Integer |
.37 and .41 |
PSU[0-1] Service Required indicator |
Integer |
.46 |
SYSTEM Service Required indicator |
Integer |
.43 |
Keyswitch |
Integer |
.98 .100 .102 and .104 |
CPU[0-3] Core temperature |
Degrees |
.106 |
MOBO temperature |
Degrees |
.107 |
SCSI temperature |
Degrees |
.131 and .132 |
PSU[0-1] Predict fan fault |
Integer |
.121 |
PCIFAN tacho meter |
Integer |
.122 and .123 |
CPUFAN[0-1] tacho meter |
Integer |
.36 and .40 |
PSU[0-1] Power OK |
Integer |
.124 .125 .126 and .127 |
CPU[0-3] Power fault |
Integer |
.128 |
MOBO Power fault |
Integer |
kilala.nl tags: sysadmin, unix, solaris, nagios,
View or add comments (curr. 0)
2008-01-01 00:00:00
In some cases you're going to want to use Net-SNMP on your Solaris hosts, while still being able to monitor Sun-specific SNMP objects. It took me a while to get all of this to work and it's a bit of a puzzle, but here's how to make it work.
In our current environment at $CLIENT we want to standardise all of our UNIX hosts to the Net-SNMP agent software. This will allow us to use a configuration file which can be at least 60% identical on each host, making life just a little bit easier for all of us. Unfortunately Net-SNMP isn't equipped to deal with all of Sun's specific SNMP objects, so we're going to have to make a few big modifications to the software.
Of course packaging all these changes into one big .PKG is the nicest way of ensuring that all required changes are made in one blow, so that's what I've done. Unfortunately I cannot share this package with you, since it contains quite a large amount of $CLIENT internal information. I may be tempted at another time to recreate a non-$CLIENT version of the package that can be used elsehwere.
The latest versions of Net-SNMP comes with experimental LM_Sensors support for Sun hardware. Oddly, I've found that you need to drop one version below the latest version to get it to work nicely with Solaris 8. So here's the steps to take...
--with-mib-modules="host disman/event-mib ucd-snmp/diskio smux agentx disman/event-mib ucd-snmp/lmSensors" --with-perl-module
/usr/bin/crle -c /var/ld/ld.config -l /lib:/usr/lib:/usr/local/lib:/usr/local/ssl/lib
PLEASE NOTE: SUNWmasf will currently (july of 2006) only get useful results on the following models: V210, V240, V250, V440, V1280, E2900, N210, N240, N440, N1280. On other systems you may have more luck using the LM_Sensors pieces of Net-SNMP. They have been tested to work on E450, V880 and 280R.
As I mentioned earlier Net-SNMP with LM_Sensors can only gather limited amounts of Sun specific information. That's besides the fact that it is also still an experimental feature. So we're going to need an alternative SNMP agent to gather more information for us. Enter the SUNWmasf package.
SUNWmasf and its components may be downloaded from the Sun Microsystems website. Either use this direct link (which may be subject to change), or go to www.sun.com/download and search for "Sun SNMP Management Agent".
You can opt to install SUNWmasf manually on each of your clients, but it would be much nicer to include it into your custom made package. To have a full list of all the files and symlinks that you should include, you can take a peek at the prototype file I made for the package. It includes all the files required for Net-SNMP.
Installation of the software couldn't be easier. Just run the following command, after extracting the .TAR.Z file that contains SUNWmasf.
pkgadd -d . SUNWescdl SUNWescfl SUNWeschl SUNWescnl SUNWescpl SUNWmasf SUNWmasfr
Go into /etc/opt/SUNWmasf/conf and replace the snmpd.conf file with the following:
rocommunity public
agentaddress 1161
agentuser daemon
agentgroup daemon
The configuration file for Net-SNMP is located in /usr/local/share/snmp. You will need to make a whole bunch of changes over here that I won't cover, like security ACLs, SNMP trap hosts and bunches of other stuff. However, you _will_ need to add the following lines to allow Net-SNMP to talk to SUNWmasf.
proxy -c public localhost:1161 .1.3.6.1.4.1.42
proxy -c public localhost:1161 .1.3.6.1.2.1.47
Since SUNWmasf relies upon Net-SNMP, it will need to be started after that piece of software. The prototype file I mentioned earlier already takes this into account, but if you're not going to use it just make sure that /etc/init.d/masfd gets called _after_ /etc/init.d/snmpd during the boot process.
Also, I've noticed that SUNWmasf will need about thirty seconds before it can be read using commands like snmpget and snmpwalk.
As you may well know, SNMP is a tangly web of numerical identifiers. I will make a nice overview of the various useful OIDs that you can use for monitoring through both LM_Sensors and SUNWmasf. However, I will put these in a seperate document, since it falls outside the scope of this mini-howto.
kilala.nl tags: sysadmin, unix, solaris, nagios,
View or add comments (curr. 0)
2007-05-20 19:05:00
Well, I have finally unsubscribed myself from the Nagios mailing lists. It was great being a member of those lists while I was working with the software on a daily basis, but these days I've put Nagios behind me. I haven't written one line of Nagios monitoring code for months now.
I'm sure I'll also be skipping this year's Nagios Konferenz unless a job involving monitoring comes up again.
Thanks Ethan, for making such great software freely available! All the best to you and maybe we'll meet again o/
kilala.nl tags: nagios, unix, work, sysadmin,
View or add comments (curr. 0)
2006-10-25 09:05:00
Many thanks to my colleague Guldan who pointed me towards a website giving a short description of using the BSD hardware-sensors daemon, together with Nagios in order to monitor your hardware. Using sensord should make things a lot easier for people running BSD, as they won't have to muck about with SNMP OIDs and so on.
kilala.nl tags: work, nagios, unix, sysadmin,
View or add comments (curr. 0)
2006-10-03 23:31:00
This goes to show that the proverb above is right: Joerg Linge, whom I met at NagKon 2006, just e-mailed me. He mentioned that right around the same time we had both come up with a similar solution to one problem.
The problem: use Nagios plugins through a normal SNMP daemon.
Our solutions were identical when it came to configuring the daemon, but differed slightly when it comes to getting the information from the client. The approach is the same, but while he uses Perl for the plugin, I use Bash ^_^
Life's little coincidences :)
Joerg's solution and write-up.
Anywho... Joerg's a cool guy :) Go check out his website and have a look around.
kilala.nl tags: work, nagios, sysadmin,
View or add comments (curr. 2)
2006-09-24 09:04:00
So I made it back home in one piece. My trip back took me around 7.5 hours, which was mostly due to me driving a little bit faster :p
I have to say that the A45 route up north is much less glamorous than the A3 :( The Rast Hofe all look much older and less fancy than the ones on the A3. Ah, but they sufficed anyway...
I'm thinking of moving my summaries from the previous blog posts into one big page in the Sysadmin section. Reckon that should prevent Google from raising the Archives above the Sysadmin section when it comes to Nagios.
/me starts immediately.
kilala.nl tags: work, nagios, website, conference,
View or add comments (curr. 0)
2006-09-22 23:27:00
< moved to Sysadmin section, to keep Google from messing up >
View or add comments (curr. 0)
2006-09-21 17:10:00
Astounding by the way, the amount of Apple laptops I see around here. Less than at SANE'06, but still, around 35%. o/
View or add comments (curr. 0)
2006-09-21 17:01:00
< moved to Sysadmin section, to keep Google from messing up >
View or add comments (curr. 0)
2006-09-21 14:19:00
For the conference I had Snow buy me the iMic and a nice Philips microphone. For now though, I'm not completely happy with the setup.
* The mic is omnidirectional and thus doesn't pick up much of what person out in front is telling, while it does pick up quite a loot of noise from the room.
* iMic is a USB device and it seems that it claims enough CPU resources to mess with the rest of my system :(
Lunch was nice though! <3
View or add comments (curr. 0)
2006-09-20 23:21:00
< moved to Sysadmin section, to keep Google from messing up >
View or add comments (curr. 2)
2006-09-19 21:13:00
The next few days I'll be in Germania... Nurnberg, to be precise.
Together with around eighty other Nagios administrators and experts I'll be attending the first, annual Nagios Conference. Over the course of two days, we'll get a chance to meet up together, exchange ideas and generally have a go at improving both Nagios and our knowledge of the software. I'm looking forward to it quite a lot.
Maybe I'll even meet up with a few of the mailing list members :) I'll bring the camera and I'll try to snap a few quick pics.
kilala.nl tags: work, snow, nagios,
View or add comments (curr. 2)
2006-07-27 13:01:00
I've added a small comparison between the various ways in which your Nagios server can communicate with its clients. It's in the menu on the left, or you can go there directly.
kilala.nl tags: work, unix, nagios, sysadmin,
View or add comments (curr. 0)
2006-07-26 16:25:00
After digging through Sun's MIB description (see SUN-PLATFORM-MIB.txt) it became clear to me that things are a lot more convoluted than I originally expected. For example, each sensor in the Sun Fire systems lead to at least five objects each describing another aspect of the sensor (name, value, expected value, unit, and so on). Unfortunately Sun has no (public) description of all possible SNMP sensor objects so I've come to the following two conclusions:
1. I'll figure it all out myself. For each model that we're using I'll weasel out every possible sensor and all information relevant to these sensors.
2. I'll have to write my own check script for Nagios which deals with with all the various permutations of sensor arrays in an appropriate fashion. Joy...
EDIT:
For your reference, Sun has released the following documents that pertain to their SNMP implementation. Mostly they're a slight expansion on the info from the MIB. At least they're much easier on the eyes when reading :p
* 817-2559-13
* 817-6832-10
* 817-6238-10
* 817-3000-10
kilala.nl tags: unix, work, nagios, sysadmin,
View or add comments (curr. 0)
2006-07-25 09:34:00
Right now I'm working on getting my Sun systems properly monitored through SNMP. Using the LM_sensors module for Net-SNMP has gotten me quite far, but there's one drawback. A lot of Sun's internal counters use some really odd values that don't speak for themselves. This makes it necessary to read through Sun's own MIB and correlate the data in there with the stuff from LM_sensors.
Point is, Sun isn't very forthcoming with their MIB even though it should probably be public knowlegde. Nowhere on the web can I find a copy of the file. The only way to get it is by extracting it from Sun's free SUNWmasfr package, which I have done: here's SUN-PLATFORM-MIB.txt
In now way am I claiming this file to be a product of mine and it definitely has Sun's copyright on it. I just thought I'd make the file a -little- bit more accessible through the Internet. If Sun objects, I'm sure they'll tell me :3
kilala.nl tags: unix, work, nagios, sysadmin,
View or add comments (curr. 0)
2006-06-19 15:11:00
Both check_log2 and check_log3 have been thoroughly debugged today. Finally. Thanks to both Kyle Tucker and Ali Khan for pointing out the mistakes I'd made. I also finally learned the importance of proper testing tools, so I wrote test_log2 and test_log3 which run the respective check scripts through all the possible states they can encounter.
Oh... check_ram was also -finally- modified to take the WARN and CRIT percentages through the command line. Shame on me for not doing that earlier.
kilala.nl tags: work, unix, nagios, sysadmin,
View or add comments (curr. 0)
2006-06-01 14:53:00
Today I made an improved version of the Nagios monitor "check_log2", which is now aptly called "check_log3". Version 3 of this script gives you the option to add a second query to the monitor. The previous two incarnations of the script only allowed you to search for one query and would return a Critical if it was found. Now you can also add a query which will return in a Warning message as well. Goody!
kilala.nl tags: work, unix, nagios, sysadmin,
View or add comments (curr. 0)
2006-06-01 00:00:00
After reading through my small write-up on Nagios clients on UNIX you may also be interested in the same story for Windows systems.
Since Nagios was originally written with UNIX systems in mind, it'll be a little bit trickier to get the same amount of information from a Windows box. Luckily there are a few tools available that will help you along the way.
For a quick introduction the Nagios clients, read the write-up linked above. Or pick it from the menu on the left.
|
||||||
Connection |
Srv -> Clnt |
Srv -> Clnt |
Srv -> Clnt |
Srv -> Clnt |
Clnt -> Srv |
Clnt -> Srv |
Security |
Password |
Password |
Password |
Access List |
Access List |
Encryption |
Configuration |
On client |
On client |
On client |
On client |
On client and |
On client |
Difficulty |
Moderate |
Moderate |
Moderate |
Hard |
Hard |
Moderate |
Resource |
unknown |
unknown |
9MB RAM |
unknown |
unknown |
30MB RAM |
Available |
*: Thanks to Jeronimo Zucco for pointing out that encryption in NSClient++ only works when used with the NRPE DLL.
**: Thanks to Anthony Montibello for pointing out recent changes to NC_Net, which is now at version 3.
***: Thanks to Kyle Hasegawa for providing me with resource usage infor on the various clients.
NSClient was originally written to work with Nagios when it was still called NetSaint: a long, long time ago. NSClient only provides you with access to a very small number of system metrics, including those that are usually available through the Windows Performance Tool.
Personally I have no love for this tool since it is quite fidgetty to use. In order to use NSClient on your systems, you will need to do the following.
You can now set up your services.cfg in such a way that each remote service is checked like so:
define service{
host_name remote-host
service_description D_ROOT
check_command check_nt_disk!C!85!95
}
Your check command definition would look something like this:
define command {
command_name check_nt_disk
command_line /usr/local/nagios/libexec/check_nt -H $HOSTADDRESS$ -p 1248 -v USEDDISKSPACE -l $ARG1$ -w $ARG2$ -c $ARG3
}
NRPEnt is basically a drop-in replacement for NRPE on Windows. It really does work the same way: on the Nagios server you run check_nrpe and on the Windows side you have plugins to run locally. These plugins can be binaries, Perl scripts, VBScript, .BAT files, whatever.
To set things up, you'll need the same things as with the normal NRPE.
You can now set up your services.cfg in such a way that each remote service is checked like so:
define service{
host_name remote-host
service_description D_ROOT
check_command check_nrpe!check_root
}
And in nrpent.cfg on the client you would need to include:
command[check_root]=C:\windows\system32\cscript.exe //NoLogo //T:10 c:\nrpe_nt\check_disk.wsf /drive:"c:/" /w:300 /c:100
Due to the limited use provided by NSClient, someone decided to create NSClient++. This piece of software is a lot more useful because it actually combines the functionality of the original NSClient and that of NRPEnt into one Windows daemon.
NSClient++ includes the same security measures as NRPEnt and NSClient, but adds an ACL functionality on top of that.
On the configuration side things are basically the same as with NSClient and NRPEnt. You can use both methods to talk to a client running NSClient++.
Unfortunately I haven't yet worked with SNMP on Windows systems, so I can't tell you much about this. I'm sure though that things won't be much different from the UNIX side. So please check the Nagios UNIX clients story for the full details.
To make proper use of monitoring through SNMP you'll need to:
Ufortunately the check_snmp script that comes with Nagios isn't flexible enough to let you monitor custom SNMP objects in a nice way. This is why I wrote the retrieve_custom_nagios script, which is available from the menu. Your service definition would look like this:
define service{
host_name remote-host
service_description D_ROOT
check_command retrieve_custom_snmp!.1.3.6.1.4.1.6886.4.1.4
}
As I said, I haven't configured a Windows SNMP daemon before, so I really can't tell you what the config would look like. Just look for options similar to "EXEC", which allows you to run a certain command on demand.
Just as is the case with UNIX systems you will need to dig around the MIB files provided to you by Microsoft and you hardware vendors to find the OIDs for interesting metrics. It's not an easy job, but with some luck you'll find a website where someone's already done the hard work for you :)
SNMP doesn't involve polling alone. SNMP enabled devices can also be configured to automatically send status updates do a so-call trap host. The downside to receiving SNMP traps with Nagios is that it takes quite a lot of work to get them into Nagios :D
To make proper use of monitoring through SNMP traps you'll need to:
There are -many- ways to get the SNMP traps translated for Nagios' purposes, 'cause there's many roads that lead to Rome. Unfortunately none of them are very easy to use.
NC_net is another replacement for the original NSClient daemon. It performs the same basic checks, plus a few additional ones, but it is not exentable with your own scripts (like NRPEnt is).
So why run NC_net instead of NSClient++? Because it is capable of sending passive check results to your Nagios server using a send_nsca-alike method. So if you're going all the way in passifying all your service checks, then NC_net is the way to go.
I haven't worked with NC_net yet, so I can't tell you anything about how it works. Too bad :(
UPDATE 31/10/2006:
I was informed by Marlo Bell of the Nagios mailing list that NC_net version 3.x does indeed allow running your own scripts and calling them through the NRPEnt interface! That's great to know, as it does in fact make NC_net the most versatile solution for running Nagios on your Windows.
Also, Anthony Montibello (lead NC_Net dev) tells me that NC_Net 3 requires dotNET 2.0.
kilala.nl tags: tutorial, sysadmin, nagios, windows,
View or add comments (curr. 7)
2006-06-01 00:00:00
This script was written while I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.
One of the things we've been looking into recently, is running the standard Nagios plugins through SNMP instead of through NRPE. Putting aside the discussion of the various merits and flaws such a solution has, let's say that it works nicely.
How do you do this?
In your snmpd.conf add a line like:
exec .1.3.6.1.4.1.6886.4.1.1 check_load /usr/local/nagios/libexec/check_load
exec .1.3.6.1.4.1.6886.4.1.2 check_mem /usr/local/nagios/libexec/check_mem –w 85 –c 95
exec .1.3.6.1.4.1.6886.4.1.3 check_swap /usr/local/nagios/libexec/check_swap -w 15% -c 5%
What this does, is tell the SNMP daemon to run the check_load script when someone asks for object .1.3.6.1.4.1.6886.4.1.1 (or .2, or .3). The exit code for the script will be place in OID.100.0 and the first line of output will be placed in OID.101.1. This script retrieves those two values through SNMP and returns them to Nagios.
Your checkcommands.cfg should contain something like:
define command{
command_name retrieve_custom_snmp
command_line $USER1$/retrieve_custom_snmp -H $HOSTADDRESS$ -o $ARG1$
}
The "-o" parameter takes the OID you have selected for your custom check.
Now... How do you select an OID? There's two ways:
1. The WRONG way = randomly selecting some OID. You might pick an OID which is needed for other monitoring purposes in your network.
2. The RIGHT way = requesting a private Enterprise ID for your company at IANA. You are free to build an SNMP tree beneath this EID. For example, the EID 6886 mentioned above is registered to KPN (my current client). The sub-tree .4.1 contains all OIDs referring to Nagios checks performed by my department.
Before sending out that request, please check the current EID list to see if you company already owns a private subtree. If that's the case, contact the "owner" to request your own part of the subtree.
UPDATE (2006-10-02):
Thanks to the kind folks on the Nagios Users ML I've found out that my original version of the script was totally bug-ridden. I've made a big bunch of adjustments and now the script should work properly. Thanks especially to Andreas Ericsson.
#!/bin/bash # # Script to retrieve custom SNMP objects set using the "exec" handler # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of KPN-IS, i-Provide, the Netherlands # Last Modified: 18-07-2006 # # Usage: ./retrieve_custom_snmp # # Description: # On our Nagios client systems we use a lot of custom MIB OIDs which are # registered under our own Enterprise ID. A whole bunch of the # original Nagios script are run through the SNMP daemon and their exit # codes and output are appended to specific OID. This all happens using the # SNMP "exec" handler. # Unfortunately the default check_snmp script doesn't allow for easy # handling of these objects, so I hacked together a quick script. # # So basically this script doesn't do any checking. It just retrieves # information :) # # Limitations: # This script should work properly on all implementations of Linux, Solaris # and Mac OS X. # # Output: # The exit code is the exit code retrieved from OID.100.1. It is temporarily # stored in $EXITCODE. # The output string is the string retrieved from OID.101.1. It is tempo- # rarily stored in $OUTPUT. # # Other notes: # If you ever run into problems with the script, set the DEBUG variable # to 1. I'll need the output the script generates to do troubleshooting. # See below for details. # I realise that all the debugging commands strewn throughout the script # may make things a little harder to read. But in the end I'm sure it was # well worth adding them. It makes troubleshooting so much easier. :3 # Also, for some reason the case statement with the shifts (to detect # passed options) doesn't seem to be working right. FIXME! # # Check command definition: # define command{ # command_name retrieve_custom_snmp # command_line $USER1$/retrieve_custom_snmp -H $HOSTADDRESS$ -o $ARG1$ # } # # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh PROGNAME="retrieve_custom_snmp" COMMUNITY="public" [ `uname` == "SunOS" ] && SNMPGET="/usr/local/bin/snmpget -Oqv -v 2c -c $COMMUNITY" [ `uname` == "Darwin" ] && SNMPGET="/usr/bin/snmpget -Oqv -v 2c -c $COMMUNITY" [ `uname` == "Linux" ] && SNMPGET="/usr/bin/snmpget -Oqv -v 2c -c $COMMUNITY" ### DEBUGGING SETUP ### # Cause you never know when you'll need to squash a bug or two DEBUG="0" if [ $DEBUG -gt 0 ] then DEBUGFILE="/tmp/foobar" rm $DEBUGFILE >/dev/null 2>&1 fi ### REQUISITE NAGIOS COMMAND LINE STUFF ### print_usage() { echo "Usage: $PROGNAME -H hostname -o OID" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Script to retrieve the status for custom SNMP objects." echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1"; do case "$1" in --help) print_help exit $STATE_OK ;; -h) print_help exit $STATE_OK ;; -H) HOST=$2 shift ;; -o) OID=$2 STATUS="$OID.100.1" STRING="$OID.101.1" shift ;; *) echo "Unknown argument: $1" print_usage exit $STATE_UNKNOWN ;; esac shift done ### FINALLY... RETRIEVING THE VALUES ### EXITCODE=`$SNMPGET $HOST $STATUS` [ $DEBUG -gt 0 ] && echo "Retrieve exit code is $EXITCODE" >> $DEBUGFILE OUTPUT=`$SNMPGET $HOST $STRING | sed 's/"//g'` [ $DEBUG -gt 0 ] && echo "Retrieve status message is: $OUTPUT" >> $DEBUGFILE echo $OUTPUT exit $EXITCODE
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 0)
2006-06-01 00:00:00
This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.
A few of our projects and services are run on Solaris systems running Sun Cluster software. Since there were no Nagios scripts available to perform checks against Sun Cluster I made a basic script that checks the most important factors.
This script performs a different function, depending on the parameter with which it is called. This allows you to define multiple service checks in Nagios, without needing seperate check scripts for each.
EDIT:
Oh! Just like my other recent Nagios scripts, check_suncluster comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution. And like my other, recent scripts it also comes with its own test script.
#!/usr/bin/ksh # # Nagios check script for Sun Cluster. # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of KPN-IS, i-Provide SYS, the Netherlands # Last Modified: 25-09-2006 # # Usage: ./check_suncluster [-t, -q, -g, -G resource-group, -r, -R resource, -i] # # Description: # This script is capable of performing a number of basic checks on a # system running Sun Cluster. Depending on the parameter you pass to # it, it will check: # * Transport paths (-t). # * Quorum (-q). # * Resource groups (-g). # * One selected resource group (-G). # * Resources (-r). # * One selected resource (-R). # * IPMP groups (-i). # # Limitations: # This script will only work with Korn shell, due to some funky while # looping with pipe forking. Bash doesn't handle this very gracefully, # due to its sub-shell variable scoping. Maybe I really should learn # to program in Perl. # # Output: # * Transport paths return a WARN when one of the paths is down and a # CRIT when all paths are offline. # * Quorum returns a WARN when not all, but enough quorum devices are # available. It returns a CRIT when quorum cannot be reached. # * Resource groups returns a CRIT when a group is offline on all nodes # and a WARN if a group is in an unstable state. # * Resources returns a CRIT when a resource is offline on all nodes # and a WARN if a resource is in an unstable state. # * IPMP groups returns a CRIT when a group is offline. # # Other notes: # Aside from the debugging output that I've built into most of my recent # scripts, this check script will also have a testing mode hacked on, as # a bag on the side. This testing mode is only engaged when the test_check_suncluster # script is being run and will intentionally "break" a few things, to # verify the failure options of this check script. # # Enabling the following dumps information into DEBUGFILE at various # stages during the execution of this script. DEBUG=0 DEBUGFILE="/tmp/foobar" if [ -f /tmp/neko-wa-baka ] then if [ `cat /tmp/neko-wa-baka` == "Nyo!" ] then TESTING="1" else TESTING="0" fi else TESTING="0" fi ### REQUISITE NAGIOS USER INTERFACE STUFF ### # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin:/usr/cluster/bin" LIBEXEC="/usr/local/nagios/libexec" PROGNAME="check_suncluster" . $LIBEXEC/utils.sh [ $DEBUG -gt 0 ] && rm $DEBUGFILE print_usage() { echo "Usage: $PROGNAME [-t, -q, -g, -G resource-group, -r, -R resource, -i]" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Sun Cluster check plugin for Nagios" echo "" echo "-t: check transport paths" echo "-q: check quorum" echo "-g: check resource groups" echo "-G: check one individual resource group" echo "-r: check all resources" echo "-R: check one individual resources" echo "-i: check IPMP groups" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } ### SUB-ROUTINE DEFINITIONS ### function check_transport_paths { [ $DEBUG -gt 0 ] && echo "Starting check_transport_path subroutine." >> $DEBUGFILE TOTAL=`scstat -W | grep "Transport path:" | wc -l` let COUNT=0 scstat -W | grep "Transport path:" | awk '{print $3" "$6}' | while read PATH STATUS do [ $DEBUG -gt 0 ] && echo "Before math, Count has the value of $COUNT." >> $DEBUGFILE if [ $STATUS == "online" ] then let COUNT=$COUNT+1 fi [ $DEBUG -gt 0 ] && echo "Path: $PATH has status $STATUS" >> $DEBUGFILE [ $DEBUG -gt 0 ] && echo "Count: $COUNT online transport paths." >> $DEBUGFILE done [ $DEBUG -gt 0 ] && echo "Count: Outside the loop it has a value of $COUNT." >> $DEBUGFILE [ $TESTING -gt 0 ] && COUNT="0" if [ $COUNT -lt 1 ] then echo "NOK - No transport paths online." exit $STATE_CRITICAL elif [ $COUNT -lt $TOTAL ] then echo "NOK - One or more transport paths offline." exit $STATE_WARNING fi } function check_quorum { [ $DEBUG -gt 0 ] && echo "Starting check_quorum subroutine." >> $DEBUGFILE NEED=`scstat -q | grep "votes needed:" | awk '{print $4}'` PRES=`scstat -q | grep "votes present:" | awk '{print $4}'` [ $DEBUG -gt 0 ] && echo "Quorum needed: $NEED" >> $DEBUGFILE [ $DEBUG -gt 0 ] && echo "Quorum present: $PRES" >> $DEBUGFILE [ $TESTING -gt 0 ] && PRES="0" if [ $PRES -ge $NEED ] then [ $DEBUG -gt 0 ] && echo "Enough quorum votes." >> $DEBUGFILE scstat -q | grep "votes:" | awk '{print $3" "$6}' | while read VOTE STATUS do [ $DEBUG -gt 0 ] && echo "Vote: $VOTE has status $STATUS." >> $DEBUGFILE if [ $STATUS != "Online" ] then echo "NOK - Quorum vote $VOTE not available." exit $STATE_WARNING fi done else [ $DEBUG -gt 0 ] && echo "Not enough quorum." >> $DEBUGFILE echo "NOK - Not enough quorum votes present." exit $STATE_CRITICAL fi } function check_resource_groups { [ $DEBUG -gt 0 ] && echo "Starting check_resource_groups subroutine." >> $DEBUGFILE scstat -g | grep "Group:" | awk '{print $2}' | sort -u | while read GROUP do ONLINE=`scstat -g | grep "Group: $GROUP" | grep "Online" | wc -l` WEIRD=`scstat -g | grep "Group: $GROUP" | grep -v "Resources" | grep -v "Online" | grep -v "Offline" | wc -l` [ $DEBUG -gt 0 ] && echo "Resource Group $GROUP has $ONLINE instances online." >> $DEBUGFILE [ $DEBUG -gt 0 ] && echo "Resource Group $GROUP has $WEIRD instances in a weird state." >> $DEBUGFILE [ $TESTING -gt 0 ] && ONLINE="0" if [ $ONLINE -lt 1 ] then echo "NOK - Resource group $GROUP not online." exit $STATE_CRITICAL fi if [ $WEIRD -gt 1 ] then echo "NOK - Resource group $GROUP is an unstable state." exit $STATE_WARNING fi done } function check_resource_grp { [ $DEBUG -gt 0 ] && echo "Starting check_resource_grp subroutine." >> $DEBUGFILE [ $DEBUG -gt 0 ] && echo "Selected group: $RGROUP" >> $DEBUGFILE ONLINE=`scstat -g | grep $RGROUP | grep "Online" | wc -l` WEIRD=`scstat -g | grep $RGROUP | grep -v "Resources" | grep -v "Online" | grep -v "Offline" | wc -l` [ $DEBUG -gt 0 ] && echo "Resource Group $GROUP has $ONLINE instances online." >> $DEBUGFILE [ $DEBUG -gt 0 ] && echo "Resource Group $GROUP has $WEIRD instances in a weird state." >> $DEBUGFILE [ $TESTING -gt 0 ] && ONLINE="0" if [ $ONLINE -lt 1 ] then echo "NOK - Resource group $RGROUP not online." exit $STATE_CRITICAL fi if [ $WEIRD -gt 1 ] then echo "NOK - Resource group $RGROUP is in an unstable state." exit $STATE_WARNING fi } function check_resources { [ $DEBUG -gt 0 ] && echo "Starting check_resources subroutine." >> $DEBUGFILE RESOURCES=`scstat -g | grep "Resource:" | awk '{print $2}' | sort -u` [ $DEBUG -gt 0 ] && echo "List of resources to check: $RESOURCES" >> $DEBUGFILE for RESOURCE in `echo $RESOURCES` do ONLINE=`scstat -g | grep "Resource: $RESOURCE" | awk '{print $4}' | grep "Online" | wc -l` WEIRD=`scstat -g | grep "Resource: $RESOURCE" | awk '{print $4}' | grep -v "Online" | grep -v "Offline" | wc -l` [ $DEBUG -gt 0 ] && echo "Resource $RESOURCE has $ONLINE instances online." >> $DEBUGFILE [ $DEBUG -gt 0 ] && echo "Resource $RESOURCE has $WEIRD instances in a weird state." >> $DEBUGFILE [ $TESTING -gt 0 ] && ONLINE="0" if [ $ONLINE -lt 1 ] then echo "NOK - Resource $RESOURCE not online." exit $STATE_CRITICAL fi if [ $WEIRD -gt 1 ] then echo "NOK - Resource $RESOURCE is in an unstable state." exit $STATE_WARNING fi done } function check_rsrce { [ $DEBUG -gt 0 ] && echo "Starting check_rsrce subroutine." >> $DEBUGFILE [ $DEBUG -gt 0 ] && echo "Selected resource: $RSRCE" >> $DEBUGFILE ONLINE=`scstat -g | grep "Resource: $RSRCE" | awk '{print $4}' | grep "Online" | wc -l` WEIRD=`scstat -g | grep "Resource: $RSRCE" | awk '{print $4}' | grep -v "Online" | grep -v "Offline" | wc -l` [ $DEBUG -gt 0 ] && echo "Resource $RESOURCE has $ONLINE instances online." >> $DEBUGFILE [ $DEBUG -gt 0 ] && echo "Resource $RESOURCE has $WEIRD instances in a weird state." >> $DEBUGFILE [ $TESTING -gt 0 ] && ONLINE="0" if [ $ONLINE -lt 1 ] then echo "NOK - Resource $RESOURCE not online." exit $STATE_CRITICAL fi if [ $WEIRD -gt 1 ] then echo "NOK - Resource $RESOURCE is in an unstable state." exit $STATE_WARNING fi } function check_ipmp { [ $DEBUG -gt 0 ] && echo "Starting check_ipmp subroutine." >> $DEBUGFILE scstat -i | grep "IPMP Group:" | awk '{print $3" "$5}' | while read GROUP STATUS do [ $DEBUG -gt 0 ] && echo "IPMP Group: $GROUP has status $STATUS" >> $DEBUGFILE if [ $STATUS != "Online" ] then echo "NOK - IPMP group $GROUP not online." exit $STATE_CRITICAL fi if [ $TESTING -gt 0 ] then echo "NOK - IPMP group $GROUP not online." exit $STATE_CRITICAL fi done } ### THE MAIN ROUTINE FINALLY STARTS ### [ $DEBUG -gt 0 ] && echo "Starting main routine." >> $DEBUGFILE if [ $# -lt 1 ] then print_usage exit $STATE_UNKNOWN fi [ $DEBUG -gt 0 ] && echo "More than one argument." >> $DEBUGFILE [ $DEBUG -gt 0 ] && echo "" >> $DEBUGFILE case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; -t) check_transport_paths;; -q) check_quorum;; -g) check_resource_groups;; -G) RGROUP="$2"; check_resource_grp;; -r) check_resources;; -R) RSRCE="$2"; check_rsrce;; -i) check_ipmp;; *) print_usage; exit $STATE_UNKNOWN;; esac [ $DEBUG -gt 0 ] && echo "No problems. Exiting normally." >> $DEBUGFILE # None of the other subroutines forced us to exit 1 before here, so let's quit with a 0. echo "OK - Everything running like it should" exit $STATE_OK
#!/usr/bin/bash function testrun() { echo "Running without parameters." /usr/local/nagios/libexec/check_suncluster echo "Exit code is $?." echo "" echo "Testing transport paths." /usr/local/nagios/libexec/check_suncluster -t echo "Exit code is $?." echo "" echo "Quorum votes." /usr/local/nagios/libexec/check_suncluster -q echo "Exit code is $?." echo "" echo "Checking all resource groups." /usr/local/nagios/libexec/check_suncluster -g echo "Exit code is $?." echo "" echo "Checking individual resource groups." for GROUP in `scstat -g | grep "Group:" | awk '{print $2}' | sort -u` do echo "Running for group $GROUP." /usr/local/nagios/libexec/check_suncluster -G $GROUP echo "Exit code is $?." echo "" done echo "Checking all resources." /usr/local/nagios/libexec/check_suncluster -r echo "Exit code is $?." echo "" echo "Checking all resources." for RESOURCE in `scstat -g | grep "Resource:" | awk '{print $2}' | sort -u` do echo "Running for resource $RESOURCE." /usr/local/nagios/libexec/check_suncluster -R $RESOURCE echo "Exit code is $?." echo "" done echo "Checking IPMP groups." /usr/local/nagios/libexec/check_suncluster -i echo "Exit code is $?." echo "" } function breakstuff() { # Now we'll start breaking things!! echo "" echo "Now it's time to start breaking things! Gruaargh!" echo "Mind you, it's all fake and simulated. I am not changing -anything-" echo "about the cluster itself." echo "" echo "Nyo!" > /tmp/neko-wa-baka } echo "Starting clean" rm /tmp/neko-wa-baka /tmp/foobar >/dev/null 2>&1 echo "" testrun breakstuff testrun echo "Starting clean at the end" rm /tmp/neko-wa-baka >/dev/null 2>&1 echo ""
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 2)
2006-06-01 00:00:00
This script was written at the time I was hired by UPC / Liberty Global.
Basic monitor to check percentage of used physical RAM.
This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:
UPDATE 19/06/2006:
Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!
I've also -finally- changed the script so that it takes the Warning and Critical percentages from the command line.
UPDATE 15/07/2006:
Whoops... I just noticed that the file had gone missing <3
#!/bin/ksh # # Free physical RAM monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of DTV Labs, Liberty Global, the Netherlands # Last Modified: 20-10-2006 # # Usage: ./check_ram # # Description: # This plugin determines how much of the physical RAM in the # system is in use. # # Limitations: # Currently this plugin will only function correctly on Solaris systems. # And it really is only usefull at DTV Labs. # # Output: # The script returns either a WARN or a CRIT, depending on the # percentage of free physical memory. # # Enabling the following dumps information into DEBUGFILE at various # stages during the execution of this script. DEBUG="1" DEBUGFILE="/tmp/foobar" rm $DEBUGFILE >/dev/null 2>&1 echo "Starting script check_ram." > $DEBUGFILE # Host OS check and warning message if [ `uname` != "SunOS" ] then echo "WARNING:" echo "This script was originally written for use on Solaris." echo "You may run into some problems running it on this host." echo "" echo "Please verify that the script works before using it in a" echo "live environment. You can easily disable this message after" echo "testing the script." echo "" exit 1 fi # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/usr/local/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh print_usage() { echo "Usage: $PROGNAME warning-percentage critical-percentage" echo "" echo "e.g. : $PROGNAME 15 5" echo "This will start alerting when more than 85% of RAM has" echo "been used." echo "" } print_help() { echo "" print_usage echo "" echo "Free physical RAM plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } if [ $# -lt 2 ]; then print_help; exit $STATE_WARNING;fi case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) if [ $# -lt 2 ]; then print_help; exit $STATE_WARNING;fi ;; esac RAM_WARN=$1 RAM_CRIT=$2 [ $DEBUG -gt 0 ] && echo "Warning and Critical percentages are $RAM_WARN and $RAM_CRIT." >> $DEBUGFILE if [ $RAM_WARN -le RAM_CRIT ] then echo "Warning percentage should be larger than critical percentage." exit $STATE_WARNING fi check_space() { [ $DEBUG -gt 0 ] && echo "Starting check_space." >> $DEBUGFILE TOTALSPACE=0 TOTALSPACE=`prtconf | grep ^"Memory size" | awk '{print $3}'` [ $DEBUG -gt 0 ] && echo "Total space is $TOTALSPACE." >> $DEBUGFILE TOTALFREE=0 TOTALFREE=`vmstat 2 2 | tail -1 | awk '{print $5}'` [ $DEBUG -gt 0 ] && echo "Free space is $TOTALFREE." >> $DEBUGFILE let TOTALFREE=$TOTALFREE/1000 [ $DEBUG -gt 0 ] && echo "Free space, div1000 is $TOTALFREE." >> $DEBUGFILE } check_percentile() { [ $DEBUG -gt 0 ] && echo "Starting check_percentile." >> $DEBUGFILE FRACTION=`echo "scale=2; $TOTALFREE/$TOTALSPACE" | bc` [ $DEBUG -gt 0 ] && echo "Fraction is $FRACTION." >> $DEBUGFILE PERCENT=`echo "scale=2; $FRACTION*100" | bc | awk -F. '{print $1}'` [ $DEBUG -gt 0 ] && echo "Percentile is $PERCENT." >> $DEBUGFILE if [ $PERCENT -lt $RAM_CRIT ]; then [ $DEBUG -gt 0 ] && echo "$PERCENT is smaller than $RAM_CRIT. Critical." >> $DEBUGFILE echo "RAM NOK - Less than $RAM_CRIT % of physical RAM is unused." exitstatus=$STATE_CRITICAL exit $exitstatus fi if [ $PERCENT -lt $RAM_WARN ]; then [ $DEBUG -gt 0 ] && echo "$PERCENT is smaller than $RAM_WARN. Warning." >> $DEBUGFILE echo "RAM NOK - Less than $RAM_WARN % of physical RAM is unused." exitstatus=$STATE_WARNING exit $exitstatus fi } check_space check_percentile [ $DEBUG -gt 0 ] && echo "$PERCENT is greater than $RAM_WARN. OK." >> $DEBUGFILE echo "RAM OK - $TOTALFREE MB out of $TOTALSPACE MB RAM unused." exitstatus=$STATE_OK exit $exitstatus
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 0)
2006-06-01 00:00:00
This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.
A very simply script that takes a list of processes, instead of a single processes name (as is the case with check_process). This should make monitoring a basic list of processes a lot easier. I really should change the script in such a way that it takes the process list from the command line, instead of from the $LIST variable that's defined internally. I'll do that when I have the time.
Until I've made those change, I use the script by copying check_processes to a new file which is used specifically for one purpose. For example check_linux_processes and check_solaris_processes check a list of processes that should be up and running on Linux and Solaris respectively.
This check script should work on just about any UNIX OS.
#!/bin/bash # # Process monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of KPN-IS, i-Provide, the Netherlands # Last Modified: 13-07-2006 # # Usage: ./check_solaris_processes # # Description: # This script couldn't be simpler than it is. It just checks to see # whether a predefined list of processes is up and running. # # Limitations: # This script should work properly on all implementations of Linux, Solaris # and Mac OS X. # # Output: # If there one of the processes is down, a CRIT is issued. # # You may have to change this, depending on where you installed your # Nagios plugins PROGNAME="check_linux_processes" PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh ### DEFINING THE PROCESS LIST ### LIST="init" ### REQUISITE NAGIOS COMMAND LINE STUFF ### print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Basic processes list monitor plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done ### FINALLY THE MAIN ROUTINE ### COUNT="0" DOWN="" for PROCESS in `echo $LIST` do if [ `ps -ef | grep -i $PROCESS | grep -v grep | wc -l` -lt 1 ] then let COUNT=$COUNT+1 DOWN="$DOWN $PROCESS" fi done if [ $COUNT -gt 0 ] then echo "NOK - $COUNT processes not running: $DOWN" exit $STATE_CRITICAL fi # Nothing caused us to exit early, so we're okay. echo "OK - All requisite processes running." exit $STATE_OK
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 2)
2006-06-01 00:00:00
This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.
As far as I know there was no Nagios plugin that allowed you to really check your client configuration. I mean, it would be nice to know for sure that all your systems are syncing against the proper server... Wouldn't it?
The script was tested on Redhat ES3, Mac OS X and Solaris. Its basic requirement is the bash shell.
EDIT:
Oh! Just like my other recent Nagios scripts, check_ntp_config comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.
#!/usr/bin/bash # # CPU load monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of KPN-IS, i-Provide, the Netherlands # Last Modified: 10-07-2006 # # Usage: ./check_ntp_config # # Description: # Well, there's not much to tell. We have no way of making sure that our # NTP clients are all configured in the right way, so I thought I'd make # a Nagios check for it. ^_^ # You can change the NTP config at the top of this script, to match your # own situation. # # Limitations: # This script should work properly on all implementations of Linux, Solaris # and Mac OS X. # # Output: # If the NTP client config does not match what has been defined at the # top of this script, the script will return a WARN. # # Other notes: # If you ever run into problems with the script, set the DEBUG variable # to 1. I'll need the output the script generates to do troubleshooting. # See below for details. # I realise that all the debugging commands strewn throughout the script # may make things a little harder to read. But in the end I'm sure it was # well worth adding them. It makes troubleshooting so much easier. :3 # # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh ### DEFINING THE NTP CLIENT CONFIGURATION AS IT SHOULD BE ### NTP_SERVER="ntp.wxs.nl" ### DEBUGGING SETUP ### # Cause you never know when you'll need to squash a bug or two DEBUG="0" if [ $DEBUG -gt 0 ] then DEBUGFILE="/tmp/foobar" rm $DEBUGFILE >/dev/null 2>&1 fi ### REQUISITE NAGIOS COMMAND LINE STUFF ### print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "NTP client configuration monitor plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done ### DEFINING SUBROUTINES ### function gather_config() { case `uname` in Linux) CFGFILE="/etc/ntp.conf"; IP_SERVER=`host $NTP_SERVER | awk '{print $4}'` ;; SunOS) CFGFILE="/etc/inet/ntpd.conf"; IP_SERVER=`getent hosts $NTP_SERVER | awk '{print $2}'`;; Darwin) CFGFILE="/etc/ntp.conf"; IP_SERVER=`host $NTP_SERVER | awk '{print $4}'` ;; *) ;; esac REAL_SERVER=`cat $CFGFILE | grep ^server | awk '{print $2}'` [ $DEBUG -gt 0 ] && echo "Gather_config: Host name for required server is $NTP_SERVER." >> $DEBUGFILE [ $DEBUG -gt 0 ] && echo "Gather_config: IP address for required server is $IP_SERVER." >> $DEBUGFILE [ $DEBUG -gt 0 ] && echo "Gather_config: currently configured server is $REAL_SERVER." >> $DEBUGFILE } function check_config() { if [ $REAL_SERVER != $NTP_SERVER ] then if [ $REAL_SERVER != $IP_SERVER ] then echo "NOK - NTP client is not configured to speak to $NTP_SERVER" exit $STATE_WARNING fi fi } ### FINALLY, THE MAIN ROUTINE ### gather_config check_config # Nothing caused us to exit early, so we're okay. echo "OK - NTP client configured correctly." exit $STATE_OK
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 0)
2006-06-01 00:00:00
This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.
Today I made an improved version of the Nagios monitor "check_log2", which is now aptly called "check_log3". It includes all the improvements I originally added to "check_log2", so you can simply use this as a drop-in replacement.
Version 3 of this script gives you the option to add a second query to the monitor.
The previous two incarnations of the script only allowed you to search for one query and would return a Critical if it was found. Now you can also add a query which will return in a Warning message as well. Goody! :3
1st of Feb, 2006:
Kyle Tucker pointed out that he had problems running this script with bash on Solaris. The changes he suggested have been worked into the newer version. Thanks Kyle :)
5th of Mar, 2006:
I finally got round to fix the script according to all the changes Kyle (and others) suggested. So here's another try! Right now I've tested the script on Red Hat, Mac OS X and Solaris, so it should be much better than before.
19th of June, 2006:
Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!
Also stomped out a few horrendous bugs! I'm very sorry for putting out such a buggy script earlier... If you've started using the script in your environment, please download the latest version. Thanks to Ali Khan for pointing out these mistakes.
#!/bin/bash # # Log file pattern detector plugin for Nagios # Written by Ethan Galstad (nagios@nagios.org) # Last Modified: 07-31-1999 # Heavily modified by Thomas Sluyter (nagiosATkilalaDOTnl) # Last Modified: 19-06-2006 # # Usage: ./check_log3 -F log_file -O old_log_file -C crit-pattern -W warn-pattern # # Description: # # This plugin will scan a log file (specified by the log_file option) # for specific patterns (specified by the XXX-pattern options). Successive # calls to the plugin script will only report *new* pattern matches in the # log file, since an copy of the log file from the previous run is saved # to old_log_file. # # Output: # # On the first run of the plugin, it will return an OK state with a message # of "Log check data initialized". On successive runs, it will return an OK # state if *no* pattern matches have been found in the *difference* between the # log file and the older copy of the log file. If the plugin detects any # pattern matches in the log diff, it will return a CRITICAL state and print # out a message is the following format: "(x) last_match", where "x" is the # total number of pattern matches found in the file and "last_match" is the # last entry in the log file which matches the pattern. # # Notes: # # If you use this plugin make sure to keep the following in mind: # # 1. The "max_attempts" value for the service should be 1, as this # will prevent Nagios from retrying the service check (the # next time the check is run it will not produce the same results). # # 2. The "notify_recovery" value for the service should be 0, so that # Nagios does not notify you of "recoveries" for the check. Since # pattern matches in the log file will only be reported once and not # the next time, there will always be "recoveries" for the service, even # though recoveries really don't apply to this type of check. # # 3. You *must* supply a different old_file_log for each service that # you define to use this plugin script - even if the different services # check the same log_file for pattern matches. This is necessary # because of the way the script operates. # # 4. Changes to the script were made by Thomas Sluyter (cailin@kilala.nl). # * The first set of changes will allow the script to run properly on Solaris, which # it did not do by default. The second set of changes will allow the following: # * State retention. In the original script, if a NOK was put into the log file # at point A in time and it is not repeated at A+1, then an OK is sent to Nagios. # Not something that you would like to happen. # I've added the $oldlog.STATE trigger file which retains the last exitstatus. Should # there be no new lines added to the log, check_log will simply repeat the last state # instead of give an OK. # In order for this state retention to work properly your client system MUST # HAVE THE DIRECTORY /USR/LOCAL/NAGIOS/VAR. # * Two queries. In the original script you could only enter one query which, when # found, would result in a Critical message being sent to Nagios. I've added the # possibility to add another query, which will result in a Warning message. # * Bugfix: changed all instances of "crit-count" and "warn-count" to "critcount" and # "warncount" after a tip from Kyle Tucker who ran into problems running this script # with bash on Solaris. # # Paths to commands used in this script. These # may have to be modified to match your system setup. PATH="/usr/bin:/usr/sbin:/bin:/sbin" PROGNAME=`basename $0` PROGPATH=`echo $0 | sed -e 's,[\\/][^\\/][^\\/]*$,,'` #. $PROGPATH/utils.sh . /usr/local/nagios/libexec/utils.sh print_usage() { echo "Usage: $PROGNAME -F logfile -O oldlog -C CRITquery -W WARNquery" echo "Usage: $PROGNAME --help" echo "Usage: $PROGNAME --version" } print_help() { echo "" print_usage echo "" echo "Log file pattern detector plugin for Nagios" echo "" support } # Make sure the correct number of command line # arguments have been supplied if [ $# -lt 8 ]; then print_usage exit $STATE_UNKNOWN fi # Grab the command line arguments exitstatus=$STATE_WARNING #default while test -n "$1"; do case "$1" in --help) print_help exit $STATE_OK ;; -h) print_help exit $STATE_OK ;; -F) logfile=$2 shift ;; -O) oldlog=$2 shift ;; -C) CRITquery=$2 shift ;; -W) WARNquery=$2 shift ;; *) echo "Unknown argument: $1" print_usage exit $STATE_UNKNOWN ;; esac shift done # If the source log file doesn't exist, exit if [ ! -e $logfile ]; then echo "Log check error: Log file $logfile does not exist!" exit $STATE_UNKNOWN echo $STATE_UNKNOWN > $oldlog.STATE fi # If the dump/temp log file doesn't exist, this must be the first time # we're running this test, so copy the original log file over to # the old diff file and exit if [ ! -e $oldlog ]; then cat $logfile > $oldlog TEMPcount=0 let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $WARNquery | wc -l | awk '{print $1}') let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $CRITquery | wc -l | awk '{print $1}') if [ $TEMPcount -gt 0 ] then echo "Log check data initialized... Last line contained error message." echo $STATE_WARNING > $oldlog.STATE exit $STATE_WARNING else echo "Log check data initialized..." echo $STATE_OK > $oldlog.STATE exit $STATE_OK fi fi # A bug which was caught very late: # If newlog is shorter than oldlog, the diff used below will return # false positives for the query because the will be in $oldlog. Why? # Because $oldlog is not rolled over / rotated, like $newlog. I need # to fix this in a kludgy way. if [ `wc -l $logfile|awk '{print $1}'` -lt `wc -l $oldlog|awk '{print $1}'` ] then rm $oldlog cat $logfile > $oldlog TEMPcount=0 let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $WARNquery | wc -l | awk '{print $1}') let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $CRITquery | wc -l | awk '{print $1}') if [ $TEMPcount -gt 0 ] then echo "Log check data initialized... Last line contained error message." echo $STATE_WARNING > $oldlog.STATE exit $STATE_WARNING else echo "Log check data initialized..." echo $STATE_OK > $oldlog.STATE exit $STATE_OK fi fi # The oldlog file exists, so compare it to the original log now # The temporary file that the script should use while # processing the log file. if [ -x mktemp ]; then tempdiff=`mktemp /tmp/check_log.XXXXXXXXXX` else tempdate=`/bin/date '+%H%M%S'` tempdiff="/tmp/check_log.${tempdate}" touch $tempdiff fi diff $logfile $oldlog > $tempdiff if [ `wc -l $tempdiff | awk '{print $1}'` -eq 0 ] then rm $tempdiff touch $oldlog.STATE exitstatus=`cat $oldlog.STATE` echo "LOG FILE - No status change detected. Status = $exitstatus" exit $exitstatus fi # Count the number of matching log entries we have CRITcount=`grep -c "$CRITquery" $tempdiff` WARNcount=`grep -c "$WARNquery" $tempdiff` # Get the last matching entry in the diff file CRITlastentry=`grep "$CRITquery" $tempdiff | tail -1` WARNlastentry=`grep "$WARNquery" $tempdiff | tail -1` rm $tempdiff cat $logfile > $oldlog if [ "$CRITcount" -gt 0 ]; then echo "($CRITcount) $CRITlastentry" echo $STATE_CRITICAL > $oldlog.STATE exit $STATE_CRITICAL fi if [ "$WARNcount" -gt 0 ]; then echo "($WARNcount) $WARNlastentry" echo $STATE_WARNING > $oldlog.STATE exit $STATE_WARNING fi echo "Log check ok - 0 pattern matches found" exit $STATE_OK
echo "Starting clean" rm /tmp/foobar /usr/local/nagios/var/foobar* /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "Starting normally" echo "baka" echo "normal" >> /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "baka" echo "normal" >> /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "warning" echo "bla" >> /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "critical" echo "neko" >> /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "warning" echo "bla" >> /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "normal" echo "baka" >> /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "Log rotation with crit" rm /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "critical" echo "neko" >> /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "normal" echo "baka" >> /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "Log rotation with warn" rm /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "warning" echo "bla" >> /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "normal" echo "baka" >> /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "Normal log rotation" rm /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "normal" echo "baka" >> /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo "" echo "normal" echo "baka" >> /tmp/foobar /usr/local/nagios/libexec/check_log3 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -C neko -W bla echo $? echo ""
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 2)
2006-06-01 00:00:00
I know that, the first time I started using Nagios, I got confused a little when it came to monitoring systems other than the one running Nagios. To shed a little light on the subject for the beginning Nagios user, here's a discussion of the various methods of talking to Nagios clients.
First off, let me make it absolutely clear that, in order to monitor systems other than the one running Nagios, you are indeed going to have to communicate with them in some fashion. Unfortunately very few things in the Sysadmin trade are magical, and Nagios is unfortunately not one of them.
So first off, let's look at the -wrong- way of doing things. When I first started with Nagios (actually I made this mistake on my second day with the software) I wrote something like this:
define service{
host_name remote-host
service_description D_ROOT
check_command check_disk!85!95!/
}
The problem with this setup is that I was using a -local- check and said it belonged to remote-host. Now this may look alright on the status screen ("Hey! It's green!"), but naturally you're not monitoring the right thing ^_^
So how -do- you monitor remote resources? Here's a table comparing various methods. After that I'll give examples on how you can correct the mistake I made above with each method.
PLEASE NOTE: the following discussion will not cover the monitoring of systems other than the various UNIX flavours. Later on I'll write a similar article covering Windows and stuff like Cisco.
|
|||||
Connection |
Srv -> Clnt |
Srv -> Clnt |
Srv -> Clnt |
Clnt -> Srv |
Clnt -> Srv |
Security |
Encryption |
Encryption |
Access List (v2) |
Access List (v2) |
Encryption |
Configuration |
On server |
On client |
On client |
On client and On server |
On client |
Difficulty |
Easy |
Moderate |
Hard |
Hard |
Moderate |
Just about everyone should already have SSH running on their servers (except for those few who are still running telnet or, horror or horrors!, rsh). So it's safe to assume that you can immediately start using this communications method to check your clients. You will need to:
You can now set up your services.cfg in such a way that each remote service is checked like so:
define service{
host_name remote-host
service_description D_ROOT
check_command check_disk_by_ssh!85!95!/
}
Your check command definition would look something like this:
define command {
command_name check_disk_by_ssh
command_line /usr/local/nagios/libexec/check_by_ssh -H $HOSTADDRESS$ -C "/usr/local/nagios/libexec/check_disk -w $ARG1$ -c $ARG2$ $ARG3$"
}
Working this way will allow you to do most of your configuring centrally (on the Nagios server), thus saving you a lot of work on each client system. All you have to do over there is make sure that there's a working user account and that all the scripts are in place. Quite convenient... The only drawback being that you're making a relatively open account which has full access to the system (sometimes even with sudo access).
As a replacement for the SSH access method, Ethan also wrote the NRPE daemon. Using NRPE requires that you:
You can now set up your services.cfg in such a way that each remote service is checked like so:
define service{
host_name remote-host
service_description D_ROOT
check_command check_nrpe!check_root
}
And in /usr/local/nagios/etc/nrpe.cfg on the client you would need to include:
command[check_root]=/usr/local/nagios/libexec/check_disk 85 95 /
Good thing is that you won't have a semi-open account lying about. Bad things are that, if you want to change the configuration of your client, you're going to have to login. And you're going to have yet another piece of software to keep up to date.
Whoo boy! This is something I'm working on right now at $CLIENT and let me tell you: it's hard! At least much harder than I was expecting.
SNMP is a network management protocol used by the more advanced system administrators. Using SNMP you can access just about -any- piece of equipment in your server room to read statistics, alarms and status messages. SNMP is universal, extensible, but it is also quite complicated. Not for the faint of heart.
To make proper use of monitoring through SNMP you'll need to:
The reason why point C tells you to register a private EID, is because the SNMP tree has a very rigid structure. Technically speaking you -could- just plonk down your results at a random place in the tree, but it's likely that this will screw up something else at a later time. IANA allows each company to have only one private EID, so first check if your company doesn't already have one on the IANA list.
Ufortunately the check_snmp script that comes with Nagios isn't flexible enough to let you monitor custom SNMP objects in a nice way. This is why I wrote the retrieve_custom_nagios script, which is available from the menu. Your service definition would look like this:
define service{
host_name remote-host
service_description D_ROOT
check_command retrieve_custom_snmp!.1.3.6.1.4.1.6886.4.1.4
}
And in this case your snmpd.conf would contain a line like this:
exec .1.3.6.1.4.1.6886.4.1.4 check_d_root /usr/local/nagios/libexec/check_disk -w 85 -c 95 /
Up to now things are actually not that different from using NRPE, are they? Well, that's because we haven't even started using all the -real- features of SNMP. Point is that using SNMP you can dig very deeply into your system to retrieve all kinds of useful information. And -that's- where things get complicated because you're going to have to dig up all the object IDs (OIDs) that you're going to need. And in some cases you're going to have to install vendor specific sub-agents that know how to speak to your specific hardware.
One of the best features of SNMP though are the so-called traps. Using traps the SNMP daemon will actively undertake action when something goes wrong in your system. So if for instance your hard disk starts failing, it is possible to have the daemon send out an alert to your Nagios server! Awesome! But naturally this will require a boatload of additional configuration :(
So... SNMP is an awesomely powerful tool, but you're going to have to pay through the nose (in effort) to get it 100% perfect.
SNMP doesn't involve polling alone. SNMP enabled devices can also be configured to automatically send status updates do a so-call trap host. The downside to receiving SNMP traps with Nagios is that it takes quite a lot of work to get them into Nagios :D
To make proper use of monitoring through SNMP you'll need to:
There are -many- ways to get the SNMP traps translated for Nagios' purposes, 'cause there's many roads that lead to Rome. Unfortunately none of them are very easy to use.
And finally there's NSCA. This daemon is usually used by distributed Nagios servers to send their results to the central Nagios server, which gathers them as so-called "passive checks". It is however entirely possible to install NSCA on each of your Nagios clients, which will then get called to send in the results of local checks. In this case you'll need to:
On your Nagios server things would look like this:
define service{
host_name remote-host
service_description D_ROOT
check_command check_disk!85!95!/
passive_checks_enable 1
active_checks_enable 0
}
For the configuration on the client side I recommend that you read up on NSCA. It's a little bit too much to show over here.
The upside to this is that you won't have to run any daemon on your client to accept incoming connections. This will allow you to lock down your system in a hard way.
Naturally you are absolutely free to combine two or more of the methods described above. You could poll through NRPE and receive SNMP traps in one environment. This will have both ups and downs, but it's up to your own discretion. Use the tools that feel natural to you, or use those that are already standard in your environment.
I realise I've rushed through things a little bit, but I was in a slight hurry :) I will go over this article a second time RSN, to apply some polish.
kilala.nl tags: tutorial, sysadmin, nagios, unix,
View or add comments (curr. 1)
2006-06-01 00:00:00
September 21-22 of 2006 saw the first annual Nagios Conference. Organised by the good folk of Netways, the conference was attended by around 130 people (mostly Germans, with some foreigners thrown in for fun).
Originally I posted some comments about the conference on my blog, but I thought I'd move them over into the Sysadmin section, to keep Google from thinking the Archives had content about Nagios :D
Wow... Today was a long day :)
Left Utrecht around 09:30 and finally arrived at the hotel at 17:30. Eight hours, just as I predicted! 6 hours driving (0.5 of which due to delays) and 2 hours spent resting. Speaking of: I -love- the Germanian Autobahn! They are littered with comfortable places to take a break and there's also an abundance of what they call a Rasthof: parking space, combined with restaurants, gas station, maybe a hotel, a few shops and very cool sanitary facilities (by the Sanifair company). I'll talk about those some more another time :)
What else is there to tell? I showered, I unpacked, we had dinner with the whole group and I met some interesting people. *waves* Hi Stephan! Hi Jorg! *waves*
Now... I feel really tired (I also notice that it's getting harder for me to string together coherent thought, despite the recent cappuccino), so I'd better get to bed... I'm actually quite woozy in the head! :)
Tomorrow the conference'll start, so I'd better be at my best!
So far, it's been an interesting day.
In the morning, Ethan Galstad (main Nagios developer) covered his plans for the future. Version 3.x (improved notification, expanded plugin output, custom variables and a greatly improved method for host checking) will Alpha in October and Stable somewhere this winter, while 4.x (a new PHP-based GUI, among other things) is on the long-term roadmap.
After that Michael Kienle and Markus Kosters told us a few things about the practical side of implementing Nagios in your organisation. I was already familiar with most of what they told us, but it must've been an eye opener for a lot of people! The notion that Nagios needs much more than just "download and install" is apparently foreign to a lot of people, which comes back to bite them in the ass later.
Lunch was terrific. I don't know how they do it, but the Nurnberg Holiday Inn are perfectly capable of making a buffet-style meal that -is- quite edible and actually varied and tasty! Kudos to them!
While on the subject of the hotel... The hotel, the rooms, the facilities: they're all wonderful. Nice ambience, a swanky in-house cafe and comfortable furniture. I like it! I just have to wonder about one thing: why the heck are there at least a dozen brothels and sex clubs surrounding the hotel?! o_O
The afternoon saw two sessions regarding data collection and representation: RRDTool and NagiosGrapher. RRD itself couldn't interest me for long, but NagGraph (which relies on RRD) on the other hand could. NagGraph allows you to include somewhat complicated graphs to Nagios (inside the Nagios GUI), which gives you something that is a little similar to Cacti
I had to skip the session on monitoring storage systems, because I -really- needed some fresh air. So I walked around Nurnberg's Alt Stad for a while. Looks nice, I have to say :) Of course I was only able to see a small part of it, but hey... At least I got out for a while. [EDIT: Anand from ASAM told me afterwards that I didn't miss much. Apparently is was kind of a marketing spiel]
So... The plan for the rest of the day:
See you guys tomorrow!
*phew* That was great! <3
I'm sitting here in my hotel room with some apple soda and some Pringles, feeling nice and drowsy thanks to the hotel's sauna. It felt real good, just spending an hour and a half relaxing.
Anywho... The conference today... Pretty darn interesting and it gave me a load of things to think about! In his morning session Ethan covered some things that you usually don't think of when configuring Nagios, but that can save you loads of trouble! A few of the things he mentioned I will actually try to work into the design of $CLIENT's new Nagios infra, 'cause else they may run into some problems later.
The rest of the morning for me was filled with two sessions on varying ways to get info into Nagios. On the one hand there was SNMPtt (trap translator), which to me seemed like a really backward solution to a problem that wasn't too difficult to start with. And on the other hand, there was EventDB whose goal it is to have only one check command to access information provided by a great variety of sources. The only down-side being that you'll need translation adapters for each of these sources (which means that you basically are filling one whole by digging another).
Now I don't mean to be too negative about these two sessions. I'm sure that a lot of people are actually very happy to see these tools and that they will have some great uses for them.
Lunch... What can I say? It was great, just like yesterday. The hotel took great care of us, thanks to Netways.
After lunch, Ton Voon kicked off with a brief session on open source etiquette. Basically telling the attendees both the up and down sides their companies could experience by contributing to the Nagios community. As ever, Ton was charismatic and displayed a good sense of humor ^_^
Two Netways employees gave talks on:
1. The IT Portal they implemented at the Bundesverwaltungsamt. This is actually the same portal that Markus Kosters told us about yesterday, but Julian actually took time to show us the technology behind the portal.
2. Integrating Nagios with Asterisk (among other things), to allow for some nice telephone trickery. Mind you, Asterisk isn't really my thing, but I can imagine some people enjoying the idea of being called by the Nagios server to literally -tell- them (through a .WAV voice) that their server's down.
For me, the con was closed by a guy giving a marketing spiel about the services his company provides, but I was actually able to glean something useful from the talk.
Unfortunately there was no official closing ceremony, so the con ended quite abruptly. Which means that just about everyone stormed out of the building in the span of thirty minutes. I did however get to say goodbye to a few nice acquaintances I've made during these three days. And my hat's off to Anand who decided to drive home during the night (he lives in The Hague)... He should be arriving home, somewhere around 0200 ;_; Wow!
While waving off the last person to leave (Stephan), I met up with Ethan and his SO, Mary. We went to dinner together and I must say I enjoyed their company! Friendly folks and very down to earth. I believe that sometimes Ethan is just overwhelmed by all the attention people are willing to give him... Who could blame him?
Aside from Ethan and Mary, I'm the last conference attendee at the hotel. In the morning, I'll have a nice breakfast, grab some rolls at the bakery and head off home. I reckon I should get there around five-ish.
kilala.nl tags: conference, nagios,
View or add comments (curr. 0)
2006-06-01 00:00:00
Working at $CLIENT in 2005 was the first time that I built a complete monitoring infrastructure from the ground on up. In order to keep expenses low we went for a free, yet versatile monitoring tool: Nagios
Nagios, which is available over here, is a free and Open Source monitoring solution based on what was once known as NetSaint.
Nagios allows you to monitor a number of different platforms through the use of plugins which can run on both the server as well as on the monitoring clients. So far I've heard of clients being available for various UNIXen and BSDs (including Mac OS X) and Windows. Windows monitoring requires either the unclear NSClient software, or the NRPE_nt daemon which is basically a port of the UNIX Nagios client.
Setting up the basic server requires some fidgeting with compilers, dependencies and so on. However, a reasonably experienced sysadmin should be able to have the basic software up and running (and configured) in a day. However, adding all the monitors for all the clients is a matter entirely
Although there are a number of GUI's available which should make configuring Nagios a bit easier, I chose to do it all by hand. Just because that's what I'm used to and because I have little faith in GUI-generated config files. You will need to define each monitor separately for each host, so let's take a look at a quick example.
Say that you have twenty servers that need to be monitored by ten monitors each. Each definition in the configuration file takes up approximately sixteen lines, so in the end your config file will be at least 3200 lines long :)
But please don't let that deter you! Nagios is a powerful tool and can help you keep an eye on _a_lot_ of different things in your environment. I for one have become quite smitten with it.
In the menu you will find a configuration manual which I wrote for $CLIENT, as well as a bunch of plugins which were either modified or created for their environment. Quite possible there's one or two in there that will be interesting for you.
kilala.nl tags: tutorial, sysadmin, nagios,
View or add comments (curr. 0)
2006-06-01 00:00:00
This script was written at the time I was hired by UPC / Liberty Global.
Improved log checker for Solaris, with state retention.
I found that the version of check_log included in the default monitor package doesn't work perfectly on Solaris: it needs a bit of tweaking... Which is what I've done for the script.
Also, I've added state retention. It's a bit of a hack, but hey! I needed a quick solution.
The original script sends a Critical when it detects the string you've queried the log file for, but it clears that same Critical immediately if the same message is not repeated once the monitor runs again. Meaning that, if there are no updates to your log file, the Critical will only be around until the next time the monitor runs.
Not very handy if the Critical occurs during the night.
This new version of the script creates a file called $oldlog.STATE in /usr/local/nagios/var (which should be 755, nagios:nagios), which contains the exit status for the last detected _changed_ status... If there are no changes detected in your log file, this old exit state is repeated.
The script has been tested on Solaris 8, Mac OS X 10.4 and Redhat ES3.
UPDATE 19/06/2006:
Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!
Also stomped out a few horrendous bugs! I'm very sorry for putting out such a buggy script earlier... If you've started using the script in your environment, please download the latest version. Thanks to Ali Khan for pointing out these mistakes.
#!/bin/bash # # Log file pattern detector plugin for Nagios # Written by Ethan Galstad (nagios@nagios.org) # Last Modified: 07-31-1999 # Updated by Thomas Sluyter (nagiosATkilalaDOTnl) # Last Modified: 19-06-2006 # # Usage: ./check_log2 -F log_file -O old_log_file -Q pattern # # Description: # # This plugin will scan a log file (specified by the log_file option) # for a specific pattern (specified by the pattern option). Successive # calls to the plugin script will only report *new* pattern matches in the # log file, since an copy of the log file from the previous run is saved # to old_log_file. # # Output: # # On the first run of the plugin, it will return an OK state with a message # of "Log check data initialized". On successive runs, it will return an OK # state if *no* pattern matches have been found in the *difference* between the # log file and the older copy of the log file. If the plugin detects any # pattern matches in the log diff, it will return a CRITICAL state and print # out a message is the following format: "(x) last_match", where "x" is the # total number of pattern matches found in the file and "last_match" is the # last entry in the log file which matches the pattern. # # Notes: # # If you use this plugin make sure to keep the following in mind: # # 1. The "max_attempts" value for the service should be 1, as this # will prevent Nagios from retrying the service check (the # next time the check is run it will not produce the same results). # # 2. The "notify_recovery" value for the service should be 0, so that # Nagios does not notify you of "recoveries" for the check. Since # pattern matches in the log file will only be reported once and not # the next time, there will always be "recoveries" for the service, even # though recoveries really don't apply to this type of check. # # 3. You *must* supply a different old_file_log for each service that # you define to use this plugin script - even if the different services # check the same log_file for pattern matches. This is necessary # because of the way the script operates. # # 4. Changes to the script were made by Thomas Sluyter (nagios@kilala.nl). # The first set of changes will allow the script to run properly on Solaris, which # it did not do by default. The second set of changes will allow the following: # * State retention. If a NOK was generated at point A in time and it is not repeated # at A+1, then an OK is sent to Nagios. Not something that you would like to happen. # I've added the $oldlog.STATE trigger file which retains the last exitstatus. Should # there be no new lines added to the log, check_log will simply repeat the last state # instead of give an OK. # # Examples: # # Check for login failures in the syslog... # # check_log -F /var/log/messages -O /usr/local/nagios/var/check_log.badlogins.old -Q "LOGIN FAILURE" # # Check for port scan alerts generated by Psionic's PortSentry software... # # check_log -F /var/log/messages -O /usr/local/nagios/var/check_log.portscan.old -Q "attackalert" # # Paths to commands used in this script. These # may have to be modified to match your system setup. PATH="/usr/bin:/usr/sbin:/bin:/sbin" PROGNAME=`basename $0` PROGPATH=`echo $0 | sed -e 's,[\\/][^\\/][^\\/]*$,,'` #. $PROGPATH/utils.sh . /usr/local/nagios/libexec/utils.sh print_usage() { echo "Usage: $PROGNAME -F logfile -O oldlog -Q query" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Log file pattern detector plugin for Nagios" echo "" support } # Make sure the correct number of command line # arguments have been supplied if [ $# -lt 6 ]; then print_usage exit $STATE_UNKNOWN fi # Grab the command line arguments exitstatus=$STATE_WARNING #default while test -n "$1"; do case "$1" in --help) print_help exit $STATE_OK ;; -h) print_help exit $STATE_OK ;; -F) logfile=$2 shift ;; -O) oldlog=$2 shift ;; -Q) query=$2 shift ;; *) echo "Unknown argument: $1" print_usage exit $STATE_UNKNOWN ;; esac shift done # If the source log file doesn't exist, exit if [ ! -e $logfile ]; then echo "Log check error: Log file $logfile does not exist!" exit $STATE_UNKNOWN echo $STATE_UNKNOWN > $oldlog.STATE fi # If the oldlog file doesn't exist, this must be the first time # we're running this test, so copy the original log file over to # the old diff file and exit if [ ! -e $oldlog ]; then cat $logfile > $oldlog if [ `tail -1 $logfile | grep -i $query | wc -l` -gt 0 ] then echo "Log check data initialized... Last line contained error message." echo $STATE_CRITICAL > $oldlog.STATE exit $STATE_CRITICAL else echo "Log check data initialized..." echo $STATE_OK > $oldlog.STATE exit $STATE_OK fi fi # A bug which was caught very late: # If newlog is shorter than oldlog, the diff used below will return # false positives for the query because the will be in $oldlog. Why? # Because $oldlog is not rolled over / rotated, like $newlog. I need # to fix this in a kludgy way. if [ `wc -l $logfile|awk '{print $1}'` -lt `wc -l $oldlog|awk '{print $1}'` ] then rm $oldlog cat $logfile > $oldlog if [ `tail -1 $logfile | grep -i $query | wc -l` -gt 0 ] then echo "Log check data re-initialized... Last line contained error message." echo $STATE_CRITICAL > $oldlog.STATE exit $STATE_CRITICAL else echo "Log check data re-initialized..." echo $STATE_OK > $oldlog.STATE exit $STATE_OK fi fi # Everything seems fine, so compare it to the original log now # The temporary file that the script should use while # processing the log file. if [ -x mktemp ]; then tempdiff=`mktemp /tmp/check_log.XXXXXXXXXX` else tempdate=`/bin/date '+%H%M%S'` tempdiff="/tmp/check_log.${tempdate}" touch $tempdiff fi diff $logfile $oldlog > $tempdiff if [ `wc -l $tempdiff|awk '{print $1}'` -eq 0 ] then rm $tempdiff touch $oldlog.STATE exitstatus=`cat $oldlog.STATE` echo "LOG FILE - No status change detected. Status = $exitstatus" exit $exitstatus fi # Count the number of matching log entries we have count=`grep -c "$query" $tempdiff` # Get the last matching entry in the diff file lastentry=`grep "$query" $tempdiff | tail -1` rm -f $tempdiff cat $logfile > $oldlog if [ "$count" = "0" ]; then # no matches, exit with no error echo "Log check ok - 0 pattern matches found" exitstatus=$STATE_OK else # Print total matche count and the last entry we found # echo "($count) $lastentry" echo "Log check NOK - $lastentry" exitstatus=$STATE_CRITICAL echo $STATE_CRITICAL > $oldlog.STATE fi exit $exitstatus
echo "Starting clean" rm /tmp/foobar /usr/local/nagios/var/foobar* /usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko echo $? echo "" echo "Starting normally" echo "normal" echo "normal" >> /tmp/foobar /usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko echo $? echo "" echo "normal" echo "normal" >> /tmp/foobar /usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko echo $? echo "" echo "critical" echo "neko" >> /tmp/foobar /usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko echo $? echo "" echo "normal" echo "baka" >> /tmp/foobar /usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko echo $? echo "" echo "Log rotation with crit" rm /tmp/foobar /usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko echo $? echo "" echo "critical" echo "neko" >> /tmp/foobar /usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko echo $? echo "" echo "normal" echo "baka" >> /tmp/foobar /usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko echo $? echo "" echo "Normal log rotation" rm /tmp/foobar /usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko echo $? echo "" echo "normal" echo "baka" >> /tmp/foobar /usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko echo $? echo "" echo "normal" echo "baka" >> /tmp/foobar /usr/local/nagios/libexec/check_log2 -F /tmp/foobar -O /usr/local/nagios/var/foobar.archive -Q neko echo $? echo ""
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 2)
2006-06-01 00:00:00
This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.
At $CLIENT we've often run into problems with the NSCA daemon, where the daemon would not crash per se, but where it would also not process incoming service checks. The nsca process was still running, but it simply wasn't transferring the incoming results to the Nagios command file.
I was amazed to find that nobody else had written a script to do this! So I quickly wrote one.
#!/usr/bin/bash # # NSCA Nagios service results monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of KPN-IS, i-Provide, the Netherlands # Last Modified: 16-08-2006 # # Usage: ./check_nsca # # Description: # Aside from checking whether the NSCA process is still running, this script # also attempts to insert a message into the Nagios queue. After sending a # message to the NSCA daemon, it will verify that the message is received by # Nagios, by checking the nagios.log file. # # Limitations: # This script should work properly on all implementations of Linux, Solaris # and Mac OS X. # # Output: # If the NSCA daemon, or something along the message path, is borked, a # CRIT message will be issued. # # You may have to change this, depending on where you installed your # Nagios plugins PROGNAME="check_nsca" PATH="/usr/bin:/usr/sbin:/bin:/sbin" NAGIOSHOME="/usr/local/nagios" LIBEXEC="$NAGIOSHOME/libexec" NAGVAR="$NAGIOSHOME/var" NAGBIN="$NAGIOSHOME/bin" NAGETC="$NAGIOSHOME/etc" . $LIBEXEC/utils.sh ### REQUISITE NAGIOS COMMAND LINE STUFF ### print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "NSCA Nagios service results monitor plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done ### PLATFORM INDEPENDENCE ### case `uname` in Linux) PSLIST="ps -ef";; SunOS) PSLIST="ps -ef";; Darwin) PSLIST="ps -ajx";; *) ;; esac ### CHECKING FOR THE NSCA PROCESS ### [ `$PSLIST | grep nsca | grep -v grep | wc -l` -lt 1 ] && (echo "NSCA process not running."; exit $STATE_CRITICAL) ### INSERTING A TEST MESSAGE ### DATE=`date +%Y%m%d%H%M` STRING="`hostname`\tFOOBAR\t0\t$DATE This is a test of the emergency broadcast system.\n" echo -e "$STRING" | $NAGBIN/send_nsca -H localhost -c $NAGETC/send_nsca.cfg >/dev/null 2>&1 ### CHECKING THE NAGIOS LOG FILE ### sleep 10 if [ `tail -1000 $NAGVAR/nagios.log | grep "emergency broadcast system" | grep $DATE | wc -l` -lt 1 ] then # Giving it a second try sleep 10 if [ `tail -5000 $NAGVAR/nagios.log | grep "emergency broadcast system" | grep $DATE | wc -l` -lt 1 ] then echo "NSCA daemon not processing check results." exit $STATE_CRITICAL fi fi ### EXITING NORMALLY ### echo "OK - NSCA working like it should." exit $STATE_OK
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 2)
2006-06-01 00:00:00
This script was written at the time I was hired by UPC / Liberty Global.
Basic monitor to check whether BIND is up and running. It checks for a number of processes and tries to perform a basic lookup using the localhost.
This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:
A Critical is sent if:
A) one or more of the required processes is not running, or
B) the script is unable to perform a basic lookup using the localhost.
UPDATE 19/06/2006:
Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!
#!/usr/bin/bash # # DNS / Named process monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of DTV Labs, Liberty Global, the Netherlands # Last Modified: 19-06-2006 # # Usage: ./check_named # # Description: # This plugin determines whether the named DNS server # is running properly. It will check the following: # * Are all required processes running? # * Is it possible to make DNS requests? # # Limitations: # Currently this plugin will only function correctly on Solaris systems. # # Output: # The script returns a CRIT when the abovementioned criteria are # not matched. # # Host OS check and warning message if [ `uname` != "SunOS" ] then echo "WARNING:" echo "This script was originally written for use on Solaris." echo "You may run into some problems running it on this host." echo "" echo "Please verify that the script works before using it in a" echo "live environment. You can easily disable this message after" echo "testing the script." echo "" fi # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Named DNS monitor plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done check_processes() { PROCESS="0" if [ `ps -ef | grep named | grep -v grep | grep -v nagios | wc -l` -lt 1 ]; then echo "NAMED NOK - One or more processes not running" exitstatus=$STATE_CRITICAL exit $exitstatus fi } check_service() { SERVICE=0 nslookup www.google.com localhost >/dev/null 2>&1 if [ $? -eq 1 ]; then SERVICE=1;fi if [ $SERVICE -eq 1 ]; then echo "SQUID NOK - One or more TCP/IP ports not listening." exitstatus=$STATE_CRITICAL exit $exitstatus fi } check_processes check_service echo "NAMED OK - Everything running like it should" exitstatus=$STATE_OK exit $exitstatus
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 0)
2006-06-01 00:00:00
This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.
I couldn't find an easy way to check whether all interfaces of a host are up and running from the -inside-, so I wrote a Nagios plugin to do this.
Naturally you could also try to ping all of the IP addresses of all of these network cards, but this isn't always possible. Lord knows how many routing issues I had fight through to get our current IP set monitored. I guess using this script is a bit easier :)
The script was tested on Redhat ES3, Mac OSX and Solaris. Its basic requirement is the Korn shell (due to some conversions happening inside the script). On Linux/RH you'll need mii-tool (and sudo) and on Solaris you'll need Perl (for one lousy piece of math :p ).
EDIT:
Oh! Just like my other recent Nagios scripts, check_networking comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.
#!/usr/bin/ksh # # Basic UNIX networking check script. # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of KPN-IS, i-Provide SYS, the Netherlands # Last Modified: 22-06-2006 # # Usage: ./check_networking # # Description: # This plugin determines whether the local host's network interfaces # are all up and running like they should. It uses the following # questions to determine this. # * Does /sbin/mii-tool report any problems? (Linux only) # * Are the gateways for each subnet pingable? # # Limitations: # * I have no clue whether mii-tool is something specific to Redhat ES3, # or whether all Linii have it. # * Sudo access to mii-tool is required for the nagios account. # * Perl is required on Solaris, to do just tiny bit of math. # * KSH is required. # * The script assumes that the first available IP from a subnet is the # router. # # Output: # The script retunrs a CRIT when one of the criteria mentioned # above is not matched. # # Other notes: # I wish I'd learn Perl. I'm sure that doing all of this stuff in Perl # would have cut down on the size of this script tremendously. Ah well. # If you ever run into problems with the script, set the DEBUG variable # to 1. I'll need the output the script generates to do troubleshooting. # See below for details. # I realise that all the debugging commands strewn throughout the script # may make things a little harder to read. But in the end I'm sure it was # well worth adding them. It makes troubleshooting so much easier. :3 # # Enabling the following dumps information into DEBUGFILE at various # stages during the execution of this script. DEBUG="0" DEBUGFILE="/tmp/foobar" ### REQUISITE NAGIOS USER INTERFACE STUFF ### # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh [ $DEBUG -gt 0 ] && rm $DEBUGFILE print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Basic UNIX networking check plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done ### SETTING UP THE ENVIRONMENT ### # Host OS check and warning message MIITOOL="0" if [ -f /sbin/mii-tool ] then MIITOOL="1" sudo /sbin/mii-tool >/dev/null 2>&1 if [ $? -gt 0 ] then echo "ERROR: sudo permissions" echo "" echo "This script requires that the Nagios user account has" echo "sudo permissions for the mii-tool command. Currently it" echo "does not have these permissions. Please fix this." echo "" exit $STATE_UNKNOWN fi fi ### SUB-ROUTINE DEFINITIONS ### function convert_base { typeset -i${2:-16} x x=$1 echo $x } function subnet_router { [ $DEBUG -gt 0 ] && echo "- Starting subnet_router -" >> $DEBUGFILE first="0"; second="0"; third="0"; fourth="0" first=`echo $1 | cut -c 1-8`; FIRST=`convert_base 2#$first 10` [ $DEBUG -gt 0 ] && echo "First: $first $FIRST" >> $DEBUGFILE second=`echo $1 | cut -c 9-16`; SECOND=`convert_base 2#$second 10` [ $DEBUG -gt 0 ] && echo "Second: $second $SECOND" >> $DEBUGFILE third=`echo $1 | cut -c 17-24`; THIRD=`convert_base 2#$third 10` [ $DEBUG -gt 0 ] && echo "Third: $third $THIRD" >> $DEBUGFILE fourth=`echo $1 | cut -c 25-32` [ `echo $fourth|wc -c` -gt 1 ] || fourth="0" TEMPCOUNT=`echo $fourth | wc -c | awk '{print $1}'` let PADDING=9-$TEMPCOUNT [ $DEBUG -gt 0 ] && echo "Fourth: padding fourth with $PADDING zeroes" >> $DEBUGFILE i=1 while ((i <= $PADDING)); do fourth=$fourth"0" let i=$i+1 done FOURTH=`convert_base 2#$fourth 10`; let FOURTH=$FOURTH+1 [ $DEBUG -gt 0 ] && echo "Fourth: $fourth $FOURTH" >> $DEBUGFILE echo "$FIRST.$SECOND.$THIRD.$FOURTH" } gather_interfaces_linux() { [ $DEBUG -gt 0 ] && echo "- Starting gather_interfaces_linux -" >> $DEBUGFILE for INTF in `ifconfig -a | grep ^[a-z] | grep -v ^lo | awk '{print $1}'` do if [ `echo $INTF | grep : | wc -l` -gt 0 ] then export INTERFACES="`echo $INTF|awk -F: '{print $1}'` $INTERFACES" else export INTERFACES="$INTF $INTERFACES" fi done INTFCOUNT=`echo $INTERFACES | wc -w` [ $DEBUG -gt 0 ] && echo "Interfaces: There are $INTFCOUNT interfaces: $INTERFACES." >> $DEBUGFILE if [ $INTFCOUNT -lt 1 ] then echo "NOK - No active network interfaces." exit $STATE_CRITICAL fi } gather_interfaces_darwin() { [ $DEBUG -gt 0 ] && echo "- Starting gather_interfaces_darwin -" >> $DEBUGFILE for INTF in `ifconfig -a | grep ^[a-z] | grep -v ^gif | grep -v ^stf | grep -v ^lo | awk '{print $1}'` do [ `echo $INTF | grep : | wc -l` -gt 0 ] && INTF=`echo $INTF|awk -F: '{print $1}'` [ `ifconfig $INTF | grep "status: inactive" | wc -l` -gt 0 ] && break INTERFACES="$INTF $INTERFACES" done INTFCOUNT=`echo $INTERFACES | wc -w` [ $DEBUG -gt 0 ] && echo "Interfaces: There are $INTFCOUNT interfaces: $INTERFACES." >> $DEBUGFILE if [ $INTFCOUNT -lt 1 ] then echo "NOK - No active network interfaces." exit $STATE_CRITICAL fi } gather_gateway_linux() { [ $DEBUG -gt 0 ] && echo "- Starting gather_gateway_linux for interface $1 -" >> $DEBUGFILE MASKBIN="" MASK=`ifconfig $1 | grep Mask | awk '{print $4}' | awk -F: '{print $2}'` for PART in `echo $MASK | awk -F. '{print $1" "$2" "$3" "$4}'` do MASKBIN="$MASKBIN`convert_base $PART 2 | awk -F# '{print $2}'`" done [ $DEBUG -gt 0 ] && echo "Mask: $MASK $MASKBIN" >> $DEBUGFILE BITCOUNT=`echo $MASKBIN | grep -o 1 | wc -l | awk '{print $1}'` [ $DEBUG -gt 0 ] && echo "Bitcount: $BITCOUNT" >> $DEBUGFILE IPBIN="" IP=`ifconfig $1 | grep "inet addr" | awk '{print $2}' | awk -F: '{print $2}'` for PART in `echo $IP | awk -F. '{print $1" "$2" "$3" "$4}'` do TEMPBIN=`convert_base $PART 2 | awk -F# '{print $2}'` TEMPCOUNT=`echo $TEMPBIN | wc -c | awk '{print $1}'` let PADDING=9-$TEMPCOUNT i=1 while ((i <= $PADDING)); do IPBIN=$IPBIN"0" let i=$i+1 done IPBIN=$IPBIN$TEMPBIN done [ $DEBUG -gt 0 ] && echo "IP address: $IP $IPBIN" >> $DEBUGFILE CUT="1-$BITCOUNT" [ $DEBUG -gt 0 ] && echo "Cutting: Cutting chars $CUT" >> $DEBUGFILE NETBIN=`echo $IPBIN | cut -c $CUT` [ $DEBUG -gt 0 ] && echo "Netbin: $NETBIN" >> $DEBUGFILE ROUTER=`subnet_router $NETBIN` [ $DEBUG -gt 0 ] && echo "Router: $ROUTER" >> $DEBUGFILE echo $ROUTER } gather_gateway_darwin() { [ $DEBUG -gt 0 ] && echo "- Starting gath_gateway_darwin for interface $1 -" >> $DEBUGFILE MASKBIN="" [ `uname` == "Darwin" ] && MASK=`ifconfig $1 | grep netmask | awk '{print $4}' | awk -Fx '{print $2}'` [ `uname` == "SunOS" ] && MASK=`ifconfig $1 | grep netmask | awk '{print $4}'` for PART in `echo 1 3 5 7` do let PLUSPART=$PART+1 MASKPART=`echo $MASK | cut -c $PART-$PLUSPART` MASKBIN="$MASKBIN`convert_base 16#$MASKPART 2 | awk -F# '{print $2}'`" done [ $DEBUG -gt 0 ] && echo "Mask: $MASK $MASKBIN" >> $DEBUGFILE BITCOUNT=`echo $MASKBIN | grep -o 1 | wc -l | awk '{print $1}'` [ $DEBUG -gt 0 ] && echo "Bitcount: $BITCOUNT" >> $DEBUGFILE IPBIN="" IP=`ifconfig $1 | grep "inet " | awk '{print $2}'` for PART in `echo $IP | awk -F. '{print $1" "$2" "$3" "$4}'` do TEMPBIN=`convert_base $PART 2 | awk -F# '{print $2}'` TEMPCOUNT=`echo $TEMPBIN | wc -c | awk '{print $1}'` let PADDING=9-$TEMPCOUNT i=1 while ((i <= $PADDING)); do TEMPBIN="0"$TEMPBIN let i=$i+1 done IPBIN=$IPBIN$TEMPBIN done [ $DEBUG -gt 0 ] && echo "IP address: $IP $IPBIN" >> $DEBUGFILE CUT="1-$BITCOUNT" [ $DEBUG -gt 0 ] && echo "Cutting: cutting chars $CUT" >> $DEBUGFILE NETBIN=`echo $IPBIN | cut -c $CUT` [ $DEBUG -gt 0 ] && echo "Netbin: $NETBIN" >> $DEBUGFILE ROUTER=`subnet_router $NETBIN` [ $DEBUG -gt 0 ] && echo "Router: $ROUTER" >> $DEBUGFILE echo $ROUTER } gather_gateway_sunos() { [ $DEBUG -gt 0 ] && echo "- Starting gath_gateway_solaris for interface $1 -" >> $DEBUGFILE MASKBIN="" [ `uname` == "Darwin" ] && MASK=`ifconfig $1 | grep netmask | awk '{print $4}' | awk -Fx '{print $2}'` [ `uname` == "SunOS" ] && MASK=`ifconfig $1 | grep netmask | awk '{print $4}'` for PART in `echo 1 3 5 7` do let PLUSPART=$PART+1 MASKPART=`echo $MASK | cut -c $PART-$PLUSPART` MASKBIN="$MASKBIN`convert_base 16#$MASKPART 2 | awk -F# '{print $2}'`" done [ $DEBUG -gt 0 ] && echo "Mask: $MASK $MASKBIN" >> $DEBUGFILE # This piece of kludge also requires that all tabs are removed from the beginning of each line. # Additional character needed to trick the counter below # Shitty thing is that it doesn't work. Stupid "let" aryth engine... #MASKBIN="$MASKBIN-" #[ $DEBUG -gt 0 ] && echo "Bitcount: kludged binmask is $MASKBIN" >> $DEBUGFILE # #IFS="1" #read TEMP << EOT #echo $MASKBIN #EOT #let "BITCOUNT=(${#TEMP[@]} - 1)" #IFS=" " # The kludge above was replaced by this one line of Perl. BITCOUNT=`echo $MASKBIN | perl -ne 'while(/1/g){++$count}; print "$count"'` [ $DEBUG -gt 0 ] && echo "Bitcount: $BITCOUNT" >> $DEBUGFILE IPBIN="" IP=`ifconfig $1 | grep "inet " | awk '{print $2}'` for PART in `echo $IP | awk -F. '{print $1" "$2" "$3" "$4}'` do [ $DEBUG -gt 0 ] && echo "IP part: converting part $PART" >> $DEBUGFILE TEMPBIN=`convert_base $PART 2 | awk -F# '{print $2}'` [ $DEBUG -gt 0 ] && echo "IP part: converted part is $TEMPBIN" >> $DEBUGFILE TEMPCOUNT=`echo $TEMPBIN | wc -c | awk '{print $1}'` [ $DEBUG -gt 0 ] && echo "IP part: this part is $TEMPCOUNT chars long." >> $DEBUGFILE let PADDING=9-$TEMPCOUNT [ $DEBUG -gt 0 ] && echo "IP part: will be padded with $PADDING zeroes" >> $DEBUGFILE i=1 while ((i <= $PADDING)); do TEMPBIN="0"$TEMPBIN let i=$i+1 done IPBIN=$IPBIN$TEMPBIN done [ $DEBUG -gt 0 ] && echo "IP address: $IP $IPBIN" >> $DEBUGFILE CUT="1-$BITCOUNT" [ $DEBUG -gt 0 ] && echo "Cutting: cutting chars $CUT" >> $DEBUGFILE NETBIN=`echo $IPBIN | cut -c $CUT` [ $DEBUG -gt 0 ] && echo "Netbin: $NETBIN" >> $DEBUGFILE ROUTER=`subnet_router $NETBIN` [ $DEBUG -gt 0 ] && echo "Router: $ROUTER" >> $DEBUGFILE echo $ROUTER } check_miitool() { [ $DEBUG -gt 0 ] && echo "- Starting check_miitool -" >> $DEBUGFILE COUNT="0" for INTF in `echo $INTERFACES` do [ `sudo /sbin/mii-tool $INTF | head -1 | grep -c ok` -gt 0 ] || let COUNT=$COUNT+1 [ `sudo /sbin/mii-tool $INTF | head -1 | grep -c 100baseTx-FD` -gt 0 ] || let COUNT=$COUNT+1 [ `sudo /sbin/mii-tool $INTF | head -1 | grep -c 1000baseTx-FD` -gt 0 ] || let COUNT=$COUNT+1 done [ $COUNT -gt $INTFCOUNT ] && (echo "NOK - Problem with one of the interfaces"; exit $STATE_CRITICAL) } check_ping() { [ $DEBUG -gt 0 ] && echo "- Starting check_ping -" >> $DEBUGFILE INTF="" for INTF in `echo $INTERFACES` do case `uname` in Linux) GATEWAY=`gather_gateway_linux $INTF`;; Darwin) GATEWAY=`gather_gateway_darwin $INTF`;; SunOS) GATEWAY=`gather_gateway_sunos $INTF`;; *) echo "OS not supported by this check."; exit 1;; esac [ $DEBUG -gt 0 ] && echo "Gateway: $GATEWAY" >> $DEBUGFILE ping -c 3 $GATEWAY >/dev/null 2>&1 if [ $? -gt 0 ] then echo "NOK - Problem pinging gateway $GATEWAY"; exit $STATE_CRITICAL fi done } ### THE MAIN ROUTINE FINALLY STARTS ### case `uname` in Linux) gather_interfaces_linux;; Darwin) gather_interfaces_darwin;; #SunOS) gather_interfaces_sunos;; SunOS) gather_interfaces_linux;; *) echo "OS not supported by this check."; exit 1;; esac [ $MIITOOL -eq 1 ] && check_miitool check_ping # None of the other subroutines forced us to exit 1 before here, so let's quit with a 0. echo "OK - Everything running like it should" exit $STATE_OK
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 0)
2006-06-01 00:00:00
This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.
There really isn't much to say... This script is so fscking basic that it shames me to even put it up here among all the other projects
#!/usr/bin/bash # # NFS stale mounts monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of KPN-IS, i-Provide, the Netherlands # Last Modified: 13-07-2006 # # Usage: ./check_nfs_stale # # Description: # This script couldn't be simpler than it is. It just checks to see # whether there are any stale NFS mounts present on the system. # # Limitations: # This script should work properly on all implementations of Linux, Solaris # and Mac OS X. # # Output: # If there are stale NFS mounts, a CRIT is issued. # # You may have to change this, depending on where you installed your # Nagios plugins PROGNAME="check_nfs_stale" PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh ### REQUISITE NAGIOS COMMAND LINE STUFF ### print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "NFS stale mounts monitor plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done [ `df -k | grep "Stale NFS file handle" | wc -l` -gt 0 ] && (echo "NOK - Stale NFS mounts."; exit $STATE_CRITICAL) # Nothing caused us to exit early, so we're okay. echo "OK - No stale NFS mounts." exit $STATE_OK
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 0)
2005-09-11 01:00:00
I've added all the custom Nagios monitors I wrote for $CLIENT. They might come in handy for any of you. They're not beauties, but they get the job done.
kilala.nl tags: work, nagios, unix, sysadmin,
View or add comments (curr. 0)
2005-09-11 00:47:00
Major updates in the Sysadmin section! w00t!
In this case a lot of information one of my favourite security tools and Nagios, my new-found love on the monitoring front.
kilala.nl tags: nagios, boks, work, unix,
View or add comments (curr. 0)
2005-07-01 00:00:00
This script was written in the time I was hired by UPC / Liberty Global.
The text I wrote on Nagios Exchange about this script has been lost. I guess it speaks for itself :)
#!/usr/bin/bash # # Squid process monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of DTV Labs, Liberty Global, the Netherlands # Last Modified: 19-06-2006 # # Usage: ./check_squid # # Description: # This plugin determines whether the Squid proxy server # is running properly. It will check the following: # * Are all required processes running? # * Are all the required TCP/IP ports open? # # Limitations: # Currently this plugin will only function correctly on Solaris systems. # # Output: # The script returns a CRIT when the abovementioned criteria are # not matched # # Host OS check and warning message if [ `uname` != "SunOS" ] then echo "WARNING:" echo "This script was originally written for use on Solaris." echo "You may run into some problems running it on this host." echo "" echo "Please verify that the script works before using it in a" echo "live environment. You can easily disable this message after" echo "testing the script." echo "" fi # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Squid monitor plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done check_processes() { PROCESS="0" if [ `ps -ef | grep squid | grep -v grep | grep -v nagios | wc -l` -lt 2 ]; then echo "SQUID NOK - One or more processes not running" exitstatus=$STATE_CRITICAL exit $exitstatus fi } check_ports() { PORTS=0 PORTLIST="8080 3128 3130" for NUM in `echo $PORTLIST`; do if [ `netstat -an | grep LISTEN | grep $NUM | grep -v grep | wc -l` -lt 1 ]; then PORTS=1;fi done if [ $PORTS -eq 1 ]; then echo "SQUID NOK - One or more TCP/IP ports not listening." exitstatus=$STATE_CRITICAL exit $exitstatus fi } check_processes check_ports echo "SQUID OK - Everything running like it should" exitstatus=$STATE_OK exit $exitstatus
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 0)
2005-07-01 00:00:00
This script was written at the time I was hired by UPC / Liberty Global.
Basic monitor that checks if the Retrospect client is up and running.
This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:
The script sends a Critical if the required process is not running.
UPDATE 19/06/2006:
Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!
#!/usr/bin/bash # # Retrospect Backup Client monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of DTV Labs, Liberty Global, the Netherlands # Last Modified: 19-06-2006 # # Usage: ./check_retro_client # # Description: # This plugin determines whether the Retrospect backup client # is running properly. It will check the following: # * Are all required processes running? # # Limitations: # Currently this plugin will only function correctly on Solaris systems. # # Output: # The script returns a CRIT when the abovementioned criteria are # not matched # # Host OS check and warning message if [ `uname` != "SunOS" ] then echo "WARNING:" echo "This script was originally written for use on Solaris." echo "You may run into some problems running it on this host." echo "" echo "Please verify that the script works before using it in a" echo "live environment. You can easily disable this message after" echo "testing the script." echo "" fi # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Retrospect Backup Client monitor plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done check_processes() { PROCESS="0" if [ `ps -ef | grep retroclient | grep -v grep | grep -v nagios | wc -l` -lt 1 ]; then echo "RETROSPECT NOK - One or more processes not running" exitstatus=$STATE_CRITICAL exit $exitstatus fi } check_processes echo "RETROSPECT OK - Everything running like it should" exitstatus=$STATE_OK exit $exitstatus
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 0)
2005-07-01 00:00:00
This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.
We are currently in the process of distributing a standard set of Nagios monitoring scripts to over 300 client systems. One of the metrics we would like to monitor is the three load averages (or as Dr. Gunther calls them: the LaLaLa triplets).
Since these 300 servers aren't all alike, we are bound to run into systems with one, two, four, eight or more processors. That way there is no nice way of making one standard configuration, since you'll have to define separate LA levels for WARN and CRIT. Why? Cause a quad system can take much more load than a single core system.
One way to get around this would be by defining separate host groups, based on the amount of processors in a system. You could then define a unique check_load command for each CPU host group.
I've gone the other way around though...
My work-around for this is by replacing check_load with check_load2. This script takes no command line parameters and works on the basis of standard multipliers. We are of the opinion that the number of processors multiplied by a certain factor (150%? 200%? and so on) is a good enough way to define these WARN and CRIT levels. These multipliers can easily be modified (at the top of the script) to fit what -you- think is a worrying level of activity.
This script was tested on Redhat ES3, Solaris 8 and Mac OS X 10.4. It should run on other versions of these OSes as well.
EDIT:
Oh! Just like my other recent Nagios scripts, check_load2 comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.
#!/usr/bin/bash # # CPU load monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of KPN-IS, i-Provide, the Netherlands # Last Modified: 22-06-2006 # # Usage: ./check_load2 # # Description: # Ethan's original version of the check_load script is very flexible. # It allows you to specifically set WARN and CRIT levels regarding # the CPU load of the system you're monitoring. # However: flexibility is not always a good thing. Say for example that # you want to monitor the CPU load across a few hundred of systems having # various CPU configurations. You -could- define host groups for single, dual # quad (and so on) processor systems and assign unique check_load command # definitions to each group. # Or you could write a script which checks the amount of active CPUs and # then makes an educated guess at the WARN and CRIT levels for the system. # In most cases this should really be enough. # # Limitations: # This script should work properly on all implementations of Linux, Solaris # and Mac OS X. # # Output: # Depending on the levels defined at the top of the script, # the script returns an OK, WARN or CRIT to Nagios based on CPU load. # # Other notes: # If you ever run into problems with the script, set the DEBUG variable # to 1. I'll need the output the script generates to do troubleshooting. # See below for details. # I realise that all the debugging commands strewn throughout the script # may make things a little harder to read. But in the end I'm sure it was # well worth adding them. It makes troubleshooting so much easier. :3 # # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh ### DEBUGGING SETUP ### # Cause you never know when you'll need to squash a bug or two DEBUG="1" DEBUGFILE="/tmp/foobar" rm $DEBUGFILE ### REQUISITE NAGIOS COMMAND LINE STUFF ### print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Semi-intelligent CPU load monitor plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done ### SETTING UP THE WARN AND CRIT FACTORS ### # Please be aware that these are -factors- and not real load average values. # The numbers below will be multiplied by the amount of processors to come # to the desired WARN and CRIT levels. Feel free to adjust these factors, if # you feel the need to tweak them. WARN_1min="2.00" WARN_5min="1.50" WARN_15min="1.50" [ $DEBUG -gt 0 ] && echo "Factors: warning factors are at $WARN_1min, $WARN_5min, $WARN_15min." >> $DEBUGFILE CRIT_1min="3.00" CRIT_5min="2.00" CRIT_15min="2.00" [ $DEBUG -gt 0 ] && echo "Factors: critical factors are at $CRIT_1min, $CRIT_5min, $CRIT_15min." >> $DEBUGFILE ### DEFINING SUBROUTINES ### function gather_procs_linux() { NUMPROCS=`cat /proc/cpuinfo | grep ^processor | wc -l` [ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE } function gather_procs_sunos() { NUMPROCS=`/usr/bin/mpstat | grep -v CPU | wc -l` [ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE } function gather_procs_darwin() { NUMPROCS=`/usr/bin/hostinfo | grep "Default processor set" | awk '{print $8}'` [ $DEBUG -gt 0 ] && echo "Numprocs: Number of processors detected is $NUMPROCS." >> $DEBUGFILE } function gather_load_linux() { REAL_1min=`cat /proc/loadavg | awk '{print $1}'` REAL_5min=`cat /proc/loadavg | awk '{print $2}'` REAL_15min=`cat /proc/loadavg | awk '{print $3}'` [ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE } function gather_load_sunos() { REAL_1min=`w | grep "load average" | awk -F, '{print $4}' | awk '{print $3}'` REAL_5min=`w | grep "load average" | awk -F, '{print $5}'` REAL_15min=`w | grep "load average" | awk -F, '{print $6}'` [ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE } function gather_load_darwin() { REAL_1min=`sysctl -n vm.loadavg | awk '{print $1}'` REAL_5min=`sysctl -n vm.loadavg | awk '{print $2}'` REAL_15min=`sysctl -n vm.loadavg | awk '{print $3}'` [ $DEBUG -gt 0 ] && echo "Gather_load: Detected load averages are $REAL_1min, $REAL_5min, $REAL_15min." >> $DEBUGFILE } function check_load() { WARN="0"; CRIT="0" [ `echo "if(($NUMPROCS * $WARN_1min) > $REAL_1min) 0; if(($NUMPROCS * $WARN_1min) <= $REAL_1min) 1" | bc` -gt 0 ] && let WARN=$WARN+1 [ `echo "if(($NUMPROCS * $WARN_5min) > $REAL_5min) 0; if(($NUMPROCS * $WARN_5min) <= $REAL_5min) 1" | bc` -gt 0 ] && let WARN=$WARN+1 [ `echo "if(($NUMPROCS * $WARN_15min) > $REAL_15min) 0; if(($NUMPROCS * $WARN_15min) <= $REAL_15min) 1" | bc` -gt 0 ] && let WARN=$WARN+1 [ $DEBUG -gt 0 ] && echo "Check_load: warning levels are `echo "$NUMPROCS * $WARN_1min"|bc`, `echo "$NUMPROCS * $WARN_5min"|bc`, `echo "$NUMPROCS * $WARN_15min"|bc`," >> $DEBUGFILE [ `echo "if(($NUMPROCS * $CRIT_1min) > $REAL_1min) 0; if(($NUMPROCS * $CRIT_1min) <= $REAL_1min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1 [ `echo "if(($NUMPROCS * $CRIT_5min) > $REAL_5min) 0; if(($NUMPROCS * $CRIT_5min) <= $REAL_5min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1 [ `echo "if(($NUMPROCS * $CRIT_15min) > $REAL_15min) 0; if(($NUMPROCS * $CRIT_15min) <= $REAL_15min) 1" | bc` -gt 0 ] && let CRIT=$CRIT+1 [ $DEBUG -gt 0 ] && echo "Check_load: critical levels are `echo "$NUMPROCS * $CRIT_1min"|bc`, `echo "$NUMPROCS * $CRIT_5min"|bc`, `echo "$NUMPROCS * $CRIT_15min"|bc`," >> $DEBUGFILE [ $WARN -gt 0 ] && (echo "NOK: load averages are at $REAL_1min, $REAL_5min, $REAL_15min"; exit $STATE_WARNING) [ $CRIT -gt 0 ] && (echo "NOK: load averages are at $REAL_1min, $REAL_5min, $REAL_15min"; exit $STATE_CRITICAL) } ### FINALLY, THE MAIN ROUTINE ### NUMPROCS="0" case `uname` in Linux) gather_procs_linux; gather_load_linux; check_load;; Darwin) gather_procs_darwin; gather_load_darwin; check_load;; SunOS) gather_procs_sunos; gather_load_sunos; check_load;; *) echo "OS not supported by this check."; exit 1;; esac # Nothing caused us to exit early, so we're okay. echo "OK - load averages are at $REAL_1min, $REAL_5min, $REAL_15min" exit $STATE_OK
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 7)
2005-07-01 00:00:00
This script was written at the time I was hired by UPC / Liberty Global.
Basic monitor that checks if the Checkpoint Firewall-1 Management software is up and running. It checks for a number of processes and ports.
This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:
The script sends a Critical if:
A) One or more processes are not running, or
B) One or more ports are not available for connections.
UPDATE 19/06/2006:
Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!
#!/usr/bin/bash # # Firewall-1 process monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of DTV Labs, Liberty Global, the Netherlands # Last Modified: 19-06-2006 # # Usage: ./check_fwm # # Description: # This plugin determines whether the Firewall-1 management # software is running properly. It will check the following: # * Are all required processes running? # * Are all the required TCP/IP ports open? # # Limitations: # Currently this plugin will only function correctly on Solaris systems. # # Output: # The script retunrs a CRIT when one of the criteria mentioned # above is not matched. # # Host OS check and warning message if [ `uname` != "SunOS" ] then echo "WARNING:" echo "This script was originally written for use on Solaris." echo "You may run into some problems running it on this host." echo "" echo "Please verify that the script works before using it in a" echo "live environment. You can easily disable this message after" echo "testing the script." echo "" fi # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Firewall-1 monitor plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done check_processes() { PROCESS="0" # PROCLIST="cpd fwd fwm cpwd cpca cpmad cplmd cpstat cpshrd cpsnmpd" PROCLIST="cpd fwd fwm cpwd cpca cpmad cpstat cpsnmpd" for PROC in `echo $PROCLIST`; do if [ `ps -ef | grep $PROC | grep -v grep | wc -l` -lt 1 ]; then PROCESS=1;fi done if [ $PROCESS -eq 1 ]; then echo "FWM NOK - One or more processes not running" exitstatus=$STATE_CRITICAL exit $exitstatus fi } check_ports() { PORTS="0" PORTLIST="256 257 18183 18184 18187 18190 18191 18192 18196 18264" for NUM in `echo $PORTLIST`; do if [ `netstat -an | grep LISTEN | grep $NUM | grep -v grep | wc -l` -lt 1 ]; then PORTS=1;fi done if [ $PORTS -eq 1 ]; then echo "FWM NOK - One or more TCP/IP ports not listening." exitstatus=$STATE_CRITICAL exit $exitstatus fi } check_processes check_ports echo "FWM OK - Everything running like it should" exitstatus=$STATE_OK exit $exitstatus
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 0)
2005-07-01 00:00:00
This script was written at the time I was hired by UPC / Liberty Global.
Basic monitor that checks if Postfix is up and running. It checks for a number of processes and ports.
This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:
The script sends a Critical if:
A) One or more processes are not running, or
B) One or more ports are not available for connections.
UPDATE 19/06/2006:
Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!
#!/usr/bin/bash # # Postfix process monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of DTV Labs, Liberty Global, the Netherlands # Last Modified: 19-06-2006 # # Usage: ./check_postfix # # Description: # This plugin determines whether the Postfix SMTP server # is running properly. It will check the following: # * Are all required processes running? # * Are all the required TCP/IP ports open? # # Limitations: # Currently this plugin will only function correctly on Solaris systems. # # Output: # Script returns a CRIT when one of the abovementioned criteria is # not matched # # Host OS check and warning message if [ `uname` != "SunOS" ] then echo "WARNING:" echo "This script was originally written for use on Solaris." echo "You may run into some problems running it on this host." echo "" echo "Please verify that the script works before using it in a" echo "live environment. You can easily disable this message after" echo "testing the script." echo "" fi # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "Postfix monitor plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done check_processes() { PROCESS="0" PROCLIST="smtpd qmgr pickup master sendmail" for PROC in `echo $PROCLIST`; do if [ `ps -ef | grep $PROC | grep -v grep | wc -l` -lt 1 ]; then if [ $PROC == "smtpd" ]; then if [ `ps -ef | grep proxymap | grep -v grep | wc -l` -lt 1 ]; then PROCESS=1 else PROCESS=0 fi else PROCESS=1 fi fi done if [ $PROCESS -eq 1 ]; then echo "SMTP-S NOK - One or more processes not running" exitstatus=$STATE_CRITICAL exit $exitstatus fi } check_ports() { PORTS="0" PORTLIST="25" for NUM in `echo $PORTLIST`; do if [ `netstat -an | grep LISTEN | grep $NUM | grep -v grep | wc -l` -lt 1 ]; then PORTS=1;fi done if [ $PORTS -eq 1 ]; then echo "SMTP-S NOK - One or more TCP/IP ports not listening." exitstatus=$STATE_CRITICAL exit $exitstatus fi } check_processes check_ports echo "SMTP-S OK - Everything running like it should" exitstatus=$STATE_OK exit $exitstatus
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 0)
2005-07-01 00:00:00
This script was written at the time I was hired by UPC / Liberty Global.
Basic monitor that checks if the server is up and running. It checks for a process and whether the server has drifted from its higher level Stratum server.
This script was quickly hacked together for my current customer, as a Q&D solution for their monitoring needs. It's no beauty, but it works. Written in ksh and tested with:
The script sends a Critical if:
A) One or more processes are not running, or
B) The server's clock has drifted too far from its higher level Stratum server.
Requires the "check_ntp" plugin which is part of the default monitor package.
UPDATE 19/06/2006:
Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!
#!/usr/bin/bash # # NTP server process monitor plugin for Nagios # Written by Thomas Sluyter (nagiosATkilalaDOTnl) # By request of DTV Labs, Liberty Global, the Netherlands # Last Modified: 19-06-2006 # # Usage: ./check_ntp_s # # Description: # This plugin determines whether the Nagios client is functioning # properly as an NTP server. It does this by checking: # * Are all required processes running? # * Is the server's time up to scratch with its higher stratum server? # # Limitations: # Currently this plugin will only function correctly on Solaris systems. # # Output: # The script returns a CRIT when one of the abovementioned criteria # is not matched. # # Host OS check and warning message if [ `uname` != "SunOS" ] then echo "WARNING:" echo "This script was originally written for use on Solaris." echo "You may run into some problems running it on this host." echo "" echo "Please verify that the script works before using it in a" echo "live environment. You can easily disable this message after" echo "testing the script." echo "" fi # You may have to change this, depending on where you installed your # Nagios plugins PATH="/usr/bin:/usr/sbin:/bin:/sbin" LIBEXEC="/usr/local/nagios/libexec" . $LIBEXEC/utils.sh print_usage() { echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" } print_help() { echo "" print_usage echo "" echo "NTP server plugin for Nagios" echo "" echo "This plugin not developped by the Nagios Plugin group." echo "Please do not e-mail them for support on this plugin, since" echo "they won't know what you're talking about :P" echo "" echo "For contact info, read the plugin itself..." } while test -n "$1" do case "$1" in --help) print_help; exit $STATE_OK;; -h) print_help; exit $STATE_OK;; *) print_usage; exit $STATE_UNKNOWN;; esac done check_processes() { PROCESS="0" if [ `ps -ef | grep xntpd | grep -v grep | grep -v nagios | wc -l` -lt 1 ]; then PROCESS=1;fi if [ $PROCESS -eq 1 ]; then echo "NTP-S NOK - One or more processes not running" exitstatus=$STATE_CRITICAL exit $exitstatus fi } check_time() { TIME="0" #SERVERS="ntp0.nl.net ntp1.nl.net ntp2.nl.net" SERVERS="nl-ams99z-a02-01" for SERV in `echo $SERVERS`; do if [ `/usr/local/nagios/libexec/check_ntp -H $SERV | awk '{print $2}'` != "OK:" ]; then TIME=1 else TIME=0 break fi done if [ $TIME -eq 1 ]; then echo "NTP-S NOK - Time not in synch with higher Stratum." exitstatus=$STATE_CRITICAL exit $exitstatus fi } check_processes check_time echo "NTP-S OK - Everything running like it should" exitstatus=$STATE_OK exit $exitstatus
kilala.nl tags: nagios, unix, programming,
View or add comments (curr. 1)
All content, with exception of "borrowed" blogpost images, or unless otherwise indicated, is copyright of Tess Sluijter. The character Kilala the cat-demon is copyright of Rumiko Takahashi and used here without permission.