BoKS troubleshooting: replication of the BoKS database

2008-11-21 21:08:00

If one or more of the replicas are out of sync login attempts by users may fail, assuming that the BoKS client on the server in question was looking at the out-of-sync BoKS replica. Other nasty stuff may also occur.

Standard procedure is to follow these steps:

  1. Check the status of all BoKS replicas.
  2. Check BoKS error logs on the master and the replica(s).
  3. Try a forced database download.
  4. Check BoKS replication processes to see if they are all running.
  5. Check the master queue, using the boksdiag fque -master command.
  6. Check BoKS communications, using the cadm command.
  7. Check node keys.
  8. Check the replica server's definition on BoKS database.
  9. Check the BoKS configuration on the replica.
  10. Debug replication processes.

All commands are run in a BoKS shell, on the master server unless specified otherwise.



1. Check the status of all BoKS replicas.

# /opt/boksm/sbin/boksadm -S boksdiag list

Since last pckt

The amount of minutes/seconds since the BoKS master last sent a communication packet to the respective replica server. This amount should never exceed more than a couple of minutes.

Since last fail

The amount of days/hours/minutes since the BoKS master was last unable to update the database on the respective replica server. If an amount of a couple of hours is listed you'll know that the replica server had a recent failure.

Since last sync

Shows the amount of days/hours/minutes since BoKS last sent a database update to the respective replica server.

Last status

Yes indeed! The last known status of the replica server in question. OK means that the server is running perfectly and that updates are received. Loading means that the server was just restarted and is still loading the database or any updates. Down indicates that the replica server is down or even dead.



2. Check BoKS error logs on the master and the replica(s).

This should be pretty self-explanatory. Read the /var/opt/boksm/boks_errlog file on both the master and the replicas to see if you can detect any errors there. If the log file doesn't mention something about the hosts involved you should be able to find the cause of the problem pretty quickly.



3. Try a forced database download.

Keon> boksdiag download -force $hostname

This will push a database update to the replica. Perform another boksdiag list to see if it worked. Re-read the BoKS error log file to see if things have cleared up.



4. Check BoKS replication processes to see if they are all running.

Keon> ps -ef | grep -i drainmast

This should show two drainmast processes running. If there aren't you should see errors about this in the error logs and in Tivoli.

Keon> Boot -k

Keon> ps -ef | grep -i boks (kill any remaining BoKS processes)

Keon> Boot

Check to see if the two drainmast processes stay up. Keep checking for at least two minutes. If one of them crashes again, try the following:

Check to see that /opt/boksm/lib/boks_drainmast is still linked to boks_drainmast_d, which should be in the same directory. Also check to see that boks_drainmast_d is still the same file as boks_drainmast_d.nonstripped.

If it isn't, copy boks_drainmast_d to boks_drainmast_d.orig and then copy the non-stripped version over the boks_drainmast_d. This will allow you to create a core file which is useful to TFS Technology.

Keon> Boot -k

Keon> Boot

Keon> ls -al /core

Check that the core file was just created by boks_drainmast_d.

Keon> Boot -k

Keon> cd /var/opt/boksm/data

Keon> tar -cvf masterspool.tar master_spool

Keon> rm master_spool/*

Keon> Boot

Things should now be back to normal. Send both the tar file and the core file to TFS Technology (support@tfstech.com).



5. Check the master queue.

Keon> boksdiag fque -master

If any messages are stuck there is most likely still something wrong with the drainmast processes. You may want to try and reboot the BoKS master software. Do NOT reboot the master server! Reboot the software using the Boot command. If that doesn't help, perform the troubleshooting tips from step 4.



6. Check BoKS communications, using the cadm command.

Verify that the BoKS communication between the master and the replica itself is up and running.

Keon> cadm -l -f bcastaddr -h $replica.

If this doesn't work, re-check the error logs on the client and proceed with step 7.



7. Check node keys.

On the replica system run:

Keon> hostkey

Take the output from that command and run the following on the master:

Keon> dumpbase | grep $hostkey

If this doesn't return the configuration for the replica server, the keys have become unsynchronized. If you make any changes you will need to restart the BoKS processes, using the Boot command.



8. Check the replica server's definition on BoKS database.

Keon> dumpbase | grep RNAME | grep $replica

The TYPE field in the definition of the replica should be set to 261. Anything else is wrong, so you need to update the configuration in the BoKS database. Either that or have SecOPS do it for you.



9. Check the BoKS configuration on the replica.

On the replica system, review the settings in /etc/opt/boksm/ENV.



10. Debug replication processes.

If all of the above fails you should really get cracking with the debugger. Refer to the appropriate chapter of this manual for details.




kilala.nl tags: , ,

View or add comments (curr. 0)