Troubleshooting BoKS fault situations

2008-01-01 00:00:00

A PDF version of this document is available. Get it over here.

1.1.Verifying the proper functioning of a BoKS client

People have often asked me how one can check of a newly installed BoKS client is functioning

properly. With these three easy steps you too can become a milliona..!!.... Oops... Wrong show!

These easy steps will show you whether your new client is working like it should.

  1. Check the boks_errlog in $BOKS_var.
  2. Run cadm –l –f bcastaddr –h $client from the BoKS master (in a BoKS shell).
  3. Try to login to the new client.

If all three steps go through without error your systems is as healthy as a very healthy good

thing... or something.

1.2.SCENARIO: The BoKS master is not replicating to a replica (or all replicas)

Since on or more of the replicas is/are out of sync login attempts by users may fail, assuming that

the BoKS client on the server in question was looking at the out-of-sync BoKS replica. Other

nasty stuff may also occur.

Standard procedure is to follow these steps:

  1. Check the status of all BoKS replicas.
  2. Check BoKS error logs on the master and the replica(s).
  3. Try a forced database download.
  4. Check BoKS replication processes to see if they are all running.
  5. Check the master queue, using the boksdiag fque –master command.
  6. Check BoKS communications, using the cadm command.
  7. Check node keys.
  8. Check the replica server’s definition on BoKS database.
  9. Check the BoKS configuration on the replica.
  10. Debug replication processes.

All commands are run in a BoKS shell, on the master server unless specified otherwise.

1. Check the status of all BoKS replicas.

# /opt/boksm/sbin/boksadm –S boksdiag list

Since last pckt

The amount of minutes/seconds since the BoKS master

last sent a communication packet to the respective

replica server. This amount should never exceed more

than a couple of minutes.

Since last fail

The amount of days/hours/minutes since the BoKS

master was last unable to update the database on the

respective replica server. If an amount of a couple of

hours is listed you’ll know that the replica server had a

recent failure.

Since last sync

Shows the amount of days/hours/minutes since BoKS last

sent a database update to the respective replica server.

Last status

Yes indeed! The last known status of the replica server in

question. OK means that the server is running perfectly

and that updates are received. Loading means that the

server was just restarted and is still loading the database

or any updates. Down indicates that the replica server is

down or even dead.

2. Check BoKS error logs on the master and the replica(s).

This should be pretty self-explanatory. Read the /var/opt/boksm/boks_errlog file on both the

master and the replicas to see if you can detect any errors there. If the log file doesn’t mention

something about the hosts involved you should be able to find the cause of the problem pretty

quickly.

3. Try a forced database download.

Keon> boksdiag download –force $hostname

This will push a database update to the replica. Perform another boksdiag list to see if it

worked. Re-read the BoKS error log file to see if things have cleared up.

4. Check BoKS replication processes to see if they are all running.

Keon> ps –ef | grep –i drainmast

This should show two drainmast processes running. If there aren’t you should see errors about

this in the error logs and in Tivoli.

Keon> Boot –k

Keon> ps –ef | grep –i boks (kill any remaining BoKS processes)

Keon> Boot

Check to see if the two drainmast processes stay up. Keep checking for at least two minutes. If

one of them crashes again, try the following:

Check to see that /opt/boksm/lib/boks_drainmast is still linked to boks_drainmast_d, which

should be in the same directory. Also check to see that boks_drainmast_d is still the same file as

boks_drainmast_d.nonstripped.

If it isn’t, copy boks_drainmast_d to boks_drainmast_d.orig and then copy the non-stripped

version over the boks_drainmast_d. This will allow you to create a core file which is useful to TFS

Technology.

Keon> Boot –k

Keon> Boot

Keon> ls –al /core

Check that the core file was just created by boks_drainmast_d.

Keon> Boot –k

Keon> cd /var/opt/boksm/data

Keon> tar –cvf masterspool.tar master_spool

Keon> rm master_spool/*

Keon> Boot

Things should now be back to normal. Send both the tar file and the core file to TFS Technology

(support@tfstech.com).

5. Check the master queue.

Keon> boksdiag fque –master

If any messages are stuck there is most likely still something wrong with the drainmast processes.

You may want to try and reboot the BoKS master software. Do NOT reboot the master server!

Reboot the software using the Boot command. If that doesn’t help, perform the troubleshooting

tips from step 4.

6. Check BoKS communications, using the cadm command.

Verify that the BoKS communication between the master and the replica itself is up and running.

Keon> cadm –l –f bcastaddr –h $replica.

If this doesn’t work, re-check the error logs on the client and proceed with step 7.

7. Check node keys.

On the replica system run:

Keon> hostkey

Take the output from that command and run the following on the master:

Keon> dumpbase | grep $hostkey

If this doesn’t return the configuration for the replica server, the keys have become

unsynchronized. If you make any changes you will need to restart the BoKS processes, using the

Boot command.

8. Check the replica server’s definition on BoKS database.

Keon> dumpbase | grep RNAME | grep $replica

The TYPE field in the definition of the replica should be set to 261. Anything else is wrong, so you

need to update the configuration in the BoKS database. Either that or have SecOPS do it for you.

9. Check the BoKS configuration on the replica.

On the replica system, review the settings in /etc/opt/boksm/ENV.

10. Debug replication processes.

If all of the above fails you should really get cracking with the debugger. Refer to the appropriate

chapter of this manual for details.

1.3.SCENARIO: You can’t log in to a BoKS client

Most obviously we can’t do our work on that particular server and neither can our customers.

Naturally this is something that needs to be fixed quite urgently!

  1. Check BoKS transaction log.
  2. Check if you can log in.
  3. Check BoKS communications
  4. Check bcastaddr and bremotever files.
  5. Check BoKS port number.
  6. Check node keys
  7. Check BoKS error logs.
  8. Debug servc process on replica server or relevant process on client.

All commands are run in a BoKS shell, on the master server unless specified otherwise.

1. Check BoKS transaction log.

Keon> cd /var/opt/boksm/data

Keon> grep $user LOG | bkslog –f - -wn

This should give you enough output to ascertain why a certain user cannot login. If there is no

output at all, do the following:

Keon> cd /var/junkyard/bokslogs

Keon> for file in `ls –lrt | tail –5 | awk ‘{print $9}’`

> do

> grep $user $file | bkslog –f - -wn

> done

If this doesn’t provide any output, perform step 2 as well to see if us sys admins can login.

2. Check if you can log in.

Pretty self-explanatory, isn’t it? Try if you can log in yourself.

3. Check BoKS communications

Keon> cadm –l –f bcastaddr –h $client

4. Check bcastaddr and bremotever files.

Login to the client through its console port.

Keon> cat /etc/opt/boksm/bcastaddr

Keon> cat /etc/opt/boksm/bremotever

These two files should match the same files on another working client. Do not use a replica or

master to compare the files. These are different over there. If you make any changes you will need

to restart the BoKS processes using the Boot command.

5. Check BoKS port number.

On the client and master run:

Keon> getent services boks

This should return the same value for the BoKS base port. If it doesn’t either check /etc/services

or NIS+. If you make any changes you will need to restart the BoKS processes using the Boot

command.

6. Check node keys

On the client system run:

Keon> hostkey

Take the output from that command and run the following on the master:

Keon> dumpbase | grep $hostkey

If this doesn’t return the definition for the client server, the keys have become unsynchronized.

Reset them and restart the BoKS client software. If you make any changes you will need to restart

the BoKS processes using the Boot command.

7. Check BoKS error logs.

This should be pretty self-explanatory. Read the /var/opt/boksm/boks_errlog file on both the

master and the client to see if you can detect any errors there. If the log file doesn’t mention

something about the hosts involved you should be able to find the cause of the problem pretty

quickly.

8. Debug servc process on replica server or relevant process on client.

If all of the above fails you should really get cracking with the debugger. Refer to the appropriate

chapter of this manual for details (see chapter: SCENARIO: Setting a trace within BoKS)

NOTE: If you need to restart the BoKS software on the client without logging in, try doing so using a remote management tool, like Tivoli.

1.4 SCENARIO: The BoKS client queues are filling up

The whole of BoKS is still up and running and everything’s working perfectly. The only client(s)

that won’t work are the one(s) that have stuck queues. The only way you’ll find out about this is

by running boksdiag fque –bridge which reports all of the queues which are stuck.

  1. Check if client is up and running.
  2. Check BoKS communications.
  3. Check node keys.
  4. Check BoKS error logs.

All commands are run in a BoKS shell, on the master server unless specified otherwise.

1. Check if client is up and running.

Keon> ping $client

Also ask your colleagues to see if they’re working on the system. Maybe they’re performing

maintenance.

2. Check BoKS communications.

Keon> cadm –l –f bcastaddr –h $client

3. Check node keys.

On the client system run:

Keon> hostkey

Take the output from that command and run the following on the master:

Keon> dumpbase | grep $hostkey

If this doesn’t return the definition for the client server, the keys have become unsynchronised.

Reset them and restart the BoKS client software using the Boot command.

4. Check BoKS error logs.

This should be pretty self-explanatory. Read the /var/opt/boksm/boks_errlog file on both the

master and the client to see if you can detect any errors there. If the log file doesn’t mention

something about the hosts involved you should be able to find the cause of the problem pretty

quickly.

NOTE: What can we do about it?

If you’re really desperate to get rid of the queue, do the following

Keon> boksdiag fque –bridge –delete $client-ip

At one point in time we thought it would be wise to manually delete

messages from the spool directories. Do not under any circumstance touch the

crypt_spool and master_spool directories in /var/opt/boksm. Really:

DON’T DO THIS! This is unnecessary and will lead to troubles with BoKS.

1.5 SCENARIO: Setting a trace within BoKS

We are required to run a BoKS debug trace when either:

  1. People are unable to login without any apparent reason. A debug will show why login are

    getting rejected.

  2. We have run into a bug or a problem with BoKS which cannot easily be dealt with through e-

    mail. TFS Tech support will usually request us to perform a number of traces and that we send

    them the output files..

First off, let me warn you: debug trace log files can grow pretty vast pretty fast! Make sure that

you turn on the trace only right before you’re ready to use the faulty part of BoKS and also be

sure to stop the trace immediately once you’re done.

Now, before you can start a trace you will need to make sure that the BoKS client system only

performs transactions with one BoKS server. If you don’t you will have no way of knowing on

which server you should run the trace.

Login to the client system experiencing problems.

$ su –

# cd /etc/opt/boksm

# cp bcastaddr bcastaddr.orig

# vi bcastaddr

Edit the file in such a way that it only points to one of the available BoKS servers. Preferably a

BoKS replica. Please refrain from using the BoKS master server.

# /opt/boksm/sbin/boksadm –S Boot –k

# sleep 10; ps –ef | grep –i boks | awk '{print $2}' | xargs kill

# /opt/boksm/sbin/boksadm –S Boot

Now, how you proceed depends on what problems you are experiencing.

If people are having problems logging in:

Log in to the replica server and start Boks with sx.

# sx /opt/boksm/sbin/boksadm –S

# cd /var/tmp

Now, type the following command, but DO NOT press enter yet.

# bdebug –x 9 bridge_servc_r –f /var/tmp/BR-SERVC.trace

Open a new terminal window, because we will try to login to the failing client. BEFORE YOU

START THE TOOL USED TO LOGIN (SSH, Telnet, FTP, whatever) press enter at the command

waiting on the replica server. Attempt to login as usual. If it fails you have successfully set a trace.

Switch back to the window on the replica server and run the following command to stop the

trace.

# bdebug –x 0 bridge_servc_r

Repeat the same process once more, but this time around debug the servc process instead of

bridge_servc_r. Send the output to /var/tmp/SERVC.trace.

You can now read through the files /var/tmp/BR-SERVC.trace and /var/tmp/SERVC.trace to

troubleshoot the problem by your self, or you could send it to TFS Tech for analysis. If the

attempted login did NOT fail there’s something else going on: one of the other replica servers is

not working properly! Find out which one it is by changing the client’s bcastaddr file while every

time using a different BoKS server as a target.

If you are attempting to troubleshoot another kind of problem:

Tracing any other part of BoKS isn’t really altogether that different from tracing the login process.

You prepare in the same way (make bcastaddr point at one BoKS server) and you will probably

have to prepare the trace on bridge_servc_r as well (see the text block above; if you do not have

to trace bridge_servc_r TFS Tech will probably tell you so).

Yet again, BEFORE you start the trace on the master side by running

# bdebug –x 9 bridge_servc_r –f /var/tmp/SERVC.trace

You will have to go to the client system with the problematic situation and perform the following.

# cd /var/tmp

# bdebug –x 9 $PROG –f /var/tmp/$PROG.trace

$PROG in this case is the name of the BoKS process (bridge_servc_r, drainmast_download) or the

access method (login, su, sshd) that you want to debug.

Now, start both traces and attempt to perform the task that is failing. Once it has failed, stop

both traces again using bdebug –x 0 $PROG.

1.6 SCENARIO: Debugging the BoKS SSH daemon

From time to time you may have problems with the BoKS SSH daemon which cannot be explained

in any logical way. At such a time a debug trace of the SSH daemon can be very helpful! This can

be done by starting a second daemon on an unused port temporarily.

On the troubled system, login and start a BoKS shell:

# /opt/boksm/sbin/boksadm –S

Keon> boks_sshd –d –d –d –p 24 /tmp/sshd.out 2>&1

From another system:

$ ssh –l $username -p24 $target-host

Try logging in; it shouldn’t work :) Now close the SSH session with Ctrl-C, which should also

close the temporary SSH daemon on port 24. /tmp/sshd.out should now contain all of the

debugging information you or TFS Technology could need.


kilala.nl tags: , , , ,

View or add comments (curr. 0)