BoKS troubleshooting: another example of a debugging session

2008-11-22 20:29:00

Original issues

As I mentioned at the end of example 1 the problem with the seemingly random login denials was caused by a misbehaving replica server. We tracked the problem down to REPHOST, where we discovered that three of the database tables were not in sync with the rest. A whole number of hosts were being reported as non-existent, which was causing login problems for our users.

Now that we've figured out which server was giving us problems and what the symptoms were, we needed to figure out what was causing the issues.

Symptoms

One of our replica servers had three database tables that were not getting any updates. Their sequence numbers as reported by "boksdiag sequence" were very different from the sequence numbers on the master, indicating nastiness.

Diagnosis

1. Verify the sequence numbers again

Just to be sure that the replica is still malfunctioning, let's check the sequence numbers again.

BoKS > boksdiag sequence

...

T7 13678d 8:33:46 5053 (5053)

...

T9 13178d 11:05:23 7919 (7919)

...

T15 13178d 11:03:16 1865 (1865)

...



BoKS > boksdiag sequence

...

T7 13678d 8:33:46 6982 (6982)

...

T9 13178d 11:05:23 10258 (10258)

...

T15 13178d 11:03:16 2043 (2043)

Yup, it's still broken :) You may notice that the sequence numbers on the replica are actually AHEAD of the numbers on the master server.

2. Demoting the replica to client status

Because I was not sure what had been done to REPHOST in the past I wanted to reset it completely, without reinstalling the software. I knew that the host had been involved in a disaster recovery test a few months before, so I had a hunch that something'd gone awry in the conversion between the various host states.

Hence I chose to convert the replica back to client status.

BoKS> sysreplace restore



BoKS> convert -v client

Stopping daemons...

Setting BOKSINIT=client in ENV file...

Restarting daemons...

Conversion from replica to client done.



BoKS > cd /var/opt/boksm



BoKS > tail -20 boks_errlog

...

WARNING: Dying on signal SIGTERM

boks_authd Nov 17 14:59:30

INFO: Shutdown by signal SIGTERMboks_csspd@REPHOST Nov 17 14:59:30

INFO: Shutdown by signal SIGTERM



boks_authd Nov 17 14:59:30

INFO: Min idle workers 32boks_csspd@REPHOST Nov 17 14:59:30

INFO: Min idle workers 32



BoKS > sysreplace replace

I verified that all the BoKS processes running are newly created and that there are no stragglers from before the restart. Also, I tried to SSH to the replica to make sure that I could still log in.

3. Change the replica's type in the database

The BoKS master server will also need to know that the replica is now a client. In order to do this I needed to change the host's TYPE in the database. Initially I tried doing this with the following command.

BoKS> hostadm -a -h REPHOST -t UNIXBOKSHOST

Unfortunately this command refused to work, so I chose to modify the host type through the BoKS webinterface. Just a matter of a few clicks here and there. Afterwards the BoKS master was aware that the replica was no more. `

BoKS > boksdiag list

Server Since last Since last Since last Count Last

REPHOST5 00:49 523d 5:19:20 04:49 1853521 ok

REPHOST4 00:49 136d 22:21:35 04:49 526392 ok

REPHOST3 00:49 04:50 726768 ok

REPHOST2 00:49 107d 5:05:33 04:49 425231 ok

REPHOST 02:59 02:13 11:44 148342 down

It'll take a little while for REPHOST's entry to completely disappear from the "boksdiag list" output. I sped things up a little bit by restarting the BoKS master using the "Boot -k" and "Boot" commands.

4. Reconvert the host back to a replica

Of course I wanted REPHOST to be a replica again, so I changed the host type in the database using the webinterface.

I then ran the "convert" command on REPHOST to promote the host again.

BoKS > convert -v replica

Checking to see if a master can be found...

Stopping daemons...

Setting BOKSINIT=replica in ENV file...

Restarting daemons...

Conversion from client to replica done.



BoKS > ps -ef | grep -i boks

root 16543 16529 0 15:14:33 ? 0:00 boks_bridge -xn -s -l servc.s -Q !/etc/opt/boksm!.servc!servc_queue -q /etc/opt

root 16536 16529 0 15:14:33 ? 0:00 boks_servc -p1 -xn -Q !/etc/opt/boksm!.xservc1!xservc_queue

root 16535 16529 0 15:14:33 ? 0:00 boks_servm -xn

root 16529 1 0 15:14:33 ? 0:00 boks_init -f /etc/opt/boksm/boksinit.replica

root 16540 16529 0 15:14:33 ? 0:00 boks_bridge -xn -r -l servc.r -Q /etc/opt/boksm/xservc_queue -P servc -k -K /et

root 16552 16529 0 15:14:33 ? 0:00 boks_csspd -e/var/opt/boksm/boks_errlog -x -f -c -r 600 -l -k -t 32 -i 20 -a 15

root 16533 16529 0 15:14:33 ? 0:00 boks_bridge -xn -s -l master.s -Q /etc/opt/boksm/master_queue -P master -k -K /

...

...



BoKS > cd ..



BoKS > tail boks_errlog

boks_authd Nov 17 14:59:30

INFO: Min idle workers 32boks_csspd@REPHOST Nov 17 14:59:30

INFO: Min idle workers 32



boks_init@REPHOST Mon Nov 17 15:02:21 2008

WARNING: Respawn process sshd exited, reason: exit(1). Process restarted.

boks_init@REPHOST Mon Nov 17 15:14:31 2008

WARNING: Dying on signal SIGTERM

boks_aced Nov 17 15:14:33

ERROR: Unable to access configuration file /var/ace/sdconf.rec



On the master server I saw that the replica was communicating with the master again.

BoKS > boksdiag list

Server Since last Since last Since last Count Last

REPHOST5 04:35 523d 5:33:41 06:39 1853555 ok

REPHOST4 04:35 136d 22:35:56 06:42 526426 ok

REPHOST3 04:35 06:43 726802 ok

REPHOST2 04:35 107d 5:19:54 06:41 425265 ok

REPHOST 01:45 16:34 26:05 0 new

Oddly enough REPHOST was not receiving any real database updates. I also noticed that the sequence numbers for the local database copy hadn't changed. This was a hint that stuck in the back of my head, but I didn't pursue it at the time. Instead I expected there to be some problem with the communications bridges between the master and REPHOST.

BoKS > ls -lrt

...

...

-rw-r----- 1 root root 0 Nov 17 15:14 copsable.dat

-rw-r----- 1 root root 0 Nov 17 15:14 cert2user.dat

-rw-r----- 1 root root 0 Nov 17 15:14 cert.dat

-rw-r----- 1 root root 0 Nov 17 15:14 ca.dat

-rw-r----- 1 root root 0 Nov 17 15:14 authenticator.dat

-rw-r----- 1 root root 0 Nov 17 15:14 addr.dat



BoKS >

5. Verify that everything's okay on the replica

I was rather confused by now. Because REPHOST wasn't getting database updates I though to check the following items

Everything seemed completely fine! It was time to break out the big guns.

6. Clear out the database

I decided to clear out the whole local cop of the database, to make sure that REPHOST had a clean start.

BoKS > Boot -k



BoKS > cd /var/opt/boksm



BoKS > tar -cvf data.20081117.tar data/*

a data/ 0K

a data/crypt_spool/ 0K

a data/crypt_spool/clntd/ 0K

a data/crypt_spool/clntd/ba_fbuf_LCK 0K

a data/crypt_spool/clntd/ba_fbuf_0000000004 6K

a data/crypt_spool/clntd/ba_fbuf_0000000003 98K

a data/crypt_spool/servc/ 0K

a data/crypt_spool/servm/ 0K

...



BoKS > cd data



BoKS > rm *.dat



BoKS > Boot

Checking the contents of /var/opt/boksm/data immediately afterwards showed that BoKS had re-created the database table files. Some of them were getting updates, but over 90% of the tables remained completely empty.

7. Debugging the communications bridges

As explained in this article it's possible to trace the internal workings of just about every BoKS process. This includes the various communications bridges that connect the BoKS hosts.

I'd decided to use "bdebug" on the "servm_r" and "servm" processes on REPHOST, while also debugging "drainmast" and "drainmast_s" on the master server. The flow of data starts at drainmast, the goes through drainmast_s and servm_r to finally end up in servm on the replica. Drainmast is what sends data to replicas and servm is what commits the received changes to the local database copy.

Unfortunately the trace output didn't show anything remarkable, so I won't go over the details.

8. Calling in tech support

By now I'd drained all my inspiration. I had no clue what was going on and I was one and a half hours into an incident that should've taken half an hour to fix. Since I always say that one should know one's limitations I decided to call in Fox Tech tech support. Because it was already 1600 and I wanted to have the issue resolved before I went home I called their international support number.

I submitted all the requested files to my engineer at FoxT, who was still investigating the case around 1800. Unfortunately things had gone a bit wrong in the handover between the day and the night shift, so my case had gotten lost. I finally got a call back from an engineer in the US at 2000. I talked things over with him and something in our call triggered that little voice stuck in the back of my head: sequence numbers!

The engineer advised me to go ahead and clear the sequence numbers file on REPHOST. At the same time I also deleted the database files again for a -realy clean start.

BoKS > Boot -k



BoKS > cd /var/opt/boksm



BoKS > tar -cvf data.20081117-2.tar data/*

...



BoKS > cd data



BoKS > rm *.dat



BoKS > rm sequence



BoKS > Boot

Lo and behold! The database copy on REPHOST was being updated! All of the tables were getting filled again, including the three tables that had been stuck from the beginning.

The engineer informed me that in BoKS 6.5 the "convert" command is supposed to clear out the database and sequence file when demoting a master/replica to client status. Apparently this is NOT done automatically in BoKS versions 6.0 and lower.

In conclusion

We discovered that the host had at one point in time played the role of master server and that there was still some leftover crap from that time. During REPHOST's time as the master the sequence numbers for tables 7, 9 and 15 had gotten ahead of the sequence numbers of the real master which was turned off at the time. This had happened because these three tables were edited extensively during the original master's downtime. This in turn led to these tables never getting updated.

After the whole mess was fixed we concluded that the following four steps are all you need to restart your replica in a clean state.

  1. Stop the BoKS software on REPHOST.
  2. Delete all the .dat files in $BOKS_var/data.
  3. Delete the sequence file from $BOKS_var/data.
  4. Restart the BoKS software on REPHOST

I've also asked the folks at Fox Tech to issue a bugfix request to their developers. As I mentioned in step 1, the seqeunce numbers on the replica were ahead of those on the master. Realisticly speaking this should never happen, but BoKS does not currently recognize said situation as a failure.

In the meantime I will write a monitoring script for Nagios and Tivoli that will monitor the proper replication of the BoKS database.


kilala.nl tags: , ,

View or add comments (curr. 0)