2008-11-21 21:52:00
Recently we ran into a rather perplexing problem: a few of our customers had intermittent login problems. There seemed to be no pattern to this issue, with users from different departments being deing access to their servers at random points in time. Sometimes the problem would go away after a few hours, sometimes it took a few days. It took a few days before the penny dropped and we found out that one of our replica servers was misbehaving.
The paragraphs below outline my diagnosis and troubleshooting procedure.
The issues seemed to focus on servers in one specific, physical location.
One of our DBAs created several incidents over the course of a month regarding login issues with user sybase@SYBHOST. Initially this problem was fixed by adding the "ssh_pk" authenticator, but the problem returned with intermittent login denial without an apparent reason.
A number of users from another department indicated intermittent login problems where they were allowed to login one day and denied access the next. My troubleshooting of the problem hadn't given me any real results so far. I'd ran debugging on SSH sessions which didn't clear much up.
For the remainder of this document I will focus on my troubleshooting process for the case involving user sybase.
These denials occur at seemingly random intervals and result in varying BoKS error messages. Most frequent is the rather useless "ERR 223, no authentication" which, as Fox Tech confirms, tells us absolutely nothing. At other times users receive an "ERR 203, no access route" eventhough said user does in fact have the requisite access routes.
In this case the DBAs attempt to use SSH (with keypair authentication) from sybase@UNIXHOST, to sybase@SYBHOST.
The BoKS database shows that both hosts are part of the hostgroup SYBASE.
BoKS > hgrpadm -l | grep UNIXHOST
...
SYBASE UNIXHOST
TRUSTED UNIXHOST
...
BoKS > hgrpadm -l | grep SYBHOST
...
SYBASE SYBHOST
TRUSTED SYBHOST
...
The BoKS database shows that user sybase is allowed SSH inside hostgroup SYBASE.
BoKS > sx /opt/boksm/sbin/boksadm -S dumpbase -t 2 | grep SYBASE:sybase
RUSER="SYBASE:sybase" ROUTE="ssh*:TRUSTED->SYBASE"
...
RUSER="SYBASE:sybase" ROUTE="ssh*:ANY/SYBASE->SYBASE"
...
The BoKS database confirms that sybase is allowed to use SSH keypairs.
BoKS > sx /opt/boksm/sbin/boksadm -S dumpbase -t 31 | grep SYBASE:sybase
RLOGNAME="SYBASE:sybase" TYPE="ssh_pk" VERSION="1.0" FLAGS="1"
The public key of sybase@UNIXHOST is correctly installed in the authorized_keys file of user sybase@SYBHOST.
sybase@UNIXHOST > cat ~/.ssh/id_dsa.pub
ssh-dss AAAAB3NzaC1kc3MAAACBANSl ... WjUgDlUEIA5g== sybase@UNIXHOST
sybase@SYBHOST > cat ~/.ssh/authorized_keys
ssh-dss AAAAB3NzaC1kc3MAAACBAPd/ ... 8Cbt3Gl9hvTa== sybase@OTHERHOST
ssh-dss AAAAB3NzaC1kc3MAAACBANSl ... WjUgDlUEIA5g== sybase@UNIXHOST
The permissions on the .ssh directory for sybase@SYBHOST are also correct.
sybase@SYBHOST > ls -al ~/.ssh
drwx------ 2 sybase sybase 96 Aug 15 2007 .
drwxr-xr-x 3 sybase sybase 8192 Sep 12 15:58 ..
-rw------- 1 sybase sybase 1210 Oct 27 10:53 authorized_keys
Because things seem alright so far it's time to check out what's going wrong on the inside of BoKS. The first step to take is to run an additonal debugging SSH daemon. This can be done using the following command. Key in this are the multiple -d flags and "-p 2222".
BoKS > /opt/boksm/lib/boks_sshd -d -d -d -D -g120 -p 2222 >/tmp/Trace.txt 2>&1
The customer is now instructed to attempt a login to port 2222 by adding "-p 2222" to his usual SSH command. This should of course still fail, but this time we can get a trace.
The trace output file gets pretty long because it no only shows the SSH debug information, but also debugging for the BoKS internals. After going through the hostkey exchange, BoKS will start authentication by requesting valid authentication methods.
debug2: userauth-request for user sybase service ssh-connection method none
debug2: input_userauth_request: setting up authctxt for sybase
...
debug2: get_opt_authmethod_from_servc: INSIDE - user = sybase, need_privsep = 0
debug2: boks_servc_call_vec: INSIDE boks_sshd@SYBHOST[6] 14 Nov 11:21:24:026533 in servc_call_str: To server: {FUNC=route-stat-user FROMUSER = sybase ROUTE = SSH:192.168.0.181->?HOST TOHOST=?HOST TOUSER=sybase FROMHOST = 192.168.0.181}
...
boks_sshd@SYBHOST[6] 14 Nov 11:21:24:264031 in servc_call_str: Return: {FUNC=route-stat-user FROMUSER=sybase ROUTE=SSH:192.168.0.181->?HOST TOHOST=?HOST TOUSER=sybase FROMHOST=192.168.0.181 $HOSTSYM=SYBHOST $ADDR=192.168.40.165 $SERVCADDR=192.168.23.9 METHODS=ssh_pk $SERVCVER=6.0.3}
debug2: get_opt_authmethod_from_servc: Must use BokS authentication methods: "ssh_pk"
debug2: get_opt_authmethod_from_servc: BokS optional authentication methods: ""
debug2: boks_ssh_restrict_authmethods: INSIDE - orginal authmethods = publickey,keyboard-interactive
debug2: boks_ssh_restrict_authmethods: DONE - returning methods = publickey
debug2: userauth-request for user
This confirms that authentication using SSH keypairs is allowed and is actually enforced. The key is now checked and (after some fidgeting) accepted.
debug2: input_userauth_request: try method publickey
debug1: trying public key file /home/sybase/.ssh/authorized_keys
...
debug2: userauth_pubkey: authenticated 1 pkalg ssh-dss
Accepted publickey for sybase from 192.168.0.181 port 63569 ssh2
Now that the user has been authenticated BoKS will check his access routes. Sadly this returns with ERR 203 (no access route)
boks_sshd@SYBHOST[6] 14 Nov 11:21:24:304336 in servc_call_str: To server: {FUNC=auth FROMUSER=sybase ROUTE=SSH:192.168.0.181->?HOST TOHOST=?HOST TOUSER=sybase FROMHOST=192.168.0.181 $ssh_pk=ok}
...
boks_sshd@SYBHOST[6] 14 Nov 11:21:24:314704 in servc_call_str: Return: {FUNC=auth FROMUSER=sybase ROUTE=SSH:UNIXHOST->SYBHOST TOHOST=SYBHOST TOUSER=sybase FROMHOST=192.168.0.181 $ssh_pk=ok 01$HOSTSYM=SYBHOST $ADDR=192.168.40.165 $SERVCADDR=192.168.23.9 WC=#$*-./?_ UKEY=SYBASE:sybase MOD_CONV=1 SEC_USER=sybase VTYPE=ssh_pk MODLIST=optional_ssh_pk=+1,psw=+1,prompt=-1,timeout=+1,login=+1,verbose=+1 $STATE=6 ERROR=-203 $SERVCVER=6.0.3}
debug3: boks_ssh_do_authorization: Servc auth failed ERROR = -203
Please note that the SSH debug trace above shows that address 192.168.23.9 is used for the servc calls. This indicates that the client is communicating with replica REPHOST. In order to further aid the troubleshooting process it's best to force the client to communicate with just this one replica.
BoKS > cd /etc/opt/boksm
BoKS > vi bcastaddr
DONT_BROADCAST
ADDRESS_LIST
192.168.23.9 REPHOST.domain
~
~
:wq
BoKS > Boot -k
BoKS > Boot
Just to play it safe we'll need to check that the client's request is sent and received properly. This can be done by running a BoKS debug on the "servc_bridge_[s|r]" process, "s" being on the sending side and "r" on the receiving end.
Once again we'll be asking the customer to SSH to the system. However, right before he executes his command we'll run the following two commands.
Client: bdebug bridge_servc_s -x 9 -f /tmp/servcs.out
Replica: bdebug bridge_servc_r -x 9 -f /tmp/servcr.out
Right after the customer's SSH session is killed again we'll run the following commands.
Client: bdebug bridge_servc_s -x 0
Replica: bdebug bridge_servc_r -x 0
The two resulting files will be rather large and hard to read. Both log should only be given a cursory glance as they only pertain to the BoKS communications itself. In this case the logs indicate no problems at all, though they might have shown problems with hostkeys or network connectivity.
Again we will ask the customer to attempt another (failed) login through SSH. This time we will trace another subset of BoKS, the "servc" process which handles the actual database lookup and verification.
Right before the client executes his SSH we'll run the following command.
Replica: bdebug servc -x 9 -f /tmp/servc-trace.out
Right after the customer's SSH session is killed again we'll run the following commands.
Replica: bdebug servc -x 0
The resulting log file will most likely be huge as it will contain all authentication requests handled by the replica during the trace. In order to get to the part of the log that is of interest to us it's best to do a search for the username (sybase). The first entry that we'll find is part of the setup of the authentication request.
servc@REPHOST[3] 14 Nov 11:43:35:660033 in servc_func_1: From client (SYBHOST) {FUNC=route-stat-user FROMUSER=sybase ROUTE=SSH:192.168.0.181->?HOST TOHOST=?HOST TOUSER=sybase FROMHOST=192.168.0.181}
BoKS will now go through a rather lengthy process of identifying the parties involved, which includes some BoKS-database and DNS voodoo to identify the hosts and their hostgroups. It's important to read all the log entries, searching for errors.
Having ascertained the identity of the parties involved, BoKS will start checking the appropriate access routes for the user. In this case you will see that BoKS will go over the access routes found at step 2 one by one. As part of this list it will also go over the access route that should have given sybase SSH access. However, instead we see the following.
14 Nov 11:43:35:930834 in fetchrec: Reading record from tab 2 at offset 1878504 (688 bytes)
14 Nov 11:43:35:931016 in get_route_key: got "ssh*:ANY/SYBASE->SYBASE"
14 Nov 11:43:35:931150 in am_methodcmp: ssh* == SSH ?
14 Nov 11:43:35:931254 in am_methodcmp: yes
14 Nov 11:43:35:931354 in hosttype_cmp: wild = ANY/SYBASE, host = UNIXHOST
14 Nov 11:43:35:931453 in domexpand: Enter. host="ANY/SYBASE"
...
14 Nov 11:43:35:931863 in domexpand: Return. "ANY/SYBASE.domain"
14 Nov 11:43:35:931963 in domexpand: Enter. host="UNIXHOST"
...
14 Nov 11:43:35:932367 in domexpand: Return. "UNIXHOST.domain"
...
14 Nov 11:43:35:932721 in host_wild_cmp: wild (SYBASE.domain) is a hostgroup
14 Nov 11:43:35:932824 in hostgroup_match_sub: enter
14 Nov 11:43:35:933336 in hostgroup_match_sub: no match
14 Nov 11:43:35:933641 in get_route_key: mismatch
This indicates that BoKS thinks that host UNIXHOST is not part of hostgroup SYBASE, even though we already confirmed that this is in fact the case (see step 2). This would seem to indicate that there are problems with the local copy of the BoKS database on replica REPHOST.
We won't have to continue reading the log file any further.
Suspecting database problems on the replica we check the following.
BoKS > hgrpadm -l | grep UNIXHOST
...
SYBASE UNIXHOST
TRUSTED UNIXHOST
...
Oddly enough the "hgrpadm" command, which interacts with the database, returns the proper results. However, dumping the local tables shows that we have problems.
BoKS > dumpbase -t 7 | grep UNIXHOST
BoKS > dumpbase -t 9 | grep UNIXHOST
BoKS > dumpbase -t 15 | grep UNIXHOST
Run the following command on both the master server and the replica. Compare the figures for each table, looking for any discrepancies. A difference less than ten is alright, but anything in the dozens or higher is a problem. In this case I found the following.
BoKS > boksdiag sequence
Master Replica
...
T7 13678d 8:33:46 5053 (5053)
...
T9 13178d 11:05:23 7919 (7919)
...
T15 13178d 11:03:16 1865 (1865)
...
T7 13678d 8:33:46 6982 (6982)
...
T9 13178d 11:05:23 10258 (10258)
...
T15 13178d 11:03:16 2043 (2043)
This indicates that there are indeed synchronisation problems between this replica server and the master server.
Now that we've ascertained that there's one replica that's running badly, it's a good idea to check the other replicas as well. Run the "boksdiag sequence" command on the other replicas and verify the figures again.
In this case the figures for the other replicas all look fine, with one exception: REPHOST2 complains about database locking issues. Said error messages also pop up when running "dumpbase" commands on that replica, indicating software errors on that host as well.
boksdiag@REPLICA: INTERNAL DYNDB ERROR in blockbase(): Can't lock database
errno = 28, No space left on device
boksdiag@sREPLICA: INTERNAL DYNDB ERROR in bunlockbase(): Can't unlock database
errno = 28, No space left on device
T0 12549d 6:39:06 94193 (94193)
T1 13907d 7:13:45 637314 (637314)
...
In the end the problem was in fact down to REPHOST being out of synch with the rest of the BoKS domain. The troubleshooting continues with example 2.
kilala.nl tags: boks, sysadmin,
View or add comments (curr. 0)
All content, with exception of "borrowed" blogpost images, or unless otherwise indicated, is copyright of Tess Sluijter. The character Kilala the cat-demon is copyright of Rumiko Takahashi and used here without permission.