At $CLIENT we've been having some nasty problems with our development SAP box. The box is part of a Veritas cluster and actually runs a bunch of Solaris Zones. The problems originally started about two months ago when we ran into a rare and newly discovered bug in UFS. It took a while for us to get the proper patches, but we finally managed to get that sorted out.

Remco installed the patches on Thursday morning, though he ran into some trouble. As always, patches can give you crap when it comes to cross-dependencies and this time wasn't any different. Around lunch time we thought we had things sorted out and went for the final reboot. All the zones were transferred to the proper boxen and things looked okay.

Until we tried to make a network connection. D:

None of the zones had access to the network, even though their interfaces were up and running. We sought for hours, but couldn't find anything. And like us, Sun was in the dark as well. In the end Remco and Sun worked all night to get an answer. Unfortunately they didn't make it, so I took over in the morning. Lemme tell you, once I was in the middle of all the tech and the phone calls and the managers, I found some more respect for Remco. He did a great job all through Thursday!

Just before lunch both Sun and one of the other guys came up with the solution. That was an awesome coincidence :) Turns out that the problems we were having are caused by timing issues during the boot-up of the Solaris Zones. Because we let Veritas Cluster handle the network interfaces things turned sour. Things would've worked better if we'd let the Zone framework handle things.

The stopgap solution: freeze all cluster resources to prevent fail-over, then manually restart all virtual interfaces for the zones. And presto! It works again!

Happily we went to lunch, only to come back to more crap!

Turns out that the five SAP instances we were running wouldn't fit into the available swap space anymore. Weird! Before yesterday, things would barely fit in the 30GB of swap space. And now all of a sudden SAP would eat about 38GB! o_O WTF?!

A whole bunch of managers wanted us to work through the whole weekend to sort everything out. Naturally we didn't feel to enthused, let alone the fact that the box's SLA doesn't cover weekend work.

In the end we tacked on some temporary swap space, started SAP and left for the weekend. We'll have to take more downtime on Monday for granted. It also leaves us with two big things to fix:

1. Modify the cluster/zone config for the network interfaces.

2. Find out why SAP has grown gluttonous and fix it.

kilala.nl tags: work, unix,

View or add comments (curr. 1)

2007-10-22 11:06:00

Posted by Edmond (website)

And we're at it again ;) Stuff seems to be running, but my guess is we're not even halfway a decent fix.

All content, with exception of "borrowed" blogpost images, or unless otherwise indicated, is copyright of Tess Sluijter. The character Kilala the cat-demon is copyright of Rumiko Takahashi and used here without permission.

Kilala.nl - Personal website of Tess Sluijter

About me

Blog archives

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

> Weblog

> Sysadmin articles

> Maths teaching

Sometimes clusters do not guarantee high uptime