2005-05-31 22:15:00
A PDF version of this document is available. Get it over here.
We've all experienced that sinking feeling: blurry-eyed and not halfway through your first cup of coffee you're startled by the phone. Something's gone horribly wrong and your customers demand your immediate attention!
From then on things usually only get worse. Everybody's working on the same problem. Nobody keeps track of who's doing what. The problem has more depth to it than you ever imagined and your customers keep on calling back for updates. It doesn't matter whether the company is small or large: we've all been there at some point in time.
The last time we encountered such an incident at our company wasn't too long ago. It wasn't a pretty sight and actually went pretty much as described above. During the final analysis our manager requested that we produce a small checklist which would prevent us from making the same mistakes again. The small checklist finally grew into this article which we thought might be useful for other system administrators as well.
Before we begin we'd like to mention that this article was written with our current employer in mind: large support departments, multiple tiers of management, a few hundred servers and an organization styled after ITIL. Most of the principles that are described in this document also apply to smaller departments and companies albeit in a more streamlined form. Meetings will not be as formal, troubleshooting will be more supple and communication lines between you and the customer will be shorter.
Now, we have been told that ITIL is a mostly European phenomenon and that it is still relatively unknown in the US and Asia. The web site of the British Office of Government Commerce (http://www.itil.co.uk) describes ITIL as follows:
"ITIL (IT Infrastructure Library) is the most widely accepted approach to IT Service Management in the world. ITIL provides a cohesive set of best practice, drawn from the public and private sectors internationally.
ITIL is ... supported by publications, qualifications and an international user group. ITIL is intended to assist organizations to develop a framework for IT Service Management."
Some readers may find our recommendations to be strict, while other may find them completely over the top. It is of course up to your own discretion how you deal with crises.
Now. Enough with the disclaimers. On with the show!
The following paragraphs outline the phases which one should go through when managing a crisis. The way we see things, phases 1 through 3 and phase 11 are all parts of the normal day to day operations. All steps in between (4 through 10) are steps to be taken by the specially formed crisis team.
1. A fault is detected
2. First analysis
3. First crisis meeting
4. Deciding on a course of action
5. Assigning tasks
6. Troubleshooting
7. Second crisis meeting
8. Fixing the problems
9. Verification of functionality
10. Final analysis
11. Aftercare
"Oh the humanity!..."
Reporter at the crash of the Hindenburg
I really doesn't matter how this happens, but this is naturally the beginning. Either you notice something while V-grepping through a log file, or a customer calls you, or some alarm bell starts going off in your monitoring software. The end result will be the same: something has gone wrong and people complain about it.
In most cases the occurrence will simply continue through the normal incident process since the situation is not of a grand scale. But once every so often something very important breaks and that's when this procedure kicks in.
"Elementary, dear Watson."
The famous (yet imaginary) detective Sherlock Holmes
To be sure of the scale of the situation you'll have to make a quick inventory:
Once you have collected all of this information you will be able to provide your management with a clear picture of the current situation. It will also form the basis for the crisis meeting, which we will discuss next.
This phase underlines the absolute need for detailed and exhaustive documentation of your systems and applications! Things will go so much smoother if you have all of the required details available.1 If you already have things like Disaster Recovery Plans lying around, gather them now.
If you don't have any centralized documentation yet we'd recommend that you start right now. Start building a CMDB, lists of contacts and so-called build documents describing each server.
"Emergency family meeting!"
From "Cheaper by the dozen"
Now the time has come to determine how to tackle the problem at hand. In order to do this in an orderly fashion you will need to have a small crisis meeting.
Make sure that you have a whiteboard handy, so you can make a list of all of the detected defects. Later on this will make it easier to keep track of progress with the added benefit that the rest of your department won't have to disturb you for updates.
Gather the following people:
During this meeting the on-call team member brings everybody up to speed. The supervisor is present so that he/she may be prepared for any escalation from above, while the problem manager needs to be able to inform the rest of your company through the ITIL problem process. Of course it is clear why all of the other people are invited.
One of the goals of the first crisis meeting is to determine a course of action. You will need to set out a clear list of things that will be checked and of actions that need to be taken, to prevent confusion along the way.
It is possible that your department already has documents like a Disaster Recovery Plan or notes from a previous comparable crisis that describe how to treat your current situation. If you do, follow them to the letter. If you do not have documents such as these you will need to continue with the rest of our procedure.
Once a clear list of actions and checks has been created you will have to assign tasks to a number of people. We have determined a number of standard roles:
It is imperative that the spokesperson is not involved with any troubleshooting whatsoever. Should the need arise for the spokesperson to get involved, then somebody else should assume the role of spokesperson in his place. This will ensure that lines of communication don't get muddled and that the real work can go on like it should.
In this phase the designated troubleshooters go over the list of possible checks that was determined in the first crisis meeting. The results for each check need to be recorded of course.
It might be that they find some obvious mistakes that may have led to the situation at hand. We suggest that you refrain from fixing any of these, unless they are really minor. The point is that it would be wiser to save these errors for the meeting that is discussed next.
This might seem counterintuitive, but it could be that these errors aren't related to the fault or that fixing them might lead to other problems. This is why it's wiser to discuss these findings first.
Once the troubleshooters have gathered all of their data the crisis team can enter a second meeting.
At this point in time it is not necessary to have either the supervisor or the problem manager present. The spokesperson and the troubleshooters (perhaps assisted by a specialist who's not on the crisis team) will decide on the new course of action.
Hopefully you have found a number of bugs that are related to the fault. If you haven't, loop back to step 4 to decide on new things to check. If you did, now is the time to decide how to go about fixing things and in which order to tackle them.
Make a list of fixable errors and glance over possible corrections. Don't go into too much detail, since that will take up too much time. Leave the details to the person who's going to fix that particular item. Assign each item on the list to one of the troubleshooters, and decide in which order they should be fixed.
When you're done with that, start thinking about plan B. Yes, it's true that you have already invested a lot of time into troubleshooting your problems, but it might be that you will not be able to fix the problems in time. So decide on a time limit if it hasn't been determined for you and start thinking worst case scenario: "What if we don't make it? How are we going to make sure people can do their work anyway?O.
Obviously you'll now tackle each error, one by one. Make sure that you take notes of all of the changes that are made. Once more though (I'm starting to feel like the school teacher from The Wall): don't be tempted to do anything you shouldn't be doing.
Don't go fixing other faults you've detected. And absolutely do not use the downtime as a convenient time window to perform an upgrade you'd been planning of doing for a while.
Once you've gone over the list errors and have fixed everything verify that peace has been brought to the land, so to speak. Also verify that your customers can work again and that they experience no more inconvenience. Strike every fixed item from the whiteboard, so your colleagues are in the know.
If you find that there are still some problems left, or that your fixes broke something else, add them to the board and loop back to phase 3.
"Analysis not possible... We are lost in the universe of Olympus."
Shirka the board computer, from "Ulysses31"
Naturally your customers will want some explanation for all of the problems you caused them (so to speak). So gather all people involved with the crisis team and hold one final meeting. Go over all of the things you've discovered and make a neat list. Cover how each error was created and its repercussions. You may also want to explain how you'll prevent these errors from happening again in the future.
What you do with this list depends entirely on the demands set out by your organization. It could be that all your customers want is a simple e-mail, while ITIL-reliant organizations may require a full blown Post Mortem.
"I don't think any problem is solved unless at the end of the day you've turned it into a non-issue. I would say you're not doing your job properly if it's possible to have the same crisis twice.O
Salvaico, Sysadmintalk.com forum member
Apart from the Post Mortem which was already mentioned you need to take care of some other things.
Maybe you've discovered that the server in question is under powered or that the faults experienced were fixed in a newer version of the software involved. Things like these warrant the start of a new project at the cost of your customers. Or maybe you've found that your monitoring is lacking when it comes to the resource(s) that failed. This of course will lead to an internal project for your department.
All in all, aftercare covers all of the activities required to make sure that such a crisis never occurs again. And if you cannot prevent such a crisis from happening again you should document it painstakingly so that it may be solved quickly in the future.
We sincerely hope that our article has provided you with some valuable tips and ideas. Managing crises is hard and confusing work and it's always a good idea to take a structured approach. Keeping a clear and level head will be the biggest help you can find!
kilala.nl tags: writing, tutorial, sysadmin,
View or add comments (curr. 0)
All content, with exception of "borrowed" blogpost images, or unless otherwise indicated, is copyright of Tess Sluijter. The character Kilala the cat-demon is copyright of Rumiko Takahashi and used here without permission.
You are free to use this specific work, to share and distribute it and to adapt it for your own purposes. However, you must attribute this work as mine and you must share all of your alterations. Click on the logo, or follow this link for full details.