You’ve had a disaster at work, the data centre is off line, management are panicing and money is being lost. What should you be doing?
Short answer :
nothing that isn’t in the Disaster Recovery plan (you do have one of those don’t you?)
Longer answer:
Some time ago I got into work at my usual hour to notice a distinct lack of server admins at their desks and other early starters unable to log in to Active Directory, neither of which filled me with confidence for a quiet day working on writing some training material.
A quick walk down to the data centre found some worried looking electricians, some annoyed looking server admins and a distinct lack of server noise. Obviously something major had failed and taken out everything, so time to dig out the disaster recovery plans..
In previous roles I’ve had responsibility for hardware and infrastructure so the urge to jump in and help out was strong. But I could see that the correct guys were on the case, and that with the influx of panicing management another body was only going to slow things down and get in the way.
So I went back to my desk, got out a copy of the SQL Server Disaster Recovery plan and started refreshing (you do have an offline copy you can read without power, network or server infrastructure?) . This was the best place for me to be, and the best use of my time while things were being fixed, I was ensuring that:
- I could be found easily by anyone needing me
- Would have information to hand as soon as needed. (which order to bring back the hardware back online, SAN dependencies, etc)
- I could make a condensed checklist for what to do when things came back up based on time (ie; kill off an import that would fail due to a missing dependancy)
- Keep out of the way of the people who were working hard to get their part of the disaster plan enacted. They really don’t need another person who doesn’t know what’s going on bugging them.
- Work with management to work out the order for bringing back services
- Make sure I had all the contact details I needed for the service owners who I’d have to liase with for final release (you just know that a disaster will happen while X is on leave, and you won’t have the number for Y who’s covering their role)
All of this meant that when the call came up that the physical layers were back up and running I quickly started bring up the SQL Server instances in an orderly fashion, and could quickly run through my checks before handing them over to the business users for clearance.
Disaster recovery plans take a long time to write and to get right, but if you’ve got it right then on the day it makes life go easier if you stick to it. When the business is suffering it’s not the right time to be improvising, or trying to work on something you’re not completely OK with, or the time to be trying out some new technique you’ve read on a random DBA’s blog.
Making sure you study it is even more important if it’s a disaster plan you didn’t write. If you ever get brought in to enact a disaster recovery plan you’ve never seen before and you don’t know the providence of, make sure you read it thoroughly before you start. You don’t want to get half way through and find out that you’re missing either a vital piece of infrastructure or software.
(Burnt server image courtesy of John/Roadhunter – http://www.flickr.com/photos/roadhunter/68017745/)