One of the reasons I became so keen on automating restores was that I’d regularly get requests from various auditors asking for examples of valid restores carried out in the last 3 months, or wanting me to justify my Recovery Time Objectives, or needing to ensure we had adequate DR testing. And there’s always a manager wanting reassurance that we could meet our SLA commitments.
By automating restores of our major production systems and recording the results, whenever a request came in I could quickly dump the information into Excel for them (remember, it’s not Real Information™ unless it’s in Excel).
So what sort of information should be be audited about restores? I find the following are a minimum that cover most requests, though be sure to check for any industry/business specifics that may apply to your own case.
- Time of restore
- Restores should ideally be attempted at different times throughout the day. This will highlight and potential slowdowns due to other activity on hardware or network
- What database was restored
- Which server was the restore performed on
- If you have multiple restore/DR servers, it’s important you have a log of testing a restore on all of them to avoid having to use the one of the set that doesn’t work at a critical point.
- How long it took
- How much data was written out
- This could be the amount of data on disk at the end of the backup, or you could calculate the total throughput of all backup files restored, or both
- To what size did the database files grow during the restore
- This may not be the same as the previous metric. This value will also include the empty space within data files, and accommodate any ‘shrinks’ that happened during the period being restored
- User running the restore
- Just so you can recreate any permissions issues
- Original server of backup
- Location of all backup files restored
- Output (or lack of) from DBCC
- If you’re using NO_INFOMSGS you may still want to log the fact that you had no reported errors, just to record that it had beem run
- Output from in house check scripts
- Log of any notifications sent to DBAs for further investigations
Once you have this information you can start to mine it for your own use as well. You can make sure that all your ‘matched’ hardware is actually working at the same speed, check that restoring whilst the network is under normal business load won’t add an extra hour to your RTO.
You can also start looking for trends, are your restores taking a lot longer since the new SAN was put in? or is Server A producing a lot more alerts on checking, perhaps there’s a underlying hardware problem to be investigated there?
A side bonus of this is also that your recovery hardware is being used. Rathe than just being sat there waiting for a disaster you’re actually reading and writing data from the drives. So now at 3am during a panic restore you can also be confident that you don’t have a dead spindle or a flaky drive controller in your server.
This post is part of a series posted between 1st September 2013 and 3rd October 2013, an index for the series is available here.