I tried to use SQL Server’s built-in log shipping to set up a disaster recovery facility but I found there were some serious limitations. So, after some research, I decided to do it myself. Here’s what happened.
A disaster recovery plan is useless unless you can test it, preferably at least once a year. To be really sure, I think you need to prove you can fail over, use, and fail back the production systems. It’s all very well testing on a test setup – you have to make sure it works – but there’s no substitute for confirming that it actually works in the environment it’s meant to preserve. There are so many things that have to fit together and, in a disaster situation, you really don’t want find yourself saying “Crap, we didn’t think about that!” and explaining to your boss that the business is going to be down for a day or two instead of the hour or two you’d confidently promised.
So you have to be able to smoothly fail over to the standby servers, have all the data consuming systems gracefully adapt to the new data sources, confirm they all work, and then reverse the process. Without losing any data. At least once a year. Easy!
Some specific technical requirements:
- Setup and configuration must be scriptable. I don’t want to screw around with multi-page wizards and write reams of documentation about it. If I have to rebuild the server I want to get it done quickly and easily, and scripts are the way.
- I need it to tell me if it goes wrong, and then I need some tools to help me understand where I am and how I got there so I can fix it.
- When failed over, I want to be able to log ship back the other way. In a real disaster this is probably not an issue but it’s vital for testing the solution. Once live transactions have been applied to the standby server you have to log ship them back to the normal live server or lose them.
- I decided to have a time lag of 3 hours before transactions are applied to the standby server. We thought we could use this to give us a chance to rescue data that was deleted by mistake.
Trying Out-of-the-Box Log Shipping
I filled in about 11 pages of wizard, setting up the source, standby and monitor. Then I captured the script, thinking I could just copy it for each database and find and replace the database name. But the script doesn’t run! I spent quite a while searching for how to fix and couldn’t find an answer. At this point I decided I could live without the scriptability so I went through the wizard again, building a detailed installation guide for our domain.
It all looked great at first. There’s a report on the monitor server to show the progress of backups, file copies and restores. I failed over, putting the source database into standby mode ready to receive logs. But when I tried to restore the standby database right up to date there was no way to override the 3 hour delay I’d specified. I went into the wizard, took out the delay, restored all logs and brought the database online. My test application adapted to the new data source. Not ideal in a real disaster with everyone breathing down my neck but I can live with it.
Now to set up log shipping back the other way. I go into the wizard and it won’t let me set up the monitor. Apparently you can only monitor shipping between two servers in one direction. If you remove the initial shipping you can set it up going the other way. So that’s about 11 pages of wizard per database to failover, and the same to fail back. That’s enough of that!
I looked around hoping to find that somebody else had already done a full solution but I couldn’t find one. I felt Pop Rivetts Does Log Shipping came the closest so I cannibalised it into what I wanted.
I have the following jobs on both servers:
Does a log backup on selected databases, compresses the file with WinRAR, copies it to a share on the other server, and does housekeeping on the files and the backup/respore history tables in msdb. The copy stage copies over any files that are not on the other side so it can catch up if connection has been lost for a bit.
Sends me an email if the LogShipBackup job has not run recently and successfully. LogShipBackup will send me an email if it fails but that won’t alert me if it doesn’t run.
Decompresses and restores logs in sequence on selected databases, and does housekeeping on the files and the backup/respore history tables in msdb. You can specify a delay so that logs aren’t applied immediately. It checks the backup history so it doesn’t trip up if a file is copied over again. It actively kills connections to a database to ensure the restore can go ahead.
Sends me an email if the LogShipRestore job has not run recently and successfully. LogShipRestore will send me an email if it fails but that won’t alert me if it doesn’t run.
Disables the LogShipBackup and LogShipBackupAlert jobs, backs up the tail of the log on selected databases, puts them in read-only standby mode, and copies the backup files over to the other server. The copy stage copies over any files that are not on the other side so it’s quite resilient.
Disables the LogShipRestore and LogShipRestoreAlert jobs, restores all logs in sequence on selected databases and brings those databases online.
The jobs LogShipBackup and LogShipBackupAlert are enabled on the primary server, and LogShipRestore and LogShipRestoreAlert are enabled on the standby server. I run them every 15 minutes. To fail over, run LogShipFailOver on the primary and then LogShipBringOnline on the standby server.
Apart from the alert jobs, each job has one or more steps that perform the appropriate actions on a database that is shipping or receiving logs. The jobs rely on a set of stored procedures installed in both servers’ master databases.
The code is available for download here. If you’re interested, please feel free to check it out and use it if you want to. If it doesn’t quite do what you want you can always fix that. I’d be interested to know what you think of it. I’d also be very keen to hear from you if you’ve found ways round the problems I found with the built-in log shipping.
To find out more about log shipping I recommend the following:
- Brad McGehee’s article about doing it yourself. It dates back to SQL 2000 days but log shipping still seems to work just the same.
- Paul Randal’s posts about how backup and recovery work.