Backups

Some hints for backup system setup and integration:

When setting up your backup systems, don't "reinitialize to redo from scratch" when everything got stuck.

These issues will all oocur later on during production service, thats why it's important to develop methologies and practice beforehand.

Be aware that once the backup setup is deployed, it can never be "done over again", especially if you're legally bound to ensure backups are avaliable.

Don't misunderstand me, when the tinkering phase is over there will be a point at which you start testing deployment of the production setup and will be doing it over a few times. But once it goes "live" and it has been running for a while, there will be no going back. You can reinstall your backup server. But you can't redo last years' backup.

About backup technologies:

Your final backup target can be a disk or a tape based setup.

A backup done to a media that can be overwritten from the system being backed up is NOT a backup.

A backup that does not have a defined on-media format that can be scanned after a loss of the backup server is NOT a backup.

Beware "smart" ideas from people that are not showing a background in backup systems.

Like in Cryptograhpy chances are high those smart ideas were proven wrong years before you heard about them.

If you look at something snapshot based, ensure that it's possible to generate a complete copy that can be split off the source system.

Online Filesystem snapshots are never truly clean on an application level. Thats why Oracle has a "backup" mode in the first place.

If you're using a certain drive technology, you need a spare at hand. If it's an usb drive, you need an extra, compatible and tested USB drive.

If you're using a LTO4 jukebox, you need one standalone drive in case the jukebox fails.

Other things can and will happen:

  • Earthquakes kill harddisks
  • Power surges kill harddisks
  • Fire suppressant flooding kills tapes
  • Encrypted backups are not evil
  • But the crypto key sometimes gets lost

Align with your company policies for that, don't think you own the place. It's the CIO/CEOs problem to worry about storing the key, not yours!

If that CxO gets replaced more often than you, then consider giving it to the housekeeper instead.

About setting up backups vs. running backups

Just between us:

I found that people who are good at planning and troubleshooting backup systems are often not the good backup admins. A good backup admin has to 99% concern with coming in in the early morning (before the users do), have everything in check (all backups done?), sort out everything quickly (run again) and leave with everything finished (no unsuccessful backups) / prepared (every spool has space, etc).

He should be someone who is able to fix all bugs he encounters, but still really, personally verifies the same backup log each day forever on and can be relied on for that. This is far more professional attitude that most of us geeks could live with. I don't feel bad for saying that I couldn't reliably check backups each day, this needs other strengths. It can't hurt to identify the most reliable guy on the team for this job.

About "we're not a big financial institution or anything", we don't need ...

...an extra server, or an extra disk, or a tape

You're right. You DON'T need:

  • backups that are copied to 2 or more different locations in real time.
  • backups that are written to a mirrored VTL in different locations in real time.
  • a completely different backup scheme to make basic OS backups for bare metal restores independent of the OS and DATA backups
  • you dont need a nightly backup every night
  • you don't need a database backup that lets you with a minute restore point objective
  • you don't need a separate backup SAN
  • you don't need a 10k+ slot media changer and an array of tape drives that stream multiple GB per second
  • you don't need a large VTL to buffer between always-slow PC servers and those fast tapes.
  • you don't need multiple full-time backup admins
  • you don't need admins on oncall in case "something goes wrong with the backup"
  • you don't need one or more totally disconnected offline storages that survive a nuclear attack

see, You just saved MILLIONS right there!

so shut up and buy another backup disk.

Strategies for problem solving

A few hints for "backup is stuck" type of scenarios. For some reason I'm really good at running into them, and had to fix them just as often; it didn't really matter which software I used, which is why I'm always glad to hear this happens to other people, too :)

  • - List stuck jobs' media before purging
  • - Purge jobs only, not media, if possible
  • - If messing in the database, stop the software
  • - Cross-reference Media of a Job and other jobs on the media before purging anything