Check_MK and LVM Snapshots

I'm running virtual machines. Many of them and on many hosts.

The bad thing is I also test a lot of stuff. So I do often need a quick backup, and often I might also forget about that backup once I've made it.

My backups are made using a self-written script that cleanly shuts down the virtual machine, creates a snapshot, then boots the VM and starts saving the backup in the background.

This is to get a perfect balance between VM uptime and data consistency, the VM will be down for less than a minute and I get a fully consistent backup.

 

The problem can arise if the backup process is interrupted, and a snapshot is left over. This could conflict with future backups or run full, risking an invalid snapshot, or if there's any LVM bug, risking the source volume's availability.

Since I use the same script (btw, see http://bitbucket.org/darkfader/vmbackup/ ) for the backups of some production VMs that's something I really need to avoid. Note: The script is not worrying me - in over 2 years of weekly backups, it so far never failed to clean up the snapshots. But still, lets be on the safe side, especially considering that yours truly also sometimes creates those snapshots... (smile)

 

What I've done to support this in Check_MK is three things:

Make the Disk IO check ignore snapshot volumes

They're all called snaplv. The following rule in Monitoring Configuration, Inventory and Check_MK Settings, Ignored Services catches the LVM2-Snapshot related entries. Note this is a bleeding edge server, on older ones you'll not have the cow and real pseudo files. 

You can also use a regexed entry like (-cow|-real) if you want to feel smart but it decreases overview a lot if you real with larger rulesets.

Create a fileinfo group

This rule is created in Parameters for Inventorized Checks, Storage, Filesystems and Files. It's called fileinfo Grouping patterns 

This collects all fileinfo services for LVM snapshots into one service based on their OS path name. Now you'd be real happy if I also let you see the fileinfo.cfg from the server, right?

Add entry to fileinfo.cfg

server:~# cat /etc/check_mk/fileinfo.cfg 
/etc/bacula/last-backup
/dev/vg*/snaplv*

 

Some example data, the way the agent is outputting it:

<<<fileinfo:sep(124)>>>
1357990268
/etc/bacula/last-backup|0|1357923784
/dev/vgxen/snaplv_xen01|30|1356315949
/dev/vgxen/snaplv_xen01swap|34|1356315949
/dev/vgxen/snaplv_xen02|30|1356315949
/dev/vgxen/snaplv_xen02swap|34|1356315949
/dev/vgxen/snaplv_xen03|30|1357245417
/dev/vgxen/snaplv_xen03swap|34|1356315949

 

Define a maximum age for the fileinfo group

This rule is created in the same place Parameters for Inventorized Checks, Storage, Filesystems and Files. It's called, well, see below (smile)

This way it will alert me if any of the LVM snapshots exceeds a certain age.

I've configured it so it alerts if the age exceeds two days and goes to critical at 7 days (since that would mean the next backup has been missed).

If you run daily snapshot backups you could tweak this some more to ensure there is a fresh backup around. Let me know if you do that.

 

and this is the final result: