Check_MK and Bacula

Monitoring Bacula from within Check_MK

 

There are multiple options for monitoring Bacula, ranging for very easy stuff to in-depth.

I'll show some of them, including the most interesting pieces of configuration to use.

Note that for Check_MK I'm using the text-driven configuration for easier understanding, but any of those settings are configurable via WATO!

So if you're not looking for ways to automate your config from puppet etc., and don't like to make your life complicated then just use WATO.

Monitoring Job status (easy)

  • Let Bacula write a Flag file
  • Monitor the flag file's existance and age from Check_MK

I have added the following configuration to write the flag file:

JobDefs {
  Name = DefaultJob
  [....... many more lines ......]
  Priority = 10
  Write Bootstrap = "/var/lib/bacula/%c.bsr"

  RunScript {
      Command = "touch /etc/bacula/last-backup"
      RunsWhen = After
      RunsOnFailure = no
      RunsOnClient  = yes
      RunsOnSuccess = yes
  }
}

The flag is written to /etc/bacula since that is not writeable to other system users. Important, we're running it client-side since the backup status is something that relates to each client, not the server itself.

To report the flag files existance, you also need an entry in /etc/check_mk/fileinfo.cfg. If you use a wildcard * it will work with the same content for all your hosts.

/etc/bacula/last-backup

If you run your agent (and don't have a totally outdated version, this would show up in the fileinfo section.

<<<fileinfo:sep(124)>>>
1356553230
/etc/bacula/last-backup|0|1356550964

On the Check_MK server end you will need to configure the fileinfo check. Either let inventory_processes just pick up the service, or use a manual check definition.

Both ways, you'll then need to configure the maximum age for the file to be i.e. 25 hours or 24:01 with a daily backup interval.

The manual approach is better in that it's enforcing the flag file to exist, so no system can fall through the cracks.

We use the same name on all hosts so the same rule works for all hosts.

My example, based on WATO & inventory. 

checkgroup_parameters['fileinfo'] = [
  ( {'maxage': (90000, 694800)}, [], ALL_HOSTS, ['/etc/bacula/last-backup'], {'comment': u'This monitors the age of the last bacula of the local system.', 'docu_url': ''} ),
] + checkgroup_parameters['fileinfo']

Non-WATO users will have to redo this according to the fileinfo man page.

This will go WARN if there is no backup from the last day, and CRIT if there has not been a backup for a whole week.

 

This is working nicely, just don't forget to update your fileinfo.cfg on all hosts!

 

The result looks really nice:

 

 

Monitoring Job status (less easy)

Find the script bacula2nagios.sh which uses NSCA to send the results to Nagios.

It seems the script will be included here:

http://blog.netways.de/wp-content/uploads/2009/09/netways_bacula.zip

Then configure the correct passive check definition for the NSCA-fed service.

Verify that "something" (i.e. freshness checks) notices if the Backup server doesn't run the job again, or you'll still be reported an OK status from the last job.

You can in fact use Python scripting in tie Check_MK's extra_nagios_conf to avoid manually creating any of these services.

This approach also has problems if the host names don't match.

 

 

Monitoring Processes (easy)

This can be done using the process inventory check for the server.

Ensure the database & bacula daemons are running and the filesystems are not full.

inventory_processes_perf += [
    ( "Bacula Director",  "~.*bin/bacula-dir", ANY_USER, 0, 1, 2, 2),
    ( "Bacula Storage",  "~.*bin/bacula-dir", ANY_USER, 0, 1, 2, 2),
    ( "MySQL", "~.*sbin/mysqld", ANY_USER, 0, 1, 5, 8),
    ( "MySQL", "~.*libexec/mysqld", ANY_USER, 0, 1, 5, 8),
]

On the clients, you would do a manual check definition that enforces a running bacula-fd process, or also use inventory for a more lax configuration.

This would be using an enforced

manual check:

checks += [
    ( ["tcp", "lnx", "backup"], ALL_HOSTS, "ps.perf", "Bacula FD", ("~.*sbin/bacula-fd", ANY_USER, 0, 1, 2, 2)),
]

And this is using the

process inventory:

inventory_processes_perf += [ 
    ( "Bacula FD" ,     "~.*sbin/bacula-fd" , "root", 0, 1, 2, 2),
]

Personally I've stuck with inventory for this so far, but I've also had clients with a missing Bacula agent at times, so chose your own way. (smile)

Using "perf" also tracks the CPU usage of the file daemon.

A service dependency from the backup job could be added to not alert missing backup if the file daemon is down.

 

Monitoring via SNMP (not easy)

This can be done using bacula-snmp. Horrible code.

See http://www.backupcentral.com/phpBB2/two-way-mirrors-of-external-mailing-lists-3/bacula-25/configuring-bacula-snmp-subagent-118285/ for some info to get you started.

I'm hoping to make the bacula-snmp subagent work reliably, after that it would be possible to write a SNMP-based Check_MK check for it.

Beer donations appreciated.

The Debian / Postgres Check_MK plugin:

This works server-side, unfortunately it seemed hard to adapt for different setups.

http://sts.ono.at/blog/2010/12/07/checkmk-update/

 

The old Nagios Plugin

Using the plugins from Netways you can also get quite OK per client stats like jobBytes and pool info.

That would be most important if some of your pools go do non-disk media, as then a "df" based check is not possible.

Look for check_netways_bacula.pl but let me add it could use some love to have more beautiful PNP graphs etc.

If you use it, maybe take some hours to make a nicer PNP template for it and commit it back.

 

 

Whats missing:

Most of what you're looking for in enterprise backup scenarios.

  • Jukebox status
  • Cleaning status
  • Replication job status (think BI rules)
  • FC throughput statistics (available on AIX if you ask the MK devs)
  • Predictive pool usage analysis 
  • SCSI Error monitoring
  • Tape age monitoring
  • Performance alerting (Finance, Telco businesses face financial or legal risks if the backups get stuck, take too long or have to be aborted)
  • Inter-server relation monitoring, i.e. if SD-SD copy jobs would work if one were to run
  • Much more powerful per-pool statistics.

These gaps are not something to "blame" on any of the components. They took the better of 10 years to appear in any commercial product, and each OSS component we're looking at is that old. 

Still some way to go until all bonus features show up (smile)

At least Check_MK already helps using things like the filesystem trend monitoring.

 

 

Query a clients status:

The nagios server will need a bconsole access to run this.

echo "sta client=host.domain.tld" | bconsole | grep -i "failed to connect"

 

Query a directors status:

echo "status dir" | bconsole | grep "Daemon started"

 

I've made a Check_MK local check from this. Browse & look at my Bitbucket Check_MK repo

 

Stale clients:

Monitor for stale clients that don't have any active backup left.

Those could be just not removed after the phys. hardware is gone, or a restore target, who knows.

This check immediately goes to CRIT. Even though the backup system is perfectly working under this condition there's still either a config maintenance issue to immediately address (if you wanna run a clean operation) or it might really hint at a bigger issue. Any client flagged here HAS a problem.

#!/bin/bash
CLIENTS="`echo "query
10" | bconsole -n | grep -w 0 | awk '{ if ( $3 == 0 ); printf $8" " }'`"
if [[ $CLIENTS ]] ; then
# 0 bacula-dir-connect - OK - Bacula director is accessible
  echo "2 bacula-clients-broken - CRIT - Broken Clients found - all backups total 0Bytes! ( $CLIENTS)."
else
  echo "0 bacula-clients-broken - OK - No broken Clients found."
fi

Note: I might have added a bug to the script with the printf $8 "  "; I wanted to achieve nicer output and it may now always go critical. I hope not, the if condition in awk should still make it work as intended. (smile)

I've made a Check_MK local check from this, find it at Bitbucket - nagios - check_mk - bacula-clients-broken

 

I came up with some ideas for new helpful checks:

  • monitor cross-client restore permissions:
    • the check scans all client configs and looks at their ClientACL. It will alert as Warning if a client is able to perform restores for a different client
  • monitor ACL correctness:
    • Verify a clients config is valid (has correct ACLs defined, will be able to do "setip"). I just too often hack in my configs and forget something. Yes, I do have ansible scripts to generate them by magic. But... I still break it! (smile) How great would it be to know what I broke.
  • monitor client status:
    • the check will run just once a day (for example) and check all clients' reachability. If i client is unreachable or bacula authentication is broken, it will alert you.