is a part of the monitoring plugins project used as a replacement for NRPE when you want to centrally define your checks and have a transparent layer in between your nagios server and a remote monitored host. Transparent meaning there's no extra configuration file to push with NRPE:
Like, where you locally would just define "check_dummy 0" you can use "check_by_ssh -C check_dummy 0" to run this remotely - of course plus some arguments like the remote host's name.
check_by_ssh has a rarely used better performing passive mode, and it has also been tweaked to use a multiplexed "controlmaster" session.check_by_ssh is an alternative to NRPE with higher flexibility, based on more standard connections.
Both these modes are horribly underdocumented and hard to set up. Especially the passive mode is something most people get wrong and end up in vain. After researching this I found the mode works perfectly fine, people just don't figure how to use it.
I hope writing a new howto from end to end will help. This should allow you to walk through the setup and successfully use the combination of both modes. That way you'll have much faster Nagios checks over standard ssh with low CPU overhead. And that's why you came here, right?
There are two howtos about using SSH and controlmaster, the more recent one is from Gerhard Lausser at Consol:
I'll not link to the older one since it is not as robust as the above solution.
What I've done is to expand this into a better performing setup using the passive mode for pushing the info back to nagios.
Enable regex matching to use wildcards in the service dependencies. This is set in the misc.cfg for your Nagios/Icinga installation.
Configure your SSH Config
You have to do chmod 600 on the .ssh/config or SSH will refuse to use it.
Host localhost is my testing example. You can use wildcards like Host * or Host *.my.tld
Note you can give multiple identityfiles if needed.
You also need to create the ssh RSA key the config referred to :)
Calling from Services
I'll start the howto with the real services you're checking on the remote system. In this case, check_dummy.
As you can see, Services themselves only use the check_ssh_passive command definition we show later. This is because passive checks can't do anything but wait for a correctly written check result to appear.
60 is a timeout parameter for ssh, you could chose to not use that.
Note: Put the superflous settings in a real template, sorry I didn't have that in my test.
Define a check command that handles the ssh connection via a master.
For my testing I had numbered master checks (125, running 20 services each). In any sensible scenario you would probably group the SSH master checks per host!
I also think most of the --ssh-option settings could be supplied in a .ssh/config but i'm not 100% sure about it.
Note, for passive mode it must list the commands and services that are run - this is a requirement of the passive mode. Once using -C to give the remote check command and once in the -s list of the service titles.
You need to have some generator for the checks or you'll end up very unhappy :)
You'll be able to have a look at my Python scripts from this test as a starting point.
Only If you don't use a .ssh/config file as we created above you also need to give these additional settings in the check_by_ssh call.
SSH Binary path
The original howto used a non-upstream version of check_by_ssh to support --ssh-program. This wasn't documented, but the story goes like this: Some old SLES versions don't support use of ControlMaster or other features like it. The authors' nagios server apparently had such an old OS so he compiled his own ssh and put it in $OMD_ROOT/local/ssh/bin/ssh
If you're not on an ancient OS, you'll do just fine. Otherwise the patch adding support for flexible ssh binary settings is referenced at https://labs.consol.de/blog/nagios/arbitrary-ssh-command-for-check_by_ssh/ . It doesn't cleanly apply on the master version of the monitoring plugins but I could manually apply it just fine, it's super-easy.
Just, I repeat, normally you don't need that.
These two services handle the remote connection from within nagios. By doing this from nagios the connection is brought up on demand and can be fully integrated into service dependencies.
You also need to grab the actual controlmaster script from the consol labs post linked above.
There is an issue with this persistent connection: It stays persistent, meaning it'll be there even if you stop the OMD site. This is kinda shitty and maybe it's better to turn it into an OMD service?
The expect script used in check_ssh_login wasn't included in the consol howto, so I went with something more trivial.
The key function is to be able to tell if ssh works or not.
The first one lets the controlmaster depend on a working SSH access
The second one lets all listed "real" services depend on the controlmaster
Testing on Ubuntu?
If you see insane load on Ubuntu, remove the scripts from /etc/update.motd.d
They're run at every logon and on top of being useless, incur high system load.
The remainder of load is caused by dbus and policykit who combined still use up a lot more CPU time than Nagios and SSH combined.
Load averages running 5000 Checks:
- motd scripts in place: ~12
- 2500/2500 passive checks: mixed controlmaster/non-controlmaster: ~4
- 5000 controlmaster checks: < 0.8
As indicated above, the load would be even lower running the test on a server-class OS without dbus, policykit and similar performance killers.
On second glance, this would also free up a GB of ram wasted by this:
As you can see, SSH and Nagios don't even show in the list, thanks to the above optimizations.