check_by_ssh

is a part of the monitoring plugins project used as a replacement for NRPE when you want to centrally define your checks and have a transparent layer in between your nagios server and a remote monitored host. Transparent meaning there's no extra configuration file to push with NRPE:

Like, where you locally would just define "check_dummy 0" you can use "check_by_ssh -C check_dummy 0" to run this remotely - of course plus some arguments like the remote host's name.

 

check_by_ssh has a rarely used better performing passive mode, and it has also been tweaked to use a multiplexed "controlmaster" session.check_by_ssh is an alternative to NRPE with higher flexibility, based on more standard connections.

Both these modes are horribly underdocumented and hard to set up. Especially the passive mode is something most people get wrong and end up in vain. After researching this I found the mode works perfectly fine, people just don't figure how to use it.

I hope writing a new howto from end to end will help. This should allow you to walk through the setup and successfully use the combination of both modes. That way you'll have much faster Nagios checks over standard ssh with low CPU overhead. And that's why you came here, right?

 

References:

There are two howtos about using SSH and controlmaster, the more recent one is from Gerhard Lausser at Consol:

https://labs.consol.de/blog/uncategorized/running-check_by_ssh-over-a-peristent-ssh-connection/

I'll not link to the older one since it is not as robust as the above solution.

 

What I've done is to expand this into a better performing setup using the passive mode for pushing the info back to nagios.

 

Enable regex matching to use wildcards in the service dependencies. This is set in the misc.cfg for your Nagios/Icinga installation.

etc/nagios/nagios.d/misc.cfg
use_regexp_matching=1

 

Configure your SSH Config

/omd/sites/mysite/.ssh/config
Host localhost
    ControlPath=/omd/sites/mysite/tmp/ssh/controlpath/ssh-%r@%h
    ControlMaster=no
    UserKnownHostsFile=/omd/sites/mysite/tmp/ssh/known_hosts
    IdentityFile=/omd/sites/mysite/.ssh/id_rsa
    IdentitiesOnly=yes
    StrictHostKeyChecking=no
    PasswordAuthentication=no
    Compression=yes
    CompressionLevel=1

You have to do chmod 600 on the .ssh/config or SSH will refuse to use it.

Host localhost is my testing example. You can use wildcards like Host * or Host *.my.tld

Note you can give multiple identityfiles if needed.

SSH Key

You also need to create the ssh RSA key the config referred to :)

ssh-keygen -t rsa -b 4096 -N ""

 

 

Calling from Services

I'll start the howto with the real services you're checking on the remote system. In this case, check_dummy.

As you can see, Services themselves only use the check_ssh_passive command definition we show later. This is because passive checks can't do anything but wait for a correctly written check result to appear.

60 is a timeout parameter for ssh, you could chose to not use that.

auto-services.cfg
define service{
    service_description dummy_ctl_01
    check_command check_ssh_passive
    use generic-service
    host_name              localhost
    active_checks_enabled  0
    passive_checks_enabled 1
}
define service{
    service_description dummy_ctl_02
    check_command check_ssh_passive
    use generic-service
    host_name              localhost
    active_checks_enabled  0
    passive_checks_enabled 1
}

Note: Put the superflous settings in a real template, sorry I didn't have that in my test.

 

Define a check command that handles the ssh connection via a master.

For my testing I had numbered master checks (125, running 20 services each). In any sensible scenario you would probably group the SSH master checks per host!

I also think most of the --ssh-option settings could be supplied in a .ssh/config but i'm not 100% sure about it.

Note, for passive mode it must list the commands and services that are run - this is a requirement of the passive mode.  Once using -C to give the remote check command and once in the -s list of the service titles.

You need to have some generator for the checks or you'll end up very unhappy :)

You'll be able to have a look at my Python scripts from this test as a starting point.

 

auto-checkcommands.cfg
define command{
    command_name check_ssh_master_ctl0
    command_line $USER1$/check_by_ssh -l root -H $HOSTNAME$ -n $HOSTNAME$ \
      --skip-stderr 1 \
            -C '/usr/lib/nagios/plugins/check_dummy 0' \
        -C '/usr/lib/nagios/plugins/check_dummy 0' \
     -s "dummy_ctl_01:dummy_ctl_02" -O /omd/sites/mysite/tmp/run/nagios.cmd
}

 

SSH Settings

Only If you don't use a .ssh/config file as we created above you also need to give these additional settings in the check_by_ssh call.

      --ssh-option "ControlPath=$USER4$/tmp/ssh/controlpath/ssh-%r@%h" \
      --ssh-option "ControlMaster=no" \
      --ssh-option "UserKnownHostsFile=$USER4$/tmp/ssh/known_hosts" \
      --ssh-option "IdentityFile=$USER4$/.ssh/id_rsa" \
      --ssh-option "IdentitiesOnly=yes" \
      --ssh-option "StrictHostKeyChecking=no" \
      --ssh-option "PasswordAuthentication=no" \

 

SSH Binary path

The original howto used a non-upstream version of check_by_ssh to support --ssh-program. This wasn't documented, but the story goes like this: Some old SLES versions don't support use of ControlMaster or other features like it. The authors' nagios server apparently had such an old OS so he compiled his own ssh and put it in $OMD_ROOT/local/ssh/bin/ssh

If you're not on an ancient OS, you'll do just fine. Otherwise the patch adding support for flexible ssh binary settings is referenced at https://labs.consol.de/blog/nagios/arbitrary-ssh-command-for-check_by_ssh/ . It doesn't cleanly apply on the master version of the monitoring plugins but I could manually apply it just fine, it's super-easy.

Just, I repeat, normally you don't need that.

SSH Handling

These two services handle the remote connection from within nagios. By doing this from nagios the connection is brought up on demand and can be fully integrated into service dependencies.

sshbackhand-services.cfg
define service {
  service_description             os_linux_default_check_controlmaster
  use                             generic-service
  host_name                       localhost
  max_check_attempts              1
  check_interval                  15
  check_command                   check_ssh_controlmaster
}
 
define service {
  service_description             os_linux_default_check_shell
  use                             generic-service
  host_name                       localhost
  max_check_attempts              5
  check_interval                  60
  check_command                   check_ssh_login!$HOSTNAME$!60!22
}
define command {
  command_name    check_ssh_controlmaster
  command_line    $USER2$/check_ssh_controlmaster \
      -H $HOSTNAME$ -l root -p 22 -d $USER4$/tmp/ssh
}

You also need to grab the actual controlmaster script from the consol labs post linked above.

 

There is an issue with this persistent connection: It stays persistent, meaning it'll be there even if you stop the OMD site. This is kinda shitty and maybe it's better to turn it into an OMD service?

 

Testing SSH

The expect script used in check_ssh_login wasn't included in the consol howto, so I went with something more trivial.

The key function is to be able to tell if ssh works or not.

sshbackhand-commands.cfg
define command {
   command_name   check_ssh_login
   command_line   ssh -l root $HOSTNAME$ /bin/true
}

 

Dependencies:

The first one lets the controlmaster depend on a working SSH access

The second one lets all listed "real" services depend on the controlmaster

servicedeps.cfg
define servicedependency {
 name                             dependency_os_linux_default_check_shell
 host_name                        localhost
 service_description              os_linux_default_check_shell
 execution_failure_criteria       u,w,c,p
 notification_failure_criteria    u,w,c,p
 dependent_service_description    os_linux_default_check_controlmaster,\
                                  !os_linux_default_check_shell
}

define servicedependency {
   name                             dependency_os_linux_default_check_controlmaster_uc_localhost
   host_name                        localhost
   service_description              os_linux_default_check_controlmaster
   execution_failure_criteria       u,w,c,p
   notification_failure_criteria    u,w,c,p
   dependent_service_description    dummy_ctl_.*,\
                                    !os_linux_default_check_shell,\
                                    !os_linux_default_check_controlmaster
}

 

 

 

Testing on Ubuntu?

If you see insane load on Ubuntu, remove the scripts from /etc/update.motd.d

They're run at every logon and on top of being useless, incur high system load.

The remainder of load is caused by dbus and policykit who combined still use up a lot more CPU time than Nagios and SSH combined.

 

Load averages running 5000 Checks:

  • motd scripts in place: ~12 
  • 2500/2500 passive checks: mixed controlmaster/non-controlmaster: ~4
  • 5000 controlmaster checks: < 0.8

 

As indicated above, the load would be even lower running the test on a server-class OS without dbus, policykit and similar performance killers.

On second glance, this would also free up a GB of ram wasted by this:

 top - 19:16:23 up 14 days,  7:37,  1 user,  load average: 0.95, 1.04, 1.38
Tasks: 325 total,   1 running, 319 sleeping,   0 stopped,   5 zombie
Cpu(s):  3.2%us,  2.1%sy,  0.7%ni, 93.2%id,  0.8%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8062956k total,  7297120k used,   765836k free,   319412k buffers
Swap:        0k total,        0k used,        0k free,  2948692k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                  
 1593 root      20   0 1114m 953m 2584 S   41 12.1 286:00.00 polkitd                                                                                  
 1303 messageb  20   0 29944 6092  916 S   26  0.1 165:01.73 dbus-daemon                                                                              
 1576 root      20   0  245m  22m 2988 S   12  0.3  52:18.70 NetworkManager  

As you can see, SSH and Nagios don't even show in the list, thanks to the above optimizations.