So, what does Check_MK do and how does it do it - in 10 Minutes.

 

There's a highlevel description at Wikipedia:

http://en.wikipedia.org/wiki/Check_MK - Note: I originally wrote that article.

But Wikipedia doesn't really help us understanding what language Check_MK talks to Nagios, and we'll look into that right now!

Monitoring configuration expressed as rules:

Check_MK at it's core has one job only:

Let the user express monitoring configuration for Nagios / Icinga in form of rules.

Those rules are capable of matching on objects, and result in a more dense configuration.

This, for example are two rules as Check_MK's Webinterface, or a user manually changing the config, could define them.

all_hosts += [ "myhost|tcp|broken" ]
 
ignored_services += [
   ( [ "tcp", "broken" ], [ "Harddisk" ] )
]

Those would tell Check_MK how to behave when it creates a Nagios config.

  1. define a host named myhost, and give it two tags (attributes) named tcp and broken
  2. tell Check_MK to ignore anything that would result in a Nagios service named anything starting with "Harddisk" if it's on a server with the tags "tcp" and "broken".

 

To create a Nagios config, it

  • takes all objects known to it
  • matches all defined (default, manually set, magic) rules on top of that, in a specific order
  • compiles this into nagios' horribly explicit syntax
  • and writes that to a defined location in a check_mk-objects.cfg.

This file and a usually unmodified check_mk-templates.cfg make up all config that Check_MK hands over to Nagios.

 

These two files are what's written if you use "Activate Changes" in the UI or if you run i.e. cmk -R on the command line.

The core job is:

Have a few lines of python, write 100's of thousands of lines of Nagios config.

 

Now, there's the fancy stuff too...

Executing the configured checks:

Since Nagios' NRPE addon is highly inefficient and so are Nagios active checks, the next thing to know about Check_MK is how it checks stuff.

Suddenly it's supposed to be a monitoring plugin and not a config editor? Oh well, you'll get used to it.

What happens is that there's one active nagios check per Check_MK monitored host, and this check calls up and checks everything which is defined to be checked in the ruleset for that host (just wait). 

So, the check_command for that service would be something along the lines of "python check_mk.py". at latest upon the first run this call will be turned into precompiled python bytecode for performance reasons.

The check_mk.py heads out, queries a host by contacting i.e. an agent or snmp devices. It can use an arbitrary method, but the two main types are "tcp" and "snmp". Setting either decides on the structure of the data and the transport to use. A custom transport for the "tcp" method, which basically uses plain text, can be defined as a "datasource_program". This can fetch/accept data by any way you can dream up. SSH, wget, Email, Morse, you name it and it can be done.

Upon retrieving the data it is first _cached_ (in $OMD_ROOT/tmp/check_mk/cache) to allow quick evaluation and re-use by multiple Check_MK checks. Put short, the cache is written on every first request to a data source and preserved until it is considered timed out or invalidated (i.e. if you click "Full Scan" in the UI)

After storing the data in cache files, it runs it's own checks on that and writes the results out as nagios passive check result files (those go to $OMD_ROOT/tmp/nagios/checkresults)

Those files are collected by Nagios once the "active check" is done.

The checks:

They're python scripts, that do nothing but checking status of a component they're told to check. They do not handle "fetching" the data or talking to nagios, or any of that. If you look at above you see they don't have to.

So a check is really just called as 

def check_funny_thing(item, params, info)

item: the name of the thing it's supposed to check, for example "harddisk1"

params: let's say "spinning_direction"

info: 

0: right

1: left

2: right

 

Now the check could figure that your harddisk 1 goes the wrong way round and all our data is sdrawkcak.

 

 

Special case:

Livecheck

Livecheck has been deprecated and is removed from newer releases

 

This twists around the whole nagios interaction for another spin.

Broker for active nagios checks.

Written in C, supports doing check_icmp inline. Hooks into nagios but forks outside since forking overhead / ctx is one of the biggest nagios issues. (smile)

Benchmarked to >1Mio Nagios checks per minute. Might still have bugs. Bugs suck, mixing up results, going stale on you. Works for most. Doesn't work for some.

Checks are still submitted back to Nagios as files.

Servers based on Open Monitoring Distribution (OMD) have the submission directory on a ramdisk.

How does Check_MK know what to check?

To be added later.

Answer:

Rules.

Rules either defined as "enforced" using 

checks += []

or added by autoinventory.

 

Autoinventory works since the agent returns a lot of info  and each check can define an inventory function.

That function picks up the 3 instances of "harddisk" in the above example and is able to "store" them. Check_MK does that if a check returns something in the "inventory" list.

 

 

What the GUI does:

The first thing to know is that the GUI is a web framework, so displaying status is just one module.

It runs in Apache as a mod_python module and stays mostly resident in there. It ties you to prefork mode and mod_python apparently also means you can't run apache 2.4?

Display monitoring status

Obtained directly from the Nagios Core via the Livestatus module.

This is loaded on Nagios' start and offers a query language with microsec return times.

The gui is called Multisite since it can, well, connect to multiple nagios sites at once and display their status as a whole.

You can quickly do searches over info from multiple servers and it's optimized to be deal with medium latency WAN links.

Generally it's helpful to turn on mod_deflate and then it's possible (but not fun) to use it even over 64kbps links.

The GUI can use livestatus to do changes in the nagios core, i.e. acknowledge a service or disable all notifications everywhere within a glimpse. Generally, AJAX bugs aside, once a flip is switched over in the GUI it's also happened inside Nagios.

Managing configuration

Another GUI module is WATO which has it's own way of talking with the cli side of things (a tiny api) and can:

  • Run stuff like inventory
  • Manage objects (add hosts, users etc)
  • Configure rules
  • Validate input

This goes totally beyond what other config tools for Nagios ever could do, since it allows you a certain level of implicit inheritance.

That means, if you put a rule on a folder, it'll be able to match on the subojects, unless they define something else. Nagios templates can achieve the same thing, but have to be manually managed, plus they also normally achieve an indigestable configuration.

Being implicit is very helpful with monitioring config.

But... what this also means is if you wanna have a really easy to maintain configuration you need some practice to get those folders right.

TL;DR

Any setting (tags mostly) will be attached to the WATO folders and then you just drop your servers in there.

WATO simply (haha) writes folders and .mk files below check_mk/conf.d/wato

 

Business Intelligence:

BI doesn't directly interact with Nagios, as such it cannot alert without help!!!!!

Thanks to running in the Web UI it's able to aggregate data from multiple sites into common rulesets (or not. as you please)

It's also using the full rule matching fun but not easy to configure. Still, less work than i.e. NagiosBP. Just more thinking.

It's configured with files in check_mk/multisite.d

You use a helper service to get alerting back into nagios. It's a cludge and can be troublesome. Until you get over it (smile)

Event Console:

There are some extra nagios interactions here.

 

 

 

What else:

Alerting

Alerting is done quite differently to allow for more complex notification settings, exceptions, and generally, stuff.

 

 

Microcore

 

The OMD builds for MK customers now also include a more powerful monitoring core. This core does most definitely replace the interaction of Check_MK and Nagios, and I'm not yet able to list the changes.

One of the visible differences is the much better spreading of load on the Nagios server.

 

 

Here you see three load scenarios

  • first: just running cobbler, or totally idle
  • second: running nagios with ssh push, checking 65 hosts and around 2000 services. 5 Min Load average around 1.5, 15 Min around 0.8.
  • third: from "Sunday" - switched to check_mk microcore, doing the same processing at 0.08 15 min load average.