Which parts am I really monitoring on real systems?
What's in my config?
I'll try to describe and document it!
You can use it as a starting point for your own setups :)
what am I monitoring?
There are a few loopholes still open in this configuration, i.e. I have not used with BI. This is b/c the customer there tries to get along with Zabbix and PRTG. Both are rather limited IMO. I needed something workable just for my own needs. (aka: know whats down, but not be bothered)
So, it's servers and ESX hosts and a pseudo host for *some* websites.
what I see
Around 60 Checks per Host which, in a way shows I'm monitoring the systems in detail but it can't be all of in-depth monitoring of OS, Applications and Application end-to-end functionality.
An unconfigured setup will give around 20-25 checks/system, usually bumped up by just the network switches.
what I don't see
I'm not monitoring network switches, router logs, firewall logs, vpn, power and rack management systems. Notably also not the Nexus 1000V switches.
I'm also not monitoring most of the applications in-depth.
- I.e. MySQL clusters etc. are pretty, but ignored.
- A virus scanner's process is monitored, but not the signature file age.
- In most cases I don't even verify if the actual web sites are online.
I put in some time to build this setup, but only so it serves the basic purpose. It helps me to know if the OS side of the infrastructure is there. but application monitoring doesn't come free just for fun. :)
(To track network partitions I have another secret instance that pings 100-odd out of >1000 IP devices there, but psssst!)
General Monitoring Config
OS health checking
These servers were set up for ISP customers so they have rather few filesystems.
Things like /var/log actually reside in /home there.
If you look carefully you'll see my critical level for filesystems is set to 100%.
This is not a common way of doing it, but I like having a "CRITICAL equals broken" alerting.
Also notice the 720 hour (1 month) trending period for the /var filesystem. If the logs were in that place, it would make even more sense.
On a "real" server it would look like this:
CPU and others
CPU utilitzation (disk wait)
Set to WARN at 30%, and CRIT at 80%
Very hard to trigger this. Even 4 stray processes on a server are still "OK". Which is fine - the server will still be doing its job.
Whatever way, take care when setting up the predictive alerts. If you set it too sharp you get a lot of alerts.
Important OS Processes
This check tests the firewall. It validates if it's running and has no config errors.
By the same means I can also check if all systems that need one have it enabled. I'd like it to extend it to be more specific about the rules that are active, but have no idea how to do that.(it should be actually useful). One idea is to check the rules in the config against the loaded ones. Looking for better ideas...
Windows system time offset
This check works with any OS, you just need to have the right client plugin.
Which just needs to do a date +%s
WARN at 100s, CRIT at 300s
State of NTP sync
Sharp on NTP servers
800ms/20000ms on others
(remember, for me critical == broken)
I have written a uCARP check that just checks if the master IP is there. uCARP has by itself no monitoring options but sending a mail.
If the cluster is flip-sided I'll get a warning like this:
Also, of course monitoring the process...
The first check alerts on the installed OS release.
I've set it to allow -p12 and -p13 and manually up the "desired" release after it's gone through testing.
So, while testing -p13 both are allowed. When I want -p13 everywhere, I remove -p12 from the list of allowed OS levels.
The second check looks for unmerged files in /etc in case I missed mergemastering or broke it somehow. This is a lifesaver.
Ports / Packages
Right now the updates check does both. I'm not happy with it, patches would be very welcome :)
The TCP connection stats check is very valuable.
If your system provides any networked services, odd problems usually trigger this check. If it's configured. And attacks will also pop up here.
I set values that will not be met during everydays idle or peak loads. So 30% above peak == good.
Agent checking - The data must flow
Using checks I enforce a check on all system matching certain tags. So check_mk will alert if there's no agent found on those systems.
No agent config, no template adjustments, no nrpe config hacking. Simply a strict rule enforcement.
The bacula Filesystems get special settings to allow for wiggle room. Up and Down and Out.
So far I got no false alerts with these settings.
I still have those in an old .mk config file.
They are in a inventory_processes_perf group.
I configured a TCP Check for both important server ports of bacula.
(WATO->Host & Service Param->Active Checks->TCP Port)
This is what it will look like when you look for 9102:
Three plugins try to gather the basic health of bacula. They are all "real" application checks, using the bconsole command to gather info from Bacula/Bareos to identify if the backup system is doing well.
You can find them at my bitbucket nagios repository.
I use the fileinfo plugin to verify that the tape device files exist (this avoids the drives being powered off, etc)
I considered doing the same via camcontrol with is more powerful. But the fileinfo way was configured in less than 5 minutes.
Since only tape devices will be called "/dev/*sa[0-9] there was no point in anything more explicit in the fileinfo.cfg.
Note camcontrol has still many more powerful options, and the additional ctladm tool can do things like "Test Unit Ready" (TUR) checking of SCSI devices.
This would be handy for Fibre- or iSCSI attached tapes from a VTL.
1. A friendly, stable box
2. A box affected by the PostgreSQL UFS leak
3. Comparing to uptime graph
This clearly shows this issue is solved by rebooting. Also always check out the filesystem trend graphs.
4. Filesystem trend data
To track them and have visual info, I have the the "trend performance data" setting enabled:
Internal DB Health checks using check_postgres
For this i compiled / installed check_postgres. I don't have any record of the installation, but here are the parameters I'm using to check:
I removed one parameter from the example:
Simply for reasons of formatting.
So, why did the confluence database size double a month ago? I have no idea. But now I know.
If you're an actual business and do proper sizing this info gets a lot more valueable.
This is a vast topic...
File Grouping Patterns (for bayes)
Size and age of file groups
Size and age of single files
The first rule is for two highly busy servers. It's followed by a stricter catch-all rule for the others.
Mail delivery check
(already described here at adminspace)