This is the post-training material mixed with a generic overview.
In summary you'd end up with around 60 checks per ONE node, which is a good point to be at for a reasonable host+application monitoring.
This number of checks should best explain why you should not attach them to host names but to folders or host tags.
For the start, consider grouping all your alerts into "this breaks everything now" and "this will need attention soon"
Much more options are explained in my talk from 2013's OpenNebulaConf, but those will be most efficient once you have the basic monitoring sorted.
(clustered or not)
ucarp status check
The status check inventorizes the master / slave status for hosts.
Use this if your master/slave roles are static. It will alert as warning if a failover is detected.
ucarp dual master check
The dual master check is a local check that processes the cache files in the OMD site.
Being a local check it can't do piggyback and will be assigned to the OMD server instead.
nginx / apache
ONE health checks
The checks are here at bitbucket and include agent plugins, the checks and hook scripts.
check for the status of hosts and capacity
adjust thresholds for free memory in check
Contains code example for accessing data from multiple hosts
Attach to cluster host, supply hook script
add code in check to alert on overdue states (i.e. PENDING for a day, MIGRATING for an hour)
Generic lifecycle check.
Contains code example for accessing historic data from the RRDs
consider adding a check that does a onedb fsck on a backup copy, if that is possible
saves on a lot of troubleshooting...
- Process mfsmaster
- Processes of metaloggers
- metadata replay (doesn't exist yet could be a local check, make sure it's not run too often!)
- MFS replication status (this is the most detailed check)
Just deploy the nfsexports plugin, the server check will do the rest
Add checks for portmapper and anything else you'd hate to miss
There's a local check for XFS fragmentation status
It is IO- and CPU intensive. Make sure it is not run every minute, and that it's run with
Those can be monitored to check the average hitrates. The hitrate usually is hard to value, keep an eye on the rate of the last weeks, but keep in mind for short burts the performance will usually be better than the average.
Monitor standards (/var/log/messages) and the OpenNebula ONEd + Scheduler + Webinterface logfiles
Make sure non-deadly errors are only "WARNING". Non-deadly means anything that can't break a already running VM or a livemigration?
Deadly means deeper issues like corrupted DB etc.
SSH connect as oneadmin (optional, host check already knows)
Use the http check, it's able to handle basic authentication (I hope that's enough) and parse the page's contents.
Ether add the hosts' IPMI modules as separate hosts (even as a parent of the compute node) using the IPMI special Agent
Or you just turn on the openipmi service and use the normal agent to query the sensors/health
Turn off the ipmi summary to avoid having issues when IPMI fails to report!
This will also report OK if 0 sensors were found!
The mode of this check can be re-configured via the normal "inventorized checks" rulesets.
Once you changed this, run an automatic refresh off the host's inventory.
Afterwards you should see a number of Voltage and temperature checks and be able to define i.e. checks for the air intake temperature.
Local check for kernel modules
It needs to be adjusted to look for the right modules.
On an Intel system that means i.e. AES-NI (aesni-intel) and VT-anything (kvm-intel)
If possible, configure the EDAC check to detect ECC memory errors. It might have to be adjusted.
While you surely need to monitor your bridges, physical interfaces and bonds you rarely need to monitor the VM's interfaces. Especially since they live migrate and might just be gone. So make sure you set your system to use Alias + Description for the interface name and ignore all that are vnet[0-9]+ - problem solved.
Alternatively, you can monitor the VM interfaces (need to fire an inventory of the VM host if a new VM is made or deleted) to catch spam bots etc.
But this can monitored in the VM just the same. So general advice is to not monitor the VM's NICs'.
Monitor standards (/var/log/messages)
SMART / Raid / ZFS
The SMART check is included by default, but might demand some configuration. This is via check_parameters and should be done for each disk if they got known issues that somehow aren't bad.
Raid checks: Most controllers are supported out of the box, otherwise search for Check_MK plugins. Avoid local checks if you can, look for "real" checks.
SW Raid is supported out of the box.
ZFS: ZFS checking is supported by default. If your agent is missing the right bits, grab them from another OS. The check scrubs (intelligently) the output from zpool status -x.
NFS client processes
The Qemu check tracks the VM CPU load and memory usage
It's deployed as a standard plugin, and the services are added to the ONE master as clustered services
Like the one_vms check, they need to be auto-deleted at LCM_DONE using a hook script.
Alerting time needs to be adjusted to allow for deletion of shut down VMs before an alert is fired.
Disk IO check
Configure the check to inventorize the READ + WRITE IO rates separately.
Configure thresholds for READ and WRITE IO.
READ should be tuned so it does not trigger by backups (so allow i.e. 95% of your known local IO rate, plus set it to not alert unless this keeps going for hours)
WRITE should be tuned so it does not trigger by a minor full-speed rebalance or anything like that. But a 90% throughput for 30 minutes is worth attention, and as such, a WARNING.
If you would like to catch single IO intensive VMs quickly, use the per-VM checks in addition to this one with lower thresholds.
Also consider monitoring the latency/wait counters.
As with any checks, measure, measure, measure before deciding on the thresholds.
IO Wait check
The IO wait can be monitored using the CPU usage check.
Set it to alert on 50%+ of IO wait.
On compute nodes you generally assume that most of the RAM is fair game for usage by VMs.
You can use page fault rates to track how much memory pressure is on the system
Besides that, monitor for a memory usage under maybe 95%. Keep in mind that already allows no live migration. Also monitor swap - on a *server* you'd expect to see no more than 500MB-4GB of swap used, and certainly neither of that in excess of 30% of total swap.
If the system has no swap, don't add that part of the check. Instead, also use the same values for total virtual memory as for memory.
The backup ages can be monitored using two combined checks:
Logfile checks to ensure an error in the backup run is detected.
File age and size checks with fileinfo_groups to ensure the generated backup files are valid.
For VM hosts I'd not try to go and alert as a warning on anything but heavy overload.
Live migration can cause OOM messages on the target nodes, even though there's no bad effects.
You can still monitor for massive, continued memory exhaustion.
The Linux Bridge LACP checks should work out of the box.
The OpenVswitch LACP checks appear broken
For your switches, you need to add the correct SNMP interface type to what Check_MK looks for (and inventorize)
(Use i.e. a --snmpwalk to find it)
As soon as virtualization is involved, you don't need to heavily differentiate between hosts and switches.
- Adjust the thresholds of the Nodes' uplink interfaces. Monitor the up/down state of them
- Adjust the thresholds of the switches' uplink interfaces. Monitor the up/down state of them
- Do only use minimal checking (error rates) on all other ports. Do not monitor the up/down state of them
Remember that "averaging" should be used on all counter-based checks. Remember also that averaging (which uses cache data) is less reliable than trending (which uses RRD data) and as such there can be false alarms.
Review network alarms and discard those that should that they occured at a period of minimum traffic, unless you suspect the errors caused the low throughput.
If you can wrangle SSH or other data out of your switches, track the time of the last STP recalculation and monitor that.
Also monitor the packet loss to some core and edge points of your network. You can adjust the ping checks for those devices, set them to do more than the usual 5 pings per minute, more like 30.
Spend time to configure LLDP where possible, since it will help you a lot in debugging issues. Check_MK currently has no direct support of LLDP :(
There is an ethernet OAM implementation for Linux, but no checks for it
Use dyndns_hosts for any system that has an uplink with changing IP address.