This is the post-training material mixed with a generic overview.

In summary you'd end up with around 60 checks per ONE node, which is a good point to be at for a reasonable host+application monitoring.

This number of checks should best explain why you should not attach them to host names but to folders or host tags.

 

 


 

Alerting

For the start, consider grouping all your alerts into "this breaks everything now" and "this will need attention soon"

Much more options are explained in my talk from 2013's OpenNebulaConf, but those will be most efficient once you have the basic monitoring sorted.

 


 

Master node

(clustered or not)

ucarp

ucarp status check

The status check inventorizes the master / slave status for hosts.

Use this if your master/slave roles are static. It will alert as warning if a failover is detected.

ucarp dual master check

The dual master check is a local check that processes the cache files in the OMD site.

Being a local check it can't do piggyback and will be assigned to the OMD server instead.

clustered processes

ONE

ucarp

sunstone

nginx / apache

 

ONE health checks

The checks are here at bitbucket and include agent plugins, the checks and hook scripts.

one-hosts

check for the status of hosts and capacity

adjust thresholds for free memory in check

Contains code example for accessing data from multiple hosts

 

one-vms

Attach to cluster host, supply hook script

add code in check to alert on overdue states (i.e. PENDING for a day, MIGRATING for an hour)

 

one-stats

Generic lifecycle check.

Contains code example for accessing historic data from the RRDs

 

onedb

consider adding a check that does a onedb fsck on a backup copy, if that is possible

saves on a lot of troubleshooting...

 

Storage

MooseFS/LizardFS

  • Process mfsmaster
  • Processes of metaloggers
  • metadata replay (doesn't exist yet could be a local check, make sure it's not run too often!)
  • MFS replication status (this is the most detailed check)

NFS

Exports

Just deploy the nfsexports plugin, the server check will do the rest

Processes

Add checks for portmapper and anything else you'd hate to miss

 

XFS Fragmentation

There's a local check for XFS fragmentation status

It is IO- and CPU intensive. Make sure it is not run every minute, and that it's run with

nice -n 19 ionice -c 3

 

SSD Caches

There's a nice checks for EnhanceIO & bcache

Those can be monitored to check the average hitrates. The hitrate usually is hard to value, keep an eye on the rate of the last weeks, but keep in mind for short burts the performance will usually be better than the average.

Logfiles

Monitor standards (/var/log/messages) and the OpenNebula ONEd + Scheduler + Webinterface logfiles

Make sure non-deadly errors are only "WARNING". Non-deadly means anything that can't break a already running VM or a livemigration?

Deadly means deeper issues like corrupted DB etc.

 

Connectivity

SSH connect as oneadmin (optional, host check already knows)

 

Sunstone Login

Use the http check, it's able to handle basic authentication (I hope that's enough) and parse the page's contents.

 

 

 

 

 

 


 

Compute nodes

IPMI

Ether add the hosts' IPMI modules as separate hosts (even as a parent of the compute node) using the IPMI special Agent

Or you just turn on the openipmi service and use the normal agent to query the sensors/health

Turn off the ipmi summary to avoid having issues when IPMI fails to report!

[...]
IPMI Sensor Summary  OK - 14 sensors OK
[...]

This will also report OK if 0 sensors were found!

The mode of this check can be re-configured via the normal "inventorized checks" rulesets.

Once you changed this, run an automatic refresh off the host's inventory.

Afterwards you should see a number of Voltage and temperature checks and be able to define i.e. checks for the air intake temperature.

 

Local check for kernel modules

It needs to be adjusted to look for the right modules.

On an Intel system that means i.e. AES-NI (aesni-intel) and VT-anything (kvm-intel)

 

EDAC

If possible, configure the EDAC check to detect ECC memory errors. It might have to be adjusted.

 

Network Interfaces

While you surely need to monitor your bridges, physical interfaces and bonds you rarely need to monitor the VM's interfaces. Especially since they live migrate and might just be gone. So make sure you set your system to use Alias + Description for the interface name and ignore all that are vnet[0-9]+ - problem solved.

Alternatively, you can monitor the VM interfaces (need to fire an inventory of the VM host if a new VM is made or deleted) to catch spam bots etc.

But this can monitored in the VM just the same. So general advice is to not monitor the VM's NICs'.

 

Filesystem usage

Root volume

Other volumes

LizardFS volumes

NFS Client

 

Logfiles

Monitor standards (/var/log/messages)

 

SMART / Raid / ZFS

The SMART check is included by default, but might demand some configuration. This is via check_parameters and should be done for each disk if they got known issues that somehow aren't bad.

Raid checks: Most controllers are supported out of the box, otherwise search for Check_MK plugins. Avoid local checks if you can, look for "real" checks.

SW Raid is supported out of the box.

ZFS: ZFS checking is supported by default. If your agent is missing the right bits, grab them from another OS. The check scrubs (intelligently) the output from zpool status -x.

Processes

NFS client processes

OpenVswitch

KSM tuned?

Qemu check

The Qemu check tracks the VM CPU load and memory usage

It's deployed as a standard plugin, and the services are added to the ONE master as clustered services

 

Like the one_vms check, they need to be auto-deleted at LCM_DONE using a hook script.

Alerting time needs to be adjusted to allow for deletion of shut down VMs before an alert is fired.

 

Disk IO check

Configure the check to inventorize the READ + WRITE IO rates separately.

Configure thresholds for READ and WRITE IO.

READ should be tuned so it does not trigger by backups (so allow i.e. 95% of your known local IO rate, plus set it to not alert unless this keeps going for hours)

WRITE should be tuned so it does not trigger by a minor full-speed rebalance or anything like that. But a 90% throughput for 30 minutes is worth attention, and as such, a WARNING.

If you would like to catch single IO intensive VMs quickly, use the per-VM checks in addition to this one with lower thresholds.

Also consider monitoring the latency/wait counters.

As with any checks, measure, measure, measure before deciding on the thresholds.

IO Wait check

The IO wait can be monitored using the CPU usage check.

Set it to alert on 50%+ of IO wait.

 

 

Memory check

On compute nodes you generally assume that most of the RAM is fair game for usage by VMs.

You can use page fault rates to track how much memory pressure is on the system

Besides that, monitor for a memory usage under maybe 95%. Keep in mind that already allows no live migration. Also monitor swap - on a *server* you'd expect to see no more than 500MB-4GB of swap used, and certainly neither of that in excess of 30% of total swap.

If the system has no swap, don't add that part of the check. Instead, also use the same values for total virtual memory as for memory.

 


Other checks

Backups

The backup ages can be monitored using two combined checks:

Logfile checks to ensure an error in the backup run is detected.

File age and size checks with fileinfo_groups to ensure the generated backup files are valid.

 

 

CPU alerts

For VM hosts I'd not try to go and alert as a warning on anything but heavy overload.

Memory alerts

Live migration can cause OOM messages on the target nodes, even though there's no bad effects.

You can still monitor for massive, continued memory exhaustion.

 


 

Networking

 

Redundancy

The Linux Bridge LACP checks should work out of the box.

The OpenVswitch LACP checks appear broken

For  your switches, you need to add the correct SNMP interface type to what Check_MK looks for (and inventorize)

(Use i.e. a --snmpwalk to find it)

Interface checks

As soon as virtualization is involved, you don't need to heavily differentiate between hosts and switches.

  • Adjust the thresholds of the Nodes' uplink interfaces. Monitor the up/down state of them
  • Adjust the thresholds of the switches' uplink interfaces. Monitor the up/down state of them
  • Do only use minimal checking (error rates) on all other ports. Do not monitor the up/down state of them

Remember that "averaging" should be used on all counter-based checks. Remember also that averaging (which uses cache data) is less reliable than trending (which uses RRD data) and as such there can be false alarms.

Review network alarms and discard those that should that they occured at a period of minimum traffic, unless you suspect the errors caused the low throughput.

 

Topology sniffing

If you can wrangle SSH or other data out of your switches, track the time of the last STP recalculation and monitor that.

Also monitor the packet loss to some core and edge points of your network. You can adjust the ping checks for those devices, set them to do more than the usual 5 pings per minute, more like 30.

Spend time to configure LLDP where possible, since it will help you a lot in debugging issues. Check_MK currently has no direct support of LLDP :(

WAN topics

There is an ethernet OAM implementation for Linux, but no checks for it

Use dyndns_hosts for any system that has an uplink with changing IP address.