FreeBSD Monitoring

Which parts am I really monitoring on real systems?

What's in my config?

I'll try to describe and document it!

 

You can use it as a starting point for your own setups  :)

 

 

 

what am I monitoring?

There are a few loopholes still open in this configuration, i.e. I have not used with BI. This is b/c the customer there tries to get along with Zabbix and PRTG. Both are rather limited IMO. I needed something workable just for my own needs. (aka: know whats down, but not be bothered)

So, it's servers and ESX hosts and a pseudo host for *some* websites.

 

what I see

 

Around 60 Checks per Host which, in a way shows I'm monitoring the systems in detail but it can't be all of in-depth monitoring of OS, Applications and Application end-to-end functionality.

An unconfigured setup will give around 20-25 checks/system, usually bumped up by just the network switches.

what I don't see

I'm not monitoring network switches, router logs, firewall logs, vpn, power and rack management systems. Notably also not the Nexus 1000V switches. 

I'm also not monitoring most of the applications in-depth.

  • I.e. MySQL clusters etc. are pretty, but ignored.
  • A virus scanner's process is monitored, but not the signature file age.
  • In most cases I don't even verify if the actual web sites are online.

I put in some time to build this setup, but only so it serves the basic purpose. It helps me to know if the OS side of the infrastructure is there. but application monitoring doesn't come free just for fun. :)

(To track network partitions I have another secret instance that pings 100-odd out of >1000 IP devices there, but psssst!)

General Monitoring Config

Tuning

Interval

Timeperiods

Ignored Services

Alerting

Notification Periods

Notification delay

Retries

OS health checking

Filesystems

These servers were set up for ISP customers so they have rather few filesystems.

Things like /var/log actually reside in /home there.

If you look carefully you'll see my critical level for filesystems is set to 100%. 

This is not a common way of doing it, but I like having a "CRITICAL equals broken" alerting.

Also notice the 720 hour (1 month) trending period for the /var filesystem. If the logs were in that place, it would make even more sense.

 

On a "real" server it would look like this:

fs_/                 OK - 3.5% used (0.99 of 28.6 GB), (levels at 80.00/100.00%), trend: +97.14kB / 24 hours
fs_/tmp              OK - 0.0% used (0.00 of 27.6 GB), (levels at 90.00/100.00%), trend: +0.00B / 24 hours
fs_/usr              OK - 4.9% used (1.41 of 29.0 GB), (levels at 90.00/100.00%), trend: +15.79B / 24 hours
fs_/usr/home         OK - 0.0% used (0.00 of 27.6 GB), (levels at 90.00/100.00%), trend: 0.00B / 24 hours
fs_/usr/ports        OK - 3.1% used (0.88 of 28.4 GB), (levels at 90.00/100.00%), trend: +15.79B / 24 hours
fs_/usr/src          OK - 1.9% used (0.53 of 28.1 GB), (levels at 90.00/100.00%), trend: 0.00B / 24 hours
fs_/var              OK - 1.1% used (0.30 of 27.9 GB), (levels at 94.00/100.00%), trend: +2.62MB / 168 hours
fs_/var/crash        OK - 0.0% used (0.00 of 27.6 GB), (levels at 90.00/100.00%), trend: 0.00B / 24 hours
fs_/var/log          OK - 0.0% used (0.00 of 27.6 GB), (levels at 90.00/100.00%), trend: -33.27kB / 24 hours
fs_/var/mail         OK - 0.0% used (0.00 of 27.6 GB), (levels at 90.00/100.00%), trend: +3.63kB / 24 hours
fs_/var/tmp          OK - 0.0% used (0.00 of 27.6 GB), (levels at 90.00/100.00%), trend: -0.00B / 24 hours
fs_/zdata            OK - 11.6% used (71.89 of 618.2 GB), (levels at 90.00/100.00%), trend: +4.09kB / 24 hours
fs_/zdata2           OK - 51.9% used (2833.23 of 5454.7 GB), (levels at 90.00/100.00%), trend: -86.77GB / 24 hours
fs_/zdata2/mfs       OK - 19.9% used (650.20 of 3271.6 GB), (levels at 90.00/100.00%), trend: +5.78GB / 24 h

 

Network IO

CPU and others

CPU utilitzation (disk wait)

Set to WARN at 30%, and CRIT at 80%

CPU load

Very hard to trigger this. Even 4 stray processes on a server are still "OK". Which is fine - the server will still be doing its job.

Whatever way, take care when setting up the predictive alerts. If you set it too sharp you get a lot of alerts.

Important OS Processes

# standard unix procs
inventory_processes_perf += [
    ( "SSH" ,           "~.*sbin/sshd" , ANY_USER, 1, 1, 20, 70),
    ( "syslog",         "~.*syslogd", ANY_USER, 0, 1, 2, 6),
    ( "syslog",         "~.*syslog-ng", ANY_USER, 0, 1, 5, 10),
    ( "xinetd",         "~.*xinetd", ANY_USER, 0, 1, 5, 10),
    ( "inetd",          "~.*/sbin/inetd", ANY_USER, 0, 1, 5, 10),
    ( "Bacula FD" ,     "~.*sbin/bacula-fd" , "root", 0, 1, 2, 2),
]

 

PF firewall

This check tests the firewall. It validates if it's running and has no config errors.

By the same means I can also check if all systems that need one have it enabled. I'd like it to extend it to be more specific about the rules that are active, but have no idea how to do that.(it should be actually useful). One idea is to check the rules in the config against the loaded ones. Looking for better ideas...

 

Windows system time offset

This check works with any OS, you just need to have the right client plugin.

Which just needs to do a date +%s

WARN at 100s, CRIT at 300s

 

State of NTP sync

Sharp on NTP servers

800ms/20000ms on others

(remember, for me critical == broken) 

 

uCARP clusters

I have written a uCARP check that just checks if the master IP is there. uCARP has by itself no monitoring options but sending a mail.

If the cluster is flip-sided I'll get a warning like this:

Also, of course monitoring the process...

 

 

Updates

Base OS

The first check alerts on the installed OS release.

I've set it to allow -p12 and -p13 and manually up the "desired" release after it's gone through testing.

So, while testing -p13 both are allowed. When I want -p13 everywhere, I remove -p12 from the list of allowed OS levels.

 

The second check looks for unmerged files in /etc in case I missed mergemastering or broke it somehow. This is a lifesaver.

Ports / Packages

Right now the updates check does both. I'm not happy with it, patches would be very welcome :)

Applications

Apache

TCP Sessions

The TCP connection stats check is very valuable.

If your system provides any networked services, odd problems usually trigger this check. If it's configured. And attacks will also pop up here.

I set values that will not be met during everydays idle or peak loads. So 30% above peak == good.

Bacula

Agent checking - The data must flow

Using checks I enforce a check on all system matching certain tags. So check_mk will alert if there's no agent found on those systems.

No agent config, no template adjustments, no nrpe config hacking. Simply a strict rule enforcement.

checks += [
    ( ["tcp", "prod", "!bohmedv" ], ALL_HOSTS, "ps.perf",  "Bacula FD" , ( "~.*sbin/bacula-fd" , "root", 0, 1, 2, 2)),
]

 

Filesystems

The bacula Filesystems get special settings to allow for wiggle room. Up and Down and Out. 

So far I got no false alerts with these settings.

Process checks

I still have those in an old .mk config file.

They are in a inventory_processes_perf group.

    ( "Bacula FD" ,     "~.*sbin/bacula-fd" , "root", 0, 1, 2, 2),
    ( "Bacula SD" ,     "~.*sbin/bacula-sd" , ANY_USER, 0, 1, 2, 2),
    ( "Bacula Dir" ,    "~.*sbin/bacula-dir" , ANY_USER, 0, 1, 2, 2),

 

TCP Checks

I configured a TCP Check for both important server ports of bacula.

(WATO->Host & Service Param->Active Checks->TCP Port)

This is what it will look like when you look for 9102:

Local plugins

Three plugins try to gather the basic health of bacula. They are all "real" application checks, using the bconsole command to gather info from Bacula/Bareos to identify if the backup system is doing well.

[/usr/local/lib/check_mk_agent/local]# ./bacula-dir-status 
0 bacula-dir-connect - OK - Bacula director is accessible
[/usr/local/lib/check_mk_agent/local]# ./bacula-clients-broken 
0 bacula-clients-broken - OK - No broken Clients found.
[root@waxu0604 local]# ./bacula-jobs 
0 bacula-jobs - OK - 2 running jobs, 0 waiting jobs

You can find them at my bitbucket nagios repository.

 

Tape drives

I use the fileinfo plugin to verify that the tape device files exist (this avoids the drives being powered off, etc)

/etc/check_mk/fileinfo.cfg
/dev/nsa[0-9]

 

camcontrol

I considered doing the same via camcontrol with is more powerful. But the fileinfo way was configured in less than 5 minutes.

# camcontrol devlist
<QUANTUM ULTRIUM 5 3180>           at scbus0 target 0 lun 0 (sa0,pass0)
<QUANTUM ULTRIUM 5 3180>           at scbus0 target 1 lun 0 (sa1,pass1)
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus1 target 0 lun 0 (ada0,pass2)
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus2 target 0 lun 0 (ada1,pass3)
<TSSTcorp CDDVDW SN-208AB FT00>    at scbus5 target 0 lun 0 (pass4,cd0)

Since only tape devices will be called "/dev/*sa[0-9] there was no point in anything more explicit in the fileinfo.cfg.

Note camcontrol has still many more powerful options, and the additional ctladm tool can do things like "Test Unit Ready" (TUR) checking of SCSI devices.

This would be handy for Fibre- or iSCSI attached tapes from a VTL.

 

PostgreSQL

DB Filesystems:

1. A friendly, stable box

2. A box affected by the PostgreSQL UFS leak

3. Comparing to uptime graph

This clearly shows this issue is solved by rebooting. Also always check out the filesystem trend graphs.

 

4. Filesystem trend data

To track them and have visual info, I have the the "trend performance data" setting enabled:

 

 

Process checks:

    ( "Postgres Postmaster", "/usr/bin/postmaster", ANY_USER, 0, 1, 2, 2),

 

Internal DB Health checks using check_postgres

For this i compiled / installed check_postgres. I don't have any record of the installation, but here are the parameters I'm using to check:

/etc/check_mk/mrpe.cfg
PGSQL_Query_Time /usr/local/check_postgres/check_postgres_query_time --warning='10m' --critical='300m'
PGSQL_Autovacuum_Age /usr/local/check_postgres/check_postgres_last_autovacuum --warning='14d' --critical='30d'
PGSQL_Autovacuum_Freeze /usr/local/check_postgres/check_postgres_autovac_freeze --warning="97%" --critical="100%"
PGSQL_DB_Size /usr/local/check_postgres/check_postgres_database_size --warning="2 G"
PGSQL_DB_Bloat /usr/local/check_postgres/check_postgres_bloat --warning="100 M" --critical="5000 M"
PGSQL_Hitratio /usr/local/check_postgres/check_postgres_hitratio --warning="95%" --critical="90%"

I removed one parameter from the example: 

 --dbuser=pgsql

Simply for reasons of formatting.

 

DB Statistics

So, why did the confluence database size double a month ago? I have no idea. But now I know.

If you're an actual business and do proper sizing this info gets a lot more valueable.

 

 

 

Mail Servers

This is a vast topic...

File Grouping Patterns (for bayes)

Size and age of file groups

Size and age of single files

The first rule is for two highly busy servers. It's followed by a stricter catch-all rule for the others.

 

Queue checks

(...custom...sendmail...queue...checks)

Mail delivery check

(already described here at adminspace)

 

Process checks