Cutting down Nagios excessive notifications

Notification "spam"

Many people just get too many Nagios notifications. I'm not even talking about big outages, but it's everyday emails about normal infrastructure events that drives them crazy.

Most of the messages they're getting will be "irrelevant" email notifications that they might simply ignore or delete. 

Often the Nagios admin will not have had the luxury of getting one or two 1-week Nagios training classes and end up in a catch-22 - they don't really know the inner Nagios workings to fight from that end, and they also didn't get a "supervised" practice to develop a methodical approach to stop getting those many many emails per night.

The configs are quite often not built to send things to different departments - this means the Nagios admin won't just get messages that concern the Nagios server. More likely, everyone gets everything.

It seems many people think learning Nagios configuration is somehow different from learning networking or unix or whatnot. you need to test, tinker and practice. If you don't you'll have a much more frustrating experience. Like, running STP on a network and not understanding it. It causes issues. You turn it off, it causes more issues. You learn how to use it, it prevents issues.

This article tries to condense the knowledge / experience things so you can read up on them.


Contents:


The road to nagios notification hell

is paved with a few more things:

  1. Not defining contact groups that get different stuff (this is usually a function of no IT policy)
  2. No parenting of devices / network segments (get 100 notifications for 1 error)
  3. Default notification / retry settings
  4. Using periodic notifications "so nothing gets lost"
  5. No service dependenices (Check_MK mostly fixes that)

Real life "solutions"

to this problem, which I've encountered often:

  • "Hide it away" - Make an outlook filter and put all those mails in a folder. Never look at the folder, unless someone asks about what happened last night.
  • "Yes, our monitoring sends mails" - Make an outlook rule deleting all Nagios email
  • "What? Nagios?" - Pretend the nagios server isn't there for a while, then "forget" about the server when you virtualize your infrastructure and put it to scrap with the rest.
  • "We just use it for analysis" - Turn off all mail notifications, just look at Nagios if something is broken
  • "LOOK at what it did, we can't possibly use this" - Turn them on without configuring contacts, watch your inbox explode. Make sure you find someone to blame, then turn it back off
  • "Yeah, I kinda wondered if it still works" - Misconfigure it by accident to not notify anyone, not roll the change back and don't tell anyone about it

 

I'd like to show more sustainable approaches.

Contacts

Contact definitions, I don't yet know how to really explain this. You need to have good definitions of who gets alerted about what.

But IF your team is constantly struggling AND everyone is responsible for everything, then you should take a hunch here.

The Nagios admin in a well-designed setup would only be informed about Nagios-related events (i.e. SMS delivery not working, 50% increase in total of WARN|CRIT services, a report taking longer). If you send yourself all notifications, you're probably not doing it right. Even if you're the single admin it still makes sense to drill down stuff so you can apply different kinds of notifications to it. (Everything has its time)

Parenting

Nagios Parenting is something really best learned in a class or at least in a lab. Two switches and a few devices are enough, more is nice.

I think it's important to first test in a lab context and then move on to creating or validating your production config. A small lab makes effects clearly visible.

A note about testing - there's more to test than pulling network cables. You can as well just stop your Check_MK agent service, or block it via a firewall. Those are ways to simulate different error scenarios. Even drop / refuse by the host firewall will make a difference. Play with ideas like having a service dependency from a hosts's Check_MK & Check_MK inventory services to the hosts IPMI board and the power status of the host. Then get rid of all the complex ideas. (smile)

Parenting

 Don't try to get it right. Try to just get it good.

With that said, my current theory is that the intervals for all critical parent devices (the most important nodes in the relations) must be lower than the host check retry interval for the "normal" systems. Nagios in theory holds back service notifications till it has a result a quickshot on-demand hostcheck. My feeling is that something doesn't work as intended here in some cases. Thus the idea to "aid" Nagios determining the right result.

 

I'll try to outline a few things about the actual "mails" that go out to the defined contacts.

These are my own experiences from cutting down notifications, i.e. you start out with 7k a week in a larger windows env and want to go down to 20-30 / night MAX.

My playground Nagios setup where I often re-apply these rules (yes, this is process that needs to be repeated) is around 1k services and I try to keep it under 1 notification per week. Since I'm also testing stuff there this is not always possible, but generally if I invest 1-2 hours to match up the notifications with the rules I have it will afterwards be smooth for months to come.

Enough intro and one last thing:

 

Generally, your results will be best if these changes are done in-tie with the intended users, not just the nagios admin (you) himself tuning. Scaling out is great!

 

The existing notifications tell you what to do:

The notification reporter:

Look at your last weeks notifications using notification_report.sh
(I put it in docs/treasures)
It seems to have a bug if you request reports for ranges in the past but is great for daily reporting.

Run numbers on the notifications:

Search and list ALL notifications of one month (as that is a good cycle)
Identify / count the numbers per notification type.
  • Do you need flapping or unreachable infos?
  • Do you need warnings?
  • Do you really need warnings?

Find the most common notifications, and make them go away:

Cut away the host names and the states
Identify the most common topics, i.e. CPU Util
Only the top 10 are interesting for this step.
The top 2:
Disable notifications for it. If it happens this often, it is obviously normal to happen and did not have fatal consequences, i.e. tripping something else.
Make sure this measure is documented along with the known issue.
The top 3-8:
Differentiate gauge- and status-based alerts.
Gauges:
For gauges, identify the median value of the service on all systems, add 20% and make that into a catch all rule.
I.e. your median cpu util on all systems is 25% then set your warning level at 45%.
This rule should apply for any system.
Next, carve out exceptions for systems that by their nature run at high usage levels, i.e. 99%.
There you set a level of 101% warn and crit, thus disabling alerting.
Remember this rule will need to preceed the catch all to be in effect.
If you have systems that only have peak periods at which they exceed the normal usage levels, there's two options:
either turn off notifications for the period in question, if you can pinpoint the time period
or set the max_check_attempts high enough to avoid early alerting. This second method does have disadvantages, if your excess CPU usage is fine at night during batch runs, but not during "interactive" work hours.
Flex downtimes should not be consider a solution for this.
Status-based:
just the same as above:
either turn off notifications for the period in question, if you can pinpoint the time period when it most often happens.
or set the max_check_attempts high enough to avoid early alerting. This second method does have disadvantages, if your excess CPU usage is fine at night during batch runs, but not during "interactive" work hours.
Flex downtimes should not be consider a solution for this.

 

If you run into a corner case defining these rules, go the other way round:
Define a proper SLA for the service, and alert based on that. If an alert comes around that is not relevant, then the SLA needs more work. Rinse and repeat

 

Fix the mail subject and content

The subject does NOT need to indicate the sending server or such. Make it so that it's perfectly readable, sortable and generally user-friendly. Not "Debugging-Friendly" for yourself.

You can run a different notification command for the debugging, or use the debug_notifications setting with notify.py.

 

Use delays:

initial_notification_delay meaning - don't send a message if the issue went away.

Note: Since this only affects the first notification, it will not suppress a recovery message at all. I don't know what the Nagios author(s) were smoking when they designed this feature as such an incomplete means of reducing notifications. Anyway, if you use this, you'll have no choice but turning off recovery notifications.

It works very well like that and I pretty much like it.

With this and a 5 service check retries I'm down to 1-2 notifications every few days for a live ISP of over 60 hosts. Everyone of those remaining 1-2 is something that I really need to act on.

 

Change the way you use notifications:

This is where we leave the "classic" nagios way of things.

Don't alert if it doesn't die from it

If you have a service that doesn't report any useful errors for a year, but a weekly false alert, get rid of it.

If nobody needs to interact with the system as an "incident response" then this does not need to be alerted. Yes, it is probably still relevant for reporting, but not for alerting. The emails are alerts. You should only get one if there is an alert.

<video here was removed on YT for whatever reason. How sad>

You might want to let only whitelisted services / groups be able to notify.

 

Example of the notification reporter:

OMD[sitename]:~$ ./notification_report.sh 1
host1.dmz.domain.de;Check_MK inventory;WARNING - 4 unchecked services (ps.perf:4)
host2.domain-management.de;Check_MK inventory;WARNING - 4 unchecked services (ps.perf:4)
ftp.domain.de;Check_MK inventory;WARNING - 4 unchecked services (ps.perf:4)
host3.domain-intern.de;Check_MK inventory;WARNING - 4 unchecked services (ps.perf:4)
host4.domain.de;Check_MK inventory;WARNING - 4 unchecked services (ps.perf:4)
host5.domain-intern.de;Check_MK inventory;WARNING - 4 unchecked services (ps.perf:4)
host6.domain.de;Check_MK inventory;WARNING - 39 unchecked services (df:5, [...]

Why would we ever want emails in the middle of the night for this...

So, lets do it like this:

Now in the automatic inventory check has a nagios config setting of "notifications_enabled 0" which makes sure it won't alert. Instead, you can make a service group, containing all of the inventory checks, and just check the status of that service group, or include it in the admin dashboard.

Rethink levels and notifications

Warning is your warning

Critical is when it's broken

Culture thing - people need to fix WARNING not sit and wait

 

 

Where is your communication / escalation definition

Only the relevant people should get "their" notifications, basically anything from the process/windows service level and above should be only delivered to certain groups.

If you can't pinpoint them, thats where you should call a meeting and start setting up processes. (Actually, talk with the IT manager / CIO, this should be in place!)

 

 

 

Things I couldn't yet explain and that need more thought: 

be really tough on whats 24x7 notification worthy:

- i.e. if your sla isn't intent on proactively fixing every small error (which is a great thing though), dont alert component failures, only alert end-end errors.

But then you really need reliable ticket system or processes for periodic checking to ensure that the component failures are detected and resolved within a defined period (i.e. before the next oncall period starts)

 

-> be smart have a set of different notification command, i.e. rss plus the gui plus ...

 

 

 

Examples:

 

Case 1:

Sender:

OMD Site ...?

don't care. It would better read "Monitoring". Nobody except the monitoring admin really will care about the installation type for Nagios. Also, the site's name is just not this important. It is helpful if you have a larger network disruption and get a lot of false alerts from that, but then you'll better configure more parenting to let Nagios handle that instead of analysing the info in your inbox (smile)

=> If the default username is left then you'll get distracted reading a larger number of email alerts, since there is one more item of information.

Subject:

I would really recommend to not use the notification type (PROBLEM / RECOVERY / FLAPPINGSTART / FLAPPINGSTOP ...) up here. The only time this is really of value is with Downtime start/end notifications (And I don't think anyone really emails that to their users?). These are "technical" Nagios states and don't really concern the server being monitored or the issue on the server. 

I think the little word "OK" or "CRIT" are much more helpful.

If you look closely you'll see that the actual state is not visible in my notifications in the mail program's preview. That's a big time failure.

It's important to check your notification outcome in a few mail clients and limit it down so that the most important information gets best visibility.

Normally, the subject also has an extra "Check_MK:" - i've already removed that and went for $HOSTNAME/$SERVICENAME.

 

Body:

Now why do I need the host name another time in the body? I don't!

Some people will need the Alias, but it's questionable if the Alias is one of the things that should be in the first words.

Same goes for the Address. Sure, that's a valuable piece of information for the on-call. If there's connection issues etc. they will need the IP address. They'll also need such a second key to the same server if they're debugging CMDB errors. But do they need it at the start of the text? I don't think so. More so, they'd also not need a word "Address" - most people can tell an IP by it's looks just well. I think I had added "Address" well-meaningly. It was a bad idea nonetheless.

I'd aim for some region at the bottom of the notification body that has this kind of information in a tabular view.

###Server info:###
Hostname: Myserver
Alias: The webserver everyone and their mom relies on
Address: 192.168.10.1

That way you can also add extra info like links in a designated area people can jump to and use this as their tool.

Servers uptime: 25min
Multisite URL: https://..../
iLO URL: https://...../
Inventory tag ID: 4335AFE

 

I would ask you do be critical about every single letter you have in your notification mails. Only once they pass the nit-picking test you'll get valuable feedback.

 

Should it even be there?

Now, lets look at the nature of these messsages:

The agent was flapping during some period of that night.

The whole thing lasted for about 25 minutes, including flapping and recovery.

Things to look into:

  • If it recovered and flapped, probably my retries were ill-chosen, in fact, they must have been too low
  • The host must have stayed pingable, it didn't wholly go down
  • Of course it's important if a host's Check_MK service is not responding since that means it is not monitorable any more and probably in a bad state - but here it has been flapping, so it was more likely a load issue.
  • It would thus be interesting to not as quickly alert.
  • If I did not look into the issue post-mortem, due to it worrying me, there should not be a notification, either.

What to do:

  • Since the Check_MK service itself being affected you couldn't just turn off notifications.
  • Look at the server's graphical history and check for excessive swap / disk IO / context switches around that time.
  • I would turn off the flapping info, but leave the flap detection on.
  • raise the max_check_attempts to 10, meaning it needs to be continuously broken for 10 minutes.
  • I would double-check the host check retries are considerably less than 10.

If you are into maths, the right thing[tm] to do is to make sure that the host notification kicks in before the flap detection threshold is reached, but the flap detection threshold is reached right on the first hard state.

Ok, that is just a theory so far, but I think it will work.

 

 

Further reading

Little was written in-depth on this topic, but there's a few works on the topics you should read.

  • In the book "Web Operations" by John Allspaw and friends there's a good chapter on notification tuning handling. (If you don't have it, go ahead and buy this book - probably it'll pay off within one or two oncall hours saved)
  • The Psychology of alert Design, a very recent presentation on the topic. It's online at: https://speakerdeck.com/auxesis/the-psychology-of-alert-design (half of it is "OH, REALLY" devops findings, but keep reading through it)
  • Reducing Noisy Alerts from Nagios - about scale, and mostly trivial. Example of service dependencies - http://www.slideshare.net/takumisakamoto/reducing-noisy-nagios-alerts
  • iVoyeur: 7 habits of highly effective monitoring designs (;login, dec 2009)

  • Effective Monitoring and Alerting: For Web Operations - A book about monitoring / alerting theory.

I don't buy "Nagios" books anymore since they all just cover the basic "what is NRPE" stuff. This book is much better: I bought it immediately after I found a reference to it. So, I recommend you do the same. The author did a lot of hard work to describe all the basics and you'll find a lot of background on alerting in it.

 

There's also a presentation from LISA13 proceedings - it's both about cutting down notifications and nagios redesign. It has relevant examples of issues and attempts to resolve them. Much of it is, unfortunately, examplary of why you should deeply consider getting 1-2 days of Nagios consulting instead of guess-work-tuning in a live ops environment.

1 Comment

  1. new idea, have lower host check timeouts and shorter host check intervals depending on the number of a hosts childs.

    routers will be child to many hosts and it would be good to have their host status already ready in case a on demand host check is fired for a given host. AFAIK they do not propagate upwards!