The Guru College

Using Nagios Logs For Availability Calculation

Feb 12, 2011
Rants Technology

I’m of the opinion that using Nagios log data for any kind of statical percentage data is a bad idea. My employer runs a distributed, load balanced Nagios system (that I architected, deployed and currently maintain), so I am in a position to have looked at this problem repeatedly. Nagios gets almost all of its data by polling a server to see if a given server is responding properly. To keep load down on servers, checks are usually run no more frequently than once every 15 minutes. Two things come to mind that must be overcome before you can assess Nagios log data properly: granularity of event data, and the fact that Nagios servers aren’t perfect.

First, granularity. Let’s take a hypothetical look at a service that is checked every 13 minutes by the Nagios polling servers (node1 and node2), which report events to the master notification nodes (master1 and master2):

`I’m of the opinion that using Nagios log data for any kind of statical percentage data is a bad idea. My employer runs a distributed, load balanced Nagios system (that I architected, deployed and currently maintain), so I am in a position to have looked at this problem repeatedly. Nagios gets almost all of its data by polling a server to see if a given server is responding properly. To keep load down on servers, checks are usually run no more frequently than once every 15 minutes. Two things come to mind that must be overcome before you can assess Nagios log data properly: granularity of event data, and the fact that Nagios servers aren’t perfect.

An outage just shy of 13 minutes is never reported, even though it is 8:00 AM. Many users know their service was down, but the official uptime report says that all is well and you are still at %100 availability. The next night, this happens:

``I’m of the opinion that using Nagios log data for any kind of statical percentage data is a bad idea. My employer runs a distributed, load balanced Nagios system (that I architected, deployed and currently maintain), so I am in a position to have looked at this problem repeatedly. Nagios gets almost all of its data by polling a server to see if a given server is responding properly. To keep load down on servers, checks are usually run no more frequently than once every 15 minutes. Two things come to mind that must be overcome before you can assess Nagios log data properly: granularity of event data, and the fact that Nagios servers aren’t perfect.

By failing enough times and tripping the HARD state, Nagios falls back into it’s regular checking routine. In this case, a 4 minute outage is reported as CRITICAL for 17 minutes. Yes, I fully acknowledge that this is a worst-case-scenerio example, but nothing really strange has to happen to make it fail in this manner. If the check interval is set to 30 minutes instead of 13, an outage of 30 minutes can be totally missed, or a 4 minute outage can be reported as a 34 minute outage. Going the other way and setting the check interval to 2 minutes is just as bad a solution, as we have over 5000 service checks that are executed on both poller nodes – a 2 minute interval would melt the face off our servers. Perhaps when we move to Merlin, this summer, this can be fine tuned.

The second problem comes from the fact that the Nagios system was designed for reliability and notifications, not for statistical accuracy. The system is designed in a way that outages don’t impact it’s ability to report on service problems in a timely way. Another way to phrase this: is an outage seen from master1 but not seen from master2 an outage? Unfortunately, the front end nodes don’t always have the same data. Recent events at work back this up:

Recent switch maintenance made master2 have a very different view of the network than master1, as master2 was on the switch stack that the Networking group was working on. If you compared the log files, you will get very different numbers about host and service availability.
A few weeks ago, master2 was moved between racks, and it was down for 45 minutes. For those 45 minutes, in the view of master2, no state changes took place.
A few weeks before that, the NSCA daemon on master1 hung, and stopped accepting service check results for over an hour before it was caught and fixed. For that hour, master1 thought that no state changes took place.

That list goes on and on, pretty much forever. The nature of services is that they go down from time to time, including the monitoring services. In order to accurately correlate the data and calculate statical availability numbers, you’d have to keep track of every time something happened to the nagios servers, and adjust your results accordingly. Remember: it’s a different adjustment if a poller goes down, or if there is network separation, or if a firewall allows one poller node to check a remote host and the other poller node can’t communicate. The data about outages to do that kind of processing isn’t recorded by anyone where I work, and even if it was, it won’t be simple to translate that information into availability numbers. And again, Merlin may help make the numbers consistent by making state data a shared resource, but that is very much not a common use scenario.

I’m sure I could go on, but there’s just no way I can see to justify relying on anything other than the log files from the hosts providing a service to see when the service was up or down. When people can log in and access their data, the service is up. When they get errors or get denied entry, it’s down.

Photo Tips | Home | A Python For The Web