Nagios performance. Concepts ~ Monitoring Tips & Tricks

This article, the first of a series where nagios core performance is put under microscope, analyzes what is performance under the point of view of the engine and how can it be monitored. The second article, "Nagios performance. Best practices" defines a list of steps to bring these concepts to practice.

When does Nagios perform well

For defining the previos concept it's necessary going a step back and considering what is Nagios Core: In essence it's just an scheduler, ie, a process that runs tasks in a predefined, cyclic way. So in broad terms we can consider that Nagios Core performs well when these tasks are run at the time they were scheduled.

If this scheduling becomes delayed appears what is named latency: the difference, in seconds, between the time a task should be executed and the time when if fact it was. For instance, if a service check is scheduled for being executed at 9:00:00.0000 AM but it's executed at 9:00:00:500 AM, we get a latency of 0.5 seconds on the check.

So, summarizing, we can consider (and the design of the core supports it) that Nagios Core performs well when the latency of host and service active checks is the lowest, what drops another question: how is lowest? Well, its up to the system admin deciding what latency level is acceptable in his system. As a personal rule of thumb, max latency on a system cannot reach the interval_lenght configuration option value (by default is 60 seconds) in order to be sure that every scheduled check is run in its scheduled time window.

Why not CPU

There are some considerations for discarding CPU as the main Nagios Core performance indicator. The first is generic: A system performs well if it does well what it's designed to do, nevertheless its cpu load. The second one is more specific: Nagios lacks of load balancing capabilities at core level, so even using the initial scheduling options (inter-check delay, service interleaving), the server tends to show regular load peaks. The third is more practical: lack of cpu resources implies an increase in the latency, so controlling just this metric will give you the best monitoring system health indicator.

Latency vs. execution time

You must not confuse check latency with check execution time. Execution time is the amount of time that takes Nagios executing a check. It can hardly be considered as a performance indicator since the check execution time closely depends on different factors as plugin efficiency, load on the device being checked and load on network infraestructure that interconnects your server with the checked device.

Monitoring latency

So it seems that checking latency is more important that checking CPU itself, so the question is how to get it in a programatic way. nagiostats is a command line binary that parses Nagios status.dat file and gives some interesting core running performance statistics, among others latency. Luckily there's no need of programming a plugin for parsing nagiostats output: there are a bunch of plugins doing it in Nagios Exchange and Monitoring Exchange sites, most of them generating average latency performance data.

From among them I recommend to use the fanstastic check_nagiostats since it generates, among much others, min, average and max latency performance data metrics and, moreover, it's programmed in Perl so you can get the efficiency benefits of running it with the core embedded Perl interpreter. check_nagiostats is supported by Icinga too (naming it to check_icingastats for copyright reasons) sharing usage and configuration documentation in its project wiki.

Nagios performance. Best practices

Saturday, 9 March 2013