Monitoring Tips & Tricks: Icinga

This article defines a list of tips for improving the performance of a Nagios 3.x core based monitoring system. It can be considered the practical side of the article Nagios Performance. Concepts where I exposed what should be considered performance from a point of view of Nagios Core.

Since as of today the differences between Nagios Core and its forks Icinga Core an Centreon Core can be considered as minor, many -if not all- of the performance optimization tips exposed can be applied to any of these three monitoring engines.

Finally, far from just copying, I've tried to enrich the information contained in the Nagios documentation under the chapter "Tunning Nagios for Maximum Performance", a source that must be taken as the basis when trying to optimize the performance of a Nagios core based system.

Improving Nagios performance

There's not a magic recipe when trying to maximize the Nagios Core performance. Instead, the performance optimization is achieved by systematically applying a set of basic rules, or what is the same, systematically following a list of best practices like these that follow.

Use passive checks

If possible, bet for passive checks instead of active checks. A passive check is that whose result based information is not retrieved by the own check (and thus by a process run by the core scheduler) but served by another entity in a non-scheduled way.

The best example of a passive check is that whose result depends on the reception of a given SNMP trap: Instead of periodically running a command on the monitoring system in order to get one or more SNMP values, your system waits for the reception of a SNMP trap in order to set the status of a service. Specifically, think in a Nagios core service addressed to control the temperature on a server power supply: You can periodically ask the server (acting as an SNMP agent) for the power supply temperature value or you can do nothing but waiting to receive a trap from the server if something changes.

By its extension, passive checks can be the matter of a single article but, as a rule of thumb, rely on passive checks when you can passively get the info needed for both setting Ok and non-Ok (Warning or Critical) service states. Following the previous example, rely on passive checks if the server manages at least two traps in order to set if power supply temperature is normal (what should set a service in a Ok state) or abnormal (what should set a service in a non-Ok state). Opposite, if your server (perhaps periodically) sends a trap when an abnormal state is present and it stops doing it when the problem is gone, passive checks are hard to apply because the trigger of an Ok state is not the reception of a trap but the lack of traps in a given period of time.

Define smart check periods

All of us tend to think that more is better, ie, the faster we check a service the best service status image we're getting. That's true but it has a cost, of course: More checks implies more load, and due their nature not all services need to be checked as soon as possible. The best example could be a disk resource size check, a metric that usually tends to grow slowly in time (we're talking about months or even years for getting a drive full). Is it necessary checking the disk resource every few minutes or even every hour? In my opinion no, once every day is enough.

Of course you might think that this solution could not detect the problem of a crazy process smashing a disk resource in minutes but, again, consider how many times it happens and what is your response time: maybe when you get ready to actuate the disk was already full.

For these reason I recommend carefully review and set the check_interval property of your active host/service checks. Not all checks need to be scheduled every minute.

Optimize active check plugins

Every time the monitoring core needs to set the status of a host or service based on a active check, it creates a subprocess that executes the plugin bound to the host/service via the command object. The most efficient the plugin, the less load to the system. But, how to select a plugin based on its efficiency? Basically in two ways:

Use binary plugins or, if not, Perl based plugins setting the enable_embedded_perl configuration option to 1: It allows Nagios to save resources using the embedded perl interpreter. If any of the previous options are not available try not to use shell script plugins since, as a rule of thumb, they are the less efficient.
If developing your own plugins, try to set the most of the plugin needed info as command line argument instead of retrieving it every time the plugin is executed. A common example might be an SNMP based plugin checking the status of switch port via the IFACE-MIB: You can program an elegant plugin by setting the port argument by its description, but it will require an extra SNMP get operation in order to determine the port index associated to that description prior to get the port status. That extra operation will be repeated every time the plugin is executed, ie, every time the service is checked.

Use RAM disks

Nagios Core continuously gets disk IO resources in different ways: Refreshing the status and retention files, creating temporary files every time an active check is performed, storing performance data, reading the external command file... The most optimized the disk access the less resources Nagios will take to the system.

In order to optimize it a good option -instead of spending money in faster drives- is configuring a TMPFS disk and placing in it the spool directory (check_result_path) status (status_file) and retention (state_retention_file) files, performance data files (host_perfdata_file, service_perfdata_file), external command file (command_file) and temp path (temp_path). Care must be taken in order to mount and populate it every time the server boots up just before launching Nagios and, opposite, backing it up and unmounting after stopping Nagios every time the server shuts down.

Limit the NDO broker exported data

When using NDOUtils, try to limit the information exported by ndomod.o to the minimum. Usually the info exported by ndomod.o is sent to a database server (via the ndo2db daemon), so limiting the exported information will reduce both the network traffic and the database load.

In order to do it, set the right value in the ndomod.cfg data_processing_options configuration option, a bitmask value where the meaning of each bit is defined in the german Nagios Wiki and than can be properly set using the free Consol Labs calculator. What kind of information can be omitted depends on the system, but usually (if not always) you can omit the timed event data. For Centreon based systems a safe value is 67108661.

Use tables based on MEMORY engine

For those systems using MySQL database backend (ie, using ndo2db) set the nagios_hoststatus and nagios_servicestatus tables on the nagios database to use the MEMORY engine. Since every check result is stored in these tables, storing them in memory will reduce the disk access and thus it will enhance the database performance.

In order to do it you will have to drop and re-create both two tables setting its engine to "MEMORY". In order to do it stop ndo2db, login as root on the database and follow these steps:

Get the table definition by running the command show create table tablename; being table name nagios_hoststatus or nagios_servicestatus.
Drop the table by running the command drop table tablename; being table name nagios_hoststatus or nagios_servicestatus.
Create again the table pasting the table definition retrieved in the step 1, but this time set "ENGINE=MEMORY" instead of "ENGINE=MyISAM" and set the type of long_output and perfdata fields to varchar(4096) instead of text, since this field type is not supported by the MEMORY engine.

Once the ndo2db daemon was started again, restart your Nagios processes in order to fully populate the empty new tables. Every time the database server was (re)started while there were Nagios processes running you will need to do this step again, but luckily the database server usually starts before nagios processes and stops after them, so the previous step won't be necessary except for extraordinary situations.

Apply large installation tweaks separately

Nagios supports a configuration property called use_large_installations_tweaks that applies a set of optimizations in order to improve the core performance. These optimizations are:

Inhibiting the use of summary macros in environment variables. This option can be fully disabled by setting the value of the enable_environment_macros nagios configuration option.
Using an alternative check memory cleanup procedure. This option can be managed too by setting the value of the free_child_process_memory nagios configuration option.
Modifying the way it forks when running a check. This option can be managed too by setting the value of the child_processes_fork_twice nagios configuration option.

As you can see it is possible to enable/disable every of these optimization separately, what allow you getting a better control about what's going on and, if something fails, being able to undo it without having to disable the rest of the enabled installation tweaks.

Migrate to a distributed architecture

A distributed architecture, ie, that design where more than one Nagios instances (and thus server) supports the monitoring tasks, must be considered under basically two circumstances:

When the monitoring system reaches a certain size and the server(s) performs unstable under load peaks like those that happen for a massive check scheduling (for instance during a unreachability host condition where Many checks can be scheduled at the same time). In this case try to distribute the checks in more than on server based on geographical, security, etc, rules.
When some checks are based on resource-eating scripts and their execution, again, can handle to a lack of stability in the system. A good example of this scenario could be Selenium based checks (at least until recent Selenium server versions). In this case isolate the hard checks on a dedicated Nagios poller.

Use dedicated servers

When your system reaches a really big size, use dedicated server for supporting Web frontend and database management if you're running Nagios XI, Centreon or Icinga. Initially these products deploy database, web, and monitoring tasks over the same server but when the systems grows and many or all of the previous tips has been applied the rational next step is deploying database and web tasks on dedicated servers.

Nagios performance. Concepts

This article, the first of a series where nagios core performance is put under microscope, analyzes what is performance under the point of view of the engine and how can it be monitored. The second article, "Nagios performance. Best practices" defines a list of steps to bring these concepts to practice.

When does Nagios perform well

For defining the previos concept it's necessary going a step back and considering what is Nagios Core: In essence it's just an scheduler, ie, a process that runs tasks in a predefined, cyclic way. So in broad terms we can consider that Nagios Core performs well when these tasks are run at the time they were scheduled.

If this scheduling becomes delayed appears what is named latency: the difference, in seconds, between the time a task should be executed and the time when if fact it was. For instance, if a service check is scheduled for being executed at 9:00:00.0000 AM but it's executed at 9:00:00:500 AM, we get a latency of 0.5 seconds on the check.

So, summarizing, we can consider (and the design of the core supports it) that Nagios Core performs well when the latency of host and service active checks is the lowest, what drops another question: how is lowest? Well, its up to the system admin deciding what latency level is acceptable in his system. As a personal rule of thumb, max latency on a system cannot reach the interval_lenght configuration option value (by default is 60 seconds) in order to be sure that every scheduled check is run in its scheduled time window.

Why not CPU

There are some considerations for discarding CPU as the main Nagios Core performance indicator. The first is generic: A system performs well if it does well what it's designed to do, nevertheless its cpu load. The second one is more specific: Nagios lacks of load balancing capabilities at core level, so even using the initial scheduling options (inter-check delay, service interleaving), the server tends to show regular load peaks. The third is more practical: lack of cpu resources implies an increase in the latency, so controlling just this metric will give you the best monitoring system health indicator.

Latency vs. execution time

You must not confuse check latency with check execution time. Execution time is the amount of time that takes Nagios executing a check. It can hardly be considered as a performance indicator since the check execution time closely depends on different factors as plugin efficiency, load on the device being checked and load on network infraestructure that interconnects your server with the checked device.

Monitoring latency

So it seems that checking latency is more important that checking CPU itself, so the question is how to get it in a programatic way. nagiostats is a command line binary that parses Nagios status.dat file and gives some interesting core running performance statistics, among others latency. Luckily there's no need of programming a plugin for parsing nagiostats output: there are a bunch of plugins doing it in Nagios Exchange and Monitoring Exchange sites, most of them generating average latency performance data.

From among them I recommend to use the fanstastic check_nagiostats since it generates, among much others, min, average and max latency performance data metrics and, moreover, it's programmed in Perl so you can get the efficiency benefits of running it with the core embedded Perl interpreter. check_nagiostats is supported by Icinga too (naming it to check_icingastats for copyright reasons) sharing usage and configuration documentation in its project wiki.

Nagios performance. Best practices

Welcome to Monitoring Tips & Tricks

Monday, 17 June 2013

Nagios performance. Best practices