Monitoring Tips & Tricks: Nagios

Showing posts with label Nagios. Show all posts

Monday, 17 June 2013

Nagios performance. Best practices

This article defines a list of tips for improving the performance of a Nagios 3.x core based monitoring system. It can be considered the practical side of the article Nagios Performance. Concepts where I exposed what should be considered performance from a point of view of Nagios Core.

Since as of today the differences between Nagios Core and its forks Icinga Core an Centreon Core can be considered as minor, many -if not all- of the performance optimization tips exposed can be applied to any of these three monitoring engines.

Finally, far from just copying, I've tried to enrich the information contained in the Nagios documentation under the chapter "Tunning Nagios for Maximum Performance", a source that must be taken as the basis when trying to optimize the performance of a Nagios core based system.

Improving Nagios performance

There's not a magic recipe when trying to maximize the Nagios Core performance. Instead, the performance optimization is achieved by systematically applying a set of basic rules, or what is the same, systematically following a list of best practices like these that follow.

Use passive checks

If possible, bet for passive checks instead of active checks. A passive check is that whose result based information is not retrieved by the own check (and thus by a process run by the core scheduler) but served by another entity in a non-scheduled way.

The best example of a passive check is that whose result depends on the reception of a given SNMP trap: Instead of periodically running a command on the monitoring system in order to get one or more SNMP values, your system waits for the reception of a SNMP trap in order to set the status of a service. Specifically, think in a Nagios core service addressed to control the temperature on a server power supply: You can periodically ask the server (acting as an SNMP agent) for the power supply temperature value or you can do nothing but waiting to receive a trap from the server if something changes.

By its extension, passive checks can be the matter of a single article but, as a rule of thumb, rely on passive checks when you can passively get the info needed for both setting Ok and non-Ok (Warning or Critical) service states. Following the previous example, rely on passive checks if the server manages at least two traps in order to set if power supply temperature is normal (what should set a service in a Ok state) or abnormal (what should set a service in a non-Ok state). Opposite, if your server (perhaps periodically) sends a trap when an abnormal state is present and it stops doing it when the problem is gone, passive checks are hard to apply because the trigger of an Ok state is not the reception of a trap but the lack of traps in a given period of time.

Define smart check periods

All of us tend to think that more is better, ie, the faster we check a service the best service status image we're getting. That's true but it has a cost, of course: More checks implies more load, and due their nature not all services need to be checked as soon as possible. The best example could be a disk resource size check, a metric that usually tends to grow slowly in time (we're talking about months or even years for getting a drive full). Is it necessary checking the disk resource every few minutes or even every hour? In my opinion no, once every day is enough.

Of course you might think that this solution could not detect the problem of a crazy process smashing a disk resource in minutes but, again, consider how many times it happens and what is your response time: maybe when you get ready to actuate the disk was already full.

For these reason I recommend carefully review and set the check_interval property of your active host/service checks. Not all checks need to be scheduled every minute.

Optimize active check plugins

Every time the monitoring core needs to set the status of a host or service based on a active check, it creates a subprocess that executes the plugin bound to the host/service via the command object. The most efficient the plugin, the less load to the system. But, how to select a plugin based on its efficiency? Basically in two ways:

Use binary plugins or, if not, Perl based plugins setting the enable_embedded_perl configuration option to 1: It allows Nagios to save resources using the embedded perl interpreter. If any of the previous options are not available try not to use shell script plugins since, as a rule of thumb, they are the less efficient.
If developing your own plugins, try to set the most of the plugin needed info as command line argument instead of retrieving it every time the plugin is executed. A common example might be an SNMP based plugin checking the status of switch port via the IFACE-MIB: You can program an elegant plugin by setting the port argument by its description, but it will require an extra SNMP get operation in order to determine the port index associated to that description prior to get the port status. That extra operation will be repeated every time the plugin is executed, ie, every time the service is checked.

Use RAM disks

Nagios Core continuously gets disk IO resources in different ways: Refreshing the status and retention files, creating temporary files every time an active check is performed, storing performance data, reading the external command file... The most optimized the disk access the less resources Nagios will take to the system.

In order to optimize it a good option -instead of spending money in faster drives- is configuring a TMPFS disk and placing in it the spool directory (check_result_path) status (status_file) and retention (state_retention_file) files, performance data files (host_perfdata_file, service_perfdata_file), external command file (command_file) and temp path (temp_path). Care must be taken in order to mount and populate it every time the server boots up just before launching Nagios and, opposite, backing it up and unmounting after stopping Nagios every time the server shuts down.

Limit the NDO broker exported data

When using NDOUtils, try to limit the information exported by ndomod.o to the minimum. Usually the info exported by ndomod.o is sent to a database server (via the ndo2db daemon), so limiting the exported information will reduce both the network traffic and the database load.

In order to do it, set the right value in the ndomod.cfg data_processing_options configuration option, a bitmask value where the meaning of each bit is defined in the german Nagios Wiki and than can be properly set using the free Consol Labs calculator. What kind of information can be omitted depends on the system, but usually (if not always) you can omit the timed event data. For Centreon based systems a safe value is 67108661.

Use tables based on MEMORY engine

For those systems using MySQL database backend (ie, using ndo2db) set the nagios_hoststatus and nagios_servicestatus tables on the nagios database to use the MEMORY engine. Since every check result is stored in these tables, storing them in memory will reduce the disk access and thus it will enhance the database performance.

In order to do it you will have to drop and re-create both two tables setting its engine to "MEMORY". In order to do it stop ndo2db, login as root on the database and follow these steps:

Get the table definition by running the command show create table tablename; being table name nagios_hoststatus or nagios_servicestatus.
Drop the table by running the command drop table tablename; being table name nagios_hoststatus or nagios_servicestatus.
Create again the table pasting the table definition retrieved in the step 1, but this time set "ENGINE=MEMORY" instead of "ENGINE=MyISAM" and set the type of long_output and perfdata fields to varchar(4096) instead of text, since this field type is not supported by the MEMORY engine.

Once the ndo2db daemon was started again, restart your Nagios processes in order to fully populate the empty new tables. Every time the database server was (re)started while there were Nagios processes running you will need to do this step again, but luckily the database server usually starts before nagios processes and stops after them, so the previous step won't be necessary except for extraordinary situations.

Apply large installation tweaks separately

Nagios supports a configuration property called use_large_installations_tweaks that applies a set of optimizations in order to improve the core performance. These optimizations are:

Inhibiting the use of summary macros in environment variables. This option can be fully disabled by setting the value of the enable_environment_macros nagios configuration option.
Using an alternative check memory cleanup procedure. This option can be managed too by setting the value of the free_child_process_memory nagios configuration option.
Modifying the way it forks when running a check. This option can be managed too by setting the value of the child_processes_fork_twice nagios configuration option.

As you can see it is possible to enable/disable every of these optimization separately, what allow you getting a better control about what's going on and, if something fails, being able to undo it without having to disable the rest of the enabled installation tweaks.

Migrate to a distributed architecture

A distributed architecture, ie, that design where more than one Nagios instances (and thus server) supports the monitoring tasks, must be considered under basically two circumstances:

When the monitoring system reaches a certain size and the server(s) performs unstable under load peaks like those that happen for a massive check scheduling (for instance during a unreachability host condition where Many checks can be scheduled at the same time). In this case try to distribute the checks in more than on server based on geographical, security, etc, rules.
When some checks are based on resource-eating scripts and their execution, again, can handle to a lack of stability in the system. A good example of this scenario could be Selenium based checks (at least until recent Selenium server versions). In this case isolate the hard checks on a dedicated Nagios poller.

Use dedicated servers

When your system reaches a really big size, use dedicated server for supporting Web frontend and database management if you're running Nagios XI, Centreon or Icinga. Initially these products deploy database, web, and monitoring tasks over the same server but when the systems grows and many or all of the previous tips has been applied the rational next step is deploying database and web tasks on dedicated servers.

Nagios performance. Concepts

Saturday, 2 February 2013

Restarting Windows services from Linux

The post Monitoring Windows services covered how to remotely check a service status (running, stopped) using WMI, a powerful framework for fetching info from Windows based systems.

All of us managing monitoring systems know how important is providing proactive capabilities to the system in order to fix simple problems as a first step once an incidence is detected. Maybe the best example might be an stopped Windows service: It might be desirable that the monitoring system tried to restart it and, once a given number of unsuccessful tries were made, it performed a notification to the administrators in order to manage the problem in a more human-like way.

Sadly WMI is not so useful when trying to being interactive with the remote system. If not using sql-like syntax, it's is possible calling a local script when a given condition is true (for instance when a service is stopped) but the Linux wmi client (wmic) only support sql-like queries. Moreover, even if sql queries supported running commands under certain circumstances, a remote script might exist on the Windows server side in order to be run (whose existence migth be a problem when dealing with strict remote server administrators).

Let's dance

SAMBA is the Linux implementation of the Windows SMB protocol that allows, among others, supporting Remote Procedure Call transport (RPC over SMB)... and obviously RPC allow us remotely calling Windows procedures, what seems a good solution for our purpose.

samba-client is a package available for different platforms (it is called smbclient in Debian-like plataforms) that groups different utilities for interacting from Linux hosts with remote SMB compatible systems (as Windows servers). One of these utilities is net, that is meant to work just like the net utility available for Windows and DOS.

On a Windows system, we can restart an stopped service calling net in this way:

net start my_windows_service

Using the samba net utility, we can do the same action from a remote Linux system in this way:

net rpc service start my_windows_service \

-I 192.168.0.64 \

-U myDomain/jdoe%jdoe_password

The only difference is that, while in Windows you can use both long (quoting it) or short service name, in Linux you can use just the short service name.

The previous command started a service called my_windows_service on a remote Windows server with address 192.168.0.64 using the privileges of the user jdoe (authenticated with password jdoe_password) belonging to the Active Directory domain myDomain. It is possible doing it using a local user if the domain name (and the slash) is omitted:

net rpc service start my_windows_service \

-I 192.168.0.64 \

-U jdoe%jdoe_password

Finally, using net is possible checking if a given service is running, something useful for validating that a restarted service operation succeeds:

net rpc service status my_windows_service \

-I 192.168.0.64 \

-U myDomain/jdoe%jdoe_password

In practice

Let's assume we are managing a Nagios Core based system that monitors the status of some services running on remote Windows servers. The way to do it was covered in the post Monitoring Windows services.

Now we want to give our monitoring system proactive capabilities in this way: Once a monitored windows service is detected as stopped, our monitoring system must restart it for a given number of tries and, if not achieved, stop doing it and notifying the incidence to the defined contacts (or contactgroup members).

That can be achieved by defining an event handler bound to the service check. Since an event handler executes a command every time a service or host is in a soft state and the first time it goes to a hard (OK or non-OK) state, we will create a command that restarts the Windows service if the Nagios service check is in a non-OK, soft state. Since we can define how many checks can be run before going to a hard state via the service property max_check_attempts, we can set how many service restart tries can be performed before going to a hard state and then running a notification. Let's see it step by step:

1.- Create an script for restarting a windows service if a nagios service is in a soft, non-ok state. Name it 'restart_win-service' and save it in the Nagios libexec directory (with the right permissions for being executed from Nagios):

#!/bin/sh

# restart_win-service

# Restarts a remote windows service if nagios service is

# in a non ok, soft state

# Arguments:

sevice_status service_status_type user_id

# server_address service_name

if [ "$1" != 'OK' -a "$2" == 'SOFT' ]; then

# We are in a soft, non OK status:

# Restart the service

net rpc service start $5 -I $4 -U $3 > /dev/null 2> /dev/null

2.- Define a Nagios command representing the previous script:

define command{

command_name restart_win_service

command_line $USER1$/restart_win-service $SERVICESTATE$ $SERVICESTATETYPE$ $ARG1$ $HOSTADDRESS$ $ARG2$

}

3.- Configure the service for using the command restart_win_service as event handler and running it for three times before notifying the problem

define service {

...

enable_event_handler 1

event_handler restart_win_service!myDomain/jdoe%jdoe_password!my_windows_service

max_check_attempts 3

...

}

Thursday, 29 December 2011

Monitoring VMWare: Installing vSphere SDK for Perl

Considering the huge growth that virtualization platforms are experiencing in these last years, specially but not only VMWare based ones, monitoring these platforms and supported hosts becomes a must be for every monitoring administrator.

You can find a bunch of good plugins addressed to monitor VMWare platforms on Nagios compatible tools (Nagios itself, Icinga, Babel or Zenoss, among others), perhaps being the most adopted the versatile check_esx from OP5. Anyway, the key point is that the most of them are based on VMWare's vSphere SDK for Perl, a set of libraries acting as API, as interface, between plugins and VCenters/ESX(i) servers.

vSphere SDK is all but light in terms of disk space requirements... ok, maybe I was a bit extreme: It requires more than 80Mbytes of space in your Gigabyte drive, the most of this space being used by:

vCli: a shell client for managing vSphere based platforms.
WSMAN libraries for getting server hardware information by vSphere.

Well, the truth is that I don't want neither managing vSphere platforms nor getting hardware information by vSphere (I prefer getting it directly from the servers), so these components lack of interest for someone who only wants monitoring VCenter, ESX(i) and virtualized servers. The only thing I need is the Perl modules, a set of files located on lib/VMware/share/VMware installation package directory... and requiring 3Mbytes of disk space:

VICommon.pm
VICredStore.pm
VIExt.pm
VILib.pm
VIM2Runtime.pm
VIM2Stub.pm
VIM25Runtime.pm
VIM25Stub.pm
VIMRuntime.pm
VIRuntime.pm

So to start playing the only thing you must do it copying VMware libraries full directory to one of the locations where your distro stores Perl libraries (one of them is /usr/local/share/perl/ in Debian) and checking if all dependent libraries are properly installed, specifically:

Crypt-SSLeay-0.55 (0.55-0.9.7 or 0.55-0.9.8)
IO-Compress-Base-2.005
Compress-Zlib-2.005
IO-Compress-Zlib-2.005
Compress-Raw-Zlib-2.017
Archive-Zip-1.26
Data-Dumper-2.121
XML-LibXML-1.63
libwww-perl-5.805
LWP-Protocol-https-6.02
XML-LibXML-Common-0.13
XML-NamespaceSupport-1.09
XML-SAX-0.16
Data-Dump-1.15
URI-1.37
UUID-0.02
SOAP-Lite-0.710.08
HTML-Parser-3.60
version-0.78

I imagine that it must exist a right procedure for checking if these libraries are present in the system, but I recommend the "trial and error" approach as the fastest: Just download one of the most plugins the uses the API (as check_esx does), run it from the command line and install each non satisfied dependency the script dropped as runtime error.

Dealing with Debian platforms, you can find problems when trying to install XML::LibXML from CPAN (perl -MCPAN -e "install XML::LibXML"), specifically you can get a MAKEFILE.pl error message talking about the impossibility of finding some libraries:

...
looking for -lxml2... no
looking for -llibxml2... no
libxml2 not found
...

The best way to solve it is installing the libxml-libxml-perl Debian package (apt-get install libxml-libxml-perl) instead of doing it by CPAN.

Friday, 2 December 2011

Managing Nagios logs

Nagios natively supports log rotation, a functionality managed using log_rotation main configuration option. This is the configuration option description taken from official nagios documentation

Format: log_rotation_method=[n/h/d/w/m]
Example: log_rotation_method=d

This is the rotation method that you would like Nagios to use for your log file. Values are as follows:

n = None (don't rotate the log - this is the default)
h = Hourly (rotate the log at the top of each hour)
d = Daily (rotate the log at midnight each day)
w = Weekly (rotate the log at midnight on Saturday)
m = Monthly (rotate the log at midnight on the last day of the month)

Many times people become confused by the Nagios log management capabilities and believe that, besides rotating, Nagios will erase older files too... or well, some angel in our system will do for us. Saddly this is not real neither in Nagios nor in Centreon systems and older logs remain in our disk for months or even years.

This simple script can be very helpful in order to address the previous fact. It manages log files in two combinable ways: Compressing and or deleting files older than x days. It takes three arguments:

Directory where nagios logs are stored
Age, in days, for files that will be compressed
Age, in days, for files that will be deleted

For instance, and given that it is named as manage_naglogs, this example would delete files older than 30 days and would compress files older than 7 days:

manage_naglogs /var/log/nagios 7 30

And here comes the script:

#!/bin/bash

if [ $3 -gt 0 ]

then

find $1/nagios-*.gz -mtime +$3 -exec rm {} \;

find $1/nagios-*.log -mtime +$3 -exec rm {} \;

if [ $2 -gt 0 ]

then

find $1/nagios-*.log -mtime +$2 -exec gzip {} \;

In order to run it periodically, I recommend adding the needed commands to cron. In systems like Debian where /etc/cron.daily stores scripts run every day, and assuming you have saved the previous script in /usr/local/nagios/bin, create an script like this, save it in /etc/cron.daily and set proper file permissions for being run for cron daemon (chmod 755 will do the job):

#!/bin/bash

/usr/local/nagios/bin/manage_naglogs /var/nagios/logs 7 30

In systems where only crontab is available, next entry will do the job. It will run our script once every day at 3:00am:

00 3 * * * root /usr/local/nagios/bin/manage_naglogs /var/nagios/log 7 30

Finally one advice for those using Centreon: Keep, at least, one nagios rotated log file untouched (ie, neither compressed nor deleted). Have in mind that centreon run every day (usually at 1:00am) an script for parsing Nagios log files in order to create availability reports. To achieve it, use values higher than 1 for the second and third script arguments.

Last but not least...

If you found this article useful, please leave your comments and support the site by clicking in some (or even in all!) of the interesting advertisements of our sponsors. Thanks in advance!

Saturday, 21 May 2011

Monitoring multi-address or multi-identifier devices

When managing monitoring systems it's common to find situations in which one device has more than one identifier, or number of network addresses or a combination of specific IP addresses and identifiers. Some cases may be:

Servers with different management and production network interfaces. These include, for example, HP Proliant servers on which the ILO has a dedicated network interface and therefore a different network address to the address of the production network.
Virtual Hosts with an IP address and an ID production system level. A common example are virtualized on VMWare ESX hosts, where the identifier system-level virtualization is completely disconnected from the IP address that is assigned to the device.

When the monitoring system is based on Nagios, where there is only one property that identifies the host address (property address in host object), the above situation becomes a problem.
The usual solution is keeping the second value stored in the property alias and change the definitions of the check command, replacing $HOSTADDRESS$ macro by $HOSTALIAS$ macro. However, this approach leads to more problems than solutions:

The alias is very useful when correctly used in reports, identifying and providing valuable information about the host.
Some third-party tools, usually topology tools, use this field as display name.

User Macros

In addition to the standard macros, Nagios supports the so-called custom variable macros: identifier-value pairs defined in the host, service or contact objects. Macros of this type are distinguished from standard being necessarily prefixed by a "_" symbol.

define host {
    host_name ProliantServer
    address 192.168.1.1
    _ILOADDRESS 192.168.2.1
    ...
}

In the above example a macro called $_ILOADDRESS$ is defined, being 192.168.2.1 its value that identifies the IP address of the ILO management interface on a server called MyProliantServer. From all points of view this macro can be considered as an standard Nagios macro: It can be invoked from both the execution of host or service checks and therefore can be used in the in a command definition:

define command {
    command_name CheckILOFans
    command_line $USER1$/check_snmp -H $_HOSTILOADDRESS$ ...
    ...
}

define command {

command_name CheckHTTPPort

command_line $USER1$/check_tcp -H $HOSTADDRESS$ ...

...

}

The above example first defines a command called CheckILOFans addressed to check the status of fans on a server with an ILO management interface. CheckHTTPPort also defines a command intended to establish the availability status of the HTTP port on a production interface.

In the first case the used host address is not $HOSTADDRESS$. Instead we use the address stored in our recently created macro, whose name must be prefixed by _HOST because it has been defined as part of a host object, so the macro must be referenced as $_HOSTILOADDRESS$. In the same way, if we define a custom macro in a service object definition, it should be referenced prefixing its id with _SERVICE and finally, if we define a custom macro in a contact object definition, it should be prefixed by _CONTACT.

By following this approach, now we can use both commands to define checks on the same host, even being based on information available through different network interfaces:

define service {

host_name ProliantServer

service_description FanStatus

check_command CheckILOFans

...

}

define service {

host_name ProliantServer

service_description HTTPStatus

check_command CheckHTTPPort

...

}

Macros in Centreon

For those who prefer configuring Nagios using the Merethis tool, Centreon supports the management of variable custom macros from version 1.x: You can create, modify and delete them on the tab "Macros" in the definition of host and service objects configuration. Unfortunately, being version 2.2 recently released, the management of macros in contact objects is still not supported.

Welcome to Monitoring Tips & Tricks

Monday, 17 June 2013

Nagios performance. Best practices

Improving Nagios performance

Use passive checks

Define smart check periods

Optimize active check plugins

Use RAM disks

Limit the NDO broker exported data

Use tables based on MEMORY engine

Apply large installation tweaks separately

Migrate to a distributed architecture

Use dedicated servers

Related posts

Saturday, 2 February 2013

Restarting Windows services from Linux

Let's dance

In practice

Thursday, 29 December 2011

Monitoring VMWare: Installing vSphere SDK for Perl

Friday, 2 December 2011

Managing Nagios logs

Last but not least...

Saturday, 21 May 2011

Monitoring multi-address or multi-identifier devices

Tags

Recent

Most read