Monitoring Tips & Tricks: Centreon

Showing posts with label Centreon. Show all posts

Monday, 17 June 2013

Nagios performance. Best practices

This article defines a list of tips for improving the performance of a Nagios 3.x core based monitoring system. It can be considered the practical side of the article Nagios Performance. Concepts where I exposed what should be considered performance from a point of view of Nagios Core.

Since as of today the differences between Nagios Core and its forks Icinga Core an Centreon Core can be considered as minor, many -if not all- of the performance optimization tips exposed can be applied to any of these three monitoring engines.

Finally, far from just copying, I've tried to enrich the information contained in the Nagios documentation under the chapter "Tunning Nagios for Maximum Performance", a source that must be taken as the basis when trying to optimize the performance of a Nagios core based system.

Improving Nagios performance

There's not a magic recipe when trying to maximize the Nagios Core performance. Instead, the performance optimization is achieved by systematically applying a set of basic rules, or what is the same, systematically following a list of best practices like these that follow.

Use passive checks

If possible, bet for passive checks instead of active checks. A passive check is that whose result based information is not retrieved by the own check (and thus by a process run by the core scheduler) but served by another entity in a non-scheduled way.

The best example of a passive check is that whose result depends on the reception of a given SNMP trap: Instead of periodically running a command on the monitoring system in order to get one or more SNMP values, your system waits for the reception of a SNMP trap in order to set the status of a service. Specifically, think in a Nagios core service addressed to control the temperature on a server power supply: You can periodically ask the server (acting as an SNMP agent) for the power supply temperature value or you can do nothing but waiting to receive a trap from the server if something changes.

By its extension, passive checks can be the matter of a single article but, as a rule of thumb, rely on passive checks when you can passively get the info needed for both setting Ok and non-Ok (Warning or Critical) service states. Following the previous example, rely on passive checks if the server manages at least two traps in order to set if power supply temperature is normal (what should set a service in a Ok state) or abnormal (what should set a service in a non-Ok state). Opposite, if your server (perhaps periodically) sends a trap when an abnormal state is present and it stops doing it when the problem is gone, passive checks are hard to apply because the trigger of an Ok state is not the reception of a trap but the lack of traps in a given period of time.

Define smart check periods

All of us tend to think that more is better, ie, the faster we check a service the best service status image we're getting. That's true but it has a cost, of course: More checks implies more load, and due their nature not all services need to be checked as soon as possible. The best example could be a disk resource size check, a metric that usually tends to grow slowly in time (we're talking about months or even years for getting a drive full). Is it necessary checking the disk resource every few minutes or even every hour? In my opinion no, once every day is enough.

Of course you might think that this solution could not detect the problem of a crazy process smashing a disk resource in minutes but, again, consider how many times it happens and what is your response time: maybe when you get ready to actuate the disk was already full.

For these reason I recommend carefully review and set the check_interval property of your active host/service checks. Not all checks need to be scheduled every minute.

Optimize active check plugins

Every time the monitoring core needs to set the status of a host or service based on a active check, it creates a subprocess that executes the plugin bound to the host/service via the command object. The most efficient the plugin, the less load to the system. But, how to select a plugin based on its efficiency? Basically in two ways:

Use binary plugins or, if not, Perl based plugins setting the enable_embedded_perl configuration option to 1: It allows Nagios to save resources using the embedded perl interpreter. If any of the previous options are not available try not to use shell script plugins since, as a rule of thumb, they are the less efficient.
If developing your own plugins, try to set the most of the plugin needed info as command line argument instead of retrieving it every time the plugin is executed. A common example might be an SNMP based plugin checking the status of switch port via the IFACE-MIB: You can program an elegant plugin by setting the port argument by its description, but it will require an extra SNMP get operation in order to determine the port index associated to that description prior to get the port status. That extra operation will be repeated every time the plugin is executed, ie, every time the service is checked.

Use RAM disks

Nagios Core continuously gets disk IO resources in different ways: Refreshing the status and retention files, creating temporary files every time an active check is performed, storing performance data, reading the external command file... The most optimized the disk access the less resources Nagios will take to the system.

In order to optimize it a good option -instead of spending money in faster drives- is configuring a TMPFS disk and placing in it the spool directory (check_result_path) status (status_file) and retention (state_retention_file) files, performance data files (host_perfdata_file, service_perfdata_file), external command file (command_file) and temp path (temp_path). Care must be taken in order to mount and populate it every time the server boots up just before launching Nagios and, opposite, backing it up and unmounting after stopping Nagios every time the server shuts down.

Limit the NDO broker exported data

When using NDOUtils, try to limit the information exported by ndomod.o to the minimum. Usually the info exported by ndomod.o is sent to a database server (via the ndo2db daemon), so limiting the exported information will reduce both the network traffic and the database load.

In order to do it, set the right value in the ndomod.cfg data_processing_options configuration option, a bitmask value where the meaning of each bit is defined in the german Nagios Wiki and than can be properly set using the free Consol Labs calculator. What kind of information can be omitted depends on the system, but usually (if not always) you can omit the timed event data. For Centreon based systems a safe value is 67108661.

Use tables based on MEMORY engine

For those systems using MySQL database backend (ie, using ndo2db) set the nagios_hoststatus and nagios_servicestatus tables on the nagios database to use the MEMORY engine. Since every check result is stored in these tables, storing them in memory will reduce the disk access and thus it will enhance the database performance.

In order to do it you will have to drop and re-create both two tables setting its engine to "MEMORY". In order to do it stop ndo2db, login as root on the database and follow these steps:

Get the table definition by running the command show create table tablename; being table name nagios_hoststatus or nagios_servicestatus.
Drop the table by running the command drop table tablename; being table name nagios_hoststatus or nagios_servicestatus.
Create again the table pasting the table definition retrieved in the step 1, but this time set "ENGINE=MEMORY" instead of "ENGINE=MyISAM" and set the type of long_output and perfdata fields to varchar(4096) instead of text, since this field type is not supported by the MEMORY engine.

Once the ndo2db daemon was started again, restart your Nagios processes in order to fully populate the empty new tables. Every time the database server was (re)started while there were Nagios processes running you will need to do this step again, but luckily the database server usually starts before nagios processes and stops after them, so the previous step won't be necessary except for extraordinary situations.

Apply large installation tweaks separately

Nagios supports a configuration property called use_large_installations_tweaks that applies a set of optimizations in order to improve the core performance. These optimizations are:

Inhibiting the use of summary macros in environment variables. This option can be fully disabled by setting the value of the enable_environment_macros nagios configuration option.
Using an alternative check memory cleanup procedure. This option can be managed too by setting the value of the free_child_process_memory nagios configuration option.
Modifying the way it forks when running a check. This option can be managed too by setting the value of the child_processes_fork_twice nagios configuration option.

As you can see it is possible to enable/disable every of these optimization separately, what allow you getting a better control about what's going on and, if something fails, being able to undo it without having to disable the rest of the enabled installation tweaks.

Migrate to a distributed architecture

A distributed architecture, ie, that design where more than one Nagios instances (and thus server) supports the monitoring tasks, must be considered under basically two circumstances:

When the monitoring system reaches a certain size and the server(s) performs unstable under load peaks like those that happen for a massive check scheduling (for instance during a unreachability host condition where Many checks can be scheduled at the same time). In this case try to distribute the checks in more than on server based on geographical, security, etc, rules.
When some checks are based on resource-eating scripts and their execution, again, can handle to a lack of stability in the system. A good example of this scenario could be Selenium based checks (at least until recent Selenium server versions). In this case isolate the hard checks on a dedicated Nagios poller.

Use dedicated servers

When your system reaches a really big size, use dedicated server for supporting Web frontend and database management if you're running Nagios XI, Centreon or Icinga. Initially these products deploy database, web, and monitoring tasks over the same server but when the systems grows and many or all of the previous tips has been applied the rational next step is deploying database and web tasks on dedicated servers.

Nagios performance. Concepts

Friday, 2 December 2011

Managing Nagios logs

Nagios natively supports log rotation, a functionality managed using log_rotation main configuration option. This is the configuration option description taken from official nagios documentation

Format: log_rotation_method=[n/h/d/w/m]
Example: log_rotation_method=d

This is the rotation method that you would like Nagios to use for your log file. Values are as follows:

n = None (don't rotate the log - this is the default)
h = Hourly (rotate the log at the top of each hour)
d = Daily (rotate the log at midnight each day)
w = Weekly (rotate the log at midnight on Saturday)
m = Monthly (rotate the log at midnight on the last day of the month)

Many times people become confused by the Nagios log management capabilities and believe that, besides rotating, Nagios will erase older files too... or well, some angel in our system will do for us. Saddly this is not real neither in Nagios nor in Centreon systems and older logs remain in our disk for months or even years.

This simple script can be very helpful in order to address the previous fact. It manages log files in two combinable ways: Compressing and or deleting files older than x days. It takes three arguments:

Directory where nagios logs are stored
Age, in days, for files that will be compressed
Age, in days, for files that will be deleted

For instance, and given that it is named as manage_naglogs, this example would delete files older than 30 days and would compress files older than 7 days:

manage_naglogs /var/log/nagios 7 30

And here comes the script:

#!/bin/bash

if [ $3 -gt 0 ]

then

find $1/nagios-*.gz -mtime +$3 -exec rm {} \;

find $1/nagios-*.log -mtime +$3 -exec rm {} \;

if [ $2 -gt 0 ]

then

find $1/nagios-*.log -mtime +$2 -exec gzip {} \;

In order to run it periodically, I recommend adding the needed commands to cron. In systems like Debian where /etc/cron.daily stores scripts run every day, and assuming you have saved the previous script in /usr/local/nagios/bin, create an script like this, save it in /etc/cron.daily and set proper file permissions for being run for cron daemon (chmod 755 will do the job):

#!/bin/bash

/usr/local/nagios/bin/manage_naglogs /var/nagios/logs 7 30

In systems where only crontab is available, next entry will do the job. It will run our script once every day at 3:00am:

00 3 * * * root /usr/local/nagios/bin/manage_naglogs /var/nagios/log 7 30

Finally one advice for those using Centreon: Keep, at least, one nagios rotated log file untouched (ie, neither compressed nor deleted). Have in mind that centreon run every day (usually at 1:00am) an script for parsing Nagios log files in order to create availability reports. To achieve it, use values higher than 1 for the second and third script arguments.

Last but not least...

If you found this article useful, please leave your comments and support the site by clicking in some (or even in all!) of the interesting advertisements of our sponsors. Thanks in advance!

Saturday, 28 May 2011

Nagios: Service checks based on host status

Notice

This article applies to Nagios Core 2.x and 3.x. Luckily Nagios Core 4 natively manages the inhibition of service notifications when the service parent (for instance its host) is not UP. Read about this and other Nagios 4 Core features at Nagios Core 4: Overview.

It is likely that when a host switch to a DOWN state or UNREACHABLE, Nagios inhibit cheking its services: Why checking them if Nagios itself has determined that the host isnot UP?

For better or worse this is not true: Nagios keeps on running regular checks on the services on a non-UP host. The resulting state of each service check depends on how it handles the unavailability of the data source.

Beyond the advantages of that fact, there are some disadvantages:

Too much information produces perplexity, and a set of alarms in services related to a host failure can hide real problems in services from other hosts.
Resource consumption related to the implementation of checks predestined to fail.
Notification storm related to the host and its services failure.

Therefore it seems desirable, if not for all at least for many service types, following some steps to avoid the above problems:

Establishing service states to reflect the reality of the situation, such as an UNKNOWN state.
Inhibiting notifications related to service state change.
Disabling active checks of services while their host is not UP.

These steps should prevent, in a major or minor way, the problems related to mesleading information, resource consumption and notification storm.

Howto
So now the question is: How to do it? There are different approaches, having each one its pros and cons. Far from analyzing all, the best solution seems to be using Nagios external commands for performing all previous tasks every time host status changes.

Required external commands should be:

ENABLE_PASSIVE_SVC_CHECKS: Enables service status to be set from an external command. Note that this command itself doesn't set the status, you must use PROCESS_SERVICE_CHECK_RESULT (read on) to do it.
DISABLE_HOST_SVC_CHECKS, ENABLE_HOST_SVC_CHECKS: Disables/Enables checks for all services of a given host.
PROCESS_SERVICE_CHECK_RESULT: Sets the status value for a given service.
DISABLE_HOST_SVC_NOTIFICATIONS and additionally DISABLE_ALL_NOTIFICATIONS_BEYOND_HOST: Disables notifications for both all services of a given host and all services from all hosts topologically beyond a given host.
ENABLE_HOST_SVC_NOTIFICATIONS and additionally ENABLE_ALL_NOTIFICATIONS_BEYOND_HOST: Makes the opposite of the previous commands.

All these commands must be used on a script designed for managing host status changes. This script migth manage these command line arguments:

Host name, avaliable through the $HOSTNAME$ host macro.
Host status, available (in numeric format) through the $HOSTSTATUSID$ host macro.

This could be the script algorithm using metalanguage:

if HOSTSTATUSID=0 the
  # Host has changed to an UP status
   
  # Force status for all host services 
  for each host Service

    # Submit an external command to set, as service status,

    # previous current value ($LASTSERVICESTATUSID$ macro)
    ExternalCommand(PROCESS_SERVICE_CHECK_RESULT,Service,
                    $LASTSERVICESTATUSID:HostName:Service$)
  endfor 

  # Enable notifications for all host services
  ExternalCommand(ENABLE_HOST_SVC_NOTIFICATIONS, HostName)

  # Enable active checks for all host services
  ExternalCommand(ENABLE_HOST_SVC_CHECKS, Hostname)   
else
  # Host has changed to a non-UP status
   
  # Disable active checks for all host services
  ExternalCommand(DISABLE_HOST_SVC_CHECKS, Hostname)

   
  # Disable notifications for all host services

  ExternalCommand(DISABLE_HOST_SVC_NOTIFICATIONS, HostName)

  # Set UNKNOWN (3) status for all host services 

  for each host Service

ExternalCommand(PROCESS_SERVICE_CHECK_RESULT,Service,3)

  endfor 

endif

Configuration

Once the script is written, you must define a command object for enabling its usage from Nagios:

define command {

command_name setSvcStatusByHostStatus

command_line -h $HOSTNAME$ -s $HOSTSTATUSID$

}

In the previous example, hostname will be passed to the script using the -h argument, and -s argument will be used to pass host status id.
Finally, it will be necessary setting the previous command as host event handler. If the defined solution is suitable for managing all host status changes, previous command must be set as global event handler in the Nagios configuration (usually stored in nagios.cfg file):

global_host_event_handler = setSvcStatusByHostStatus

If it's not to be used on all hosts, it must be set as event handler for every suitable host:

define host {

...

event_handler setSvcStatusByHostStatus

...

}

Centreon

Previous solution is fully supported by Centreon:

Command definition is not different to other usual command. The only thing to consider is defining it as "check" type in order to be available through the event handler configuration lists.
You can set the value of global_host_event_handler through the field "Global host event handler" located on the "Checking options" tab in the Configuration>Nagios>Nagios.cfg menu.
You can set the event_handler directive for each host using the field "Event handler" located on the "Data management" of the Configuration>Hosts>(host name).

Nagios Core 4: Overview

Saturday, 21 May 2011

Monitoring multi-address or multi-identifier devices

When managing monitoring systems it's common to find situations in which one device has more than one identifier, or number of network addresses or a combination of specific IP addresses and identifiers. Some cases may be:

Servers with different management and production network interfaces. These include, for example, HP Proliant servers on which the ILO has a dedicated network interface and therefore a different network address to the address of the production network.
Virtual Hosts with an IP address and an ID production system level. A common example are virtualized on VMWare ESX hosts, where the identifier system-level virtualization is completely disconnected from the IP address that is assigned to the device.

When the monitoring system is based on Nagios, where there is only one property that identifies the host address (property address in host object), the above situation becomes a problem.
The usual solution is keeping the second value stored in the property alias and change the definitions of the check command, replacing $HOSTADDRESS$ macro by $HOSTALIAS$ macro. However, this approach leads to more problems than solutions:

The alias is very useful when correctly used in reports, identifying and providing valuable information about the host.
Some third-party tools, usually topology tools, use this field as display name.

User Macros

In addition to the standard macros, Nagios supports the so-called custom variable macros: identifier-value pairs defined in the host, service or contact objects. Macros of this type are distinguished from standard being necessarily prefixed by a "_" symbol.

define host {
    host_name ProliantServer
    address 192.168.1.1
    _ILOADDRESS 192.168.2.1
    ...
}

In the above example a macro called $_ILOADDRESS$ is defined, being 192.168.2.1 its value that identifies the IP address of the ILO management interface on a server called MyProliantServer. From all points of view this macro can be considered as an standard Nagios macro: It can be invoked from both the execution of host or service checks and therefore can be used in the in a command definition:

define command {
    command_name CheckILOFans
    command_line $USER1$/check_snmp -H $_HOSTILOADDRESS$ ...
    ...
}

define command {

command_name CheckHTTPPort

command_line $USER1$/check_tcp -H $HOSTADDRESS$ ...

...

}

The above example first defines a command called CheckILOFans addressed to check the status of fans on a server with an ILO management interface. CheckHTTPPort also defines a command intended to establish the availability status of the HTTP port on a production interface.

In the first case the used host address is not $HOSTADDRESS$. Instead we use the address stored in our recently created macro, whose name must be prefixed by _HOST because it has been defined as part of a host object, so the macro must be referenced as $_HOSTILOADDRESS$. In the same way, if we define a custom macro in a service object definition, it should be referenced prefixing its id with _SERVICE and finally, if we define a custom macro in a contact object definition, it should be prefixed by _CONTACT.

By following this approach, now we can use both commands to define checks on the same host, even being based on information available through different network interfaces:

define service {

host_name ProliantServer

service_description FanStatus

check_command CheckILOFans

...

}