Notice
This article applies to Nagios Core 2.x and 3.x. Luckily Nagios Core 4 natively manages the inhibition of service notifications when the service parent (for instance its host) is not UP. Read about this and other Nagios 4 Core features at Nagios Core 4: Overview.It is likely that when a host switch to a DOWN state or UNREACHABLE, Nagios inhibit cheking its services: Why checking them if Nagios itself has determined that the host isnot UP?
For better or worse this is not true: Nagios keeps on running regular checks on the services on a non-UP host. The resulting state of each service check depends on how it handles the unavailability of the data source.
Beyond the advantages of that fact, there are some disadvantages:
- Too much information produces perplexity, and a set of alarms in services related to a host failure can hide real problems in services from other hosts.
- Resource consumption related to the implementation of checks predestined to fail.
- Notification storm related to the host and its services failure.
Therefore it seems desirable, if not for all at least for many service types, following some steps to avoid the above problems:
- Establishing service states to reflect the reality of the situation, such as an UNKNOWN state.
- Inhibiting notifications related to service state change.
- Disabling active checks of services while their host is not UP.
These steps should prevent, in a major or minor way, the problems related to mesleading information, resource consumption and notification storm.
Howto
So now the question is: How to do it? There are different approaches, having each one its pros and cons. Far from analyzing all, the best solution seems to be using Nagios external commands for performing all previous tasks every time host status changes.
So now the question is: How to do it? There are different approaches, having each one its pros and cons. Far from analyzing all, the best solution seems to be using Nagios external commands for performing all previous tasks every time host status changes.
Required external commands should be:
- ENABLE_PASSIVE_SVC_CHECKS: Enables service status to be set from an external command. Note that this command itself doesn't set the status, you must use PROCESS_SERVICE_CHECK_RESULT (read on) to do it.
- DISABLE_HOST_SVC_CHECKS, ENABLE_HOST_SVC_CHECKS: Disables/Enables checks for all services of a given host.
- PROCESS_SERVICE_CHECK_RESULT: Sets the status value for a given service.
- DISABLE_HOST_SVC_NOTIFICATIONS and additionally DISABLE_ALL_NOTIFICATIONS_BEYOND_HOST: Disables notifications for both all services of a given host and all services from all hosts topologically beyond a given host.
- ENABLE_HOST_SVC_NOTIFICATIONS and additionally ENABLE_ALL_NOTIFICATIONS_BEYOND_HOST: Makes the opposite of the previous commands.
All these commands must be used on a script designed for managing host status changes. This script migth manage these command line arguments:
- Host name, avaliable through the $HOSTNAME$ host macro.
- Host status, available (in numeric format) through the $HOSTSTATUSID$ host macro.
This could be the script algorithm using metalanguage:
if HOSTSTATUSID=0 the
# Host has changed to an UP status
# Force status for all host services
for each host Service
$LASTSERVICESTATUSID:HostName:Service$)
endfor
# Enable notifications for all host services
ExternalCommand(ENABLE_HOST_SVC_NOTIFICATIONS, HostName)
# Enable active checks for all host services
ExternalCommand(ENABLE_HOST_SVC_CHECKS, Hostname)
else
# Host has changed to a non-UP status
# Disable active checks for all host services
ExternalCommand(DISABLE_HOST_SVC_CHECKS, Hostname)
# Host has changed to an UP status
# Force status for all host services
for each host Service
# Submit an external command to set, as service status,
# previous current value ($LASTSERVICESTATUSID$ macro)
ExternalCommand(PROCESS_SERVICE_CHECK_RESULT,Service,$LASTSERVICESTATUSID:HostName:Service$)
endfor
# Enable notifications for all host services
ExternalCommand(ENABLE_HOST_SVC_NOTIFICATIONS, HostName)
# Enable active checks for all host services
ExternalCommand(ENABLE_HOST_SVC_CHECKS, Hostname)
else
# Host has changed to a non-UP status
# Disable active checks for all host services
ExternalCommand(DISABLE_HOST_SVC_CHECKS, Hostname)
# Disable notifications for all host services
ExternalCommand(DISABLE_HOST_SVC_NOTIFICATIONS, HostName)
# Set UNKNOWN (3) status for all host services
for each host Service
ExternalCommand(PROCESS_SERVICE_CHECK_RESULT,Service,3)
endfor
endif
Configuration
Once the script is written, you must define a command object for enabling its usage from Nagios:
define command {
command_name setSvcStatusByHostStatus
command_line -h $HOSTNAME$ -s $HOSTSTATUSID$
}
Finally, it will be necessary setting the previous command as host event handler. If the defined solution is suitable for managing all host status changes, previous command must be set as global event handler in the Nagios configuration (usually stored in nagios.cfg file):
global_host_event_handler = setSvcStatusByHostStatus
If it's not to be used on all hosts, it must be set as event handler for every suitable host:
define host {
...
event_handler setSvcStatusByHostStatus
...
}
Centreon
Previous solution is fully supported by Centreon:
- Command definition is not different to other usual command. The only thing to consider is defining it as "check" type in order to be available through the event handler configuration lists.
- You can set the value of global_host_event_handler through the field "Global host event handler" located on the "Checking options" tab in the Configuration>Nagios>Nagios.cfg menu.
- You can set the event_handler directive for each host using the field "Event handler" located on the "Data management" of the Configuration>Hosts>(host name).
Related posts
Tweet |
|
Great article, helped me a lot! Thank you!
ReplyDeleteBut there is one thing i can't figure out - how can i determine which services are under a host? I couldn't find any Nagios macro that could send this information to my script. Therefor I don't know how to solve your for cycle:
# Force status for all host services
for each host Service
# Submit an external command to set, as service status,
# previous current value ($LASTSERVICESTATUSID$ macro)
ExternalCommand(PROCESS_SERVICE_CHECK_RESULT,Service,
$LASTSERVICESTATUSID:HostName:Service$)
endfor
Could you please give me any advice? Thank you in advance!
Hi Honza:
ReplyDeleteHappy to know that my article helped you. About what you ask you are right: No macros for getting all services on a host.
You have to get it parsing the Nagios configuration files but don't fear, the fantastic Perl Nagios::Config library is here to help us. This script shows how to get all the services from a given host:
#!/usr/bin/perl
use Nagios::Config;
my $Parser = Nagios::Config->new(Filename => $ARGV[0], Version => 2);
my $Host = $Parser->find_object($ARGV[1],'Nagios::Host');
if ( defined $Host ) {
foreach my $Service ( $Host->list_services ) {
printf "%s\n", $Service->{'service_description'};
}
}
It takes two command line arguments: Nagios config file name and host name for what you want to get its services. Script will output, one per line, the value of the field service_description of every service bound to the host name you pass as second argument.
You can get the Nagios::Config library from CPAN: http://search.cpan.org/~duncs/Nagios-Object-0.21.16/lib/Nagios/Config.pm
Hope it was helpful :)
Vicente - could you post or forward to me (stvlange@gmail.com) your scripts you use for this? I tried your perl script you list about and am not getting any output and I'd love to see your actual scripts (not just the metalanguage). I'm really new to using external commands.
ReplyDeleteHello,
ReplyDeleteI must admit that your approach is very interesting, but I would like seeing the code, because I don't know how to implement it properly.
I'm focus on putting UNKNOWN status in each service of a down node. That's why I think your post is worth it.
Regards!
Thanks for your feedback Siser. I've developed a public script release but I'm dealing with a bug in the underlying and needed Nagios::Object Perl library. This library is used for retrieving what services are bound to the target host and hence setting what services must be handled in order to avoid a notifications storm.
DeleteThe bug has has been reported to its developers that gently have checked it and tell about an early resolution. As soon as they fixed it I'll test the script and, if you agree, I'll sent it to you by email (I'll sent it to Steve Lange too) in order to test it as beta prior to releasing it.
Hello Vicente!
ReplyDeleteIs it also possible to get you public script for this effort when the problem in the Perl library is solved? I am looking for such a solution a long time.
Thank you very much in advance.
Regards!
Hello Vicente!
DeleteI have scripted it on my own. Thank you very much for your input and your decleration. The Nagios::Object Perl library seems fixed now, because for me it is working!!
Hi would you mind sharing the script you used to get this working?
DeleteThis comment has been removed by the author.
ReplyDelete