...A place where sharing IT monitoring knowledges

Showing posts with label UPS. Show all posts
Showing posts with label UPS. Show all posts

Thursday, 12 October 2017

Plugin Of The Month: Checking UPS alarms

NOTE: This post covers a Nagios/Icinga/Centreon Core compatible plugin addressed to monitor the active alarms. If you are interested in the background information that supports the plugin, see the post Monitoring UPS Devices: UPS-MIB.

Background

Monitoring the UPSs health is critical when trying to keep alive your ITC infrastructure. One of the most supported ways of monitoring UPSs, instead of using proprietary solutions, is SNMP via the MIB UPS (RFC 1628), widely supported by most of the devices in the market.

Specifically, MIB UPS (RFC 1628) supports an active alarms table in the managed device (OID upsAlarmTable, 1.3.6.1.2.1.33.1.6.2). Each table input stores:
  • upsAlarmId: An unique active alarm identifier
  • upsAlarmDescr: A reference to an alarm description object
  • upsAlarmTime: The value of sysUpTime when the alarm condition was detected
MIB includes a list of alarms objects called Well-Known-Alarms:

ID OID Description
1 upsAlarmBatteryBad (1.3.6.1.2.1.33.1.6.3.1) One or more batteries have been determined to require replacement.
2 upsAlarmOnBattery (1.3.6.1.2.1.33.1.6.3.2) The UPS is drawing power from the batteries.
3 upsAlarmLowBattery (1.3.6.1.2.1.33.1.6.3.3) The remaining battery run-time is less than or equal to upsConfigLowBattTime.
4 upsAlarmDepletedBattery (1.3.6.1.2.1.33.1.6.3.4) The UPS will be unable to sustain the present load when and if the utility power is lost.
5 upsAlarmTempBad (1.3.6.1.2.1.33.1.6.3.5) A temperature is out of tolerance.
6 upsAlarmInputBad (1.3.6.1.2.1.33.1.6.3.6) An input condition is out of tolerance.
7 upsAlarmOutputBad (1.3.6.1.2.1.33.1.6.3.7) An output condition (other than OutputOverload) is out of tolerance.
8 upsAlarmOutputOverload (1.3.6.1.2.1.33.1.6.3.8) The output load exceeds the UPS output capacity.
9 upsAlarmOnBypass (1.3.6.1.2.1.33.1.6.3.9) The Bypass is presently engaged on the UPS.
10 upsAlarmBypassBad (1.3.6.1.2.1.33.1.6.3.10) The Bypass is out of tolerance.
11 upsAlarmOutputOffAsRequested (1.3.6.1.2.1.33.1.6.3.11) The UPS has shutdown as requested, i.e., the output is off.
12 upsAlarmUpsOffAsRequested (1.3.6.1.2.1.33.1.6.3.12) The entire UPS has shutdown as commanded.
13 upsAlarmChargerFailed (1.3.6.1.2.1.33.1.6.3.13) An uncorrected problem has been detected within the UPS charger subsystem.
14 upsAlarmUpsOutputOff (1.3.6.1.2.1.33.1.6.3.14) The output of the UPS is in the off state.
15 upsAlarmUpsSystemOff (1.3.6.1.2.1.33.1.6.3.15) The UPS system is in the off state.
16 upsAlarmFanFailure (1.3.6.1.2.1.33.1.6.3.16) The failure of one or more fans in the UPS has been detected.
17 upsAlarmFuseFailure (1.3.6.1.2.1.33.1.6.3.17) The failure of one or more fuses has been detected.
18 upsAlarmGeneralFault (1.3.6.1.2.1.33.1.6.3.18) A general fault in the UPS has been detected.
19 upsAlarmDiagnosticTestFailed (1.3.6.1.2.1.33.1.6.3.19) The result of the last diagnostic test indicates a failure.
20 upsAlarmCommunicationsLost (1.3.6.1.2.1.33.1.6.3.20) A problem has been encountered in the communications between the agent and the UPS.
21 upsAlarmAwaitingPower (1.3.6.1.2.1.33.1.6.3.21) The UPS output is off and the UPS is awaiting the return of input power.
22 upsAlarmShutdownPending (1.3.6.1.2.1.33.1.6.3.22) A upsShutdownAfterDelay countdown is underway.
23 upsAlarmShutdownImminent (1.3.6.1.2.1.33.1.6.3.23) The UPS will turn off power to the load in less than 5 seconds; this may be either a timed shutdown or a low battery shutdown.
24 upsAlarmTestInProgress (1.3.6.1.2.1.33.1.6.3.24) A test is in progress, as initiated and indicated by the Test Group. Tests initiated via other implementation-specific mechanisms can indicate the presence of the testing in the alarm table, if desired, via a OBJECT-IDENTITY macro in the MIB document specific to that implementation and are outside the scope of this OBJECT-IDENTITY

Plugin description

check_ups_alarms is a Nagios/Icinga/Centreon Core compatible plugin for checking the active alarms in a UPS MIB compliant device.

The alarms to be checked (see table above) can be defined as a list of IDs in the warning and critical plugin arguments. If one of these alarms becomes active, the plugin will return the status associated to the list where the active alarm is defined.
check_ups_alarms  is based on fetching session data using SNMP v1/2c, so it's necessary that the device being checkek supported this protocol and served info managed by the UPS MIB (RFC 1628).

You can get detailed help and usage examples by running the script with the  --help option.

Usage examples

check_UPS_alarms -H 192.168.0.1
Checks the active alarms on a host with address 192.168.0.1 using SNMP protocol version 1 and 'public' as community. Plugin returns always OK.

check_UPS_alarms -H 192.168.0.1 -w 1..4,11 -c 5..10
Similar to the previous example but returning WARNING if alarm(s) with ids 1 to 4 and 11 are active and CRITICAL if alarms with ids 5 to 10 are active

Download

You can download the latest version of the plugin from GitHub.

The development of this plugin, that now is freely released, implies hours of reading technical documentation, programming and testing. I will be more than glad if you support this effort by clicking in some of the interesting advertisements that you can find on this website.

Last but not least, if you find some bug don't hesitate in contacting me for fixing it quickly. Feedback comments are welcome too!

Saturday, 5 January 2013

Monitoring UPS Devices: UPS-MIB


In the article "Monitoring UPS Devices" I tried to explain how important is monitoring these key devices and what are the methods when doing it. In this article, much more practical than the previous one, I'll cover where UPS-MIB stores key info from a monitoring point of view. For those of us even more practical I'll name some free Nagios core compatible plugins that do the trick quite well.

UPS-MIB

UPS-MIB (RFC1628) is the most extended MIB in UPS devices with native or added (via network cards) SNMP support. It covers, among others:

  • Electrical levels (voltage, current, power, frequency) in both inputs and outputs
  • Battery system status (temperatures, charge and autonomy levels)
  • Device working mode: online, offline, bypassed...
  • Device alarms
... what is basically all you need to know about your UPS for getting an accurate view of its state.

You can check if your device supports UPS MIB trying to query any OID of the MIB subtree 1.3.6.1.2.1.33. For instance, and supposing you have the net-snmp package, you can get the device model identifier from a UPS addressed as 192.168.1.129 supporting SNMP version 1 with community string "public" using this command:

snmpget -v 1 -c public 192.168.1.129 .1.3.6.1.2.1.33.1.1.2.0

Finally, you can get the full MIB from the IETF site: http://tools.ietf.org/html/rfc1628

Electrical levels

Monitoring electrical levels is useful for:
  • Being sure that input levels are in the right operational margins and, when not, having true data for complaining to your power provider.
  • Knowing when your device is reaching its limits (based on output power) and thus being able of planning upgrades.
  • Having a true monthly estimate about how much power (and hence money) is spent in your datacenter.
All those values are available in the MIB groups upsInput (.1.3.6.1.2.1.33.1.3) and upsOutput (.1.3.6.1.2.1.33.1.4). Following the previous example, you can get the upsInput group OIDs running this command:

snmpwalk -v 1 -c public 192.168.1.129 .1.3.6.1.2.1.33.1.3

Every group contains, among others, a table where line levels (voltage, current, power, frequency for inputs, load for outpus) are stored. Have in mind that, since UPS devices can have more than one input/output line (for instance, three-phase UPS have three input and three output lines) you can find different line info in that table. So for instance, for getting levels in the input line 2 you must run this command:

snmpwalk -v 1 -c public 192.168.1.129 .1.3.6.1.2.1.33.1.3.3.1.2

Some interesting OIDs for input table are:
  • upsInputVoltage (upsInput.3.1.3)
  • upsInputCurrent (upsInput.3.1.4)
  • upsInputTruePower (upsInput.3.1.5)
Some interesting OIDs for output table are:
  • upsOutputVoltage (upsOutput.4.1.2)
  • upsOutputCurrent (upsOutput.4.1.3)
  • upsOutputPower (upsOutput.4.1.4)
  • upsOutputPercentLoad (upsOutput.4.1.5)
Maybe the key value to check is the last one (upsOutputPercentLoad) since it will show the load, in percent over its nominal power, that your UPS is supporting in a given time. If this value reaches levels near 100% its time to start considering to upgrade your UPS. 

In three-phase devices, this value is important for other purpose too: checking how balanced are your output lines. If you find that some lines are highly loaded while other(s) not, its a good idea talking with your electrical technician in order to distribute the load in a more balanced way: Any UPS likes having unbalanced lines.

Specifically for the upsOutput group, the OID upsOutputSource (1.3.6.1.2.1.33.1.4.1.0) stores the source from where the UPS gets its power:
  • If the UPS is online, upsOutputSource is 3 (normal)
  • If the device is in bypass mode, upsOutputSource is 4 (bypass)
  • If the device is offline, upsOutputSource is 5 (battery)

Battery system

Battery system is the vital point in a UPS device. You can be sure that an UPS is healthy only to see how it is unable to support a moderate load during a power cutoff when its battery system is not ok.

upsBattery group (1.3.6.1.2.1.33.1.2) stores all the battery system information available, being the most important:
  • upsBatteryStatus (upsBattery.1.0) stores the status of the battery: unknown (1), normal (2), low (3) or depleted (4).
  • upsEstimatedMinutesRemaining (upsBattery.3.0) shows an estimation about how long, in minutes, are lasting your batteries. This value is established based on battery system capacity and output load: the more load, the less autonomy.
  • upsEstimatedChargeRemaining (upsBattery.4.0) shows the load level of your battery system. In a online system this value must be 100%.
  • upsBatteryTemperature (upsBattery.7.0) shows the temperature in your battery system. Temperature is key for optimizing the life of your battery system, or what is the same, for helping you to save money delaying each battery replacement cycle. Ask your UPS manufacturer for his optimal battery temperature levels, but usually it might be in the 20-25 Celsius degrees range.

Device status

You can get a view of the status of your device in two ways:
  • Combining upsBatteryStatus, upsEstimatedMinutesRemaining, upsEstimatedChargeRemaining, upsOutputPercentLoad and upsOutputSource  for knowing:
    • If your UPS device is online, offline or bypassed
    • If in offline mode, knowing your battery level and getting an estimation of the battery remaining time
  • Cheking the device alarm info. The upsAlarm group (1.3.6.1.2.1.33.1.6) stores the alarm information of the device: upsAlarmsPresent OID (upsAlarm.1.0) stores the number of active alarms present and upsAlarmTable table (upsAlarm.2) stores the information (id, description and time) of each present alarm.

Nagios core compatible plugins

For those managing Nagios Core based monitoring solutions (Nagios, Icinga, Centreon, ...), there are some free plugins that could help you. All these plugins are addressed to monitor UPS-MIB compatible devices and use most of the previous reviewed OIDs:

  • check_ups_alarms is shared in the post Plugin Of The Month: Checking UPS alarms of this website.
  • check_ups_mode, from Boreal Labs, combines different UPS-MIB values for generating a UPS mode status (online, bypass, offline, offline with battery low or offline with battery depleted) and returns battery load percent and backup time as performance data.

Monday, 3 May 2010

Monitoring UPS devices


From a layered design point of view, the monitoring base layer -or layer 0- includes all devices whose mission is creating a solid physic and environmental platform where upper devices and services could run.

UPS devices are a key part of this layer: from affording not only a continuous supply but also a quality signal to showing the consumption per line, they are a MUST BE when designing from datacenter to global monitoring strategies.

DISCLAIMER: This is mainly a theoretical article. If you're looking for practical info about how to get the most of your device read "Monitoring UPS Devices: UPS-MIB"

Why monitoring a UPS?

Many times, when getting data to start a monitoring project, I find that users claim "Really? Can it be monitored?", then I answer "it MUST be monitored".

Perhaps because UPS devices are in the border between computer and power supply worlds, in the limits where a vast, misterious land populated by volts, ampers and watts starts, the IT crowd deliberately or unconsciously doesn't pay them a bit of its attention.

However, all the IT hard -and thus soft- structure depends on them as suppliers of continuous, clean* power supply, just consider the costs in manpower and the amount of offline service clients of an unattended power cutoff in your CPD due to a unmonitorized out of service UPS.

All UPS are based on batteries as power storage devices and battery life is closely environmental temperature dependent, usually 21 to 25 Celsius degrees (depending on battery specs). In order to maximize their life you have two options: turning you CPD into a freezer spending lots of money in conditioning air maintenance, or monitoring their temperature and fit the environmental temperature to its optimal value... bet it, the last one wins.

Finally, monitoring the UPS output line values will allow you, without the need of using PDUs, knowing when you are reaching the UPS nominal limits or helping you to dimension the right device (and thus saving money) if you need to upgrade the current. And last but not least nowadays, monitoring the UPS input line value will allow you getting the overall** power supply consumption, and thus, the CO2 emission levels of your CPD.

...Well, and how do I get that info?

For a usual IT technician, first obvious option should be adopting a TUIN approach (Toothpicks Under Its Nails). Saddly UPS don't have nails.

Now seriously, you can rely on a propietary monitoring solution like those that some times manufacturers give or much more times they sell. If you have good friends and/or money to get it you will get a nice, colorful, program that only will be useful to monitor one kind of UPS. So if you have a diversified UPS park you will have to use more than one, and you will have to change it if you change the brand when upgrading it.

On the other hand, you can increase the value of your monitoring platform (read Nagios, Zenoss, OpenNMS or any other that supports SNMP traps and/or requests) adding UPS monitoring capabilities. You only need a UPS with SNMP capabilites: many of them have it out of the box, other can get it by mounting a network interface adapter.

When dealing with UPS SNMP MIBs you have, again, two options: basing your monitoring on private, manufacturer MIBs, or adopting an standard position and getting the needed info from the RFC 1628 MIB (formely UPS-MIB). First one is sweet and easy because the manufacturer might had chewed the meal to you, however you'll have to create a service set by manufacturer. Second one will require, in some cases, a higher scripting effort but it will work with all UPS supporting RFC 1628 MIB (that should be -if not all- the most).

As most of you -like me- like the extreme pain, let's go for the second one: "Monitoring UPS devices: UPS-MIB"

Summarizing

Using a few words I've tried to show the pros of monitoring our UPS park: more security, costs saving and more info in our hands to base further decisions. From the many ways that could be selected to monitor an UPS, I've argued why selecting a SNMP based approach and why using the info stored in RFC 1628 MIB.

Interesting links


Related posts


(*).- Depending on the device characteristics, not all UPS improve the power signal quality. To know more about it check the fantastic Neil Rasmussen's white paper in references.
(**).- Not counting air conditioning consumptions and stating that all CPD devices are attached to UPS network.



 
Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes