Monitoring HP bladesystem servers ~ Monitoring Tips & Tricks

Wednesday, 14 November 2012

Monitoring HP bladesystem servers

HP Bladesystem servers are different guys when compared with their brothers from the DL, ML or even BL series: Among other things, its management is not based on ILO but on Onboard Administrator (OA).

ILO supports the great RIBCL protocol, that is by far the best option for monitoring HP servers: it is based on xml and, thus, easily parseable and it is native (no need of installing SNMP daemons in our servers). Sadly there's not a similar option to RIBCL in Onboard Administrator. It supports a telnet/ssh command interpreter, but parsing outputs from a facility addressed to human administrators instead of machines is more than tricky: Bet that the output format of the command you parse will change in the next firmware revision.

It's true that the blades contained in a bladesystem enclosure -since are considered as servers- support ILO, but the output you get when you submit a RIBCL command is not 100% real: For instance a virtual fan is shown for representing all the fans available in the enclosure, and something similar happens with power supplies. What blade servers publish via RIBCL is an abstraction of the enclosure reality.

SNMP is the answer

So the only option for fine-graining monitoring the bladesystem is SNMP. HP C3000 and C7000 series bladesystems support the CPQRACK-MIB MIB (1.3.6.1.4.1.232.22) storing interesting information for monitoring the system health:

The enclosure itself polling the table cpqRackCommonEnclosureTable (CPQRACK-MIB.2.3.1.1)
Enclosure manager (the own onboard administrators) information is located in the table cpqRackCommonEnclosureManagerTable (CPQRACK-MIB.2.3.1.6)
Temperature data can be found in the table cpqRackCommonEnclosureTempTable (CPQRACK-MIB.2.3.1.2)
Fan info is located in the table cpqRackCommonEnclosureFanTable (CPQRACK-MIB.2.3.1.3)
Fuses are represented in the table cpqRackCommonEnclosureFuseTable (CPQRACK-MIB.2.3.1.4)
FRUs (Field Replaceable Units) information is stored in the table cpqRackCommonEnclosureFruTable (CPQRACK-MIB.2.3.1.5)
Power systems (global and power supply specific) can be monitored polling the tables cpqRackPowerEnclosureTable (CPQRACK-MIB.2.3.3.1) and cpqRackPowerSupplyTable (CPQRACK-MIB.2.5.1.1)
Blade information is stored in the table cpqRackServerBladeTable (CPQRACK-MIB.2.4.1.1)
Finally, network IO subsystems can be polled via the table cpqRackNetConnectorTable (CPQRACK-MIB.2.6.1.1)

MIB in detail

All of them store item working status and levels that is what a monitoring system needs for building an image of the status and performance of a blade system:

cpqRackCommonEnclosureCondition (cpqRackCommonEnclosureTable.1.16) stores the status of the whole enclosure: OK (2), degraded (3), failed (4) or other (1).
cpqRackCommonEnclosureManagerCondition (cpqRackCommonEnclosureManagerTable.1.12) stores the status of each manager: OK (2), degraded (3), failed (4) or other (1). cpqRackCommonEnclosureManagerRedundant (cpqRackCommonEnclosureManagerTable.1.11) stores the manager redundancy status: redundant (3), notRedundant (2) or other(1).
cpqRackCommonEnclosureTempCondition (cpqRackCommonEnclosureTempTable.1.8) states the temperature condition of a single sensor: OK (2), degraded (3), failed (4) or other (1). You can get the real temperature value (in celsius) from cpqRackCommonEnclosureTempCurrent (cpqRackCommonEnclosureTempTable.1.6) and its factory threshold from cpqRackCommonEnclosureTempThreshold (cpqRackCommonEnclosureTempTable.1.7)
cpqRackCommonEnclosureFanCondition (cpqRackCommonEnclosureFanTable.1.11) returns a single fan status: OK (2), degraded (3), failed (4) or other (1). cpqRackCommonEnclosureFanRedundant (cpqRackCommonEnclosureFanTable.1.9) returns if a fan is in a redundant configuration: redundant (3), notRedundant (2) or other(1).
cpqRackCommonEnclosureFuseCondition (cpqRackCommonEnclosureFuseTable.1.7) stores the condition of a single fuse: OK (2), failed (4) or other (1).
cpqRackPowerEnclosureCondition (cpqRackPowerEnclosureTable.1.9) stores the overall power system status: OK (2), degraded (3) or other (1).
cpqRackPowerSupplyCondition (cpqRackPowerSupplyTable.1.17) returns the working condition of a single power supply: OK (2), degraded (3), failed (4) or other (1). If you like LOTS of details, cpqRackPowerSupplyStatus (cpqRackPowerSupplyTable.1.14) stores the real status of the element:

noError (1)
generalFailure (2)
bistFailure (3)
fanFailure (4)
tempFailure (5)
interlockOpen (6)
epromFailed (7)
vrefFailed (8)
dacFailed (9)
ramTestFailed (10)
voltageChannelFailed (11)
orringdiodeFailed (12)
brownOut (13)
giveupOnStartup (14)
nvramInvalid (15)
calibrationTableInvalid (16)

cpqRackServerBladeStatus (cpqRackServerBladeTable.1.21) returns the status of a single blade: OK (2), degraded (3), failed (4) or other (1). cpqRackServerBladePowered (cpqRackServerBladeTable.1.25) returns the operational status of a single blade: On (2), off (3) powerStaggedOff (4), rebooting (5) or other (1).

Using traps

Maybe you are an experienced monitoring technician and you discard polling data continuously because you prefer to manage the bladesystem status based on SNMP traps (the truth is that plotting fan speeds and temperatures is cool, but unpractical).

If you select this approach, focus on managing at least these traps. All of them are derived from cpqHoGenericTrap (.1.3.6.1.4.1.232.0) defined in CPQHOST-MIB (and inherited by CPQRACK-MIB):

Managers:

cpqRackEnclosureManagerDegraded (cpqHoGenericTrap.22037)
cpqRackEnclosureManagerOk (cpqHoGenericTrap.22038)

Temperatures:

cpqRackEnclosureTempFailed (cpqHoGenericTrap.22005)
cpqRackEnclosureTempDegraded (cpqHoGenericTrap.22006)
cpqRackEnclosureTempOk (cpqHoGenericTrap.22007)

Fans:

cpqRackEnclosureFanFailed (cpqHoGenericTrap.22008)
cpqRackEnclosureFanDegraded (cpqHoGenericTrap.22009)
cpqRackEnclosureFanOk (cpqHoGenericTrap.22010)

Power supplies:

cpqRackPowerSupplyFailed (cpqHoGenericTrap.22013)
cpqRackPowerSupplyDegraded (cpqHoGenericTrap.22014)
cpqRackPowerSupplyOk (cpqHoGenericTrap.22015)

Power system:

cpqRackPowerSubsystemNotRedundant (cpqHoGenericTrap.22018)
cpqRackPowerSubsystemLineVoltageProblem (cpqHoGenericTrap.22019)
cpqRackPowerSubsystemOverloadCondition (cpqHoGenericTrap.22020)

Blades:

cpqRackServerBladeStatusRepaired (cpqHoGenericTrap.22052)
cpqRackServerBladeStatusDegraded (cpqHoGenericTrap.22053)
cpqRackServerBladeStatusCritical (cpqHoGenericTrap.22054)

Network IO subsystem:

cpqRackNetConnectorFailed (cpqHoGenericTrap.22046)
cpqRackNetConnectorDegraded (cpqHoGenericTrap.22047)
cpqRackNetConnectorOk (cpqHoGenericTrap.22048)

Getting the MIB itself

You can browse CPQRACK-MIB in different places, but be warned that it is not show on its last version in, for instance, mibdepot (it doesn't cover the more than important cpqRackServerBladeStatus field in the cpqRackServerBladeTable blade table that defines the status of a blade). If you need the CQPRACK-MIB MIB itself, you can download it from Plixer.

Monitoring Bladesystem servers in Nagios

If you are a practical guy or you feel too lazy for programming, I recommend using Trond H. Amundsen's check_hp_bladechassis Nagios plugin. It is based on the polling of the previous tables and it's able of generating performance data.

Last but not least...

If you found this article useful, please leave your comments and support the site by clicking in some (or even in all!) of the interesting advertisements of our sponsors. Thanks in advance!

26 comments:

Ross24 November 2012 at 21:59
Hi, I have two important questions. When enable the SNMP agent on a BladeSystem it's necessary enable SNMP agent in Blade Servers too? Which MIB I have to use for the bade server? Both of us use the same? (CPQRACK-MIB)
ReplyDelete
Replies
dddd19 December 2012 at 10:15
Hello,

Im trying to monitor power at blade level using SNMP. Is that possible?
I see the OID "cpqRackServerBladeWattageAvg" in the table "cpqRackServerBlade" but when I walk the bladesystems these OIDs are "0".

If not possible using SNMP, what can I use to get the power used by each blade?

Thanks in advance,
Nestor
ReplyDelete
Replies
dddd20 December 2012 at 11:45
Hello Vicente,

thanks for the quick answer. It was helpful and accurate.
To summarize, we can monitor Power at Blade level by SNMP or RIBCL (pre-requisite: ILO enabled in the blades)

1. Is this correct?

Im interested in monitoring the blades using SNMP. When I SNMP-walk the blade, I dont get any OID from CPQPOWER-MIB or CPQHLTH-MIB, only from CPQRACK-MIB which makes me think that my blade does not support Power meter.

2. When we SNMP-walk the enclosure IP and the blade IP, we always hit the same MIB: CPQRACK-MIB. Why is this? I would expect to find a different MIB if I walk the blade server.

3. What can we do to enable Power meter (CPQPOWER-MIB or CPQHLTH-MIB) at blade level?

Many Thanks in advance Vicente and great blog about monitoring!
Nestor
ReplyDelete
Replies
dddd20 December 2012 at 17:33
Hi Vicente,

great information! I think that we have now clarified well how to monitor power at blade-level!

I will try what you suggest (that will be in January) and I will post an update here so everybody can benefit from this experience.

Thanks a lot!
Nestor
ReplyDelete
Replies
dddd25 January 2013 at 16:54
Hello,

update as promised.
It seems we have not been able yet to poll power data at blade level.
Mainly because to find the right OID to poll is not straight forward.
I will give additional updates if this changes.

However, for Sun blades, we have managed to snmp poll an specific OID from the blade MIB and extract power data smoothly.

Thanks for this blot, it is an excellent piece of information!
Nestor
ReplyDelete
Replies
dddd5 February 2013 at 12:58
Hi Vicente,

thanks for your answer, Im curious about the Perl approach.
How would you use it in a large environment:
- can we run RIBCL query from a central management server to get power data across 200+ blades?
- can we get the output in a CSV/XML format so it becomes readable for a management station?

Thanks in advance,
Nestor
ReplyDelete
Replies
GoBa19 August 2013 at 22:59
Hello Vicente,

First of all, great post, it helped me to understand alerts integration.

Now, I would like to ask you a question regarding Performance management (such as CPU, RAM, etc), I have read that this can be done through Insight Manager, is there any other way to get this information through SNMP?

Thanks in advance for your help
ReplyDelete
Replies
Netrack3 October 2013 at 12:31
This comment has been removed by a blog administrator.
ReplyDelete
Replies
Chris and Rita9 December 2013 at 22:21
What about the SL230 Gen 8 blade server, that does not utilize an OA? Will the XML-based method work on those? If so, what needs to be enabled and checked on the ILOs themselves to make this work?

Would the Agentless Nagios plugin work with the SL230 server? I have a customer asking me this very specific question and they would like to avoid SNMP use in general.
ReplyDelete
Replies
Unknown10 November 2015 at 09:37
Hello Vicente,

Thanks for the great article. I have a blade enclosure C7000
I want to monitor it through opmanager (a software from manage engine) using SNMP
I used the MIB CPQRACK but no monitor is returning any value ... I also let the opmanager query the OID of the blade encl but it returns a different value (.1.3.6.1.4.1.11.5.7.1.2) which is the OID of the on-board admin I think.
Any thoughts ??
Thanks :)
ReplyDelete
Replies
Unknown29 September 2016 at 22:34
Yo pude utilizar el mib desde mi OpManager para monitorear el blade c7000, básicamente probé con el navegador Mib los monitores disponibles, luego agregué la plantilla respectiva, solo pude monitorear los estados.
ReplyDelete
Replies
Unknown24 October 2016 at 11:56
Hi,
can we install Ovagent to monitor C70000 enclosure on each blades
ReplyDelete
Replies
Tuan Anh18 December 2017 at 19:04
Hello Vicente,
i have a switch HP Blade 6120 XG, i want to change the default SNMP community string to the other. Do i have change the one in HP Bladesystem C7000? I did not find the way to change the HP Blade's SNMP community string in Bladesystem C7000.
Thanks.
ReplyDelete
Replies
gaming UAE24 August 2021 at 09:18
Where can buy best HP ML350 Server in Uae, HP ML350 Gen10 Server in Uae, HP ML350 Proliant Server in Uae
https://gccgamers.com/servers-workstations/hp-ml350-gen10-proliant-server-p04674-425.html
ReplyDelete
Replies