...A place where sharing IT monitoring knowledges

Wednesday, 14 November 2012

Monitoring HP bladesystem servers

HP Bladesystem servers are different guys when compared with their brothers from the DL, ML or even BL series: Among other things, its management is not based on ILO but on Onboard Administrator (OA).

ILO supports the great RIBCL protocol, that is by far the best option for monitoring HP servers: it is based on xml and, thus, easily parseable and it is native (no need of installing SNMP daemons in our servers). Sadly there's not a similar option to RIBCL in Onboard Administrator. It supports a telnet/ssh command interpreter, but parsing outputs from a facility addressed to human administrators instead of machines is more than tricky: Bet that the output format of the command you parse will change in the next firmware revision.

It's true that the blades contained in a bladesystem enclosure -since are considered as servers- support ILO,  but the output you get when you submit a RIBCL command is not 100% real: For instance a virtual fan is shown for representing all the fans available in the enclosure, and something similar happens with power supplies. What blade servers publish via RIBCL is an abstraction of the enclosure reality.

SNMP is the answer

So the only option for fine-graining monitoring the bladesystem is SNMP. HP C3000 and C7000 series bladesystems support the CPQRACK-MIB MIB (1.3.6.1.4.1.232.22) storing interesting information for monitoring the system health:
  • The enclosure itself polling the table cpqRackCommonEnclosureTable (CPQRACK-MIB.2.3.1.1)
  • Enclosure manager (the own onboard administrators) information is located in the table  cpqRackCommonEnclosureManagerTable (CPQRACK-MIB.2.3.1.6)
  • Temperature data can be found in the table cpqRackCommonEnclosureTempTable (CPQRACK-MIB.2.3.1.2)
  • Fan info is located in the table cpqRackCommonEnclosureFanTable (CPQRACK-MIB.2.3.1.3)
  • Fuses are represented in the table cpqRackCommonEnclosureFuseTable (CPQRACK-MIB.2.3.1.4)
  • FRUs (Field Replaceable Units) information is stored in the table cpqRackCommonEnclosureFruTable (CPQRACK-MIB.2.3.1.5)
  • Power systems (global and power supply specific) can be monitored polling the tables cpqRackPowerEnclosureTable (CPQRACK-MIB.2.3.3.1) and cpqRackPowerSupplyTable (CPQRACK-MIB.2.5.1.1)
  • Blade information is stored in the table cpqRackServerBladeTable (CPQRACK-MIB.2.4.1.1)
  • Finally, network IO subsystems can be polled via the table cpqRackNetConnectorTable (CPQRACK-MIB.2.6.1.1)

MIB in detail

All of them store item working status and levels that is what a monitoring system needs for building an image of the status and performance of a blade system:

  • cpqRackCommonEnclosureCondition (cpqRackCommonEnclosureTable.1.16) stores the status of the whole enclosure: OK (2), degraded (3), failed (4) or other (1).
  • cpqRackCommonEnclosureManagerCondition (cpqRackCommonEnclosureManagerTable.1.12) stores the status of each manager: OK (2), degraded (3), failed (4) or other (1). cpqRackCommonEnclosureManagerRedundant (cpqRackCommonEnclosureManagerTable.1.11) stores the manager redundancy status: redundant (3), notRedundant (2) or other(1).
  • cpqRackCommonEnclosureTempCondition (cpqRackCommonEnclosureTempTable.1.8) states the temperature condition of a single sensor: OK (2), degraded (3), failed (4) or other (1). You can get the real temperature value (in celsius) from cpqRackCommonEnclosureTempCurrent (cpqRackCommonEnclosureTempTable.1.6) and its factory threshold from cpqRackCommonEnclosureTempThreshold (cpqRackCommonEnclosureTempTable.1.7)
  • cpqRackCommonEnclosureFanCondition (cpqRackCommonEnclosureFanTable.1.11) returns a single fan status: OK (2), degraded (3), failed (4) or other (1). cpqRackCommonEnclosureFanRedundant (cpqRackCommonEnclosureFanTable.1.9) returns if a fan is in a redundant configuration: redundant (3), notRedundant (2) or other(1).
  • cpqRackCommonEnclosureFuseCondition (cpqRackCommonEnclosureFuseTable.1.7) stores the condition of a single fuse: OK (2), failed (4) or other (1).
  • cpqRackPowerEnclosureCondition (cpqRackPowerEnclosureTable.1.9) stores the overall power system status: OK (2), degraded (3) or other (1).
  • cpqRackPowerSupplyCondition (cpqRackPowerSupplyTable.1.17) returns the working condition of a single power supply: OK (2), degraded (3), failed (4) or other (1). If you like LOTS of details, cpqRackPowerSupplyStatus (cpqRackPowerSupplyTable.1.14) stores the real status of the element:
    • noError (1)
    • generalFailure (2)
    • bistFailure (3)
    • fanFailure (4)
    • tempFailure (5)
    • interlockOpen (6)
    • epromFailed (7)
    • vrefFailed (8)
    • dacFailed (9)
    • ramTestFailed (10)
    • voltageChannelFailed (11)
    • orringdiodeFailed (12)
    • brownOut (13)
    • giveupOnStartup (14)
    • nvramInvalid (15)
    • calibrationTableInvalid (16)
  • cpqRackServerBladeStatus (cpqRackServerBladeTable.1.21) returns the status of a single blade: OK (2), degraded (3), failed (4) or other (1). cpqRackServerBladePowered (cpqRackServerBladeTable.1.25) returns the operational status of a single blade: On (2), off (3) powerStaggedOff (4), rebooting (5) or other (1).

Using traps

Maybe you are an experienced monitoring technician and you discard polling data continuously because you prefer to manage the bladesystem status based on SNMP traps (the truth is that plotting fan speeds and temperatures is cool, but unpractical).

If you select this approach, focus on managing at least these traps. All of them are derived from cpqHoGenericTrap (.1.3.6.1.4.1.232.0) defined in CPQHOST-MIB (and inherited by CPQRACK-MIB):
  • Managers: 
    • cpqRackEnclosureManagerDegraded (cpqHoGenericTrap.22037)
    • cpqRackEnclosureManagerOk (cpqHoGenericTrap.22038)
  • Temperatures: 
    • cpqRackEnclosureTempFailed (cpqHoGenericTrap.22005)
    • cpqRackEnclosureTempDegraded (cpqHoGenericTrap.22006)
    • cpqRackEnclosureTempOk (cpqHoGenericTrap.22007)
  • Fans: 
    • cpqRackEnclosureFanFailed (cpqHoGenericTrap.22008)
    • cpqRackEnclosureFanDegraded (cpqHoGenericTrap.22009)
    • cpqRackEnclosureFanOk (cpqHoGenericTrap.22010)
  • Power supplies:
    • cpqRackPowerSupplyFailed (cpqHoGenericTrap.22013)
    • cpqRackPowerSupplyDegraded (cpqHoGenericTrap.22014)
    • cpqRackPowerSupplyOk (cpqHoGenericTrap.22015)
  • Power system:
    • cpqRackPowerSubsystemNotRedundant (cpqHoGenericTrap.22018)
    • cpqRackPowerSubsystemLineVoltageProblem (cpqHoGenericTrap.22019)
    • cpqRackPowerSubsystemOverloadCondition (cpqHoGenericTrap.22020)
  • Blades:
    • cpqRackServerBladeStatusRepaired (cpqHoGenericTrap.22052)
    • cpqRackServerBladeStatusDegraded (cpqHoGenericTrap.22053)
    • cpqRackServerBladeStatusCritical (cpqHoGenericTrap.22054)
  • Network IO subsystem:
    • cpqRackNetConnectorFailed (cpqHoGenericTrap.22046)
    • cpqRackNetConnectorDegraded (cpqHoGenericTrap.22047)
    • cpqRackNetConnectorOk (cpqHoGenericTrap.22048)


Getting the MIB itself

You can browse CPQRACK-MIB in different places, but be warned that it is not show on its last version in, for instance, mibdepot (it doesn't cover the more than important cpqRackServerBladeStatus field in the cpqRackServerBladeTable blade table that defines the status of a blade). If you need the CQPRACK-MIB MIB itself, you can download it from Plixer.


Monitoring Bladesystem servers in Nagios

If you are a practical guy or you feel too lazy for programming, I recommend using Trond H. Amundsen's check_hp_bladechassis Nagios plugin. It is based on the polling of the previous tables and it's able of generating performance data.


Last but not least...

If you found this article useful, please leave your comments and support the site by clicking in some (or even in all!) of the interesting advertisements of our sponsors. Thanks in advance!



26 comments:

  1. Hi, I have two important questions. When enable the SNMP agent on a BladeSystem it's necessary enable SNMP agent in Blade Servers too? Which MIB I have to use for the bade server? Both of us use the same? (CPQRACK-MIB)

    ReplyDelete
    Replies
    1. Hi Ross:

      Is it necessary enabling SNMP on blade servers when enabling it in Bladesystem? No. You can enable SNMP on bladesystem without enabling it in blade servers, you can enable SNMP on servers without enabling it on bladesystem, and you can enable it on bladesystem and in part of blade servers.

      Which MIB you have to use for the blade server? It depends on the ILO version, but bet that it supports some of these: cpqhost-mib, cpqfca.mib, cpqscsi.mib, cpqide.mib , cpqnic.mib, cpqsm2-mib.

      Both bladesystem and bladeservers support CPQRACK-MIB? No, CPQRACK-MIB is only supported by bladeserver.

      Hope it will help you.

      Delete
    2. With regards to cpqrack not being available on BladeSystem: Does this mean that, if I want to know cpqRackCommonEnclosureCondition of the c7000 in question, I direct the SNMP requests to one of the servers in this enclosure?

      I have tried to request the OID (1.3.6.1.4.1.232.22.2.3.1.1.1.16) to the iLO IP of the primary OA, the iLO IP of one of the servers, and to the IP of one of the nics available from Windows on the server, but I am not able to get anything (the OA and the nic available frow within Windows on the server give me a "No such name" error, whereas the iLO IP of the server gives me a "No response (check: firewalls, routing, snmp settings of device, IPs, SNMP version, community, passwords etc)"

      What am I doing wrong?

      Delete
    3. So, this works! 1.3.6.1.4.1.232.22.2.3.1.1.1.16.1 (Note the extra .1 at the end)

      Delete
    4. Hi Trostein. In fact the OID you've tried to poll (cpqRackCommonEnclosureCondition, or 1.3.6.1.4.1.232.22.2.3.1.1.1.16) is a field of the table cpqRackCommonEnclosureTable (or .1.3.6.1.4.1.232.22.2.3.1.1).

      As any SNMP table field, you must specify the table item you try to poll. So you've succeeded with the OID 1.3.6.1.4.1.232.22.2.3.1.1.1.16.1 because you've specified the first table item (ending .1) to be polled.

      Delete
  2. Hello,

    Im trying to monitor power at blade level using SNMP. Is that possible?
    I see the OID "cpqRackServerBladeWattageAvg" in the table "cpqRackServerBlade" but when I walk the bladesystems these OIDs are "0".

    If not possible using SNMP, what can I use to get the power used by each blade?

    Thanks in advance,
    Nestor

    ReplyDelete
    Replies
    1. Hi Nestor: By what you say it seems that CPQRACK-MIB supports blade level power readings but whole blades don't. However you can try to get these values not from bladesystem but from the whole blades, since they support ILO as a usual tower or rack server.

      Stating that ILO was enabled in a given blade, average, max and min power readings can be recovered via RIBCL sending a GET_POWER_READINGS query. If RIBCL is not an option for you, try SNMP-walking through CPQPOWER-MIB (1.3.6.1.4.1.232.165) or getting cpqHePowerMeterCurrReading (1.3.6.1.4.1.232.6.2.15.3) for getting the current power consumption (in watts) if your blade supports Power meter.

      If the info is not available via SNMP I encourage you to wait until feb-2013 since I've programmed publishing an article covering how to monitor HP devices using RIBCL.

      Hope it was helpful. If true please give us feedback in order to help other people.

      Delete
  3. Hello Vicente,

    thanks for the quick answer. It was helpful and accurate.
    To summarize, we can monitor Power at Blade level by SNMP or RIBCL (pre-requisite: ILO enabled in the blades)

    1. Is this correct?

    Im interested in monitoring the blades using SNMP. When I SNMP-walk the blade, I dont get any OID from CPQPOWER-MIB or CPQHLTH-MIB, only from CPQRACK-MIB which makes me think that my blade does not support Power meter.

    2. When we SNMP-walk the enclosure IP and the blade IP, we always hit the same MIB: CPQRACK-MIB. Why is this? I would expect to find a different MIB if I walk the blade server.

    3. What can we do to enable Power meter (CPQPOWER-MIB or CPQHLTH-MIB) at blade level?

    Many Thanks in advance Vicente and great blog about monitoring!
    Nestor

    ReplyDelete
    Replies
    1. Hi Nestor. I answer your questions:

      1.- Yes, for both using SNMP or RIBCL you must enable ILO at blade level.
      2.- Power Meter must be activated installing an optional license key. You can check if you own it accessing to the blade ILO web administrator and browsing "/Power Managerment/Power Meter" (ILO2).

      The good new is that using RIBCL you can get the info you need, even not having that license. For testing if your system supports it, open a SSH session in the Bladesystem Onboard Administrator CLI (use a user credentials with access granted both to bladecenter and blade ILOs) and run this command:

      HPONCFG ALL << .
      <RIBCL VERSION="2.21">
      <LOGIN USER_LOGIN="" PASSWORD="">
      <SERVER_INFO MODE="read">
      <GET_POWER_READINGS/>
      </SERVER_INFO>
      </LOGIN>
      </RIBCL>
      .

      What you're doing is forwarding the GET_POWER_READINGS RIBCL command to ALL blade ILOs (if you want to send it just to one ILO, replace the 'ALL' argument by the slot number where the blade is present). If you success, you'll get a RIBCL response containing the power being consumed by the blade being the most significant line:

      <PRESENT_POWER_READING VALUE="144" UNIT="Watts"/>

      I encourage you to test it and, if you success, stay tuned to the blog since I'll cover RIBCL in a new article programmed for February/2013.

      Delete
  4. Hi Vicente,

    great information! I think that we have now clarified well how to monitor power at blade-level!

    I will try what you suggest (that will be in January) and I will post an update here so everybody can benefit from this experience.

    Thanks a lot!
    Nestor

    ReplyDelete
  5. Hello,

    update as promised.
    It seems we have not been able yet to poll power data at blade level.
    Mainly because to find the right OID to poll is not straight forward.
    I will give additional updates if this changes.

    However, for Sun blades, we have managed to snmp poll an specific OID from the blade MIB and extract power data smoothly.

    Thanks for this blot, it is an excellent piece of information!
    Nestor

    ReplyDelete
    Replies
    1. Hi Nestor. Be confident that, at least, you'll be able to get it via RIBCL.

      I've a alpha Perl library for that, please email me if you don't success following the snmp way and you want to test via RIBCL.

      Delete
  6. Hi Vicente,

    thanks for your answer, Im curious about the Perl approach.
    How would you use it in a large environment:
    - can we run RIBCL query from a central management server to get power data across 200+ blades?
    - can we get the output in a CSV/XML format so it becomes readable for a management station?

    Thanks in advance,
    Nestor

    ReplyDelete
    Replies
    1. Hi Nestor. If you're monitoring using a Nagios Core based solution, the best option for me is deploying a Perl plugin embedding all the RIBCL stuff in order to check power. That is the straightforward solution.

      If you're using another kind of monitoring solution, dumping RIBCL responses (XML) to a file in order to be latter processed seems a pretty solution too.

      Anyway, please send me a private message (my address is in my profile) in order to avoid being offtopic

      Delete
  7. Hello Vicente,

    First of all, great post, it helped me to understand alerts integration.

    Now, I would like to ask you a question regarding Performance management (such as CPU, RAM, etc), I have read that this can be done through Insight Manager, is there any other way to get this information through SNMP?

    Thanks in advance for your help

    ReplyDelete
    Replies
    1. Hi Alebal:

      Some holidays here, sorry for the late answer. Under my personal view CPU and memory levels (and hence monitoring) relies not in the server management appliance (in this case the Onboard Administrator) but in the operating system.

      One clear example might be the way in that both Linux/MacOS and Windows consider and manage the CPU metric: For Linux/MacOS it represents the number of CPUs that are being (or should be being if they existed) used, while for Windows it represents the percent of time a CPU is being used. Similar examples could be applied to memory usage.

      On the operating system layer there are different ways of getting CPU and memory usage, but just answering about the possibility of doing it via SNMP, you can get it by installing Net-SNMP and running snmpd on Linux/MacOS platforms. On Windows platforms you can rely on Net-SNMP or, better, on the native SNMP agent service (read http://support.microsoft.com/kb/324263 for getting more info about it). In all cases the CPU and memory info is available via the HOST-RESOURCES-MIB.

      I hope this answer was useful for you, in fact is matter for a couple of blog articles due to its extension. Finally thanks for your feedback :)

      Delete
  8. This comment has been removed by a blog administrator.

    ReplyDelete
  9. What about the SL230 Gen 8 blade server, that does not utilize an OA? Will the XML-based method work on those? If so, what needs to be enabled and checked on the ILOs themselves to make this work?

    Would the Agentless Nagios plugin work with the SL230 server? I have a customer asking me this very specific question and they would like to avoid SNMP use in general.

    ReplyDelete
    Replies
    1. Hi Chris and Rita:

      I have no experience with this device, however if it supports ILO (and so it seems) you can use a RIBCL approach (what you mean as XML).

      In order to use RIBCL you just need creating a user in the ILO (a readonly user will work) and using a RIBCL based Nagios plugin like this: http://exchange.nagios.org/directory/Plugins/Hardware/Server-Hardware/HP-(Compaq)/check_ilo2_health/details

      Please, share your experience.

      Delete
  10. Hello Vicente,

    Thanks for the great article. I have a blade enclosure C7000
    I want to monitor it through opmanager (a software from manage engine) using SNMP
    I used the MIB CPQRACK but no monitor is returning any value ... I also let the opmanager query the OID of the blade encl but it returns a different value (.1.3.6.1.4.1.11.5.7.1.2) which is the OID of the on-board admin I think.
    Any thoughts ??
    Thanks :)

    ReplyDelete
  11. Yo pude utilizar el mib desde mi OpManager para monitorear el blade c7000, básicamente probé con el navegador Mib los monitores disponibles, luego agregué la plantilla respectiva, solo pude monitorear los estados.

    ReplyDelete
  12. Hi,
    can we install Ovagent to monitor C70000 enclosure on each blades

    ReplyDelete
  13. Hello Vicente,
    i have a switch HP Blade 6120 XG, i want to change the default SNMP community string to the other. Do i have change the one in HP Bladesystem C7000? I did not find the way to change the HP Blade's SNMP community string in Bladesystem C7000.
    Thanks.

    ReplyDelete
    Replies
    1. Hi Tuan:

      I think you should't look for it in the switch but in the blade system's Onboard Administrator. Please check this URL: http://bladesystem.helpmax.net/en/configuring-the-hp-bladesystem-c7000-enclosure-and-enclosure-devices/enclosure-settings/snmp-settings/

      In any case I'm not an expert in administration, so the start point for me is always a system properly configured for being monitored. In that sense, the help of any other reader is welcome.

      Delete
  14. Where can buy best HP ML350 Server in Uae, HP ML350 Gen10 Server in Uae, HP ML350 Proliant Server in Uae
    https://gccgamers.com/servers-workstations/hp-ml350-gen10-proliant-server-p04674-425.html

    ReplyDelete

 
Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes