Tuesday, August 23, 2011

Monitoring CPU load with SNMP

Or: Nothing Is Ever Easy.

We have a small farm of Citrix servers. They run a particular app for about 130 users. After a recent upgrade to the app, we are beginning to suspect that the new version of the app is putting more load on the CPUs. Alas, we have no historical data to refer to... but doesn't it sound like the kind of thing that some Cacti graphs would be perfect for? For example, in this next graph of our network traffic, you can see a sudden jump in outbound network traffic at the end of March - that's when the Bacula backup system became live:

So for spotting trends, and detecting changes, this kind of graph is invaluable.

Since we already have Cacti, why not poll the Citrix servers for CPU load, and graph that too... maybe also memory use... all sounds good, right? All you have to do is enable SNMP on the Windows host, figure out the OIDs of each CPU (each core counts as a CPU) and hey presto, graphs! As we'll see, it's not that easy.

OK, first step: enable SNMP on your Windows host - go to Start -> Control Panel -> Add/Remove Programs -> Windows Components -> Management and Monitoring Tools and make sure "Simple Network Management Protocol" is selected:

Next step: Go to Services, scroll down to SNMP Service, and right click, select "Properties"

Now click the Agent tab, and in the Service section, enable Physical:

Without Physical selected, the SNMP service will not report on physical hardware components such as the CPUs or memory. Strangely enough, it will report on physical hardware such as disks. I'm still scratching my head over that one.

OK, so now we should be able to get some info on the CPU load. The MIB that deals with this info is the HOST-RESOURCES-MIB and the interesting bits relating to CPU load are in the hrProcessor table. Let's take a walk... an snmp-walk:

pyarra@iceberg:~$ snmpwalk -v1 -cpublic windows-server
HOST-RESOURCES-MIB::hrProcessorFrwID.12 = OID: SNMPv2-SMI::zeroDotZero
HOST-RESOURCES-MIB::hrProcessorFrwID.13 = OID: SNMPv2-SMI::zeroDotZero
HOST-RESOURCES-MIB::hrProcessorFrwID.14 = OID: SNMPv2-SMI::zeroDotZero
HOST-RESOURCES-MIB::hrProcessorFrwID.15 = OID: SNMPv2-SMI::zeroDotZero
HOST-RESOURCES-MIB::hrProcessorLoad.12 = INTEGER: 30
HOST-RESOURCES-MIB::hrProcessorLoad.13 = INTEGER: 15
HOST-RESOURCES-MIB::hrProcessorLoad.14 = INTEGER: 20
HOST-RESOURCES-MIB::hrProcessorLoad.15 = INTEGER: 35

Cool! We're all done, right? Well... no. You see, as the Net-snmp doco makes clear, the indexes into this table (12,13,14,15 in this example) are the device IDs of the CPUs. As such, this table is sparse - there won't be entries for other device IDs. That's okay, so long as we know the OIDs we want, no dramas. Sort of... except that when you look at how Windows enumerates those device IDs, you'll find that printers get enumerated before CPUs:

pyarra@iceberg:~$ snmpwalk -v1 -cpublic em-fap HOST-RESOURCES-MIB::hrDeviceDescr
HOST-RESOURCES-MIB::hrDeviceDescr.1 = STRING: Microsoft XPS Document Writer
HOST-RESOURCES-MIB::hrDeviceDescr.2 = STRING: Xerox Phaser 8560DT
HOST-RESOURCES-MIB::hrDeviceDescr.3 = STRING: FX DocuPrint C2100 PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.4 = STRING: HP Universal Printing PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.5 = STRING: HP Universal Printing PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.6 = STRING: Dell Laser Printer 1720dn
HOST-RESOURCES-MIB::hrDeviceDescr.7 = STRING: FX DocuPrint C2100 PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.8 = STRING: HP LaserJet 4
HOST-RESOURCES-MIB::hrDeviceDescr.9 = STRING: HP Universal Printing PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.10 = STRING: HP Universal Printing PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.11 = STRING: HP Universal Printing PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.12 = STRING: Intel
HOST-RESOURCES-MIB::hrDeviceDescr.13 = STRING: Intel
HOST-RESOURCES-MIB::hrDeviceDescr.14 = STRING: Intel
HOST-RESOURCES-MIB::hrDeviceDescr.15 = STRING: Intel

On reboot, if you have added a printer... yes, the device IDs get re-enumerated, and suddenly your CPU device IDs get changed. Some brave souls have attempted to find ways to deal with this. Me, I dunno if it's worth it.

For the record, Linux doesn't re-number the CPU OIDs on a reboot, at least, not when I added a network interface. It seems like the CPU OIDs start at 768 regardless of what else there is... I really am curious about why they start there. I'm sure it's some sort of fixed offset to avoid OID renumbering issues, but why 768? I would have expected a power of 2. Maybe I think too much like a programmer.

Anyhow, back to WIndows: I also have my doubts about how useful the moment-to-moment CPU load will be, since sampling it at one- or five-minute intervals doesn't give us a great overall picture of CPU load. What we really need is something more like the Unix load average figures. And lo and behold, UCD MIBs implement this in the laTable in a range of helpful ways:

pyarra@verbena:~$ snmpwalk -v1 -cpublic localhost .
UCD-SNMP-MIB::laIndex.1 = INTEGER: 1
UCD-SNMP-MIB::laIndex.2 = INTEGER: 2
UCD-SNMP-MIB::laIndex.3 = INTEGER: 3
UCD-SNMP-MIB::laNames.1 = STRING: Load-1
UCD-SNMP-MIB::laNames.2 = STRING: Load-5
UCD-SNMP-MIB::laNames.3 = STRING: Load-15
UCD-SNMP-MIB::laLoad.1 = STRING: 1.19
UCD-SNMP-MIB::laLoad.2 = STRING: 1.12
UCD-SNMP-MIB::laLoad.3 = STRING: 0.94
UCD-SNMP-MIB::laConfig.1 = STRING: 12.00
UCD-SNMP-MIB::laConfig.2 = STRING: 14.00
UCD-SNMP-MIB::laConfig.3 = STRING: 14.00
UCD-SNMP-MIB::laLoadInt.1 = INTEGER: 118
UCD-SNMP-MIB::laLoadInt.2 = INTEGER: 112
UCD-SNMP-MIB::laLoadInt.3 = INTEGER: 93
UCD-SNMP-MIB::laLoadFloat.1 = Opaque: Float: 1.190000
UCD-SNMP-MIB::laLoadFloat.2 = Opaque: Float: 1.120000
UCD-SNMP-MIB::laLoadFloat.3 = Opaque: Float: 0.940000
UCD-SNMP-MIB::laErrorFlag.1 = INTEGER: noError(0)
UCD-SNMP-MIB::laErrorFlag.2 = INTEGER: noError(0)
UCD-SNMP-MIB::laErrorFlag.3 = INTEGER: noError(0)
UCD-SNMP-MIB::laErrMessage.1 = STRING: 
UCD-SNMP-MIB::laErrMessage.2 = STRING: 
UCD-SNMP-MIB::laErrMessage.3 = STRING: 

So now, I'm beginning to think that net-snmp on Windows might be the way to go. Time to create some VM images and start experimenting!

1 comment:

  1. net-snmp on Windows 2003 server returned empty laTable entries - I don't know if it works on Windows properly. I sure couldn't get it going.