Tuesday, August 23, 2011

Monitoring CPU load with SNMP

Or: Nothing Is Ever Easy.

We have a small farm of Citrix servers. They run a particular app for about 130 users. After a recent upgrade to the app, we are beginning to suspect that the new version of the app is putting more load on the CPUs. Alas, we have no historical data to refer to... but doesn't it sound like the kind of thing that some Cacti graphs would be perfect for? For example, in this next graph of our network traffic, you can see a sudden jump in outbound network traffic at the end of March - that's when the Bacula backup system became live:

So for spotting trends, and detecting changes, this kind of graph is invaluable.

Since we already have Cacti, why not poll the Citrix servers for CPU load, and graph that too... maybe also memory use... all sounds good, right? All you have to do is enable SNMP on the Windows host, figure out the OIDs of each CPU (each core counts as a CPU) and hey presto, graphs! As we'll see, it's not that easy.

OK, first step: enable SNMP on your Windows host - go to Start -> Control Panel -> Add/Remove Programs -> Windows Components -> Management and Monitoring Tools and make sure "Simple Network Management Protocol" is selected:

Next step: Go to Services, scroll down to SNMP Service, and right click, select "Properties"

Now click the Agent tab, and in the Service section, enable Physical:

Without Physical selected, the SNMP service will not report on physical hardware components such as the CPUs or memory. Strangely enough, it will report on physical hardware such as disks. I'm still scratching my head over that one.

OK, so now we should be able to get some info on the CPU load. The MIB that deals with this info is the HOST-RESOURCES-MIB and the interesting bits relating to CPU load are in the hrProcessor table. Let's take a walk... an snmp-walk:

pyarra@iceberg:~$ snmpwalk -v1 -cpublic windows-server 1.3.6.1.2.1.25.3.3
HOST-RESOURCES-MIB::hrProcessorFrwID.12 = OID: SNMPv2-SMI::zeroDotZero
HOST-RESOURCES-MIB::hrProcessorFrwID.13 = OID: SNMPv2-SMI::zeroDotZero
HOST-RESOURCES-MIB::hrProcessorFrwID.14 = OID: SNMPv2-SMI::zeroDotZero
HOST-RESOURCES-MIB::hrProcessorFrwID.15 = OID: SNMPv2-SMI::zeroDotZero
HOST-RESOURCES-MIB::hrProcessorLoad.12 = INTEGER: 30
HOST-RESOURCES-MIB::hrProcessorLoad.13 = INTEGER: 15
HOST-RESOURCES-MIB::hrProcessorLoad.14 = INTEGER: 20
HOST-RESOURCES-MIB::hrProcessorLoad.15 = INTEGER: 35

Cool! We're all done, right? Well... no. You see, as the Net-snmp doco makes clear, the indexes into this table (12,13,14,15 in this example) are the device IDs of the CPUs. As such, this table is sparse - there won't be entries for other device IDs. That's okay, so long as we know the OIDs we want, no dramas. Sort of... except that when you look at how Windows enumerates those device IDs, you'll find that printers get enumerated before CPUs:

pyarra@iceberg:~$ snmpwalk -v1 -cpublic em-fap HOST-RESOURCES-MIB::hrDeviceDescr
HOST-RESOURCES-MIB::hrDeviceDescr.1 = STRING: Microsoft XPS Document Writer
HOST-RESOURCES-MIB::hrDeviceDescr.2 = STRING: Xerox Phaser 8560DT
HOST-RESOURCES-MIB::hrDeviceDescr.3 = STRING: FX DocuPrint C2100 PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.4 = STRING: HP Universal Printing PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.5 = STRING: HP Universal Printing PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.6 = STRING: Dell Laser Printer 1720dn
HOST-RESOURCES-MIB::hrDeviceDescr.7 = STRING: FX DocuPrint C2100 PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.8 = STRING: HP LaserJet 4
HOST-RESOURCES-MIB::hrDeviceDescr.9 = STRING: HP Universal Printing PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.10 = STRING: HP Universal Printing PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.11 = STRING: HP Universal Printing PCL 6
HOST-RESOURCES-MIB::hrDeviceDescr.12 = STRING: Intel
HOST-RESOURCES-MIB::hrDeviceDescr.13 = STRING: Intel
HOST-RESOURCES-MIB::hrDeviceDescr.14 = STRING: Intel
HOST-RESOURCES-MIB::hrDeviceDescr.15 = STRING: Intel

On reboot, if you have added a printer... yes, the device IDs get re-enumerated, and suddenly your CPU device IDs get changed. Some brave souls have attempted to find ways to deal with this. Me, I dunno if it's worth it.

For the record, Linux doesn't re-number the CPU OIDs on a reboot, at least, not when I added a network interface. It seems like the CPU OIDs start at 768 regardless of what else there is... I really am curious about why they start there. I'm sure it's some sort of fixed offset to avoid OID renumbering issues, but why 768? I would have expected a power of 2. Maybe I think too much like a programmer.

Anyhow, back to WIndows: I also have my doubts about how useful the moment-to-moment CPU load will be, since sampling it at one- or five-minute intervals doesn't give us a great overall picture of CPU load. What we really need is something more like the Unix load average figures. And lo and behold, UCD MIBs implement this in the laTable in a range of helpful ways:

pyarra@verbena:~$ snmpwalk -v1 -cpublic localhost .1.3.6.1.4.1.2021.10
UCD-SNMP-MIB::laIndex.1 = INTEGER: 1
UCD-SNMP-MIB::laIndex.2 = INTEGER: 2
UCD-SNMP-MIB::laIndex.3 = INTEGER: 3
UCD-SNMP-MIB::laNames.1 = STRING: Load-1
UCD-SNMP-MIB::laNames.2 = STRING: Load-5
UCD-SNMP-MIB::laNames.3 = STRING: Load-15
UCD-SNMP-MIB::laLoad.1 = STRING: 1.19
UCD-SNMP-MIB::laLoad.2 = STRING: 1.12
UCD-SNMP-MIB::laLoad.3 = STRING: 0.94
UCD-SNMP-MIB::laConfig.1 = STRING: 12.00
UCD-SNMP-MIB::laConfig.2 = STRING: 14.00
UCD-SNMP-MIB::laConfig.3 = STRING: 14.00
UCD-SNMP-MIB::laLoadInt.1 = INTEGER: 118
UCD-SNMP-MIB::laLoadInt.2 = INTEGER: 112
UCD-SNMP-MIB::laLoadInt.3 = INTEGER: 93
UCD-SNMP-MIB::laLoadFloat.1 = Opaque: Float: 1.190000
UCD-SNMP-MIB::laLoadFloat.2 = Opaque: Float: 1.120000
UCD-SNMP-MIB::laLoadFloat.3 = Opaque: Float: 0.940000
UCD-SNMP-MIB::laErrorFlag.1 = INTEGER: noError(0)
UCD-SNMP-MIB::laErrorFlag.2 = INTEGER: noError(0)
UCD-SNMP-MIB::laErrorFlag.3 = INTEGER: noError(0)
UCD-SNMP-MIB::laErrMessage.1 = STRING: 
UCD-SNMP-MIB::laErrMessage.2 = STRING: 
UCD-SNMP-MIB::laErrMessage.3 = STRING: 

So now, I'm beginning to think that net-snmp on Windows might be the way to go. Time to create some VM images and start experimenting!


Tuesday, August 9, 2011

FreeBSD, ALTQ and SNMP

Some history: we use FreeBSD (actually, a cut-down version called nanoBSD) to route and shape our WAN traffic. Works like a charm. ALTQ is the magic kernel bits that make the queuing work. It's a lot like the Linux tc stuff: set up queues, assign queuing disciplines, then push traffic into the appropriate queues based on certain criteria - in our case, usually related to port numbers and IP addresses.

Anyhow, all good, it works just like you'd expect. We can use pftop -v queues to watch in realtime how much traffic is passing through (and being dropped by) each queue.

Then I got ambitious, and decided it'd be really helpful to use Cacti to graph the queue stats. We already do this for overall traffic throughput on the interfaces, and it's handy. We just use SNMP to poll the interface counters, Cacti makes a nice graph, and we can see what's going where, when.

However, that's where things started to get a little complicated. You see, the ALTQ SNMP implementation was incomplete for a while. Specifically, this is what happens if you try to walk the ALTQ section of the MIB:

$ snmpwalk -v1 -cpublic my-router .1.3.6.1.4.1.12325.1.200.1.9.2.1
SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.2.1 = STRING: "NoRouteIPs"
SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.2.2 = STRING: "Sequencers"
SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.3.1 = INTEGER: 2
SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.3.2 = INTEGER: 2
SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.4.1 = Timeticks: (1198511000) 138 days, 17:11:50.00
SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.4.2 = Timeticks: (1198511000) 138 days, 17:11:50.00
[snip]
SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.20.1 = Counter64: 0
SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.20.2 = Counter64: 0
Error in packet.
Reason: (genError) A general failure occured
Failed object: SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.20.2

Sub-optimal! The reason why this is occurring was that the section of code that would then walk through the ALTQ table was implemented like this:

int
pf_tbladdr(struct snmp_context __unused *ctx, struct snmp_value __unused *val,
        u_int __unused sub, u_int __unused vindex, enum snmp_op __unused op)
{
        return (SNMP_ERR_GENERR);
} 

Yep, that'll do it. The source file is /usr/src/usr.sbin/bsnmpd/modules/snmp_pf/pf_snmp.c Fortunately, version 1.14 of pf_snmp.c has a proper implementation. So which version is it in? Sadly, not in FreeBSD releases 8.x but it is in FreeBSD 9.0 Beta 1. Sooo, I thought I'd run that up and give it a crack.

Had to re-compile my kernel to support ALTQ, with reference to this article and this one too, as I'm something of a FreeBSD n00b. Then reboot and...

Enabling pfpanic mutex pf task mtx owned at /usr/src/sys/contrib/pf/net/if_pfsync.c:3163
cpuid = 0
KDB: enter: panic
[ thread pid 961 tid 100059 ]
Stopped at    kdb_enter+0x3a:  movl    $0,kdb_why
db>

Oh well, nice try. I guess I'll check back when 9.0 is stable.