Sysadmin ramblings: October 2010

Wednesday, October 27, 2010

Cacti and rrdtool

After a few hours' worth of wrangling, I think I've tamed rrdtool into producing a sane graph for devices that report temperatures in degrees C multiplied by 10:

/usr/bin/rrdtool graph test.png --imgformat=PNG --start=-86400 --end=-300 --title="rm-mon-1 - Rack A18" --base=1000 --height=120 --width=500 --alt-autoscale-max --lower-limit=0 --vertical-label="degrees C x 10" --font TITLE:12: --font AXIS:8: --font LEGEND:10: --font UNIT:8: DEF:a=rra/rm-mon-1_snmp_oid_138.rrd:snmp_oid:AVERAGE DEF:b="rra/rm-mon-1_snmp_oid_138.rrd":snmp_oid:MAX CDEF:cdefa=a,0.1,'*' CDEF:cdefb=b,10,/ LINE:cdefa#F50000FF:"degrees C" GPRINT:cdefa:LAST:"Current\:%8.2lf %s" GPRINT:cdefa:AVERAGE:"Average\:%8.2lf %s" GPRINT:cdefb:MAX:"Maximum\:%8.2lf %s\n"

The magic is in the CDEF statements, which declares a variable (for example, cdefa) then assigns to it the value of a with a RPN modifier - in this case 10,/ (divide by 10)

The other magic is to then remember to USE the newly-assigned cdefa rather than straight a as values used by LINE and GPRINT statements (it took me a while to realise that I was happily assigning the correct value to cdefa and cdefb and then never using them.

I'm yet to figure out how to wrangle this data into cacti - so far I'm just fooling around in bash. I'm sure I'll figure it out... later.

For bonus points:

assign more meaningful variables names than a, b, cdefa and cdefb - these are the defaults I got from cacti, but they should really be r18_avg_temp_by_ten, r18_max_temp_by_ten, r18_avg_temp, r18_max_temp
plot all related rack temps on the same single graph - they're drawn from multiple rrd files, but that should be easy enough

Thursday, October 21, 2010

ReadyNAS and Cacti: grumble

Since we've been having a few temperature issues lately, I thought it might be a good time to start using our Cacti to graph temperatures, so we can see the trends (as well as the alerts which Nagios sends us). We have a SafetyNet5 which for some reason I cannot get to produce graphs... still puzzling over that one.

But we have a ReadyNAS at most of our sites, so why not get the temperature data and graph that? Good idea, right? Yep, in theory. If I ever get the chance, I'd like to ask the authors of the ReadyNAS MIB why they thought that returning temperature data as a String containing both Celcuis and Fahrenheit data was A Good Idea. For some funny reason, Cacti isn't that keen on numbers that look like this: "32.0C/89.6F"

Oh well, nice try. There is a neato solution for this on the cacti forums, but:

it requires a more recent Cacti than we have
it requires a more recent ReadyNAS firmware than we have

Bugger.

Monday, October 11, 2010

Nagios monitoring ReadyNAS via SNMP

First up: the ReadyNAS does have an SNMP implementation, but it's off by default, so go turn it on: Front Panel -> System -> Alerts, select the SNMP tab.

Test it works:

$ snmpwalk -v1 -cpublic my-nas-box-ip
SNMPv2-MIB::sysDescr.0 = STRING: Linux my-nas-box 2.6.17.8ReadyNAS #1 Tue Jun 9 13:59:28 PDT 2009 padre

[buckets of usual SNMP output omitted]

The interesting stuff is found here:
$ snmpwalk -v1 -cpublic my-nas-box-ip enterprises.4526
SNMPv2-SMI::enterprises.4526.18.1.0 = STRING: "4.01c1-p6"

[smaller buckets of output omitted]

$ snmpwalk -v1 -cpublic my-nas-box-ip enterprises.4526.18.7
SNMPv2-SMI::enterprises.4526.18.7.1.1.1 = INTEGER: 1
SNMPv2-SMI::enterprises.4526.18.7.1.2.1 = STRING: "Volume C"
SNMPv2-SMI::enterprises.4526.18.7.1.3.1 = STRING: " RAID Level X"
SNMPv2-SMI::enterprises.4526.18.7.1.4.1 = STRING: "ok"
SNMPv2-SMI::enterprises.4526.18.7.1.5.1 = INTEGER: 4262912
SNMPv2-SMI::enterprises.4526.18.7.1.6.1 = INTEGER: 3679876

These items are the (RAID) volume table:
1 is the first RAID volume (and indeed my only one)
Volume C is the name of the RAID volume - it's the default ReadyNAS one
RAID Level X means we're using the ReadyNAS default X-RAID
ok is the status of the volume - it's OK. Phew!
4262912 is the size in megabytes of the volume (I can't really make that tally against the actual size, I'm still puzzling over that one)
3679876 is the free space of the volume in megabytes

You can get the whole MIB here. There are plenty of other interesting things you can monitor, such as temp and fan speeds, but since we've set up email alerts if there are problems with those, I'm happy to leave them out of Nagios.

Now to set up Nagios:
Grab check_readynas_hd.pl from these guys. It only monitors RAID volume, and only the first one - that's all I wanted to monitor, so it's perfect. The code is nice and simple, so it'd be easy enough to expand it to cater for multiple volumes, maybe monitor temperatures, physical disks, if you were that way inclined. I found that the script assumed snmpwalk would be at /usr/bin/snmpwalk - not so for FreeBSD, but it wasn't too hard to hack it in as /usr/local/bin/snmpwalk

You also need the ReadyNAS MIB, so download that.

Running the script was easy:
$ /usr/local/libexec/nagios/check_readynas -H my-nas-box-ip -m /usr/local/libexec/nagios/READYNAS-MIB.txt
Volume C(RAID Level X): 4262912/3679876bytes (13% in use) STATUS: "ok"

Yay, looks good. Now add these lines to commands.cfg:

# 'check_readynas' command definition
define command{
command_name check_readynas_disk
command_line $USER1$/check_readynas -H $HOSTADDRESS$ -m /usr/local/libexec/nagios/READYNAS-MIB.txt
}

Then add these lines to the server's cfg file for nagios:
define service {
use local-service
host_name my-nas-box
service_description readyNAS RAID
check_command check_readynas_disk
}

Re-start Nagios and watch things start to come good (where "monitored" equals "good).

Sunday, October 10, 2010

Nagios monitoring Dell PE 2900 via SNMP

I decided I would like to monitor our new file server - you know, so if the RAID became degraded, I'd know... rather than lose two disks from a set like we did recently and um, lose a bit of data. Yeah, oops.

So... how hard can it be? Answer: quite hard for our Windows 2000 server. More on that later.

For Windows 2003, it wasn't too bad, but there were a few hoops to jump through. I'm documenting those hoops here for future reference:

Install OpenManage Server Administrator Managed Node (v6.3) ... ahah, but not so fast - it will probably force you to install new RAID firmware and drivers, so do that, of course, reboot... then here's the trick that got me first time around: Storage Management is deselected for installation by default, so you MUST choose a custom installation and for the love of god, select Storage Management for installation! Why would Dell do this??? Especially after making a song-and-dance that forced me to upgrade my RAID firmware... anyway...

Then to monitor via SNMP you need Windows SNMP installed (Start -> Settings -> Control Panel -> Add/Remove Programs, select "Windows Components" then "Management and Monitoring Tools", click "Details:" button and scroll down to "Simple Network Management Protocol" and make sure that's ticked. By default SNMP only allows polling from localhost (this is either good security, or absolutely stupid, depending on your point of view and level of caffeination). To allow SNMP polling from other hosts, go to the services control panel applet, find "SNMP Service", right-click, select "Properties", click the "Security" tab and either allow SNMP from all hosts, or just the hosts you choose.

Test that SNMP is working:


$ snmpget -v 1 -c public hostname .1.3.6.1.4.1.674.10893.1.20.140.1.1.2.1
SNMPv2-SMI::enterprises.674.10893.1.20.140.1.1.2.1 = STRING: "System"

(this is the name of the "disk label" for virtual disk 1). You can find a list of useful info on OpenManage SNMP here.

Great. From here I could write some simple SNMP checks for Nagios, and so long as the virtualDiskRollUpStatus (1.3.6.1.4.1.674.10893.1.20.140.1.1.19.x) comes back as 3 then we can assume we're all happy. But I thought maybe some helpful soul out there might have already written something more sophisticated for monitoring OpenManage, and they surely have - I settled on check_openmanage as a nice one.

So on the nagios server, I did this:


# cd /usr/local/libexec/nagios/
# wget http://folk.uio.no/trondham/software/check_openmanage-3.6.0/check_openmanage
# chmod +x check_openmanage
# ./check_openmanage -H my-server
OK - System: 'PowerEdge 2900 III', SN: 'XXXXXX1S', 2 GB ram (2 dimms), 2 logical drives, 4 physical drives
# vi /usr/local/etc/nagios/commands.cfg
# add these lines:
# 'check_openmanage' command definition

define command{

        command_name    check_openmanage

        command_line    $USER1$/check_openmanage -H $HOSTADDRESS$

        }

Then edit the server's .cfg file to call the plugin:
define service{

        use                             local-service         ; Name of service

        host_name my-server

        service_description             OpenManage Status

        check_command                   check_openmanage

        }

Then re-start nagios and wait till it polls, and see nice green output. Yay!

Show where Windows home directories are

In our AD, user's home directories are stored on various file servers. So when it's time to migrate them to a new file server, how do we determine who needs to get moved off the old one? ldapsearch to the rescue:


ldapsearch -x -LLL -E pr=2000/noprompt -h rov-dc -D Administrator@example.com -W -b 'cn=Users,dc=example,dc=com' -s sub  homeDirectory | awk '$1 = /homeDirectory:/ {print $2}' | sort

Wednesday, October 6, 2010

Trend 10 fills disks

Trend Micro OfficeScan 10 offers us some compelling features, so we're upgrading from Trend 7 (stop laughing in the back there, we like old software!). We've found 2 significant gotchas:

Disk filler

We install the Trend server software (i.e. the part that farms out the new virus definitions to the clients on that network) in the default location - C:\Program Files\Trend Micro\ on our local file/print server and soon enough, it fills the entire C: progressively killing off services, until file and print services die, and the users start calling me. The culprit: C:\Program Files\Trend Micro\OfficeScan\PCCSRV\Apache2\logs\error.log has grown to 3 GB (yep, 3 gigabytes of error logs!). I'd love to tell you what was in that file, but Notepad won't open a file that big, and Wordpad wants to make a copy of it - on C:\temp I guess - before it opens it, making the problem even worse. So I just killed it, shrugged and moved on.

Weird Word Wackiness

OK, this is clearly a corner case, but it happened to three machines on one network. Trend 10 on the clients (two of them w2k, one XP), they try to open a Word document from a Samba file share, and...can't do it. With OpenOffice instead of Word, it works. With the documents in question copied to a Windows 2003 file server, it works. With Trend reverted to V7, it works. So: Trend 10 + MS Word 2003 + Samba file server = bizarre errors.