Thursday, April 19, 2012

Nagios UPTIME errors for a Windows 2003 client

We have a Windows 2003 server - one of a farm of 5 similar machines - that suddenly started reporting errors in Nagios:

nagios# /usr/local/libexec/nagios/check_nt -H mq-citrix-5 -v UPTIME 
NSClient - ERROR: Could not get value

Looks like the issue was with the underlying Windows counters that nagios client uses to get this info - this was useful reading: http://nsclient.org/nscp/discussion/message/1066

So I did this to verify we had the same issue:

cd "\Program Files\NSClient++ "
nsclient++.exe /test 
(error output) 

To fix (on w2k3) we need to force a counter rebuild:

cd \windows\system32 
lodctr /R 

I had to re-start the NSClient++ service, then test again from nagios:

nagios# /usr/local/libexec/nagios/check_nt -H mq-citrix-5 -v UPTIME 
System Uptime - 0 day(s) 9 hour(s) 7 minute(s) 

And we are happy again. In hindsight, I guess this server crashing 6 times in an hour must have corrupted the counters. Wait a couple of minutes, and this hits my inbox:

** RECOVERY alert - mq-citrix-5/UPTIME is OK **

Yay!

No comments:

Post a Comment