Introduction
When Linux or Unix systems encounter an error on a file system, it will generally remount it as read-only to prevent further damage. This happens regularly with BTRFS file systems. We need to know about this before issues arise.
The SNMP arcs hrStorage and dskTable report on storage and disks, but they do not indicate the writable status of a mount point.
The best available solution for an SNMP-only implementation of this is to use the SNMP extend mechanism to invoke a server-side script, and send the results of that as SNMP values.
Method
Server-side script
I started with an existing Nagios script as the server-side script. This had a dependency on the Perl module Utils.pm, however it was just a couple of definitions which I hard-coded into the script:
1# hard-code these in for machines without Nagios plugins installed - pyarra
2#use lib "/usr/lib/nagios/plugins";
3#use utils qw (%ERRORS &support);
4my %ERRORS=('OK'=>0,'WARNING'=>1,'CRITICAL'=>2,'UNKNOWN'=>3,'DEPENDENT'=>4);
5
6sub support () {
7 my $support='Send email to help@monitoring-plugins.org if you have questions regarding use\nof this software. To submit patches or suggest improvements, send email to\ndevel@monitoring-plugins.org. Please include version information with all\ncorrespondence (when possible, use output from the --version option of the\nplugin itself).\n';
8 $support =~ s/@/\@/g;
9 $support =~ s/\\n/\n/g;
10 print $support;
11}
This script is then deployed at /usr/local/bin/check_ro_mounts and given execution permissions for all users.
Server-side snmpd configuration
Edit /etc/snmp/snmpd.conf and insert config to tell snmpd to call this script:
1extend fsro /usr/local/bin/check_ro_mounts -X tmpfs
Note we are excluding filesystems of type tmpfs, as these are expected to be present read-only. Other types of filesystem may need to be added using repeated -X if they’re found.
Restart snmpd:
1service snmpd restart
Testing with snmpget
Initially I’d considered getting these extend sections to be rooted at our company's arc using our existing PEN from IANA. However using the SNMP NET-SNMP-EXTEND-MIB::nsExtendObjects
proved to make interrogating the results much easier, as the user-provided name index (such as fsro) can be used to get at values.
There are two values that are particularly useful for us:
nsExtendResult
: an integer value, where 0 means OK, and 2 means read-only mounts were found
nsExtendOutputFull
: a string representation. If an unexpected read-only mount is found, it is listed in this output.
Here’s an example of what we expect when things are okay:
1$ snmpget -cpublic -v2c gctssle02 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"' 'NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro"'
2NET-SNMP-EXTEND-MIB::nsExtendResult."fsro" = INTEGER: 0
3NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro" = STRING: RO_MOUNTS OK: No ro mounts found
Note the weird quoting that is absolutely required to make this work!
Now let’s simulate a failure - we’re going remove the server-side config that excludes tmpfs file systems:
Change this line in /etc/snmp/snmpd.conf:
1extend fsro /usr/local/bin/check_ro_mounts -X tmpfs
to:
1extend fsro /usr/local/bin/check_ro_mounts
Restart snmpd on the server:
1service snmpd restart
… And re-query using snmpget from the monitoring host:
1$ snmpget -cpublic -v2c gctssle02 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"' 'NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro"'
2NET-SNMP-EXTEND-MIB::nsExtendResult."fsro" = INTEGER: 2
3NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro" = STRING: RO_MOUNTS CRITICAL: Found ro mounts: /sys/fs/cgroup
Great! The integer value for nsExtendResult
is 2, and the nsExtendOutputFull
value tells us the read-only mount point we found was /sys/fs/cgroup
Now restore your original server-side config to exclude tmpfs, and restart snmpd.
Creating Nagios checks
We’ll start by calling the Nagios command-line tools initially, to confirm that the techniques we used for snmpget will work. We want the check_snmp tool for this, as it allows us to check arbitrary SNMP values.
First, we’ll check that the integer value works as expected - anything above 0 will be CRITICAL :
1$ check_snmp -H gctssle02 -o 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"' -c 0
2SNMP OK - 0 | 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"'=0
Second, we check that we can parse the string value - we’re looking for it to contain “OK”, using a regex:
1$ check_snmp -H gctssle02 -r 'RO_MOUNTS OK' -o 'NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro"'
2SNMP OK - RO_MOUNTS OK: No ro mounts found |
Now let’s simulate failure to make sure our checks fail as we expect. Again, we’re going to modify the server-side snmpd.conf to not exclude tmpfs filesystems.
First, check that the non-zero integer value causes a critical alert:
1$ check_snmp -H gctssle02 -o 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"' -c 0
2SNMP CRITICAL - *2* | 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"'=2
Second, check that the absence of “OK” in the string output causes an alert:
1$ check_snmp -H gctssle02 -r 'RO_MOUNTS OK' -o 'NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro"'
2SNMP CRITICAL - *RO_MOUNTS CRITICAL: Found ro mounts: /sys/fs/cgroup* |
Excellent! We could theoretically use just one of these, however integer parsing is a safer way to alert to the presence of an error, and the string representation gives us useful information in case of an alert, so we’ll configure both.
Now let’s put these into Nagios. First, we’ll create the commands in /etc/nagios3/commands.cfg:
1define command{
2 command_name check_readonlyfs_int
3 command_line /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"' -c 0
4}
5
6define command{
7 command_name check_readonlyfs_str
8 command_line /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -r 'RO_MOUNTS OK' -o 'NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro"' -s OK
9}
Now assign these checks to a host:
1define host{
2 use generic-host ; Name of host template to use
3 host_name gctssle02
4 alias GC SUSE test server
5 address gctssle02
6 contact_groups admins
7 _SNMP_COMMUNITY public
8 }
9
10define service{
11 use generic-service
12 host_name gctssle02
13 service_description Read-only filesystem int
14 check_command check_readonlyfs_int
15 check_interval 5
16}
17
18define service{
19 use generic-service
20 host_name gctssle02
21 service_description Read-only filesystem str
22 check_command check_readonlyfs_str
23 check_interval 5
24}
Reload Nagios:
1sudo service nagios3 reload
Simulate failure:
And check that it works for success:
No comments:
Post a Comment