Thursday, February 17, 2022

Nagios - monitoring for read-only disks

 

Introduction

When Linux or Unix systems encounter an error on a file system, it will generally remount it as read-only to prevent further damage. This happens regularly with BTRFS file systems. We need to know about this before issues arise.

The SNMP arcs hrStorage and dskTable report on storage and disks, but they do not indicate the writable status of a mount point.

The best available solution for an SNMP-only implementation of this is to use the SNMP extend mechanism to invoke a server-side script, and send the results of that as SNMP values.

Method

Server-side script

I started with an existing Nagios script as the server-side script. This had a dependency on the Perl module Utils.pm, however it was just a couple of definitions which I hard-coded into the script:

1# hard-code these in for machines without Nagios plugins installed - pyarra 2#use lib "/usr/lib/nagios/plugins"; 3#use utils qw (%ERRORS &support); 4my %ERRORS=('OK'=>0,'WARNING'=>1,'CRITICAL'=>2,'UNKNOWN'=>3,'DEPENDENT'=>4); 5 6sub support () { 7 my $support='Send email to help@monitoring-plugins.org if you have questions regarding use\nof this software. To submit patches or suggest improvements, send email to\ndevel@monitoring-plugins.org. Please include version information with all\ncorrespondence (when possible, use output from the --version option of the\nplugin itself).\n'; 8 $support =~ s/@/\@/g; 9 $support =~ s/\\n/\n/g; 10 print $support; 11}

This script is then deployed at /usr/local/bin/check_ro_mounts and given execution permissions for all users.

Server-side snmpd configuration

Edit /etc/snmp/snmpd.conf and insert config to tell snmpd to call this script:

1extend fsro /usr/local/bin/check_ro_mounts -X tmpfs

Note we are excluding filesystems of type tmpfs, as these are expected to be present read-only. Other types of filesystem may need to be added using repeated -X if they’re found.

Restart snmpd:

1service snmpd restart

Testing with snmpget

Initially I’d considered getting these extend sections to be rooted at our company's arc using our existing PEN from IANA. However using the SNMP NET-SNMP-EXTEND-MIB::nsExtendObjects proved to make interrogating the results much easier, as the user-provided name index (such as fsro) can be used to get at values.

There are two values that are particularly useful for us:

nsExtendResult: an integer value, where 0 means OK, and 2 means read-only mounts were found

nsExtendOutputFull: a string representation. If an unexpected read-only mount is found, it is listed in this output.

Here’s an example of what we expect when things are okay:

1$ snmpget -cpublic -v2c gctssle02 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"' 'NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro"' 2NET-SNMP-EXTEND-MIB::nsExtendResult."fsro" = INTEGER: 0 3NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro" = STRING: RO_MOUNTS OK: No ro mounts found

Note the weird quoting that is absolutely required to make this work!

Now let’s simulate a failure - we’re going remove the server-side config that excludes tmpfs file systems:

Change this line in /etc/snmp/snmpd.conf:

1extend fsro /usr/local/bin/check_ro_mounts -X tmpfs

to:

1extend fsro /usr/local/bin/check_ro_mounts

Restart snmpd on the server:

1service snmpd restart

… And re-query using snmpget from the monitoring host:

1$ snmpget -cpublic -v2c gctssle02 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"' 'NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro"' 2NET-SNMP-EXTEND-MIB::nsExtendResult."fsro" = INTEGER: 2 3NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro" = STRING: RO_MOUNTS CRITICAL: Found ro mounts: /sys/fs/cgroup

Great! The integer value for nsExtendResult is 2, and the nsExtendOutputFull value tells us the read-only mount point we found was /sys/fs/cgroup

Now restore your original server-side config to exclude tmpfs, and restart snmpd.

Creating Nagios checks

We’ll start by calling the Nagios command-line tools initially, to confirm that the techniques we used for snmpget will work. We want the check_snmp tool for this, as it allows us to check arbitrary SNMP values.

First, we’ll check that the integer value works as expected - anything above 0 will be CRITICAL :

1$ check_snmp -H gctssle02 -o 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"' -c 0 2SNMP OK - 0 | 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"'=0

Second, we check that we can parse the string value - we’re looking for it to contain “OK”, using a regex:

1$ check_snmp -H gctssle02 -r 'RO_MOUNTS OK' -o 'NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro"' 2SNMP OK - RO_MOUNTS OK: No ro mounts found |

 

Now let’s simulate failure to make sure our checks fail as we expect. Again, we’re going to modify the server-side snmpd.conf to not exclude tmpfs filesystems.

First, check that the non-zero integer value causes a critical alert:

1$ check_snmp -H gctssle02 -o 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"' -c 0 2SNMP CRITICAL - *2* | 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"'=2

Second, check that the absence of “OK” in the string output causes an alert:

1$ check_snmp -H gctssle02 -r 'RO_MOUNTS OK' -o 'NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro"' 2SNMP CRITICAL - *RO_MOUNTS CRITICAL: Found ro mounts: /sys/fs/cgroup* |

Excellent! We could theoretically use just one of these, however integer parsing is a safer way to alert to the presence of an error, and the string representation gives us useful information in case of an alert, so we’ll configure both.

Now let’s put these into Nagios. First, we’ll create the commands in /etc/nagios3/commands.cfg:

1define command{ 2 command_name check_readonlyfs_int 3 command_line /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o 'NET-SNMP-EXTEND-MIB::nsExtendResult."fsro"' -c 0 4} 5 6define command{ 7 command_name check_readonlyfs_str 8 command_line /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -r 'RO_MOUNTS OK' -o 'NET-SNMP-EXTEND-MIB::nsExtendOutputFull."fsro"' -s OK 9}

Now assign these checks to a host:

1define host{ 2 use generic-host ; Name of host template to use 3 host_name gctssle02 4 alias GC SUSE test server 5 address gctssle02 6 contact_groups admins 7 _SNMP_COMMUNITY public 8 } 9 10define service{ 11 use generic-service 12 host_name gctssle02 13 service_description Read-only filesystem int 14 check_command check_readonlyfs_int 15 check_interval 5 16} 17 18define service{ 19 use generic-service 20 host_name gctssle02 21 service_description Read-only filesystem str 22 check_command check_readonlyfs_str 23 check_interval 5 24}

 

Reload Nagios:

1sudo service nagios3 reload

Simulate failure:



And check that it works for success:



No comments:

Post a Comment