Sysadmin ramblings: 2012

Saturday, November 10, 2012

Word games

So The Age publishes a word game - Target - in the weekend paper. The rules are simple: you get 9 letters, and have to make as many words of 4 letters or more out of them, and there is one of the letters which must be included in all words. You also have to figure out what the original 9-letter word was which got scrambled up to provide the 9 letters. I quite like the challenge of making words out of the letters, but I rarely figure out the 9-letter word. It frustrates me. In fact, so much that I seem to spend more time figuring out ways to write a program to solve it for me. Yeah, really.

Anyway, I couldn't be bothered doing it "properly" so I decided to do a quick command-line hack to solve it. Today's quiz had the letters H K I A R T B R M Witness the following ugliness:

grep  '^.........$' /usr/share/dict/words|grep 'k'|grep 'r'|grep 'm'| grep 'h' |grep -E '[i]{1}'|grep -E '[a]+'|grep -v '[eou]'

There must be more elegant ways to achieve this, but still, it got the job done in a couple of minutes.

Oh, and the word was "birthmark" in case you wondered :-)

Edit: I solved this properly in perl today.

Monday, September 10, 2012

NanoBSD 6.3 to 8.0 - what do I need to change?

Upgrading some NanoBSD boxes from FreeBSD 6.3 to 8.0, and adding BGP functionality along the way. A couple of config changes that are required:

Enable BGP

add /cfg/local/bgpd.conf and edit to suit (hint: AS and IP addresses ought to match what is assigned for the site)
add openbgpd_enable="YES" to /cfg/rc.conf
add _bgpd user account to /etc/passwd and /etc/group like this:

pw useradd "_bgpd" -u 130 -c "BGP Daemon" -d /var/empty -s /sbin/nologin
mount /cfg
cp /etc/group /cfg
cp /etc/passwd /cfg
cp /etc/pwd.db /cfg
cp /etc/spwd.db /cfg
mount -u -o ro /

NTPD changes

On boot, ntpd fails to start with errors such as:

Starting ntpd.
ERROR: only one configfile option allowed
ntpd - NTP daemon program - Ver. 4.2.4p5

In /cfg/rc.conf, change this:

ntpd_enable="YES"
ntpd_flags="-g -p /var/run/ntpd.pid -f /etc/ntpd.drift -c /etc/ntp.conf -t 3"

to this:

ntpd_enable="YES"
ntpd_config="/etc/ntp.conf" # ntpd(8) configuration file
ntpd_flags="-p /var/run/ntpd.pid -f /etc/ntpd.drift -t 3"

Wireless access point

Change ath0 interface config from this:

ifconfig_ath0="ssid bsdbox media autoselect mode 11g mediaopt hostap up"

... to this...

wlans_ath0="wlan0"
create_args_wlan0="wlanmode hostap"
ifconfig_wlan0="ssid bsdbox media autoselect mode 11g mediaopt hostap up"

Edit /cfg/hostapd.conf and change interface=ath0 to interface=wlan0

Edit /cfg/rc.conf and change the bridge members so that ath0 is removed, and wlan0 added

Other stuff

add "kern.maxfilesperproc=4096" to /cfg/sysctl.conf so that newer version of bind can start

Also, you can ignore all this stuff in dmesg:
FAILURE - READ_DMA status=51 error=10 LBA=15625215
ad0: FAILURE - READ_DMA status=51 error=10 LBA=15625215

Apparently it's just FreeBSD's way to tell you to relax and have fun :-) PfSense info on it over here

You can also relax about this error:
Starting named.
named[1302]: the working directory is not writable
That's because at boot, /etc/namedb/ isn't writable, but it becomes so when the mfs (RAM disk ) is mounted there. I think...

DMA and Ultra DMA

Transcend UDMA 16GB CF cards do not work reliably - they will not boot on power-up (this is in a Soekris net5501). I can get them to boot by letting boot fail, drop into COMBios over serial, issue a reboot, then and only then will it boot. I suspect this is related to DMA levels supported by the net5501. Obviously not reliable enough for our purposes, so I have ordered some DMA66 SanDisk 4GB cards.

Monday, July 9, 2012

FreeBSD: show outgoing SMTP connections

To show active NAT sessions: pfctl -s state
To show just those going to SMTP ports: pfctl -s state | awk '$7 ~ /:25$/'

Helpful to find outgoing NAT sessions that might be caused by a spambot like, oh, let's say maybe cutwail.

And to show all SMTP sessions, both directions:
pfctl -s state | awk '$7 ~ /:25$/||$3 ~ /:25$/'

Tuesday, May 29, 2012

Install as another (administrative) user on Windows XP

Gotta love it when you get an install package as an .exe file - right click and "Run As" to install it as an admin account. All good. But when you get an .msi do you really have to log out, then log in as an administrative user, just to install it? I've always known there must be a better way to do it, but it finally irritated me enough to find out. The trick is to use the runas command:

C:\Documents and Settings\myUser>runas /user:domain\admin-account "msiexec /i \"C:\Documents and Settings\myUser\My Documents\Downloads\someInstaller.msi\""

Some gotchas:

you must include the whole command you're passing to runas in quotes
if the path to the MSI file has spaces in it, you must put it in quotes
you can't nest quotes, so you'll have to use a backslash to escape the inner set of quotes

For the record, some bright sparks have suggested some simple reg hacking to create an "Install as" menu item in the right-click context menu. Nice idea, and it appeals to my sense of order and neatness. I'll have to try that next time it irritates me enough.

Thursday, April 19, 2012

Nagios UPTIME errors for a Windows 2003 client

We have a Windows 2003 server - one of a farm of 5 similar machines - that suddenly started reporting errors in Nagios:

nagios# /usr/local/libexec/nagios/check_nt -H mq-citrix-5 -v UPTIME
NSClient - ERROR: Could not get value

Looks like the issue was with the underlying Windows counters that nagios client uses to get this info - this was useful reading: http://nsclient.org/nscp/discussion/message/1066

So I did this to verify we had the same issue:

cd "\Program Files\NSClient++ "

nsclient++.exe /test 

(error output) 

To fix (on w2k3) we need to force a counter rebuild:

cd \windows\system32 

lodctr /R 

I had to re-start the NSClient++ service, then test again from nagios:

nagios# /usr/local/libexec/nagios/check_nt -H mq-citrix-5 -v UPTIME 

System Uptime - 0 day(s) 9 hour(s) 7 minute(s) 

And we are happy again. In hindsight, I guess this server crashing 6 times in an hour must have corrupted the counters. Wait a couple of minutes, and this hits my inbox:

** RECOVERY alert - mq-citrix-5/UPTIME is OK **

Yay!

Courier IMAP - migrating all users' email

So we have a project underway to move all our users off our old postfix+courier+squirrelmail system to Microsoft Exchange 2010. Now, you might think this would be easy, but you would be wrong.

Some bits are okay - getting a list of all users and their job titles, photos etc. from LDAP is easy - ldapsearch is a pretty powerful tool. But the bit that I assumed would be easiest of all - importing all their existing mail into Exchange - has proved a little more difficult.

Exchange seems not to have any native tools to import Maildir (which is of course what we use) so I planned to use imapsync. Good theory, and some useful blog posts here and here point the way to setting up one user account as an administrator who can connect to any mailbox. But after several hours of flailing around, I failed. Here's what I tried:

In /etc/courier/authldaprc, set LDAP_AUXOPTIONS sharedgroup=group

In LDAP, use the (previously unused) sharedgroup attribute:

# ldapsearch -h ldap -x -b 'ou=People,dc=example,dc=com,dc=au' '(uid=migrate1)' sharedgroup -LLL
dn: uid=migrate1,ou=People,dc=example,dc=com,dc=au
sharedgroup: administrators

Then test:

# courieruserinfo migrate1
uid=10381
gid=100
home=/home/migrate1
authaddr=migrate1
authfullname=Email Migration User
maildir=
quota=
options=

Hmmm... options isn't set. OK, try the same with userdb:

# userdb migrate1 set options=group=administrators
# userdb -show migrate1
options=group=administrators
root@zappa:~# courieruserinfo migrate1
uid=10381
gid=100
home=/home/migrate1
authaddr=migrate1
authfullname=Email Migration User
maildir=
quota=
options=

Still not set! Why not? Ahhhh, bugger, we're using PAM auth (in /etc/courier/authdaemonrc, I've set authmodulelist="authpam")

and if you read the documentation carefully enough:
"The authentication library has a facility for keep arbitrary “name=value”-type settings, called “options”, for individual accounts. This feature is only available with userdb, LDAP, MySQL, and PostgresSQL modules. Individual account options are not supported with system-based authentication modules (password/shadow files, or PAM)."

Well that explains why it doesn't work... now how do we fix that? I can see a few options, which I guess I'll be trying out in the next few weeks. More to come.

Saturday, March 24, 2012

Disable offline files on Windows XP

For some reason, the GUI way of disabling offline files was greyed out, regardless of me logging in as local admin or a domainadmin. Maybe because it was set up with the computer in a previous domain? Don't know, don't care. A regedit fixes it all: edit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MRxSmb\Parameters\CSCEnabled and set it to 0x0. Reboot. Sorted.

Kudos to http://offlinefiles.blogspot.com.au/2010/10/disable-offline-files-registry-key.html for this info.

Wednesday, March 14, 2012

Secure hard drive destruction (or: Fun With Power Tools)

Usually when I dispose of a hard drive, I wipe it using DBAN - Darik's Boot And Nuke - secure enough for our purposes. But what about when the hard drive electronics or motor are damaged so that the disk can't be wiped?

Well then it's time for physical destruction:

Physical destruction of Maxtor hard drives using a drill press

Very satisfying!

PHP DateInterval gotcha!

I did some quick PHP code a little while back to produce timesheets - pretty simple: given a fixed start date in the past for fortnightly time sheets (say, perhaps,2011-10-24) figure out what date the most recent time sheet should start at. So this function ought to be able to do that, right?

# this function is subtly wrong, do not use it!

function getCurrentTimesheetStartDate($ts_start_date)
{
        $current_date = date_create("now");
        $interval_from_start = date_diff($current_date, $ts_start_date);
        $interval_from_current = ($interval_from_start->d) % 14;
        $date_interval_spec = new DateInterval("P" . $interval_from_current . "D");
        $retval = date_sub($current_date, $date_interval_spec);
        return $retval;
}

Well, not quite :-( It turns out that $interval_from_start->d fetches the number of days between the two dates within the current month (that is, if the ts_start_date is in a previous month, it will be "wrong" unless you also consider that $interval_from_start->m will be higher than zero... ). It turns out the member I need is $interval_from_start->days (the actual number of days between these two dates). Doh! So the corrected code is:

function getCurrentTimesheetStartDate($ts_start_date)

{

        $current_date = date_create("now");

        $interval_from_start = date_diff($current_date, $ts_start_date);

        $interval_from_current = ($interval_from_start->days) % 14;

        $date_interval_spec = new DateInterval("P" . $interval_from_current . "D");

        $retval = date_sub($current_date, $date_interval_spec);

        return $retval;

}

One long-running bug successfully squashed :-)

The Mysterious Case of the HP ProCurve PoE switch(es)

Lots of our sites use oldish Dell switches. They're nothing fancy, but they're pretty cheap, and they've stood the test of time - many of them are more than 5 years old, and never miss a beat. They just work.

One of our newer sites has a pair of fancy-pants HP ProCurve 2910al-48G-PoE switches (48 port, gigabit, power over ethernet). These switches were sufficiently expensive that we started joking that they must have gold-plated connectors and platinum cases. Still, we figured, you get what you pay for - spendy switches == extra good, right? Sadly not.

The first unpleasant surprise was when they arrived - both had chassis that were noticeably bent - as if someone had taken the "curve" part of ProCurve too literally. One was dead on arrival - wouldn't even power on. We kept the working-but-bent one, as we needed it ASAP. HP shipped the replacement for the DOA one, and life went on...

A year later another one died - quite spectacularly - the PoE power supply died, so while the switch kept on working, there was no power over ethernet, and any phones connected to it stopped working. I took out the power cord to power cycle the unit, and when I restored power, there was an unpleasant crackling noise, and acrid smoke issued from the unit. You've never seen two sysadmins move so quickly!

Again HP shipped a replacement, which died within a month, again with a fault in the PoE power supply. HP support asked us to run "show tech all" but we found this from "show log" most useful:

W 01/25/90 22:59:43 00576 chassis: 50V Power Supply 1 is Faulted. Failures: 2
W 01/25/90 22:59:44 00071 chassis: Power Supply failure: Supply: 1, Failures: 1
W 01/25/90 22:59:45 00578 chassis: Co-processor Unrecoverable fault on PoE controller 1

Yep, another dead PoE power supply. That was in 2010. Replacement shipped, life went on...

You guessed it... yesterday, I got a call to say that some of the phones at that site had stopped working. It wasn't too hard to guess the cause! The web UI showed no faults, but the port status display showed 5 ports were delivering PoE but had no ethernet link. Definitely not what I expect to see - once the phones are getting power, they boot and establish an ethernet link. So a 55 km drive to the site, and what do you know, another dead PoE power supply:


PoE fault lights on an HP ProCurve switch - a most unpleasant sight!

Just to be thorough, I also hooked up the LinkRunner to confirm what commonsense was already telling me - yep, no power being delivered. Sigh. Before calling HP I checked the warranty status and it told me the warranty expires in 2108 - 96 years from now! I assumed this was an error, but when I called HP to order the replacement, I was told that it is correct - these switches are covered for life (I doubt I have another 96 years on this mortal coil, so probably somewhat beyond my life-span). So that's nice, I guess, given how unreliable they are.

Anyway, the nice lady at HP is shipping out another one, so I guess I'll go to the site again, and replace it yet again. I wonder how long this one will last.

In summary: nice features, expensive switches, totally unreliable.

I'm contemplating buying or making some sort of PoE monitoring device that we could monitor from Nagios, so we know when it fails.

Thursday, February 23, 2012

Bacula - testing excludes

I've been having some trouble getting my Bacula jobs to exclude certain directories. To test which files will be included (or excluded) you can get a listing of the files that Bacula would include for a particular job from bconsole:

root@rm-bac-1:~# bconsole
Connecting to Director localhost:9101
1000 OK: rm-bac-1-dir Version: 5.0.1 (24 February 2010)
Enter a period to cancel a command.
*estimate job=ep-server-user-shares listing
[listing follows]

Handy! Beats waiting for the job to run to find out what will get backed up.

In my case, the culprit was incorrect case for the directory to exclude.

Tuesday, February 21, 2012

ldapsearch - exporting photos to named files

Simple requirement: export all staff photos from our LDAP repository. That bit is easy:

ldapsearch -h ldap -x -t -b "ou=People,dc=example,dc=com,dc=au"

But the files end up being called things like file:///tmp/ldapsearch-jpegPhoto-G4hm3V - not quite what we're after here. We want the pictures with the user ID as part of the name - e.g. pyarra.jpeg

After a bit of experimentation, I came up with this simple, elegant one-liner. OK, it's not all that simple, or elegant, but it is one-line. One long, ugly line:

ldapsearch -h ldap -t -x -b 'ou=People,dc=example,dc=com,dc=au'  uid | awk '$1 ~ /uid:/ {print $2}' | while read LUID; do ldapsearch -h ldap -t -x -b 'ou=People,dc=example,dc=com,dc=au' "uid=$LUID" jpegPhoto | (FILENAME=$(awk '$1 ~ /jpegPhoto:</ {print $2}' | sed -e 's/file:\/\///'); mv "$FILENAME" "/tmp/mugshots/$LUID.jpeg"); done 

I probably should have bitten the bullet and done it as a Perl script. But that's the lure of the one-liner, eh? If I just go a little further, I'll have it!

Tuesday, February 7, 2012

Nagios, check_openmanage and the dreaded out-of-date firmware

I started to add some nagios monitoring for one of our Dell PowerEdge 1950 servers, but was a bit puzzled when I got this response:

nagios# /usr/local/libexec/nagios/check_openmanage -s -H mq-citrix-4
WARNING: Controller 0 [PERC 6/i Integrated]: Firmware '6.1.1-0047' is out of date

Hmmm... I'm not sure I want to start dropping production servers to upgrade firmware, just to make the monitoring system happy. Luckily, the check_openmanage script is intelligently written, and offers lots of options to blacklist checks of some items. Cool!

So for us, I can simply do this:

nagios# /usr/local/libexec/nagios/check_openmanage -H mq-citrix-4 -b ctrl_fw=0

OK - System: 'PowerEdge 1950 III', SN: 'FW86Y1S', 16 GB ram (4 dimms), 1 logical drives, 2 physical drives

To make this work in the config file for Nagios, I added the highlighted bit to the host definition:

 define host{

        use                     windows-server

        host_name               mq-citrix-4
        _openmanage_options     -b ctrl_fw=0
        }

Now I'm wondering if it's a little bit wrong to hide warnings about out-dated firmware. Oh well...

Sunday, January 15, 2012

Nsclient++ on Windows 2000 can't understand hostnames?

Just installed Nsclient++ on two boxes - one Windows 2003 server, the other Windows 2000 Server (yeah, I know, Windows 2000 server is getting a bit long in the tooth, but if it ain't totally broke...)

Anyway, I restricted which hosts could talk to the NSClient to just the Nagios server, called, amazingly, nagios.mydomain.com. For w2k3, that works, access is allowed. For the Windows 2000 Server, I had to go put the IP address in place of the hostname in nsc.ini before it would allow access. And yeah, the Windows 2000 server can resolve the IP address back to the hostname, using nslookup.

I don't know if that's a Windows 2000 oddity, a NSClient for w2k oddity, or just a sign that it's time to call it a day :-)

Tuesday, January 10, 2012

Solaris and tape drives

I was faced with an interesting question today: we have a DDS tape drive of some sort attached to a Solaris machine, and a tape we need to read from. How do we discover what type of tape drive it is, what device it's attached with, and go get some data off the tape?

To show attached SCSI devices: cfgadm -al lists a tape device at rmt/3

To show the device details: iostat -E

bash-3.00# iostat -E
st4       Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ARCHIVE  Product: Python 04687-XXX Revision: 6610 Serial No:þÊÝºþÊÝºþÊÝº


Looked up the Python 04687-XXX - it's a DDS-2 tape drive according to
http://www.freebsd.org/doc/en/articles/storage-devices/x528.html#HW-STORAGE-PYTHON-04687.

The tape we wanted to restore from is a DDS-1 (identified by looking here):
http://en.wikipedia.org/wiki/Digital_Data_Storage#DDS-1

A tar tf /dev/rmt/3 failed - I guess the DDS-2 drive expects to find a DDS-2 tape in there, not a DDS-1. Maybe it would be happier if we told it to use a low-density type of tape?

You can tell the tape drive what density to use depending on which device
file you refer to - details here:
http://www.cyberciti.biz/tips/solaris-tape-device-names-and-control-the-tape-drive.html

To get Solaris to read from it as a low-density device: mt -f /dec/rmt/3l status

or tar xvf /dev/rmt/3l

And it works!

Monday, January 9, 2012

The Sparc ghetto

I recently obtained a Sun Fire V250 and an old Ultra 10 - some nice UltraSparc goodies to play with - yay!

Both are running Solaris 8, so of course, the first thing to do is spend some time getting some modern tools installed - SSH, Firefox. It looks like the old Blastwave team have split into two rival efforts - Blastwave and OpenCSW. OpenCSW looked a little simpler to get up and running so I went with that. Oh yeah, and of course, ditch NIS and use DNS.

The Ultra 10 is now dual-booting Solaris 8 and Debian for Sparc which I'm hoping will provide a native sparc buildhost for ReadyNAS binaries (since as noted earlier I had not much luck with cross-compiling!). I simply added another IDE hard drive, and from OpenBoot I either boot disk0 or boot disk1 depending on what flavour I feel like. There was one trick to getting the Debian installer working correctly, all cool after that. While it's not exactly fast, it's tolerable using WindowMaker for a desktop. I somehow think KDE or Gnome might be a bit too demanding for it though.

The Sun Fire V250 is rather different. It sounds like a light aircraft taxiing for take-off. This thing is noisy! However, it's also rather faster than the old Ultra 10. So far, it's just vanilla Solaris 8. However, since I pretty much don't wish to permanently reside in the Jurassic era, it's either going to need a newer Solaris, or some other operating system. I thought OpenIndiana might be worth a look-see, since it's the continuation of the now-murdered OpenSolaris project. Hmmm... no Sparc version available for download... what??? I mean, I know we all use x86 these days, but why no Sparc ISO images for what is essentially Solaris?

Seems the malady is wider-spread than I thought... can you download the latest Oracle VirtualBox binaries for Sparc? Why no, you cannot. x86 and amd64, for sure. You can get VirtualBox for Solaris 10 from Oracle. Still... odd, no?

Kinda feels like poor old Sparc users are in a ghetto :-(