Wednesday, July 16, 2014

Fun with tar

I had an interesting challenge today: we use an application (Pinnacle) that uses tar files to save a collection of patient files as an archive. How can we tell which tar file contains which patients?

For each patient in the archive, there's a file called "Patient", so I initially started by getting these out and grepping them. Because tar doesn't allow a wildcard operator, you have to first build a list of the files (one per patient) in a sub-shell, then pass it to tar:

tar xfO 20131025_03_HD1.1.tar `tar tf 20131025_03_HD1.1.tar| grep 'Patient$'` | grep -i lastname

Note: you'll need to use gtar if you're on Solaris (on Solaris 10 boxes it's installed at /usr/sfw/bin/gtar) so you can extract the file's contents to STDOUT using xfO (that's a capital letter Oh, not a zero).

However, there's an easier way to manage this. At the start of each tar file, there's a file called Institution which stores header information about the patients contained in this archive file. The section we're interested in looks like this:

  PatientLite ={
    PatientID = 98765;
    PatientPath = "Institution_123/Mount_0/Patient_98765";
    MountPoint = "Mount_0";
    FormattedDescription = "CLAUS&&Santa&&&&098765&&SL&&2013-10-23 11:20:03";
    DirSize = 349.607;
  };
  PatientLite ={
    PatientID = 12345;
    PatientPath = "Institution_123/Mount_0/Patient_12345";
    MountPoint = "Mount_0";
    FormattedDescription = "CHRISTMAS&&Mary&&&&012345&&MAG&&2013-10-23 11:20:14";
    DirSize = 262.177;
  };


OK, so it shouldn't be too hard to get this info:

$ gtar xfO 20131025_03_HD1.1.tar Institution | awk -F'=' '$1 ~ /FormattedDescription/ {print $2}'

 "CLAUS&&Santa&&&&098765&&SL&&2013-10-23 11:20:03";
 "CHRISTMAS&&Mary&&&&012345&&MAG&&2013-10-23 11:20:14";

... clean up those ampersands:

$ gtar xfO 20131025_03_HD1.1.tar Institution | awk -F'=' '$1 ~ /FormattedDescription/ {print $2}' | sed -e 's/&/ /g'

 "CLAUS  Santa    098765  SL  2013-10-23 11:20:03";
 "CHRISTMAS  Mary    012345  MAG  2013-10-23 11:20:14";


Now let's write something to catalog a directory full of these tar files:

$ for TARFILE in *.tar; do echo "Filename: $TARFILE"; (gtar xfO "$TARFILE" Institution | awk -F'=' '$1 ~ /FormattedDescription/ {print $2}' | sed -e 's/&/ /g'); done

Filename: 20131025_03_HD1.1.tar
 "EXAMPLE  Fred    012345  SL  2013-10-23 11:20:03";
 "EG  Robert    123456  MAG  2013-10-23 11:20:14";

  [etc]
Filename: 20131025_04_HD1.1.tar
 "CITIZEN  Jeanette    234567  SL  2013-10-23 11:20:03";
 "ALIAS  Dean    345678  MAG  2013-10-23 11:20:14";

  [etc]
Filename: 20131025_05_HD1.1.tar
 "MANCHU  Fu    456789  SL  2013-10-23 11:20:03";
 "KHAN  Ghengis    567890  MAG  2013-10-23 11:20:14";


Hey presto!