Command Line Processing for awk Shebang Scripts

Most of the credit for this blog post goes to user "agama" over at https://www.unix.com/shell-programming-and-scripting/162459-awk-script-file-command-line-options.html - that may be worth reading too as it's probably better explained. I have however built out the example script (below) a fair bit.

Awk is a text processing and pattern matching programming language. It's Turing-complete, and therefore surprisingly powerful while simultaneously being rather limited in its intended domain. Essentially it's meant to slice and dice text content, and it can do that incredibly well. But most Linux/Bash users know it as a provider of one-liners they barely understand embedded in the midst of a stream of pipes.

Having used it that way myself for roughly twenty-five years, I thought I ought to understand it a bit better. So I borrowed O'Reilly's Effective awk Programming (Fourth edition) by Arnold Robbins from the library. There aren't a lot of books written about awk because of its limited usefulness: this one is the language's bible. I've been writing Bash scripts as long as I've been using Linux so the first thing I wanted to know was, can I use awk with a shebang to write scripts? The answer is yes - but unlike Bash scripts, awk scripts need file or text input to do anything useful (yes, yes - awk can do things without file input ... but in practical terms you can't achieve much). Which led me to the next question: is awk aware of its own command line so I can validate and react to command line choices? The answer is again yes, with a CAVEAT: the book says command line parsing is inconsistent across platforms, meaning the script I'm showing below works great in Linux ... but might have a different ARG count on Solaris because their version of awk doesn't count awk as the zeroeth parameter or similar problems. These days you could probably avoid this entirely by using recent builds of gawk (preferable not only for consistency, but also features), but it should definitely be kept in mind.

#!/bin/awk --exec
#
# https://www.unix.com/shell-programming-and-scripting/162459-awk-script-file-command-line-options.html
#
# Test with their example: <scriptname>  -v -f "some value" intput-file1 input-file2
# (fails on any dash-options other than "v" and "f")
#

BEGIN {
    # Look for a gawk-specific variable:
    if ( PROCINFO["pid"] == "" )
        print "This is not gawk";
    else
        print "This is gawk";

    # Print our variables before processing:
    print "Before: ARGC is " ARGC;
    for ( k = 1; k <= ARGC ; k++ ) {
        printf( "ARGV[%d] = (%s)\n", k, ARGV[k] );
    }

    # Process our command line:
    for( i = 1; i <= ARGC; i++ ) {
        if ( substr( ARGV[i], 1, 1 ) != "-" )  # assume first non -x is a file name
        break;

        if( ARGV[i] == "-v" )   # example option with no trailing data
        {
            verbose = 1;
            continue;           # loop to avoid error trap
        }

        if( ARGV[i] == "-f" )   # example option with trailing data
        {
            foo = ARGV[i+1];    # need to validate i+1 isn't out of range
            i++;                # bad form, but it works
            continue;           # loop to avoid error catch
        }

        # suss out other desired options like above

        printf( "unrecognised option: %s\n", ARGV[i] ) >"/dev/stderr";
        exit(1);
    }

    j = 1;
    c = 1;
    for( i; i < ARGC; i++ ) {   # copy input file names down in argv
        ARGV[j++] = ARGV[i];
        c++;                    # new setting for ARGC
    }

    ARGC = c;    # number of file names shifted + 1 for argv[0] value

    # Show us remaining command line and variables:
    print "After: ARGC is " ARGC;
    for ( k = 1; k <= ARGC ; k++ ) {
        printf( "ARGV[%d] = (%s)\n", k, ARGV[k] );
    }
    print "verbose ('-v') is " verbose
}

This script does nothing useful, it's only an educational tool: it prints all the command line arguments before and after processing to show how awk command line processing can be made to work.

As well as doing a before-and-after printout of the variables, I've also added a test to determine if you're using gawk (as opposed to some other version of awk) by testing for the availability of a gawk-specific variable (tested against mawk with a "not gawk" result). (I would have preferred to use a built-in variable like AWK_VERSION, but it appears no such value exists.) This test should allow termination or a warning when "not gawk" is found.

The author of the answer this is based on says "Personally, I prefer to wrap my awk with a shell script and let it do all of the command line parsing and other error checking. The script then invokes awk with one or more -v var=value options to pass in the desired data." This is probably a good idea although my "gawk detection" might make this more useful?

I have no idea if I'll use this, I may go no further in learning awk. But it was interesting and I hope it proves useful and/or educational to someone else.

Update / Partial Fix

2019-06-03: Multiple problems have been found with this idea.

  • on the Mac the system version is located at /usr/bin/awk (not the /bin/awk my shebang above calls)
  • Mac's default awk dates from 2007, and won't run this script at all: it barfs on the shebang and never gets to my oh-so-clever "is it gawk" test. My limited attempts to make it work failed.
  • if you've installed gawk with HomeBrew (you really, really should), it will be first on the path as /usr/local/bin/awk, thus solving this problem ... if you use HomeBrew
  • on Fedora, both /bin/awk and /usr/bin/awk are links to /usr/bin/gawk, but ...
  • I had assumed that Debian used gawk by default, but no! Their version is based on mawk and so behaves differently ... it will run the script above if you change --exec to -f. (gawk is optionally available for Debian.)
  • #!/usr/bin/env awk -f solves the path problem, but still won't work with Apple's default awk

I've decided the best way to deal with this (and I admit it's not a good solution) is to use #!/usr/bin/env gawk --exec so the script fails immediately if gawk isn't available. This will work for an individual if you're willing to install gawk when needed, but obviously won't work for mass distribution of scripts.