Notes
Powered by Gregarious (33)
Go to Post Index Blog Index
Subscribe Subscribe
Subscribe to RSS feed via Email Subscribe via Email
Sphere: Related Content
 

Gawk for dummies - Part I

Filed under Gawk, Programming.

Viewed 3376 times times.

 

 

Series table of contents:

  1. Gawk for dummies - Part I
  2. Gawk for dummies - Part II
  3. Gawk for dummies - Part III

Deep in the bowels of most UNIX based systems lies “gawk“, a little known command line application that can make your dealings with the ever pervasive text files much easier. This is the first of a series of posts that introduces the basics of this powerful tool.

Gawk can receive commands straight from the command line, or it can be used as an interpreter for a more complex script. Each “gawk” script has three different parts, all of which are optional.

  • The “BEGIN” block is ran before any input files are read and is mostly used for initialization and command line parameter validation.
  • The “END” block is ran after all input files have been read and can be used to print summaries and statistics.
  • Every other code block is of the form /REGEXP/{CODE}, where /REGEXP/ is an optional and simple regular expression that is used to select which records the CODE should be applied to. If the regexp is absent the {CODE} is applied to every record.

Let’s sink our teeth in to a simple example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/sw/bin/gawk -f
# The "shebang" includes the -f option to tell gawk it is being used as an interpreter
 
BEGIN{ # Ran before any files are read.
  print "Let's start";
}
 
{# For every record.
  printf("%u %n",NR,$0);
}
 
END{ # Executed at the very end.
  print "Is that the fat lady singing?";
}

If we copy this code in to a file called “test1.awk” and type at the command line:

666
667
chmod u+x test1.awk
./test1.awk test1.dat

This script would simply print a string before and after echoing the contents of the text file “test1.dat” with each line preceded by the line number. With this relatively simple example I introduced several concepts:

  1. The “print” function, works similarly to the “echo” command in bash, with the difference that multiple arguments can be separated by commas
  2. Gawk treats each file as being a flat database with records, each of which contains several fields. “NR” is a variable declared by gawk that contains the total number of records it has seen so far. By default, gawk treats each line in a text file as an individual record, this gives us the total number of lines. If I had passed multiple file names to the script, NR would count the total number of line including all the files. FNR is a variable similar to NR, except that it gets reset to zero at the beginning of each file.
  3. The “printf” function should already be familiar to C programmers.
  4. Finally, $0 represents field 0, the whole line. Gawk by default splits each record in fields separated by white space. The total number of fields in each record (words in each line) is stored in NF and can be individually accessed using $i, where i is any expression that evaluates to an integer larger or equal to 1 and smaller or equal to NF. Each field will be treated as a string or as a number depending on the context in which it is evaluated.

Armed with this basic understanding of the workings of gawk, we can already perform several simple functions, such as:

  • Count the number of words in a file:
666
gawk '{sum+=NF;}END{print sum;}' foo.dat
  • Switch the order or remove columns from a file:
666
gawk '{print $2,$1,$4;}' foo.dat
  • Perform simple calculations:
666
gawk '{print $1,$2*$3;}' foo.dat

In the next part of this series I’ll look in to how we can tell gawk to split the file in to records, each record in to fields and a little bin on how loops work.

Sphere: Related Content




2 Responses to “Gawk for dummies - Part I”

Comments RSS
  1. Brent Ashley Says:

    Great to see awk being promoted!

    To be clear, the expression before each code block isn’t limited to regular
    expressions - it can be anything resulting in a boolean value. For
    instance:

    1
    2
    3
    4
    
    # count comment-only lines in a file
    /^[ \t]*#/{ count++ }
    (count % 10) == 0 { print count, "comment lines" }
    END{ print count, "total comment lines" }
  2. L’angolo del Basetta | Alessios’ blog Says:

    […] Gawk per principianti (mi chiedo come se ne possa fare meno) […]

Comments RSS

Leave a Reply




 

© Copyright 2004 Bruno Goncalves - All rights reserved

Valid XhtmlValid CSS

Socialized through Gregarious 33
Close
E-mail It