Gawk for dummies - Part I
Filed under Gawk, Programming.
Viewed 3376 times times.
Series table of contents:
- Gawk for dummies - Part I
- Gawk for dummies - Part II
- Gawk for dummies - Part III
Deep in the bowels of most UNIX based systems lies “gawk“, a little known command line application that can make your dealings with the ever pervasive text files much easier. This is the first of a series of posts that introduces the basics of this powerful tool.
Gawk can receive commands straight from the command line, or it can be used as an interpreter for a more complex script. Each “gawk” script has three different parts, all of which are optional.
- The “BEGIN” block is ran before any input files are read and is mostly used for initialization and command line parameter validation.
- The “END” block is ran after all input files have been read and can be used to print summaries and statistics.
- Every other code block is of the form /REGEXP/{CODE}, where /REGEXP/ is an optional and simple regular expression that is used to select which records the CODE should be applied to. If the regexp is absent the {CODE} is applied to every record.
Let’s sink our teeth in to a simple example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | #!/sw/bin/gawk -f # The "shebang" includes the -f option to tell gawk it is being used as an interpreter BEGIN{ # Ran before any files are read. print "Let's start"; } {# For every record. printf("%u %n",NR,$0); } END{ # Executed at the very end. print "Is that the fat lady singing?"; } |
If we copy this code in to a file called “test1.awk” and type at the command line:
666 667 | chmod u+x test1.awk ./test1.awk test1.dat |
This script would simply print a string before and after echoing the contents of the text file “test1.dat” with each line preceded by the line number. With this relatively simple example I introduced several concepts:
- The “print” function, works similarly to the “echo” command in bash, with the difference that multiple arguments can be separated by commas
- Gawk treats each file as being a flat database with records, each of which contains several fields. “NR” is a variable declared by gawk that contains the total number of records it has seen so far. By default, gawk treats each line in a text file as an individual record, this gives us the total number of lines. If I had passed multiple file names to the script, NR would count the total number of line including all the files. FNR is a variable similar to NR, except that it gets reset to zero at the beginning of each file.
- The “printf” function should already be familiar to C programmers.
- Finally, $0 represents field 0, the whole line. Gawk by default splits each record in fields separated by white space. The total number of fields in each record (words in each line) is stored in NF and can be individually accessed using $i, where i is any expression that evaluates to an integer larger or equal to 1 and smaller or equal to NF. Each field will be treated as a string or as a number depending on the context in which it is evaluated.
Armed with this basic understanding of the workings of gawk, we can already perform several simple functions, such as:
- Count the number of words in a file:
666 | gawk '{sum+=NF;}END{print sum;}' foo.dat |
- Switch the order or remove columns from a file:
666 | gawk '{print $2,$1,$4;}' foo.dat |
- Perform simple calculations:
666 | gawk '{print $1,$2*$3;}' foo.dat |
In the next part of this series I’ll look in to how we can tell gawk to split the file in to records, each record in to fields and a little bin on how loops work.
Sphere: Related Content

Blog Index
Subscribe via Email

April 17th, 2007 at 9:28 am
Great to see awk being promoted!
To be clear, the expression before each code block isn’t limited to regular
expressions - it can be anything resulting in a boolean value. For
instance:
April 24th, 2007 at 1:32 am
[…] Gawk per principianti (mi chiedo come se ne possa fare meno) […]