Notes
Powered by Gregarious (33)
Go to Post Index Blog Index
Subscribe Subscribe
Subscribe to RSS feed via Email Subscribe via Email
Sphere: Related Content
 

Gawk for dummies - Part II

Filed under Gawk, Programming.

Viewed 1736 times times.

 

 

Series table of contents:

  1. Gawk for dummies - Part I
  2. Gawk for dummies - Part II
  3. Gawk for dummies - Part III

Deep in the bowels of most UNIX based systems lies “gawk“, a little known command line application that can make your dealings with the ever pervasive text files much easier. In the second post of the series (you can find the first here) I look in to how we can tell gawk to split the file in to records, each record in to fields and a little bin on how loops work.

As we saw in Part I, Gawk looks at each file as if it were a flat database, divided in to several records, each subdivided in to fields. By default, each line is considered to be a record with each whitespace separated word being a field. Each record can have a different number of fields, and at each step, the number of fields is stored in the variable NF. The number of records seen so far is stored in NR and the number of records in the current file (you can pass several file names to gawk at the command line) is given by FNR.

Defining Records

Gawk looks in to RS, the record separator, to find out how each record is terminated. By default,

RS="\n";

representing a new line. By assigning different values to this built in variable, you can override the default behavior. For example, you can make sure each “C” statement is in its own line if you type:

666
gawk 'BEGIN{RS=";";}{printf("%s;\n",$0);}' HelloWorld.c

where we assigned “;” to the Record Separator variable. Since Gawk removes the RS from the $0 field, we had to reinclude it in the output. If HelloWorld.c is a simple C program:

1
2
#include<stdio.h>
int main(){ printf("Hello world\n"); return 0;}

the end result would be:

1
2
3
4
#include<stdio.h>
int main(){ printf("Hello world\n");
return 0;
}

Defining Fields

Now that we know how to delimit records we look at how to split it in to individual fields. By default, gawk sets the field separator FS to match white space (spaces ” ” and tabs “\t”):

FS="[ t]+";

The notation []+; expands the rule so it can match any combination of the characters used. By modifying this variable we can process other types of files. We can use:

FS=",";

to process theCSV file format that is commonly used as a simpe format to exchange data between different applications. Given the example input:

Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234,Blankman,,SomeTown, SD, 00298

we can process it using:

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/sw/bin/awk -f
 
BEGIN{
  FS=",";
}
{
  printf("Name: %s\n\
Surname: %s\n\
Street: %s\n\
City: %s\n\
State: %s\n\
Zip Code: %s\n",$1,$2,$3,$4,$5,$6);
}

where we used “\” to be able to split the command over multiple lines.

Variables, arrays and loops

So far we have only used built in variables, but Gawk lets us easily define our own variables, by directly assigning to it. There is no need to declare it to be of any particular type, and Gawk will treat each variable in a sensible way depending on context. Gawk initializes all variables to the empty string when they are first used and is smart enough to treat variables as numbers for numerical operations, and as strings for string operations.
Arrays can also be defined in a similar way, by assigning directly to the position we are interested in. The array will automagically grow to accommodate all the data we assign to it, and we can use any mix of strings and numbers to index it. Since there is no restriction on the indices that can be used in arrays, they can be any valid variable and are not constrained to be sequential a smart way to iterate over all entries in an array is necessary. This is done using a special form of the for loop:

1
2
for(i in array)
    print i,array[i];

In this example, at each step “i” will be assigned to a different value of the index of the “array”. Gawk offers no guarantees on the sequence used to transverse the array, even for purely numerical indices. The usual “C” like form of the for loop is also available, and we use it to build a Gawk version of the common seq command line utility.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/sw/bin/awk -f
 
BEGIN{
  if(ARGC<3 || ARGC>4)
    {
      printf("Correct usage:nn%s <start> <end> [<step>]\n",ARGV[0]);
      exit(1);
    }
 
  start=ARGV[1];
  end=ARGV[2];
 
  if(ARGC==4)
    step=ARGV[3];
  else
    step=1;
 
  for(i=start;i<=end;i+=step)
    print i;
}

In this example, we use the for loop to iterate between “start” and “end” in steps of size “step”. We also introduced two built in variables, ARGC, the number of command line parameters, and ARGV, the array containing those parameters. The syntax of the “if” statement should also be familiar to “C” programmers, and we use it to validate the command line parameters. If the number of parameters is different from the expected, an error message is written, followed by a list of the parameters used and the order in which they appear. As in “C”, ARGV[0] is the name of the executable and the real parameters start at position 1.

In the next and final post of this series, I will look a little bit more in to the possibilities of Gawk and provide several useful examples.

Sphere: Related Content




Leave a Reply




 

© Copyright 2004 Bruno Goncalves - All rights reserved

Valid XhtmlValid CSS

Socialized through Gregarious 33
Close
E-mail It