Pages

Wednesday, February 17, 2010

Awk

Share it Please

Today we will work on AWK command line tool in Unix .This article helps you to have a basic understanding on how awk works and some of the internal structure of awk.

Awk is a simple and elegant pattern scanning and processing language. It is created in the late 70’s.the name was composed from the initial letters of three original authors Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger. It is commonly used as a command line filter in pipers to reformat the output of other commands.

Some others features of awk are
Its ability to view a text file as made up of records and fields in a textual database.
Its use of variables to manipulate the database.
Its use of arithmetic and string operators.
Its use of common programming constructs such as loops and conditionals.
Its ability to generate formatted reports.

awk takes two inputs : data file & command line
the command line file can be absent and necessary commands can be sent as an argument.

An awk would take the command line syntax as ,

awk ‘{pattern + action}’ {filename}

The pattern represents what awk is looking for in the data and action is a series of commands executed when a match for the pattern is found. Curly brackets are not always required around your program, but they can be used to group series of instructions based on a specific pattern.

Understanding Fields: a common use of awk is to process files by formatting and then displaying the necessary data from files.awk separates each input file into Records. A Record is nothing but a single line of input and each Record contains multiple fields. The field separator is space or tab and can be changed.

Let’s see a simple text file to illustrate awk. Create a file student with the following data
jagadesh  10    2010  h
kiran        30    1997  k
pavan      123  10000 n
pavan      345   2009 j
jagan       345  400   p
madan    345   2007  k
naren     1009  1200 l
gagan    234   100    m

Now we will try a simple awk command  awk '{print $1 " " $2}' student
this command prints the first field and second field. We used the awk along with a pattern and a file name, in the form of arguments.

Try this  : awk '{print "first Name is MR." $1 " " $2}' student

Working With Patterns : an awk can contain a pattern and a procedure ,

Pattern { procedure }

Both are optional, if pattern is missing , { procedure } is applied to all lines , if {procedure } is missing , the matched line is printed.

A pattern can take the following form ,
/regular expression/
Relational expression
Pattern-matching expression
BEGIN && END

Regular Expression: let’s search for a string in the file above , in order to use a regular expression , we need to write as  /String to Search / . let write a simple command to search a “jagadesh”
awk '/jagadesh/' student : This command searches for the string jagadesh and prints the data related to it.

Relational expression: we can use a releational expression in order to retrieve results , let’s see a simple one , awk '$2==10' student

we can also print firleds that we are intersted in  awk '/kiran/ {print $2}' student
print the second field in the record which has a matching record to the pattern

multiple commands for the same set of data can be used by using a ; between them like
 awk '/10/ {print $1 ;print $2}' student .print the first field and second field with the matching pattern from the student data file.

We can also insert the field separators like new line '\n' , new tab '\t' e.t.c to display data in a appropriate way.

Searching data with multiple patterns is also possible with awk , this can done by including a ‘|’ pipe in the awk command as awk ‘/jagadesh|2010/’ student : in this iam searching for records with jagadesh and 2010 in them.

Now lets try a more advanced example of searching for k in file
awk ‘/k/’ student : this returns all the records with ‘k’ in them . from the above data file we will get kiran and madan who has k in their records . but I want to get records whose first field contains ‘k’ . This is where we get pattern matching regular expression comes into position.

Pattern Matching Expressions: as said above if we need to check for a particular field by out pattern, we will be using the pattern matching regular expression as

awk '$1 ~ /k/' student : the “~” tilde operator makes sure that the k is being search in the first field only . this gives us the result as only 1 record with [kiran].

Similarly we can search for a 4 in 3rd row as
awk ‘$3 ~ /4/’ student

The opposite to the tilde operator is negotiation operator ‘!~’ which gives all the records form the one that we are currently searching like
awk ‘$1 !~ /k/’ student : displays all the records that don’t have ‘k’ in their first field.

Before going to the other patten matching expression ‘BEGIN’ and ‘END’ , we will have a look at the awk built in variable and operators that awk support .

Build in Variables: awk provides some built in variable which can be used while performing a search on a data file. These are the built in variable available in awk

FS: filed separator
NF: number of fields
NR: number of current row
OFMT: output format for numbers “%.6g” and for conversion to string
OFS: output field separator
ORS: output record separator
RD: record separator
$0: entire input record
$n: nth field in current record

We can use the awk built in variables to get better results.If we need to get the data depending on the number of fields , we can write as  awk ‘NF==4’ student  , which gets all the records which has 4 fields
awk ‘NF==4 && /jagadesh/’ student : retrieves all the records which has 4 fields and has ‘jagadesh’ in it.

Operators: the following are the operators available in the awk ,
= += -= *= /= %= ^= **=         : Assignment
||                                              : Logical OR (short-circuit)
&&                                            : Logical AND (short-circuit)
~ !~                                          : Match regular expression and negation
< <= > >= != ==                      : Relational operators
(blank)                                       : Concatenation
+ -                                           : Addition, subtraction
* / %                                        : Multiplication, division, and modulus (remainder)
+ - !                                         : Unary plus and minus, and logical negation
^ **                                         : Exponentiation
++ --                                        : Increment and decrement, either prefix or postfix
$                                              : Field reference

BEGIN && END : a begin and end pattern rules can be applied to get better results . a beging rule is executed before the first records is read and an end rule is executed after all records are read. Normally, awk executes each block of your script's code once for each input line. However, there are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it's an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you'll reference later in the program.

Awk also provides another special block, called the END block. Awk executes this block after all lines in the input file have been processed. Typically, the END block is used to perform final calculations or print summaries that should appear at the end of the output stream.

Lets see a simple syntax

awk '
> BEGIN { print "jagadesh" }
> /jagadesh/
> END { print "done" }' student

What iam doing here is iam searching for a pattern /jagadesh/ , before reading the first record , I want to print “jagadesh” then search the string and print results. After all records are read , I want to print “done” .

An awk program may have multiple BEGIN and/or END rules. They are executed in the order in which they appear: all the BEGIN rules at startup and all the END rules at termination. BEGIN and END rules may be intermixed with other rules. Multiple BEGIN and END rules are useful for writing library functions, because each library file can have its own BEGIN and/or END rule to do its own initialization and/or cleanup. The order in which library functions are named on the command line controls the order in which their BEGIN and END rules are executed.

Some examples ,

Print the firstNames in the Data File :

awk '
> BEGIN { print "First Names" }
> { print $1 }' student

Display values of the 2,3 and 4 columns

awk '
> BEGIN { print "Names "  }
> BEGIN { print "------" }
> { print $2+$3+$4 }
> END { } ' student

awk '/jagadesh/ {++x} END {print x}' student
awk '{total +=2 } END {print total }' student

We can dig into for more examples

Empty Pattern : A Empty pattern is considered as a match to every record in file.
awk ‘ { print $0 } ‘ student
Variables: variables in awk are assigned by “=” operator , like
FS=”,”

Arrays: Arrays in Awk are associate arrays , that is they contain a index and a associated value to the index.

     Element 3     Value 30
     Element 1     Value "foo"
     Element 0     Value 8
     Element 2     Value ""

The pairs are shown in jumbled order because their order is irrelevant. One advantage of associative arrays is that the elements can be added at any time.

Array can be created as

arr[0]=”jagadesh”         or 

for(i=0;i<5;i++)
    arr[i]=i

iterating over arrays : awk has a handy mechanism for iterating over arrays , it has for construct as follows ,
  for(x in myarray)
    print myarray[x]

elemets in the array can be deleted by using the delete in awk
  delete myarray[1]

Escape Sequences :
Within string and regular expression constants, the following escape sequences may be used. Note: The \x escape sequence is a common extension.

Sequence Meaning Sequence Meaning
\a    Alert (bell)
\v    Vertical tab
\b    Backspace
\\    Literal backslash
\f    Form feed   
\nnn Octal value nnn
\n Newline
\xnn Hexadecimal value nn
\r Carriage return
\" Literal double quote (in strings)
\t Tab
\/ Literal slash (in regular expressions


Functions: lets move to more advanced concept of using functions and writing our own.
There are 2 types of functions availalable ,
Built in  (&&)
User –defined

Built in function comes under 3 types  I/o , String and math . To call one of awk's built-in functions, write the name of the function followed by arguments in parentheses.a simple syntax is

awk ‘ { print sqrt(16) }‘ student

awk provides functions that work on numbers like sin(x) ,tan(x),sqrt(x) . string functions like getting the length of string , spiriting the string e.t.c.and even I/o ,I18n and even functions on Time and Date.

These are the basics of Awk .

A more detailed information can be found at
http://www.gnu.org/manual/gawk/html_node/index.html