-
awksearch for and process a pattern in a file.Formatawk [-Fc] f
program-file [file-list]awk program [file-list]SummaryThe awk
utility is a pattern-scanning and processing language. It searches
one or more files to see if they contain lines that match specified
patterns and then performs actions, such as writing the line to the
standard output or incrementing a counter, each time it finds a
match.You can use awk to generate reports or filter text. It works
equally well with numbers and text; when you mix the two, awk will
almost always come up with the right answer.The authors of awk
(Alfred V. Aho, Peter J. Weinberger, and Brian W.Kernighan)
designed it to be easy to use and, to this end, they sacrificed
execution speed.
-
The awk utility takes its input from files you specify on the
command line or fron1 its standard input.
flexible formatconditional executionlooping statementsnumeric
variablesstring variablesregular expressionsCs printfThe awk
utility takes many of its constructs from the C programming
language. It includes the following features:The first format uses
a program-file, which is the pathname of a fie containing an awk
program. See Description, on the next page.The second format uses a
program, which is an awk program included on the command line. This
format allows you to write simple, short awk programs without
having to create a separate program-file. To prevent the shell from
interpreting the awk commands as shell commands, it is a good idea
to enclose the program in single quotation marks.The file-list
contains pathnames of the ordinary files that awk processes. These
are the input files.Arguments
-
OptionsIf you do not use the -f option, awk uses the first
command line argument as its program.
-fprogram-filefile This option causes awk to read its program
from the program file given as the first command line
argument.-Fcfield This option specifies an input field separator c,
to be used in place of the default separators ([space] and [TAB]).
The field separator can be any singlecharacter.Description
An awk program consists of one or more program lines containing
a pattern and/or action in the hllowing format:
panern { action }
The pattern selects lines from the input file. The awk utility
performs the action on all lines that the pattern selects. You must
enclose the action within braces so that awk can differentiate it
from the pattern . If a program line does not contain a pattern,
awk selects all lines in the input file. If a program line does not
contain an action, awk copies the selected lines to its standard
output.
-
To start, awk compares the first line in the input file (from
the file--list) with each pattern in the program-file or program.
If a pattern selects the line (if there is a match), awk takes the
action associated with the pattern. If the line is not selected,
awk takes no action. When awk has completed its comparisons for the
first line of the input file, it repeats the process for the next
line of input. It continues this process, comparing subsequent
lines in the input file, until it has read the entire file-list.If
several patterns select the same line, awk takes the actions
associated with each of the patterns in the order in which they
appear. It is therefore possible for awk to send a single line from
the input file to its standard output more than once.
-
Patterns
You can use a regular expression (refer to Appendix A), enclosed
within slashes, as a pattern. The ~ operator tests to see if a
field or variable matches a regular expression-The !~operator tests
for no match.You can process arithmetic and character relational
expressions with the following relational operators.You can combine
any of the patterns described above using the Boolean operators | |
(OR) or && (AND).
Operator Meaningless thanless than or equal toequal tonot equal
togreater than or equal togreater than
-
The comma is the range operator. If you separate two patterns
with a comma on a single awk program line, awk selects a range of
lines beginning with the first line that contains the first
pattern. The last line awk selects is the next subsequent line that
contains the second pattern. After awk finds the second pattern, it
Starts the process over by looking for the first pattern again.Two
unique patterns, BEGIN and END, allow you to execute commands
before awk starts its processing and after it finishes. The awk
utility executes the actions associated with the BEGIN pattern
before, and with the END pattern after, it processes all the files
in the file-list.ActionsThe action portion of an awk command causes
awk to take action when it matches a pattern. If you do not specify
an action, awk performs the default action, which is the Print
command (explicitly represented as {print}). This action copies the
record (normally a line-see Variables on the next page) from the
input file to awks standard output.You can follow a Print command
with arguments, causing awk to print just the arguments you
specify. The arguments can be variables or string constants. Using
awk, you can send the output from a Print command to a file(>),
append it to a file (>>), or pipe it to the input of another
program( | ).Unless you separate items in a Print command with
commas, awk catenates them. Commas cause awk to separate the items
with the output field separator (normally a [space]-see Variables
on the next page).You can include several actions on one line
within a set of braces by separating them with semicolons.
-
CommentsThe awk utility disregards anything on a program line
following a pound sign (#). You can document an awk program by
preceding comments with this symbol.
VariablesYou declare and initialize user variables when you use
them (that is, you do not have to declare them before you use
them). In addition, awk maintains program variables for your use.
You can use both user and program variables in the pattern and in
the action portion of an awk program. Following is a list of
program variables.
VariableRepresentsNR$0NF$1-$NFSOFSRSORSFILENAMErecord number of
current recordthe current record(as a single variable) number of
fields in the current record fields in the current record input
field separator (default:[SPACE]or[TAB])output field separator
(default:[SPACE])input record separator (default:[NEWLINE])output
record separator (default:[NEWLINE])name of the current input
file
-
The input and output record separators are, by default,
[NEWLINE] characters. Thus, awk takes each line in the input file
to be a separate record and appends a [NEWLINE] to the end of each
record that it sends to its standard output. The input field
separators are, by default, [SPACE] and [TAB]s. The output field
separator is a [SPACE]. You can change the value of any of the
separators at any time by assigning a new value to its associated
variable. Also, the input held separator can be set on the command
line using the -F option.FunctionsThe functions that awk provides
for manipulating numbers and strings follow.
NameFunctionlength(str)returns the number of characters in str;
if you do not supply an argument, it returns the number of
characters in th current input recordint(num)returns the integer
portion of numindex(str1, str2)returns the index of str2 in str1 or
0 if str2 is not presentsplit(str, arr, del)places elements of str,
delimited by del, in the array arr[1]arr[n]; returns the number of
elements in the arraysprintf(fmt, args)formats args according to
fmt and returns the formatted string; mimics the C programming
language function of the same namesubstr(str,pos,len)returns a
substring of str that begins at pos and is len characters long
-
OperatorsThe following awk arithmetic operators are from the C
programming language.
OperatorFunction*multiplies the expression preceding the
operator by the expression following it./divides the expression
preceding the operator by the expression following it.%takes the
remainder after dividing the expression preceding the operator by
the expression following it+adds the expression preceding the
operator and the expression following it.-subtracts the expression
following the operator from the expression preceding it=assigns the
value of the expression following the operator to the variable
preceding it.++increments the variable preceding the
operator--decrements the variable preceding the operator+=adds the
expression following the operator to the variable preceding it and
assigns the result to the variable preceding the
operator-=subtracts the expression following the operator from the
variable preceding it and assigns the result to the variable
preceding the operator
-
Associative ArraysAn associative array is one of awks most
powerful features. An associative array uses strings as its
indexes. Using an associative array, you can mimic a traditional
array by using numeric. strings as indexes.You assign a value to an
element of an associative array just as you would assign a value to
any other awk variable. The format is shown below.
array[string] = value
The array is the name of the array, string is the index of the
element of the array you are assigning a value to, and value is the
value you are assigning to the element of the array
OperatorFunction*=multiplies the variable preceding the operator
by the expression following it and assigns the result to the
variable preceding the operator/=divides the variable preceding the
operator by the expression following it and assigns the result to
the variable preceding the operator%=takes the remainder, after
dividing the variable preceding the operator by the expression
following it, and assigns the result to the variable preceding the
operator
-
There is a special For structure you can use with an awk array.
The formatat is:
for (elem in array) action
The elem is a variable that takes on the values of each of the
elements in the array as the For structure loops through them,
array is the name of the array, and action is the action that awk
takes for each element in the array. You can use the elem variable
in this action.The Examples section contains programs that use
associative arrays.PrintfYou can use the Printf command in place of
Print to control the format of the output that awk generates. The
awk version of Printf is similar to that of the C language. A
Printf command takes the following format:printf control-string
arg1, arg2, ..., argn
The control-string determines how Printf will format arg1-n. The
arg1-n can be variables or other expressions. Within the
control-string, you can use \n to indicate a [NEWLINE] and \t to
indicate a [TAB].The control-string contains conversion
specifications, one for each argument (arg1-n). A conversion
specification has the following format:
-
%[-][x[.y]]convThe - causes Printf to Left justify the argument.
The x is the minimum field width, and the .y is the number of
places to the right of a decimal point in a number. The conv is a
letter from the following list. Refer to the following Examples
section for examples of how to use printf.
conv Cenversionddecimaleexponential notationffloating-point
numberguse f or e, whichever is shorterounsigned octalsstring of
charactersxunsigned hexadecimal
-
ExamplesA simple awk program is shown on the following page.{
print }
This program consists of one program line that is an action. It
uses no pattern. Because the pattern is missing, awk selects all
lines in the input file. Without any arguments, the Print command
prints each selected line in its entirety. This program copies the
input file to its standard output.The following program has a
pattern pan without an explicit action.
/jenny/
In this case, awk selects all lines from the input file that
contain the string jenny. When you do not specify an action, awk
assumes the action to be Print. This program Copies all the lines
in the input file that contain jenny to its standard output.The
following examples work with the car data file. From left to right,
the columns in the file contain each cark make, model, year of
manufacture, mileage, and price. All white space in this file is
composed of single [TAB]s (there are no [SPACE]s in the file).
-
$cat carsThe first example below selects all lines that contain
the string chevy. The slashes indicate that chevy is a regular
expression. This example has no action part.Although neither awk
nor shell syntax requires single quotation marks on the command
line, it is a good idea to use then1, because they prevent many
problems. If the awk program you create on the command line
includes [SPACE]s or any special characters that the shell will
interpret, you must quote them. Always enclosing the program in
single quotation marks is the easiest way of making sure you have
quoted any characters that need to be quoted.
$ awk /chevy/ carschevy nova 79603000chevy nova 80503500chevy
impa1a 65851550
The next example selects all lines from the file (it has no
pattern part). The braces enclose the action part-you must always
use braces to delimit the action part, so that awk can distinguish
the pattern part from the action part. This example prints the
third field ($3), a [SPACE] (indicated by the comma), and the first
field ($1) of each selected line.
-
$ awk {print $3, $1} cars77 p1ym79 chevy65 ford78 vo1vo83 ford
88 chevy65 fiat8l honda84 ford82 toyota65 chevy83 ford
The next example includes both a pattern and an action part. It
selects all lines that contain the string chevy and prints the
third and first fields from the lines it selects.
$ awk /chevy/ {print $3, $l} cars79 chevy88 chevy65 chevy
-
The next example selects lines that contain a match for the
regular expression h. Because there is no explicit action, it
prints all the lines it selects.$ awk /h/ carschevy nova 79 68
3000chevy nova 8050 3500honda accord 8l 30 6000ford thundbd 84 l0
17000chevy impa1a 65 85 l550
The next pattern uses the matches operator (~) to select all
lines that contain the letter h in the first field.$ awk $1 ~ /h/
carschevy nova 79 60 3000chevy nova 80 50 3500honda accord 8l 30
6000chevy impa1a 65 85 l550
The caret (^) in a regular expression forces a match at the
beginning of the line or, in this case, the beginning of the first
field.
$ awk $l ~ /^h/ carshonda accord 81 30 6000
A pair of brackets SUI-rounds a character class definition
(refer to Appendix A, Regular Expressions). Below, awk selects all
lines that have a second field that begins with t or m. Then it
prints the third and second fields, a dollar sign, and the fifth
field.
-
$ awk $2 ~ /^[tm]/ {print $3, $2, $, $5} cars65 mustang $l000084
thundbd $1700082 tercel $750
The next example shows three roles that a dollar sign can play
in an awk program. A dollarsign followed by a number forms the name
of a field. Within a regular expression, a dollar sign forces a
match at the end of a line or held (5$). Within a string, you can
use a dollar sign as itself.
$ awk $3 ~ /5$/ {print $3, $l, $ $5} cars65 ford $l000065 fiat
$45065 chevy $l550
Below, the equals relational operator (==) causes awk to perform
a numeric comparison between the third field in each line and the
number 65. The awk commands takes the default action, Print, on
each line that matches.
$ awk $3 == 65 carsford mustang 654510000fiat 600 65115450chevy
impa1a 65851550
-
The next example finds all cars priced at or under $3000.
$ awk $5 = 300 carsplym fury 77732500chevy nova 79603000fiat
60065115450toyota terce1 82180750chevy impa1a 65851550When you use
double quotation marks, awk performs textual comparisons, using the
ASCII collating sequence as the basis of the comparison. Below, awk
shows that the strings 450 and 750 fall in the range that lies
between the strings 2000 and 9000.
$ awk $5 >= 2000 && $5 < 9000 carsp1ym fury 77 73
2500chevy nova 79 60 3000chevy nova 80 50 3500fiat 600 65 ll5
450honda accord 8l 30 6000toyota terce1 82 l80750When you need a
numeric comparison, do not use quotation marks.The next example
gives the correct results. It is the same as the previous ex. ample
but omits the double quotation marks .
-
$ awk $5 >= 2000 && $5 < 9000
carsplymfury77732500chevynova79603000chevynova80503500Hondaaccord81306000Next,
the range operator (,) selects a group of lines. The first line it
selects is the one specified by the pattern before the comma. The
last line is the one selected by the pattern after the comma. If
there is not line that matches the pattern after the comma, awk
selects every line up to the end of the file. The example selects
all lines starting with the line that contains Volvo and concluding
with the line that contains fiat.
$ awk /volvo/ , /fiat/
carsvolvogl781029850fordltd831510500chevynova80503500fiat60065115450
After the range operator finds its first group of lines, it
starts the process over, looking for a line that matches the
pattern before the comma. In the following example, awk finds three
groups of lines that fall between chevy and ford. Although the
fifth line in the file contains ford, awk does not select it
because, at the time it is processing the fifth line, it is
searching for chevy.
-
$ awk /chevy/ , /ford/
carschevynova79603000fordmustang654510000chevynova80503500fiat60065115450hondaaccord81306000fordthundbd841017000chevyimpala65851550fordbronco83259500When
you are writing a longer awk program, it is convenient to put the
program in a file and reference the file on the command line. Use
the f option, followed by the name of the file containing the awk
program.Following is an awk program that has two actions and uses
the BEGIN pattern. The awk utility performs the action associated
with BEGIN before it processes any of the lines of the data file.
The pr_header awk program uses BEGIN to print a header.The second
action, {print}, has no pattern part and prints all the lines in
the file.$ cat pr_headerBEGIN{printMake
ModelYearMilesPrice}{print}
-
$ awk f pr_header
carsMakeModelYearMilesPricePlymfury77732500Chevynova79603000Fordmustang654510000Volvogl781029850Fordltd831510500Chevynova80503500Fiat60065115450Hondaaccord81306000Fordthundbd841017000Toyotatercel82180750Chevyimpala65851550Fordbronco83259500In
the previous and following examples, the white space in the headers
is composed of single [TAB]s, so that the titles line up with the
columns of data.
$ cat pr_header2BEGIN
{printMakeModelYearMilesPriceprint-----------------------------}{print}
-
$ awk f pr_header2
carsMakeModelYearMilesPrice------------------------------------------------------------Plymfury77732500Chevynova79603000Fordmustang654510000Volvogl781029850Fordltd831510500Chevynova80503500Fiat60065115450Hondaaccord81306000Fordthundbd841017000Toyotatercel82180750Chevyimpala65851550Fordbronco83259500
When you call the length function without an argument, it
returns the number of characters in the current line, including
field separators. The $0 variable always contains the value of the
current line. In the next example, awk prepends the length to each
line, and then a pipe sends the output from awk to sort, so that
the lines of the cars file appear in order of length. Because the
formatting of the report depends on [TAB]s, including three extra
characters at the beginning of each line throws off the format of
the last line. A remedy for this situation will be covered
shortly.
-
$ awk {print length, $0} cars |
sort19fiat6006511545020fordltd83151050020plymfury7773250020volvogl78102985021chevynova7960300021chevynova8050350022fordbronco8325950023chevyimpala6585155023hondaaccord8130600024fordmustang65451000024fordthundbd84101700024toyotatercel82180750The
NR variable contains the record (line) number of the current line.
The following pattern selects all lines that contain more than 23
characters. The action prints the line number of all the selected
lines.
$ awk length > 23 {print NR} cars3910
-
You can combine the range operator (,) and the NR variable to
display a group of lines of a file based on their line numbers. The
next example displays lines 2 through 4.
$ awk NR == 2 , NR == 4
carschevynova79603000fordmustang654510000volvogl781029850
The END pattern works in a manner similar to the BEGIN pattern,
except awk takes the actions associated with it after it has
processed the last of its input lines. The following report
displays information only after it has processed the entire data
file. The NR variable retains its value after awk has finished
processing the data file, so that an action associated with an END
pattern can use it.
$ awk END {print NR, cars for sale. } cars12 cars for sale.
The next example uses If commands to change the values of some
of the first fields. As long as awk does not make any changes to a
record, it leaves the entire record, including separators, intact.
Once it makes a change to a record, it changes all separators in
that record to the default. The default output field separator is a
[SPACE].
-
$ cat separ_demo{ if ($1 ~ /ply/)$1 = plymouth if ($1 ~ /chev/)
$1 = chevroletprint}
$ awk f separ_demo carsplymouth fury 77 73 2500chevrolet nova 79
60
3000fordmustang654510000volvogl781029850ford1td831510500chevrolet
nova 80 50
3500fiat60065115450hondaaccord81306000fordthundba841017000Toyotatercel82180750Chevroletimpala65851550Fordbronco83259500
-
You can change the default value of the output field separator
by assigning a value to the OFS variable. There is one [TAB]
character between the quotation marks in the following example.This
fix improves the appearance of the report but does not properly
line up the columns.
$ cat ofs_demoBEGIN{OFS = [TAB]}{if ($1 ~ /ply/) $1 = plymouthif
($1 ~ /chev/)$1 = chevroletprint}
$ awk -f ofs_demo
carsplymouthfury77732500chevroletnova79603000ford
mustang654510000volvo gl781029850ford
1td831510500chevroletnova80503500fiat 60065115450honda
accord81306000ford thundba841017000Toyota
tercel82180750Chevroletimpala65851550Ford bronco83259500
-
You can use Printf to refine the output format (refer to page
535). The following example uses a backslash at the end of a
program line to mask the following [NEWLINE] from awk. You can use
this technique to continue a long line over one or more lines
without affecting the outcome of the program.
$ cat printf_demoBEGIN {print Milesprint
MakeMode1Year(000)Priceprint
\-----------------------------------------------------------------------}}if
($l ~ /p1y/ $l = p1ymouthif ($l ~ /chev/) $l = chevro1et printf
%-l0s %-8s l9%2d %5d$ %8.2f\n,\$1, $2, $3, $4, $5 }
-
$ awk -f printf_demo cars
MilesMakeModelYear(0000)Price------------------------------------------------------------------------------------------plymouthfury1977
73$ 2500.00chevroletnova1079 60$ 3000.00fordmustang1965 45$
10000.00volvogl1978 102$ 9850.00ford1td1983 15$
10500.00chevroletnova1980 50$ 3500.00fiat6001965115$
450.00hondaaccord1981 30$ 6000.00fordthundba1984 10$
17000.00Toyotatercel1982180$ 750.00Chevroletimpala1965 85$
1550.00Fordbronco1983 25$ 9500.00
-
The next example creates two new files, one with all the lines
that contain chevy and the other with lines containing ford.
$ cat redi rect-out/chevy/ {print chevfi1e}/ford/ {print
fordfi1e}END {print done.}
$ awk -f red1rect-out carsdone .
$ cat chevfi1echevy nova79603000chevy nova80503500chevy
nova65851550
The summary program produces a summary report on all cars and
newer cars. The first two lines of declarations are not required;
awk automatically declares and initializes variables as you use
them. After awk reads all the input data, it computes and displays
averages.
-
$ cat summaryBEGIN{yearsum = 0 ; costsum = 0newcostsum = 0 ;
newcount = 0}{yearsum += $3costsum += $5}$3 80 {newcostsum += $5 ;
newcount ++}END {Printf Average age of cars is %3.lf yearsn , \90 -
(yearsum/NR)printf Average cost of cars is $%7.2fn ,costum/NRprintf
Average cost of newer cars is %$7.2fn,\ newcostsum/newcount}
$ awk -f summary carsAve rage age of cars is l3.2 yearsAverage
cost of cars is $62l6.67Average cost of newer cars is $8750.00
Following, grep shows the format of a line from the passwd file
that the next example uses.
-
$ grep mark /etc/passwdmark:4zvDGYGEbYHJg:107:ext
112:/home/mark:/bin/csh
The next example demonstrates a technique for finding the
largest number in a field. Because it works with the passwd file,
which delimits fnelds with colons (:), it changes the input filed
separator (FS) before reading any data. (Alternatively, the -F
option could be used on the command line to change the input held
separator.) This example reads the passwd file and determines the
next available user ID number (field 3). The numbers do not have to
be in order in the passwd file for this program to work..The
pattern causes awk to select records that contain a user ID number
greater than any previous user ID number that it has processed.
Each time it selects a record, it assigns the value of the new user
ID number to the saveit variable. Then awk uses the new value of
saveit to test the user ID of all subsequent records.Finally awk
adds 1 to the value of saveit and displays the result.
$ cat find-uid. BEGIN {F5 = : saveit = 0}$3 Saveit {saveit =
$3}END {print Next avai1able UID i s saveit + 1}
$awk f find_uid /etc/passwdNext available UID is 192
-
The next example shows another report based on the cars file.
This report uses nested If Else statements to substitute values
based on the contents of the price field. The program has no
pattern part--it processes every record.
$ cat price_range{if ($5 5000 && $5 1000) $5 = please
aske1se if ($5 >= l0000) $5 = expensiveprintf %-10s
%-8s19%2d%5d%-12s\n,\$l, $2, $3, $4, $5}
$ awk -f price -range carsp1ym fury1977 73inexpensivechevy
nova1979 60inexpensiveford mustang1965 45expensivevolvo
g11978102please askford 1td1983 15expensivechevy nova1980
50inexpensivefiat 6001965115inexpensivehonda accord1981 30please
askford thundbd1984 10expensivetoyota tercel1982180inexpensivechevy
impa1a1965 85inexpensiveford bronco1983 25please ask
-
Problem 1) Find the number of annotated gene in each strand of
ecoli genome sequences.
Problem 2)Find the number of putatively identified,
hypothetical, unknown genes from ecoli genome seqeunces.