1 Tools for processing text: awk Tools for processing text: awk David Morgan awk scripts awk scripts patterns - actions
1
Tools for processing text: awkTools for processing text: awk
David Morgan
awk scriptsawk scripts
� patterns - actions
2
Kinds of patternsKinds of patterns
� /regular expression/
� relational expression
� pattern-matching expression
� BEGIN
� END
Kinds of actionsKinds of actions
� variable assignments
� input/output commands
� built-in functions
� control flow commands
� user-defined functions
3
Ways to run gawkWays to run gawk
� gawk ‘pattern’ file default action: print line
� gawk ‘{action}’ file default pattern: match line
� gawk ‘pattern {action}’ file
� gawk –f script file pattern{action}’s in script
Script operationScript operation
� process each line from an input source
� apply to it each pattern{action} line in the script
� if the lines match (per pattern), apply the action to the input line
AWK PROGRAM EXECUTION
read program source from the "script" file
execute any code in BEGIN block(s)
read input "file" (or if none, standard input)
test each input record against patterns in the AWK program in order of appearanceif it matches any pattern, execute the associated action
execute any code in END block(s)
--gawk man page, heavily abridged and adapted
4
Ways to run gawkWays to run gawk
datadata
pattern, no action
(default: print line)
action, no pattern
(default: select line)
pattern plus action
multiple
pattern{action}’son command line
in script file
(same result)
BEGIN to trigger actionBEGIN to trigger action
correct syntax Ineffective! gawk is waiting for standard input,
something to math, nothing printed
Only “acts” on a matched “pattern”, so give it one to spur it on
5
Some gawk numeric functionsSome gawk numeric functions
Numeric Functions
atan2(y, x) returns arctangent of y/x in radians.
cos(expr) returns cosine of expr
exp(expr) returns e raised to expr
int(expr) truncates expr to integer.
log(expr) returns natural logarithm of expr
rand() returns random number 0 < N < 1
sin(expr) returns sine of expr
sqrt(expr) returns square root of expr
Some gawk string functionsSome gawk string functions
String Functions
asort(s [, d]) sort array s
gsub(r, s [, t]) search and replace
index(s, t) returns position of substring t within string d
length([s]) returns length of string s
match(s, r [, a]) returns the position of r in s
split(s, a [, r]) split string s into array a on regular expressionseparator r
sprintf(fmt, expr-list) prints expr-list according to fmt
strtonum(str) returns its numeric value of string str
substr(s, i [, n]) returns the at most n-character substring of s
starting at i.
tolower(str) lower-cases string str
toupper(str) upper-cases string str
6
Some gawk bitwise functionsSome gawk bitwise functions
and(v1, v2) return bitwise AND of v1 and v2
compl(val) return the bitwise complement of val
lshift(val, count) return val, shifted left by count bits
or(v1, v2) return bitwise OR of v1 and v2
rshift(val, count) return val, shifted right by count bits
xor(v1, v2) return bitwise XOR of v1 and v2
system functionsystem function
� gawk 'BEGIN{system("date")}‘
� gawk 'BEGIN{"date" | getline d ; print d}'
7
Extracting a substringExtracting a substring
index( )
substr( )
VariablesVariables
� categories
� user-defined
� built-in
� field
� data types
� string or number
� context-inferred
8
Field variable namesField variable names
� $1, $2 … for first, second …
� $0 for whole line
� shell positional parameters use same names
� for command line arguments
� args, similarly, are command line “fields”
� but shell $1 and awk $1 are not the same
� don’t confuse them
Variable naming and typingVariable naming and typing
naming
Type inference
from context
9
Main builtMain built--in variablesin variables
NF The number of fields in the current input record
NR The total number of input records seen so far
FS The input field separator, a space by default
RS The input record separator, by default a newline
OFS The output field separator, a space by default
ORS The output record separator, by default a newline
FILENAME The name of the current input file
ARGC The number of command line arguments
ARGV Array of command line arguments
Field separator specificationField separator specification
recognizes passwd file’s colon as field separator
10
Functions, arrays, loops Functions, arrays, loops (bubble sort)(bubble sort)
Current row into (unsorted) array
print (sorted) array
send for sorting (by reference!)
loop that spanned these lines, not
in script’s code, is gawk’s internal
main loop
loop to increasing high-water marks, from 2
Visit descendingly successive pairs,
swapping if out-of-order stopping upon an
in-order pair
Quigley textbook, p. 249
Save/restore text filesSave/restore text files
Save files’ lines,
Mark each withits filename
we have
3 files
bye, files
bringthem
back
they’re back
*
* Consumes a file descriptor.
in production use, first line:… { close(prev); prev = $1 }not to exhaust available descriptors
+ The redirection operators > and >> are used to put output into files instead of the standard output. ... It is also mportant to note that a redirection operator opens a file only once; each successive print or printf statement adds more data to the open file. When the redirection operator > is used, the file is initially cleared before any output is written to it. If >> is used instead of >, the file is not initially cleared; output is appended after the original contents.
+
11
RS RS –– record separatorrecord separator
Print records containing “New York”
Print records containing “New York”
one record containing “New York”S
another record containing “New York”
one record containing “New York”
another record containing “New York”
split split -- distribute fieldsdistribute fields--inin--record into elementsrecord into elements--inin--arrayarray
for for -- two kinds of for looptwo kinds of for loop
C-style for
awk-style for, for array elements
“The order in which the subscripts are considered is implementation dependent.” The AWK Programming Language p.51
12
? : conditional operator? : conditional operator
expr1 expr2 expr3
evaluates to expr2 or expr3
according as
expr1 is true or false respectively
awk approach to word frequencyawk approach to word frequency
covers all fields (words) in a line, while
gawk does so for all lines
An array element “counting bucket” for each word
That occurs. The word is the associative array
subscript for the array element that counts that word’s
occurrences
most commonly occurring words in kjv.txt (King
James Version of the bible)
13
getline functiongetline function