Chapter 11 The awk Programming Language
Contents:
Conceptual Overview
Command-Line Syntax
Patterns and Procedures
Built-in Variables
Operators
Variables and Array Assignments
User-Defined Functions
Group Listing of awk Functions and Commands
Implementation Limits
Alphabetical Summary of Functions and Commands
11.1 Conceptual Overview
awk is a pattern-matching program for processing files, especially when they are databases. The new version of awk, called nawk, provides additional capabilities.[1] Every modern Unix system comes with a version of new awk, and its use is recommended over old awk.
[1] It really isn't so new. The additional features were added in 1984, and it was first shipped with System V Release 3.1 in 1987. Nevertheless, the name was never changed on most systems.
Different systems vary in what the two versions are called. Some have oawk and awk, for the old and new versions, respectively. Others have awk and nawk. Still others only have awk, which is the new version. This example shows what happens if your awk is the old one:
$awk 1 /dev/null
awk: syntax error near line 1 awk: bailing out near line 1
awk exits silently if it is the new version.
Source code for the latest version of awk, from Bell Labs, can be downloaded starting at Brian Kernighan's home page:http://cm.bell-labs.com/~bwk. Michael Brennan's mawk is available via anonymous FTP fromftp://ftp.whidbey.net/pub/brennan/mawk1.3.3.tar.gz. Finally, the Free Software Foundation has a version of awk called gawk, available from ftp://gnudist.gnu.org/gnu/gawk/gawk-3.0.4.tar.gz. All three programs implement "new" awk. Thus, references below such as "nawk only," apply to all three. gawk has additional features.
With original awk, you can:
Think of a text file as made up of records and fields in a textual database.
Perform arithmetic and string operations.
Use programming constructs such as loops and conditionals.
Produce formatted reports.
With nawk, you can also:
Define your own functions.
Execute Unix commands from a script.
Process the results of Unix commands.
Process command-line arguments more gracefully.
Work more easily with multiple input streams.
Flush open output files and pipes (latest Bell Labs awk).
In addition, with GNU awk (gawk), you can:
Use regular expressions to separate records, as well as fields.
Skip to the start of the next file, not just the next record.
Perform more powerful string substitutions.
Retrieve and format system time values.
11.2 Command-Line Syntax
The syntax for invoking awk has two forms:
awk [options
] 'script
'var
=value file(s)
awk [options
] -fscriptfile var
=value file(s)
You can specify a script directly on the command line, or you can store a script in a scriptfile and specify it with -f. nawk allows multiple -f scripts. Variables can be assigned a value on the command line. The value can be a literal, a shell variable (
$
name
), or a command substitution (`
cmd
`
), but the value is available only after theBEGIN
statement is executed.awk operates on one or more files. If none are specified (or if
-
is specified), awk reads from the standard input.The recognized options are:
- -F
fs
Set the field separator to fs. This is the same as setting the system variable
FS
. Original awk allows the field separator to be only a single character. nawk allows fs to be a regular expression. Each input line, or record, is divided into fields by whitespace (blanks or tabs) or by some other user-definable record separator. Fields are referred to by the variables$1
,$2
,...,$
n
.$0
refers to the entire record.- -v
var
=
value
Assign a value to variable var. This allows assignment before the script begins execution (available in nawk only).
To print the first three (colon-separated) fields of each record on separate lines:
awk -F: '{ print $1; print $2; print $3 }' /etc/passwd
More examples are shown in the section "Simple Pattern-Procedure Examples."
11.3 Patterns and Procedures
awk scripts consist of patterns and procedures:
pattern
{procedure
}Both are optional. If pattern is missing,
{
procedure}
is applied to all lines; if{
procedure}
is missing, the matched line is printed.11.3.1 Patterns
A pattern can be any of the following:
/regular expression
/relational expression
pattern-matching expression
BEGIN END
Expressions can be composed of quoted strings, numbers, operators, functions, defined variables, or any of the predefined variables described later in the section "Built-in Variables."
Regular expressions use the extended set of metacharacters and are described in Chapter 6, Pattern Matching.
^
and$
refer to the beginning and end of a string (such as the fields), respectively, rather than the beginning and end of a line. In particular, these metacharacters will not match at a newline embedded in the middle of a string.Relational expressions use the relational operators listed in the section "Operators" later in this chapter. For example,
$2 > $1
selects lines for which the second field is greater than the first. Comparisons can be either string or numeric. Thus, depending on the types of data in$1
and$2
, awk does either a numeric or a string comparison. This can change from one record to the next.Pattern-matching expressions use the operators
~
(match) and!~
(don't match). See the section "Operators" later in this chapter.The
BEGIN
pattern lets you specify procedures that take place before the first input line is processed. (Generally, you set global variables here.)The
END
pattern lets you specify procedures that take place after the last input record is read.In nawk,
BEGIN
andEND
patterns may appear multiple times. The procedures are merged as if there had been one large procedure.Except for
BEGIN
andEND
, patterns can be combined with the Boolean operators||
(or),&&
(and), and!
(not). A range of lines can also be specified using comma-separated patterns:pattern
,pattern
11.3.2 Procedures
Procedures consist of one or more commands, functions, or variable assignments, separated by newlines or semicolons, and contained within curly braces. Commands fall into five groups:
Variable or array assignments
Printing commands
Built-in functions
Control-flow commands
User-defined functions (nawk only)
11.3.3 Simple Pattern-Procedure Examples
Print first field of each line:
{ print $1 }
Print all lines that contain pattern:
/pattern
/
Print first field of lines that contain pattern:
/pattern
/ { print $1 }
Select records containing more than two fields:
NF > 2
Interpret input records as a group of lines up to a blank line. Each line is a single field:
BEGIN { FS = "\n"; RS = "" }
Print fields 2 and 3 in switched order, but only on lines whose first field matches the string "URGENT":
$1 ~ /URGENT/ { print $3, $2 }
Count and print the number of pattern found:
/pattern
/ { ++x } END { print x }
Add numbers in second column and print total:
{ total += $2 } END { print "column total is", total}
Print lines that contain less than 20 characters:
length($0) < 20
Print each line that begins with Name: and that contains exactly seven fields:
NF == 7 && /^Name:/
Print the fields of each input record in reverse order, one per line:
{ for (i = NF; i >= 1; i--) print $i }
11.4 Built-in Variables
Version Variable Description awk FILENAME
Current filename
FS
Field separator (a space)
NF
Number of fields in current record
NR
Number of the current record
OFMT
Output format for numbers (
"%.6g"
) and for conversion to stringOFS
Output field separator (a space)
ORS
Output record separator (a newline)
RS
Record separator (a newline)
$0
Entire input record
$
n
nth field in current record; fields are separated by
FS
nawk ARGC
Number of arguments on command line
ARGV
An array containing the command-line arguments, indexed from 0 to
ARGC - 1
CONVFMT
String conversion format for numbers (
"%.6g"
) (POSIX)ENVIRON
An associative array of environment variables
FNR
Like NR, but relative to the current file
RLENGTH
Length of the string matched by
match()
functionRSTART
First position in the string matched by
match()
functionSUBSEP
Separator character for array subscripts (
"\034"
)gawk ARGIND
Index in
ARGV
of current input fileERRNO
A string indicating the error when a redirection fails for
getline
or ifclose()
failsFIELDWIDTHS
A space-separated list of field widths to use for splitting up the record, instead of
FS
IGNORECASE
When true, all regular expression matches, string comparisons, and calls to
index()
s ignore caseRT
The text matched by
RS
, which can be a regular expression in gawk11.5 Operators
The following table lists the operators, in order of increasing precedence, that are available in awk. Note: while
**
and**=
are common extensions, they are not part of POSIX awk.
Symbol Meaning
= += -= *= /= %= ^= **=
Assignment ?:
C conditional expression (nawk only) ||
Logical OR (short-circuit) &&
Logical AND (short-circuit) in
Array membership (nawk only) ~ !~
Match regular expression and negation < <= > >= != ==
Relational operators (blank) Concatenation + -
Addition, subtraction * / %
Multiplication, division, and modulus (remainder) + - !
Unary plus and minus, and logical negation ^ **
Exponentiation ++ --
Increment and decrement, either prefix or postfix $
Field reference 11.6 Variables and Array Assignments
Variables can be assigned a value with an = sign. For example:
FS = ","
Expressions using the operators
+
,-
,/
, and%
(modulo) can be assigned to variables.Arrays can be created with the
split()
function (see below), or they can simply be named in an assignment statement. Array elements can be subscripted with numbers (array[1]
, ..., array[
n]
) or with strings. Arrays subscripted by strings are calledassociative arrays.[2] For example, to count the number of widgets you have, you could use the following script:[2] In fact, all arrays in awk are associative; numeric subscripts are converted to strings before using them as array subscripts. Associative arrays are one of awk's most powerful features.
/widget/ { count["widget"]++ } Count widgets END { print count["widget"] } Print the count
You can use the special
for
loop to read all the elements of an associative array:for (item in array)process
array[item]The index of the array is available as
item
, while the value of an element of the array can be referenced asarray[item]
.You can use the operator
in
to see if an element exists by testing to see if its index exists (nawk only):
if (index in array) ...
This sequence tests that
array[index]
exists, but you cannot use it to test the value of the element referenced byarray[index]
.You can also delete individual elements of the array using the
delete
statement (nawk only).11.6.1 Escape Sequences
Within string and regular expression constants, the following escape sequences may be used. Note: The
\x
escape sequence is a common extension; it is not part of POSIX awk.
Sequence Meaning Sequence Meaning \a
Alert (bell) \v
Vertical tab \b
Backspace \\
Literal backslash \f
Form feed \
nnn
Octal value nnn \n
Newline \x
nn
Hexadecimal value nn \r
Carriage return \"
Literal double quote (in strings) \t
Tab \/
Literal slash (in regular expressions) 11.7 User-Defined Functions
nawk allows you to define your own functions. This makes it easy to encapsulate sequences of steps that need to be repeated into a single place, and reuse the code from anywhere in your program. Note: for user-defined functions, no space is allowed between the function name and the left parenthesis when the function is called.
The following function capitalizes each word in a string. It has one parameter, named
input
, and five local variables, which are written as extra parameters.# capitalize each word in a string function capitalize(input, result, words, n, i, w) { result = "" n = split(input, words, " ") for (i = 1; i <= n; i++) { w = words[i] w = toupper(substr(w, 1, 1)) substr(w, 2) if (i > 1) result = result " " result = result w } return result } # main program, for testing { print capitalize($0) }With this input data:
A test line with words and numbers like 12 on it.This program produces:
A Test Line With Words And Numbers Like 12 On It.11.8 Group Listing of awk Functions and Commands
The following table classifies awk functions and commands.
Arithmetic String Control Flow I/O Time Program- Functions Functions Statements Processing Functions ming atan2
[3]gensub
[4]break
close
[3]strftime
[4]delete
[3]cos
[3]gsub
[3]continue
fflush
[5]systime
[4]function
[3]exp
index
do
/while
[3]getline
[3]system
[3]int
length
exit
next
log
match
[3]for
nextfile
[5]rand
[3]split
if
sin
[3]sprintf
return
[3]printf
sqrt
sub
[3]while
srand
[3]substr
tolower
[3]toupper
[3][3] Available in nawk.
[4] Available in gawk.
[5] Available in Bell Labs awk and gawk.
11.9 Implementation Limits
Many versions of awk have various implementation limits, on things such as:
Number of fields per record
Number of characters per input record
Number of characters per output record
Number of characters per field
Number of characters per
printf
stringNumber of characters in literal string
Number of characters in character class
Number of files open
Number of pipes open
The ability to handle 8-bit characters and characters that are all zero (ASCII NUL)
gawk does not have limits on any of these items, other than those imposed by the machine architecture and/or the operating system.
11.10 Alphabetical Summary of Functions and Commands
The following alphabetical list of keywords and functions includes all that are available in awk, nawk, and gawk. nawk includes all old awkfunctions and keywords, plus some additional ones (marked as {N}). gawk includes all nawk functions and keywords, plus some additional ones (marked as {G}). Items marked with {B} are available in the Bell Labs awk. Items that aren't marked with a symbol are available in all versions.
atan2
atan2
(
y
,
x
)
break
break
close
close(
filename-expr
)
close
(
command-expr
)
In most implementations of awk, you can have only 10 files open simultaneously and one pipe. Therefore, nawk provides a
close
function that allows you to close a file or a pipe. It takes as an argument the same expression that opened the pipe or file. This expression must be identical, character by character, to the one that opened the file or pipe; even whitespace is significant. {N}continue
continue
cos
cos(
x
)
delete
delete
array
[
element
]
delete
array
Delete element from array. The brackets are typed literally. The second form is a common extension, which deletes all elements of the array at one shot. {N}
do
do
statement
while (
expr
)
Looping statement. Execute statement, then evaluate expr and, if true, execute statement again. A series of statements must be put within braces. {N}
exit
exit
[expr
]
Exit from script, reading no new input. The
END
procedure, if it exists, will be executed. An optional expr becomes awk's return value.exp
exp(
x
)
fflush
fflush(
[output-expr
])
Flush any buffers associated with open output file or pipe output-expr. {B}
gawk extends this function. If no output-expr is supplied, it flushes standard output. If output-expr is the null string (
""
), it flushes all open files and pipes. {G}for
for (
init-expr
;
test-expr
;
incr-expr
)
statement
C-style looping construct. init-expr assigns the initial value of a counter variable. test-expr is a relational expression that is evaluated each time before executing the statement. When test-expr is false, the loop is exited. incr-expr increments the counter variable after each pass. All the expressions are optional. A missing test-expr is considered to be true. A series of statements must be put within braces.
for
for (
item
in
array
)
statement
Special loop designed for reading associative arrays. For each element of the array, the statement is executed; the element can be referenced by array[item]. A series of statements must be put within braces.
function
function
name
(
parameter-list
) {
statements
}
Create name as a user-defined function consisting of awk statements that apply to the specified list of parameters. No space is allowed between name and the left paren when the function is called. {N}
getline
getline
[var
] [<
file
]orcommand
| getline
[var
]
Read next line of input. Original awk doesn't support the syntax to open multiple input streams. The first form reads input from file; the second form reads the output of command. Both forms read one record at a time, and each time the statement is executed, it gets the next record of input. The record is assigned to
$0
and is parsed into fields, settingNF
,NR
andFNR
. If var is specified, the result is assigned to var, and$0
andNF
aren't changed. Thus, if the result is assigned to a variable, the current record doesn't change.getline
is actually a function and returns 1 if it reads a record successfully, 0 if end-of-file is encountered, and -1 if it's otherwise unsuccessful. {N}gensub
gensub(
r
,
s
,
h
[,
t
])
General substitution function. Substitute s for matches of the regular expression r in the string t. If h is a number, replace the hth match. If it is
"g"
or"G"
, substitute globally. If t is not supplied,$0
is used. Return the new string value. The original t is not modified. (Compare gsub and sub.) {G}gsub
gsub(
r
,
s
[,
t
])
Globally substitute s for each match of the regular expression r in the string t. If t is not supplied, defaults to
$0
. Return the number of substitutions. {N}if
if (
condition
)
statement
[else
statement
]
If condition is true, do statement(s); otherwise do statement in the optional
else
clause. The condition can be an expression using any of the relational operators<
,<=
,==
,!=
,>=
, or>
, as well as the array membership operatorin
, and the pattern-matching operators~
and!~
(e.g.,if ($1 ~ /[Aa].*/)
). A series of statements must be put within braces. Anotherif
can directly follow anelse
in order to produce a chain of tests or decisions.index
index(
str
,
substr
)
int
int(
x
)
length
length(
[arg
])
log
log(
x
)
match
match(
s
,
r
)
Function that matches the pattern, specified by the regular expression r, in the string s, and returns either the position in s, where the match begins, or 0 if no occurrences are found. Sets the values of
RSTART
andRLENGTH
to the start and length of the match, respectively. {N}next
next
nextfile
nextfile
Stop processing the current input file and start new cycle through pattern/procedures statements, beginning with the first record of the next file. {B} {G}
output-expr
[, ...]] [dest-expr
]
Evaluate the output-expr and direct it to standard output, followed by the value of
ORS
. Each comma-separated output-expr is separated in the output by the value ofOFS
. With no output-expr, print$0
.Output Redirections
dest-expr is an optional expression that directs the output to a file or pipe.
>
file
Directs the output to a file, overwriting its previous contents.
>>
file
Appends the output to a file, preserving its previous contents. In both cases, the file is created if it does not already exist.
|
command
Directs the output as the input to a Unix command.
Be careful not to mix
>
and>>
for the same file. Once a file has been opened with>
, subsequent output statements continue to append to the file until it is closed.Remember to call
close()
when you have finished with a file or pipe. If you don't, eventually you will hit the system limit on the number of simultaneously open files.printf
printf(
format
[,
expr-list
])
[dest-expr
]
An alternative output statement borrowed from the C language. It can produce formatted output and also output data without automatically producing a newline. format is a string of format specifications and constants. expr-list is a list of arguments corresponding to format specifiers. See print for a description of dest-expr.
format follows the conventions of the C-language printf(3S) library function. Here are a few of the most common formats:
%s
A string.
%d
A decimal number.
%
n
.
m
f
A floating-point number; n = total number of digits. m = number of digits after decimal point.
%[-]
nc
n specifies minimum field length for format type c, while
-
left-justifies value in field; otherwise, value is right-justified.Like any string, format can also contain embedded escape sequences:
\n
(newline) or\t
(tab) being the most common. Spaces and literal text can be placed in the format argument by quoting the entire argument. If there are multiple expressions to be printed, there should be multiple formats specified.Example
Using the script:
{ printf("The sum on line %d is %.0f.\n", NR, $1+$2) }
The following input line:
5 5
produces this output, followed by a newline:
The sum on line 1 is 10.
rand
rand()
Generate a random number between 0 and 1. This function returns the same series of numbers each time the script is executed, unless the random number generator is seeded using
srand()
. {N}return
return
[expr
]
Used within a user-defined function to exit the function, returning value of expr. The return value of a function is undefined if expr is not provided. {N}
sin
sin(
x
)
split
split(
string
,
array
[,
sep
])
Split string into elements of array array
[1]
,...,array[
n]
. The string is split at each occurrence of separator sep. If sep is not specified,FS
is used. The number of array elements created is returned.sprintf
sprintf(
format
[,
expressions
])
Return the formatted value of one or more expressions, using the specified format (see printf). Data is formatted but not printed. {N}
sqrt
sqrt(
arg
)
srand
srand(
[expr
])
Use optional expr to set a new seed for the random number generator. Default is the time of day. Return value is the old seed. {N}
strftime
strftime(
[format
[,
timestamp
]])
Format timestamp according to format. Return the formatted string. The timestamp is a time-of-day value in seconds since midnight, January 1, 1970, UTC. The format string is similar to that of
sprintf
. (See the Example for systime.) If timestamp is omitted, it defaults to the current time. If format is omitted, it defaults to a value that produces output similar to that of date. {G}sub
sub(
r
,
s
[,
t
])
Substitute s for first match of the regular expression r in the string t. If t is not supplied, defaults to
$0
. Return 1 if successful; 0 otherwise. {N}substr
substr(
string
,
beg
[,
len
])
Return substring of string at beginning position beg and the characters that follow to maximum specified length len. If no length is given, use the rest of the string.
system
system(
command
)
Function that executes the specified command and returns its status. The status of the executed command typically indicates success or failure. A value of 0 means that the command executed successfully. A nonzero value indicates a failure of some sort. The documentation for the command you're running will give you the details.
The output of the command is not available for processing within the awk script. Use command
| getline
to read the output of a command into the script. {N}systime
systime()
Return a time-of-day value in seconds since midnight, January 1, 1970, UTC. {G}
Example
Log the start and end times of a data-processing program:
BEGIN { now = systime() mesg = strftime("Started at %m/%d/%Y %H:%M:%S", now) print mesg }process data ...
END { now = systime() mesg = strftime("Ended at %m/%d/%Y %H:%M:%S", now) print mesg }tolower
tolower(
str
)
Translate all uppercase characters in str to lowercase and return the new string.[6] {N}
[6] Very early versions of nawk don't support tolower() and toupper(). However, they are now part of the POSIX specification for awk, and are included in the SVR4 nawk.
toupper
toupper(
str
)
while
while (
condition
)
statement
Do statement while condition is true (see
if
for a description of allowable conditions). A series of statements must be put within braces.
11.10.1 printf Formats
Format specifiers for
printf
andsprintf
have the following form:
%
[flag
][width
][.
precision
]letter
The control letter is required. The format conversion control letters are as follows.
Character Description c
ASCII character d
Decimal integer i
Decimal integer (added in POSIX) e
Floating-point format ([-]d.precision e
[+-]dd)E
Floating-point format ([-]d.precision E
[+-]dd)f
Floating-point format ([-]ddd.precision) g
e
orf
conversion, whichever is shortest, with trailing zeros removedG
E
orf
conversion, whichever is shortest, with trailing zeros removedo
Unsigned octal value s
String x
Unsigned hexadecimal number; uses a
-f
for 10 to 15X
Unsigned hexadecimal number; uses A
-F
for 10 to 15%
Literal %
The optional flag is one of the following.
Character Description -
Left-justify the formatted value within the field.
space Prefix positive values with a space and negative values with a minus.
+
Always prefix numeric values with a sign, even if the value is positive.
#
Use an alternate form:
%o
has a preceding0
;%x
and%X
are prefixed with0x
and0X
, respectively;%e
,%E
, and%f
always have a decimal point in the result; and%g
and%G
do not have trailing zeros removed.0
Pad output with zeros, not spaces. This happens only when the field width is wider than the converted result.
The optional width is the minimum number of characters to output. The result will be padded to this size if it is smaller. The
0
flag causes padding with zeros; otherwise, padding is with spaces.The precision is optional. Its meaning varies by control letter, as shown in this table.
Conversion Precision Means %d
,%i
,%o
The minimum number of digits to print %u
,%x
,%X
%e
,%E
,%f
The number of digits to the right of the decimal point %g
,%G
The maximum number of significant digits %s
The maximum number of characters to print
Back to: UNIX in a Nutshell: System V Edition, 3rd Edition
'Programming' 카테고리의 다른 글
[ELK] ElasticSearch + Kibana + Logstash 설치 (0) | 2018.02.25 |
---|---|
[JAVA] 리눅스에서 오라클 Java를 wget으로 다운로드 하는 방법 (0) | 2018.02.25 |