An Introduction to AWK

awk is a powerful tool. It is actually a Turing-complete language, meaning that you can technically write any kind of program with it. You could implement the classic sorting algorithms or more complex things such as a parser or an interpreter. Examples of this kind can be found in the “AWK Programming Language” book written by awk‘s authors. The reason awk is still popular today, though, has nothing to do with its generality and more with its usefulness working in the command line.
Whether you are trying to extract and format some textual data or build a nifty command to make your life easier, awk can really help you get the job done. Indeed If you search for “awk” in our code base, it appears about 520 times.
awk was created by Alfred Aho, Peter J. Weinberger and Brian Kernighan back in 1977 while working at Bell Labs. It was designed for text processing and typically used as a data extraction and reporting tool.
During the early days of Unix, awk was the only scripting language, besides Bourne shell, available by default. Surprisingly enough, it is still widely used today by many people for its simplicity and power.
Basics
In some of the examples that follow, I’ll assume input data is coming from the standard input. That’s how awk is typically used in the real world. For example:
1 2 3 4 |
# Example 1 cat my-input.txt | awk '1' # This is how to execute the code my_super_command | awk '1' # This is how to execute the code awk '1' # This is how I will write code snippets in this post |
I encourage you to follow this post with an open terminal to try the examples and experiment and make them fail; that’s the best way of having a good learning experience if you are new to awk. Let’s start with the most basic rules:
- Any awk program is composed of a series of a pattern-action statements.
- In a pattern-action statement, either the pattern or the action may be missing.
- If the pattern is missing, the action is applied to every single line of input.
- A missing action is equivalent to { print }.
- A missing pattern is always a match.
- An action is applied to the line only if the pattern matches, i.e., the pattern is true.
In the first example, since the pattern is 1, and 1 is always true, this one-liner translates further into a print statement:
1 |
awk '1 { print }' |
Because any missing pattern is a match, we could also write it like this:
1 |
awk '{ print }' |
What does this program do? Well, it just prints out every single input line. Sound familiar? We just implemented cat in awk!
1 2 3 4 5 |
# Example 2 # All of the following are equivalent implementations of 'cat' awk '1' # short but confusing awk '1 { print }' # long and also a bit confusing awk '{ print }'. # this one gets it right |
Moving Forward
awk‘s default field separator (FS) is the white space. An input line is divided into individual fields delimited by the FS. Other interesting variables available to us are the following:
- $0: the current record (line) being processed
- $1,$2, …, $NF: individual fields within a record
- NR: ordinal number of the current record
- NF: number of fields in the current record
- FS: field separator; regular expression used to separate fields (also settable with -F)
- OFS: output field separator; regular expression used to separate fields in the output
- RS: input record separator, newline by default
- ORS: output record separator, added after each record. Newline by default.
- BEGIN: special pattern used to execute statements before any record is processed
- END: special pattern used to execute statements after the last record is processed
The following examples make use of every variable just described. awk can do almost the same as the cut command and much more. For example, get the user, group and filename for each file in the current directory:
1 2 3 4 5 6 7 8 9 |
# Example 3 $ ls -l | awk 'NR>1 { print $3, $4, $NF }' xesco staff Documents/ xesco staff absecret/ xesco staff ansible/ xesco staff archive-upload-validate/ xesco staff authme/ xesco staff batchjobs/ [...] |
Here we have a single pattern-action statement. The pattern NR>1 skips the first line (the total number of bytes) so it won’t be printed. The statement selects fields 3, 4 and the last field to print them out. Take a moment to understand what $NF actually represents: We are dereferencing (accessing a field) using the NF variable.
Now, imagine you want to extract some data from the /etc/services file in your system.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
$ cat /etc/services # The Well Known Ports are those from 0 through 1023. # The Registered Ports are those from 1024 through 49151 # The Dynamic and/or Private Ports are those from 49152 through 65535 # # $FreeBSD: src/etc/services,v 1.89 2002/12/17 23:59:10 eric Exp $ # From: @(#)services 5.8 (Berkeley) 5/9/91 # # WELL KNOWN PORT NUMBERS # rtmp 1/ddp #Routing Table Maintenance Protocol tcpmux 1/udp # TCP Port Service Multiplexer tcpmux 1/tcp # TCP Port Service Multiplexer # Mark Lottor <MKL@nisc.sri.com> nbp 2/ddp #Name Binding Protocol compressnet 2/udp # Management Utility compressnet 2/tcp # Management Utility compressnet 3/udp # Compression Process compressnet 3/tcp # Compression Process # Bernie Volz <VOLZ@PROCESS.COM> echo 4/ddp #AppleTalk Echo Protocol # 4/tcp Unassigned # 4/udp Unassigned rje 5/udp # Remote Job Entry rje 5/tcp # Remote Job Entry # Jon Postel <postel@isi.edu> zip 6/ddp #Zone Information Protocol # 6/tcp Unassigned # 6/udp Unassigned [...] |
The file format is the following: service_name port/protocol #comment. We are interested in extracting the name and the protocol of each service. We also want to skip commented lines. The problem here is the default field separator does not match the data we want to extract. Luckily, with awk, you can define your own regular expression as a field separator with -F:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Example 4 $ cat /etc/services | awk -F"[0-9]+/" '!/^#/ { print $1, $2 }' rtmp ddp #Routing Table Maintenance Protocol tcpmux udp # TCP Port Service Multiplexer tcpmux tcp # TCP Port Service Multiplexer nbp ddp #Name Binding Protocol compressnet udp # Management Utility compressnet tcp # Management Utility compressnet udp # Compression Process compressnet tcp # Compression Process echo ddp #AppleTalk Echo Protocol rje udp # Remote Job Entry [...] |
This last example introduced a few new constructs and syntax. Let’s give it a look:
- -F”regex”: set the field separator (FS) to the regular expression “regex”
- /regex/: in a pattern, match the line against the regular expression “regex”
- ! expression: reverse the truth value of “expression”
The regular expression [0-9]+/ matches one or more digits ending in a /. For example, 48128/ whatever would be a match. Essentially, we are setting our field separator to the section of the line we want to remove. Next comes the pattern ! /^#/. The ! is reversing the value of the expression /^#/. This is also a regular expression that matches any line starting with #. So we are removing lines starting with #. Finally, we print the fields between our field separators. The result is almost what we want, but unfortunately, we also get the comment at the end of the line. We could pipe the output to another awk and print the first two fields:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Example 5 $ cat /etc/services | awk -F"[0-9]+/" '!/^#/ { print $1, $2 }' | awk '{ print $1, $2 }' rtmp ddp tcpmux udp tcpmux tcp nbp ddp compressnet udp compressnet tcp compressnet udp compressnet tcp echo ddp rje udp rje tcp zip ddp [...] |
It worked! (There’s of course a better way of doing this using the built-in function split. I won’t get into built-in functions in this post, but it’s good to know they are there to help.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Example 6 $ cat /etc/services | awk -F"[0-9]+/" '!/^#/ { split($2,a,"#"); print $1, a[1] }' rtmp ddp tcpmux udp tcpmux tcp nbp ddp compressnet udp compressnet tcp compressnet udp compressnet tcp echo ddp rje udp rje tcp zip ddp echo udp echo tcp discard udp discard tcp [...] |
In this case, we have two statements. The first one splits $2 using # as a separator and stores the parts in the array a. In awk, variables are dynamic, and they don’t have to be declared or initialized before being used. Finally, we print out the first field as before, but then we print the first element of the array a. Later, I will show you an easy way of getting rid of repeated lines, whether they’re consecutive or not.
Let’s see a few more simple examples. Imagine you want to double-space a file. By default, awk uses the new line \n as output field separator (OFS), and we can easily change it to double-space a file. The following one-liners are equivalent:
1 2 3 |
# Example 7 $ awk -v ORS="\n\n" '{ print }' # use -v to set a variable $ awk 'BEGIN {ORS="\n\n" } { print }' # use a BEIGN block |
We can define variables using the option -v on the command line. In the first example, we are overwriting the default ORS value to add an extra newline at the end of every line. In the second example, we are using the special pattern BEGIN. This pattern always matches before processing any line and can be used to execute initializations or if we don’t actually care about the input. In both cases, the result is the same: We get the file double-spaced.
In the next example, we just print “hello world!” The code does not expect any input and the only patter-action we have is the BEGIN one.
1 2 |
# Example 8 awk 'BEGIN { print "hello world" }' |
Let’s move on. Now we are interested in getting to total bytes count for all the YAML files in a directory. That’s an easy task for awk:
1 2 3 |
# Example 9 $ ls -l | awk '/.yaml$/ { sum+=$5 } END { print sum+0 }' 330953 |
I know, I could have used ls -l *.yaml and pipe that to awk. What I did has some interesting side effects I want you to remember. If you want to filter the input or the output of a command, do not grep | awk or awk | grep unless is strictly necessary. You can do the same using the filtering expression /regex/ inside an awk’s pattern. That saves you an extra pipe and gives you a cleaner and more compact expression. In this example, we first filter the input making sure we only get files ending in .yaml. For each of those files we accumulate column 5 (the byte count) into the sum variable. Finally, we use the special pattern END to print the value of sum once all input has been processed. If we had no files at all, we would print undefined which is not very nice. That’s why we use the expression sum+0. Before you ask, yes undefined+0 == 0.
A Final Example
Before I finish, I would like to show you something that is non-trivial. The following example is what you would call a very idiomaticawk one-liner. Can you guess what it does?
1 2 |
# Example 10 awk '!a[$0]++' |
Notice we don’t have a statement, just a pattern. What the heck, this pattern is not the usual expression we have seen so far. It’s doing stuff. At the end, what a pattern does, is filtering out input lines so the pattern should be either true or false. Let’s do a step-by-step execution to see what’s going on.
We get the first line and at this point the array a is undefined, but that’s not a problem for awk and a gets created. Then a[$0] is also undefined and a[$0] gets also created with a value of undefined. Next step is the increment bit ++, that is undefined++ and awk is smart enough to give us back a 0 as a result. Finally, we have !0 which evaluates true and the line gets printed out. Easy peasy. What happens with the next line? Well, there are two possible outcomes — and do you know what? I have already written too much, so I will let you figure that out.