Development / Linux / Sponsored / Contributed

An Introduction to AWK

15 Sep 2021 12:00pm, by

Francesc Vendrell
Francesc Vendrell is a Site Reliability Engineer at LogDNA where he focuses on automation, containers, and Kubernetes. He works remotely from Catalonia, Spain and loves tinkering with retro computers, cooking, and playing sports.

awk is a powerful tool. It is actually a Turing-complete language, meaning that you can technically write any kind of program with it. You could implement the classic sorting algorithms or more complex things such as a parser or an interpreter. Examples of this kind can be found in the “AWK Programming Language” book written by awk‘s authors. The reason awk is still popular today, though, has nothing to do with its generality and more with its usefulness working in the command line.

Whether you are trying to extract and format some textual data or build a nifty command to make your life easier, awk can really help you get the job done. Indeed If you search for “awk” in our code base, it appears about 520 times.

awk was created by Alfred Aho, Peter J. Weinberger and Brian Kernighan back in 1977 while working at Bell Labs. It was designed for text processing and typically used as a data extraction and reporting tool.

During the early days of Unix, awk was the only scripting language, besides Bourne shell, available by default. Surprisingly enough, it is still widely used today by many people for its simplicity and power.

Basics

In some of the examples that follow, I’ll assume input data is coming from the standard input. That’s how awk is typically used in the real world. For example:

I encourage you to follow this post with an open terminal to try the examples and experiment and make them fail; that’s the best way of having a good learning experience if you are new to awk. Let’s start with the most basic rules:

  • Any awk program is composed of a series of a pattern-action statements.
  • In a pattern-action statement, either the pattern or the action may be missing.
  • If the pattern is missing, the action is applied to every single line of input.
  • A missing action is equivalent to { print }.
  • A missing pattern is always a match.
  • An action is applied to the line only if the pattern matches, i.e., the pattern is true.

In the first example, since the pattern is 1, and 1 is always true, this one-liner translates further into a print statement:

Because any missing pattern is a match, we could also write it like this:

What does this program do? Well, it just prints out every single input line. Sound familiar? We just implemented cat in awk!

Moving Forward

awk‘s default field separator (FS) is the white space. An input line is divided into individual fields delimited by the FS. Other interesting variables available to us are the following:

  • $0: the current record (line) being processed
  • $1,$2, …, $NF: individual fields within a record
  • NR: ordinal number of the current record
  • NF: number of fields in the current record
  • FS: field separator; regular expression used to separate fields (also settable with -F)
  • OFS: output field separator; regular expression used to separate fields in the output
  • RS: input record separator, newline by default
  • ORS: output record separator, added after each record. Newline by default.
  • BEGIN: special pattern used to execute statements before any record is processed
  • END: special pattern used to execute statements after the last record is processed

The following examples make use of every variable just described. awk can do almost the same as the cut command and much more. For example, get the user, group and filename for each file in the current directory:

Here we have a single pattern-action statement. The pattern NR>1 skips the first line (the total number of bytes) so it won’t be printed. The statement selects fields 3, 4 and the last field to print them out. Take a moment to understand what $NF actually represents: We are dereferencing (accessing a field) using the NF variable.

Now, imagine you want to extract some data from the /etc/services file in your system.

The file format is the following: service_name port/protocol #comment. We are interested in extracting the name and the protocol of each service. We also want to skip commented lines. The problem here is the default field separator does not match the data we want to extract. Luckily, with awk, you can define your own regular expression as a field separator with -F:

This last example introduced a few new constructs and syntax. Let’s give it a look:

  • -F”regex”: set the field separator (FS) to the regular expression “regex”
  • /regex/: in a pattern, match the line against the regular expression “regex”
  • ! expression: reverse the truth value of “expression”

The regular expression [0-9]+/ matches one or more digits ending in a /. For example,  48128/ whatever would be a match. Essentially, we are setting our field separator to the section of the line we want to remove. Next comes the pattern ! /^#/. The ! is reversing the value of the expression /^#/. This is also a regular expression that matches any line starting with #. So we are removing lines starting with #. Finally, we print the fields between our field separators. The result is almost what we want, but unfortunately, we also get the comment at the end of the line. We could pipe the output to another awk and print the first two fields:

It worked! (There’s of course a better way of doing this using the built-in function split. I won’t get into built-in functions in this post, but it’s good to know they are there to help.)

In this case, we have two statements. The first one splits $2 using # as a separator and stores the parts in the array a. In awk, variables are dynamic, and they don’t have to be declared or initialized before being used. Finally, we print out the first field as before, but then we print the first element of the array a. Later, I will show you an easy way of getting rid of repeated lines, whether they’re consecutive or not.

Let’s see a few more simple examples. Imagine you want to double-space a file. By default, awk uses the new line \n as output field separator (OFS), and we can easily change it to double-space a file. The following one-liners are equivalent:

We can define variables using the option -v on the command line. In the first example, we are overwriting the default ORS value to add an extra newline at the end of every line. In the second example, we are using the special pattern BEGIN. This pattern always matches before processing any line and can be used to execute initializations or if we don’t actually care about the input. In both cases, the result is the same: We get the file double-spaced.

In the next example, we just print “hello world!” The code does not expect any input and the only patter-action we have is the BEGIN one.

Let’s move on. Now we are interested in getting to total bytes count for all the YAML files in a directory. That’s an easy task for awk:

I know, I could have used ls -l *.yaml and pipe that to awk. What I did has some interesting side effects I want you to remember. If you want to filter the input or the output of a command, do not grep | awk or awk | grep unless is strictly necessary. You can do the same using the filtering expression /regex/ inside an awk’s pattern. That saves you an extra pipe and gives you a cleaner and more compact expression. In this example, we first filter the input making sure we only get files ending in .yaml. For each of those files we accumulate column 5 (the byte count) into the sum variable. Finally, we use the special pattern END to print the value of sum once all input has been processed. If we had no files at all, we would print undefined which is not very nice. That’s why we use the expression sum+0. Before you ask, yes undefined+0 == 0.

A Final Example

Before I finish, I would like to show you something that is non-trivial. The following example is what you would call a very idiomaticawk one-liner. Can you guess what it does?

Notice we don’t have a statement, just a pattern. What the heck, this pattern is not the usual expression we have seen so far. It’s doing stuff. At the end, what a pattern does, is filtering out input lines so the pattern should be either true or false. Let’s do a step-by-step execution to see what’s going on.

We get the first line and at this point the array a is undefined, but that’s not a problem for awk and a gets created. Then a[$0] is also undefined and a[$0] gets also created with a value of undefined. Next step is the increment bit ++, that is undefined++ and awk is smart enough to give us back a 0 as a result. Finally, we have !0 which evaluates true and the line gets printed out. Easy peasy. What happens with the next line? Well, there are two possible outcomes — and do you know what? I have already written too much, so I will let you figure that out.

The New Stack is a wholly owned subsidiary of Insight Partners. TNS owner Insight Partners is an investor in the following companies: Real, Bit.

Photo by TStudio from Pexels.

A newsletter digest of the week’s most important stories & analyses.