Data / Development

Dr. Torq: Data Processing at the Edge with Linux awk

4 Feb 2020 12:00pm, by

Last June, data scientist and visualization expert Nick Strayer learned a valuable lesson in large scale data processing: Sometimes even the latest “Big Data”-oriented software doesn’t as well as what we already have in the Unix toolbox. Looking to parse 25TB of genetic data, he tried using tools such as Parquet and Spark, but in the end, he found the best solution was a combination of the R statistical programming language and the humble awk.

Sometimes we need to take large data sets and hack them into something easily analyzed by a human. Other times, that same data might need to be converted from one format to another, such as when you move from using one application to a new one.

awk is an awesome Linux command-line program for performing those types of tasks. It typically takes plain text as input and produces specifically formatted output. Of course, you could do that with common programming languages like Python or C. If you like to develop lots of custom code, that’s one way to do it. Linux has awk, a built-in utility that’s programmable, so why not use it?

Today we’ll begin exploring awk with a few basic examples. Future articles will cover more advanced topics.

Start Simple

Say we want to print a big listing of files on our system and just show the file names. As always, Linux offers multiple solutions for this task. Use the standard “ls” command with the “-1” option (that’s the number one).

rob% ls -1

ls -1 (one) listing results

Simple, right?

The same could be done with awk, although we’ll need to do a little extra work. Again, get a list of the files, this time using the “ls” command with the “-l” option (the letter l) and redirect the output to a file.

rob% ls -l > rob2.txt

ls -l listing

I used the “head” command with the “-n” option for this screenshot, to display the first dozen lines of the rob2.txt file. The first line of the listing shows the number of 1k blocks used by the files in the directory. For our purposes, we can just delete it to clean up the file. Keep in mind that most real-world data conversions or transformations usually need a small bit of manual intervention, to get everything automated. It’s just the nature of slicing and dicing data. I removed the line using the vi editor and re-saved the file for later use.

Notice the printout is much more complicated.

Start by running the file through awk, using its standard print syntax. This outputs each line in the rob2.txt file, much like the normal Linux “cat” command.

rob% awk '{print}' rob2.txt

Simple awk print result

We only want the file names, so should use a field with the print statement. The file names are in field nine and the field separating character is a space, which is the default. Fields are indicated with the “$” sign.

rob% awk '{print $9}' rob2.txt

awk using only field nine result

That’s better. It looks just like the “ls -1” command output.

A little more complex example is to print the file names, followed by their creation dates.

rob% awk '{print $9,$6,$7,$8}' rob2.txt

awk using 4 fields result

Notice that we used other fields and they can be placed anywhere you like. We could easily have put the date in front of the file name, if we wanted. Field six is the month. Field seven is the day. And, field eight is the year.

Sometimes files use commas or other characters for their field separator. Spreadsheets generate a comma separator when you export a .csv file (Comma Separated Values) from MS Excel or LibreOffice Calc. Use the “-F” option to specify the desired field separator in awk. Here’s an example of the command line you’d use for a comma.

rob% awk -F',' '{print $9,$6,$7,$8}' filename

You can also insert text into the printout. Adding a “Date =” label might be useful.

rob% awk '{print $9,"Date =",$6,$7,$8}' rob2.txt

awk using four fields and adding a date label

You Can Search, Too

awk has built-in search capabilities. Suppose we want to print out only the lines that contain “2015”. We could use the following.

rob% awk '/2015/ {print $9,"Date =",$6,$7,$8}' rob2.txt

awk searching for 2015 results

I verified the output with a quick grep for “2015” in the file.

rob% grep 2015 rob2.txt

grep for 2015 in rob2.txt results

Another way to search is by comparing a field to a value. We can compare field eight (the year) to “2015”.

rob% awk '{if ($8==2015) print $9,"Date =",$6,$7,$8}' rob2.txt

awk compare of field 8 to 2015 results

Maybe you’d want to search for years greater than “2015.” Use a comparison there too.

rob% awk '{if ($8>2015) print $9,"Date =",$6,$7,$8}' rob2.txt

awk is field 8 greater-than 2015 results

One More Thing

I mentioned at the beginning of the article, that awk was great for data conversion or translation.

Suppose we want to change the year from 2015 to 2016, when it occurs in field 8 (the year). It is as easy as replacing the “$8” field, in the print part, with “2016”.

rob% awk '{if ($8==2015) print $9,"Date =",$6,$7,"2016"}' rob2.txt

awk replacing 2015 with 2016 results

Although this is a trivial example, in principle it could be used in quite a few practical situations.

Going Further

awk has a lot of options and it can handle seriously large files. I usually use quick one-liners and output results, on-the-fly, to my terminal or save it to a new file using a standard Linux redirection (the > character). awk has scripting capabilities and that can get quite complex. We can investigate those details in a future story.

Data conversions and translations can be tedious. awk, while practically magical does have a learning curve. With a little bit of practice, awk is certainly better than going through a data file manually.

Don’t forget that awk is available everywhere. You will find it on Linux servers, desktops, notebooks, the Raspberry Pi boards and a variety of nano-Linux machines. Maybe use awk for standalone high-powered data processing at the edge.

TNS Managing Editor Joab Jackson contributed to this post. 

Contact Rob “drtorq” Reilly for consultation, speaking engagements and commissioned projects at doc@drtorq.com or 407-718-3274.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.