Culture / Tools

Tutorial: Tame Your Access Log with Unix Pipes

19 Feb 2019 10:11am, by

Last month the developer community took note of a blog post by Jessie Frazelle — a well-respected programmer who’s done time at big-name companies like Docker, Google, and Microsoft, and describes herself as a “hacker of all things Linux & Containers.” In the blog post, Frazelle explained why she appreciates Unix’s pipe command, calling it “brilliant software design… that enables creativity, values simplicity, and doesn’t put users in a box. ”

“[P]rograms being simple and doing one thing well makes them easy to grok. The beautiful part, though, is the fact that in combination with an operator like | the program becomes one step in a much larger plan.”

I’d have to agree. One of my friends once joked that you could write a command for anything in Unix — but that command might be six lines long, connecting a long series of with pipes.

Let me share my own favorite real-world example — using pipes to extract a very handy report from a typical web server’s access log — with a series of commands that actually have their own colorful history.

A Half Century of Pipes?

Doug McIlroy, who headed Bell Labs’ famous Computing Techniques Research Department from 1965 through 1986, first came up with the idea for pipes, more than 50 years ago. “We should have some ways of coupling programs like a garden hose,” he wrote in 1964 — he’d actually pecked the words out on a typewriter — saying this method would let programmers “screw in another segment when it becomes necessary to massage data in another way.”

It wasn’t actually implemented until 1973, but McIlroy identifies this as the development that led to the Unix philosophy of combining powerful individual tools which each do one thing well.

And it was still on his mind 13 years later, in 1986, when he touted pipes as a superior solution to a 10-page Pascal program that was written by math legend Donald Knuth.

Knuth’s program reported the most frequently-used words in a text file and was published in Jon Bentley’s “Programming Pearls” column in the Communications of the Association for Computing Machinery. Doug McIlroy sent a letter in response to the column which had praise Knuth’s cleverness — before pointing out that the same thing could be duplicated with just six Unix commands joined by pipes.

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

“[O]ld UNIX hands know instinctively how to solve this one in a jiffy…” McIlroy wrote, adding “The simple pipeline given above will suffice to get answers right now, not next week or next month.”

I was thrilled when I read this example because I’d stumbled across at least part of McIlroy’s solution myself — the useful three-command combination sort | uniq -c | sort -rn

Pipes versus Your Access Log

I’ve been using that same combination for many years just to crunch data from my web server’s access log (which for decades have been following a standardized format that includes the date, time, and time zone of each request, plus its originating IP address and any information on referrers).

68.186.130.95 - - [15/Feb/2019:08:18:46 -0800] "GET /sloop-john-b.htm HTTP/1.1" 200 2106 "https://www.google.com/" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36"

What I want to know is whether someone’s sending traffic to my website — and which pages are most popular on any given day.

Pipes to the rescue…

It’s a bit of a kludge, but I noticed that the name of the file being accessed was always the seventh text block (separated by spaces). And Unix’s awk command does let you print out specific parts of a line, after defining (with the F flag) what’s separating those parts. So in my case, that’s a space.

cat access.log | awk -F" " '{print $7}'

This command gives a glorious list of every file that’s been accessed during the last day — spewing them all across my screen. Obviously, I could pipe them into Unix’s less command to read them one screenful at a time — but what I really want is a report that just lists every file that’s been accessed today – listing each file just once — and ideally, sorted from most-accessed to least-accessed (and displaying the count).

Fortunately, that’s exactly what sort | uniq -c | sort -rn does. The sort command puts the whole list in alphabetical order — which allows the uniq command to remove (while counting) all of the adjacent duplicates. The only problem with this list is it’s not sorted from most to least — but that’s easily fixed by piping the output for a second time through the sort command, this time using its r flag for “reverse order” — largest numbers first.

cat access.log | awk -F" " '{print $7}' | sort | uniq -c | sort -nr

Then just pipe everything into the more command so you can read it one screenful at a time.

cat access.log | awk -F" " '{print $7}' | sort | uniq -c | sort -nr | more

Or pipe everything into the tail command if you only want to see the most-recent hits.

access.log | awk -F" " '{print $7}' | sort | uniq -c | sort | tail -20

That’s one of the unappreciated joys of these strings of piped commands. You can change one parameter, and the rest of your glorious pipeline still works – which makes it incredibly easy to experiment with alternate outputs. For example, I could generate an equally-useful report by changing which part of the access log that awk command is printing out. Let’s say that instead of which pages were most-viewed, I wanted to see which site was sending the most referrals. Just replace print $7 with $print 11

cat access.log | awk -F" " '{print $11}' | sort | uniq -c | sort -nr | more

This is handy if, for example, you want to see only the traffic that’s coming from Google.

And the fun experimentation was just beginning. This pipeline ultimately became a kind of stub that I used to generate several other useful log-crunching reports. I wrote scripts that would use grep to find accesses of just one specific page — then put them in my Unix bin and gave them an alias, so typing a single word on the command line would magically display a report of how often the corresponding web page had been accessed today.

grep 'pagename' access.log | awk -F" " '{print $11}' | sort | uniq -c | sort -r | more

Sometimes I’d also search access.log.0 (along with access.log) to get a 48-hour report that combined yesterday’s traffic too.

From my perspective, the only problem with pipes was that “drunk with power” feeling. Soon I’d duplicated the same functionality using Perl scripts, then put them all in my web-accessible cgi-bin, so I could use my web browser to access my own home-rolled, real-time site analytics. Then I coupled this with Firefox’s keyword functionality, so that just typing the word “hits” instead of a URL into my browser would take me to the URL of that CGI script – which would instantly pull up that up-to-the-minute report (freshly crunched from my access log).

And it was at this point that I realized my friend was right.

You reach a point where you have a powerful and unshakeable faith that whatever magical transformation you want to perform on the data that’s available to you…

…you can almost certainly do it with a long series of piped commands.

Feature image: Creative Commons photo of a stack of pipes by Paul Goyette

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.