Modal Title
Culture / Data

Unix Pioneer Brian Kernighan Still Loves AWK After All These Years

Although most of the maintenance around Unix AWK these days is done by long-time Unix programmer Arnold Robbins, AWK creator Brian Kernighan stepped in to bring the Unix utility into the Unicode era.
Aug 28th, 2022 6:00am by
Featued image for: Unix Pioneer Brian Kernighan Still Loves AWK After All These Years
Feature image via Wikipedia.

Brian Kernighan may be the closest thing we have to a living legend. He coined the term “Unix” back in 1970 and is recognized for pioneering work at Bell Labs (where the operating system was born). As the co-author of Unix’s AWK tool, even Kernighan’s name lives on in our developer environments, since Kernighan’s last initial provided both the “k” in the name of AWK — and the “K” when people cite the iconic 1978 “K&R book” about C programming.

Earlier this month Kernighan gave an interview to the YouTube channel Computerphile (which has 2.18 million subscribers). Chatting with David F. Brailsford, a computer science professor at the University of Nottingham, Kernighan weighed in on everything from the best programming for 10-year-olds to his memories of AWK’s “very short development cycle in 1977.”

In describing AWK’s utility, succinctly and clearly, Kernighan almost becomes an accidental evangelist, calling AWK “an example of what’s the right tool for the job.”

And what kind of job is that? “Something that you know in your heart you could probably write in one line if you had the right language. AWK is the language that lets you write it in one line. Because it takes care of a lot of the baggage that you would otherwise need in some other language.”

Kernighan admits Python’s the language if you’re only going to choose one for the rest of your life. But with Python, “you need the baggage of how do you get the input? How do split it into various components? How do you write it out? All of those things happen for free in AWK, and that’s one of the reasons why AWK programs tend to be very very short compared to programs in other languages.

“They run at about the same speed as they would in Python.”

Enter the Unicode

But six minutes in, Brailsford asks a revealing question: does Kernighan keep AWK under active maintenance? Kernighan says it’s been on GitHub for “quite a while” now, without a formal release schedule, and credits long-time Unix programmer Arnold Robbins for “most of the active work.” Robbins is also the current maintainer of the GNU project’s version of AWK, and Kernighan describes Robbins as “incredibly good at this kind of thing” and “a very good friend… I think of him as actually the person keeping an eye on it, for the most part.” Robbins has even augmented Kernighan’s own test suites for AWK.

But Kernighan hasn’t abandoned AWK development altogether.

“It’s always been an embarrassment that AWK only works with ASCII, or maybe 8-bit, inputs, but it doesn’t really handle Unicode at all. And so a few months ago,” Kernighan said.

Unicode is the successor to the older, much more limited ASCII character set, incorporating the world’s languages and emojis

With an anticipating laugh, Kernighan said,  “I spent some time working with an incredibly old program — and I have it at this point where it will actually handle UTF-8 [a Unicode subset] input and output, so that you can have regular expressions that, you know, pick up Japanese characters or something like that. And that appears to work correctly.”

Kernighan notes Robbins also worked on egrep, a tool with a pattern-recognizing mechanism with a parsing that’s “essentially the same” as AWK’s. “The code is pretty — what’s the correct technical word?  — inscrutable.” Kernighan laughs. “But fortunately I was able to figure enough of it out that I could put in the UTF processing and Unicode processing inside.”

Kernighan describes his updates as “sort of in the staging version” on GitHub. But to the question of whether he’s still working on AWK, 45 years later, the answer is yes: “that was actual real work, trying to understand old code and insert something into it. I think I’ve got it right, but…it needs more tests.”

“The other thing I did was just a quick-and-dirty thing to make it possible to handle CSV inputs — comma-separated variables. Because that was never really done, and so now if you have the kind of straightforward CSV input… it will handle that properly on input. That’s basically all the development I’ve done.”

And then he’s off for a discussion on how programs should be tested, calling the issue “fraught.”

And the Internet Cheers

It’s been fun to watch the reactions. “Unix legend, who owes us nothing, keeps fixing foundational AWK code,” reads a headline at Ars Technica, while marveling at the text of an email Kernighan sent this May to Robbins in lieu of a longer git commit message). “Brian Kernighan said hello, asked how their U.S. visit was going, and dropped off hundreds of lines of code that could add Unicode support for AWK, the text-parsing tool he helped create for Unix at Bell Labs in 1977.”

And their post attracted 360 comments — more than one expressing relief they’re not the only ones who have trouble with Git. “Nobody understands the git cli,” wrote one commenter. “Some people have just memorized more commands than others.” (That comment attracted 249 upvotes.

So the geekery keeps coming. Later in the interview, Kernighan even says he’s had conversations with the other two original authors of AWK — both now in their 80s — and publisher Addison Wesley, about whether they need to update their 1988 book.

Kernighan quips that the new version “would deal with things like, ‘Well, we can now represent Unicode characters at least plausibly.'”

“But I think more generally, the computing environment is just incredibly different today than what it was 35 or 40 years ago. Machines are, you know, a hundred to a thousand times faster. Memories are a million times bigger. And that changes the way you think about things.

“It used to be you couldn’t afford to run AWK programs on big data, and now that’s not true. It processes megabytes in milliseconds. And so that changes the tradeoffs that you might make.”

Kernighan also couldn’t help noticing how much our tools have changed since the first version of the book in 1988 — which was written at the highpoint of the Unix document-formatting tool troff.

At one point Kernighan says he still has the original file for the 1988 AWK book — saved in the PostScript file format — which predates even .PDF.

Publishing and Committing

So what else has Kernighan been up to lately?

It turns out — quite a lot.

Brian Kernighan turned 80 this January — and he’s been publishing regularly, according to Kernighan’s web page at Princeton University.

  • Last year Kernighan published a new book exploring “the social, political and legal issues that new technology creates.” The book’s title: Understanding the Digital World: What You Need to Know about Computers, the Internet, Privacy, and Security.
  • In 2019 Kernighan also self-published a Kindle ebook titled Unix: A History and a Memoir, exploring not just the origins of Unix but “how it came about, and why it matters.”
  • In 2018 Kernighan also published Millions, Billions, Zillions: Defending Yourself in a World of Too Many Numbers., which his web describes as “an essential survival guide for a world drowning in big — and often bad — data.”
  • In 2015, Kernighan even co-authored a book about the Go programming language for Addison-Wesley.

And of course, in 1978 Kernighan authored what’s been called the world’s very first “Hello, world” program — and in 1988 Kernighan co-authored a book on the Awk programming language.

Kernighan’s interests are surprisingly eclectic. In 2020, Kernighan also co-authored a paper about the real-world challenges of applying optical character recognition to 180,000-pages of court records from 1674 to 1913. The “Proceedings of the Old Bailey” — the official records from the central criminal court of England and Wales — offered “an ideal benchmark” for testing how optical character recognition performs on historical documents, their paper notes.

And since human transcripts already exist for the 180,000 pages, they were able to use them to test the accuracy of three top cloud-based OCR services: Amazon Web ServicesTextract; Microsoft Azure’s Cognitive Services, and Google Cloud Platform’s Vision.

“Our results found that AWS had the lowest median error rate, Azure had the lowest median round trip time, and Google Cloud Platform had the best combination of a low error rate and a low duration.”

Since 2000 Kernighan’s been part of Princeton University’s computer science faculty (where this spring taught a course on “Digital Humanities” exploring how digital representations (and other technologies) are being used for everything from literature, languages, and history to music, art, and religion. (“Digital humanities data is intrinsically messy,” explains the course’s description, “and there is always a considerable effort devoted to cleaning it up even before study can begin.”)

Kernighan’s class promises a seminar “aimed at building tools and developing techniques that will help humanities scholars work more effectively with their data. This might include machine learning, natural language processing, data visualization, data cleaning, and user interface design for making the processes available to scholars just starting out in technology.”

So maybe it was inevitable Kernighan would start thinking about Unicode characters…


WebReduce

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.