TNS
VOXPOP
What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
0%
Super-fast S3 Express storage.
0%
New Graviton 4 processor instances.
0%
Emily Freeman leaving AWS.
0%
I don't use AWS, so none of this will affect me.
0%
Software Development

Regular Expressions and Solving the Food Taster Dilemma

A look at lookaround functions in regular expressions; and a reminder of why paranoid kings and emperors employed food tasters.
Aug 24th, 2022 3:00am by
Featued image for: Regular Expressions and Solving the Food Taster Dilemma
Feature image via Shutterstock.

In my previous regex post, we covered simple regular expressions that work with most code base libraries, as well as with the search functions in many editors. Things were fairly simple to comprehend, and all was good with the world. So now I’m going to ruin all that.

Just a reminder of why we should value regular expressions. By knowing one way to search through text (even a 70-year-old way) in a computational style, we can not only solve problems with different tools, we can also better understand the problems with search itself.

This time I’m going for the next step up in difficulty: the lookaround functions. But first, we have to understand consumption. You’ll soon appreciate why I didn’t cover this in the first post.

Consumption

When the regex process matches successfully, it includes the matched character in the result and then moves on to the next character. This sounds like the right thing to do, but it does mean you can’t check actions before you commit to them. This is why paranoid kings and emperors employed food tasters; you couldn’t un-eat an apple after you discovered it was poisoned.

But if you remember, when we saw anchors we noticed they didn’t consume. They were used to clamp the search to the beginning or end of a line. We used this to find the first word in a line of Shakespearean text:

Here is an example that shows the food taster dilemma in detail. How can you find letter combinations that break the rule “i before e except after c”? This familiar but rather unfortunate rule of English language has a lot of exceptions.

If we just apply two simple searches for the two sets of offending exceptions, it won’t work as you can see here:

Clearly, the search can’t tell if it’s looking at a good “cei” or it has found a genuine rule-breaking “ei”.

At least applying the second pattern is straightforward for catching half the offenders that contain “cie”:

But how can we detect both sets of offenders correctly with just one query? The answer would seem to be that we can use an alternation (an “either-or” rule) to combine both rules and then disallow the “c” when checking for the “ei” rule:

The above solution looks good at first, but then the last three of the renegade spellings has captured the “s”, “v” and “w” — which are not in themselves problem letters, but they matched with the negated metacharacter.

Look Around without Consuming

We need to be able to “look around” without consuming. Command the food taster to have a nibble, and see what happens. Below we use the correct expression to ignore the result of the “ei” check when a “c” appears in front of it:

That worked. The new terrible-looking profusion of characters is a negative lookbehind. A close inspection of the screenshot above shows that there is a little warning flag on the right side — it is warning us that “the browser may not support negative lookbehind”!

Let us admire the whole lookaround family:

The Sub Regex

In the diagram above, where I have put an “a”, you can place a character or indeed any regex. This is matched by the lookaround before the rest of the expression is processed. So technically, we can have a form of “if .. then” — like a programming fork. Let’s say we want to find words for a Wordle-type puzzle; for example, five letters long that includes the renegade combinations from our former problem.

So we want to only look at five-letter length words. To do this we want to use the following functions, most of which I introduced in my previous regex article:


So, a word of length five letters would be matched by the expression:


Think of this as five letters consecutively sandwiched between a non-word character (eg. a space, punctuation or end of line).

It works with our example words:

So now, we can combine this with our “i before e” renegade detector.

Now, doing one “sub” calculation before going on to do another is fine for a general-purpose programming language, but a little bit of a stretch for regex. Sure it works, but you will start to produce some difficult-to-read code that might be tough to debug.

Nevertheless, if we do a positive lookahead sub calculation with our length solution, and then follow it on with our renegade test:

It doesn’t work! But that’s because we have positioned ourselves in front of the word and we are not just freely looking for the combo anywhere on the line. We need to represent the entire word.

So we pad our detector expression with word characters that might appear before and after:

That worked! Finally, let’s just prove that if we expand the repetition length to between four and 10 letters, we really capture all the renegade examples (and none of the good guys):

Now part of the reason we are doing this is that we can use this tool in other editors. Or can we? Let us try just the renegade detector in two different editors. First, Microsoft Word:

“In its own unique way” is a rather large red flag. Not only does it not support lookaround (no surprise there) it doesn’t even support alternation. Sad face.

Sublime is a developer-friendly editor, and has no trouble with regex. You just have to hit the “.*” button (that is the informal sign that regex is welcome here) and off you go:

So, some success at least. I hope text editors and search facilities retain the faithful regex, and that you remember this independent solution when you need to find text treasure hidden in a forest of words.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.