Data / Data Science / Software Development

Taming Text Search with the Power of Regular Expressions

3 Apr 2022 3:00am, by

Pattern matching is the one skill that kept homo sapiens on top through the harsh millennia of predatory competition. Our brains are really very good at spotting a snake in the grass, even if we are not familiar with that particular snake, or that patch of ground.

You realize quickly how good we are at it when you see how relatively hard it is to persuade computers to do similar “fuzzy” matching; computing is about following rules defined in advance. This is why autonomous vehicles don’t ever quite arrive. However, there is a tool that leaves the human to work out the pattern, and just manages the mechanics.

For the domain of words, Regular Expressions (or regex) are a great way to specify a search pattern in order to find words and characters hiding in a large amount of text. For example, how would you find all the dates mentioned in an email? This is exactly the sort of problem for regex. And with so many websites that let you dump a lump of text and apply regex searches to it, it is very easy to both learn and do valuable work with no more than a browser.

And yet regex is one of those skills, like touch typing, that developers know they should learn but often leave until the last moment. But it is a useful skill for everyone working with text, as many text editors will allow regex searches. What I recommend is that you learn a small core, and then extend your knowledge when you need it. If you like word games (even Wordle), you won’t have any trouble understanding the whys and wherefores of regex.

Because regex is a computing discipline, we need to dip quickly into how text is represented in computing. A printing letter or a dollar symbol are examples of characters, which make up a string within the computer’s memory. A line of text is a string ending with a new line character.

When you press return on a keyboard, that new line character gets appended to the string. So an entire Shakespeare text may be expressed as a single string, or many lines. This text, for example, probably uses a new line character to indicate the end of a paragraph; this web page will naturally use word wrapping to split the paragraph into many lines on the screen.

There are plenty of regex symbols, but I’ll introduce a few at a time so that you will find it easier to work out what you can do. First of all, every existing printing letter stands in for itself except if it is a special regex symbol.

Let me introduce you to two important regex symbols::

. (full stop) – any character

* (asterisk) – zero or many times

The full stop can stand in for any single character, except a new line.

The asterisk, an example of a quantifier, specifies that the character it stands in front of may be repeated zero or many times.

You might be slightly surprised that special symbols are not used. This is down to the age of regex when only keyboard characters were available.

The following are all valid regular expressions that could represent my first name:

David

Da.id

D…d

D.*d

.*

Maybe you spotted the trick that a full stop followed by an asterisk can stand in for any string of any length. Now regex is often known for being “eager” in that it will match as much as it possibly can, which is why it is important to be very careful with that combination.

These important symbols are anchors:

^ (hat) – start of string or line

$ (dollar) – end of string or line

These ensure you can find things relative to the start or end of a line. Understanding whether your text is made of lines, or is a single string, is part of the initial analysis when working out how to solve a problem.

Finally, we have alternation which is an OR operator. The pipe symbol gives us a bit of logic control:

(this|that)

You can read the above as “this or that”.

Now, I’ll use regexr.com to place some examples in; there are a number of sites that do the same thing. Regex can be found embedded in many applications that need search capabilities and utilities such as sed and AWK.

Before we start searches in the wild, here are a few notes:

  • Just like we use quotation marks to enclose a string, we use forward-leaning slashes to enclose a regex:

/my regex/

  • There are a set of things called flags that are denoted after the last slash. For example a “g” is used to denote”global” mode that will keep searching after the first hit:

/my regex will work keep searching/g

  • Because it is important to notice spaces and new line characters, most regex tools can make these normally invisible marks visible.

Let’s examine the result of several simple regex patterns applied to a well-known text that is made of lines:

OK, this is no more than a simple search for the word “to” across the whole text. Not very useful.

This is a reminder that regex searches are case-sensitive. Unless you turn the case insensitive flag on, or you explicitly look for uppercase characters, or you do this:

Now let’s use our anchor to find the first word of each line. That should be easy. First we add the multiline flag “m”. And we search for the start of the line, followed by any number of characters and a space…

OH NO! Despite my intention, the expression has (as I warned earlier) been too eager and got more than I hoped it would. So we need to learn a bit more — in fact like skiing we need to know how to stop!

Fortunately, we can also restrict searches to a “class” or range of characters. We use square brackets and stick the allowed characters inside. We can also use a dash to express a range.

So we can create these filters:

[AEIOU] only capitalized vowels

[0123456789] only a number

[0-9] only a number

[A-Z] only a capital letter

[A-Za-z] only a letter

So we can go back and solve our “only first word” problem with the following:

This says “starting from the beginning of the line, look for a capital letter, then look for any number of lowercase letters”. So this time we are using a space letter implicitly as an excluded group — it gets stopped by the doorkeeper. This is good enough for the Bard.

So with our handful of tricks, are we ready to do something useful, like find search for dates in a text like I promised? Well, let’s just look at a narrow band of US-style dates:

23rd of June

June 13, 2023

06/23/2023

Now we can probably do the last one… but wait. Isn’t a forward-slash part of the special symbols we need? It is — so when we want to use a special regex symbol as part of our search characters we “escape” them using a backslash:

 

The escaping of forward slashes with backslashes is — ugly. Welcome to Leaning toothpick syndrome. Can you see ways of improving the filtering in the above example? For example, the month can only start with a “0” or “1”.

For completion, I will make this a bit neater by using extra methods that define “a digit” in a simpler way, and also quantify the exact number of repeats:

Now, armed with this knowledge, you should be able to

  • Try to search for the other style of dates.
  • Learn about some of the other classes, quantifiers and logical things to make richer regex.
  • Learn how to group search results and to replace things.
  • Be wary of some tricks like lookahead, as these are genuinely tricky and are not necessarily implemented in every regex tool in the wild.
  • Look inside your own editor, and see if it uses the regex functionality many of them have built-in.
  • Learn to read long and ugly-looking regexes and work out how they do what they do.

Feature image by Nick Fewings on Unsplash.