How has the recent turmoil within the OpenAI offices changed your plans to use GPT in a business process or product in 2024?
Increased uncertainty means we are more likely to evaluate alternative AI chatbots and LLMs.
No change in plans, though we will keep an eye on the situation.
With Sam Altman back in charge, we are more likely to go all-in with GPT and LLMs.
What recent turmoil?
Frontend Development / Software Development

Magic-RegExp: A JavaScript Package for Regular Expressions

David Eastman takes a break from the Matrix to decode magic-regexp, a JavaScript package that uses an English-friendly representation.
Mar 4th, 2023 9:00am by
Featued image for: Magic-RegExp: A JavaScript Package for Regular Expressions

Much as I like regular expressions (or regex), part of their longevity is due to their separate development from mainstream languages. Regex is a method of using pattern matching, and the regex language uses standard keyboard symbols as the special “meta” characters. This gives it a strange look at first glance.

Building a regex can be a bit of a chore — I’ve seen it suggested that a lot of developers use Copilot to help them out. Now, I don’t use JavaScript as my everyday language, but as I’ve talked about regex quite a bit, it seemed natural to look at magic-regexp, a JavaScript package that uses a more English-friendly representation.

Now, we can pretty much guess the reason for using a package to represent a regex:

  • If the pattern is replaced with methods and code, it will be type-safe. The concept of a regexp is already native to JavaScript.
  • Precompiled methods should be more efficient than live interpretation.
  • Theoretically, it will be easier to read.
  • Regex patterns include meta-rules at the end to guide behavior, which is unwieldy.

So it could provide a valuable new arrow for your quiver.

There seem to be a billion ways of starting a JavaScript project, but I’ll just keep it simple. After doing the various upgrade dances on the command line to get my local environment vaguely up to date:


Let’s hop back to a previous article on regex and look at an expression that simply captured the first word of a sentence and its operation on some sample Shakespearean text:

Now, just to make sure that the test above captures single letter words and that we don’t capture mid-line sentences, I’m going to purposely ruin Shakespeare (yet again) and add one additional line from Ms. Gaynor and thus extend the test text:

Just like that guy in the Matrix, I’m happy reading regex patterns directly, but what if I did want to transcribe the one above?

So how would I describe its operation over the phone?

“Starting at the beginning, look for a capital letter, then any number of lower case letters. Apply that over the whole multiline text.”

Looking at the Magic-regexp usage:

“starting at the beginning” = at.linestart()

“Look for a capital letter” = letter.uppercase

“any number of lower case letters” = oneOrMore(letter.lowercase).optionally

“until a space” = .and whitespace

So actually we need the “*” for zero or more, but we can use oneOrMore and then add optionally.

Here is the testmagicregex.js file with the code:

And this returns a regular expression on the command line:

And it works:

But what is the extra cruft? Well the question mark colon pair “?:” are used to mark the parts within a parenthesis as a non-capturing group. This just means that the parentheses are used to group things, as you normally would when you want to apply a function to all the stuff inside. It just so happens the default use of parentheses in regex is to mark a capturing group.

It strikes me that the Magic-regexp package is only partially successful in its aims. You still have to think in regex. (I’m reminded of the 1982 Clint Eastwood film “Firefox,” where our hero must steal a plane from the Soviets, but to operate it he must think in Russian.)

Our Shakespeare example is fine, but is highly unlikely to be useful within an actual JavaScript project. A more practical example would be to check for a valid email address.

Again, how would we define a valid email scheme over the phone?

“It starts with a name that can have dots, dashes or underlines within it, as long as it starts and ends in word characters. There must then be an at character (@) followed by two bunches of letters separated by colons. Now stop ringing me!”

Now we all know there are specific packages to do this and we wouldn’t want to cook up our own for production. But a quick one will do no harm.

So from the conversation, we would need at least:

  • oneOrMore(wordChar)
  • and(anyOf(”.”, ”_”, ”-”))
  • and(oneOrMore(wordChar))
  • exactly(”@”)
  • and(oneOrMore(wordChar))
  • and(exactly(”.”))
  • and(oneOrMore(wordChar))

I’ve added some validity tests using assert this time, not all of which pass with just the code above, but with a bit of work we get:

So the final concern for a professional developer is: which one is easier to maintain? A straight regex pattern, or the representative methods above? As I said, this feels to me a little too close to the “now you have two problems” meme, which is already thrown at regex; you now have to deal with the whims of another package as well as the whims of regex.

However, with a little more development, this may well be a solid way to avoid staring at the Matrix.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.