Creating more readable regular expressions with Simple Regex Language

Clear-Sighted

Article from Issue 199/2017
Author(s):

Regular expressions are a powerful tool, but they can also be very hard to digest. The Simple Regex Language lets you write regular expressions in natural language.

Regular expressions are a fundamental feature of Linux – and many other modern operating systems. A regular expression is a search term with special placeholders representing several possible characters at the same time. The concept of a regular expression is an extension of the idea behind the "wildcard" character used in many GUI search tools, but the power and subtlety of regular expressions far exceeds what you can do with a simple wildcard.

For example, suppose you want to search the system.log file for errors, but you don't know whether the term Error will appear with initial cap or all lowercase (Error or error). You could use a regular expression as part of the Grep command:

grep -e '[eE]rror' system.log

The expression [eE] means: There is either a lowercase e or uppercase E.

A quick check for capitalization is easy to read and interpret, but some regular expressions are much more exotic. Who is able to say right away what text the following expression describes:

/^(?:\w|[\.\-\+])+(?:@) (?:[a-z]|[0-9]|[\.\-])+(?:\.)[a-z]{2,}$/i

Once you derive an expression like this, it can be a powerful tool for a script or a string search tool like Grep, but for the human who created this expression, and the other humans who comes along later and want to read it, decoding a regular expression can be a time-consuming endeavor. What is more, a small error that creeps into the expression could be difficult to spot, although it could have a significant effect on the value of the search result. An error in a complex regular expression could even form the basis for malicious code and an Internet attack.

The fledgling Simple Regex Language (SRL, [1]) from the developer Karim Geiger aims to address the problem of incomprehensibility in regular expressions. Geiger started SRL as a bit of fun in Fall 2016, and since then, other developers have helped to implement SRL in various coding languages.

The SRL allows you to write regular expressions in natural English. In the previous example of the logfile, the two words Error and error start with either E or e. In SRL, you could say:

one of "eE"

and follow it with the character string rror:

one of "eE" literally "rror"

This line forms a complete expression in the SRL. SRL does not consider uppercase and lowercase for keywords, so LITERALLY is thus the same as literally. However, for literal strings, uppercase and lowercase are very important: literally "Error" therefore means something completely different from literally "error".

In SRL, the developer can frame strings – in the example rror – with single or double quotes. You have the option of separating the individual components of the complete expression with a comma or a line break. Adding a break does not change the logic but instead simply improves the legibility:

one of "eE",
literally "rror"

The example expression matches all text passages where the character strings error or Error appear. Hence the word Terrorism would be a valid reference.

Empty Words

Spaces (whitespaces) correctly separate the words:

whitespace one of "eE" literally "rror" whitespace

The word error is usually at the beginning of a line in logfiles. Anyone who is only interested in these lines, just needs to write:

begin with one of "eE" literally "rror"

The test text now needs to start with Error or error. However, the expression only works if the program considers each line of the file as text to be retested (similarly to grep).

Some logfiles mark errors with the abbreviation EE, which you could include in the expression with:

begin with any of (literally "EE", (one of "eE" literally "rror"))

As with traditional regular expressions, brackets group matching subexpressions. The term any of serves as a logical Or. In the example, the text looks for lines beginning with either with the character string EE, or with Error or error. The comma is cosmetic.

When the Post Rings

Sometimes characters should be repeated several times. For example, with the abbreviation EE, there are exactly two Es in succession. Or in SRL, you could say: literally "E" exactly 2 times. Instead of exactly 2 times, you could also write twice.

In the following expression:

begin with any of (any character, one of".-+") once or more

the expression any character stands for any letters between A and Z or for a digit between 0 and 9 or an underscore _. Uppercase and lowercase are of no importance. The permitted characters can be repeated as often as desired; however, there must be at least one character. The entry once or more ensures a minimum of one character.

If the string you are looking for is an email address, you'll also need to ensure the presence of the @ character: literally "@". The domain name behind it may, in turn, be made up of several letters or numbers and the special characters . and -:

any of (letter, digit, one of ".-")  once or more

The any character expression does not work for the domain name because domain names prohibit the underscore _. The letter and digit expressions specify letters and numerals without additional characters. The top-level domain, which starts with a period, forms the end:

<C>literally "."<C>

At least two more letters follow:

letter at least 2 times must end

The developer explains that uppercase and lowercase are irrelevant by explicitly adding case insensitive.

Listing 1 shows the whole expression. The expression deliberately keeps the email address test simple; for example, the standard allows other special characters in front of the @. The domain name must also always end with a letter or a number.

Listing 1

Checking an Email Address

 

Testing, Testing, 1, 2, 3

You can test your SRL expression directly at the SRL project website under the menu item Build [2]. Just enter the SRL expression under Your SRL Query, type a test text under Test Input, and have it checked via Run Query (Figure 1). At the bottom of the page, developers immediately find out whether the test text matches the SRL expression. In addition, the page supplies the corresponding regular expression for comparison.

Figure 1: The SRL website uses the specified SRL expression to check whether edit@linux-magazine.com is a valid email address. The answer?

Figure 2 shows the expression for Listing 1 as an example – which, by the way, is identical to the cryptic regular expression at the beginning of this article. If the tester places a check mark in front of Save Query (to the right of Test Input), the server keeps track of all entries. The tester can use the URL at the bottom of the page to access the page with the SRL expression at any time. It remains unclear where the stored data will reside, so testers should not use sensitive data with Test Input.

Figure 2: … appears at the bottom of the page, along with the associated regular expression.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Command Line – tre-agrep

    Tre-agrep has all of grep's functionality but can also do ambiguous or fuzzy searches without deep knowledge of regular expressions.

  • Command Line: Grep

    Once you understand the intricacies of grep, you can find just about anything.

  • Perl: Tricks with Vim

    The Vim editor has any number of tricks for helping you avoid unnecessary typing. In this month’s article, we look at some effort-saving Vim techniques for Perl hackers.

  • How Compilers Work

    Compilers translate source code into executable programs and libraries. Inside modern compiler suites, a multistage process analyzes the source code, points out errors, generates intermediate code and tables, rearranges a large amount of data, and adapts the code to the target processor.

  • Free Software Projects

    Redet helps you create regular expressions, and solving Go problems is a good way to relax. We also examine the latest events at Debian, including recent talk about the GNU Free Documentation License.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News