Comments on Eggex and Regular Languages

2020-07-21 (Last updated 2020-07-23)

The last post enumerated #blog-topics that are central to the future of the project. This one lists topics that may interest users of Eggex, Oil's regex-like pattern matching language.

You can try Eggex right now with the latest release! (and ask me questions)

It may also appeal to /r/ProgrammingLanguages/ readers, who liked my recent wiki page on Why Lexing And Parsing Should Be Separate (comments).

Roughly speaking, the main point is:

"Regexes" / regular expressions are Perl-like. In contrast, "regular languages" are a mathematical idea, with important engineering properties.

You have access to good implementations of the latter in shell scripts!

I outline the main points below, but the full explanations take more space. They'll have to wait I get through more of the posts outlined yesterday.

Table of Contents

Think of "Regexes" and "Regular Languages" Separately

Backtracking is Bad in Theory

Linear Time And Constant Space

Composing With Alternation Scales Well (Sample code)

Backtracking is Bad in Practice

Backtracking is Still Misunderstood

Regular Language Engines in Shell Don't Backtrack (Sample code)

Eggex: Syntax vs. Semantics

Conclusion

Appendix: Older Blog Posts

Think of "Regexes" and "Regular Languages" Separately

This is exactly the point in Russ Cox's articles on implementing regular expressions (2007-2012), but I think the additional terminology of "regular languages" is important.

I gave a 5 minute presentation at Recurse Center a couple years ago about this:

Regex Tips, and Theory vs. Practice (would have been 1-2 hours of material)

In addition to the three tips at the top, the table that compares regexes and regular languages is pretty good (scroll down). I should publish it on its own page.

Note: Eggex is for both regexes and regular languages, but concentrates on the latter.

Backtracking is Bad in Theory

Perl-style backtracking regexes fail to give you two important properties:

Linear time and constant space
Composition by | (regular language alternation)

TODO: Give an example of failure to compose by |. It's order dependent, so a shorter match can be chosen over a longer match. (Caveat: PEGs use ordered choice, and compose in a specific mathematical sense. However, it's arguable whether they're intuitive to debug.)

Linear Time And Constant Space

I use regular languages as a linear-time, constant space programming language. Key point: This is almost never too slow for any real system!

That is, I think of such algorithms as free in terms of performance. For example, I leaned hard on regular languages in Oil's parser, and it ended up faster than bash's hand-written parser in C, and 10x faster than zsh's hand-written parser in C.

If anything, parsing is the bottleneck, not lexing with code generated from regular languages.

When you have a performance problem with regexes, it's almost always because of backtracking, whether that's inside the engine or "outside" of it, by attempting multile matches in sequence at the same character position.

Composing With Alternation Scales Well (Sample code)

Oil's lexer matches many patterns "in parallel" with re2c. All patterns are compiled to a single DFA, which is expressed as a series of switch and goto statements.

It can do this because it accepts regular languages, not regexes.

I wrote sample code to demonstrate that re2c scales in both compile time and runtime to around 48,000 constant strings:

blog-code/fgrep-problem-benchmarks

I found some issues with other engines, like egrep crashing after on a few hundred patterns.

Key point: In addition to scaling linearly with respect to the length of the input, regular languages also scale with respect to the number of simultaneous patterns. My benchmark showed approximately constant scaling, e.g. O(1), but there are caveats to dig into further.

HyperScan: "uses hybrid automata techniques to allow simultaneous matching of large numbers of regular expressions across streams of data"
What the syntect syntax highlighting engine doesn't do (largely due to the constraints of the badly-designed SublimeText interface.)
Extra RE2 and rust/regex features for simultaneous matching.

Backtracking is Bad in Practice

In this lobste.rs thread, I list positive experiences with regular languages (lexing and big data), as well as negative experiences with regexes (e.g. the ReDoS problem that everyone eventually runs into.)

I regret that this thread got heated, but the points are valid, and hopefully I can clean them up and create a useful and informative blog post.

Backtracking is Still Misunderstood

Discussions about regular languages led me to this November 2018 thread:

https://news.ycombinator.com/item?id=18553393

User ridiculous_fish says that beating native JavaScript regexes with a custom engine in WebAssembly is "totally shocking".

As the reply indicated, this isn't shocking at all: JavaScript uses Perl-style regexes, while the WebAssembly implementation uses regular languages. Backtracking is the difference.

I recognize the commenter (https://ridiculousfish.com/) as an experienced engineer who works on the fish shell (which is more mature and widely used than Oil), and worked at Apple.

So my point is not to say that he should know this, and shouldn't have been shocked. My point is that experienced engineers either don't know this, or temporarily forget about it.

If we use the term regular languages, they might remember. It matters in practice.

Regular Language Engines in Shell Don't Backtrack (Sample code)

I've been calling these the "shell-ish" regular language engines:

libc, which underlies bash's [[ $str =~ $pat ]] construct
grep
sed
awk

They all support POSIX regular expression syntax (whether basic or extended), as opposed to Perl-style extensions.

I recently verified that the common GNU implementations (e.g. in Debian and most Linux distros) do not backtrack:

blog-code/regular-languages. See the README for details.

GNU implementations appear to be automata-based.

Caveats: I haven't tested musl libc, which is used on Alpine Linux, or BSD. Apparently the glob implementation in BSD is backtracking.

(Note: Testing multiple engines with shell is an example of #shell-the-good-parts. I should write a separate post about this, or even do a video.)

Eggex: Syntax vs. Semantics

I had a conversation with James Davis about Eggex and the Lingua Franca paper, which I found very interesting:

Why Aren't Regexes a Lingua Franca?

I noted a possible limitation of Eggex: that it's mostly syntactic. However, after reviewing the paper and writing some tests, this doesn't seem to be a problem. Results:

The shell-ish engines are not only consistent in terms of performance, but consistent semantically (e.g. regarding submatches).
I verified that the Perl and Python engines backtrack, but semantically they're very closely to the shell-ish engines. I could probably dig deeper and find more differences, e.g. with ordered choice. Either way, using these engines are not crucial right now, whereas using the shell-ish engines is.
The JavaScript engine exhibited the most semantic divergences, e.g. with (a*)+ and []], as shown in Table 4 of the paper.

Conclusion

This post argued that we should start using the term regular languages, in contrast to "regexes", for reasons rooted in both theory and practice.

I don't expect that casual readers will have followed it all, since I only sketched the arguments and gave links. The full explanations will have to wait I get through some of the summer blog roadmap.

In the meantime, please leave comments with any questions, and again I recommend reading through the Regex Tips page linked above.

The next post will have some speculative ideas and questions that I cut from this post. After that, there may be another post with #blog-topics, and then I want to write the plan for Oil 0.8 and 0.9.

Appendix: Older Blog Posts

OSH 0.1 and string strip operations (September 2017)
Parser Performance (October 2017)
Lexing vs. Parsing. (December 2017). I refined this argument on the wiki.
Regular Languages vs. Regexes (July 2018)
Eggex Example: Recognizing Python Integer Literals (December 2019)