Regular Languages, Part 2: Ideas and Questions

2020-07-22 (Last updated 2020-12-23)

The last post was about Eggex and regular languages, in theory and in practice.

It had a long appendix, which I've moved to this post. It's unpolished, but may spark some ideas. Feel free to leave comments!

Table of Contents

A Shortcut for Implementing an Engine

Regular Expression Derivatives: Papers and Code

Research Questions

More Powerful Linear Time, Constant Space Languages?

Ad-Hoc Context in Common Lexing Tools? Recursion?

Conclusion

A Shortcut for Implementing an Engine

In the long term, Oil may grow its own regular language / pattern matching engine. I've tried to avoid it, but it seems inevitable because most shells do it to a degree.

For example, rather than using libc's glob() and fnmatch(), which are restricted forms of regular languages, they implement their own. One reason is to implement the globstar option (**) and extended globs like @(foo|bar). (Unfortunately, I found that these implementations backtrack.)

Another reason to have a pattern matching engine: I've found re2c very useful in implementing Oil, and I'd like "lift" its functionality to the user level.

But writing one from scratch is a huge amount of work! So here's a crazy idea to compile re2c to WebAssembly and embed a WebAssembly VM in Oil:

https://news.ycombinator.com/item?id=23694103

Summary:

re2c could grow a new WebAssembly backend, and
The tool itself could be compiled to WebAssembly

(These T or J diagrams may help in thinking about that.)

Another reason for WebAssembly: I want let users avoid the tortured string manipulation syntax in shell. Letting people write arbitrary string functions in fast, portable code is one way to do that.

Problem: I don't like opaque binary blobs, in either the browser or in the shell. Even if they're sandboxed. I want the code to have provenance. To start, maybe this feature can be disabled by default.

This idea feels like a pipe dream, since there are so many other things to do in Oil. But it does make sense because regular languages are a crucial part of shell.

On the other hand, there's a concrete way to start it. Let me know if you have ever compiled C code to WebAssembly. I haven't! re2c is plain C++ with few dependencies, and we actually compile a specific version of it in Oil's continuous build.

Regular Expression Derivatives: Papers and Code

A different way to implement an engine would be through the derivatives technique. I don't have experience with it, so I'm just dropping a few links. Let me know if you do.

Derivatives of Regular Expressions (1964, PDF) (Brzozowski)
Regular Expression Search Algorithm (1968, PDF) (Thompson)
- Curiously, this paper mentions derivatives but not NFAs or DFAs.
Comment about Antimirov derivatives from Darius, and sample code:
- https://github.com/darius/regexercise_solutions/blob/master/star.py
- https://github.com/darius/regexercise_solutions/blob/master/star_thompsonlike.py
12/2020 Update: Partial derivatives of regular expressions and finite automaton constructions by Valentin Antimirov. The abstract says that Antimirov partial derivatives are to NFAs as Brzozowski derivatives are to DFAs. The Brzozowski paper that Thompson cites constructs a minimal DFA, while Thompson's paper appears to construct an NFA. So Darius' claim now makes more sense: Thompson was using Antimirov partial derivatives before they had a name. (Though I think I would have to implement both styles to really understand the difference.)
Yacc is Not Dead (Cox)
- Big difference: Parsing with derivatives is exponential, but regular expressions with derivatives isn't. In fact it has some advantages over the common automata-based algorithms.
- Thompson learned regular expressions from Brzozowski himself while both were at Berkeley, and he credits the method in the 1968 paper. The now-standard framing of Thompson's paper as an NFA construction and the addition of caching states to produce a DFA were both introduced later, by Aho and Ullman in their various textbooks. Like many primary sources, the original is different from the textbook presentations of it. (I myself am also guilty of this particular misrepresentation.)
- 12/2020 Update: As far as I understand, the misrepresentation is that Thompson's code used NFAs only, and never used DFAs. Related: Thompson's construction on Wikipedia.
Regular Expression Derivatives Re-Examined (2009, PDF) (Owens, Reppy, Turon). This paper revived the technique.
- Cox on this paper: Because the implementation manipulates regular expressions symbolically instead of the usual sets of graph nodes, it has a very natural expression in most functional programming languages. So it's more efficient and easier to implement: win-win. I like efficient code that's easy to implement!
Smart constructors are smarter than you think, by Michael Greenberg of Smoosh

Some implementations:

google/redgrep on Github
- redgrep: from regular expression derivatives to LLVM (YouTube, 2014) By Paul Wankadia, maintainer of RE2.
MichaelPaddon/epsilon on Github
- Regular Expression Derivatives in Python (YouTube, 2018) At FOSDEM 2018.

(I recovered these links from an draft of "Dev Log #11" last year, which I never published. Hat tip to Santi from Recurse Center for sending me down this rabbithole of research :-) )

Research Questions

More Powerful Linear Time, Constant Space Languages?

As mentioned in the last post, I use regular languages as a linear-time, constant space programming language. They do a lot of work for "free", performance-wise.

Is there a more powerful model that's still constrained in resource usage? For example, Langdale mentioned that many usages of backreferences aren't exponential.

"Powerful" can be defined either:

theoretically (what subsets of infinite strings can be recognized?), or
practically (can it recognize Python, JavaScript, C++?)

I use the lexer modes technique in Oil, which involves a trivial state machine on top of regular languages. Vim and other text editors have similar mechanisms for syntax highlighting.
Calc Regular Languages (2017, PDF) From the LangSec line of research. Because network protocols often have length fields.

Ad-Hoc Context in Common Lexing Tools? Recursion?

All practical parsing tools have developed ad-hoc mechanisms to deal with the problems that lexer modes solve, e.g. recognizing string interpolation syntax.

How powerful are these mechanisms in both a theoretical and a practical sense?
Can one mechanism express everything the others can? Are any of them equivalent?

These questions are complicated by the interaction with the parsing model. Oil relies on explicit mode changes in its hand-written parsers, which doesn't work well with many parser generators.

Here are some notes on the mechanisms that different tools have come up with:

Treesitter Lexical Analysis: Context-aware lexing, Lexical Precedence, Match Length, Match Specificity.
GNU Flex Start Conditions: flex provides a mechanism for conditionally activating rules.
Lark Contextual Lexing - Lark extends the traditional YACC-based architecture with a contextual lexer, which automatically provides feedback from the parser to the lexer, making the LALR(1) algorithm stronger than ever ... This is an improvement to LALR(1) that is unique to Lark.
OCamllex and Menhir. According to the Morbig paper:
- OCamllex extends the specification language of Lex with many features, two of which are exploited in our implementation. First, a lexer can be defined by a set of mutually recursive entry points ... Second, each entry point of the lexer can be parameterized by one or several arguments.
- Note: In Oil, I define lexing as non-recursive and parsing as recursive. Atomic tokens are fed into WordParser, producing "words", which are fed into the CommandParser, ArithParser, and BoolParser. That is, the parsers are mutually recursive but the lexer is flat and "modal".
ANTLR 4 Lexer Rules
- Modes allow you to group lexical rules by context, such as inside and outside of XML tags. It’s like having multiple sublexers, one for context. The lexer can only return tokens matched by entering a rule in the current mode.
- ANTLR lexer rules can be recursive, unlike most lexical grammar tools. This comes in really handy when you want to match nested tokens like nested action blocks

Conclusion

Let me know if any of those ideas are interesting. Especially if you want help write a pattern matching engine!

I think it would be fun to compile some C code to WebAssembly. Though we can likely do something much simpler (maybe with derivatives) that will address the deficiencies of libc glob().