|
blog | oilshell.org
Regular Languages, Part 2: Ideas and Questions
2020-07-22
(Last updated 2020-12-23)
The last post was about Eggex and regular
languages, in theory and in practice.
It had a long appendix, which I've moved to this post. It's unpolished, but
may spark some ideas. Feel free to leave comments!
A Shortcut for Implementing an Engine
In the long term, Oil may grow its own regular language / pattern matching
engine. I've tried to avoid it, but it seems inevitable because most shells do
it to a degree.
For example, rather than using libc's glob()
and fnmatch()
, which
are restricted forms of regular languages, they implement their own. One
reason is to implement the globstar
option (**
) and extended globs like
@(foo|bar)
. (Unfortunately, I found that these implementations
backtrack.)
Another reason to have a pattern matching engine: I've found re2c very
useful in implementing Oil, and I'd like "lift" its functionality to the user
level.
But writing one from scratch is a huge amount of work! So here's a crazy
idea to compile re2c to WebAssembly and embed a WebAssembly VM in Oil:
Summary:
- re2c could grow a new WebAssembly backend, and
- The tool itself could be compiled to WebAssembly
(These T or J diagrams may help in thinking about that.)
Another reason for WebAssembly: I want let users avoid the tortured string
manipulation syntax in shell.
Letting people write arbitrary string functions in fast, portable code is one
way to do that.
Problem: I don't like opaque binary blobs, in either the browser or in the
shell. Even if they're sandboxed. I want the code to have provenance. To
start, maybe this feature can be disabled by default.
This idea feels like a pipe dream, since there are so many other things to do
in Oil. But it does make sense because regular languages are a crucial
part of shell.
On the other hand, there's a concrete way to start it. Let me
know if you have ever compiled C code to WebAssembly. I
haven't! re2c is plain C++ with few dependencies, and we actually compile
a specific version of it in Oil's continuous
build.
Regular Expression Derivatives: Papers and Code
A different way to implement an engine would be through the derivatives
technique. I don't have experience with it, so I'm just dropping a few links.
Let me know if you do.
- Derivatives of Regular Expressions (1964, PDF) (Brzozowski)
- Regular Expression Search Algorithm (1968, PDF) (Thompson)
- Curiously, this paper mentions derivatives but not NFAs or DFAs.
- Comment about Antimirov
derivatives from Darius, and sample code:
- 12/2020 Update: Partial derivatives of regular expressions and finite
automaton
constructions
by Valentin Antimirov. The abstract says that Antimirov partial derivatives
are to NFAs as Brzozowski derivatives are to DFAs. The Brzozowski paper that
Thompson cites constructs a minimal DFA, while Thompson's paper appears
to construct an NFA. So Darius' claim now makes more sense: Thompson was
using Antimirov partial derivatives before they had a name. (Though I think
I would have to implement both styles to really understand the difference.)
- Yacc is Not Dead (Cox)
- Big difference: Parsing with derivatives is exponential, but regular
expressions with derivatives isn't. In fact it has some advantages over
the common automata-based algorithms.
- Thompson learned regular expressions from Brzozowski himself while both
were at Berkeley, and he credits the method in the 1968 paper. The
now-standard framing of Thompson's paper as an NFA construction and the
addition of caching states to produce a DFA were both introduced later, by
Aho and Ullman in their various textbooks. Like many primary sources, the
original is different from the textbook presentations of it. (I myself am
also guilty of this particular misrepresentation.)
- 12/2020 Update: As far as I understand, the misrepresentation is that
Thompson's code used NFAs only, and never used DFAs. Related: Thompson's
construction on
Wikipedia.
- Regular Expression Derivatives Re-Examined (2009, PDF)
(Owens, Reppy, Turon). This paper revived the technique.
- Cox on this paper: Because the implementation manipulates regular
expressions symbolically instead of the usual sets of graph nodes, it has a
very natural expression in most functional programming languages. So it's
more efficient and easier to implement: win-win. I like efficient code
that's easy to implement!
- Smart constructors are smarter than you
think,
by Michael Greenberg of Smoosh
Some implementations:
(I recovered these links from an draft of "Dev Log #11" last year, which I never
published. Hat tip to Santi from Recurse Center
for sending me down this rabbithole of research :-) )
Research Questions
More Powerful Linear Time, Constant Space Languages?
As mentioned in the last post, I use regular languages as
a linear-time, constant space programming language. They do a lot of work for
"free", performance-wise.
Is there a more powerful model that's still constrained in resource usage? For
example, Langdale mentioned that many usages of backreferences aren't
exponential.
"Powerful" can be defined either:
- theoretically (what subsets of infinite strings can be recognized?), or
- practically (can it recognize Python, JavaScript, C++?)
Related:
- I use the lexer modes technique in Oil, which involves a trivial state
machine on top of regular languages. Vim and other text editors have similar
mechanisms for syntax highlighting.
- Calc Regular
Languages
(2017, PDF) From the LangSec line of research.
Because network protocols often have length fields.
Ad-Hoc Context in Common Lexing Tools? Recursion?
All practical parsing tools have developed ad-hoc mechanisms to deal with the
problems that lexer modes solve, e.g. recognizing
string interpolation syntax.
- How powerful are these mechanisms in both a theoretical and a practical sense?
- Can one mechanism express everything the others can? Are any of them
equivalent?
These questions are complicated by the interaction with the parsing model. Oil
relies on explicit mode changes in its hand-written parsers, which doesn't work
well with many parser generators.
Here are some notes on the mechanisms that different tools have come up with:
- Treesitter Lexical
Analysis:
Context-aware lexing, Lexical Precedence, Match Length, Match Specificity.
- GNU Flex Start
Conditions:
flex provides a mechanism for conditionally activating rules.
- Lark Contextual
Lexing - Lark
extends the traditional YACC-based architecture with a contextual lexer,
which automatically provides feedback from the parser to the lexer, making
the LALR(1) algorithm stronger than ever ... This is an improvement to
LALR(1) that is unique to Lark.
- OCamllex and Menhir. According to the Morbig
paper:
- OCamllex extends the specification language of Lex with many features, two
of which are exploited in our implementation. First, a lexer can be defined
by a set of mutually recursive entry points ... Second, each entry point
of the lexer can be parameterized by one or several arguments.
- Note: In Oil, I define lexing as non-recursive and parsing as
recursive.
Atomic tokens are fed into
WordParser
, producing "words", which are fed
into the CommandParser
, ArithParser
, and BoolParser
. That is, the
parsers are mutually recursive but the lexer is flat and "modal".
- ANTLR 4 Lexer
Rules
- Modes allow you to group lexical rules by context, such as inside and
outside of XML tags. It’s like having multiple sublexers, one for context.
The lexer can only return tokens matched by entering a rule in the current
mode.
- ANTLR lexer rules can be recursive, unlike most lexical grammar tools. This
comes in really handy when you want to match nested tokens like nested
action blocks
Conclusion
Let me know if any of those ideas are interesting. Especially
if you want help write a pattern matching engine!
I think it would be fun to compile some C code to WebAssembly. Though we
can likely do something much simpler (maybe with derivatives) that will address
the deficiencies of libc glob()
.