Why Sponsor Oils? | blog | oilshell.org
The last post explained why the Oil lexer is interesting, and reviewed the idea of lexer modes.
If you want to peek ahead, and know something about C and regexes, scroll through osh-lex.re2c.h. It's generated from osh/lex.py (a Python module that is less readable due to all the metaprogramming).
Regardless of whether that code makes sense right now, the idea is simple. Most lexers look something like this:
Token Read() // return the next token
In contrast, a modal lexer has an "enum" parameter:
// return the next token, using rules determined by the mode
Token Read(lex_mode_t mode)
You'll rarely see this technique in textbook lexers or generated lexers, but it solves several common problems.
The first half of this post shows motivating examples. The second half discusses implementation issues, and makes several arguments about the distinction between lexing and parsing.
Consider string literals with \
escapes:
>>> print("Hello W\u00f4rld\n")
Hello Wôrld
Although there is lexical structure within the double quotes, a lexer mode
is not required here. C-style string literals usually scanned with
hand-written code, and treated as a single token. I prefer to treat \n
and \u00f4
as separate tokens, but that's not the way it's usually done.
Lexer modes become useful when you have a recursive language within double quotes, as shell does:
$ echo "$((1 + 2 * 3)) equals 7" 7 equals 7
This shell-like syntax appears in JavaScript as of version 6:
> console.log(`${1 + 2 * 3} equals 7`) 7 equals 7
as well as in Apple's Swift language:
> print("\(1 + 2 * 3) equals 7") 7 equals 7
This is the most compelling use case for lexer modes because the language inside strings is not only recursive, but it's mutually recursive with the "main" language. The same expressions can appear both inside and outside the string literal.
The next three examples are recursive, but not mutually recursive.
Consider these two lines of JavaScript:
var i = ab + c * d // addition and multiplication
var pattern = /ab+c*d/ // 1 or more b's and 0 or more c's
The letters abcd
and the characters +
and *
mean completely different
things.
A lexer mode could be used here. When the JavaScript parser sees the opening
/
, it can tell the lexer to switch to a different mode.
Note that the regex language is arbitrarily recursive, e.g.:
var p = /(((a+b)+c)+d)+/ // matches abcd, aabcd, etc.
But again, it's not mutually recursive. Regex literals don't contain JavaScript code.
(However, Perl regexes are mutually recursive. Then again, parsing Perl is undecidable. I discuss this issue at the end of Parsing Bash is Undecidable.)
Not only do embedded regex literals lend themselves to a separate lexer modes, but regex syntax considered on its own has natural modes.
For example, the ^
character has three different meanings in these three
regexes:
^line$ # ^ is a metacharacter that matches the beginning of a line [^0-9] # ^ is the negation operator [ab^] # ^ is a character literal
Depending on the context, the lexer can return different tokens for each
instance of the ^
character. This makes the parser's job easier.
If you squint, HTML resembles shell:
So HTML involves language composition, just like shell does. It happens in both tags and attributes:
<head>
<!-- arbitrary JS here -->
<script type="text/javascript">
function init() { return 42; }
</script>
<!-- arbitrary CSS here -->
<style>
body {
width: 80em;
}
</style>
</head>
<body onload="init();"> <!-- arbitrary JS here -->
<p style="font-size: large;"> <!-- arbitrary CSS here -->
Hello World
</p>
</body>
I'm not claiming that lexer modes are required to parse this kind of document.
For example, when you see a <script>
tag, you could wait until you see the
closing tag, then pass that opaque string to a separate lexer and parser. In
the case of attributes, you can wait for the closing double quote.
This is because there is recursion, but no mutual recursion. (Although something like JSX introduces that possibility again.)
In any case, I do think lexer modes are a nice way to handle this problem, but I won't say much more because I don't know much about web browser implementation. If you do, leave a comment.
The mode doesn't have to be a parameter to the lexer's Read()
function.
Instead, your lexer can be a state machine that makes mode transitions by
itself, rather than needing the parser to tell it when to change modes.
In fact, I wrote a parser for regexes this way, and it worked well.
For Oil, I used the mode-as-parameter style — somewhat inspired by Ninja's lexer —and it turned out well. This style is almost certainly more powerful, because the parser knows more about the structure of the document than the lexer does.
One drawback is that you may have problems interfacing a generated parser with a modal lexer. The parser expects to call the lexer without a parameter.
Likewise, generated lexers are most often "autonomous", because they expect to be called without a parameter.
However, re2c is a notable exception. Because it's a library and not a
framework, it can easily accomodate the style of modal lexers. You write
the function signatures; re2c generates a bunch of switch
statements and
goto
statements that implement a DFA. (More about this later.)
This is yet another reason that parser generators are less than optimal for recognizing "real" languages: they don't handle language composition well.
Laurence Tratt points this out in his excellent article Parsing: The Solved Problem That Isn't.
One of the things that's become increasingly obvious to me over the past few years is that the general consensus breaks down for one vital emerging trend: language composition.
The article also addresses the problem of the lexer when composing languages. Some parsing techniques like Parsing Expression Grammars unify lexing and parsing, a style known as scannerless parsing.
Scannerless parsing does away with a separate tokenization phase; the grammar now contains the information necessary to dynamically tokenize text. Combining grammars with markedly different tokenization rules is now possible.
I believe that this is the wrong approach. (It's also uncommon; most real languages are parsed with separate lexers.)
Rather than throwing out the lexer, you can use the technique I describe in this post.
Here are two arguments for separate lexing and parsing:
To elaborate on the first argument, here are examples of non-recursive, lexical structure:
while 3.14159 \n
Here are examples of recursive, grammatical structure
(1 + (2 + (3 + 4))) /(((a+b)+c)+d)+/ if (x) { if (y) { print("x and y") } }
Parsers are detailed pieces of code. If you can collapse substrings like
while
or 3.14159
into a single "thing" you have to keep track of, it makes
them easier to understand.
To elaborate on the second argument:
Lexing is easy and fast. You can write a switch
statement in C, or you
can match regular expressions in linear time and constant space. You
don't need to look ahead or backtrack; you just march forward through the
bytes.
Parsing is hard and resource-intensive. There are dozens of algorithms for parsing with different tradeoffs:
In this thread, Aidenn0
on reddit points out that scannerless parsing
wastes time when combined with backtracking, and wastes space when
combined with memoization (packrat parsing).
(This is somewhat similar to the issue mentioned in the appendix of the the last post: lookahead can interact with the "lexer hack".)
2022 Update: I argued this further on the wiki:
I quoted Laurence Tratt above. He argued that that parsing isn't solved because language composition is common and not well understood.
To address a different issue, Trevor Jim argues that parsing isn't solved because most languages are not context-free, while parser generators assume that they are. Instead, their grammars are unrestricted, which means they require Turing Complete parsers.
This is true of the Oil parser, or any other shell parser. There is no way to express shell as a context-free grammar. I ported the POSIX shell grammar to ANTLR back in March 2016, and found that it only covered one of the four major sublanguages in shell.
Here are the points I want you to take away from this post:
Now we should be in a better position to language composition in shell. As I've said, it's not one language, but a composition of at least four mutually recursive sublanguages. I will show examples of this in the next post.
Leave a comment if you have other good examples of language composition!