Why Sponsor Oils? | blog | oilshell.org

How OSH Uses Lexer Modes

2016-10-19 (Last updated 2019-02-06)

OSH uses a simple lexing technique to recognize the shell's many sublanguages in a single pass. I now call it modal lexing.

This is how we address the language composition problem.

Table of Contents
Demo
List of Lexer Modes
Why A Shell Lexer Needs Modes
Example: Where the ARITH Mode is Used
Conclusion
2019 Update

Demo

What do the characters :- mean in this code?

$ echo "foo:-bar  ${foo:-bar}  $(( foo > 1 ? 5:- 5 ))"
foo:-bar  bar  -5

Three different things, depending on the context:

  1. Literal characters to be printed to stdout.
  2. The "if empty or unset" operator within ${}.
  3. The : in the C-style ternary operator, then the unary minus operator for negation.

This is why we need lexer modes. Most lexers look like this:

Token Read()  // return the next token

But our modal lexer has an "enum" parameter:

// return the next token, using rules determined by the mode
Token Read(lex_mode_t mode)

The concept is easy, but it needs a name because:

List of Lexer Modes

In OSH, there are currently 8 modes:

And there are two more unimplemented modes:

(2019 Update: I published an updated list of lexer modes.)

Why A Shell Lexer Needs Modes

In the implementation of many languages, you get by without any modes. Most languages have string literals, where you could consider \t a token, but that can be worked around by writing code to treat the whole double-quoted string (e.g. "a\tb\tc") as a single token (and that seems to be what most languages do).

You can't do this in shell, because a double-quoted string can contain an entire subprogram:

echo "BEGIN $(if grep pat input.txt; then echo 'FOUND'; fi) END"

That is, the "token" would have a recursive tree structure, which means it's not really a token anymore. The modal lexer pattern lets us easily handle these more complicated string literals.

For examples of modes in other languages, see When Are Lexer Modes Useful?. I observe that both Python and JavaScript have grown shell-like string interpolation in the last 10 years.

Example: Where the ARITH Mode is Used

Where are modes get used in the OSH parser? Let's consider the ARITH mode. It gets used in all of these places:

So when the parser sees a $(( token, it starts calling the lexer with lex_mode_t.ARITH, rather than say lex_mode_t.UNQUOTED.

Likewise, when it sees a ${ it will switch to lex_mode_t.BRACED_VAR_SUB_1.

The current mode can be stored on the stack, since paired delimiters like "quotes", ${}, and $(()) are naturally parsed with recursive function calls.

Conclusion

This post described how the OSH lexer supports parsing shell scripts in a single pass. The lexer cannot run by itself — it needs the parser to send it information so it knows what tokens to return.

This is useful background for explaining the one place where bash cannot be parsed up front: associative array indexing. We'll see this tomorrow.

2019 Update

Note: This post was formerly titled Lexical State and How We Use It.

I'm using "modal lexing" over "lexical state" because the OSH lexer has other state, like a stack of hints to disambiguate the many meanings of ).

I found the term "lexical state" in the DSL Book by Martin Fowler. The Alternative Tokenization pattern is about "completely replacing the lexer" when you get a certain token.