How OSH Uses Lexer Modes

2016-10-19 (Last updated 2019-02-06)

OSH uses a simple lexing technique to recognize the shell's many sublanguages in a single pass. I now call it modal lexing.

This is how we address the language composition problem.

Table of Contents

Demo

List of Lexer Modes

Why A Shell Lexer Needs Modes

Example: Where the ARITH Mode is Used

Conclusion

2019 Update

Demo

What do the characters :- mean in this code?

$ echo "foo:-bar  ${foo:-bar}  $(( foo > 1 ? 5:- 5 ))"
foo:-bar  bar  -5

Three different things, depending on the context:

Literal characters to be printed to stdout.
The "if empty or unset" operator within ${}.
The : in the C-style ternary operator, then the unary minus operator for negation.

This is why we need lexer modes. Most lexers look like this:

Token Read()  // return the next token

But our modal lexer has an "enum" parameter:

// return the next token, using rules determined by the mode
Token Read(lex_mode_t mode)

The concept is easy, but it needs a name because:

It deviates from what you might have learned in a compiler class.
Many lexer and parser generators don't support it. It changes the interface between the lexer and parser.

List of Lexer Modes

In OSH, there are currently 8 modes:

UNQUOTED: the start mode, for echo foo
SQ: for 'single-quoted strings'
DQ: for "double quoted strings, which allow $ expansions"
ARITH: for arithmetic expressions, e.g. in $((1+2))
BRACED_VAR_SUB_1: for the first token after ${
BRACED_VAR_SUB_2: for a token in ${ after a name, like :- in ${foo:-bar}
UNQUOTED_VAROP: for the argument after an operator, e.g. ${foo:-var op}
DQ_VAROP: for the argument after an operator when double quoted, e.g. "${foo:-var op}"

And there are two more unimplemented modes:

DOLLAR_SQ: for $'\n' -- literal strings that accept C escapes
REGEX: for the right hand side of [[ foo =~ ^foo$ ]]. (I believe the existence of this mode is a bug in bash, but let's discuss that later.)

(2019 Update: I published an updated list of lexer modes.)

Why A Shell Lexer Needs Modes

In the implementation of many languages, you get by without any modes. Most languages have string literals, where you could consider \t a token, but that can be worked around by writing code to treat the whole double-quoted string (e.g. "a\tb\tc") as a single token (and that seems to be what most languages do).

You can't do this in shell, because a double-quoted string can contain an entire subprogram:

echo "BEGIN $(if grep pat input.txt; then echo 'FOUND'; fi) END"

That is, the "token" would have a recursive tree structure, which means it's not really a token anymore. The modal lexer pattern lets us easily handle these more complicated string literals.

For examples of modes in other languages, see When Are Lexer Modes Useful?. I observe that both Python and JavaScript have grown shell-like string interpolation in the last 10 years.

Example: Where the ARITH Mode is Used

Where are modes get used in the OSH parser? Let's consider the ARITH mode. It gets used in all of these places:

Arithmetic command: (( y = x + 2 )) (useful in if or while conditions)
Alternative let syntax for arithmetic commands: let y=x+2
Arithmetic word: echo $(( y = x + 2 )) and the bash alias $[y = x + 2]
C-style for loops: for ((i=0; i<5; ++i); do echo; done. This is distinct from the (( command because it uses the ; token.
Array or string slicing: echo ${a : i : i+length}
Three places for indexed arrays (but not associative arrays)
- echo ${a[x + 2]} : R-value subscript
- a[x + 2]=foo : L-value subscript
- a=([x + 2]=foo [y]=z) : literal subscript

So when the parser sees a $(( token, it starts calling the lexer with lex_mode_t.ARITH, rather than say lex_mode_t.UNQUOTED.

Likewise, when it sees a ${ it will switch to lex_mode_t.BRACED_VAR_SUB_1.

The current mode can be stored on the stack, since paired delimiters like "quotes", ${}, and $(()) are naturally parsed with recursive function calls.

Conclusion

This post described how the OSH lexer supports parsing shell scripts in a single pass. The lexer cannot run by itself — it needs the parser to send it information so it knows what tokens to return.

This is useful background for explaining the one place where bash cannot be parsed up front: associative array indexing. We'll see this tomorrow.

2019 Update

Note: This post was formerly titled Lexical State and How We Use It.

I'm using "modal lexing" over "lexical state" because the OSH lexer has other state, like a stack of hints to disambiguate the many meanings of ).

I found the term "lexical state" in the DSL Book by Martin Fowler. The Alternative Tokenization pattern is about "completely replacing the lexer" when you get a certain token.