Why Sponsor Oils? | blog | oilshell.org
OSH uses a simple lexing technique to recognize the shell's many sublanguages in a single pass. I now call it modal lexing.
This is how we address the language composition problem.
What do the characters :-
mean in this code?
$ echo "foo:-bar ${foo:-bar} $(( foo > 1 ? 5:- 5 ))" foo:-bar bar -5
Three different things, depending on the context:
stdout
.${}
.:
in the C-style ternary operator, then the unary minus operator for
negation.This is why we need lexer modes. Most lexers look like this:
Token Read() // return the next token
But our modal lexer has an "enum" parameter:
// return the next token, using rules determined by the mode
Token Read(lex_mode_t mode)
The concept is easy, but it needs a name because:
In OSH, there are currently 8 modes:
UNQUOTED
: the start mode, for echo foo
SQ
: for 'single-quoted strings'
DQ
: for "double quoted strings, which allow $ expansions"
ARITH
: for arithmetic expressions, e.g. in $((1+2))
BRACED_VAR_SUB_1
: for the first token after ${
BRACED_VAR_SUB_2
: for a token in ${
after a name, like :-
in
${foo:-bar}
UNQUOTED_VAROP
: for the argument after an operator, e.g. ${foo:-var op}DQ_VAROP
: for the argument after an operator when double quoted, e.g.
"${foo:-var op}"
And there are two more unimplemented modes:
DOLLAR_SQ
: for $'\n'
-- literal strings that accept C escapesREGEX
: for the right hand side of [[ foo =~ ^foo$ ]]
. (I believe the
existence of this mode is a bug in bash, but let's discuss that later.)(2019 Update: I published an updated list of lexer modes.)
In the implementation of many languages, you get by without any modes.
Most languages have string literals, where you could consider \t
a token, but
that can be worked around by writing code to treat the whole double-quoted
string (e.g. "a\tb\tc"
) as a single token (and that seems to be what most
languages do).
You can't do this in shell, because a double-quoted string can contain an entire subprogram:
echo "BEGIN $(if grep pat input.txt; then echo 'FOUND'; fi) END"
That is, the "token" would have a recursive tree structure, which means it's not really a token anymore. The modal lexer pattern lets us easily handle these more complicated string literals.
For examples of modes in other languages, see When Are Lexer Modes Useful?. I observe that both Python and JavaScript have grown shell-like string interpolation in the last 10 years.
Where are modes get used in the OSH parser? Let's consider the ARITH
mode.
It gets used in all of these places:
(( y = x + 2 ))
(useful in if
or while
conditions)let
syntax for arithmetic commands: let y=x+2
echo $(( y = x + 2 ))
and the bash alias $[y = x + 2]
for ((i=0; i<5; ++i); do echo; done
. This is distinct
from the ((
command because it uses the ;
token.echo ${a : i : i+length}
echo ${a[x + 2]}
: R-value subscripta[x + 2]=foo
: L-value subscripta=([x + 2]=foo [y]=z)
: literal subscriptSo when the parser sees a $((
token, it starts calling the lexer with
lex_mode_t.ARITH
, rather than say lex_mode_t.UNQUOTED
.
Likewise, when it sees a ${
it will switch to lex_mode_t.BRACED_VAR_SUB_1
.
The current mode can be stored on the stack, since paired delimiters like
"quotes"
, ${}
, and $(())
are naturally parsed with recursive function
calls.
This post described how the OSH lexer supports parsing shell scripts in a single pass. The lexer cannot run by itself — it needs the parser to send it information so it knows what tokens to return.
This is useful background for explaining the one place where bash
cannot be
parsed up front: associative array indexing. We'll see this tomorrow.
Note: This post was formerly titled Lexical State and How We Use It.
I'm using "modal lexing" over "lexical state" because the OSH
lexer has other state, like a stack of hints to disambiguate the many
meanings of )
.
I found the term "lexical state" in the DSL Book by Martin Fowler. The Alternative Tokenization pattern is about "completely replacing the lexer" when you get a certain token.