blog | oilshell.org

OSH Parses Shell Scripts Up Front in a Single Pass. Other Shells Don't.

2016-10-13 (Last updated 2019-02-06)

The last post discussed parse-time vs. runtime errors, using the example of [ vs. [[.

To improve programmer productivity, I want OSH to emit as many errors as possible at parse-time. That is, you should get errors before running a shell script, rather than in the middle of a 5-minute or 5-hour invocation.

To do this, OSH parses the whole program at once. I thought this was the simplest and more obvious way to parse shell, but it turns out that other shells do not do this. This post demonstrates that.

Table of Contents
Three Kinds of Substitutions in Five Shells
Problems with Multi-Stage Parsing
Recap

Three Kinds of Substitutions in Five Shells

Let's start with these 3 valid programs:

$ echo ${x:-VarSub}
VarSub
$ echo $(echo CommandSub)  # `echo CommandSub` is equivalent
CommandSub
$ echo ArithSub $((1 + 2))
ArithSub 3

Now we create syntax errors by removing characters:

$ echo ${x:}
/bin/bash: line 1: ${x:}: bad substitution
$ echo $(echo CommandSub >)
/bin/bash: command substitution: line 2: syntax error near unexpected token `)'
/bin/bash: command substitution: line 2: `echo CommandSub >)'
$ echo ArithSub $((1 + ))
/bin/bash: line 1: 1 + : syntax error: operand expected (error token is "+ ")

And then we use the same if false technique from the last post:

$ if false; then echo ${foo:};      else echo VarSub not parsed;     fi &&
> if false; then echo $(echo hi >); else echo CommandSub not parsed; fi &&
> if false; then echo $(( 1 + ));   else echo ArithSub not parsed;   fi
VarSub not parsed
CommandSub not parsed
ArithSub not parsed

Wow, bash doesn't parse any of them! It interleaves parsing and execution. Let's test dash, mksh, and zsh:

$ for sh in dash mksh zsh; do
>   echo -----
>   echo $sh
>   $sh -c 'if false; then echo ${foo:};      else echo VarSub not parsed;     fi'
>   $sh -c 'if false; then echo $(echo hi >); else echo CommandSub not parsed; fi'
>   $sh -c 'if false; then echo $(( 1 + ));   else echo ArithSub not parsed;   fi'
>   echo
> done
-----
dash
dash: 1: Syntax error: Missing '}'
dash: 1: Syntax error: ")" unexpected
ArithSub not parsed

-----
mksh
VarSub not parsed
mksh: syntax error: ')' unexpected
ArithSub not parsed

-----
zsh
VarSub not parsed
zsh:1: parse error near `)'
zsh:1: parse error near `$(echo hi >)'
ArithSub not parsed

Interesting!

Let's try OSH:

Line 1 of '<-c arg>'
  if false; then echo ${foo:};      else echo VarSub not parsed;     fi
                             ^
Unknown token in arith context: <UNKNOWN_TOK ";">

Line 1 of '<-c arg>'
  if false; then echo $(echo hi >); else echo CommandSub not parsed; fi
                                 ^
Expected filename after redirect operator

Line 1 of '<-c arg>'
  if false; then echo $(( 1 + ));   else echo ArithSub not parsed;   fi
                              ^
Expected value or parenthesized expression, got {ASOP_RPAREN ")"}

It catches all 3 errors up front. The error messages need a little polish, but I like that they point to the correct token like the Python parser and Clang. No other shell does this, even if manages to catch the error.

I think this is pretty cool!

Problems with Multi-Stage Parsing

Multi-stage parsing, i.e. parsing at execution time, is harder, not easier. The issue is that you have to "pre-parse" the code anyway to figure out where the closing } or ) or )) is.

This is because the non-recursive lexer needs knowledge from a recursive parser in order to be able to "skip over" text. A shell lexer can't run by itself -- it will start returning the wrong tokens without feedback from the parser.

This is described well in the AOSA book chapter on bash:

Much of the work to recognize the end of the command substitution during the parsing stage is encapsulated into a single function (parse_comsub), which knows an uncomfortable amount of shell syntax and duplicates rather more of the token-reading code than is optimal. This function has to know about here documents, shell comments, metacharacters and word boundaries, quoting, and when reserved words are acceptable (so it knows when it's in a case statement); it took a while to get that right.

(This article by bash maintainer Chet Ramey has been very useful. I've read it several times, and I'll continue to refer to it in future posts.)

I also believe that multi-stage parsing is slower. You end up doing more total work because you have to pre-parse as well as parse. And parsing at execution time will obviously slow execution down. I haven't seen any indication that existing shells cache parse trees, so they will end up repeatedly parsing the text inside ${} and $() if they appear inside in a loop.

I'll have more to say about the performance of the OSH parser once it's ported to C++. But for now, I claim it is algorithmically more efficient.

Tomorrow I will talk about the simple technique that lets OSH parse in one pass: lexical states.

Recap

Yesterday I showed that [[ is part of the language, while [ is a builtin. Thus the arguments to [ are opaque to the shell parser, while [[ expressions can be parsed up front.

This post shows that the ${}, $(), and $(( )) constructs are not parsed up front, even though they're part of the shell language. The bash parser skips over them, essentially treating them as builtins.