The last post discussed parse-time vs. runtime errors, using the
example of [
vs. [[
.
To improve programmer productivity, I want OSH to emit as many errors as possible at parse-time. That is, you should get errors before running a shell script, rather than in the middle of a 5-minute or 5-hour invocation.
To do this, OSH parses the whole program at once. I thought this was the simplest and more obvious way to parse shell, but it turns out that other shells do not do this. This post demonstrates that.
Let's start with these 3 valid programs:
$ echo ${x:-VarSub} VarSub
$ echo $(echo CommandSub) # `echo CommandSub` is equivalent CommandSub
$ echo ArithSub $((1 + 2)) ArithSub 3
Now we create syntax errors by removing characters:
$ echo ${x:} /bin/bash: line 1: ${x:}: bad substitution
$ echo $(echo CommandSub >) /bin/bash: command substitution: line 2: syntax error near unexpected token `)' /bin/bash: command substitution: line 2: `echo CommandSub >)'
$ echo ArithSub $((1 + )) /bin/bash: line 1: 1 + : syntax error: operand expected (error token is "+ ")
And then we use the same if false
technique from the last post:
$ if false; then echo ${foo:}; else echo VarSub not parsed; fi && > if false; then echo $(echo hi >); else echo CommandSub not parsed; fi && > if false; then echo $(( 1 + )); else echo ArithSub not parsed; fi VarSub not parsed CommandSub not parsed ArithSub not parsed
Wow, bash doesn't parse any of them! It interleaves parsing and execution. Let's test dash, mksh, and zsh:
$ for sh in dash mksh zsh; do > echo ----- > echo $sh > $sh -c 'if false; then echo ${foo:}; else echo VarSub not parsed; fi' > $sh -c 'if false; then echo $(echo hi >); else echo CommandSub not parsed; fi' > $sh -c 'if false; then echo $(( 1 + )); else echo ArithSub not parsed; fi' > echo > done ----- dash dash: 1: Syntax error: Missing '}' dash: 1: Syntax error: ")" unexpected ArithSub not parsed ----- mksh VarSub not parsed mksh: syntax error: ')' unexpected ArithSub not parsed ----- zsh VarSub not parsed zsh:1: parse error near `)' zsh:1: parse error near `$(echo hi >)' ArithSub not parsed
Interesting!
Let's try OSH:
Line 1 of '<-c arg>' if false; then echo ${foo:}; else echo VarSub not parsed; fi ^ Unknown token in arith context: <UNKNOWN_TOK ";"> Line 1 of '<-c arg>' if false; then echo $(echo hi >); else echo CommandSub not parsed; fi ^ Expected filename after redirect operator Line 1 of '<-c arg>' if false; then echo $(( 1 + )); else echo ArithSub not parsed; fi ^ Expected value or parenthesized expression, got {ASOP_RPAREN ")"}
It catches all 3 errors up front. The error messages need a little polish, but I like that they point to the correct token like the Python parser and Clang. No other shell does this, even if manages to catch the error.
I think this is pretty cool!
Multi-stage parsing, i.e. parsing at execution time, is harder, not easier.
The issue is that you have to "pre-parse" the code anyway to figure out where
the closing }
or )
or ))
is.
This is because the non-recursive lexer needs knowledge from a recursive parser in order to be able to "skip over" text. A shell lexer can't run by itself -- it will start returning the wrong tokens without feedback from the parser.
This is described well in the AOSA book chapter on bash:
Much of the work to recognize the end of the command substitution during the parsing stage is encapsulated into a single function (
parse_comsub
), which knows an uncomfortable amount of shell syntax and duplicates rather more of the token-reading code than is optimal. This function has to know about here documents, shell comments, metacharacters and word boundaries, quoting, and when reserved words are acceptable (so it knows when it's in a case statement); it took a while to get that right.
(This article by bash maintainer Chet Ramey has been very useful. I've read it several times, and I'll continue to refer to it in future posts.)
I also believe that multi-stage parsing is slower. You end up doing more
total work because you have to pre-parse as well as parse. And parsing at
execution time will obviously slow execution down. I haven't seen any
indication that existing shells cache parse trees, so they will end up
repeatedly parsing the text inside ${}
and $()
if they appear inside in a
loop.
I'll have more to say about the performance of the OSH parser once it's ported to C++. But for now, I claim it is algorithmically more efficient.
Tomorrow I will talk about the simple technique that lets OSH parse in one pass: lexical states.
Yesterday I showed that [[
is part of the language, while [
is a builtin.
Thus the arguments to [
are opaque to the shell parser, while [[
expressions can be parsed up front.
This post shows that the ${}
, $()
, and $(( ))
constructs are not
parsed up front, even though they're part of the shell language. The bash
parser skips over them, essentially treating them as builtins.