Why Sponsor Oils? | blog | oilshell.org
Writing a parser for the OSH language revealed an unusual feature of the bash language. This post explains the theory, and why it matters in practice.
What does this line of bash code mean?
array[X=b+2*3]=c
It depends on what array
is. Let's wrap it in some code:
if test $# -eq 0; then # any arguments? declare -a array # indexed array else declare -A array # associative array fi b=1 array[X=b+2*3]=c # What does this mean? echo "Index: ${!array[@]}" echo "Value: ${array[@]}" echo "X: ${X:-undefined}"
If I run it with no arguments, I get:
$ bash undecidable-parse.sh Index: 7 Value: c X: 7
But if I run it with an argument, I get:
$ bash undecidable-parse.sh ZZZ Index: X=b+2*3 Value: c X: undefined
In the first case, bash:
help declare
).b+2*3
to 7
. (The $
prefix for variables is optional in an
arithmetic context, so b+2*3
is the same as $b+2*3
.)7
to X
(a side effect)c
to array[7]
.In the second case, it
c
with the the string index 'X=b+2*3'
X
remains undefined.That is, parsing what's inside array[]
depends on the dynamic type of the
variable. We can't tell until runtime which one of these is the parse tree:
; indexed array case
(assign
(L-value
"array"
(assign
(L-value "X")
(add
(var "b")
(multiply 2 3))))
(string "c"))
; associative array case
(assign
(L-value
"array"
(string "X=b+2*3"))
(string "c"))
More formally, we can't make a parse tree without being able to statically
determine properties of arbitrary programs, which means that it's
undecidable. To emphasize this, you can replace the test $# -eq 0
line
with arbitrary code — an interpreter for a Turing machine, or a function
that tests the network latency to Google, etc. The parser can't say anything
about the results of those computations.
OSH has the philosophy of catching errors earlier, so it aims to parse shell programs entirely up front. It can't do this if it's mathematically impossible!
Fortunately, there's a simple solution: always quote string literals used as array keys. The parser will then know that unquoted strings are variables. That is:
array[mystring]=x # BAD: is it a string or a variable?
array['mystring']=x # GOOD: it's a string
This rule technically makes OSH an incompatible variant of bash, but you can turn any valid bash program into a valid OSH program without breaking it. In other words, it doesn't remove any functionality; it just tightens up the syntax by requiring quotes.
Associative arrays were added in bash 4.0, which means that they're one of the
most non-portable parts of bash. Most shells don't implement associative
arrays, and zsh
uses an entirely different syntax.
declare -a a # indexed array
declare -A A # associative array
a['1+2']=x # ERROR: '1+2' doesn't look like a number
A['1+2']=x # GOOD: a literal string key
a[1+2]=x # GOOD: set index 3
A[1+2]=x # OK: set index "3" ("type coercion")
a[3]=x # GOOD: set index 3
A[3]=x # OK: set index "3" ("type coercion")
a['3']=x # OK: set index 3 ("type coercion")
A['3']=x # GOOD: set index "3"
a[v]=x # GOOD: substitute variable (runtime error if
# not integer-like)
A[v]=x # CHANGED: substitute variable
# Use A['v'] if you want a string literal
The only line with changed behavior is the last one. With this change, bash can be parsed up front, with only trivial modifications to existing programs. In practice, I see people following this style already.
The following are acceptable and equivalent to a[v]=x
or A[v]=x
, but the
simpler forms are preferred:
a[$v]=x # strings converted to integers
A[$v]=x
a["$v"]=x # strings converted to integers
A["$v"]=x
There could be an even stricter mode to disallow "type coercions", but that's a bigger change at odds with existing shell semantics.
A major goal of Oil is to treat shell as a real programming language, rather than an OS command shell with sloppy macro processing.
Here's the narrative so far:
I showed that that [[
is the "compile-time"
version of [
, illustrating the difference between
parsing up front vs. parsing during execution. The latter opens up the
possibility of depending on arbitrary execution state, which makes parsing
undecidable. All shells that implement [[
parse it up front.
Even when it's possible to parse up front, all popular shells defer some
parsing to execution time, e.g. what's inside ${}
or $()
.
OSH uses the technique of lexical state to parse up front, in a single pass. I believe this strategy is simpler, faster, and results in a better user experience.
In this post, I showed the only construct in bash
that's impossible to
parse up front. Then I proposed a simple fix: always quote string literals
used as array keys.
Similar issues come up in other programming languages.
Parsing C++ is Literally Undecidable. As far as I understand, the issue is that C++ has two Turing-complete execution contexts: compile-time and runtime, and parsing sits between them. Parsing can depend on arbitrary compile-time computation, making it undecidable.
Parsing Perl Considered Undecidable.
What does this code mean?
whatever / 25 ; # / ; die "this dies!";
Kati internals. Kati is a modern parser for GNU Make, and it finds that parsing and execution are interleaved.
What does this code mean?
$(VAR) X=hoge echo $${X}
I'll let you decide whether these are language design errors. On the one hand, these are obscure corners that don't come up very often.
On the other hand, C++, Perl, Make, and bash all have a reputation for being difficult to learn, in a large part due to their syntax. Languages like Python and Java don't have these types of issues.
On a related note, parsing JavaScript is relatively easy, and gives the expected parse errors. But it does have a lack of runtime errors, preferring nonsensical type coercion to stopping the program. The shell shares this unfortunate property. Future posts will discuss how OSH will improve upon this situation.