At the bottom of this page is a grammar for the language accepted inside ${}
.
I wrote it to guide the implementation of a recursive descent
parser.
I've tested it on over a hundred thousand lines of shell, and it now appears to
parse everything correctly. The previous iteration had problems with code
like ${@:1:2}
.
This grammar describes part of what I call the "word language". Shell is actually composed of four interleaved sublanguages:
for
, if
, functions, ...${}
, $()
, $(())
, ...a**2 + b**2
[[
, which is statically parsed).There are other mini-languages in shell, like globbing and brace expansion, but I don't consider them full-fledged languages because they're not recursive and there are no syntax errors.
I hope to publish grammars for all 4 languages at some point, but right now I'm doing the minimum to get a correct parser working.
The arithmetic language appears inside the word language in two
places: inside subscripts ${a[x+1]}
and inside slicing ${a:x+1:y+2}
.
It's a recursive grammar because it contains other words. Words like this are valid:
$ echo ${a-${a-${a-unset}}} unset
#
and !
tokens need LL(2)
lookahead. For example, the #
token
could be either a variable like ${#}
(length of arguments array), or a
prefix operator like ${#var}
.The grammar tries to strike a balance between being faithful to bash
and
following the philosophy of early errors. Bash accepts some code as
syntactically valid, but doesn't interpret it correctly.
For example, bash allows multiple subscripts during parsing, but ignores them during execution:
$ array=(abc def ghi) > echo ${array[0]} > echo ${array[0][1]} > echo ${array[0][1][2]} > echo ${array[0][1][2] : 1 : 2} abc abc abc bc
Here is an example where slices are accepted, but ignored:
$ array=(abc def ghi jkl) > echo ${#array[@]} # length of array, OK > echo ${array[@] : 1 : 2} # slice of array, OK > echo ${#array[@] : 1 : 2} # why is this 4? 4 def ghi 4
The OSH parser disallows both of these constructs, since they don't seem to be implemented correctly.
NAME = [a-zA-Z_][a-zA-Z0-9_]* NUMBER = [0-9]+ # ${10}, ${11}, ... Subscript = '[' ('@' | '*' | ArithExpr) ']' VarSymbol = '!' | '@' | '#' | ... VarOf = NAME Subscript? | NUMBER # no subscript allowed, none of these are arrays | VarSymbol TEST_OP = '-' | ':-' | '=' | ':=' | '+' | ':+' | '?' | ':?' STRIP_OP = '#' | '##' | '%' | '%%' CASE_OP = ',' | ',,' | '^' | '^^' UnaryOp = TEST_OP | STRIP_OP | CASE_OP | ... Match = ('/' | '#' | '%') WORD # match all / prefix / suffix VarExpr = VarOf | VarOf UnaryOp WORD | VarOf ':' ArithExpr (':' ArithExpr )? | VarOf '/' Match '/' WORD LengthExpr = '#' VarOf # can't apply operators after length RefOrKeys = '!' VarExpr # CAN apply operators after a named ref # ${!ref[0]} vs ${!keys[@]} resolved later PrefixQuery = '!' NAME ('*' | '@') # list variable names with a prefix VarSub = LengthExpr | RefOrKeys | PrefixQuery | VarExpr
Again, this isn't the entire word grammar — it's the grammar for variable substitutions inside words.