Why Sponsor Oils? | blog | oilshell.org
In the first post, I mentioned that I used Python to prototype the OSH parser, in order to figure out an algorithm for parsing shell. After the algorithm is nailed down, implementing it in C++ should be easy.
Why is this interesting?
${my_var}
or $(my-command)
, which are
part of the word sublanguage.I'm about to release the OSH parser, which statically parses many real shell programs, so I'm confident I've contributed something to this problem.
I discuss the issue of token lookahead here. There will be more to write about after the parser is released.
Let's define a new term first: the OSH language.
Most people won't be able to tell the difference between OSH and bash, because OSH is designed for compatibility. But it's useful when talking about the parser. From now on, I'll be careful to make the distinction between the new Oil language and the compatible OSH language.
The main differences between bash
and OSH are:
${var[0][1][2]}
. OSH prefers to catch errors early where it won't
adversely affect compatibility.$((foo))
is always an arithmetic substitution, not a command
substitution with a subshell in it. I will talk more about this issue
tomorrow.Repeating the thesis: the OSH language can be parsed with two
tokens of lookahead. In other words, it's expressible with an LL(2)
grammar.
This is good because it means the parser is relatively efficient.
When the parser encounters tokens like for
, if
, while
, $(
, ${
, etc.,
it doesn't have to look any further to know what to parse. If these were the
only constructs, it would be a language expressible with an LL(1)
grammar.
But each of the four sublanguages requires two tokens of lookahead in certain cases:
Functions. When we see a token like my-func
, we must peek ahead to see if
the next token is (
to know whether it's a function definition or a
command.
my-func() { echo hi; } # function definition
my-func x y # command
Coprocesses. In bash, after we see coproc foo
, we must peek ahead to see if
the next word is {
to know whether it's a coprocess definition or
command.
coproc foo bar
coproc foo {
echo hi
}
When we see ${#
, we have to look past the #
token to know if it's a
variable or an operator. See this in-depth
discussion). The same issue occurs with ${!}
vs. ${!var}
.
(The former is equivalent to $!
, the last background process PID.)
echo ${#} # # is a variable
echo ${#foo} # # is a prefix operator
When we see a token like var=
, we have to check if the next token is the (
operator to see if the word is an array assignment, which requires
recursive word parsing, or a word assignment.
var=value
var=(1 2 3)
The arithmetic language uses Pratt Parsing, which I believe is a form of LL(2)
parsing, although I've never heard anyone say that.
The nud
methods (null denotation) are for parsing decisions that can be
made with one token of lookahead, while the led
methods (left
denotation) are for decisions that can be made with two tokens.
This handles mathematical expressions easily, since the infix operator token is second, after the first operand, telling you the type of AST node to create. In Pratt parsing, there's no possibility of looking ahead three tokens.
Like the arithmetic language, the boolean language contains infix operators. So it has to look ahead one word when it sees an operand word.
[[ $var ]] # test if $var is non-empty
[[ $var -eq foo ]] # compare $var with another value
So OSH has parsers for four sublanguages, each of which use two tokens of lookahead.
Arithmetic is handled with Pratt parsing, and the other three languages are handled with recursive descent. These top-down techniques are easily interwoven.
(I thought about using Pratt parsing for the boolean language as well, but it
has two extra lexical states for regular expressions (operator
=~
), which complicates things.)
In contrast, bash uses a generated, bottom-up parser for the command sublanguage, and ad hoc parsers for the other three. It also dynamically parses some sublanguages, as demonstrated in this post.
${}
parser is mixed in with the runtime in subst.c,
weighing in at ~10,700 lines! Arithmetic in expr.c is another
~1500 lines, also including the runtime.In OSH, I've taken care to keep the definition of the language in one place, apart from the runtime, which should make it easier to read and extend.
A longer-term goal is to define the OSH language in a meta-language or grammar rather than in Python, but I believe that is beyond the state of the art in parsing technology. I will have more to say about this.
Credit: Thanks to Daniel Marti for pointing out the coproc
lookahead issue.