Parser Architecture

This doc has rough notes on the architecture of the parser. How to Parse Shell Like a Programming Language (on the blog) covers some of the same material and is more polished.

Table of Contents
Lexing
List of Regex-Based Lexers
Sublanguages We Don't Lex or Parse
Lexer Unread
Parsing Issues
Where We Parse More Than Once (statically, and unfortunately)
Where We Read More Than Once (VirtualLineReader)
Extra Passes Over the LST
Lookahead in Recursive Descent Parsers
Where the Arena Invariant is Broken
Where Parsers are Instantiated
Runtime Parsing
Where OSH Dynamically Parses
Where Bash Dynamically Parses (perhaps unintentionally)
Parse Errors at Runtime (Need Line Numbers)

Lexing

List of Regex-Based Lexers

Oils uses regex-based lexers, which are turned into efficient C code with re2c. We intentionally avoid hand-written code that manipulates strings char-by-char, since that strategy is error prone; it's inevitable that rare cases will be mishandled.

The list of lexers can be found by looking at native/fastlex.c.

Sublanguages We Don't Lex or Parse

These constructs aren't recognized by the Oils front end. Instead, they're punted to libc:

Lexer Unread

osh/word_parse.py calls lexer.MaybeUnreadOne() to handle right parens in this case:

(case x in x) ;; esac )

This is sort of like the ungetc() I've seen in other shell lexers.

Parsing Issues

This section is about extra passes ("irregularities") at parse time. In the "Runtime Issues" section below, we discuss cases that involve parsing after variable expansion, etc.

Where We Parse More Than Once (statically, and unfortunately)

This makes it harder to produce good error messages with source location info. It also implications for translation, because we break the "arena invariant".

(1) Array L-values like a[x+1]=foo. bash allows splitting arithmetic expressions across word boundaries: a[x + 1]=foo. But I don't see this used, and it would significantly complicate the OSH parser.

(in _MakeAssignPair in osh/cmd_parse.py)

(2) Backticks. There is an extra level of backslash quoting that may happen compared with $().

(in _ReadCommandSubPart in osh/word_parse.py)

Where We Read More Than Once (VirtualLineReader)

Extra Passes Over the LST

These are handled up front, but not in a single pass.

Lookahead in Recursive Descent Parsers

Where the Arena Invariant is Broken

Where Parsers are Instantiated

Runtime Parsing

Where OSH Dynamically Parses

  1. Alias expansion like alias foo='ls | wc -l'. Aliases are like "lexical macros".
  2. Prompt strings. $PS1 and family first undergo \ substitution, and then the resulting strings are parsed as words, with $ escaped to \$.
  3. Builtins.

Where Bash Dynamically Parses (perhaps unintentionally)

All of the cases above, plus:

(1) Recursive Arithmetic Evaluation:

$ a='1+2'
$ b='a+3'
$ echo $(( b ))
6

This also happens for the operands to [[ x -eq x ]].

Note that a='$(echo 3)' results in a syntax error. I believe this was due to the ShellShock mitigation.

(2) The unset builtin takes an LValue. (not yet implemented in OSH)

$ a=(1 2 3 4)
$ expr='a[1+1]'
$ unset "$expr"
$ argv "${a[@]}"
['1', '2', '4']

(3) printf -v takes an "LValue".

(4) Var refs with ${!x} takes a "cell". (not yet implemented OSH. Relied on by bash-completion, as discovered by Greg Price)

$ a=(1 2 3 4)
$ expr='a[$(echo 2 | tee BAD)]'
$ echo ${!expr}
3
$ cat BAD
2

(5) test -v takes a "cell".

(6) ShellShock (removed from bash): export -f, all variables were checked for a certain pattern.

Parse Errors at Runtime (Need Line Numbers)


Generated on Wed, 06 Dec 2023 01:04:01 -0500