Why Sponsor Oils? | blog | oilshell.org

Oil Uses Its Parser For History And Completion

2020-01-17

This is part one of "The Interactive Shell Needs a Principled Parser", which was mentioned in the January blog roadmap.


Last February, I described the interactive features in Oil. I also wrote How to Parse Shell Like a Programming Language to review how the parser works.

Implementing these features taught me that:

This post starts by showing three bugs in the bash ecosystem, which Oil avoids. I discuss another design issue with autocompletion. And the next post will discuss more coupling between the shell parser and the interactive shell.

Table of Contents
Three Bugs
Bash Expands History Without Knowing the Language
Bash Autocompletes Without Knowing the Language
Bash Punts the Problem To A Separate Project
How It Works
zsh does well
Autocompletion Should Understand Two Languages Separately
Operant Conditioning By Software Bugs
Summary
Caveat
Appendix: Maintainer Comments on Bash Parser

Three Bugs

Bash Expands History Without Knowing the Language

Let's start with a bug in history expansion, because it's simple to see. I've reproduced this with the newest version of bash:

bash-5.0$ echo ${x:-a b c}
a b c

bash-5.0$ echo !$       # !$ is supposed to be the "last word"
echo c}                 # It splits the word incorrectly!
c}

Oil uses its own parser, so it isn't fooled by the spaces:

osh$ echo ${x:-a b c}
a b c

osh$ echo !$
! echo ${x:-a b c}
a b c

So the bash history mechanism uses a partial, incorrect parser for its own language. Another part of bash obviously knows how to parse words correctly, but it's not used here.

Bash Autocompletes Without Knowing the Language

Bash will complete variable names and command names if you press TAB:

bash$ echo $HOM<TAB>           # (1) completes $HOME
bash$ ech<TAB>                 # (2) completes echo

However, it doesn't perform these similar completions:

bash$ echo ${undef:-$HOM<TAB>   # (3) does NOT complete $HOME
bash$ if true; then ech<TAB>    # (4) does NOT complete echo
                                # ditto for commands in while loops,
                                # for, case, etc.

This gave me a hint that bash doesn't use its own parser for completion, which I verified by reading the source. Scanning the ~4300-line bashline.c file may give you a sense of how it works — e.g. starting at bash_forward_shellword. I call this style "groveling through backslashes and braces one-by-one".

In short, bash's duplicate, ad hoc parser for completion isn't accurate.

In Oil, lines 3 and 4 behave like lines 1 and 2, and there are no special cases involved. We can use the same parser for execution and autocompletion, because it now emits extra information on incomplete input.

Bash Punts the Problem To A Separate Project

I mentioned this confusion over completion in Dev Log #9, but I've since discovered more gory details.

bash-completion is a collection of autocompletion scripts for bash. Linux distros like Debian use it by default.

Because bash's second ad hoc parser isn't accurate, bash-completion makes a third attempt, but it also does poorly.

It tries to parse bash in bash!

Here's an example on Ubuntu 16.04:

andy@host:~$ bash --norc   # start with a clean state
bash-4.3$ source /usr/share/bash-completion/bash_completion

bash-4.3$ echo $(readlink <TAB>bash: unexpected EOF while looking for matching `)'
bash: syntax error: unexpected end of file

It gives you a syntax error rather than completion candidates. (I've also reproduced this bug on a Debian machine with bash 4.4.)

In contrast, OSH does what you'd expect:

osh$ echo $(readlink <TAB>
baylisa/                      demo/                         logs/

How It Works

Briefly, Oil's parser has two outputs:

  1. The lossless syntax tree when the parse succeeds, and
  2. A "trail" of words and tokens when it fails on incomplete input. Callers can inspect the trail after catching the ParseError exception.

zsh does well

I tested zsh, and it behaves like Oil on all of three of these examples. It also has an interesting feature where it prints the parse state in the prompt:

zsh% if
if> true
if> then
then> echo hi
then> fi
hi
zsh%

If you know how zsh or another shell solves these problems, please leave a comment. What confuses me is that zsh does not statically parse ${x:-} or $((1+2)). So I believe it must have a second, more accurate parser for completion?

Autocompletion Should Understand Two Languages Separately

Another thing I learned from implementing autocompletion is that bash's completion API has a fundamental confusion. It treats the command line as a string and does ad-hoc splitting into a COMP_WORDS array.

This means that it conflates two separate problems:

  1. Completing the shell language itself, e.g. $HOM or ${HOM
  2. Completing the "language" of the command being invoked. For example, after grep --color= may come never or always.

Why should these problems be treated separately?

One reason is that you'll have quoting and de-quoting bugs otherwise. From the perspective of grep, arguments like 'file with spaces.txt' and file\ with\ spaces.txt' are identical. From the perspective of the line editor (the shell UI), they're different.

A second reason is that it makes it impossible to create shell-agnostic autocompletion. For example, in Elvish, the syntax for environment variables is $E:USER, but it's $USER in POSIX shell. But the syntax of grep is obviously the same when used from either shell.

Although bash itself is confused, the bash-completion project "papers over" this and attempts to separate the shell language from the argv language. It's not perfect, but it succeeds enough that Oil can reuse most of the project's code. I've forked oilshell/bash-completion and run it in my interactive Oil sessions.


Aside: As part of cutting scope in 2020, I'm deferring work on the Shellac Protocol for shell-agnostic completion. The goal of this project is to help upstream authors target something other than bash's flawed API. But I'd still like to see it happen, and there are shell authors interested in it on #shell-autocompletion on oilshell.zulipchat.com.

Operant Conditioning By Software Bugs

I've used bash on Ubuntu for about 15 years now, but I only recently became aware of these obvious bugs.

This is puzzling, but Operant Conditioning by Software Bugs is the most likely explanation. That is, users subconsciously train themselves to avoid bugs.

I can think of another example in the domain of GUIs and Windows. I avoid moving the mouse at certain times when the computer appears "busy". Subconsciously, I fear it may crash.

I was probably trained to do this 20 years ago by Windows 95. I don't know if such crash bugs exist in Ubuntu, but the subtle avoidance is still there.

If you've been using bash for many years, you may be unknowingly avoiding these bugs! To me, hitting TAB seems to carry a "downside risk", which I hope to avoid in Oil. You should be able to hit TAB all the time and bad things won't happen.

Summary

This post showed the benefits of reusing a shell's parser for interactive features like history and autocompletion. I was a stickler about parsing back in 2016, but I didn't realize it would pay off when implementing the interactive shell in 2018.

The next post will discuss how parsing interacts with the interactive prompt and alias expansion — ideally, it shouldn't.

Caveat

There are undoubtedly areas where Oil is less polished than bash. But I'm optimistic that the principled architecture will hold up after those fixes. I've fixed dozens of user-reported bugs in other areas of Oil, and they've largely fallen into the "right" place.

Appendix: Maintainer Comments on Bash Parser

I believe that top-down parsing makes it easier to generate the "trail" for incomplete input mentioned above. This is in contrast to bash's use of bottom-up LR parsing via yacc.

The AOSA chapter on bash seems to support this:

The bash parser is derived from an early version of the Posix grammar, and is, as far as I know, the only Bourne-style shell parser implemented using Yacc or Bison. This has presented its own set of difficulties — the shell grammar isn't really well-suited to yacc-style parsing and requires some complicated lexical analysis and a lot of cooperation between the parser and the lexical analyzer.

...

One thing I've considered multiple times, but never done, is rewriting the bash parser using straight recursive-descent rather than using bison.