Dev Log #9: Progress on Oil Subprojects

2018-12-16

This is the third of four posts that describe what's happened since August:

Dev Log #7: Hollowing Out the Python Interpreter
Dev Log #8: Shell Protocol Designs
Dev Log #9: This post is about core components of the Oil project.
Dev Log #10: Coding Experiments and Research

This post has a dry title, but after writing it, I think it's the most interesting one of the series. It covers many topics, as you can see from the tags I've given it:

#project-updates #oil-release #blog-topics #shell-the-bad-parts #opy #OVM #oheap #interactive-shell

And it's dense with details, since I'm writing for present and future Oil developers. Please leave a comment if you have questions.

Table of Contents

The Interactive Shell

OPy: A Bytecode Compiler

OVM2: A Virtual Machine for a Subset of Python

OHeap2

OSH: A Compatible Shell Language

Oil: A New Shell Language

Zephyr ASDL and Algebraic Data Types

Summary

Appendix: Blog Post Ideas

The Interactive Shell

I rewrote the interactive completion code to use the OSH lexer and parser as of Oil 0.6.pre11, which I released yesterday.

I'm excited by this because I don't believe any other POSIX shell does it. I believe they all work essentially like bash, which has two ad hoc parsers:

The ~6500-line parse.y file which lexes and parses (a portion of) the bash language. There's also significant parsing code in subst.c, which is ~10,700 lines long.
The ~4300-line bashline.c file, which duplicates much of that logic in a heuristic way.

It's filled with complicated code and comments like this:

/* Flags == SD_NOJMP only because we want to skip over command
substitutions in assignment statements.  Have to test whether this
affects `standalone' command substitutions as individual words. */
while (((s = skip_to_delim (rl_line_buffer, os,
                            COMMAND_SEPARATORS, SD_NOJMP|SD_COMPLETE/*|SD_NOSKIPCMD*/))
         <= start) && rl_line_buffer[s]) {
    /* Handle >| token crudely; treat as > not | */
    if (rl_line_buffer[s] == '|' && rl_line_buffer[s-1] == '>')

In contrast, I figured out a clean way to simply use the lexer and parser I already wrote and documented.

This isn't trivial because the "real" parser has to reject invalid input, while the completion parser has to understand invalid input, so it can suggest completions.

Oil had an attempt to do this dating back to 2016, involving partial parse trees, but that wasn't a good strategy.

Instead, I now "prime" the ParseContext object that I thread throughout the mutually recursive parsers. And the parsers leave "trails" of their partial parsing in the ParseContext. I also introduced a dummy token into the lexer right before EOF.

It required very little code. I may write more about this later.

Please download and try OSH 0.6.pre11, and send me feedback:

https://www.oilshell.org/download/oil-0.6.pre11.tar.xz (gzip version)

Caveat: OSH is still a basic interactive shell. However, it already handles completion better than bash in many cases, and I believe this use of the OSH parser is the foundation for it to be superior in all cases.

OPy: A Bytecode Compiler

I fixed the first bug related to name analysis in the compiler!

As part of hollowing out the Python interpreter, I analyzed the Oil's bytecode, and I found an anomaly in the code produced from generator expressions like:

print(''.join(t.val for t in tokens))

The compiler produced a closure rather a plain function, but this is wrong. The name tokens from the parent scope isn't used. Rather, an iterator over tokens is used.

I fixed this bug, and now Oil uses 73 unique Python bytecodes, as shown near the bottom of the oil-with-opy metric. It was 88 upon first measurement.

More background:

This bug dates back to the original compiler2 code. Generator expressions were added relatively late in Python 2's life, and this was more about efficiency than correctness.
In March, I wrote that I wildly refactored the OPy bytecode compiler. I wrote many new tests, which are published with every release.
Yesterday, this comment reminded me of my challenge back in April:
- Challenge: Can I read your compiler in 100 lines?. At least one person told me that they changed the structure of their compiler as a result of this thread. (I've been getting good feedback on lexer modes as well.)

OVM2: A Virtual Machine for a Subset of Python

I started OVM2, a new Python VM for Oil, and it can now run a toy Fibonacci program! It's a small start, but I'm excited by it.

I've reached the limit of what I can do with CPython, hence the new VM.

OHeap2

There's a new data format called OHeap2 at the center of OVM2. Recall that the original OHeap was a read-only, compact encoding of the lossless syntax tree. It was too limited, but its ideas influenced the work I'm doing now.

OHeap2 serves the same purpose as a SmallTalk image or v8 snapshot. You can also think of it as a replacement for these three Python modules:

marshal - A file format used to represent Python bytecode. It can only represent a fixed set of tree-shaped data structures.
pickle - A file format that can represent graph-shaped Python data structures, including user-defined types.
zipimport - The module we use to implement the Oil app bundle, which is essentially .zip file. We don't need a separate file format for this simple task.

OVM2 should start very quickly due to its use of OHeap2. That is, I'm consciously avoiding the problem of VMs that start slowly, which I discussed in the last post. There, I proposed a coprocess protocol to solve this problem in general, but Oil itself won't need it.

Also note that OVM2 is not just for implementing Oil. It will also be exposed to users as part of the Oil language. In other words, Oil's bootstrapping process is unusual. I'll write more about this later.

OSH: A Compatible Shell Language

The next step for the OSH language is also motivated by interactivity.

Running bash completion scripts with OSH uncovered a common pattern: a dynamic sublanguage for referencing both string variables and array elements. It's found in three places:

${!expr}: "indirect references" treat data as a variable name to be evaluated.
printf -v: A misfeature that performs an assignment! The -v flag takes a variable name.
unset: Makes a variable nil or unset.

Wherever a variable name is valid, an expression like a[1+1] is also valid.

Examples:

$ a=(4 9 16)        # An array with 3 elements
$ expr='a[1+1]'     # Data that is parsed dynamically as code by
                    # the three constructs below

$ echo ${!expr}     # An indirect reference
9

# A very odd way to assign a variable, which bash-completion
# actually uses.
$ printf -v "$expr" "%s" foo
$ echo "${a[@]}"
4 foo 16

$ unset "$expr"     # Another builtin that dynamically parses
$ echo "${a[@]}"
4 16

(Thanks to contributor Greg Price for the discussions on these issues, and for digging through some ugly shell code!)

In addition, all shells have both static and dynamic variants of assignment. This relates to the issue I pointed out in the FAQ on POSIX. That is, POSIX doesn't say anything about this, but all shells implement it.

$ val='foo bar'   # a variable with spaces
$ argv s=$val     # Normal word splitting occurs
['s=foo', 'bar']

# No word splitting here, so it must be statically parsed.
# declare is different than echo!
$ declare s=$val
$ argv "$s"
['foo bar']

# The same expression, but it must be dynamically parsed.
$ expr='s=foo bar'
$ declare "$expr"
$ argv "$s"
['foo bar']

In summary, in order to run the bash-completion project, I may have to introduce a new dynamically parsed "cell sublanguage".

I drafted a post about all the sublanguages I've found in shell, but I call the four main ones Command, Word, Arith, and Bool.

Oil: A New Shell Language

The Oil language isn't very far along yet, but I've now started it! Here's what I've done:

Made a first pass at the Oil lexer.
I created a skeleton for the parser, and decided that Oil still needs the concept of "words". That is, it has mutually recursive command and word parsers like OSH, rather than a single parser like Python.
Reorganized the repo to accomodate Oil.
- Oil and OSH won't share as much code as I originally thought. Although I've avoided the worst parts of bash, OSH necessarily has warts in the name of compatibility. The main "leverage" will be through sharing OPy and OVM2, which I discussed above.
Added namespaces to variants ("constructors") in ASDL. This is like Rust, and unlike OCaml. This makes it easier for OSH and Oil to co-exist in the same codebase.

I plan to use the same strategy for Oil's front end as I did for OSH:

Lexing with regular expressions via re2c.
- Oil will use lexer modes, but there will be fewer of them, because it has fewer sublanguages than OSH.
Static parsing in a single pass. Several warts from OSH will be removed, including the dynamic parsing described above.
- With one or two tokens of lookahead
- Using Recursive Descent for commands / statements.
- Using Pratt Parsing for expressions.
Representation of the code with Zephyr ASDL and the Lossless Syntax Tree.

This style enables easy language composition.

Zephyr ASDL and Algebraic Data Types

I'm happy with ASDL, although it's changing based on Oil's use cases. This is good, because I may "hoist" it to the user level, akin to OVM2.

Here are threads which will inform future changes:

As background, I seem to touch ASDL about once a year:

Success with ASDL (done in December 2016) The initial use of ASDL in Oil.
In the release notes for OSH 0.3 (December 2017), I wrote that I started using textual code generation rather than Python metaprogramming for ASDL types.

Summary

I described recent progress, including:

Rewriting the parsing logic for interactive completion, along with the release of OSH 0.6.pre11.
Fixes in the OPy bytecode compiler.
A new VM that can run a toy program.
"Straggler" features for the OSH language.
The beginnings of the Oil language.

I'm glad that I managed to make progress on everything I mentioned in August.

On the other hand, this post shows me how big the Oil project is, and I'm searching for ways to cut its scope. As a result, I've drafted a post that prioritizes the many subprojects.

Before I publish it, I need to finish writing Dev Log #10: Coding Experiments and Research.

Appendix: Blog Post Ideas

Writing this post gave me a few more ideas:

Data Structures and Algorithms for Shell Autocompletion. As mentioned above, OSH does it differently than other shells.
- I have examples of completions that OSH handles but bash doesn't.
A List of Shell Sublanguages. I've already drafted this. It will be useful to underscore the simplicity of the Oil language vs. the OSH language.
Lexing and Parsing Techniques in OSH and Oil. I believe that the collection of techniques above forms a useful pattern.
re2c is a Useful, High-Quality Project. Akin to my post on CommonMark (which happens to use re2c!)
Visualization of bash's source code vs. OSH. I was surprised by the massive line counts above. A visualization is probably the best way to drive the point home.

If you have a question about any of these things, leave a comment, and the answer my help me write the post!