Why Sponsor Oils? | blog | oilshell.org

Dev Log #9: Progress on Oil Subprojects

2018-12-16

This is the third of four posts that describe what's happened since August:

  1. Dev Log #7: Hollowing Out the Python Interpreter
  2. Dev Log #8: Shell Protocol Designs
  3. Dev Log #9: This post is about core components of the Oil project.
  4. Dev Log #10: Coding Experiments and Research

This post has a dry title, but after writing it, I think it's the most interesting one of the series. It covers many topics, as you can see from the tags I've given it:

#project-updates #oil-release #blog-topics #shell-the-bad-parts #opy #OVM #oheap #interactive-shell

And it's dense with details, since I'm writing for present and future Oil developers. Please leave a comment if you have questions.

Table of Contents
The Interactive Shell
OPy: A Bytecode Compiler
OVM2: A Virtual Machine for a Subset of Python
OHeap2
OSH: A Compatible Shell Language
Oil: A New Shell Language
Zephyr ASDL and Algebraic Data Types
Summary
Appendix: Blog Post Ideas

The Interactive Shell

I rewrote the interactive completion code to use the OSH lexer and parser as of Oil 0.6.pre11, which I released yesterday.

I'm excited by this because I don't believe any other POSIX shell does it. I believe they all work essentially like bash, which has two ad hoc parsers:

It's filled with complicated code and comments like this:

/* Flags == SD_NOJMP only because we want to skip over command
substitutions in assignment statements.  Have to test whether this
affects `standalone' command substitutions as individual words. */
while (((s = skip_to_delim (rl_line_buffer, os,
                            COMMAND_SEPARATORS, SD_NOJMP|SD_COMPLETE/*|SD_NOSKIPCMD*/))
         <= start) && rl_line_buffer[s]) {
    /* Handle >| token crudely; treat as > not | */
    if (rl_line_buffer[s] == '|' && rl_line_buffer[s-1] == '>')

In contrast, I figured out a clean way to simply use the lexer and parser I already wrote and documented.

This isn't trivial because the "real" parser has to reject invalid input, while the completion parser has to understand invalid input, so it can suggest completions.

Oil had an attempt to do this dating back to 2016, involving partial parse trees, but that wasn't a good strategy.

Instead, I now "prime" the ParseContext object that I thread throughout the mutually recursive parsers. And the parsers leave "trails" of their partial parsing in the ParseContext. I also introduced a dummy token into the lexer right before EOF.

It required very little code. I may write more about this later.

Please download and try OSH 0.6.pre11, and send me feedback:

Caveat: OSH is still a basic interactive shell. However, it already handles completion better than bash in many cases, and I believe this use of the OSH parser is the foundation for it to be superior in all cases.

OPy: A Bytecode Compiler

I fixed the first bug related to name analysis in the compiler!

As part of hollowing out the Python interpreter, I analyzed the Oil's bytecode, and I found an anomaly in the code produced from generator expressions like:

print(''.join(t.val for t in tokens))

The compiler produced a closure rather a plain function, but this is wrong. The name tokens from the parent scope isn't used. Rather, an iterator over tokens is used.

I fixed this bug, and now Oil uses 73 unique Python bytecodes, as shown near the bottom of the oil-with-opy metric. It was 88 upon first measurement.

More background:

OVM2: A Virtual Machine for a Subset of Python

I started OVM2, a new Python VM for Oil, and it can now run a toy Fibonacci program! It's a small start, but I'm excited by it.

I've reached the limit of what I can do with CPython, hence the new VM.

OHeap2

There's a new data format called OHeap2 at the center of OVM2. Recall that the original OHeap was a read-only, compact encoding of the lossless syntax tree. It was too limited, but its ideas influenced the work I'm doing now.

OHeap2 serves the same purpose as a SmallTalk image or v8 snapshot. You can also think of it as a replacement for these three Python modules:

OVM2 should start very quickly due to its use of OHeap2. That is, I'm consciously avoiding the problem of VMs that start slowly, which I discussed in the last post. There, I proposed a coprocess protocol to solve this problem in general, but Oil itself won't need it.

Also note that OVM2 is not just for implementing Oil. It will also be exposed to users as part of the Oil language. In other words, Oil's bootstrapping process is unusual. I'll write more about this later.

OSH: A Compatible Shell Language

The next step for the OSH language is also motivated by interactivity.

Running bash completion scripts with OSH uncovered a common pattern: a dynamic sublanguage for referencing both string variables and array elements. It's found in three places:

Wherever a variable name is valid, an expression like a[1+1] is also valid.

Examples:

$ a=(4 9 16)        # An array with 3 elements
$ expr='a[1+1]'     # Data that is parsed dynamically as code by
                    # the three constructs below

$ echo ${!expr}     # An indirect reference
9

# A very odd way to assign a variable, which bash-completion
# actually uses.
$ printf -v "$expr" "%s" foo
$ echo "${a[@]}"
4 foo 16

$ unset "$expr"     # Another builtin that dynamically parses
$ echo "${a[@]}"
4 16

(Thanks to contributor Greg Price for the discussions on these issues, and for digging through some ugly shell code!)

In addition, all shells have both static and dynamic variants of assignment. This relates to the issue I pointed out in the FAQ on POSIX. That is, POSIX doesn't say anything about this, but all shells implement it.

$ val='foo bar'   # a variable with spaces
$ argv s=$val     # Normal word splitting occurs
['s=foo', 'bar']

# No word splitting here, so it must be statically parsed.
# declare is different than echo!
$ declare s=$val
$ argv "$s"
['foo bar']

# The same expression, but it must be dynamically parsed.
$ expr='s=foo bar'
$ declare "$expr"
$ argv "$s"
['foo bar']

In summary, in order to run the bash-completion project, I may have to introduce a new dynamically parsed "cell sublanguage".

I drafted a post about all the sublanguages I've found in shell, but I call the four main ones Command, Word, Arith, and Bool.

Oil: A New Shell Language

The Oil language isn't very far along yet, but I've now started it! Here's what I've done:

I plan to use the same strategy for Oil's front end as I did for OSH:

This style enables easy language composition.

Zephyr ASDL and Algebraic Data Types

I'm happy with ASDL, although it's changing based on Oil's use cases. This is good, because I may "hoist" it to the user level, akin to OVM2.

Here are threads which will inform future changes:

As background, I seem to touch ASDL about once a year:

Summary

I described recent progress, including:

I'm glad that I managed to make progress on everything I mentioned in August.

On the other hand, this post shows me how big the Oil project is, and I'm searching for ways to cut its scope. As a result, I've drafted a post that prioritizes the many subprojects.

Before I publish it, I need to finish writing Dev Log #10: Coding Experiments and Research.

Appendix: Blog Post Ideas

Writing this post gave me a few more ideas:

If you have a question about any of these things, leave a comment, and the answer my help me write the post!