Why Sponsor Oils? | blog | oilshell.org
Yesterday I recapped the project priorities, then showed an example ASDL schema and data structure.
Today I'll describe oil's ASDL implementation, then explain the role it will play in each of the three priorities.
I think ASDL is a big win so far, but the implementation is quite short:
ASDL 433 asdl/asdl_.py 271 asdl/encode.py 309 asdl/py_meta.py 313 asdl/format.py 471 asdl/gen_cpp.py 1797 total
I copied asdl_.py
from Python's ASDL implementation. It uses
a simple regex-based lexer and recursive
descent parser to turn the input schema (e.g. osh.asdl)
into tree of Python objects. That is, it produces the AST for the ASDL
schema itself.
The other four files implement four recursive algorithms. Two of them walk the AST:
py_meta.py
dynamically generates classes from the schema using
metaprogramming, e.g. Python's type function. It creates a class for
each product type and sum type (e.g. the token
and word
types in
yesterday's example).gen_cpp.py
generates C++ code that can read a binary encoding of ASTs,
called oheap
. For example, the 129 lines in osh.asdl produces ~1100
lines of C++.The other two algorithms operator on both the schema and the data, e.g. an oil AST for a particular program:
encode.py
walks an instance of an AST in Python and encodes it in the
oheap
format.format.py
is similar, but instead of binary data, it produces the text
format that we saw at the end of yesterday's post.I'll discuss oheap
in a subsequent post. For now, think of it as a binary
version of the text representation.
So it's possible to write a shell interpreter with a front end in Python and a back end in C++ like this:
[Lines of Code] -> Lexer (osh/lex.py) -> [Tokens] -> Four Parsers (osh/*_parse.py) -> [Tree of Python ASDL objects] -> Encoder (asdl/encode.py) -> [oheap encoding of the tree] -> Decoder (generated osh.asdl.h) -> Runtime in C++
(Above, data is surrounded by brackets and code is surrounded by arrows.)
But oil will likely use another stage, as discussed below.
All shells I've looked at (including bash, dash, zsh, mksh) are
tree interpreters. That is, after the parser builds the AST, the executor
just traverses it and makes system calls like fork()
, exec()
, pipe
(), and
dup()
. In contrast, Python is a bytecode interpreter.
If I were just writing a shell, I would write a tree interpreter as well. But oil is an interpreter with two languages: the compatible osh language and the new oil language.
And now that I'm closer to tackling this problem, it looks more complex. I
naively thought that I could just compile both languages to the same AST, and
use that same AST to convert shell source to oil
source.
I think I'll need multiple representations of the code, for two reasons:
oil
is a superset of the shell language. For example, it's
dynamically typed rather than "unityped". It doesn't make sense to use the
same AST for two different languages, even if one is inspired by the other.I'll write more about this when I've actually implemented it, because I'm sure my thoughts will evolve.
For example, I wrote this post before I
implemented ASDL and used it to describe osh
. You can see that I understood
that the three use cases would impose different requirements on "the AST". But
I didn't yet realize that oil needs need multiple tree representations.
Luckily, ASDL is perfect for this, and it's no accident. Compilers written in ML are typically composed of many small stages connected by typed trees (algebraic data types). The types enforce properties of the code at each stage.
Immediately after integrating ASDL, it caught a lot of bugs in oil
, which I
quickly fixed. Surprisingly, dynamic type checking plus a good test suite
is a decent replacement for static type checking, at least for our prototype.
(Update on MyPy: Type checking with MyPy appears to be a lost cause, since ASDL uses extensive metaprogramming. But I wanted ML's type system and not MyPy's Java-like type system.)
I plan to use three AST schemas:
oil.asdl
— A lossless representation of oil source, not yet written.ovm.asdl
— A simpler language/tree that both osh and oil compile to,
which is easily executed by C++.Again, the top three project priorities are:
ASDL will help with each of these tasks:
osh.asdl
will be a lossless representation — that is,
we should be able to reconstruct the shell source from it. So we can also
use it to convert any shell script to oil
.The main blocker is to clean up the source location schema, so we can
preserve whitespace and comments. Right now I'm attaching a line_span
object to each token, but this information should be propagated up to the
node level.
The Python runtime has substantial functionality and can be easily extended. I hope to get help on this, but I need to get the spec tests in better shape before accepting contributions. Let me know in the comments if you're interested in contributing, and I'll increase the priority of this task!
The C++ executor is blocked on designing ovm.asdl
, which is itself blocked
on validating the oil language design. To design ovm
, we need to have a
good grasp on the languages that compile to it.
So this is further off, but I'm pleased to have implemented the oheap
format, which makes progress toward this goal.
We talked about four tree-walking algorithms in oil's ASDL implementation,
and then explained how ASDL helps with the top three priorities. And I
suspect that we will have three ASDL schemas: osh
, oil
, and ovm
.
I have one more post on the ASDL implementation, and then I'll share some
thoughts that came from the experience of implementing ASDL and using it in
oil
.
For one, I'm surprised that ASDL seems to be uncommon. It's used in one of most popular languages in the world, but there are only a couple blog posts about it, and not too many modern successors.
Not to jump forward too much, but I want oil to be a shell for distributed computing, and schema languages are widely used in that domain. So this lesson from oil's implementation may inform the design of the oil language itself.