Why Sponsor Oils? | blog | oilshell.org
In the last post, I mentioned that oil's ASDL implementation can
serialize ASTs into a binary format I'm calling oheap
.
Although it will evolve, I'll continue to mention it, so this post is a general overview. At the end, I'll make note of limitations and future work.
First, I should mention that the 1997 paper on ASDL discusses an
ASN.1-like binary encoding of ASTs. As far as I know, this format is
unused. oheap
can be thought of as a modern incarnation of that idea.
I wanted oheap
to solve a practical problem: integrating the osh
parser in
Python and a shell runtime in C++. The simplest and most efficient way to do
this is by sharing the AST with an in-memory binary format.
The other, perhaps more conventional, way is to use the
Python-C API, which is what Python itself does. But this is
verbose and error-prone: in this post, I mentioned
that 123 lines of Python.asdl
translates to ~8100 lines of C code using
the Python-C API.
In contrast, with oheap
, 129 lines of osh.asdl translates to ~1100
lines of C++ code in osh.asdl.h. As we'll see below, most these lines
define types and do no work at runtime.
The second motivation is further in the future: I want much of oil to be
written in oil. But it doesn't make sense to parse oil for every shell
script it runs. Compiling oil code to oheap
will avoid this.
Thus oheap
can be compared to the .pyc format in Python. However, oil
won't use the file system as a cache, because that scheme causes subtle
problems in production. Also, I don't believe that running a shell script
should litter your system with temp files (although surprisingly, almost all
shells use temp files to implement here docs).
I call the format oheap
because it's conceptually like the C heap — a
block of bytes representing integers, pointers, strings, arrays, and records.
However, it's a relocatable heap, which means you can store it in a file
and load it with a single read()
call.
Understanding the following methods is a shortcut to understanding the format. They're essentially the "runtime library" for oheap:
inline int Obj::Int(int n) const {
return bytes_[n] + (bytes_[n+1] << 8) + (bytes_[n+2] << 16);
}
inline const Obj& Obj::Ref(const uint32_t* base, int n) const {
int offset = Int(n);
return reinterpret_cast<const Obj&>(base[offset]);
}
The Int()
method takes an offset n
from the beginning of the Obj
instance
— i.e. the only bytes_
member — and treats that location as the
beginning of a little-endian three-byte integer.
The Ref()
method takes a pointer base
to the oheap
image, an offset n
to be treated as an Int
, and returns a reference to another Obj
. In other
words, it looks up a pointer field on an Obj
instance.
Here's the analogy to the C heap:
Obj
, which makes Obj&
analogous
to void*
.Ref()
call is analogous to deferencing a pointer. This means that
oheap
structures are lazily decoded. We can place the image anywhere
in memory and immediately begin computation, without a decoding step.The rest of the osh.asdl.h header file, generated from osh.asdl, uses
static_cast<>
to give you a typed API over the heap. It's similar to a
subset of the protobuf API.
Also note that every method in the header is inline
. The runtime library does
little real work: just array indexing, left shifts, and addition.
ASDL has these logical types:
which can be represented with these respective physical storage types in
oheap
's C-like model:
Int
0
or 1
Int
Ref
to a NUL
-terminated character sequenceInt
and Ref
(ints and pointers)Int
and Ref
Int
; then adjacent Int
or Ref
oheap
can use integers and pointers of any width, but in oil
they're three
bytes wide. I figure that any shell script can be represented with an AST of
less than 224 = 16 Mi locations.
In addition to location independence, compression is another benefit of
representing pointers with small integers. For example, it takes oheap
16
bytes to represent a sum type with five fields: one byte for the tag, and
three bytes for each of five fields.
A native representation on a 32-bit machine would take 1 + 5*4
= 21
bytes. On a 64-bit machine, it would take 1 + 5*8
= 41 bytes. In the
latter case, oheap
is over 60% smaller.
Together, ASDL and oheap
have an architecture like
protocol buffers, with these components:
foo.asdl
vs. foo.proto
)oheap
vs. a tree of variable length integers and byte
strings)As mentioned, a key difference is that there's no decoding step. This reminds me of capnproto, which is roughly a successor to protocol buffers (having been developed by the author of "proto2"). capnproto avoids the decoding step by using the in-memory format as the serialization format (with some message-independent compression.)
You could say that oheap
is the opposite: we're using the serialization
format as the in-memory format. The format is designed to be both small and
efficiently decoded on the fly.
Calling Ref()
is undoubtedly faster than parsing shell, so we've achieved
our goal of avoiding parsing. But it remains to be seen how appropriate
oheap
is in other situations.
Right now, oil uses oheap
in an immutable fashion. Everything is packed
tightly together: there's no operation for appending to an array, for example.
I have ideas for implementing mutating operations, but it's also possible that we only need this ML-like model of transformations on immutable trees.
Another important difference between oheap
and protobuf or capnproto
is that there are no integer tags in the binary encoding to identify fields.
This makes the encoding smaller at the expense of it being fragile with
respect to schema changes. Tags can be added, but since we're not saving
oheap
files or sending them across the network, they're not currently needed.
oheap
is a compact, lazily-decoded file format and type-safe API for
pointer-based data structures.
We're using it to represent ASDL trees, but it can also represent graphs — i.e. data structures with cyclic pointers.
It can be compared to .pyc files, protobuf, or capnproto, albeit with significant differences.
Its limitations are that the API is read-only, and the format isn't suitable for interchange. It's possible to fix these issues, so I'll be looking for new, related use cases.
It may even influence the oil language itself, rather than being a just a part of its implementation. Binary formats are not Unix-y, but they're often useful and widely used in practice.
Emacs Dumper Dispute: (comments on Hacker News) — This article was very timely, appearing at exactly the time I was thinking about serializing ASTs.
Serializing the heap is common in Lisp implementations, but I didn't realize
that Emacs uses special glibc
APIs to do this! oheap
appears to solve
the same problem, and does so in a portable fashion.
oheap
may benefit from similar techniques.