What is oheap?

In the last post, I mentioned that oil's ASDL implementation can serialize ASTs into a binary format I'm calling oheap.

Although it will evolve, I'll continue to mention it, so this post is a general overview. At the end, I'll make note of limitations and future work.

Motivation

First, I should mention that the 1997 paper on ASDL discusses an ASN.1-like binary encoding of ASTs. As far as I know, this format is unused. oheap can be thought of as a modern incarnation of that idea.

I wanted oheap to solve a practical problem: integrating the osh parser in Python and a shell runtime in C++. The simplest and most efficient way to do this is by sharing the AST with an in-memory binary format.

The other, perhaps more conventional, way is to use the Python-C API, which is what Python itself does. But this is verbose and error-prone: in this post, I mentioned that 123 lines of Python.asdl translates to ~8100 lines of C code using the Python-C API.

In contrast, with oheap, 129 lines of osh.asdl translates to ~1100 lines of C++ code in osh.asdl.h. As we'll see below, most these lines define types and do no work at runtime.

The second motivation is further in the future: I want much of oil to be written in oil. But it doesn't make sense to parse oil for every shell script it runs. Compiling oil code to oheap will avoid this.

Thus oheap can be compared to the .pyc format in Python. However, oil won't use the file system as a cache, because that scheme causes subtle problems in production. Also, I don't believe that running a shell script should litter your system with temp files (although surprisingly, almost all shells use temp files to implement here docs).

Description

I call the format oheap because it's conceptually like the C heap — a block of bytes representing integers, pointers, strings, arrays, and records. However, it's a relocatable heap, which means you can store it in a file and load it with a single read() call.

Understanding the following methods is a shortcut to understanding the format. They're essentially the "runtime library" for oheap:

inline int Obj::Int(int n) const {
    return bytes_[n] + (bytes_[n+1] << 8) + (bytes_[n+2] << 16);
}

inline const Obj& Obj::Ref(const uint32_t* base, int n) const {
    int offset = Int(n);
    return reinterpret_cast<const Obj&>(base[offset]);
}

The Int() method takes an offset n from the beginning of the Obj instance — i.e. the only bytes_ member — and treats that location as the beginning of a little-endian three-byte integer.

The Ref() method takes a pointer base to the oheap image, an offset n to be treated as an Int, and returns a reference to another Obj. In other words, it looks up a pointer field on an Obj instance.

The rest of the osh.asdl.h header file, generated from osh.asdl, uses static_cast<> to give you a typed API over the heap. It's similar to a subset of the protobuf API.

Also note that every method in the header is inline. The runtime library does little real work: just array indexing, left shifts, and addition.

Encoding

which can be represented with these respective physical storage types in oheap's C-like model:

oheap can use integers and pointers of any width, but in oil they're three bytes wide. I figure that any shell script can be represented with an AST of less than 2²⁴ = 16 Mi locations.

In addition to location independence, compression is another benefit of representing pointers with small integers. For example, it takes oheap 16 bytes to represent a sum type with five fields: one byte for the tag, and three bytes for each of five fields.

A native representation on a 32-bit machine would take 1 + 5*4 = 21 bytes. On a 64-bit machine, it would take 1 + 5*8 = 41 bytes. In the latter case, oheap is over 60% smaller.

Comparison to Other Serialization Frameworks

Together, ASDL and oheap have an architecture like protocol buffers, with these components:

As mentioned, a key difference is that there's no decoding step. This reminds me of capnproto, which is roughly a successor to protocol buffers (having been developed by the author of "proto2"). capnproto avoids the decoding step by using the in-memory format as the serialization format (with some message-independent compression.)

You could say that oheap is the opposite: we're using the serialization format as the in-memory format. The format is designed to be both small and efficiently decoded on the fly.

Calling Ref() is undoubtedly faster than parsing shell, so we've achieved our goal of avoiding parsing. But it remains to be seen how appropriate oheap is in other situations.

Limitations

Right now, oil uses oheap in an immutable fashion. Everything is packed tightly together: there's no operation for appending to an array, for example.

I have ideas for implementing mutating operations, but it's also possible that we only need this ML-like model of transformations on immutable trees.

Another important difference between oheap and protobuf or capnproto is that there are no integer tags in the binary encoding to identify fields. This makes the encoding smaller at the expense of it being fragile with respect to schema changes. Tags can be added, but since we're not saving oheap files or sending them across the network, they're not currently needed.

Summary

oheap is a compact, lazily-decoded file format and type-safe API for pointer-based data structures.

We're using it to represent ASDL trees, but it can also represent graphs — i.e. data structures with cyclic pointers.

Its limitations are that the API is read-only, and the format isn't suitable for interchange. It's possible to fix these issues, so I'll be looking for new, related use cases.

It may even influence the oil language itself, rather than being a just a part of its implementation. Binary formats are not Unix-y, but they're often useful and widely used in practice.