A Garbage-Collected Heap in C++, Shaped Like Typed Python

2022-05-29 (Last updated 2022-05-31)

This post shows C++ snippets from our garbage-collected data structures: Str, List<T>, and Dict<K, V>. I hope it will catch the interest of programmers with experience in compilers and language runtimes.

Overview

Like all Unix shells, Oil is an AST interpreter. I once described it as "the first half of Crafting Interpreters, plus 20 or so syscalls"! It's a thin layer over the kernel.

Unlike all Unix shells, it's written in statically-typed Python. We translate this Python to C++ code that uses garbage-collected data structures, which currently use the Cheney semi-space algorithm.

Goals: Speed up the interpreter, and break the dependency on the CPython runtime.

Influences / Prior Art

Let's recap an appendix of the last post. The two resources I remember being most helpful are:

OCaml and the Zinc Abstract Machine (PDF). OCaml is a statically typed language with builtin lists, and it doesn't have value types like C++ does. Everything is a reference.
femtolisp uses a small Cheney collector.

Other resources:

CPython, especially the Python 3.6 dict implementation
The Lua 5 VM
The es shell
SpiderMonkey article on precise rooting

I appreciate more references! This list is biased toward languages that I'm familiar with. For example, I've never used Smalltalk, although it's not statically typed, so I don't expect it to be that similar.

(Update: I just stumbled across this Cheney collector by the Google v8 authors: two_space_heap.cc.)

Using The Heap

These data structures are used by both hand-written and generated code. Here's some code generated from state.py Mem::GetAllVars():

Dict<Str*, Str*>* result = nullptr;
StackRoots _roots({&result});

result = Alloc<Dict<Str*, Str*>>();
// ...
result->set(name, str_val->s);

Equivalent Python:

result = {}
result[name] = str_val.s

So gc_heap::Alloc<> is like new, except unreferenced objects are periodically collected. The manual registration of stack roots is OK for our mostly generated code, but can probably be improved for hand-written bindings. (I think I backed out of that experiment, and it might be worth another try.)

You can also look at the mycpp-examples benchmark, published with every release. It links to small Python examples and measures the speedup after translation to C++. (Unfortunately it doesn't show the C++ code.)

Here's another view of what we're talking about:

~/git/oilshell/oil$ wc -l mycpp/gc_heap.*
  303 mycpp/gc_heap.cc
 1403 mycpp/gc_heap.h
 1706 total

So it's small! More at Line Counts for Translation.

Direct Links: gc_heap.h and gc_heap.cc

Design: Two Graphs Overlaid

I suppose you can think of any garbage collector as "two graphs overlaid", but I think it's particularly true here. C++ templates and the garbage collector interact in an interesting way.

Graph of Application Types

The data model is simple — it's shaped like statically typed Python!

Str*
List<T>*
Dict<K, V>*
User-defined classes, which may have a vtable pointer.

Graph of GC Nodes

But each of these typed nodes can be viewed as one or more Obj* instances, with an 8 byte header:

#define OBJ_HEADER()    \
  uint8_t heap_tag_;    \
  uint8_t type_tag_;    \
  uint16_t field_mask_; \
  uint32_t obj_len_;

What are these fields?

1-byte heap tag, for Cheney to determine the live set and copy it to the other semi-space.
1-byte type tag for Zephyr ASDL sum types.
- Consequence: there can only be 256 variants of a sum type.
2-byte field bit mask for following pointers in user-defined types, and on the fixed-size portion of List and Dict.
- Consequence: A class may have at most 15 or 16 fields, which is OK for Oil. Remember, we're only trying to translate one program.
4-byte length, for Cheney to determine how much to copy.

Here are the 5 heap tags:

namespace Tag {
const int Forwarded = 1;  // For the Cheney algorithm.
const int Global = 3;     // Neither copy nor scan.
const int Opaque = 5;     // Copy but don't scan.  List<int> and Str
const int FixedSize = 7;  // Fixed size headers: consult field_mask_
const int Scanned = 9;    // Copy AND scan for non-NULL pointers.
}

(The tags are odd numbers because I ran into an issue with vtables, explained below.)

In reading about garbage collectors, one thing I didn't find emphasized is that the GC metadata for Cheney can be smaller than for mark and sweep, especially on 64-bit machines.

For example, CPython has next and prev pointers for its sweep phase. This is 16 bytes alone, compared to our 8 bytes which also includes type info and field masks for static types.

This per-object overhead is important in all language runtimes, but especially so in Oil, because it has many small records in its fine-grained code representation.

Question: Can I read about this somewhere, or is it "implementer lore"?

(Update: Aidenn0 mentioned the obvious issue that Cheney has 2x space overhead, which is very true. Though I still think this difference is worth mentioning and understanding.)

Important Types

A Str is a variable-length subtype of Obj that has no pointers, so it's Tag::Opaque.
Slab<T> is also a variable-length subtype of Obj. It helps with lists and dicts.
- Slab<int> is Tag::Opaque
- But Slab<T*> requires scanning for pointers, so it's Tag::Scanned.
A List<T> is a fixed-size structure, with int fields ...
- and a pointer to a single Slab<T> (the items).
A Dict<K, V> is also fixed-size with int fields, with pointers to 3 slabs:
1. the index Slab<int>
2. the keys Slab<K>
3. the values Slab<V>

"Ghost" Layout Types

The following types are never instantiated. After examining the heap tag, the garbage collector does a reinterpret_cast<> from Obj* to LayoutX* for convenience.

LayoutFixed - for tracing up to ~16 fields of a user-defined type.
LayoutForwarded - for storing the forwarding pointer that the Cheney algorithm uses.

// for Tag::FixedSize
class LayoutFixed : public Obj {
 public:
  // only entries denoted in field_mask will be valid
  Obj* children_[16];
};

// for Tag::Forwarded
class LayoutForwarded : public Obj {
 public:
  Obj* new_location;
};

Summary

So GC graph nodes can be either:

Fixed Length
- A bit mask determines where the pointers are
Variable Length
- Tag::Scanned means it has possibly NULL pointers, while Tag::Opaque means it contains no pointers.

Details

Here are some more facets of the design.

Operation That Resize

List::append() and extend() can realloc. These are literally Python methods.
Dict::set(K key, V val) can realloc and rehash (still TODO)
- This is the translation of mydict[key] = val

Tuples Aren't Heap Allocated

Instead, tuples are value types:

Tuple2<A, B>
Tuple3<A, B, C>
Tuple4<A, B, C, D>

They're only used for destructuring function return values. For example:

s, num_chars = single_f.GetRaw()

becomes:

Tuple2<Str*, int> tup2 = single_f->GetRaw();  // value, not pointer
s = tup2.at0();
num_chars = tup2.at1();

So "mycpp" is a fairly different language than Python! And again, the generated code is meant to be readable and stepped through in a debugger.

In my experience, the more "exotic" C++ features you use, the more likely it is that the DWARF source location data is confusing or wrong.

In my mind, the generated C++ code is about the performance model, and the Python source is about the application logic.

Fly in the Ointment: VTable Pointer

I mentioned above that the heap tags are odd numbers. This was the result of a fun debugging session that reminded me that the compiler reserves a vtable pointer on objects with virtual methods :-)

The vtables should be stored at aligned addresses, so if the collector see an odd number in an Obj* header, it assumes it's a heap tag. If it sees an even number, then it advances 4 or 8 bytes past the pointer, and there should be an odd heap tag there.

TODO: Can we make this portable C++? What about big endian machines? One goal of Oil is to be portable like Lua (few #ifdefs), not like Python (many #ifdefs). I need help with these things.

Can Field Masks Be Computed at Compile Time in C++?

As mentioned, I tend to use a modest subset of C++, similar to old versions of Google's style guide.

But the problem of computing field masks seems like a natural application of constexpr and offsetof() in C++.

Our code works, but I had some problems, and ended up using more macros than expected. I don't quite remember what happened, since it was about a year ago, but it's something to make note of.

Interaction with ML-Style Data Types

Our Python classes only use single inheritance. But we also use algebraic data types via Zephyr ASDL, and its code generator also generates C++ classes.

An AST is a big and elaborate data structure, and a problem in ML is that you sometimes have trivial variants that wrap the same type, adding indirection. For example, the leaf Token type can be used as a variant of both word_t and expr_t.

These little wrappers would increase the load on the garbage collector. So I implemented variants as types, which translates to "nominal" multiple inheritance in C++:

A Token IS A word_t
A Token IS A expr_t

That is, there's no field inheritance in the application, but the object header is the 4 C++ fields shown. We wouldn't want those fields to be inherited twice. So that's one reason you see macros instead of inheritance in certain places.

Also, offsetof() apparently isn't specified well in the presence of inheritance (?).

Cheney Choices

When to collect? How full should the semi-space be?
When to grow the heap?

femtolisp has heuristics for this that I started to copy, but I ran into problems. This needs more thought.

What's Left to Do?

I want this article to be accurate in the future, so I've moved this section to a wiki page Tasks Under NLnet Grant.

The first pass was Compiler Engineer Job, but now I realize there are many subtasks we could use help on!

To get this done, I see two opposite strategies, with gradations in between:

Push this design forward as is. I believe I've figured out almost all the core issues, like stack rooting and static types. The remainder is in need of solid engineering.
- I like this design, but I'm open to feedback from people who want to work on it.
Find someone to "own" the whole bottom half of Oil. You can translate Oil to fast native code in any way you want!
- You can also use any language you want.
- However, I think our 50K euro NLnet grant should be reserved for the solution that's already made significant progress. (There is the possibility of more grants in the future if we show progress.)
- One reason this could happen is if someone has a commercial need for a bash-compatible shell that's not bash, e.g. on an embedded system. Let me know if you can think of such a use case!

Why not do this all myself? I touched on this in Oil is Being Implemented "Middle Out". The project is very big: it's more like implementing three languages than one!

Conclusion

So I think this is a fun and small piece of code, and it should lead us to a production quality shell. For some evidence, see Oil's Parser is 160x to 200x Faster Than It Was 2 Years Ago (2020).

Remember that Oil is the most bash-compatible shell by a mile, and has been for several years! It's also a brand new language which I've gotten great feedback on.

Please share this post with compiler and language runtime engineers.

And please sponsor Oil if you can! We now have a small grant, but it may not cover all of this specialized work.

Appendix

More FAQs

There was some good discussion on the last post:

Reddit /r/ProgrammingLanguages: Brief Descriptions of a Python to C++ Translator

I expect these questions to come up, and I've made light comments on them in this series:

Why not Mark and Sweep?
- A basic implementation uses the system allocator, which I found slow. (This could be because of all the tiny objects.)
Why not reference counting?
- See the last post. In addition to tiny objects, there are many allocations!
- It would be fun to explore techniques like compile-time reference counting in the Lobster language, which I know little about.

The real answer to many of these questions is more like "I haven't had time to look into it". So I wrote this post in hopes of parallelizing the project!

(Although "Shell has different implementation constraints than other languages" is also common!)

I started a wiki page: FAQ: Why Not Write Oil in X?