Pictures of a Working Garbage Collector

2023-01-11

The last post, Garbage Collector Problems, was a status update on Oil's C++ runtime, about 4 months in.

Since then, we solved both the "rooting" and fork()-friendly problems. So this post shows a demo of a fast shell in C++, with a working garbage collector! Remember, the oil-native tarball contains no Python code at all, and now it uses less memory.

Screencast

If you click on this screenshot, you'll see OSH running ./configure from CPython's tarball, with GC debug output.

This is:

16K lines of gnarly shell generated by GNU autoconf
Running in our shell interpreter, written in ~40K lines of typed Python.
- But, it's translated to ~80K lines of pure C++!
That generated C++ runs on top of a ~4K line runtime of garbage collected data structures, and ~3K lines of OS bindings (May 2022 blog post with background).
- The debug output is from our _OIL_GC_VERBOSE=1 logging.

It was a thrill to get this working! Remember that the interpreter took years to write.

Process Metrics

Here's another way to look at it. We now run benchmarks on every commit, including memory measurements with getrusage().

In November, our "leaky" prototype used ~550 MB of memory running Python's ./configure, which is 20x more than bash!

Leaky Translation (November)

When we turned on the garbage collector, the memory usage was reduced by 10x.

Garbage Collector Working (December)

Note that this is one of our hardest workloads. On 3 other hand-written ./configure scripts, OSH already matches the speed and memory usage of bash. Most shell scripts are I/O bound.

We've also made progress since then: as of the latest CI run, the autoconf-generated script uses 1.6x more memory under OSH, not 2.0x. It's now 1.4x slower, not 1.9x slower. After we fix a few "dumb things", it should be just as fast.

Where vs. When: Rooting Policy vs. "Safe Points"

How did we get it working? What were the wrong turns? This section summarizes a part of this long journey.

In the last post, I mentioned that we might change our rooting policy. The rooting problem is: Where does the collector start tracing?

A major motivation was "the f(g(), h()) problem", which we found with ASAN. (I just noticed that LLVM's docs call out precisely this problem, naming it h(f(), g())! )

I proposed a policy called "Return Value Rooting", ran it by a few people, and mentioned it on the blog.

Some contributors were convinced by the logic, so I went ahead and implemented it, testing it on the same mycpp examples. It fixed 5 failures, but our recursive descent parsing test case parse.py failed in a new way.

Parsing 1+2 =>
=================================================================
==15704==ERROR: AddressSanitizer: heap-use-after-free on address 0x6020000023f0 at pc 0x55b1eeba4d01 bp 0x7ffde9fe6e90 sp 0x7ffde9fe6e80
READ of size 1 at 0x6020000023f0 thread T0
    #0 0x55b1eeba4d00 in ObjHeader(Obj*) /home/andy/git/oilshell/oil/mycpp/gc_obj.h:114
    #1 0x55b1eeba5f31 in MarkSweepHeap::MarkObjects(Obj*) mycpp/mark_sweep_heap.cc:125
    #2 0x55b1eeba6099 in MarkSweepHeap::MarkObjects(Obj*) mycpp/mark_sweep_heap.cc:136
    #3 0x55b1eeba54f3 in RootSet::MarkRoots(MarkSweepHeap*) mycpp/mark_sweep_heap.cc:11
    #4 0x55b1eeba687a in MarkSweepHeap::Collect() mycpp/mark_sweep_heap.cc:202

I realized that the problem is fundamental: it doesn't handle mutating member variables correctly!

I was discouraged. This garbage collector seemed to have more failed attempts than any program I've ever written! I went back to the drawing board, reviewing my large collection of links, notes, and books.

Books on Garbage Collection

My first thought was to pursue the simplest possible conservative/imprecise stack scanner, just to get something working (on one architecture). It would temporarily "unblock" the project.

I had conjectured with contributors that Boehm GC was essentially a way of saying "screw it" to the rooting problem in C++, and thus to precise garbage collection. That is, it's a hack that language implementers came up with ~30 years ago to work around exactly the problems we were having.

Here's one thread of notes: #oil-dev > Conservative GC Design Links.

It includes a long 2015 conversation on an option for conservative GC in Julia, which basically scared me off again. There is a fair amount of controversy about conservative collection. (It now appears to be a workaround for integrating with unusual libraries, not Julia's main approach.)

To make a long story short, I went back to an even simpler solution, which I call:

#oil-dev > Manual Collection Points

We simply insert mylib.MaybeCollect() manually in the interpreter's Python source. It gets preserved as mylib::MaybeCollect() in C++.

You could call these ad hoc GC safe points. A safe point answers the question: When does the GC have enough information to collect?

I noted that the term "safe point" doesn't appear in index of the Garbage Collection Handbook. It's not an algorithm issue, but an implementation issue which is nonetheless difficult.

What's surprising is that we only need a handful of manual collection points in our 40K-line interpreter — primarily the Python loop that interprets shell for and while loops! If you think about it, loops are the primary ways for programs to create unbounded garbage.

(It's possible for a single MaybeCollect() call to run after every shell statement, not just loops. Max Bernstein pointed out that recursion in shell is possible, so we'll probably do this.)

Three Benefits of Our Simple Solution

Either way, this solution has a number of nice properties:

(1) It drastically reduces the state space explosion that collecting at every allocation creates.

The old policy turned every allocation into a branch — collect, or don't collect — which needs test coverage.

(2) It avoids the f(g(), h()) problem, and more problems like #oil-dev> Allocating in a constructor, which CoffeeTableEspresso ran into.

// Bug: if Foo::Foo() allocates a member variable, the instance 
// may be swept away before it's bound to myobj and thus rooted

myobj = Alloc<Foo>();

More problems: #oil-dev > Interactions Between C++ and GC

(3) It let us delete all rooting annotations from hand-written code!

To tell the GC where to start tracing, mycpp generates a line like this for every function:

 StackRoots _r({ &local1, &local2, ... });

But we also have ~7,000 lines of hand-written code, which means that hundreds of functions had such manual annotations.

With the new manual collection points, we can throw them all out!

This is because hand-written functions like Str::strip() and posix::fork() can never be lower on the stack than a manual collection point. In the old scheme, Str::strip() could call a function that may collect.

So they're a now bit like "leaf nodes" of the call tree.

This slightly unexpected property makes it easier to work on Oil. We try to write the simplest code possible.

Drawbacks?

Our runtime is instrumented with OIL_GC_STATS=1, and we collect data on real shell programs in our CI. If anything, we're still checking for collection too often, not too rarely.

So I don't see any real downsides!

While in theory the shell may use more memory than necessary in some cases, typical workloads consist of many tiny objects. And there are many other ways we should optimize memory usage, mentioned below.

So I think I misconceptualized our bugs as a rooting issue — being fastidious about where to start tracing — but it's more useful to think about when to collect.

#oil-dev > Rooting Policy vs. Safe Points

(An appendix has another framing of this design problem. It was essentially impossible without changes to the C++ language, which have been proposed on and off for ~30 years.)

Related Papers

In the Reddit comments on my last post, two readers pointed out very relevant research. (Thank you! This is why I write these posts.)

Accurate Garbage Collection in Uncooperative Environments

Accurate Garbage Collection in an Uncooperative Environment (Henderson, 2002)
Accurate Garbage Collection in Uncooperative Environments Revisited (Vitek et al., 2009)

I wrote Zulip notes on these papers, as well as on Boehm GC, but here I'll just say that reading them made me feel better.

They're from the early 2000's, not the 1970's! Now I don't feel as bad about stumbling around this problem for several months.
- When chatting with contributors, I said that GC rooting for C++ was a "I didn't know what I didn't know" problem. (Skimming this 2016 blog post, the analogous issues in Rust seem even harder!)
One paper is by my former teammate at Google, Fergus Henderson! I later noticed that it's cited in LLVM's documentation on its GC support.
We don't use these techniques literally, since one relates to generating C from a Prolog-like language, and the other isn't quite portable. But they were useful reference points.

Debugging the Collector

So after switching to manual GC points, all of our tests passed under ASAN and #ifdef GC_ALWAYS. This feels good after months of latent segfaults, but it isn't too surprising, since there are many fewer code paths.

Does the shell run? No, there were still quite a few crashes.

#oil-dev > Summary of GC Bugs

I'll recap that thread after giving some color on GC bugs in general, and our problem specifically.

GC Bugs Are Hard to Diagnose, and They Lurk

From a former v8 lead, and WebAssembly co-author:

I utterly hate debugging garbage collection bugs. Besides stress mode, fuzzing, and tracing output, there isn't really much to help you. A good debugger is a must. But even then, I feel like it is a special kind of hell.
— Ben L. Titzer (@TitzerBL) August 30, 2022

I definitely had this experience with the moving Cheney collector. It had the additional complication of being at the metalanguage level, where it interacts with dozens of C++ features.

The C++ translation essentially stalled because it was too difficult to debug!

Hans Boehm motivates conservative garbage collection:

It is difficult to design a compiler such that the generated code is guaranteed to preserve garbage collection invariants. [Errors] in the object code can potentially go unnoticed for long periods of time. Even when they do appear, the problem is difficult to isolate, since symptoms usually appear a long time after the erroneous code was executed.

Garbage collection bugs tend to be among the last to be removed from a new programming language implementation.

— Garbage Collection in an Uncooperative Environment (Boehm, 1988)

So how did we find and fix our GC bugs? Are there more lurking?

Leaning Hard on ASAN

Again, the most useful technique is to use ASAN as much as possible. It's built into both Clang and GCC, and trivial to use (pass -fsanitize=address). It compiles your program with runtime instrumentation of allocations.

It's mainly used to find classic C and C++ errors with manual memory management, but it works just as well for a garbage collector that uses malloc() and free(). (We'll have a custom allocator for performance, but will always retain this option!)

ASAN Terminal

There were two common failure modes:

(A) Heap Use After Free. A common sequence of events:

You forget a GC root in hand-written code. There might be an object referenced only by a C++ global.
The mark phase can't find this object.
So the sweep phase calls free() on it.
It's later used by the program, and ASAN flags the use-after-free.

(B) Seg Faults, e.g. due to missing object headers. ASAN basically gives you a stack trace, so in many cases you don't have to bother starting the debugger.

A "Reality Sandwich" of Statically Typed Bread

Even though ASAN reliably turns use-after-free into crashes, and these bugs ultimately had simple causes, they were not trivial to diagnose. The core problem is that by the time the crash happens, the debugger lacks type information.

Sandwich

Image Credit

I want to re-emphasize this "reality sandwich" view of garbage collection, described in May:

The top layer is your application types: user-defined classes, Dict, List, Str, etc.
The middle layer is reality: the bits in memory.
The bottom layer is the garbage collector's homogeneous view of memory. For us, it's a graph of ObjHeader*, with children traced by HeapTag::FixedSize and HeapTag::Scanned.

The middle layer is carefully designed to be compatible with the top and bottom views. Precise garbage collection requires this.

While you usually debug your program thinking about the top layer, the crashes happen in the bottom layer.

Techniques (GDB Watch Points)

Here are more debugging techniques I used:

Validate the roots as soon they're registered, not just on collection. Assert everything you can about the object headers, e.g. that they have valid tags like HeapTag::FixedSize.
GDB watch points using the command awatch -l myvar.

This is useful because ASAN gives you 3 stack traces:

Where the use-after-free occurred.
- This is in the app code: the top layer.
Where it was freed.
- This may be in the Sweep() phase of collector: the bottom layer.
Where it was originally allocated.
- In the top layer again.

So by putting a watchpoint on a memory location, you can see where an object is touched by both "bread" layers.

Gap Between Testing and Production: 9 Bugs

Now that we understand the debugging task, here's a summary of that Zulip thread: #oil-dev > Summary of GC Bugs

Logic errors generating GC metadata:

The C++ compiler warned us that some of our constexpr field masks overflowed 16 bits! In other words, a few classes had more than 16 members.
- I fixed this by putting all pointer fields first in classes without inheritance, which means they can use Tag::Scanned instead of Tag::FixedSize.
Off-by-one error for Tag::Scanned objects in mycpp. This was due to a legacy of the Cheney collector, which has now been cleaned up.
Bad Tag::Opaque optimization.

Missing GC metadata:

Field masks in the frontend/flag_gen.py code generator.
Object headers in the core/optview_gen.py code generator.
- Remember, we expanded Python reflection to textual code generation in C++!
Object header in hand-written ParserSpec class.

Missing global roots:

Shared signal handler state. (We had it under the abandoned "return value rooting" scheme, but not the new, simpler design.)
Global stdout and stderr.

(These are the only 3 globals in Oil!)

A trivial memory leak in the getline() binding, which ASAN flagged.
- The only issue here was that my machine didn't have symbols for libc, which means ASAN's stack trace was confusing.

A Delicate Octopus With Thousands of Arms

So none of the bugs were in the collector itself; they were issues with the metadata used for rooting and tracing.

So I now envision a garbage collector as an octopus!

Octopus

Image Credit

It has a small head, which is say 500 or 2000 lines of elegant marking and sweeping logic.
It has tentacles that touch all of your program! This is 50,000 or 500,000 lines of code that interact with garbage-collected objects.
- Stack frames are being pushed and popped.
- Variables are being mutated.
- Meanwhile, the octopus is tracing all objects with its arms, and throwing out what it can't reach.

If there's a single mistake in the logic of the second part, the octopus's brain explodes.

So this was the cause of essentially all 9 bugs. Precise garbage collection requires precise metadata, all over the program.

After this experience, I'm now confident our collector will be reliable. Why?

An unexpected benefit of writing the shell in Python is that the vast majority of GC metadata is in generated source code. We don't have dozens or hundreds of bugs like Mozilla explained in Clawing Our Way Back to Precision (2013).

So our source code isn't littered with any of:

Rooting annotations (remember that we deleted all of them in hand-written code)
Manually created object headers (with field masks)
Manual memory management
Reference counting annotations like Py_INCREF and Py_DECREF

So I expect that the C++ runtime will change slowly, and new features to the shell will be added in Python, where it's impossible to make mistakes with memory.

This great PyPy retrospective makes a similar point:

RPython is a garbage collected language, and the interpreter does not have to care much about GC in most cases. When the C source code is generated, a GC is automatically inserted. This is a source of great flexibility. Over time we experimented with a number of different GC approaches, from reference counting to Boehm to our current incremental generational collector

In fact, I revived the Cheney collector, as an experiment in performance. It may work well with our new manual GC points.

#oil-dev > Option to use Cheney collector?

Performance

I have a lot to say about performance, including describing 3 immediate fixes:

Fixed thrashing in the collection policy
Removed hash table in favor of a mark bitmap
- A "lazy free()" algorithm tweak fell out of recycling object IDs, and improved performance.
Non-recursive marking (surprisingly didn't improve performance, but makes the collector more robust)

And future plans:

True lazy sweeping, for O(num live objects) rather than O(num allocated), like Cheney
A custom allocator to take over small allocations
Consistently put the ObjHeader before any vtable pointer
Optimizations from the app side (big parser refactoring)
Small string optimization (hat tip to Max Bernstein), which interacts with GC
"Boxless" optimization for the interpreter

This work is guided by many benchmarks:

This post is already long, so I won't go into detail. Each piece of work above has a thread on Zulip.

Follow our new Mastodon account or Twitter account if you want to learn more.

Summary

Writing this collector was a lot of work, but we ended up with something that's simple and reliable.

Concepts
- Where vs. when: rooting vs. safe points.
- Precise GC metadata for precise collection (vs. "conservative" collection)
Analogies
- The "reality sandwich" for modeling memory.
- The delicate octopus. The octopus is also the bottom layer of bread!
Tools
- ASAN is essential for flagging bugs during testing.
- GDB watch points.
- I hope to describe performance tools in a future post. I converged on a good idiom for tables in shell, similar to what I called the "PPPT" pattern.

What's Next?

It's now clear that the OSH part of the project will "exist". We still have some work to do to replace the Python tarball with the oil-native tarball, but it will happen.

But the Oil language has stalled for nearly 6 months. I've gotten great feedback from people like Samuel Hierholzer and James Sully, which I hope to address soon.

I've also been falling behind on shell advocacy (#shell-the-good-parts). For now, I'll point out that shell was the sixth fastest growing language on Github in 2022. We still need a new shell.

And the #1 fastest growing language is the Hashicorp Config Language. A new component of Oil is very similar to it — interleaved with a principled shell, and decoupled from any application:

Hay - Custom Languages for Unix Systems.

To get these efforts going again, I'll try to get more people involved with the project. Let me know if you want to be paid to work on open source code!

Appendix

FAQ: Why Should a Shell Have a Garbage Collector?

It will allow recursive JSON-like data structures in Oil.
- Awk shares this lack of data structures with Bourne shell.
It means that the Oil interpreter itself memory safe, not just Oil programs.
- This is in contrast to every POSIX shell (bash, dash) and most language implementations (CPython, Ruby, v8, Lua). They all have memory management logic littered throughout.

We Must Relax A Constraint

Here's another way of looking at our "manual collection point" design choice.

#oil-dev > Relaxing GC Design Constraints

I believe the original problem was over-constrained, to the point that was impossible to solve:

We will generate C++ source code to make Oil fast.
- For example, we don't use LLVM IR, because Oil should compile with a plain C++ compiler. Shells run on every conceivable architecture.
We should use and generate standard, portable C++ (to the degree possible).
- Conservative stack scanning is inherently unportable. It depends on how the compiler generates code for a particular CPU architecture, and on what optimizations it does.
We should collect accurately / precisely.
- Conservative collection should really be called imprecise collection — there's a low probability of missing garbage.
The generated code should be readable.
- In the last post, I mentioned this as a reason for not doing bytecode or SSA-like transformations. This seems like a trivial issue, until you're handed a blob of 90K lines of var345 and var346 to debug! In contrast, our generated code is readable, and multiple contributors have debugged it with normal tools.
mycpp mostly prints MyPy's typed AST.
- Again, we're not doing compiler-like transformations on the code.
We can collect at any allocation.

When viewed this way, it's pretty clear to me that the last constraint is the one to relax.

Are there any other options? I don't think so, other than changing the C++ language itself, which seems to have stalled:

Garbage collection in the next C++ standard (Boehm and Spertus, 2009)

Manual collection points aren't fully general, but they solves exactly the problem we have. That is, we're able to precisely collect garbage in a subset of portable C++ code. It's easy to read, write, and generate code in this style.

Algorithm Animations

If you got this far and were disappointed by the pictures, I like these animations:

Visualizing Garbage Collection Algorithms (2014) - https://github.com/kenfox/gc-viz