Why Sponsor Oils? | blog | oilshell.org

Pictures of a Working Garbage Collector

2023-01-11

The last post, Garbage Collector Problems, was a status update on Oil's C++ runtime, about 4 months in.

Since then, we solved both the "rooting" and fork()-friendly problems. So this post shows a demo of a fast shell in C++, with a working garbage collector! Remember, the oil-native tarball contains no Python code at all, and now it uses less memory.

More in this post:

This topic can be dense and wordy, so I break it up with screenshots of the many tools we used: ASAN, GDB, perf and flame graphs, R for benchmarking, and more. I want to give contributors a sense of what working on Oil is like.

We can still use more help! Good news: we now have a second grant with NLnet. Let me know if working on this sounds fun. Follow our new account @oilsforunix on Mastodon.


Big thanks to Jesse Hughes, Melvin Walls, and CoffeeTableEspresso for essential help on the C++ runtime. Thanks again to NLnet for funding us.

Table of Contents
Screencast
Process Metrics
Where vs. When: Rooting Policy vs. "Safe Points"
Three Benefits of Our Simple Solution
Drawbacks?
Related Papers
Debugging the Collector
GC Bugs Are Hard to Diagnose, and They Lurk
Leaning Hard on ASAN
A "Reality Sandwich" of Statically Typed Bread
Techniques (GDB Watch Points)
Gap Between Testing and Production: 9 Bugs
A Delicate Octopus With Thousands of Arms
Performance
Summary
What's Next?
Appendix
FAQ: Why Should a Shell Have a Garbage Collector?
We Must Relax A Constraint
Algorithm Animations

Screencast

If you click on this screenshot, you'll see OSH running ./configure from CPython's tarball, with GC debug output.

This is:

It was a thrill to get this working! Remember that the interpreter took years to write.

Process Metrics

Here's another way to look at it. We now run benchmarks on every commit, including memory measurements with getrusage().

In November, our "leaky" prototype used ~550 MB of memory running Python's ./configure, which is 20x more than bash!

Leaky Translation (November)

When we turned on the garbage collector, the memory usage was reduced by 10x.

Garbage Collector Working (December)

Note that this is one of our hardest workloads. On 3 other hand-written ./configure scripts, OSH already matches the speed and memory usage of bash. Most shell scripts are I/O bound.

We've also made progress since then: as of the latest CI run, the autoconf-generated script uses 1.6x more memory under OSH, not 2.0x. It's now 1.4x slower, not 1.9x slower. After we fix a few "dumb things", it should be just as fast.

Where vs. When: Rooting Policy vs. "Safe Points"

How did we get it working? What were the wrong turns? This section summarizes a part of this long journey.


In the last post, I mentioned that we might change our rooting policy. The rooting problem is: Where does the collector start tracing?

A major motivation was "the f(g(), h()) problem", which we found with ASAN. (I just noticed that LLVM's docs call out precisely this problem, naming it h(f(), g())! )

I proposed a policy called "Return Value Rooting", ran it by a few people, and mentioned it on the blog.

Some contributors were convinced by the logic, so I went ahead and implemented it, testing it on the same mycpp examples. It fixed 5 failures, but our recursive descent parsing test case parse.py failed in a new way.

Parsing 1+2 =>
=================================================================
==15704==ERROR: AddressSanitizer: heap-use-after-free on address 0x6020000023f0 at pc 0x55b1eeba4d01 bp 0x7ffde9fe6e90 sp 0x7ffde9fe6e80
READ of size 1 at 0x6020000023f0 thread T0
    #0 0x55b1eeba4d00 in ObjHeader(Obj*) /home/andy/git/oilshell/oil/mycpp/gc_obj.h:114
    #1 0x55b1eeba5f31 in MarkSweepHeap::MarkObjects(Obj*) mycpp/mark_sweep_heap.cc:125
    #2 0x55b1eeba6099 in MarkSweepHeap::MarkObjects(Obj*) mycpp/mark_sweep_heap.cc:136
    #3 0x55b1eeba54f3 in RootSet::MarkRoots(MarkSweepHeap*) mycpp/mark_sweep_heap.cc:11
    #4 0x55b1eeba687a in MarkSweepHeap::Collect() mycpp/mark_sweep_heap.cc:202

I realized that the problem is fundamental: it doesn't handle mutating member variables correctly!

I was discouraged. This garbage collector seemed to have more failed attempts than any program I've ever written! I went back to the drawing board, reviewing my large collection of links, notes, and books.

Books on Garbage Collection

My first thought was to pursue the simplest possible conservative/imprecise stack scanner, just to get something working (on one architecture). It would temporarily "unblock" the project.

I had conjectured with contributors that Boehm GC was essentially a way of saying "screw it" to the rooting problem in C++, and thus to precise garbage collection. That is, it's a hack that language implementers came up with ~30 years ago to work around exactly the problems we were having.

Here's one thread of notes: #oil-dev > Conservative GC Design Links.

It includes a long 2015 conversation on an option for conservative GC in Julia, which basically scared me off again. There is a fair amount of controversy about conservative collection. (It now appears to be a workaround for integrating with unusual libraries, not Julia's main approach.)


To make a long story short, I went back to an even simpler solution, which I call:

We simply insert mylib.MaybeCollect() manually in the interpreter's Python source. It gets preserved as mylib::MaybeCollect() in C++.

You could call these ad hoc GC safe points. A safe point answers the question: When does the GC have enough information to collect?

I noted that the term "safe point" doesn't appear in index of the Garbage Collection Handbook. It's not an algorithm issue, but an implementation issue which is nonetheless difficult.

What's surprising is that we only need a handful of manual collection points in our 40K-line interpreter — primarily the Python loop that interprets shell for and while loops! If you think about it, loops are the primary ways for programs to create unbounded garbage.

(It's possible for a single MaybeCollect() call to run after every shell statement, not just loops. Max Bernstein pointed out that recursion in shell is possible, so we'll probably do this.)

Three Benefits of Our Simple Solution

Either way, this solution has a number of nice properties:

(1) It drastically reduces the state space explosion that collecting at every allocation creates.

The old policy turned every allocation into a branch — collect, or don't collect — which needs test coverage.

(2) It avoids the f(g(), h()) problem, and more problems like #oil-dev> Allocating in a constructor, which CoffeeTableEspresso ran into.

// Bug: if Foo::Foo() allocates a member variable, the instance 
// may be swept away before it's bound to myobj and thus rooted

myobj = Alloc<Foo>();

More problems: #oil-dev > Interactions Between C++ and GC

(3) It let us delete all rooting annotations from hand-written code!

To tell the GC where to start tracing, mycpp generates a line like this for every function:

 StackRoots _r({ &local1, &local2, ... });

But we also have ~7,000 lines of hand-written code, which means that hundreds of functions had such manual annotations.

With the new manual collection points, we can throw them all out!

This is because hand-written functions like Str::strip() and posix::fork() can never be lower on the stack than a manual collection point. In the old scheme, Str::strip() could call a function that may collect.

So they're a now bit like "leaf nodes" of the call tree.

This slightly unexpected property makes it easier to work on Oil. We try to write the simplest code possible.

Drawbacks?

Our runtime is instrumented with OIL_GC_STATS=1, and we collect data on real shell programs in our CI. If anything, we're still checking for collection too often, not too rarely.

So I don't see any real downsides!

While in theory the shell may use more memory than necessary in some cases, typical workloads consist of many tiny objects. And there are many other ways we should optimize memory usage, mentioned below.


So I think I misconceptualized our bugs as a rooting issue — being fastidious about where to start tracing — but it's more useful to think about when to collect.

(An appendix has another framing of this design problem. It was essentially impossible without changes to the C++ language, which have been proposed on and off for ~30 years.)

Related Papers

In the Reddit comments on my last post, two readers pointed out very relevant research. (Thank you! This is why I write these posts.)

Accurate Garbage Collection in Uncooperative Environments

I wrote Zulip notes on these papers, as well as on Boehm GC, but here I'll just say that reading them made me feel better.

Debugging the Collector

So after switching to manual GC points, all of our tests passed under ASAN and #ifdef GC_ALWAYS. This feels good after months of latent segfaults, but it isn't too surprising, since there are many fewer code paths.

Does the shell run? No, there were still quite a few crashes.

I'll recap that thread after giving some color on GC bugs in general, and our problem specifically.

GC Bugs Are Hard to Diagnose, and They Lurk

From a former v8 lead, and WebAssembly co-author:

I utterly hate debugging garbage collection bugs. Besides stress mode, fuzzing, and tracing output, there isn't really much to help you. A good debugger is a must. But even then, I feel like it is a special kind of hell.

— Ben L. Titzer (@TitzerBL) August 30, 2022

I definitely had this experience with the moving Cheney collector. It had the additional complication of being at the metalanguage level, where it interacts with dozens of C++ features.

The C++ translation essentially stalled because it was too difficult to debug!


Hans Boehm motivates conservative garbage collection:

It is difficult to design a compiler such that the generated code is guaranteed to preserve garbage collection invariants. [Errors] in the object code can potentially go unnoticed for long periods of time. Even when they do appear, the problem is difficult to isolate, since symptoms usually appear a long time after the erroneous code was executed.

Garbage collection bugs tend to be among the last to be removed from a new programming language implementation.

Garbage Collection in an Uncooperative Environment (Boehm, 1988)

So how did we find and fix our GC bugs? Are there more lurking?

Leaning Hard on ASAN

Again, the most useful technique is to use ASAN as much as possible. It's built into both Clang and GCC, and trivial to use (pass -fsanitize=address). It compiles your program with runtime instrumentation of allocations.

It's mainly used to find classic C and C++ errors with manual memory management, but it works just as well for a garbage collector that uses malloc() and free(). (We'll have a custom allocator for performance, but will always retain this option!)

ASAN Terminal

There were two common failure modes:

(A) Heap Use After Free. A common sequence of events:

  1. You forget a GC root in hand-written code. There might be an object referenced only by a C++ global.
  2. The mark phase can't find this object.
  3. So the sweep phase calls free() on it.
  4. It's later used by the program, and ASAN flags the use-after-free.

(B) Seg Faults, e.g. due to missing object headers. ASAN basically gives you a stack trace, so in many cases you don't have to bother starting the debugger.

A "Reality Sandwich" of Statically Typed Bread

Even though ASAN reliably turns use-after-free into crashes, and these bugs ultimately had simple causes, they were not trivial to diagnose. The core problem is that by the time the crash happens, the debugger lacks type information.

Sandwich

Image Credit

I want to re-emphasize this "reality sandwich" view of garbage collection, described in May:

The middle layer is carefully designed to be compatible with the top and bottom views. Precise garbage collection requires this.

While you usually debug your program thinking about the top layer, the crashes happen in the bottom layer.

Techniques (GDB Watch Points)

Here are more debugging techniques I used:

  1. Validate the roots as soon they're registered, not just on collection. Assert everything you can about the object headers, e.g. that they have valid tags like HeapTag::FixedSize.

  2. GDB watch points using the command awatch -l myvar.

This is useful because ASAN gives you 3 stack traces:

  1. Where the use-after-free occurred.
  2. Where it was freed.
  3. Where it was originally allocated.

So by putting a watchpoint on a memory location, you can see where an object is touched by both "bread" layers.

GDB watchpoint thread

Gap Between Testing and Production: 9 Bugs

Now that we understand the debugging task, here's a summary of that Zulip thread: #oil-dev > Summary of GC Bugs

Logic errors generating GC metadata:

  1. The C++ compiler warned us that some of our constexpr field masks overflowed 16 bits! In other words, a few classes had more than 16 members.
  2. Off-by-one error for Tag::Scanned objects in mycpp. This was due to a legacy of the Cheney collector, which has now been cleaned up.
  3. Bad Tag::Opaque optimization.

Missing GC metadata:

  1. Field masks in the frontend/flag_gen.py code generator.
  2. Object headers in the core/optview_gen.py code generator.
  3. Object header in hand-written ParserSpec class.

Missing global roots:

  1. Shared signal handler state. (We had it under the abandoned "return value rooting" scheme, but not the new, simpler design.)
  2. Global stdout and stderr.

(These are the only 3 globals in Oil!)

  1. A trivial memory leak in the getline() binding, which ASAN flagged.

A Delicate Octopus With Thousands of Arms

So none of the bugs were in the collector itself; they were issues with the metadata used for rooting and tracing.

So I now envision a garbage collector as an octopus!

Octopus

Image Credit

  1. It has a small head, which is say 500 or 2000 lines of elegant marking and sweeping logic.
  2. It has tentacles that touch all of your program! This is 50,000 or 500,000 lines of code that interact with garbage-collected objects.

If there's a single mistake in the logic of the second part, the octopus's brain explodes.

So this was the cause of essentially all 9 bugs. Precise garbage collection requires precise metadata, all over the program.


After this experience, I'm now confident our collector will be reliable. Why?

An unexpected benefit of writing the shell in Python is that the vast majority of GC metadata is in generated source code. We don't have dozens or hundreds of bugs like Mozilla explained in Clawing Our Way Back to Precision (2013).

So our source code isn't littered with any of:

  1. Rooting annotations (remember that we deleted all of them in hand-written code)
  2. Manually created object headers (with field masks)
  3. Manual memory management
  4. Reference counting annotations like Py_INCREF and Py_DECREF

So I expect that the C++ runtime will change slowly, and new features to the shell will be added in Python, where it's impossible to make mistakes with memory.

This great PyPy retrospective makes a similar point:

RPython is a garbage collected language, and the interpreter does not have to care much about GC in most cases. When the C source code is generated, a GC is automatically inserted. This is a source of great flexibility. Over time we experimented with a number of different GC approaches, from reference counting to Boehm to our current incremental generational collector

In fact, I revived the Cheney collector, as an experiment in performance. It may work well with our new manual GC points.

Performance

I have a lot to say about performance, including describing 3 immediate fixes:

  1. Fixed thrashing in the collection policy
  2. Removed hash table in favor of a mark bitmap
  3. Non-recursive marking (surprisingly didn't improve performance, but makes the collector more robust)

And future plans:

  1. True lazy sweeping, for O(num live objects) rather than O(num allocated), like Cheney
  2. A custom allocator to take over small allocations
  3. Consistently put the ObjHeader before any vtable pointer
  4. Optimizations from the app side (big parser refactoring)
  5. Small string optimization (hat tip to Max Bernstein), which interacts with GC
  6. "Boxless" optimization for the interpreter

This work is guided by many benchmarks:


This post is already long, so I won't go into detail. Each piece of work above has a thread on Zulip.

Follow our new Mastodon account or Twitter account if you want to learn more.

Recursive Marking Flame Graph

Non-recursive Marking Flame Graph

Summary

Writing this collector was a lot of work, but we ended up with something that's simple and reliable.

What's Next?

It's now clear that the OSH part of the project will "exist". We still have some work to do to replace the Python tarball with the oil-native tarball, but it will happen.

But the Oil language has stalled for nearly 6 months. I've gotten great feedback from people like Samuel Hierholzer and James Sully, which I hope to address soon.

I've also been falling behind on shell advocacy (#shell-the-good-parts). For now, I'll point out that shell was the sixth fastest growing language on Github in 2022. We still need a new shell.

And the #1 fastest growing language is the Hashicorp Config Language. A new component of Oil is very similar to it — interleaved with a principled shell, and decoupled from any application:

To get these efforts going again, I'll try to get more people involved with the project. Let me know if you want to be paid to work on open source code!

Github bug: TODO on core GC

Appendix

FAQ: Why Should a Shell Have a Garbage Collector?

  1. It will allow recursive JSON-like data structures in Oil.
  2. It means that the Oil interpreter itself memory safe, not just Oil programs.

More answers for the curious:

For example, why not write it in Go? Go's runtime implements goroutines with OS threads, while shells use fork()-based concurrency. Go also doesn't use libc, which shells rely on extensively.

(That said, Go's collector is very readable, with many interesting comments!)

We Must Relax A Constraint

Here's another way of looking at our "manual collection point" design choice.

I believe the original problem was over-constrained, to the point that was impossible to solve:

  1. We will generate C++ source code to make Oil fast.
  2. We should use and generate standard, portable C++ (to the degree possible).
  3. We should collect accurately / precisely.
  4. The generated code should be readable.
  5. mycpp mostly prints MyPy's typed AST.
  6. We can collect at any allocation.

When viewed this way, it's pretty clear to me that the last constraint is the one to relax.

Are there any other options? I don't think so, other than changing the C++ language itself, which seems to have stalled:

Manual collection points aren't fully general, but they solves exactly the problem we have. That is, we're able to precisely collect garbage in a subset of portable C++ code. It's easy to read, write, and generate code in this style.

Algorithm Animations

If you got this far and were disappointed by the pictures, I like these animations: