Release of Oil 0.8.8

This is the first release in almost two months! I'm writing a blog post that gives color on this (#project-updates), but let's do the usual release announcement first.

Most of the work done was under the hood, with major progress on the garbage collector (previously) and build system. I also do a thorough review of project #metrics in the appendix.

Closed Issues

There were at least two user-visible changes:

#907	"precision" for integers to zero-pad not implemented
#901	Error building on macOS

Thanks to Jason Miller for testing OSH, reporting the printf bug, and sending a patch that I built on top of.

I was forced to learn some trivia by implementing it: %06d and %6.6d both do zero padding like 000042 or -000042, but they're not the same! That is, the floating point "precision" field is overloaded when using say %d and not %f. This annoys me because I think syntax and semantics should correspond.

Commit Log

Reviewing the full changelog confirms that most work happened under the hood.

Changes tagged [gc_heap] relate to the garbage collector. The next section has details.
Changes tagged [mycpp build] relate to using Ninja to translate Python to C++, compile to native code, and run tests and benchmarks.

Using Ninja turned out to be big success! In contrast, writing GNU Make from scratch for Oil revealed many limitations and pitfalls of the classic tool. I'd like to write a #comments post about Ninja based on links collected in this Zulip thread.

The Garbage Collector Works On A Variety of Examples

In my view, this is the most important part of this announcement!

As mentioned in January, writing a garbage collector has been harder than I expected. One reason for this is the unusual set of requirements: it's moving, precise, and written almost entirely in portable C++. We use it for both a little hand-written code and a lot of generated code.

To see evidence of progress, take a look at the second table on each of these pages, Max Resident Set Size. It compares the memory usage of a Python program to the same program translated to C++ with mycpp:

mycpp-examples for 0.8.5 (November): Sometimes C++ uses less memory, but other times it uses 2.7x to 89x more memory. This is because there's no garbage collection!
mycpp-examples for 0.8.8 (this release): C++ always uses less memory than Python. (Caveat: there are 11 examples, down from 16. I need to fix build and runtime problems in the remaining 5.)

In other words, the garbage collector is working. We also have unit tests, stress tests, and various kinds of instrumentation for it. I fixed many crashes to get to this point.

But it's not done. Importantly, it's not yet hooked up to osh_eval.cc aka oil-native. Off the top of my head, we still need to generate field bitmasks for classes, including subclasses.

I hope to write about the garbage collector in detail after it's working on oil-native, rather than just small examples. This Zulip thread has some color and comments on the experience (starting in November 2020).

What's Next?

I want to review the project's progress and write about future plans in the next post. Oil is approaching the five year mark, which is crazy, so it's again time to take stock of things.

I'm still worried about the scope of the project. I also moved for the first time in 10 years (within San Francisco), and that pushed things off track for several weeks.

For the impatient, there are immediate plans in the Zulip thread for the 0.8.8 release, but that's not all I want to write about.

Appendix A: Metrics for `oil-native`

Doing these reviews helps me keep track of the project. Potential contributors may also be interested (help wanted).

I reviewed Metrics for Oil 0.8.4 in November, so let's use it as the baseline for these comparisons.

Spec Tests: Python vs. C++

Here are the OSH spec test stats for oil-native:

spec/cpp for 0.8.4: 1672 total, 1158 in Python, 917 in C++
spec/cpp for 0.8.8: 1689 total, 1172 in Python, 928 in C++

Out of the 14 newly passing tests in Python, 11 of them pass in C++. That's not bad: it means that we can fix bugs in Python and things just work, for the most part. The ones that don't work could be a result of stubbed out C++ "bindings", e.g. where I used assert(0) as the body of a function.

The bad news is that the C++ test count has stalled around 920 for a few months. I've been working on the garbage collector and other things.

So the translation process is working, but it's taking longer than expected.

Lines of Code (Mostly Generated)

There's a small increase in generated source code due to normal feature development:

oil-cpp for 0.8.4: 85,916 lines, 28,584 in osh_eval.cc
oil-cpp for 0.8.8: 89,716 lines, 28,770 in osh_eval.cc

Native Binary Size

ovm-build for 0.8.4: 1,131 KB under GCC. 1,295 KB Clang.
ovm-build for 0.8.8: 1,332 KB under GCC. 1,520 KB Clang.

This increase in native code doesn't seem proportional to the lines of source. Off the top of my head, it's probably due to the fact that every function (generated or hand-written) now has to register stack roots with the garbage collector.

Compilation Speed

Despite more lines of code, the compilation time didn't change much.

0.8.4: 110.5 / 33.8 seconds under GCC. 126.1 / 36.1 seconds under Clang.
0.8.8: 100.3 / 36.4 seconds under GCC. 118.7 / 38.1 seconds under Clang.

And I suspect that it's mainly affected by the #include structure, which we can optimize.

Parsing Speed

I think the following change is within the parser benchmark noise, but it reminds me that we need to add a stable metric to the benchmarks (issue 871).

Parser Performance for 0.8.4: 635 lines/ms and 234 lines/ms
Parser Performance for 0.8.8: 624 lines/ms and 202 lines/ms

Compute Speed

These benchmarks are hard to summarize, but looks like Oil did get slower, probably due to the extra code generated for the garbage collector to track stack roots.

compute for 0.8.4. Oil beats bash on fibonacci (on both machines).
compute for 0.8.8. Oil unfortunately loses to bash on fibonacci. It still wins on word_freq, parse_help, and bubble_sort.

Appendix B: Other Release Metrics

Spec Tests: OSH and Oil

These tests measure the correctness of the Python "reference build" / "executable spec". We have steady progress on both OSH:

OSH spec tests for 0.8.4: 1883 tests, 1671 passing, 86 failing
OSH spec tests for 0.8.8: 1900 tests, 1686 passing, 85 failing

and the Oil language;

Oil spec tests for 0.8.4: 297 tests, 274 passing, 23 failing
Oil spec tests for 0.8.8: 327 tests, 302 passing, 25 failing

Lines of Source Code

We have a few hundred more significant lines of code:

cloc for 0.8.4: 17,796 lines of Python and C, 329 lines of ASDL (excluding testdata).
cloc for 0.8.8: 18,145 lines of Python and C, 338 lines of ASDL

And physical lines of code:

src for 0.8.4: 33,467 lines of Python in OSH, 4,684 in OSH
src for 0.8.8: 34,040 lines of Python in OSH, 4,684 in Oil

Runtime Benchmarks (for the slow build)

The slowness on the following runtime benchmarks is why oil-native exists. However, I noticed a surprising speedup, which seems to hold for releases 0.8.5 to 0.8.7 as well:

Runtime Performance for 0.8.4: 263 seconds and 108 seconds on CPython's configure
Runtime Performance for 0.8.8: 114 seconds and 56 seconds on CPython's configure

Honestly, I have no idea what happened here. All I can say is that this isn't the first time I've seen something unusual while keeping track of performance over the years. I almost always learn something new when I look carefully, but I want to spend my attention on others things now.

It doesn't matter since both results are slow, and we care about the speed of oil-native. Off the top of my head, one thing that changed in January was adding enhanced xtrace, but I'd expect that to make things slower, not faster.

OK, that's it for the metrics. Feel free to leave a comment if you have any questions about this post!