mycpp

This is a Python-to-C++ translator based on MyPy. It only handles the small subset of Python that we use in Oils.

It's inspired by both mypyc and Shed Skin. These posts give background:

As of March 2024, the translation to C++ is done. So it's no longer experimental!

However, it's still pretty hacky. This doc exists mainly to explain the hacks. (We may want to rewrite mycpp as "yaks", although it's low priority right now.)


Source for this doc: mycpp/README.md. The code is all in mycpp/.

Table of Contents
Instructions
Translating and Compiling oils-cpp
Notes on the Algorithm / Architecture
mycpp Idioms / "Creative Hacks"
with {,tag,str_}switch → Switch statement
valUP_valval Downcasting pattern
Python context manager → C++ constructor and destructor (RAII)
MyPy "Shimming" Technique
NewDict() for ordered dicts
StackArray[T]
BigInt
ByteAt(), ByteEquals(), ...
File / LineReader / BufWriter
Fast JSON - avoid intermediate allocations
Limitations Requiring Source Rewrites
WARNING: Assumptions Not Checked
Global Constants Can't Be Mutated
Gotcha about Returning Variants (Subclasses) of a Type
Exceptions Can't Leave Destructors / Python __exit__
More Translation Notes
Hacky Heuristics
Hacky Hard-Coded Names
Major Features
Minor Translations
Optimizations
Rooting Policy
The mycpp Runtime
Differences from CPython
C++ Notes
Gotchas
Minor Features Used
Not Used

Instructions

Translating and Compiling oils-cpp

Running mycpp is best done on a Debian / Ubuntu-ish machine. Follow the instructions at https://github.com/oilshell/oil/wiki/Contributing to create the "dev build" first, which is DISTINCT from the C++ build. Make sure you can run:

oil$ build/py.sh all

This will give you a working shell:

oil$ bin/osh -c 'echo hi'  # running interpreted Python
hi

To run mycpp, we will build Python 3.10, clone MyPy, and install MyPy's dependencies. First install packages:

# We need libssl-dev, libffi-dev, zlib1g-dev to bootstrap Python
oil$ build/deps.sh install-ubuntu-packages

Then fetch data, like the Python 3.10 tarball and MyPy repo:

oil$ build/deps.sh fetch

Then build from source:

oil$ build/deps.sh install-wedges

To build oil-native, use:

oil$ ./NINJA-config.sh
oil$ ninja              # translate and compile, may take 30 seconds

oil$ _bin/cxx-asan/osh -c 'echo hi'  # running compiled C++ !
hi

To run the tests and benchmarks:

oil$ mycpp/TEST.sh test-translator
... 200+ tasks run ...

If you have problems, post a message on #oil-dev at https://oilshell.zulipchat.com. Not many people have contributed to mycpp, so I can use your feedback!

Related:

Notes on the Algorithm / Architecture

There are four passes over the MyPy AST.

(1) const_pass.py: Collect string constants

Turn turn the constant in myfunc("foo") into top-level GLOBAL_STR(str1, "foo").

(2) Three passes in cppgen_pass.py.

(a) Forward Declaration Pass.

class Foo;
class Bar;

This pass also determines which methods should be declared virtual in their declarations. The virtual keyword is written in the next pass.

(b) Declaration Pass.

class Foo {
  void method();
};
class Bar {
  void method();
};

More work in this pass:

(c) Definition Pass.

void Foo:method() {
  ...
}

void Bar:method() {
  ...
}

Note: I really wish we were not using visitors, but that's inherited from MyPy.

mycpp Idioms / "Creative Hacks"

Oils is written in typed Python 2. It will run under a stock Python 2 interpreter, and it will typecheck with stock MyPy.

However, there are a few language features that don't map cleanly from typed Python to C++:

So this describes the idioms we use. There are some hacks in mycpp/cppgen_pass.py to handle these cases, and also Python runtime equivalents in mycpp/mylib.py.

with {,tag,str_}switch → Switch statement

We have three constructs that translate to a C++ switch statement. They use a Python context manager with Xswitch(obj) ... as a little hack.

Here are examples like the ones in mycpp/examples/test_switch.py. (ninja mycpp-logs-equal translates, compiles, and tests all the examples.)

Simple switch:

myint = 99
with switch(myint) as case:
    if case(42, 43):
        print('forties')
    else:
        print('other')

Switch on object type, which goes well with ASDL sum types:

val = value.Str('foo)  # type: value_t
with tagswitch(val) as case:
    if case(value_e.Str, value_e.Int):
        print('string or int')
    else:
        print('other')

We usually need to apply the UP_val pattern here, described in the next section.

Switch on string, which generates a fast two-level dispatch -- first on length, and then with str_equals_c():

s = 'foo'
with str_switch(s) as case:
    if case("foo")
        print('FOO')
    else:
        print('other')

valUP_valval Downcasting pattern

Summary: variable names like UP_* are special in our Python code.

Consider the downcasts marked BAD:

val = value.Str('foo)  # type: value_t

with tagswitch(obj) as case:
    if case(value_e.Str):
        val = cast(value.Str, val)  # BAD: conflicts with first declaration
        print('s = %s' % val.s)

    elif case(value_e.Int):
        val = cast(value.Int, val)  # BAD: conflicts with both
        print('i = %d' % val.i)

    else:
        print('other')

MyPy allows this, but it translates to invalid C++ code. C++ can't have a variable named val, with 2 related types value_t and value::Str.

So we use this idiom instead, which takes advantage of local vars in case blocks in C++:

val = value.Str('foo')  # type: value_t

UP_val = val  # temporary variable that will be casted

with tagswitch(val) as case:
    if case(value_e.Str):
        val = cast(value.Str, UP_val)  # this works
        print('s = %s' % val.s)

    elif case(value_e.Int):
        val = cast(value.Int, UP_val)  # also works
        print('i = %d' % val.i)

    else:
        print('other')

This translates to something like:

value_t* val = Alloc<value::Str>(str42);
value_t* UP_val = val;

switch (val->tag()) {
    case value_e::Str: {
        // DIFFERENT local var
        value::Str* val = static_cast<value::Str>(UP_val);
        print(StrFormat(str43, val->s))
    }
        break;
    case value_e::Int: {
        // ANOTHER DIFFERENT local var
        value::Int* val = static_cast<value::Int>(UP_val);
        print(StrFormat(str44, val->i))
    }
        break;
    default:
        print(str45);
}

This works because there's no problem having different variables with the same name within each case { } block.

Again, the names UP_* are special. If the name doesn't start with UP_, the inner blocks will look like:

    case value_e::Str: {
        val = static_cast<value::Str>(val);  // BAD: val reused
        print(StrFormat(str43, val->s))
    }

And they will fail to compile. It's not valid C++ because the superclass value_t doesn't have a field val->s. Only the subclass value::Str has it.

(Note that Python has a single flat scope per function, while C++ has nested scopes.)

Python context manager → C++ constructor and destructor (RAII)

This Python code:

with ctx_Foo(42):
  f()

translates to this C++ code:

{
  ctx_Foo tmp(42);
  f()

  // destructor ~ctx_Foo implicitly called
}

MyPy "Shimming" Technique

We have an interesting way of "writing Python and C++ at the same time":

  1. First, all Python code must pass the MyPy type checker, and run with a stock Python 2 interpreter.
  2. We translate most .py files to C++, except some files, in particular mycpp/mylib.py and files starting with py like core/{pyos.pyutil}.py.
  3. In C++, we can substitute custom implementations with the properties we want, like Dict<K, V> being ordered, BigInt being distinct from C int, BufWriter being efficient, etc.

The MyPy type system is very powerful! It lets us do all this.

NewDict() for ordered dicts

Dicts in Python 2 aren't ordered, but we make them ordered at runtime by using mylib.NewDict(), which returns collections_.OrderedDict.

The static type is still Dict[K, V], but change the "spec" to be an ordered dict.

In C++, Dict<K, V> is implemented as an ordered dict. (Note: we don't implement preserving order on deletion, which seems OK.)

StackArray[T]

TODO: describe this when it works.

BigInt

ByteAt(), ByteEquals(), ...

Hand optimization to reduce 1-byte strings. For IFS algorithm, LooksLikeGlob(), GlobUnescape().

File / LineReader / BufWriter

TODO: describe how this works.

Can it be more type safe? I think we can cast File to both LineReader and BufWriter.

Or can we invert the relationship, so File derives from both LineReader and BufWriter?

Fast JSON - avoid intermediate allocations

Limitations Requiring Source Rewrites

mycpp itself may cause limitations on expressiveness, or the C++ language may be able express what we want.

Also see mycpp/examples/invalid_* for Python code that fails to translate.

WARNING: Assumptions Not Checked

Global Constants Can't Be Mutated

We translate top level constants to statically initialized C data structures (zero startup cost):

gStr = 'foo'   
gList = [1, 2]  # type: List[int]
gDict = {'bar': 42}  # type: Dict[str, int]

Even though List and Dict are mutable in general, you should NOT mutate these global instances! The C++ code will break at runtime.

Gotcha about Returning Variants (Subclasses) of a Type

MyPy will accept this code:

if cond:
  sig = proc_sig.Open  # type: proc_sig_t
                       # bad because mycpp HOISTS this
else:
  sig = proc_sig.Closed.CreateNull()
  sig.words = words    # assignment fails
return sig

It will translate to C++, but fail to compile. Instead, rewrite it like this:

sig = None  # type: proc_sig_t
if cond:
  sig = proc_sig.Open  # type: proc_sig_t
                       # bad because mycpp HOISTS this
else:
  closed = proc_sig.Closed.CreateNull()
  closed.words = words    # assignment fails
  sig = closed
return sig

Exceptions Can't Leave Destructors / Python __exit__

Context managers like with ctx_Foo(): translate to C++ constructors and destructors.

In C++, a destructor can't "leave" an exception. It results in a runtime error.

You can throw and CATCH an exception WITHIN a destructor, but you can't let it propagate outside.

This means you must be careful when coding the __exit__ method. For example, in vm::ctx_Redirect, we had this bug due to IOError being thrown and not caught when restoring/popping redirects.

To fix the bug, we rewrote the code to use an out param List[IOError_OSError].

Related:

More Translation Notes

Hacky Heuristics

Hacky Hard-Coded Names

These are signs of coupling between mycpp and Oils, which ideally shouldn't exist.

Issue on mycpp improvements: https://github.com/oilshell/oil/issues/568

Major Features

Minor Translations

Optimizations

Rooting Policy

The translated code roots local variables in every function

StackRoots _r({&var1, &var2});

We have two kinds of hand-written code:

  1. Methods like Str::strip() in mycpp/
  2. OS bindings like stat() in cpp/

Neither of them needs any rooting! This is because we use manual collection points in the interpreter, and these functions don't call any functions that can collect. They are "leaves" in the call tree.

The mycpp Runtime

The mycpp translator targets a runtime that's written from scratch. It implements garbage-collected data structures like:

It also has functions based on CPython's:

Differences from CPython

C++ Notes

Gotchas

Minor Features Used

In addition to classes, templates, exceptions, etc. mentioned above, we use:

Not Used

Generated on Fri, 31 May 2024 01:26:56 -0400