Pretty Printing ASTs with ASDL, and Metaprogramming

2017-01-21

Prior to ASDL, oil used hand-written Python classes to represent the shell AST, and ad hoc print() statements to show their debug representation.

Now our ASDL implementation can print an instance of any type automatically. For example, compare:

the old text format for debootstrap, and
the new color HTML format

(original shell source)

The new format is more readable, consistent, and complete. We can also print ASTs in plain text or ANSI-colored text.

Abbreviated AST Format

This program:

echo "hello $name"

is represented with this AST:

(SimpleCommand
  words: [
    (CompoundWord parts:[(LiteralPart token:(token id:Lit_Chars val:echo span_id:0))])
    (CompoundWord
      parts: [
        (DoubleQuotedPart
          parts: [
            (LiteralPart token:(token id:Lit_Chars val:"hello " span_id:3))
            (SimpleVarSub token:(token id:VSub_Name val:"$name" span_id:4))
          ]
        )
      ]
    )
  ]
)

That is, we have a SimpleCommand node with two children:

A CompoundWord with a LiteralPart, which has a token echo
A CompoundWord with a DoubleQuotedPart, which has
- A LiteralPart with a token "hello "
- A SimpleVarSub with a token $name

A span is a subsequence of characters in a line. They're numbered with a span_id from 0 to n-1, and concatenating the spans reproduces the source file exactly. Spans are used for error messages and for automatic conversion of osh source to oil source.

This representation is complete but can be unwieldy, so ASDL has an application-specific hook to abbreviate nodes.

Except for these locations, the abbreviated format has the same information, and is more readable:

(C {(echo)} {(DQ ("hello ") ($ VSub_Name "$name"))})

A partial list of abbreviations:

SimpleCommand → (C ...)
CompoundWord →> { ... };
DoubleQuotedPart → (DQ ...)
LiteralPart → (echo) or (Lit_Other "=")

Pretty Printing Requires Metaprogramming

I don't want to write print statements for every AST node type I define — I want "generic" pretty printing, which works for any type.

As shown above, oil's ASDL implementation now supports this. Notably, it requires treating types as objects. In other words, it requires reflection, a kind of metaprogramming.

The end of the last post outlined some metaprogramming topics. I don't have a big conclusion now, but here are two related thoughts:

When Polymorphism Fails: Steve Yegge mentions OCaml's lack of "polymorphic print". I encountered this issue in oil. There isn't a straightforward way to "generically" import ASDL trees into OCaml. You have to write a code generator or use metaprogramming like Camlp4.

I think there is room for a dynamic language with algebraic data types, like Pyret.
printf() can print any type, but it's not type-safe. Using %s instead of %d can crash your program. This is an instance of the type checking vs. metaprogramming problem.

So, in a statically-typed language, you have to "jump out of" the language into a meta-language to implement something as basic as printing.

printf() and scanf() are both little languages on top of C, and in fact printf is Turing complete! This is a mistake. I started a Language Design and Theory of Computation page which surveys similar issues. A lot of the hard work was done by Accidentally Turing Complete.

I'm trying to stay on the track I laid out in the last post, so the next post should be an update on the source code size.