Up to 4 Pretty Printers?

(March 2024)

Oils parses a lot of text, and it's becoming apparent than we need a print a lot too! The text should be nicely formatted because a shell is a user interface.

This doc describes 4 possible pretty printers in Oils. Traditional shells don't have appear to have any pretty printing.

Table of Contents
Screenshots
Background - go fmt style versus PPL style
Why PPL style?
Warning
Four Printers - What and Why?
Print YSH data types in a JSON-like format
Replace our ad hoc Zephyr ASDL pretty printer
OSH-YSH Code Formatter
Experimental: Export the Oils "Syntax Graph" to Users with "NIL8"
Implementation Notes
Do the printers depend on each other?
Code Skeleton
Design Questions
Non-Wrapping Printers aka "Indenters" - same PPL IR?
Columnar Layouts (spending more time)
Two Levels of Coarse Parsing for YSH? (not at first)
NIL8 - Uses cases for both Code and Data
Conclusion
Related
Docs
Fun Computer Science Problems in Oils
Appendix: Older Notes
Dynamically Typed YSH Data (value_t)
Statically Typed ASDL Data

Screenshots

Let's be concrete first, because there's a lot of brainstorming below.

YSH prints its JSON-like data structures with the = keyword, which takes a Python-like expression on the right:

ysh issues.json

Right now, it looks bad on big data structures. It should look something like nodejs or jq:

node.js issues.json

jq issues.json

We may want to omit the quotes, like nodejs. (This syntax isn't meant to be parsed. JSON8 may have unquoted dict keys, although it's not essential).

Background - go fmt style versus PPL style

To back up a bit, I'm writing this doc organize my thoughts, and to explain problems and requirements to contributors.

There are at least two schools of thought on pretty printers, which this lobste.rs thread has a good discussion of:

More comments on a blog post by Justin Pombrio (which I circulated):

Let's call the two styles the "go fmt style" and the "PPL style" (functional pretty printing language).

I was probably "biased" toward go fmt, because the two formatters we actually use in Oils are influenced by it:

  1. clang-format for our C++ code.
  2. yapf for our Python code.

(Details: they have line wrapping algorithms, while go fmt doesn't. I'm not calling them "graph search" printers, because I think of line wrapping as a subproblem of pretty printing.)

Why PPL style?

However, Justin's post helped me understand Wadler's printer, a popular example of the PPL style. This style might have some advantages for Oils:

(Does that last idea work?)

Warning

Sometimes I pile on too many requirements, which I mentioned in the latest release announcement:

We should do the simplest things first, and I think the PPL approach will allow that.

BTW there are many threads on #data-languages on Zulip where I'm brainstorming and learning about pretty printing.

Four Printers - What and Why?

Here's a sketch of what I think we need. It goes from concreteexperimental and research-y.

Print YSH data types in a JSON-like format

What is it? This is the = keyword shown in the screenshots above. (BTW, Lua uses the same syntax = to evaluate expressions.)

Motivation: We should look as good as node.js or jq.


Currently we use our JSON printer with the options SHOW_NON_DATA | SHOW_CYCLES.

Example:

ysh$ var d = {}
ysh$ setvar d.eggex = /dot+/  # Eggex object

ysh$ = d
(Dict 0x7feb87cb4050)   {"eggex":<Eggex 0x7feb87dbfd00>}

ysh$ setvar d.cycle = d

ysh$ = d
(Dict 0x7feb87cb4050)   {"eggex":<Eggex 0x7feb87dbfd00>,"cycle":{ --> 0x7feb87cb4050 }}

We should replace this with a new pretty printer that wraps lines, and has COLOR.

Replace our ad hoc Zephyr ASDL pretty printer

What is it? Zephyr ASDL is the statically typed schema language we use to implement Oils. It's "one level down" from the shell.

We used it to define the syntax of shell with algebraic data types. We create a lossless syntax tree, which is also an IR for shell.

Motivation: We already wrote an ad hoc pretty printer, and it should be replaced! It tries to fit records on a single line, and if that fails, it uses multiple lines. I think it's slow.

If we already have a YSH data printer, this printer should "obviously" be unified with it. We should have a nice separation of policy and mechanism.


Use osh -n myscript.sh to see what it does:

osh -n

Notes:

OSH-YSH Code Formatter

What is it? A formatter for shell code. I think it's more essential to have a YSH formatter, but an OSH formatter is also possible. They both use the same lossless syntax tree.

Motivation - Formatters make a new language easier to use, and there are many users who don't know shell well.

For example, I don't know TypeScript well, and I had a good experience with deno fmt. It reduced the mental load of adopting a new tool.


Justin had a nice idea on on lobste.rs - we should create coarse tree with these elements:

Why? We don't don't want to take responsibility for every formatting decision!

I actually think the command mode / statement formatter should be non-wrapping, while expressions can wrap. Commands will likely be more common than expressions in most YSH code.


The formatter will be invoked with ysh --tool format myfile.ysh.

It can also be used with the output of osh --tool ysh-ify, which roughly translates OSH to YSH (doesn't preserve semantics). This may help generate test data, since there's plenty of shell code in the wild, but not much YSH code.

Experimental: Export the Oils "Syntax Graph" to Users with "NIL8"

What is it? A more human AND machine-readable format for the syntax tree, which is actually a graph.

Motivation: The pretty-printed AST could be a parseable format. Allow users to reuse all the hard work we did parsing shell!


Printing raw ASDL data is useful, but it could be more readable with custom logic to print the natural layers of the graph. There are 4 logical layers in frontend/syntax.asdl:

  1. source_t describes whether shell code comes from foo.sh or ysh -c 'echo mycode'
  2. SourceLine represents physical lines of code
  3. Token refers to portions of lines
  4. The syntax tree of command_t word_t word_part_t expr_t. The leaves are tokens.

And instead of a pretty format meant for humans, we may want to print an s-expression-like format I'm calling "NIL8".

NIL8 can be parsed. You may want to write tree-shaking code to deploy YSH into containers.

More on NIL8 below. Again, it's experimental.

Implementation Notes

Do the printers depend on each other?

Risks:

Code Skeleton

I added some stubs in the code:

To generate Python code from the ASDL schema, run build/py.sh all. Otherwise, Oils is a plain Python 2 program, with a few C extensions.

C++ translation is a separate step, and it's now pretty polished.

For new contributors:

There is also a stub for the formatter:

Design Questions

This section has some less concrete thoughts.

Non-Wrapping Printers aka "Indenters" - same PPL IR?

I think the PPL IRs are also useful if you're not line wrapping? You can just fix indentation.

Columnar Layouts (spending more time)

nodejs does this in its console.log().

python node.js

Future work? We should get the basic pretty printer working first.

Two Levels of Coarse Parsing for YSH? (not at first)

Making the coarse tree has some similarity to syntax highlighting. I wrote a simple syntax highlighter for 5+ languages called micro-syntax, and it should support YSH too.

Sketch:

  1. First make the really coarse tree, something like: Comment | Code | StringLiteral
  2. Then make a less coarse tree for pretty printing:

Then do a trivial linear pass to fix up indentation. The { } or do done tokens determine indentation.


Though honestly it's probably better to just reuse our elaborate parser at first. I like to "compress" different algorithms together, but it may not be worth it here.

NIL8 - Uses cases for both Code and Data

What is "NIL8"? We don't know if it's a good idea yet, but it may be part of J8 Notation.

Think:

If NIL8 can represent both the lossless syntax tree and a new IR for a mycpp rewrite ("yaks"), that's a good test of the design.

Note that the AST is a statically typed data structure, which means we may also want to export the ASDL schema as NIL8!

Links:

At a high level, we're trying to nudge users toward a small set of syntaxes for shell-like programming, rather than inventing ad hoc syntax every time. String literals are a pain point: people often implement them badly, or not at all.

Conclusion

I think we should have shared infrastructure for 3 printers:

  1. YSH data structures
  2. ASDL data structures
  3. OSH and YSH code

And then there's this idea of "replacing" or rationalizing the ASDL syntax tree with "NIL8":

Related

Docs

Fun Computer Science Problems in Oils

Pretty printing is adjacent to other fun problems in Oils, like GC performance, "boxless" optimization, etc.

Things to think about:

(1) Unified Code Representation for Oils

We want to pack all these tools into Oils:

Related: Retrospective on the Go ast package.

(2) Yaks - mycpp from the bottom up

NIL8 leads into "Yaks", which is an IR for garbage collected C++.

(3) Pretty printing will cause many small allocations.

I think that naive implementations should be fast enough. If not, any slowness may be due to allocation, not necessarily the pretty printing algorithm itself.

Some solutions:

Appendix: Older Notes

This is about ref counting for printing graph-shaped data.

I no longer think this is as important. I think we should manually construct 4 layers of the graph, as described in the section on formatter (4).

Dynamically Typed YSH Data (value_t)

Similar to JSON / JSON8 printing, except we

  1. count references, and then print ... instead of repeating
  2. line wrap
  3. assign colors

Step 1: Count References

This is a global pass that computes a Dict[int, int]

object ID -> number of times referenced in the graph

The graph is specified by single root node, e.g. the argument to

pp value (obj)

Pass this dict into the second step.

Step 2: Convert To Homogeneous Representation

value.List   -> hnode.Compound with []
value.Dict   -> hnode.Compound with {}

null, true, false -> Atom
Cycle detected -> Atom, with { --> 4beef2 }
                             [ --> 4beef2 ]

Repetition:

{ key: { ... 4beef2 }, key2: { ... 4beef2 }

Or maybe omit the type, since strings don't have that:

{ key: ... 4beef2, key2: ... 4beef2 }

hnode Schema

The schema looks like this now?

hnode = 
  Atom(str s, color color) - # External objects can use this?
| Compound(hnode* items)

The length of 'str s' is the input to line wrapping.

Step 3: Figure out what's on each line

TODO: what's the heuristic here? Is it global?

Dynamic programming?

do we insert hnode.Newline() or something?

Statically Typed ASDL Data

Reduce it to the case above.

Step 1 - Ref Counting / Cycle Detection?

We do this all at once?

Because to convert to value.Record, you have to do cycle detection anyway.

And that's similar to ref counting.

Step 2 - ASDL records -> value.Record

value = 
    ...
  | Record(str type_name, Dict[str, value_t] fields)

The special "-" key can be used for JSON:

{"-": "command.Simple, "name": "hi"}

Though this loses some information, and it doesn't solve the problem with shared references. We would need Packle for that.

Step 2a: Optional Abbreviation?

Is this separate? Or part of step 2.

We need something between value.Record and hnode.Compound to do ABBREVIATION:

Also need nodes for

Steps 3 and 4 - Homogeneous Representation, Line Wrapping

Identical to the dynamically typed case above.

Generated on Sun, 10 Nov 2024 13:11:50 -0500