Oil is on Github

2016-11-19

As promised, I published the oil repo. The README tells you how to run it, run tests, and get an overview of the code.

The count.sh script shows various line counts:

$ ./count.sh all

BUILD/TEST AUTOMATION
...
  305 spec.sh
  394 wild.sh
 1053 total

SHELL TEST FRAMEWORK
604 sh_spec.py

SHELL SPEC TESTS
  260 tests/var-sub.test.sh
  285 tests/array.test.sh
 3509 total

OIL UNIT TESTS
...
  451 osh/word_parse_test.py
 1135 osh/cmd_parse_test.py
 3058 total

OIL
...
  1148 osh/word_parse.py
  1505 osh/cmd_parse.py
 11497 total

So there are ~11,500 lines of code, ~7000 lines of tests, and ~1000 lines of developer scripts.

The tests contain a lot of the value — I expect them to last longer than the code, which will be ported to C++, hopefully with a fair amount of bespoke code generation.

sh_spec.py is a test framework that runs shell snippets against multiple shells and makes assertions on stdout, stderr, and the exit code.

Lexing, Parsing, and the AST

Here is another way to view the code:

$ ./count.sh parser 

    78 osh/parse_lib.py
   197 osh/arith_parse.py
   319 osh/bool_parse.py
   410 osh/lex.py
  1148 osh/word_parse.py
  1505 osh/cmd_parse.py
  3657 total

This first chunk of ~3600 lines is the algorithm for parsing the osh language. As I wrote two days ago, it's three recursive descent parsers and a Pratt parser.

   80 core/bool_node.py
  101 core/arith_node.py
  472 core/tokens.py
  557 core/cmd_node.py
  997 core/word_node.py
 2207 total

This chunk of ~2200 lines is the algorithm is the AST representation. word_node.py is big because it contains "smart" polymorphic methods, not just "dumb" AST nodes.

  229 core/lexer.py
  313 core/tdop.py
  542 total

And these two files are the lexing and parsing infrastructure. I wrote about tdop.py in the Pratt parsing post, but I need to write more about the lexer.

In particular, a lexical state is now called a lexer mode, because there's another lexer hint mechanism that's also stateful. osh ended up requiring a whopping thirteen lexer modes (up from eight a month ago).

And I need to explain what the little-known tool re2c does -- in some sense it's the foundation of the parser and what enabled me to write it.

Although my parser more compact than the bash parser (details here), ~5800 lines is still big. I hope to express the oil language in a more compact way, but that remains to be figured out.

Execution

$ ./count.sh runtime
   73 core/arith_eval.py
  122 core/value.py
  230 core/bool_eval.py
  370 core/builtin.py
  545 core/process.py
  827 core/cmd_exec.py
  851 core/word_eval.py
 3018 total

Parallel to the four parsers are four evaluators: arithmetic, boolean, command, and word. They interpret the AST in the obvious way, taking care to give good runtime error messages.

They make use of a runtime that handles processes, file descriptors, builtin commands, variables, and data types.

The evaluators and runtime are less complete than the parser, but they can execute some real scripts. I hope that open source contributions will help fill out the evaluators and runtime. It seems like parsing is about 60% of the work of a shell, and execution is 40%.

Completion

The component missing from these counts is completion.py, which is the beginning of a pretty decent completion engine. The completion engine is interesting because it makes use of the parsers and evaluators as libraries:

It needs to parse incomplete lines every time you press TAB, without disrupting the interactive parser currently in progress (i.e. think about typing echo one; f() { echo <TAB>.
It needs to run user-defined completion functions when you hit TAB.

Conclusion

Overall, the architecture is an AST interpreter, where the AST is heterogeneous. That is, it has four interleaved sublanguages. Although every shell I've looked at is an AST interpreter written in C, this is only true for the command sublanguage. The other three languages are implemented in an ad hoc manner, interleaving parsing and execution.

Now that the code is public, tomorrow I will write about project priorities for the next few months.