Why Sponsor Oils? | blog | oilshell.org
(This is the update on source code size mentioned in the Blog TODO stack.)
In Analogies for ASDL, I counted lines of code in the osh
front
end prior to integrating ASDL. There were ~3500 lines in the lexer and
parser, and ~1900 lines to represent the AST, for a total of ~5400
lines.
A few weeks ago, I counted again. As shown below, there are ~1000 fewer
lines, and tokens.py
and {expr,cmd,word}_node.py
are gone. That is, there
are no more hand-written AST nodes, and the basic Token
type is now
represented in ASDL.
# as of 12/16/2016 # as of 1/6/2017 $ ./count.sh parser $ ./count.sh parser Lexer/Parser Lexer/Parser 77 osh/parse_lib.py 77 osh/parse_lib.py 196 osh/arith_parse.py 167 osh/arith_parse.py 291 osh/bool_parse.py 264 osh/bool_parse.py 334 osh/lex.py 334 osh/lex.py 1144 osh/word_parse.py 406 core/word.py 1455 osh/cmd_parse.py 1096 osh/word_parse.py 3497 total 1442 osh/cmd_parse.py 3786 total AST and IDs AST and IDs 80 core/tokens.py 129 osh/osh.asdl 99 core/expr_node.py 456 core/id_kind.py 441 core/id_kind.py 585 total 491 core/cmd_node.py 777 core/word_node.py 1888 total TOTAL: 5385 TOTAL: 4371
I'm not counting the ASDL implementation itself, which takes ~1800 lines of code. On the other hand, it also implements new features like the oheap encoding, C++ code generation and generic pretty printing.
Observations:
(1) There is a new core/word.py
file. What used to be methods on the
CompoundWord
class are now free functions:
# OLD:
w.HasArrayPart()
w.AsFuncName()
w.AsArithVarName()
w.LooksLikeAssignment()
# NEW (word is a module, w is an object):
word.HasArrayPart(w)
word.AsFuncName(w)
word.AsArithVarName(w)
word.LooksLikeAssignment(w)
In other words, I've switched from and object-oriented style with methods to a functional style. This might seem like a step backward from the syntactic point of view, but I think that's a minor difference compared to the architectural advantages.
I've called the AST backbone of the interpreter. It's now
represented by just 600 lines of code in core/id_kind.py
and
osh.asdl. Together, these two files form a concise description of every
element of the language.
So now I think of them as thin waist of the interpreter. They divide the code cleanly between front end and back end. This helps in a few ways:
Of course, this is exactly the kind of modular architecture you would learn about in school. But "real programs" rarely have a clean architecture.
For example, I've spent significant time reading the source of various shells,
including bash, which has ~150K
lines of C code. Bash even has a great
article in the AOSA book. But I would say that the
architecture only covers a small portion of the program, and the portion that
it covers is necessarily inaccurate.
In any case, bash cannot have this clean "thin waist" architecture, because it is not statically parsed. Execution and parsing are interleaved, and the same functions are called in both stages.
For example, issue 3 on Github is a head-scratcher. It's an
undocumented case of parsing and evaluating code inside strings at runtime,
without an explicit eval
.
I should note that author Chet Ramey was refreshingly honest about the parser,
and his article was part of the motivation for oil
.
When I open-sourced the code, it had ~11,500 lines of code. After the 1000-line reduction above, and a few other cleanups, the code dipped below 10,000 lines!
$ ./count.sh all ... 546 core/process.py 786 core/completion.py 843 core/cmd_exec.py 876 core/word_eval.py 1096 osh/word_parse.py 1442 osh/cmd_parse.py 9992 total
This doesn't mean that much, especially since I'm not counting ASDL. But I'd like to celebrate this as the last time the code will ever be so small. I like that you can describe (most of) a large language like bash in a small amount of code.
I've always agreed that Size is Code's Worst Enemy. I have strategies to avoid writing 50K or 100K lines of C++, including metaprogramming and bootstrapping oil in oil.
Using the model of algebraic data types in osh.asdl is powerful and concise.
I expect bootstrapping oil
to be a future theme in the blog, although I'm
also planning posts on central topic of the oil language
design.
Right now, I'm working on automatically converting osh to oil, and it's working better than expected. I've also been in touch with the authors of four different alternative shells, and I hope to get design feedback from them.
Kartik Agaram is working on the problem of conveying the global structure of programs. After puzzling through the source code of many shells (bash, dash, mksh, zsh), this idea resonates with me.
I'm looking to easily answer to questions like:
What are the interaction points of the lexer and parser(s)? Shell is
complicated, necessarily breaking out of the simple "lexer to parser" model
you learn in school. For instance, entire subprograms are embedded in a
string literals, like "abc: $(for i in a b c; do echo $i; done)"
.
How are the parsers "reused" for interactive completion, or is knowledge of the language duplicated?
How do they solve the "$PS2
problem"? This is the problem of
interactive parsing: when the user hits enter, do you print >
, waiting
for more input, or do you try to parse and run the command, possibly getting
a syntax error?