Oil can now parse 183,000 lines of git source code, with just two remaining issues, explained below. See the source files and serialized ASTs.
Git is a good test case for a shell. It's big: 183,000 lines is more than three times the size of the next biggest project I've found.
Though much of this is test code, it hits corner cases in the language. For example, its usage of here docs taught me that they're processed in a post-order traversal of the AST.
It also taught me that comments are not lexical constructs in shell, as they are in most languages. Correctly recognizing them depends on knowledge of what words and operators are, which isn't dealt with in the lexer. Consider:
$ echo foo:#not-comment > echo foo;#comment foo:#not-comment foo
The colon is not an operator, so it and the next #
are part of a
single word. In contrast, the semi-colon is an operator. This means the
next token must begin a new word, and in this case it's a comment.
The two remaining issues are:
The parser uses Python 3 strings, so it has no problem with code in the UTF-8 encoding. But git has two test scripts with non-UTF-8 Unicode (t4201 and t7831).
I'm torn on the issue of supporting other encodings, and the best way to resolve that is by examining real world usage.
On the one hand, compatibility with bash
is good. On the other hand, if
there are only two files with non-UTF-8 encodings out of dozens of projects
totalling a million lines of code, then I'll be tempted to follow Go's
approach.
This approach is simpler, uses less memory, and should reduce portability
problems stemming from libc
. (Bash uses various libc
functions to
support multiple encodings and locales.)
The second remaining issue that git uncovered relates to static parsing. If you look at line 10 of git-gui.sh, you'll see something odd:
exec wish "$argv0" -- "$@" set appvers {@@GITGUI_VERSION@@} set copyright [string map [list (c) \u00a9] { # ...
Wait, that's not shell anymore! It turned into Tcl code. Even when
non-interactive, the shell is a REPL that parses and executes each top-level
command in sequence. When it hits exec
, the REPL must stop, so nothing
else is parsed.
Here's a similar example with exit
:
$ echo "This script runs, despite bad syntax after exit" > exit > | invalid | > ; syntax ; This script runs, despite bad syntax after exit
In contrast, Python parses everything up front:
#!/usr/bin/python
import sys
sys.exit()
| invalid |
; syntax ;
$ demo/py_parse_before_run.py File "demo/py_parse_before_run.py", line 4 | invalid | ^ SyntaxError: invalid syntax
Although oil
can't do this without breaking certain shell scripts like
git-gui.sh
, it's not a problem in practice because there's a difference
between "executing" a function and calling it. When the shell
"executes" a function, it just puts its parsed representation in a lookup
table:
$ echo before > f() { echo 'not called, but parsed and stored'; } > echo after before after
So, as long all code is in functions, and there is a single top-level main "$@"
call, oil will statically parse all of the code.
In past status updates, I've shown oil parsing these projects:
The parser is converging pretty quickly. Git is the eighth project I've discussed, but there are now many more projects that it handles correctly.
The nice thing is that there have been no architectural changes for awhile; it's all been polish around the edges. I will write about this architecture in detail later, but a core observation is that it's four interleaved parsers for four sublanguages.
I've also consciously avoided any "clever" features of Python, so this parser can be ported almost line-for-line to many languages, including C++.
I want to set it up with a few more blog posts, but otherwise there's no reason not to release the code so people can play with it. I expect that to be this month.