Why Sponsor Oils? | blog | oilshell.org

Oils 0.22.0 - Docs, Pretty Printing, Nix, and Zsh

2024-06-19

This is the latest version of Oils, a Unix shell. It's our upgrade path from bash to a better language and runtime.

Oils version 0.22.0 - Source tarballs and documentation.

To build and run it, follow the instructions in INSTALL.txt, which have been updated with this release. The wiki has tips on How To Test OSH.

If you're new to the project, see Why Create a New Shell? and posts tagged #FAQ.


Reminder: As of the last release, Oils is a pure native binary! No more Python. It passes the same spec tests (2700+ cases), and it's 2x to 50x faster.

I still need to write a retrospective on this, which I may do after some performance work.

Table of Contents
Intro
Highlights and Screenshots
Contributors
Breaking Changes
Docs
Portability
OSH
Features and Fixes
Statically parsing a bit of ZSH
Driven by Nix
Interlude - History of OSH and Nix
YSH
Features
Fixes
Refining the design with shell options
Design Interlude - Exterior-first Philosophy
Data Languages
New UTF-8 Decoder
Features
Fixes
Design Interlude
Performance
Optimizations
Closed Issues
Conclusion
What's Next?
Appendix
Metrics for the 0.22.0 Release
Reader-Friendly Posts

Intro

This post is long because it's been 3 months since the last release! What's changed?

I left these areas out of the title:

I describe all these changes, but I also inserted a few design interludes, so you can see the big picture too:

  1. Brief History of Oils and Nix
  2. YSH Design Philosophy
  3. J8 Notation Analogy, and Unicode Summary

Highlights and Screenshots

Let's start with the changes that are easy to see.

I re-organized and prettified the Oils Reference:

Many things are still undocumented, but we now have metrics to track this. (See the appendix.)


Justin Pombrio implemented a new pretty printer! It uses Wadler's algorithm, as described in our design doc.

Here's what it looks like with some realistic Github issue data:

Remember that the = keyword takes an expression on the right, similar to var x = myexpr (similar to Lua).

This example shows off the line wrapping algorithm:

It also works with OSH data structures:

A few things we should polish:


I should write a more detailed blog post: Unix Shell Now Has JSON and Pretty Printing

Contributors

Before describing changes in detail, let's credit contributors.

I want to repeat that failing spec tests are valuable contributions! Figuring out what bash and other shells do is often more than half the work.

I also improved the our contributor setup a couple weeks ago. So I need to write blog posts about that.

Breaking Changes

Some of these changes are explained in detail below. I like to highlight breaking changes early in the announcement.


Now let's go through the changes in each category. You can also view the full changelog.

Docs

As mentioned, the Oils Reference has been overhauled and expanded

Portability

OSH

As mentioned, I removed the colon "pseudo-sigil" in read :myvar and mapfile :myvar. It was intended to make variable names distinct, but we now have & for that (known as value.Place). Summary:

read myvar           # OSH style

read --all           # YSH style with implicit _reply
read --all (&myvar)  # YSH style with explicit var

Features and Fixes

Statically parsing a bit of ZSH

Unlike other shells, we look inside ${} for syntax errors. This is a consequence of our static parsing philosophy.

But this meant that we couldn't run "polyglot" scripts with zsh code, like git-completion.bash:

if [[ -n ${ZSH_VERSION-} ]]; then
    # zsh-only syntax in this condition
    unset ${(M)${(k)parameters[@]}:#__gitcomp_builtin_*} 2>/dev/null
...

So we now recognize the zsh-only syntax ${(x)myvar}. We parse it, but don't execute it at runtime.

Driven by Nix

Samuel did some excellent testing with Nix. It led to fixes that improved OSH for everyone, not just Nix.

Historically, Nix has been the hardest test of bash compatibility. It uses more bash features than any distro I've seen, e.g.: https://github.com/oilshell/oil/issues/26

I also overhauled bash-style regex parsing, i.e. lex_mode_e.BashRegex. For example, we now implement the very special rule of allowing spaces inside () inside a regex pattern, like

[[ x =~ a(b c)d ]]

Here's a comment related to this design flaw in bash: https://news.ycombinator.com/item?id=38414011

From the bash manual:

It is sometimes difficult to specify a regular expression properly without using quotes, or to keep track of the quoting used by regular expressions while paying attention to shell quoting and the shell’s quote removal. Storing the regular expression in a shell variable is often a useful way to avoid problems with quoting characters that are special to the shell. For example, the following is equivalent to the pattern used above:

I recommend doing what the manual says:

  1. Use a string variable, like pat='a(b c)d'
  2. Then write the expression [[ x =~ $pat ]]

But we also "conceded to reality", for Nix. That is, you no longer have to do this refactoring, because you may not control the code in the first place.

So we are delivering on our goal: OSH is the most bash-compatible shell, by a mile.


Join us on the #nix channel on Zulip to help with Nix compatibility!

Samuel started this repo:

Interlude - History of OSH and Nix

Very briefly:

2021. Nix user Raphael Megzari tested out Oils and liked it. He inspired some documentation, as well as a Nix RFC to use Oils (then called "Oil").

To be honest, it was a bit premature, because only the slow Python implementation was usable. And we needed more help and testing from Nix users.

But Raphael also told me about NLnet and the grants they offer.

2022. I applied for a grant, we got the first one in April!

2024. Oils is now pure native code, as mentioned at the top of this post. You can also see metrics in the appendix.

This deserves a full retrospective, including crediting contributors. But I hope this summary is useful for now.

YSH

Now let's look at what changed in YSH. This is the new shell with Python-like data types.

Features

Fixes


Block args and typed args are no longer confused. We now have a third argument group, after a semicolon:

cd /tmp (; ; myblock)  # myblock is of type value.Command

This is equivalent to using a block literal, which is what you'll see 99% of the time:

cd /tmp {
  echo hi
}

In contrast, we will still have eval (block), not eval (; ; block). This is a subtle distinction: eval takes a positional value.Command arg, not a block arg.


I tightened up the parsing of command.Simple, and allowed redirects after a block arg (issue #1850):

json write (x) >out.txt
cd /tmp { echo hi } > out.txt

ARGV is a normal value.List var, not an alias for "$@", which is the "argv stack" (commit).

So now we have two different "worlds":

  1. shell "functions" and "$@"
  2. YSH procs and @ARGV

This distinction fixes a bug, simplifies the YSH language, and opens up more optimization for a pure YSH runtime.

Refining the design with shell options

Design Interlude - Exterior-first Philosophy

Last June, I published a "design roadmap" for YSH, which included the concept of interior vs. exterior:

This principle continues to play a big role in our design decisions. I want to write a post based on this thread:

For example, procs that take typed arguments can now be declared with a typed keyword:

typed proc p (; x, y) {  # new 'typed' keyword
  echo "sum is $[x + y]"
}

This is so we have a clean distinction: plain procs are exterior, but typed procs are interior. This keyword is now optional, but will become required.

A related issue is that we don't do any auto-serialization, like Python's multiprocessing module does with pickle. Serialization in YSH is short, but not invisible.

Data Languages

Now let's review changes to data languages. Recall that J8 Notation is a compatible upgrade of JSON, and is built on UTF-8.

New UTF-8 Decoder

Prior to this release, we used the "Bjoern DFA" to decode UTF-8.

But there was a problem: it has a binary yes/no error model, which isn't sufficient for JSON. Valid JSON can represent invalid UTF-8, i.e. surrogate halves:


So Aidan wrote a brand new decoder, with precise error handling. It's very clean, and better than what I had in mind, which was more of an "inverted" state machine!

So we can now round-trip JSON. And we can also show precise decoding errors to users, though we haven't hooked that up yet.

Aidan also wrote a decoder in JavaScript, which you can try here!


I agree that UTF-8 is not well explained. Here's a checklist of UTF-8 decoding errors I keep in mind, which helped me fix a few bugs below:

  1. "Overlong encoding" - like 042 or 0042 rather than 42
  2. Decoded integer greater than maximum code point
  3. Decoded integer is in the surrogate range

Features

J8 Lines Used by @(spliced command sub)

We have a new format "J8 Lines":

99% of the time, it behaves like lines of text:

/etc/os-release
/etc/passwd
http://www.example.com/

But you can also use quoted J8 strings:

 "multiple \n lines \n"
b'binary data \y00\y01\yff'

It's now hooked up to the @(spliced command sub) construct, which is like the "array" version of $(command sub):

ls @(cat j8-lines.txt)       # list all of the directories

for x in @(cat other.txt) {  # iterate over decoded lines
  echo $x
}

Invariant: any argv array can be represented with J8 Lines. This is not true with text split by $IFS. That style leads to data-dependent bugs.

Added "sigil pairs" to make string literals unambiguous

Double quoted strings unfortunately have two different meanings in Oils:

  1. Code: In OSH and YSH "hi $x" respects $ substitution, just like POSIX shell.
  2. Data: In JSON, the $ in "Price is $3.99" isn't special.

To distinguish these cases, we now allow optional sigils before the left quote.

In YSH, you can add a leading $:

var x = $"hi $x"  # identical to "hi $x"

In JSON8, you can add a leading j:

j"$3.99"  # identical to "$3.99"

You won't use these sigils in the vast majority of cases. But I want to write a blog post to emphasize that our syntax is simpler and more powerful than bash + JSON.

And using explicit sigils shows off the simplicity. We have just four styles:

r'raw without \ escapes'
b'j8 style bytes'       u'unicode'
$"shell double quotes"
j"JSON double quotes"

For each of the code strings, there's a multi-line version with triple quotes:

r'''
raw
multiline
'''
b'''
bytes
multiline
'''
u'''
unicode
multiline
'''
$"""
shell interpolated
multiline
"""

That's it!

These sigils were motivated by our pretty-printing work. We were thinking about printing strings in an unambiguous way, regardless of the surrounding context. Without context, it may not be obvious if you're looking at OSH or YSH or JSON.

json API

As mentioned in the list of breaking changes, the way to control indentation is now:

json write (x)           # default is 2 spaces
json write (x, space=0)  # no indentation
json write (x, space=4)  # 4 spaces

See chap-builtin-cmd.html#json.

Fixes


We now consistently check for code points greater than the max, and in the surrogate range. These checks happen in:

But not in OSH, basically because bash and other shells don't. For example:

Design Interlude

Analogy

I want to write a blog post about this analogy in Oils:

Shell : YSH :: JSON : J8 Notation

The surrogate pair work shows this. We faithfully implement the warts in JSON, but we upgrade it to something where you can avoid warts.

Summary

I think we're done implementing JSON in Oils. And I noticed this "trichotomy" while writing this post:

  1. Python uses UTF-32 internally (roughly speaking)
  2. JavaScript uses UTF-16
  3. Oils uses UTF-8

So this is interesting: JSON is implemented differently in Python, JavaScript, and Oils, precisely because of the interior representation of strings! (Encoding takes you from interior to exterior, and decoding from exterior to interior.)

This is also an interesting exception to our Language Design Principles. In terms of strings:

Performance

I reduced the number of HereDocWriter processes, a performance bug I mentioned in the last release:

OSH now starts 5% - 10% fewer processes than bash or dash on the Python configure workload!

But surprisingly, that doesn't make us faster overall.


Both Melvin and I got kinda worked up about this, and landed many more optimizations, which I describe below.

We made great progress, but it appears we need to back up a bit to really improve performance. For example, Melvin is working on adding a control flow graph representation to mycpp to make it smarter.

We're also improving benchmark workloads and measurements. Surprisingly, OSH is slower relative to bash on real hardware, compared to the virtual machines we that our CI runs on.

This work will take awhile, but I have no doubt that Oils will get faster over time. It's very workload-dependent, but roughly speaking, I'd say we're at 50% to 120% the speed of bash — despite being written in typed Python! And it feels like 80% - 200% is feasible, though I don't know how long that will take.

Optimizations

Melvin did a ton of deep debugging and analysis, which led to several fixes:

Some of the optimizations I landed:

I think we've now settled on the code representation. I did this refactoring not just for performance, but also because we want to write a pretty printer for YSH (and maybe OSH). I think this style is simple and general, and I'd like to write an update on it:

Closed Issues

These issues are a subset of the work above. Again, you can view the full changelog.

#1974 command -v "$emptyvar" returns zero
#1968 "Float" in J8 should probably be "Decimal"
#1943 OpenBSD `ln` and `install` do not have `-v` flag
#1937 Bug: read -n strips leading and trailing whitespace
#1924 cd { pwd } should be an error - dir name required when block is passed
#1906 [[ foo =~ pat ]] parsing doesn't match bash and zsh
#1902 [BUG] Json read won't work with negative numbers
#1900 _build/oils.sh requires bash, but should only require /bin/sh (build/common.sh )
#1898 Assoc error key should be strings error is confusing with `unset`
#1895 eggex 'a'{N *} crashes, needs a proper error
#1884 "${array[@]+foo}" should behave like bash (for Nix)
#1864 ysh exits after `ctx push (&a) { true }`
#1862 osh doesn't expand tilde in assignment
#1850 Parsing bug with comma after typed arg
#1849 Typed args and block arg can get confused
#1841 [YSH] setglobal d.key mutates local instead of global
#1130 Reorganize into new doc/ref scheme
#1103 echo and printf don't check write() failure
#280 Implement `ulimit` builtin

Conclusion

To summarize:

What's Next?

This announcement was long, but it didn't cover all parts of the project! These threads have color on other things I've been working on:

But I really want to get back to YSH. In particular:

Our "north star" is still a minimal YSH that's pretty stable. YSH has many features, but it's paradoxically small (metrics below).


Let me know what you think in the comments!

Appendix

Metrics for the 0.22.0 Release

These metrics help me keep track of the project. Let's compare this release with the previous one, version 0.21.0.

Docs

We'll track this new metric from now on:

Wild tests

I don't usually track this suite, but the case ;& ;;& change is visible:

Spec Tests

Big progress on OSH, e.g. for Nix compatibility:

Everything works in fast C++, even though we write typed Python:

(The negative delta is due to NUL bytes and integer semantics.)


Good progress on YSH:

Likewise, everything still works in C++:

Benchmarks

The parser is faster, probably due to the Token representation:

Warning: this may regress in the next release. We're measuring both the parser/mutator and the GC, and using less memory by freeing objects has made things slower! We could also change the definition of this benchmark, or make a new one.


Big reduction in memory usage, due to the parser refactoring:

Slight increase in time taken for Fibonacci:

We did better on our "problem" workload, measured on real hardware. As mentioned, we'll improve the way we measure performance.

To summarize OSH running time vs. bash:

We really want to close this gap!

Code Size

Oils is still a small program in terms of source code:

And generated C++:

And compiled binary size:

GC rooting still takes up a lot of code size. I also want "mycpp modules" to speed up the build.

Reader-Friendly Posts

I haven't been blogging as much, so I think Oils is now "underexplained"! I mentioned these shorter posts above:


If you got this far, check out yesterday's post! Comments about Scripting, CGI, and FastCGI