Why Sponsor Oils? | blog | oilshell.org
This is the latest version of Oils, a Unix shell. It's our upgrade path from bash to a better language and runtime:
Oils version 0.21.0 - Source tarballs and documentation.
If you're new to the project, see the Oils 2023 FAQ and posts tagged #FAQ.
Big news! As of this release, the C++ tarball passes all spec tests. It's 2x to 50x faster than the Python tarball.
I will have a lot to say about this, but the takeaway is that I now recommend using and testing the C++ tarball. See README-native.txt
in oils-for-unix-0.21.0.tar.gz
.
The release page also emphasizes that the project is called Oils or "Oils for Unix". It's no longer "Oil". I haven't done a good job of renaming everything (including this site), so I understand the confusion!
Shouldn't this announcement be all about how we finished translating Oil to C++? It's a big milestone!
I planned to write something like Oils 0.21.0 - Two Megabytes of Pure, Native Code.
But I'm not "declaring victory" yet. Despite being 2x to 50x faster, the C++ translation is 5-10% slower than bash and dash on a realistic configure
workload.
I'll talk about the mystery at the end of this long post, which goes like this:
ctx
builtin.
Also:
#oil-dev
.Before diving into Flags, Integers, Starship Bug, and Speed, let's credit contributors.
ctx
builtinStr => trim()
methods, with Unicode supporttrim()
methods now accept an eggex and constant strings. I appreciate the thought that went into the API design!startsWith()
and endsWith()
const x
, as opposed to const x = null
Also:
notwa
for reporting the semantic issue with /
and %
operators (fixed, discussed below)bar-g
for great testing, which resulted in a few bug fixes (more upcoming)I try to call out breaking changes in each release. OSH is stable, but YSH is still under development.
(1) Removed read --line
.
This is because the buffered version is incorrect, and the unbuffered version is slow. You can use shell's read -r line
for now, which is unbuffered.
Also see
My comment about strace
and read(fd, 1)
on
Why does an extraneous build step make my Zig app 10x faster? (mtlynch.io
via lobste.rs)
57 points, 14 comments on 2024-03-19
(2) Rename toJ8() fromJ8()
to toJson8() fromJson8()
This is more accurate because J8 Notation is a set of data languages, which includes JSON8.
(3) Tilde substitution is respected in combination with brace expansion, like ~{alice,bob}
.
This is technically a breaking change, but it improves compatibility:
The "Closed Issues" section at the end of this post links to more fixes, as does the raw change log.
Now let's talk about Flags, Integers, Starship Bug, and Speed.
You should be able to easily parse command line flags in a shell! POSIX shell has a weak getopts
builtin, and Python has a nice but verbose argparse
API.
Here's what Python looks like:
parser = argparse.ArgumentParser()
parser.add_argument('filename') # positional argument
parser.add_argument('-c', '--count') # option that takes a value
parser.add_argument('-v', '--verbose',
action='store_true') # on/off flag
args = parser.parse_args()
print(args.filename, args.count, args.verbose)
We can emulate this "OOP builder" style in YSH, but I find it too clunky:
var parser = ArgumentParser()
call parser->addArgument('filename') # call keyword for expressions
call parser->addArgument('-c', '--count')
call parser->addArgument('-v', '--verbose'
action='store_true')
var args = parser=>parse_args()
echo "$[args.filename] $[args.count] $[args.verbose]"
Here's our new API, which is still in progress:
parser (&spec) { # describe the CLI syntax
arg filename # positional argument
flag -c --count (type=Int) # flag (TODO: builtin type Int)
flag -v --verbose
}
var args = parseArgs(spec, ARGV)
echo "$[args.filename] $[args.count] $[args.verbose]"
What do you think of this style? It's a bit like Hay Ain't YAML: were reusing the syntax of shell for declarative data. This means you can also mix code in:
parser (&spec) {
flag --first
if (debug) {
flag --other
}
}
ctx
builtinWhat I really like about the flag parser:
ctx
builtin.
ctx
in ~100 lines of typed Python code, and then ~200 lines of stdlib/args.ysh
on top.Why is this good? First, there's a lot of variation in CLI syntax, so we want you to be able to write your own flag parsers in YSH. We provide something that should handle 90% of cases, but YSH is powerful enough to handle the remaining 10%.
Second, having other people involved makes YSH feels more real! I'm also glad that my Zulip ramblings were actually implementable, and there wasn't a hidden flaw.
I've been saying that we're going for the "whole enchilada" with YSH — a language that's powerful enough to write libraries and internal DSLs.
So what did it take to implement declarative flag parsing in YSH?
value.Place
- from Oils 0.19.0 - Dicts, Procs, Funcs, and Placesctx
builtin
I re-implemented the /
and %
operators so that they don't simply delegate to /
and %
in Python and C++. In particular, we don't use operands that are negative numbers, since Python and C++ differ!
Specifically:
(1) Integer division rounds toward zero (not negative infinity)
$(( -5 / 2 ))
is -2
, similar to other shells, and C++-5 // 2
is also -2
//
is integer division like in Python, whereas -5 / 2
is the float 2.5
(2) The %
operator disallows a negative divisor, because it's confusing, and arguably not useful.
ysh$ = -5 % 2
(Int) -1
ysh$ = -5 % -2
= -5 % -2
^
[ interactive ]:3: fatal: Divisor can't be negative
This means that YSH is different than Python, which chooses to round toward negative infinity:
Related story:
Why Python's integer division floors (2010) (python-history.blogspot.com
via Hacker News)
103 points, 98 comments - 20 days ago
We could provide a separate divrem()
with Python-like semantics, but I don't think these operations are common in shell. I think divrem()
is a better name than Python's divmod()
, because:
Mod and remainder are not the same (conery.io
via Hacker News)
165 points, 48 comments - on Aug 23, 2018
Like the /
and %
operators, the size of integers also depended on the host language prior to this release:
int
type, which is usually 32 bits, even on 64-bit machinesThat was obviously bad, so we now use a mylib::BigInt
type for shell operations, which is distinct from C int
. It's currently 64 bits, even on 32-bit platforms, but I'd like to make it arbitrary precision.
Nevertheless, Oils already behaves better than other shells. I wrote a little demo here:
In each language, we generate a program to print the literal 11112222333344445555666677778888999
,
which is larger than 64 bits.
Look at all this exciting behavior!
$ demo/survey-arith.sh bigint
python3
11112222333344445555666677778888999
node.js
1.1112222333344446e+34
Lua
1.1112222333344e+34
Perl
1.11122223333444e+34
awk
11112222333344446376450007752507392
dash
9223372036854775807
bash
8402588706881157415
mksh
-77842137
zsh
zsh:1: number truncated after 20 digits: 11112222333344445555666677778888999
-7334521740365106061
_bin/cxx-dbg/osh
echo $(( 11112222333344445555666677778888999 ))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ -c flag ]:1: fatal: Invalid integer constant '11112222333344445555666677778888999'
Summary:
int
, like we did, but it seems worse than that.This bug was reported by several people. It was most common after initializing the Starship prompt, but it also happened without it. The symptom was either these job control error messages:
$ echo foo | head
osh: parent failed to set process group for PID 42996 ...
osh: child failed to set process group for PID 42996 ...
foo
Or a crash like this:
$ echo hi|head
oils I/O error (main): No such process
If you don't care about a single hairy bug that's already fixed, you can skip the whole section below, and move on to the performance mystery.
But I'm writing it up because it touches on #shell-runtime topics that I haven't explained on this blog.
It's hard to understand the Starship bug without that background, but luckily Thorsten Ball just explained a similar bug on his blog.
My comments on
How to Lose Control of your Shell (registerspill.thorstenball.com
via lobste.rs)
59 points, 32 comments on 2024-03-09
This other bug is: when the Zed editor shells out to zsh -i
, why does Ctrl-C stop working in Zed? This is surprising because you wouldn't expect a child process to be able to mess up its parent.
Just like the Starship bug, the zed/zsh bug touches on POSIX job control, and the syscalls you need to implement it.
Thorsten dug deep to the root cause: a "missing" job control syscall after execve()
. But the question is why shells behave like that.
I wrote three long comments explaining why. I won't repeat them here, but the outline is:
Ctrl-Z
in Vim to "go back" to your shell.ls | cat | cat
by hitting Ctrl-C once. Those 3 processes are killed, but not the others.sh -c 'ls | wc -l'
start?(If the lobste.rs comments aren't interesting, again feel free to skip to the next section on performance.)
The Starship bug is different, but it's another interaction between features in the #shell-runtime. It's actually an interaction between three features, one of them is being job control:
shopt -s lastpipe
)
echo hi | read myvar; echo $myvar
works. It doesn't work in bash and other shells.setpgid($first_pid_of_pipeline, $last_pid_of_pipeline)
, but only if we actually start a process for the last part! Not if it's the shell itself.DEBUG
trap that runs before commands, and Starship uses it.The interaction was that user-provided code in the DEBUG
trap could start and wait()
for other processes in between the first and last part of the pipeline. This caused a stray getpgid()
call to return ESRCH errno
, causing the shell to "crash".
There were a few fixes:
getpgid()
in an assertion, since we can easily store this variable in the Pipeline
object.DEBUG
trap between parts of a pipeline.xtrace
stream, since job control is inherently racy.There are some things that are not entirely satisfying about this. But if you read the lobste.rs
threads above, you'll see why there don't appear to be clean solutions. Shells, and even kernels, are a bit under-specified.
I would like to explore more exhaustive testing/reasoning strategies in the shell runtime, just like we have exhausting reasoning with Zephyr ASDL and re2c in the front end.
Nonetheless, it now works:
OK, so that was a big hairy bug that took several days to diagnose and fix. Now we have hairy performance problem that also took several days to diagnose! Writing a shell is hard, at least if you want it to be good.
I was really happy after fixing the last C++ translation bug. What a long journey!
But I was dismayed that there was a lingering performance issue: Running Python's configure
may take 11.3 or 11.4 seconds with dash or bash, but 12.9 seconds with OSH.
This is a 12-14% slowdown, which is actually worse than the 5-6% we see in the CI (for unknown reasons).
You could argue: who cares about 1.6 seconds? But there's a mystery here, and I want to get to the bottom of it.
By the way, I haven't explained all the C++ translation bugs yet. I'll summarize these threads when we "declare victory":
I sometimes tell people that one reason I started this project was so I could use every code quality tool available, including performance tools.
It's honestly fascinating to learn what you need to speed "naive" Python code up by more than 50x, by translating it to predictable C++.
So this section is going to show the tools I used, since I think that will be interesting to readers. But I won't explain everything, to save space. Original thread:
First, I ran the perf sampling profiler, which is able to profile across processes:
# # Overhead Samples Command Shared Object Symbol # ........ ............ .......... ....................... 2.44% 3241 ld libbfd-2.40-system.so [.] bfd_link_hash_traverse 1.45% 1908 cc1 [kernel.kallsyms] [k] 0xffffffffb34bf8f0 1.42% 1871 ld libbfd-2.40-system.so [.] bfd_elf_link_add_symbols 1.26% 1740 cc1 [kernel.kallsyms] [k] 0xffffffffb347a135 1.22% 1805 osh [kernel.kallsyms] [k] 0xffffffffb34bf8f0 1.19% 1614 cc1 cc1 [.] _cpp_lex_direct 1.14% 1658 osh [kernel.kallsyms] [k] 0xffffffffb34dded0 1.09% 1484 cc1 cc1 [.] ht_lookup_with_hash 1.08% 1415 ld libc.so.6 [.] __memmove_evex_unaligned_erms 1.00% 1341 cc1 cc1 [.] ira_init_register_move_cost 0.98% 1326 cc1 cc1 [.] ira_init 0.97% 1289 ld libbfd-2.40-system.so [.] bfd_hash_lookup 0.96% 1385 osh oils-for-unix [.] MarkSweepHeap::MaybeMarkAndPush
(This is benchmarks/perf.sh profile-cpython-configure
- source)
The way I read this is:
ld
and compiler cc1
processes are hot, which makes sense because this is a configure
script. It probes for system features by "shelling out" to the compiler and linker.osh
does show up in the profile. Mostly in the kernel, but also in the garbage collector, e.g. MaybeMarkAndPush()
. (Other shells don't have this!)I zoomed into the OSH process:
17.25% 1118 osh [.] MarkSweepHeap::MaybeMarkAndPush
11.48% 737 osh [.] MarkSweepHeap::TraceChildren
8.22% 543 osh [.] state::Mem::GetExported
7.08% 511 osh [.] MarkSweepHeap::Allocate
5.75% 355 osh [.] MarkSweepHeap::Sweep
1.93% 132 osh [.] posix::execve
We've seen this problem a lot: allocation and garbage collection are expensive!
It's not the bottleneck in this problem, since osh
itself is taking so little time. That is, 40% of 5% is only 2%. Less than 2% of the configure
time is spent in Oils memory management.
Don't ask me why, but I then got side-tracked fixing allocation issues, despite what's clear in the profile.
To profile allocations, I've found callgrind very useful. Rather than sampling what the CPU is doing like perf, it traces every function call. Then you can load into the kcachegrind GUI and click around:
(This is benchmarks/autoconf.sh measure-callgrind
- source)
We also use other tools fore precise tracing: uftrace for measurement, and the R language for analysis. It's easier to put in the CI than the GUI, and the reports are custom:
task ex.compute-fib ex.bashcomp-excerpt parse.abuild total_bytes_ 2,206,788 4,025,766 4,173,385 num_allocs_ 70,159 117,374 102,837 sum_typed_strs_slabs 70,159 117,649 102,837 num_reserve_calls 32,497 64,900 61,229 percent_list_allocs 33.1% 21.0% 29.4% percent_slab_allocs 23.7% 24.0% 22.6% percent_string_allocs 6.3% 24.8% 8.4% percent_other_typed_allocs 36.9% 30.4% 39.6%
(This is benchmarks/uftrace.sh)
But clicking around in the GUI is nice when you're in the middle of debugging something unknown. (But I wish I could use one tool, instead of two.)
I found 3 hotspots with more than 400K allocations of one-byte strings, out of 2.3M total. I used a mycpp trick to turn them into integers, eliminating all those GC objects. (We're approaching the point where we are not "hilariously unoptimized"!)
This was a nice win on some workloads, but it was not the bottleneck in the configure
workload!
If it's not garbage collection, why are we slow? I had a hunch to add a call to the times
builtin at the end of configure
script.
==> _tmp/elapsed/bash.times.txt <==
0m0.213s 0m0.477s
0m8.233s 0m2.931s
==> _tmp/elapsed/dash.times.txt <==
0m0.217s 0m0.463s
0m8.281s 0m2.922s
==> _tmp/elapsed/osh.times.txt <==
0m0.360s 0m0.720s
0m8.790s 0m2.960s
The times
builtin has a terrible format, which goes like this:
[shell process - user space time] [shell process - system/kernel time]
[child processes - user space time] [child processes - system/kernel time]
(Contribution idea: add times --tsv8
to Oils)
So the configure
workload only spends 360 ms in OSH user space, which is everything our interpreter does, including garbage collection.
But we spend 8.79 seconds child processes, versus ~8.2 for bash and dash. This 500-600 ms is closer to the 1.6 second difference in elapsed time.
Why would child processes take longer under OSH? I had some random theories, like dynamic against linking against libcstd++
, which shells written in C don't do.
But the problem clarified even more by our trusty tool strace. I wrote benchmarks/autoconf.sh measure-syscalls
, which runs strace -ff
, spewing thousands of files:
Total processes:
~/git/oilshell/oil/_tmp/strace$ ls bash.txt* | wc -l
6022
~/git/oilshell/oil/_tmp/strace$ ls dash.txt* | wc -l
6063
~/git/oilshell/oil/_tmp/strace$ ls osh.txt* | wc -l
6336
OSH is starting >300 more processes? Huh?
I actually optimized that, as explained in the lobste.rs
thread above. In our test/syscall.sh harness, OSH starts fewer processes than bash, dash, zsh, and every other shell except for yash (from Japan).
I further narrowed it down to processes that don't call execve()
:
~/git/oilshell/oil/_tmp/strace/all$ ls bash*|wc -l
1723
~/git/oilshell/oil/_tmp/strace/all$ ls dash*|wc -l
1760
~/git/oilshell/oil/_tmp/strace/all$ ls osh*|wc -l
2039
So those 300 processes are not starting executables like cc1
or ld
? What's going on?
Well, I know exactly when we fork processes, and HereDocWriter
is one of them! I wrote about how shells implement here docs many years ago, when we were younger, the air was cleaner, and the sun brighter:
[bash, zsh, and mksh use temp files], which surprised me.
dash does something more expected and elegant, which is to start cat with one end of a pipe() as stdin, rather than a temp file. Strings longer than PIPE_SIZE will cause write() to block, but I think that just requires a little extra care in the implementation.
So OSH follows dash, forking a process for each here doc, rather than writing it to a temp file and reading it back in. The HereDocWriter
and the main shell communicate through a pipe()
.
And the configure
script has a bunch of here docs (C code snippets), so I think this must be the root cause of the slowdown. I had thought that temp files would be slower, but now I can see that there's probably forking and context switch overhead.
Why isn't dash slower too? Not sure, but it could relate to the PIPE_SIZE
optimization.
Here's a little intuition. Forking zero processes:
$ time for i in {1..1000}; do true; done
real 0m0.003s
user 0m0.003s
sys 0m0.000s
Forking a thousand processes, without execve()
. The ( true )
line is a subshell:
$ time for i in {1..1000}; do ( true ); done
real 0m0.702s
user 0m0.465s
sys 0m0.270s
So each fork()
takes ~0.7 milliseconds in bash.
I want to fix this bug soon, and evaluate performance again. But the bottom line is that I'm fairly sure that we'll meet the performance of bash and dash, with a codebase in typed Python. That's pretty cool, because the codebase is small.
And we may even exceed their speed. A promising sign is that OSH makes fewer syscalls than bash or dash on the configure
workload, not just the synthetic workloads. This is despite the here doc problem.
But Oils in C++ is already much faster than Oils in Python, so I recommend using it as of this release. For example, Oils in Python takes 27.7 seconds running CPython configure
, and we got all the way down to 12.9 by translating the interpreter to C++!
You can see this from the OSH Runtime benchmarks, which I review in the appendix, as always.
Near- and medium-term work:
We're almost done with this announcment! Let me mention these things, which will come up in the future:
ysh --tool fmt
, a pretty printerThese files used to be crappy shell scripts, and now they're pretty decent. They find bugs, and they'll help improve our error catalog.
$ wc -l test/*-errors.sh
868 test/parse-errors.sh
1131 test/runtime-errors.sh
1406 test/ysh-parse-errors.sh
899 test/ysh-runtime-errors.sh
4304 total
Their output is published on the release quality page.
A subset of the work we did:
#1854 | Brace expansion combined with tilde expansion should behave like bash and zsh, not mksh |
#1846 | `try eval` is inconsistent |
#1826 | ysh-0.20.0$ echo b'byte \xff' #=> Traceback |
#1825 | single backslash \ at end of line could be disallowed within $'...' (and maybe others) |
#1822 | Simple pipe crashes after initialising starship |
#1785 | try { } should be the only form for simplicity |
#1784 | 'const c' without null on RHS can be re-declared |
#1762 | Disallow proc / func keywords in OSH to avoid misleading syntax errors |
#1342 | / and % operators behave like Python and not bash |
This post had something for everybody!
times
builtinoils-for-unix / oils
Done! Let me know what you think in the comments!
These great conversations inspired the pretty printing design doc:
Why is Prettier rock solid? (mrmr.io
via lobste.rs)
41 points, 18 comments on 2024-02-19
A Twist on Wadler's Printer (justinpombrio.net
via Hacker News)
65 points, 7 comments - 22 days ago
I also think we need to explain UTF-8, since comments over many months show great confusion:
UTF-8: Bits, Bytes, and Benefits (2010) (research.swtch.com
via lobste.rs)
12 points, 8 comments on 2024-03-06
Flexible and Economical UTF-8 Decoder (bjoern.hoehrmann.de
via lobste.rs)
6 points, 1 comment on 2024-03-07
Let's compare this release with the previous one, version 0.20.0.
A little progress on OSH:
We finished the C++ translation, and blew past the mark! :-)
The difference is that the C++ version correctly allows NUL bytes in strings, while the Python version inherits its behavior from CPython.
New YSH features are reflected here:
Again, we blew past the mark with the C++ translation:
The difference is again inheriting integer behavior from CPython. We may smooth over this difference at some point.
I did some frontend refactoring, which may have caused a slight slowdown:
However, I already optimized the parser some more after this release.
parse.configure-coreutils
1.83 M objects comprising 64.8 MB, max RSS 69.5 MBparse.configure-coreutils
1.87 M objects comprising 65.7 MB, max RSS 70.5 MB (small correction on 2024-05)The Fibonacci benchmark is stable:
fib
takes 33.5 million irefs, mut+alloc+free+gcfib
takes 32.1 million irefs, mut+alloc+free+gcAfter resolving the big mystery above, we are going to close this gap:
configure
configure
configure
Oils is still a small program in terms of source code:
And generated C++:
And compiled binary size: