Why Sponsor Oils? | blog | oilshell.org
I introduced a distinction in Narrow Waists Can Be Interior or Exterior:
PyObject*
.This post uses the interior-exterior idea to describe Oils. For example, OSH and YSH are exterior-first, while PowerShell, Elvish, Nushell, and others are interior-first. To see this, we review three aspects of the design:
This is the third post in a series about YSH. I want it to be a "design roadmap" for contributors and for me, but I hope casual readers will also take something away.
I "forked" another post while writing this one: How to Create a UTF-16 Surrogate Pair by Hand, with Python. It started with an implementation detail, and led to a good discussion about Unicode history, e.g. Windows vs. Unix.
Now let's see how the distinction helps us with the design of YSH. Last year, The Sketch of the Biggest Idea in Software Architecture asked:
Should shells have two tiers? Both external processes and internal "functions"?
Both pipelines of bytes and pipelines of structured data?
We now have answers. YSH will have both:
On the other hand, our pipelines are identical to those that Thompson's original shell pioneered: "real" OS processes communicating over channels created with pipe()
. Structured data is layered on top, with textual data languages based on JSON and TSV.
This table summarizes my impressions of a few alternative shells (corrections are welcome):
Style | Shell | VM / Scheduler | What's in a pipeline? | What's Piped? |
Interior | .NET VM | cmdlet, a kind of class | .NET objects, instances of classes | |
Interior | Goroutine scheduler | Function, Wrapped Process | Garbage-collected Go records, or JSON | |
Interior | Rust async I/O scheduler | Builtins or Plugins | Rust/serde Objects, JSON/msgpack | |
Exterior | Unix Kernel | procs or processes | Bytes ⊃ Text ⊃ Data Languages |
So it appears that most alternative shells are interior-first, but Oils is exterior-first.
The distinction isn't black and white: All shells have both facilities (even bash), so it's more a matter of what's "primary" in the design. It's also a matter of how awkward the interface is — do you have two different "worlds" or tiers to bridge?
Nonetheless, we'll say an interior-first shell favors code that lives within a process, while an exterior-first shell favors coordinating data between processes.
Notice the layering:
'\n'
.I would draw this as:
Bytes ⊃ ASCII ⊃ UTF-8 ⊃ Data Languages
I'd also say the exterior style is one level below interior shells, which preserves shell's role as universal glue. If you want to glue together a .NET VM and a Go process, or Clojure program and an R script, your lowest common denominator is probably a pipe, socket, or bash script.
So does YSH have two tiers? Despite having both proc
and func
, I'm trying to avoid two tiers, at least to the extent that it reduces the whipupitude of shell.
It's "wrong" to think about YSH programs in this Python-like way:
The "right way" is to program directly with text, including our data languages for strings, records and tables. They are designed to eliminate ad-hoc parsing, which is the main downside of text.
Our in-memory data structures map one-to-one with text, and are in service of text. The encode()
and decode()
operations on J8 strings are perfect inverses, for arbitrary byte strings.
More details below.
Why did I say procs are either interior or transparently exterior? Because that's how Bourne shell works, and it's powerful and underused. The simplest usage of a proc
occurs in a single process, making it interior:
myproc() {
cp *.py /tmp
echo done
}
myproc # interior call
But you have at least two ways of making procs exterior:
sudo $0 myproc "$@"
is the $0 Dispatch Pattern. This is how you run a shell function with different privileges. The proc
itself becomes a child process of sudo
.myproc | wc -l
transparently "remotes" myproc
into another process, via fork()
thume.ca
)(Implementation status: procs exist in YSH, but we still need to be implement functions.)
proc main
versus func main
This is a good time to answer a great question from Mastodon. I expect it to be common, so I'll paraphrase:
Now that YSH has functions, can we just ignore procs? Start with
func main
, and call other functions with typed data?
I don't want to dictate the way people write code, but I think there are some downsides:
You'd lose the 2 styles of composition I mentioned above: Forth-like words, and pipelines.
When units of code are the same shape, they compose, and there's less room for bugs. Shell is a language that grows.
The exterior usage of procs is useful for distributed computing, including building containers. That is, I write shell scripts on one machine, and distribute them to another machine, or to an isolated container. This is analogous to what Docker does implicitly with "contexts".
procs are easy to use and test interactively, e.g. with the "task file" idiom, based on procs.
I test my shell programs multiple times a minute when developing.
Textual flags vs. typed arguments?
Procs with flags can and should have a stable exterior interface! On the other hand, funcs are interior, and may break during refactoring.
Flags often follow Hickey's description of versionless evolution: Strengthen a Promise, Relax a Requirement. Adding a flag is a compatible change, while changing the type of an argument is a breaking change.
Literal command line flags are used to evolve the largest distributed systems at "hyperscalers" like Google. This is because distributed systems can't be upgraded atomically. The site featureflags.io seems to elaborate on this idea.
The advantages of procs will probably become clearer when actually writing code. I should write more #shell-the-good-parts posts with concrete examples, but until then you can see them all over the Oil repo.
Now that we've discussed interior and exterior code, let's discuss text. It's central to not just shell, but all programming languages.
Text is also complex and controversial. This article, linked in the appendix of the surrogate pair post post, shows that languages disagree on the length operation:
Programming Languages | Length of 🤦🏼♂️ |
Go, Rust, Python 2 | 17 UTF-8 code units, aka bytes |
JavaScript, Java | 7 UTF-16 code units |
Python 3, bash | 5 UTF-32 code units, aka code points |
Swift | 1 Extended grapheme cluster, which doesn't have a fixed definition |
The surrogate pair post also sketches the history of this divergence, which is basically a Unix vs. Windows problem. Languages tend to follow operating systems, so JavaScript, Python, and JSON were dragged along for the 30-year ride.
The length issue correlates with — but isn't identical to — another controversial issue: the representation of strings in memory. That is, the interior representation.
Oils follows the Go language, using an array of bytes, which may or may not be UTF-8 encoded strings:
Contrary to popular belief, and contrary to Python, C, and C++, UTF-8 is a great interior representation. It's naturally compressed in memory, and you can search for ASCII substrings like {
or //
within it, without decoding.
At some point, I may write Four Reasons New Programming Languages Should Adopt a UTF-8 Centric Design:
Those 3 reasons should be enough. If not, PyPy showed us in 2019 how to use UTF-8 internally, while still retaining O(1) random code point access. You probably don't need this operation, but if you really do, it can be made both time- and space-efficient.
PyPy v7.1 released; now uses UTF-8 internally for Unicode strings (morepypy.blogspot.com
via Hacker News)
153 points, 27 comments - on March 24, 2019
Important: even though Oils is UTF-8 centric, it works languages that use any string representation. The post above would explain why we're diverging from bash.
I would also mention a bug I found in 2018: bash's ${#s}
, which measures length in code points, is a non-monotonic function of bytes. That is, adding a byte on the end of a string can reduce its length! This happens because bash doesn't handle invalid UTF-8 properly.
Still, I recognize there is a tremendous amount of confusion around strings and UTF-8. We could make our APIs more explicit:
Instead of len(s)
, we could have
s->numBytes() # O(1)
s->countRunes() # O(n), may raise decode error
Decoding:
$ var runes = s->toRunes()
$ write (runes)
[65, 20, 66, ... ]
$ var s2 = Str.fromRunes(runes) # not a method?
Indexing:
s->byteAt(i) # O(1)
s->findCharAt(i) # NOT useful, use toRunes() instead
Or maybe indexing should be s[i]
because there's only one O(1) operation. Same question with slicing:
s[i:j] # O(1)
s->byteSlice(i, j) # O(1), is this better?
Iteration:
for byte in (s) { # Go iterates over runes, not bytes
write (byte)
}
for rune in (s->toRunes()) {
write (rune)
}
Substring search:
var i = s->find('//') # remember this works without decoding
Regex:
# replace a byte or rune?
var result = s->replace( / <dot> %end /, ^"$1" )
This is just an idea. Right now we have len(s)
giving the number of bytes.
Either way, the point is that strings in Oils follow exterior reality. They're arbitrary byte strings that may or may not be UTF-8 encoded. In contrast, bash strings are NUL-terminated, but they also don't have to be valid Unicode. UTF-8 is not present in the Unix kernel — it's layered on top.
Now that we've talked about text, let's talk about structured data. Remember that it's layered on top of text, and that it's a big YSH feature:
Shell Should Be More Like Python, JavaScript, and Ruby
This is another way of saying that our data model is designed to be serialized, rather than serialization being an afterthought. In the intro, I said that:
What interior data structures will YSH have? To follow our exterior languages, I've decided on the following data model:
This can be described as either:
The idea is that interior structures and exterior languages map one-to-one, to the degree possible. I think having both ints and floats is important, because both JavaScript and Lua originally had a single number type, and grew proper integers after real usage.
So by choosing data structures to be in service of data languages, YSH is exterior-first.
But there are more one-to-one mapping problems. Here's the biggest one: JSON strings don't correspond to Unix strings.
This is a fairly technical issue, so I "forked" another post from this one:
I mentioned another demo in that post:
\xff
Be JSON-Piped Between Python and JavaScript? This would demonstrate more of the JSON-Unix String Mismatch.See blog-code/j8-notation if you want a preview.
How do we fix this? As mentioned in Sketches of YSH Features, we're adding \yff
and \u{123456}
to JSON strings, and calling those "J8 strings". This is the basis of "J8 Notation".
Mathematically:
j8encode()
should be a total function over bytes, which is the set of values the Unix kernel gives you.j8encode()
and j8decode()
should be a pair of bijective functions.Practically speaking, these properties make it easier to write correct shell programs. You can use J8 strings instead of ad-hoc parsing with spaces, newlines, or commas.
We reviewed code, text, and structured data, which showed how YSH favors the exterior viewpoint.
This is because it's meant to compose seamlessly with processes not written in shell!
In other words, Oils is not a closed world. It's part of an operating system, and part of distributed systems. Again, shell is a language that grows: Unix Shell: Philosophy, Design, and FAQs.
My "JSON Template" project from 2009 is relevant to the exterior-first philosophy. It's a string templating language that puts serialized data first — hence the name.
It no longer has an official repo because it was hosted on Google Code, which is now defunct. Ironically, I wrote it when I worked on Google Code itself!
(It lives on in the Oil repo as test/jsontemplate.py. It's been part of the "Wild" test report for years, and I recently ported our Soil CI dashboard to it.)
Why did I create JSON Template? Google Code was written in Python and JavaScript, and I didn't like using 2 template languages: one on the server, and another on the client. (Remember how different the ecosystem was prior to 2009: node.js didn't yet exist.)
So I designed a data-driven template language, wrote an interpreter for it in Python, and ported the interpreter line-for-line to JavaScript.
For a ~1200 line program, it was surprisingly influential! It was the "version 1" of Go's text/template
:
We were technically co-workers, but Rob and Russ actually just found the project on Reddit. It was exciting to get this validation from much more experience engineers!
After Go 1.0, text/template
was redesigned in a more imperative style. The JSON Template influence is still present in:
{{.}} # "dot"
{{with X}} {{end}} # push a scope, and conditionally execute
Those correspond to the primitives of JSON Template:
{@} # the "cursor"
{.section X} {.end} # conditionally expand in a JSON namespace
Squarespace also started using it in 2010. I met the founder Anthony when they were a small company with a new office in Manhattan.
I thought they had moved off it, but I found this pretty recent YouTube video, which shows that it's still part of the Squarespace platform? I'd be interested if anyone understands how exactly it's used.
{.section} {.end}
syntax.I bring this up to show that it's useful to think about serialized text first, in the Unix style. I don't think of JSON Template as a language for Python, JavaScript, or Go. It stands alone — floating in the cloud — and that requires using a language-independent representation like JSON.
I might as well drop another JSON story here: I introduced JSON to Python creator Guido van Rossum in 2006, although I'm not sure it led to anything consequential. JSON was added to the Python library a few years later via the library simplejson
, which would have happened anyway.
Another one of my defunct Google Code projects was "chutils", which had a program called dice
. It was basically JSON Lines or ndjson in 2006: a set of Unix utilities that communicated with JSON over pipes.
I used it to analyze logs from Google's internal dev tools. In particular, the "hist" operator avoided the ad-hoc parsing of sort | uniq -c | sort -n
:
$ cat x.tar.gz | to-json-lines | hist cmd # histogram by field name
905 log
405 commit
89 rm
Guido was my office-mate at the time, and I remember he was pleasantly surprised by this "cool" use of Unix. I then showed him https://json.org/, and he said, "That's just Python!"
I believe that's almost literally true: all JSON will successfully eval()
in the Python interpreter, as long as you define null, true, false = None, True, False
.
Remember, this was 2006, and JSON was quietly invented in 2001. GMail and Google Maps popularized "AJAX" in 2004 and 2005, which was nominally based on XML. Server-side JavaScript didn't exist (or it was a failed Netscape experiment most people were unaware of.)
(History question: Why do Python and JavaScript share nearly the same syntax for {}
and []
container literals? Python appeared in 1990, and JavaScript in 1995. Did they have a common ancestor, or did Python influence JavaScript? I recall that Guido said Python didn't invent this syntax, but I don't know where it came from. C doesn't have it.)
So despite playing with JSON for ~15 years, why is Oils coming around to it now? Well, Oils has had rough JSON support since 2019:
JSON Support in Oil Shell (oilshell.org
via lobste.rs)
18 points, 19 comments on 2019-12-07
(Hmm, re-reading this thread is interesting, I may comment on the issue of objects vs. data later — it's interior vs. exterior !!)
JSON-based data languages are becoming more central though. I'd say the main issues are:
PyObject*
to our ASDL value_t
. This is all over the codebase. I mentioned this in the 2023 Roadmap, and it's a big deal.value_t
! For some reason I had put "writing our own JSON library" out of scope for Oils, but now I think it's in scope.dice
and JSON Template, JSON isn't always appropriate for command line tools. Some things are more naturally modeled as tables, as I mentioned in What Is a Data Frame? (2018).JSON has a few other weaknesses beside the JSON-Unix String Mismatch:
To summarize:
PyObject*
is an interior waist, while Unix files are an exterior waist.It also applies to shell design: Oils is exterior-first.
Now, what was the point of introducing PyObject*
? It was an example of an interior narrow waist, but how does that relate to shell?
It explains the design: We won't have extensible data types like Python does! YSH is not for writing vector and matrix libraries :-)
In other words, the narrow waist of Oils is still exterior Unix files, not interior like PyObject
.
It's a Python-like language, but it's still a shell. You program "directly" with text, which is now structured.
Let's end this post with another question: do these abstract ideas matter?
I think they will be our north star for a clean, focused, and bounded language design. Even though shell is a popular, fast-growing language in 2023, I frequently see comments like these, with many upvotes:
We really gotta stop writing and using software written in shell. There are so many footguns in shell that these types of mistakes are inevitable.
Comment on
acme.sh runs arbitrary commands from a remote server (github.com
via lobste.rs)
60 points, 35 comments on 2023-06-08
This tells me that the shell language has become so complex that many users have given up hope of ever writing it correctly. They don't even want to start learning it.
For this case, the problem was the difference between "$@"
and eval "$@"
, which I mentioned in this issue.
But even that's confusing: the four characters "$@"
look similar to "$x"
, but have wildly different semantics. And eval
implicitly joins its arguments, which is even more confusing.
We have an opportunity to fix this with YSH. We're deep in the middle of it, with a lot left to do. But writing this series of posts has greatly clarified its design. We have a decision for essentially all design issues, although we'll certainly revise the language as we implement it.
I think we can produce something great!
Let me know what you think in the comments, which are now on Zulip.