Translating Shell to Oil

2017-02-05 (Last updated 2019-10-05)

Update 10/2019: The Oil language has changed. This post is accurate in spirit but not in detail.

In Success with ASDL, I mentioned that a top project priority is to automatically translate shell programs to the oil language. The ability to express real programs is a test of the language's design, especially when they're written by others.

I've done perhaps 25% of the work, but the translations are starting to look accurate. Language features are apparently used in a Pareto or "long tail" distribution.

In this post, I'll show a translated program and explain the oil language features it uses.

Why a New Language?

Before looking at code, let's remind ourselves of the motivation. At first glance, this project seems similar to CoffeeScript. We want a better syntax for shell, in order to reveal its powerful semantics, e.g. Bernstein chaining and pipelines.

I believe this is important because syntax matters.

But more important is that shell syntax leaves no room for extension. New features will necessarily have a tortured syntax, such as the ^, ^^, , and ,, operations to change the case of a string in bash. I plan to justify this further in a post called Declaring Syntax Bankruptcy on Shell.

So the bigger motivation for the oil language is to add features to shell — in particular borrowing some from awk and make, as well providing a dialect for config files. I'm excited about these goals, but they require getting past some tedious work.

Selected Translations

I plan to show two files from Aboriginal Linux and two files from the /etc/init.d directory on my Ubuntu machine. (Early blog posts: Aboriginal, init.d.)

I chose these files because they're short, use a variety of language features, and can now be translated automatically. We'll see the first file today, and the remaining three tomorrow.

Open sources/toys/make-hdb.sh in a new window to see:

original shell source,
an oil translation, and
the pretty-printed AST.

Make sure to widen the window so that the two code panes appear side-by-side.

Notice that whitespace and comments are intentionally preserved. That is, if your style is to put then on its own line, the opening { in oil will also be on its own line. I'll describe the algorithm for style-preserving translation in a future post.

To repeat, the original code is:

make_hdb()
{
  # Some distros don't put /sbin:/usr/sbin in the $PATH for non-root users.
  if [ -z "$(which  mke2fs)" ] || [ -z "$(which tune2fs)" ]
  then
    export PATH=/sbin:/usr/sbin:$PATH
  fi

  truncate -s ${HDBMEGS}m "$HDB" &&
  mke2fs -q -b 1024 -F "$HDB" -i 4096 &&
  tune2fs -j -c 0 -i 0 "$HDB"

  [ $? -ne 0 ] && exit 1
}

And here is the oil code, slightly reformatted by hand:

proc make_hdb {
  # Some distros don't put /sbin:/usr/sbin in the $PATH for non-root users.
  if test -z $[which mke2fs] || test -z $[which tune2fs] {
    export PATH = "/sbin:/usr/sbin:$PATH"
  }

  truncate -s $(HDBMEGS)m $HDB      &&
  mke2fs -q -b 1024 -F $HDB -i 4096 &&
  tune2fs -j -c 0 -i 0 $HDB

  test $Status -ne 0 && exit 1
}

They look similar from a distance, which is good. But notice the following changes:

(1) The proc keyword. Oil will have both "procs" and functions, denoted with keywords proc and func.

Procs are what we call shell "functions": they accept an argv array of strings, return an integer status, and have file descriptors. They resemble both processes and a procedures.

Functions are like those in Python or JavaScript. They have typed arguments and return values.

One important use case for functions is user-defined interactive completion. Bash has a convention to mutate globals, e.g. COMPREPLY, but proper return values are preferable.
Another use case is string manipulation, e.g. to escape HTML or SQL. You can fake this by writing a "return value" to stdout and capturing it with a subshell, but this requires forking for every function call.

So it makes sense to have proper functions, but procs are important too because they're isomorphic to an external process. I'll explain how they work together in a future post.

(2) if uses curly braces as block delimiters instead of then and fi. Reasons for this:

Consistency: In shell, function bodies are delimited by braces, while other blocks are delimited by keywords like do and done. In oil, all blocks use braces.
Huffman coding: Block delimiters are common, so they should be short, and braces are shorter than keywords. Python-style indented blocks are even shorter, but aren't suitable for a shell because the language is meant to be typed interactively.

Note that { is an operator in oil, but confusingly it isn't in shell. See discussion below.

(3) The conversion uses test instead of [. Oil will have C-style infix boolean expressions, but legacy code may use test.

Not only is the [ command an ugly syntactic pun, but the [ character is an operator in oil, so it requires quoting when in a command name.

The fact that [ and { aren't operators prevents the shell language from evolving. For example:

$ echo 'echo hi from script with funny name' > ]{
$ chmod +x ]{
$ ./]{
hi from script with funny name

In oil, you would just add single quotes like this: ']{'.

(4) Special variables look like $Status rather than $?. In oil code, we prefer readable names. A completion system that's configured well by default will make them easy to type.

The remaining observations require some background. Recall that shell is composed of four mutually recursive sublanguages:

the command language: for, if, functions, ...
the word language: ${}, $(), $(()), ...
the arithmetic language: a**2 + b**2
the boolean language: [[ a =~ b ]]

Roughly speaking, shell has a separate expression language for each type: strings, integers, and booleans. Oil does away with this complexity with a single expression language for all types, like C or Python.

As a result, it has just two sublanguages: commands and expressions.

The [] characters are used for arrays, and the () characters are used for grouping expressions, as in most languages. So it makes sense for $[] to be command substitution and $() to be expression substitution. Commands are simply arrays of strings.

Keeping the two sublanguages in mind, notice:

(5) $(HDBMEGS) is a delimited variable substitution, in contrast to ${HDBMEGS}.

(6) $[which mke2fs] is command substitution, in contrast to $(which mke2fs).

Arithmetic substitution will be $(x + 1) instead of $((x + 1)). Strings and integers don't need different substitution syntax.

(7) Substitutions aren't quoted. Oil doesn't split words because it's a misfeature designed to simulate arrays. (Most shell implementations have arrays as an extension, but they're not in POSIX.)

Splitting can be done explicitly with @split(HDB) or @[which mke2fs]. The @ character is associated with arrays, e.g. for splitting and splicing.

(8) In contrast, strings on the right-hand side of assignments must be quoted. In expression mode, strings must be quoted; and everything to the right of = is parsed in expression mode as opposed to command mode. This will be implemented with the lexer modes technique (formerly lexical state).

Examples:

echo foo bar  # command mode: command and two literal words
foo = bar     # expression mode: bar is a variable, as in C or python
foo = 'bar'   # bar is a string

x = 1 + 2 * 3           # an integer expression
s = myStr or 'default'  # a string expression

Also notice that = is a proper operator and may have spaces around it.

The seven-line make-hdb.sh script showed us many features of the oil language. It has:

"procs",
blocks with curly braces {},
arrays with brackets [],
= [ ] { and } as proper operators,
more descriptive variable names, and
command and expression sublanguages.
Also, word splitting is explicit rather than automatic.

Tomorrow, I'll reveal more language features with three more automatic translations.

Translating Shell to Oil

Why a New Language?

Selected Translations

Next