Oil Has Multi-line Commands and String Literals

2021-09-19

This post describes two new syntaxes that make Oil programs easier to read and write. Let me know what you think in the comments!

Table of Contents

Multi-Line Commands With A ... Prefix

Multi-Line String Literals: """ and ''' and $'''

An Orthogonal Design Is Easier to Use and Remember

But Nothing Is Perfect

They Can Be Combined

Reminder: Doc Comments With ###

What's Next?

Please Test Oil 0.9.2

Appendices

How Are Multi-Line Commands Parsed?

How Are Multi-Line Strings Parsed?

Review of Syntax Proposals

Multi-Line Commands With A `...` Prefix

In Proposed Changes to Oil's Syntax (November 2020), I mentioned this problem with shell:

cat file.txt \    
  | sort \          # I can't put a comment here
  | cut -f 1 \
    # And I can't put one here
  | grep foo

That is, documenting long commands is hard because you can't mix \ line continuations and comments. I just released Oil 0.9.2, which solves this problem:

... cat file.txt    
  | sort            # Comment to the right is valid
  | cut -f 1 
    # Comment on its own line is valid
  | grep foo
  ;                 # Explicit terminator required

In the multiline context started by the ... prefix:

A single newline behaves like a space.
A blank line (two newlines in a row) is illegal. This means that forgetting the ; terminator won't cause multi-line mode to "bleed" into the next command.
A line containing only a comment is allowed.

The appendix describes how this is implemented.

I've tagged this post #real-problems, since this mechanism solves a problem that multiple shell users have encountered. For example, see this January Reddit thread on Shell Scripts Are Executable Documentation.

Multi-Line String Literals: `"""` and `'''` and `$'''`

In June's post Recent Progress on the Oil Language, I wrote that Oil has Python-like multi-line string literals, but enhanced like the Julia language.

Here are examples from the Oil Language Tour.

Double-quoted multi-line strings allow interpolation with $:

sort <<< """
  var sub: $x
  command sub: $(echo hi)
  expression sub: $[x + 3]
  """
# =>
# command sub: hi
# expression sub: 9
# var sub: 6

In single-quoted multi-line strings, every character is literal, including $:

sort <<< '''
  $2.00  # literal $, no interpolation
  $1.99
  '''
# =>
# $1.99
# $2.00

C-style multi-line strings interpret character escapes:

sort <<< $'''
  C\tD
  A\tB
  '''
# =>
# A        B
# C        D

An Orthogonal Design Is Easier to Use and Remember

(This section is long and relies on shell expertise. If you only care about using Oil, as opposed to understanding the design, feel free to skip it.)

These string literals are better than shell's here doc syntax in three ways:

(1) Leading whitespace is stripped in a more useful way.

Oil uses the indentation of the closing quote like ''' to figure out what whitespace to strip. If you don't want this, then don't indent the closing quote. (This rule is similar but not identical to Julia's rule.)
In contrast, Shell has an obscure <<-EOF syntax (as opposed to <<EOF), which strips leading tabs, but not spaces.
Python never strips leading whitespace, which means that multi-line strings often mess up the indentation of your program.

(2) Multi-line strings are consistent with regular strings with respect to $var interpolation and character escapes like \n.

That is, you can use triple quotes instead of single quotes in "hello $name", 'single', and $'\n', and it means the same thing.
Shell has the odd rule that a here doc with an unquoted delimiter like EOF allows $var interpolation, but here docs with quoted delimiters like 'EOF' or \EOF don't interpret $var.
Shell doesn't have here docs that allow character escapes like \n (at least not statically-parsed ones). That is, design isn't orthogonal.

(3) Multi-line strings can be used in either commands or redirects.

In contrast, here docs can't be used directly with commands like echo, and the alternative causes too much I/O.

To elaborate, recall that this use of the <<< "here string" operator works in bash and OSH:

$ tr a-z A-Z <<< 'hello'
HELLO

And remember that the sort examples above used the <<< operator and not the << "here doc" operator. This is because Oil's multi-line strings are actually string literals!

Another consequence of this is that you can use a multi-line string directly in a command, as part of argv:

echo '''
  one
  two
  three
'''
# =>
# one
# two
# three

In shell, regular strings can span multiple lines, but there's no way to strip leading whitespace, which makes code hard to read:

echo 'one
two
three'
# =>
# one
# two
# three

You could use a here doc and cat:

# This does too much I/O for a simple task
cat <<EOF
one
two
three
EOF

For such a simple task, this is inefficient in two ways:

It causes I/O because Shells Use Temp Files to Implement Here Documents. (Oil doesn't use disk I/O, but it does start a "here doc writer" process.)
It starts an external process cat rather using the echo builtin.

To recap, I like this design because it's more orthogonal in at least 3 dimensions:

Whether whitespace is stripped
Whether $var and $\n are respected
Whether the string is used in a command or redirect

Also note:

Multi-line strings literals are enabled in bin/oil, but not bin/osh. (You can also explicitly set shopt --set parse_triple_quote in bin/osh).
Here docs aren't recommended, but you can still use them in the rare case that choosing the delimited like EOF is useful.

But Nothing Is Perfect

However, Oil's string literal syntax still has a "wart": you can't put (statically-parsed) character escapes like \n in double quoted strings.

Unfortunately, this is not orthogonal design. (We even document the warts for you; most languages don't.)

I've lived with this for awhile and think it's OK. I believe it's important to keep not just the Oil language small, but also the combined OSH+Oil "surface area". In other words, I'm happy with 6 kinds of string literal (3 x 2 for the multiline variants), but I would not like 8, 10, or 12 kinds.

As always, I welcome contributions in this direction. However I'd also suggest that this isn't the issue to start with — it's one of the most difficult design issues.

They Can Be Combined

This ugly example combines multi-line commands and multi-line strings, and gives our parsing algorithms a workout! There's no reason for this in production code, but it illustrates the principle.

var x = 'one'

# print 3 args without separators
... write --sep '' --end '' --  
    """
    $x
    """         # 1. Double Quoted Multi-Line String
    '''
    two
    three
    '''         # 2. Single Quoted Multi-Line String
    $'four\n'   # 3. C-style string with explicit newline
  | tac         # Reverse
  | tr a-z A-Z  # Uppercase
  ;
# =>
# FOUR
# THREE
# TWO 
# ONE

Reminder: Doc Comments With `###`

I also described Oil's doc comment feature in November of last year:

More Changes to Oil's Syntax

The line below a proc can have a special ### comment, and its value can be retrieved with pp proc.

proc restart(pid) {
  ### Restart server by sending it a signal
  
  kill $pid
}

What's Next?

A Tour of the Oil Language describes both of these features, and it was discussed on Hacker News a few days ago.

A few familiar questions about the project came up, so I drafted Blog Backlog: FAQ, Project Review, and the Future.

But I might just cut to the chase with What To Expect From Oil in the Near Future.

Please Test Oil 0.9.2

Try this feature out and tell me if there are any bugs! That is the main purpose of these blog posts.

Oil version 0.9.2 - Source tarballs and documentation.

Appendices

How Are Multi-Line Commands Parsed?

These notes for are contributors and people who want to reimplement the Oil language. I used our style of #parsing-shell to implement the subtle multi-line command syntax. It falls slightly outside what you'll see in textbooks on parsing.

First, here's an unusual fact: Oil has two levels of tokenization due to the inherent structure of the shell language.

The Lexer outputs Token objects, and the WordParser consumes them.
- Example: --flag="val $x" is a word consisting of multiple tokens.
The WordParser outputs word_t objects (compound_word or Token), and the CommandParser consumes them.
- Example: mycommand --flag="val $x" > out.txt is a command consisting of multiple words.

To parse multi-line commands, we look for the ... prefix word at the start of an AndOr production in the shell grammar. This production handles chains like cd / && ls | wc -l && echo OK.

If we see ..., then we use a Python context manager to flip a flag on the WordParser to enter multi-line mode. When it's in this mode, it treats newlines and blank lines differently. (Python context managers are translated to C++ constructors and destructors by mycpp).

Because ... is a unusual command prefix, I don't expect this to break existing shell code. So multi-line commands are valid in both bin/osh and bin/oil.

(Productivity note: I search the code for symbols like WordParser with grep $name */*.py.)

How Are Multi-Line Strings Parsed?

On the other hand, '''foo''' already has a meaning in shell. It's three string literals side by side using implicit concatenation.

''
'foo'
''

We take advantage of this to parse multi-line string literals when shopt --set parse_triple_quote is on. That is, we do not have tokens for ''', """, and $'''. Instead, we actually look for an empty string at the start of a word, then switch into another WordParser mode, and strip whitespace when we're done.

This is unusual, but it means that OSH and Oil share the same command and word lexer modes. This is a desirable property for keeping the upgrade path from OSH to Oil smooth, and I think it will make syntax highlighters and other tools easier to write.

Review of Syntax Proposals

This post described two syntax features, which happen to be the first two in Proposed Changes to Oil's Syntax (November 2020).

What about the others?

${x|html} and ${x %.2f} for string formatting. I think these are important, and I just barely started implementing them.
$/ d+ / for inline Egg expressions. I think we can do this now that we have shopt --set strict_dollar, which disallows echo $/ because it's equivalent to echo \$/ and echo '$/'. That is, we don't need another parsing option.
shopt --set parse_amp for redirects. This is deferred. I believe it's OK to "memorize" a few idioms for redirects, and again I want to keep the combined OSH+Oil surface area small.

Oil Has Multi-line Commands and String Literals

Multi-Line Commands With A ... Prefix

Multi-Line String Literals: """ and ''' and $'''

An Orthogonal Design Is Easier to Use and Remember

But Nothing Is Perfect

They Can Be Combined

Reminder: Doc Comments With ###

What's Next?

Please Test Oil 0.9.2

Appendices

How Are Multi-Line Commands Parsed?

How Are Multi-Line Strings Parsed?

Review of Syntax Proposals

Multi-Line Commands With A `...` Prefix

Multi-Line String Literals: `"""` and `'''` and `$'''`

Reminder: Doc Comments With `###`