source | all docs for version 0.8.0 | all versions | oilshell.org
QSN ("quoted string notation") is an interchange format for byte strings. Examples:
'' # empty string 'my favorite song.mp3' 'bob\t1.0\ncarol\t2.0\n' # tabs and newlines 'BEL = \x07' # byte escape 'mu = \u{03bc}' # char escapes are encoded in UTF-8 'mu = μ' # represented literally, not escaped
It's meant to be emitted and parsed by programs written in different languages, as UTF-8 or JSON are. It's both human- and machine-readable.
Oil understands QSN, and other Unix shells should too.
QSN has many uses, but one is that Oil needs a way to safely and readably display filenames in a terminal.
Filenames may contain arbitrary bytes, including ones that will change your terminal color, and more. Most command line programs need something like QSN, or they'll have subtle bugs.
For example, as of 2016, coreutils quotes funny filenames to avoid the same problem. However, they didn't specify the format so it can be parsed. In contrast, QSN can be parsed and printed like JSON.
set -x
in shell. Like filenames, Unix argv
arrays may contain
arbitrary bytes. See the example below.JavaScript Object Literals are to JSON as Rust String Literals are to QSN
But QSN is not tied to either Rust or shell, just like JSON isn't tied to JavaScript.
It's a language-independent format like UTF-8 or HTML. We're only borrowing a design, so that it's well-specified and familiar.
TODO: The short description above should be sufficient, but we might want to write it out.
\t
\r
\n
\'
\"
\\
\0
\x7F
\u{03bc}
or \u{0003bc}
. These are encoded as UTF-8.'\x00\xff\x00'
. JSON can't
represent binary data directly.'\u{01f600}'
for 😀. JSON
needs awkward surrogate pairs to represent this code point.The input to a QSN encoder is a raw byte string. However, the string may have additional structure, like being UTF-8 encoded.
The encoder has three options to deal with this structure:
\xce\xbc
. Never emit escapes like \u{3bc}
or
literals like μ
. This option is OK for machines, but
isn't friendly to humans who can read Unicode characters.Or speculatively decode UTF-8. After decoding a valid UTF-8 sequence, there are two options:
Show escaped code points, like \u{3bc}
. The encoded string is limited
to the ASCII subset, which is useful in some contexts.
Show them literally, like μ
.
QSN encoding should never fail; it should only fall back to byte escapes like
\xff
. TODO: Show the state machine for detecting and decoding UTF-8.
Note: Strategies 2 and 3 indicate whether the string is valid UTF-8.
The reference implementation has two functions:
IsUnprintableLow
: any byte below an ASCII space ' '
is escapedIsUnprintableHigh
: the byte \x7f
and all bytes above are escaped, unless
they're part of a valid UTF-8 sequence.In theory, only escapes like \'
\n
\\
are strictly necessary, and no
bytes need to be hex-escaped. But that strategy would defeat the purpose of
QSN for many applications, like printing filenames in a terminal.
You can see Oil's implementation in qsn_/qsn.py. It has an encoder with the various UTF-8 strategies. As of August 2020, the decoder is incomplete.
Note that we also have options to emit shell-compatible strings, which you probably don't need.
That is, C-escaped strings in bash look $'like this\n'
. A subset of QSN
is compatible with this syntax. Example:
$'\x01\n' # A valid bash string. Removing $ makes it valid QSN.
Something like $'\0065'
is never emitted, because QSN doesn't contain octal
escapes. It can be encoded with hex or character escapes.
The general idea: Rust string literals are like C and JavaScript string
literals, without cruft like octal (\755
or \0755
— which is it?) and
vertical tabs (\v
).
Comparison with shell strings:
'Single quoted strings'
in shell can't represent arbitrary byte strings.$'C-style shell strings\n'
strings are similar to QSN, but have cruft like
octal and \v
."Double quoted strings"
have unneeded features like $var
and $(command sub)
.Comparison with Python's repr()
:
"'"
, whereas it's '\''
in QSN\uxxxx
and \Uxxxxxxxx
, whereas QSN has the more natural
\u{xxxxxx}
.In-band signaling is the fundamental problem with filenames and terminals. Code (control codes) and data are intermingled.
set -x
exampleWhen arguments don't have any spaces, there's no ambiguity:
$ set -x
$ echo two args
+ echo two args
Here we need quotes to show that the argv
array has 3 elements:
$ set -x
$ x='a b'
$ echo "$x" c
+ echo 'a b' c
And we want the trace to fit on a single line, so we print a QSN string with
\n
:
$ set -x
$ x=$'a\nb'
$ echo "$x" c
+ echo $'a\nb' c
Here's an example with unprintable characters:
$ set -x
$ x=$'\e\001'
$ echo "$x"
+ echo $'\x1b\x01'