Why Sponsor Oils? | source | all docs for version 0.18.0 | all versions | oilshell.org
QSN ("quoted string notation") is a data format for byte strings. Examples:
'' # empty string 'my favorite song.mp3' 'bob\t1.0\ncarol\t2.0\n' # tabs and newlines 'BEL = \x07' # byte escape 'mu = \u{03bc}' # Unicode char escape 'mu = μ' # represented literally, not escaped
It's an adaptation of Rust's string literal syntax with a few use cases:
\n
, so literal newlines can be used to delimit records.Oil uses QSN because it's well-defined and parsable. It's both human- and machine-readable.
Any programming language or tool that understands JSON should also understand QSN.
\n
, not literal.\t
,
not literal.'μ \xff'
is valid UTF-8, even though the decoded string is not.'\xce\xbc'
is valid ASCII, even though the decoded string is
not.set -x
in shell. Like filenames, Unix argv
arrays may contain
arbitrary bytes. There's an example in the appendix.
ps
has to display untrusted argv
arrays.ls
has to display untrusted filenames.env
has to display untrusted byte strings. (Most versions of env
don't
handle newlines well.)JavaScript Object Literals are to JSON as Rust String Literals are to QSN
But QSN is not tied to either Rust or shell, just like JSON isn't tied to JavaScript.
It's a language-independent format like UTF-8 or HTML. We're only borrowing a design, so that it's well-specified and familiar.
TODO: The short description above should be sufficient, but we might want to write it out.
\t
\r
\n
\'
\"
\\
\0
\x7F
\u{03bc}
or \u{0003bc}
. These are encoded as UTF-8.'\x00\xff\x00'
. JSON can't
represent binary data directly.'\u{01f600}'
for 😀. JSON
needs awkward surrogate pairs to represent this code point.The input to a QSN encoder is a raw byte string. However, the string may have additional structure, like being UTF-8 encoded.
The encoder has three options to deal with this structure:
\xce\xbc
. Never emit escapes like \u{3bc}
or
literals like μ
. This option is OK for machines, but
isn't friendly to humans who can read Unicode characters.Or speculatively decode UTF-8. After decoding a valid UTF-8 sequence, there are two options:
Show escaped code points, like \u{3bc}
. The encoded string is limited
to the ASCII subset, which is useful in some contexts.
Show them literally, like μ
.
QSN encoding should never fail; it should only fall back to byte escapes like
\xff
. TODO: Show the state machine for detecting and decoding UTF-8.
Note: Strategies 2 and 3 indicate whether the string is valid UTF-8.
The reference implementation has two functions:
IsUnprintableLow
: any byte below an ASCII space ' '
is escapedIsUnprintableHigh
: the byte \x7f
and all bytes above are escaped, unless
they're part of a valid UTF-8 sequence.In theory, only escapes like \'
\n
\\
are strictly necessary, and no
bytes need to be hex-escaped. But that strategy would defeat the purpose of
QSN for many applications, like printing filenames in a terminal.
QSN decoders must enforce (at least) these syntax errors:
\t
or \n
. (The lack of
literal tabs and newlines is essential for QTT.)\z
\xgg
\u{123
(incomplete)Separate messages aren't required for each error; the only requirement is that they not accept these sequences.
The encoder has options to emit shell-compatible strings, which you probably
don't need. That is, C-escaped strings in bash look $'like this\n'
.
A subset of QSN is compatible with this syntax. Example:
$'\x01\n' # A valid bash string. Removing $ makes it valid QSN.
Something like $'\0065'
is never emitted, because QSN doesn't contain octal
escapes. It can be encoded with hex or character escapes.
The general idea: Rust string literals are like C and JavaScript string
literals, without cruft like octal (\755
or \0755
— which is it?) and
vertical tabs (\v
).
Comparison with shell strings:
'Single quoted strings'
in shell can't represent arbitrary byte strings.$'C-style shell strings\n'
strings are similar to QSN, but have cruft like
octal and \v
."Double quoted strings"
have unneeded features like $var
and $(command sub)
.Comparison with Python's repr()
:
"'"
, whereas it's '\''
in QSN\uxxxx
and \Uxxxxxxxx
, whereas QSN has the more natural
\u{xxxxxx}
.set -x
exampleWhen arguments don't have any spaces, there's no ambiguity:
$ set -x
$ echo two args
+ echo two args
Here we need quotes to show that the argv
array has 3 elements:
$ set -x
$ x='a b'
$ echo "$x" c
+ echo 'a b' c
And we want the trace to fit on a single line, so we print a QSN string with
\n
:
$ set -x
$ x=$'a\nb'
$ echo "$x" c
+ echo $'a\nb' c
Here's an example with unprintable characters:
$ set -x
$ x=$'\e\001'
$ echo "$x"
+ echo $'\x1b\x01'