Why Sponsor Oils? | source | all docs for version 0.19.0 | all versions | oilshell.org
Warning: Work in progress! Leave feedback on Zulip or Github if you'd like this doc to be updated.
Oils is UTF-8 centric, unlike bash
and other shells.
That is, its Unicode support is like Go, Rust, Julia, and Swift, as opposed to JavaScript, and Python (despite its Python heritage). The former languages use UTF-8, and the latter have the notion of "multibyte characters".
Shell programs should be encoded in UTF-8 (or its ASCII subset). Unicode characters can be encoded directly in the source:
echo 'μ'
or denoted in ASCII with C-escaped strings:
echo $'[\u03bc]'
(Such strings are preferred over echo -e
because they're statically parsed.)
Strings in OSH are arbitrary sequences of bytes, which may be valid UTF-8. Details:
NUL
('\0'
) byte. This is a consequence of how Unix and C work.${#s}
and slicing ${s:1:3}
require the string
to be valid UTF-8. Decoding errors are fatal if shopt -s strict_word_eval
is on.${#s}
-- length in code points (buggy in bash)
len(s)
counts bytes.${s:1:2}
-- offsets in code pointsGlobs have character classes [^a]
and ?
.
This is a glob()
call:
echo my?glob
These glob patterns are fnmatch()
calls:
case $x in ?) echo 'one char' ;; esac
[[ $x == ? ]]
${s#?} # remove one character suffix, quadratic loop for globs
This uses our glob to ERE translator for position info:
echo ${s/?/x}
Regexes have character classes [^a]
and .
.
[[ $x =~ $pat ]]
where pat='.'
printf '%d' \'c
where c
is an arbitrary character. This is an obscure
syntax for ord()
, i.e. getting an integer from an encoded character.Local-aware operations:
printf
also has time.Other:
wcswidth()
, which doesn't just count
code points. It calculates the display width of characters, which is
different in general.mystr ~ / [ \xff ] /
case (x) { / dot / }
for offset, rune in (mystr)
decodes UTF-8, like GoStr.{trim,trimLeft,trimRight}
respect unicode space, like JavaScript doesStr.{upper,lower}
also need unicode case foldingsplit()
respects unicode space?\u{123456}
in j""
stringsiconv
program converts text from one encoding to another.Unlike bash and CPython, Oils doesn't call setlocale()
. (Although GNU
readline may call it.)
It's expected that your locale will respect UTF-8. This is true on most distros. If not, then some string operations will support UTF-8 and some won't.
For example:
${#s}
is implemented in Oils code, not libc, so it will
always respect UTF-8.[[ s =~ $pat ]]
is implemented with libc, so it is affected by the locale
settings. Same with Oils (x ~ pat)
.TODO: Oils should support LANG=C
for some operations, but not LANG=X
for
other X
.