Warning: Work in progress! Leave feedback on Zulip or Github if you'd like this doc to be updated.

Notes on Unicode in Shell

Table of Contents
Philosophy
A Mental Model
Program Encoding
Data Encoding
List of Unicode-Aware Operations in OSH / bash
Length / Slicing
Globs
Regexes (ERE)
More bash operations
YSH-Specific
Data Languages
Tips
Implementation Notes

Philosophy

Oils is UTF-8 centric, unlike bash and other shells.

That is, its Unicode support is like Go, Rust, Julia, and Swift, as opposed to JavaScript, and Python (despite its Python heritage). The former languages use UTF-8, and the latter have the notion of "multibyte characters".

A Mental Model

Program Encoding

Shell programs should be encoded in UTF-8 (or its ASCII subset). Unicode characters can be encoded directly in the source:

echo 'μ'

or denoted in ASCII with C-escaped strings:

echo $'[\u03bc]'

(Such strings are preferred over echo -e because they're statically parsed.)

Data Encoding

Strings in OSH are arbitrary sequences of bytes, which may be valid UTF-8. Details:

List of Unicode-Aware Operations in OSH / bash

Length / Slicing

Globs

Globs have character classes [^a] and ?.

This is a glob() call:

echo my?glob

These glob patterns are fnmatch() calls:

case $x in ?) echo 'one char' ;; esac
[[ $x == ? ]]
${s#?}  # remove one character suffix, quadratic loop for globs

This uses our glob to ERE translator for position info:

echo ${s/?/x}

Regexes (ERE)

Regexes have character classes [^a] and ..

More bash operations

Local-aware operations:

Other:

YSH-Specific

Data Languages

Tips

Implementation Notes

Unlike bash and CPython, Oils doesn't call setlocale(). (Although GNU readline may call it.)

It's expected that your locale will respect UTF-8. This is true on most distros. If not, then some string operations will support UTF-8 and some won't.

For example:

TODO: Oils should support LANG=C for some operations, but not LANG=X for other X.


Generated on Wed, 06 Dec 2023 01:04:01 -0500