Why Sponsor Oils? | blog | oilshell.org
We plan to write a JSON parser as part of "deconstructing and augmenting JSON".
One problem is that JSON only allows Unicode character escapes with 4 hex digits, like \u03bc
for μ.
It doesn't allow \U00abcdef
(8 hex digits) or \u{abcdef}
(1 to 6 hex digits), as Python and modern JavaScript do.
So how do you express a code point greater than 216 = 65,536 in JSON?
Let's use U+1f926
aka \u{1f926}
aka 🤦 as a concrete example. What are PPPP
and QQQQ
in this code?
>>> json_str = r'"\uPPPP\uQQQQ"' # fill in the correct values
>>> print(json.loads(json_str))
🤦
This post shows how to manually calculate this "surrogate pair" in Python. Together they denote one "character", not two.
I also discuss consequences of this wart, OS and language history, and what it means for Oils.
Wikipedia's UTF-8 page helped me write an encoder-decoder a few years ago, so let's go there. This is a concise description:
But I had trouble transcribing it to Python: the ordering is fiddly, and I misread the xxx
and yyy
bit masks.
Here's what I did, without the 20 minutes of mistakes:
(1) First, it's easy to compute 0x1f926 - 0x10000 = 0xf926
without Python.
(2) Then apply the bit masks. At first, I didn't notice that they are 10 bits long, so using _
with groups of 5 makes that clearer.
$ python3
# least significant 10 bits of 20
>>> 0xf926 & 0b11111_11111
294
# most significant 10 bits of 20
>>> (0xf926 & 0b11111_11111_00000_00000) >> 10
62
(3) Then put each value in the surrogate pair range, with the special 0xd800
and 0xdc00
"base" values:
>>> hex(0xd800 + 62)
'0xd83e'
>>> hex(0xdc00 + 294)
'0xdd26'
The resulting code points are guaranteed not to represent real Unicode characters. In other words, surrogate values occupy a reserved, disjoint part of the code point space.
(4) Now we have our answer:
>>> import json
>>> json_str = r'"\ud83e\udd26"'
>>> print(json.loads(json_str))
🤦
That is, PPPP = d83e
and QQQQ = dd26
.
I also wanted to understand the raw bytes on the wire. My first attempts were wrong, since again the ordering is fiddly.
It's easiest to copy the \uabcd
pairs to \xab \xcd
bytes in order, and decode it as big endian. The b
prefix in Python 3 denotes a bytes object, and decode()
returns a string object:
>>> b'\xd8\x3e\xdd\x26'.decode('utf-16-be')
'🤦'
Then swap each pair of bytes (not surrogates) for the more common little endian:
>>> b'\x3e\xd8\x26\xdd'.decode('utf-16-le')
'🤦'
On my machine, utf-16
behaves like utf-16-le
.
Here's another quirk. Even though JSON has only UTF-16-like \uabcd
escapes, potentially paired, encoded JSON is specified to be UTF-8!
For example, this is valid JSON:
{"Literal UTF-8": "🤦"}
You don't have to write it like this:
{"ASCII-only encoding": "\ud83e\udd26"}
On the other hand, this is invalid because the entire message isn't valid UTF-8:
{"invalid": <bytes for 0xd83e> }
But this is valid, because JSON syntax is ignorant of the surrogate range, and of surrogate pairs:
# doesn't represent ANY character, but is valid!
{"valid": "\ud83e"}
So here's an interesting conclusion: the set of valid JSON strings corresponds to neither:
read()
return arbitrary bytes; paths are NUL
-terminated bytes, etc.Let's use Python to see what that means concretely:
>>> json_str = r'"\ud83e"' # first code unit only
>>> s = json.loads(json_str) # successfully decoded!
>>> print(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character ...
'\ud83e' in position 0: surrogates not allowed
The data was successfully decoded, but you can't print it, because it's not a valid character.
As another data point, the node.js interpreter chooses to print the � replacement char instead of raising an exception:
$ nodejs
> json_str = '"\\ud83e"'
'"\\ud83e"'
> decoded = JSON.parse(json_str)
'�'
Either way, this is bad property! It means that JSON can denote silent errors traveling over the wire, between processes, like "\ud38e"
.
This is really the tip of an iceberg. I'm working on another demo: Can the Filename \xff
Be JSON-Piped Between Python and JavaScript?
Someone recently asked:
Why is text such a shitshow?
Cuneicode, and the Future of Text in C (thephd.dev
via Hacker News)
124 points, 107 comments - 8 days ago
The short story is that Ken Thompson invented UTF-8 for Plan 9 in 1993, but this was slightly too late for Windows to adopt it.
Instead, Windows adopted the incomplete UCS-2 encoding, which had to be upgraded with surrogate pairs, giving UTF-16.
The Tragedy of UCS-2 (unascribed.com
via Hacker News)
208 points, 158 comments - on Aug 3, 2019
Java and JavaScript appeared in the 90's, when Windows was overwhelmingly dominant, so they inherited UTF-16-centric design. JavaScript then infected JSON (2001).
Windows also infected Python! Python isn't UTF-16-centric like Java and JavaScript, but juggling encodings caused two decades of implementation pain. Contrary to popular belief, the introduction of Python 3 was less than half of the story.
I may write up this history separately, but for now, here is a detailed description of the immense complexity:
And six great blog posts by CPython developer Victor Stinner, ending with
The third post in the series begins:
Between Python 3.0 released in 2008 and Python 3.4 released in 2014, the Python filesystem encoding changed multiple times. It took 6 years to choose the best Python filesystem encoding on each platform.
But the story isn't over!
Windows also took steps toward UTF-8, starting with Windows 10 in 2019:
At least some teams have acknowledged that UTF-8 is better than UTF-16, as well as fundamental problems with the Windows design:
By operating in UTF-8, you can ensure maximum compatibility in international scenarios and data interchange with minimal effort and test burden.
Windows operates natively in UTF-16 (or WCHAR), which requires code page conversions by using MultiByteToWideChar and WideCharToMultiByte. This is a unique burden that Windows places on code that targets multiple platforms. Even more challenging is the Windows concept ...
That burden was placed on CPython for two decades, and still is!
I started this post while justifying the YSH design with ideas from #software-architecture: Narrow Waists Can Be Interior or Exterior: PyObject
vs. Unix Files.
Key idea: Even though YSH is Python-influenced, the narrow waist is still exterior files, not interior data structures.
The natural conclusion is then:
If the power of Python is in PyObject
, then the power of Oils will be its data languages. To improve shell, we can't just change its code (the language design), we also have to change its data.
Our solution is "J8 Notation", a set of languages for strings, records, and tables based on JSON. They're designed with correctness, compatibility, and composition in mind. I mentioned this in the Sketches of YSH Features, and future posts will go into detail.
Feel free to ask questions in the comments, which are now on Zulip!
This post was inspired by:
Why does the farmer emoji have a length of 7 in JavaScript? (evanhahn.com
via lobste.rs)
24 points, 28 comments on 2023-05-27
which links to
hsivonen.fi
)Note: both posts focus on a grapheme cluster, or a sequence of code points. This post deals with just a single code point.