How to Create a UTF-16 Surrogate Pair by Hand, with Python

2023-06-15

We plan to write a JSON parser as part of "deconstructing and augmenting JSON".

One problem is that JSON only allows Unicode character escapes with 4 hex digits, like \u03bc for μ.

It doesn't allow \U00abcdef (8 hex digits) or \u{abcdef} (1 to 6 hex digits), as Python and modern JavaScript do.

So how do you express a code point greater than 2¹⁶ = 65,536 in JSON?

Let's use U+1f926 aka \u{1f926} aka 🤦 as a concrete example. What are PPPP and QQQQ in this code?

>>> json_str = r'"\uPPPP\uQQQQ"'  # fill in the correct values
>>> print(json.loads(json_str))
🤦

This post shows how to manually calculate this "surrogate pair" in Python. Together they denote one "character", not two.

I also discuss consequences of this wart, OS and language history, and what it means for Oils.

Table of Contents

Python Demo

Quirks

UTF-16 can be little- or big-endian

Encoded JSON can be and must be UTF-8

Valid JSON strings != Valid Unicode strings, or all bytes

History: Windows Infected JavaScript, JSON, and Python

Future: Windows and Python Are Moving Toward UTF-8

Conclusion

Oils Should Fix Text, Not Just Fix Shell!

Appendix: Links to More Examples

Python Demo

Wikipedia's UTF-8 page helped me write an encoder-decoder a few years ago, so let's go there. This is a concise description:

https://en.wikipedia.org/wiki/UTF-16

But I had trouble transcribing it to Python: the ordering is fiddly, and I misread the xxx and yyy bit masks.

Here's what I did, without the 20 minutes of mistakes:

(1) First, it's easy to compute 0x1f926 - 0x10000 = 0xf926 without Python.

(2) Then apply the bit masks. At first, I didn't notice that they are 10 bits long, so using _ with groups of 5 makes that clearer.

$ python3

# least significant 10 bits of 20
>>> 0xf926 & 0b11111_11111
294                                                                                                                         

# most significant 10 bits of 20
>>> (0xf926 & 0b11111_11111_00000_00000) >> 10
62

(3) Then put each value in the surrogate pair range, with the special 0xd800 and 0xdc00 "base" values:

>>> hex(0xd800 + 62)
'0xd83e'

>>> hex(0xdc00 + 294)
'0xdd26'

The resulting code points are guaranteed not to represent real Unicode characters. In other words, surrogate values occupy a reserved, disjoint part of the code point space.

(4) Now we have our answer:

>>> import json

>>> json_str = r'"\ud83e\udd26"'    
>>> print(json.loads(json_str))
🤦

That is, PPPP = d83e and QQQQ = dd26.

Quirks

UTF-16 can be little- or big-endian

I also wanted to understand the raw bytes on the wire. My first attempts were wrong, since again the ordering is fiddly.

It's easiest to copy the \uabcd pairs to \xab \xcd bytes in order, and decode it as big endian. The b prefix in Python 3 denotes a bytes object, and decode() returns a string object:

>>> b'\xd8\x3e\xdd\x26'.decode('utf-16-be')
'🤦'

Then swap each pair of bytes (not surrogates) for the more common little endian:

>>> b'\x3e\xd8\x26\xdd'.decode('utf-16-le')
'🤦'

On my machine, utf-16 behaves like utf-16-le.

Encoded JSON can be and must be UTF-8

Here's another quirk. Even though JSON has only UTF-16-like \uabcd escapes, potentially paired, encoded JSON is specified to be UTF-8!

For example, this is valid JSON:

{"Literal UTF-8": "🤦"}

You don't have to write it like this:

{"ASCII-only encoding": "\ud83e\udd26"}

On the other hand, this is invalid because the entire message isn't valid UTF-8:

{"invalid": <bytes for 0xd83e> }

But this is valid, because JSON syntax is ignorant of the surrogate range, and of surrogate pairs:

# doesn't represent ANY character, but is valid!
{"valid": "\ud83e"}

Valid JSON strings != Valid Unicode strings, or all bytes

So here's an interesting conclusion: the set of valid JSON strings corresponds to neither:

The set of valid Unicode strings.
- Abstractly, a string is a sequence of "Unicode scalars", which are code points not in the surrogate range.
The set of all byte strings.
- Unix APIs like read() return arbitrary bytes; paths are NUL-terminated bytes, etc.

Let's use Python to see what that means concretely:

>>> json_str = r'"\ud83e"'  # first code unit only

>>> s = json.loads(json_str)  # successfully decoded!

>>> print(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character ...
  '\ud83e' in position 0: surrogates not allowed

The data was successfully decoded, but you can't print it, because it's not a valid character.

As another data point, the node.js interpreter chooses to print the � replacement char instead of raising an exception:

$ nodejs

> json_str = '"\\ud83e"'
'"\\ud83e"'

> decoded = JSON.parse(json_str)
'�'

Either way, this is bad property! It means that JSON can denote silent errors traveling over the wire, between processes, like "\ud38e".

This is really the tip of an iceberg. I'm working on another demo: Can the Filename \xff Be JSON-Piped Between Python and JavaScript?

History: Windows Infected JavaScript, JSON, and Python

Someone recently asked:

Why is text such a shitshow?

The short story is that Ken Thompson invented UTF-8 for Plan 9 in 1993, but this was slightly too late for Windows to adopt it.

Instead, Windows adopted the incomplete UCS-2 encoding, which had to be upgraded with surrogate pairs, giving UTF-16.

Java and JavaScript appeared in the 90's, when Windows was overwhelmingly dominant, so they inherited UTF-16-centric design. JavaScript then infected JSON (2001).

Future: Windows and Python Are Moving Toward UTF-8

Windows also infected Python! Python isn't UTF-16-centric like Java and JavaScript, but juggling encodings caused two decades of implementation pain. Contrary to popular belief, the introduction of Python 3 was less than half of the story.

I may write up this history separately, but for now, here is a detailed description of the immense complexity:

Python behind the scenes #9: how Python strings work by Victor Skvortsov

And six great blog posts by CPython developer Victor Stinner, ending with

Python 3.7 UTF-8 Mode (2018)

The third post in the series begins:

Between Python 3.0 released in 2008 and Python 3.4 released in 2014, the Python filesystem encoding changed multiple times. It took 6 years to choose the best Python filesystem encoding on each platform.

But the story isn't over!

PEP 686 – Make UTF-8 mode default - This change has been accepted for Python 3.15, which hasn't been released yet.

Windows also took steps toward UTF-8, starting with Windows 10 in 2019:

Unicode in Microsoft Windows

At least some teams have acknowledged that UTF-8 is better than UTF-16, as well as fundamental problems with the Windows design:

By operating in UTF-8, you can ensure maximum compatibility in international scenarios and data interchange with minimal effort and test burden.

Windows operates natively in UTF-16 (or WCHAR), which requires code page conversions by using MultiByteToWideChar and WideCharToMultiByte. This is a unique burden that Windows places on code that targets multiple platforms. Even more challenging is the Windows concept ...

That burden was placed on CPython for two decades, and still is!

Conclusion

I started this post while justifying the YSH design with ideas from #software-architecture: Narrow Waists Can Be Interior or Exterior: PyObject vs. Unix Files.

Key idea: Even though YSH is Python-influenced, the narrow waist is still exterior files, not interior data structures.

The natural conclusion is then:

Oils Should Fix Text, Not Just Fix Shell!

If the power of Python is in PyObject, then the power of Oils will be its data languages. To improve shell, we can't just change its code (the language design), we also have to change its data.

Our solution is "J8 Notation", a set of languages for strings, records, and tables based on JSON. They're designed with correctness, compatibility, and composition in mind. I mentioned this in the Sketches of YSH Features, and future posts will go into detail.

Feel free to ask questions in the comments, which are now on Zulip!

Appendix: Links to More Examples

This post was inspired by:

which links to

It’s Not Wrong that “🤦🏼‍♂️”.length == 7, But It’s Better that “🤦🏼‍♂️”.len() == 17 and Rather Useless that len(“🤦🏼‍♂️”) == 5 (hsivonen.fi)

Note: both posts focus on a grapheme cluster, or a sequence of code points. This post deals with just a single code point.