Git Log in HTML: A Harder Problem and A Safe Solution

2017-09-29

This is an update to How to Quickly and Correctly Generate a Git Log in HTML.

I simplified the problem to make it easy to explain, but some reader comments indicate that I oversimplified it. So here I pose a harder problem that requires arbitrary text.

To be brief, the solution is like the one in the last post, except:

Use git's %x00 to insert NUL bytes, because bash can't do this.
For adversarial input, check the number of NUL bytes before escaping.

Read on for details and the justification.

Table of Contents

Recap of the Argument

A Harder Variant of the Problem

The Worst Part of Shell: Pushing the Problem Around

A Secure and Maintainable Solution

Note 1: Bash Strings Can't Contain NUL

Note 2: Searching for NUL Using Point-Free Style

Lessons for Oil

How Tools Should Integrate with Oil

Conclusion

Recap of the Argument

The previous article had multiple goals:

To explain a practical trick for escaping HTML in shell. I used this in real life!
To make a point about programming style: I compared the naive vs. the pedantic solution.
To explore a design space for the Oil shell. In particular, how should we communicate structured data between processes?

Regarding point #2, some of the comments missed the forest for the trees, so let me repeat the argument here:

There is a naive way to solve this problem (the quick-but-unsafe way). A reader argued for ignoring escaping until you need it, so this is not a strawman.
There is a pedantic way (the correct-but-clunky way). This also isn't a strawman. Another reader said they implemented something similar in Haskell using libgit2, but later switched to shell because of dependency problems (which was the issue I predicted).
I presented middle-ground solution, which is both quick and correct (for text). I used the trick of surrounding user-supplied data with 0x01 and 0x02 bytes, and then a single re.sub() invocation in Python can escape those demarcated substrings.
I think the middle-ground solution is the best one, but it also has drawbacks. How can we address these issues in the Oil shell?

A Harder Variant of the Problem

The harder problem includes more fields, including the full git commit description. Here's an excerpt of example output.

4c353d1	Wed Sep 20 00:44:45 2017 -0700	Andy Chu
	Simplify, and add a git-specific solution.
a089617	Tue Sep 19 11:23:33 2017 -0700	Andy Chu
	Modify example: Use bash $'' because it's not specific to git. It can be used with other tools too. 0x01 and 0x02 so as not to confuse the issue of NUL in bash strings

The Worst Part of Shell: Pushing the Problem Around

Although it's my fault for oversimplifying the problem, I think many of the alternative solutions given on Reddit, Lobsters, and Hacker News illustrate two of the worst parts of shell.

(1) The first problem is that a new, possibly buggy, solution is constructed for every text processing problem, depending on the format of the input data. For example, various solutions assumed:

knowledge of particular fields, e.g. the git commit hash looks like [0-9a-f].
that only one field contained spaces.
that there are no newlines in any field.

It would be better to have a single solution that works for a wide range of problems.

(2) These kinds of assumptions may be valid in certain contexts, but the solutions don't check them. In other words, data validation and error checking are ignored.

If assumptions are violated, the script might keep chugging until it fails later, or it might succeed, in which case your end users are left to cope with it.

These are reasons that shell scripts have a reputation for being difficult and unreliable.

A Secure and Maintainable Solution

In section 4 of the last post, I admitted that I also pushed the problem around. Instead of avoiding the HTML characters < and >, we now avoid 0x01 and 0x02. Those bytes are unlikely, but they're trivial to insert in an adversarial context: git commit -m $'\x01'.

I'm still pushing the problem around, but I argue that there's a uniquely desirable place to push it to. Now I use a pair of NUL bytes for delimiters, so that:

The only restriction we place on the untrusted input is that it can't contain NUL bytes (0x00). This means we handle all UTF-8 text.

The new solution is in demo-multiline.sh in the oilshell/blog-code repository.

You can run it like this:

~/git/blog-code/git-changelog$ ./demo-multiline.sh write-file
Wrote git-log-multiline.html

See git-log-multiline.html.

Here is an excerpt:

echo '<table>'

# git inserts NUL bytes when you specify %x00
local num_fields=2  # 2 fields contain arbitrary text
local format='<tr>
  ...
  <td> %x00%an%x00 </td>
  ...
  <td>%x00%B%x00</td>
</tr>
'

# Write raw data to a file
local num_entries=5
git log -n $num_entries --pretty="format:$format" > tmp.bin

# Check the raw data for the the right number of NULs
expect-nul-count $((num_entries * num_fields * 2)) < tmp.bin

# Escapes text between pairs of NULs, writing HTML to stdout.
escape-segments < tmp.bin

echo '</table>'
}

For the case of generating release notes, I wouldn't bother checking for NUL bytes.

But if I don't trust my input, it's easy to check. I don't think git change descriptions can contain NUL characters, but there's no guarantee.

C++ strings do allow internal NULs, and in an adversarial context, it's not wise to rely on implementation details that may change.

Note 1: Bash Strings Can't Contain NUL

Notice that we're now using a git feature (%x00) rather than a bash feature ($'\x00'). This is because bash is written in C and truncates the string at the first NUL:

$ mystr=$'abc\x00def'
> echo ${#mystr}  # length should be 7, not 3
3

Compare with Python:

>>> mystr = 'abc\0def'
>>> print len(mystr)
7

Note 2: Searching for NUL Using Point-Free Style

Even if bash supported internal NULs, you still couldn't grep for them like this:

$ grep $'\x00' myfile  # matches the empty string on every line

That's because the kernel is also written in C, and the items in the argv array passed to main() are NUL-terminated strings.

So instead I convert the file to hex digits with od, then grep and count lines:

count-nul() {
  # -o puts every match on its own line.  (grep -o -c doesn't work.)
  od -A n -t x1 | grep -o '00' | wc -l
}

Aside: this function is written in point-free style. That means it doesn't mention variables, constants, or any other kind of data. I wrote about this in Pipelines Support Vectorized, Point-Free, and Imperative Style.

Lessons for Oil

This solution isn't perfect, but that gives me things to think about for the Oil language:

(1) Oil should be be able to store NUL bytes in strings. This is trivial because Oil is currently based on the Python VM.

(2) Oil should let you directly write loops like that check for NUL. The od ... | grep ... | wc -l solution is clever, but I'd rather be straightforward and efficient.

Maybe something like:

proc expect-nul-count -- num_expected {
  var n = 0  # integer variable
  while (true) {
    var s = read(stdin, 4096)
    set n += s.count('\0')  # .count() is borrowed from Python
  }
  test $n -eq $num_expected || die "Expected $num_expected NULs"
}

(3) Left-to-right syntax would be nice. Instead of,

expect-nul-count $n < tmp.bin
escape-segments < tmp.bin

perhaps:

open tmp.bin -- expect-nul-count $n 
open tmp.bin -- escape-segments

This can also be done with cat, but some people don't like the useless use of cat.

How Tools Should Integrate with Oil

I had originally thought that a broader ecosystem around the Oil shell would needs its own implementation of tools like ls and ps. This is because I want to transfer structured data over pipes.

But after thoroughly analyzing this problem, I think that the existing convention of using NUL bytes suffices for nearly all shell problems. It's wise to avoid boiling the ocean.

Although bash doesn't properly support NUL bytes, their use is already an informal convention in shell:

git log supports them with %x00.
find in coreutils supports them with -print0 and -printf "%s\0%P\0".

On the other side:

xargs -0 splits its input with NUL bytes

However, shell should also be able to handle binary data with arbitrary bytes. Possible solutions:

Pass the path of a temp file that contains the data. The receiver calls open() and read().
Use the simple length-prefixed netstring format by DJB.
base64-encode the data.

I'm not making any strong recommendations for external tools right now, but feel free to leave comments if there's anything I'm missing.

Conclusion

In this post, I posed a more general HTML escaping problem, and gave a simple and secure solution. If you find a problem with it, leave a comment.

This problem was worth thinking about because it made me realize that very little in the shell ecosystem needs to change in order to support structured data over pipes.

Tools can provide JSON or CSV output if they want. Oil will be able to understand those formats. But the minimum set of mechanisms we need is:

Tools that use printf-style format strings should support something like %x00. This should be an easy change to make.
The Oil shell should be able to store NUL bytes in the middle of strings.

Except for aesthetics, that's it. Oil may provide optional metaprogramming libraries to do transformations like this:

# The format string is lazily evaluated here, in the context of a
# record.  We should also do auto-escaping like as in JavaScript
# ES6 template strings.
git-log-wrapper | format '<td>$hash</td> <td>$description</td>'

But that's a topic for later.