Why Sponsor Oils? | blog | oilshell.org
This is an update to How to Quickly and Correctly Generate a Git Log in HTML.
I simplified the problem to make it easy to explain, but some reader comments indicate that I oversimplified it. So here I pose a harder problem that requires arbitrary text.
To be brief, the solution is like the one in the last post, except:
%x00
to insert NUL
bytes, because bash can't do this.NUL
bytes before escaping.Read on for details and the justification.
The previous article had multiple goals:
Regarding point #2, some of the comments missed the forest for the trees, so let me repeat the argument here:
0x01
and
0x02
bytes, and then a single re.sub()
invocation in Python can escape
those demarcated substrings.The harder problem includes more fields, including the full git commit description. Here's an excerpt of example output.
4c353d1 | Wed Sep 20 00:44:45 2017 -0700 | Andy Chu |
Simplify, and add a git-specific solution. |
||
a089617 | Tue Sep 19 11:23:33 2017 -0700 | Andy Chu |
Modify example:
|
Although it's my fault for oversimplifying the problem, I think many of the alternative solutions given on Reddit, Lobsters, and Hacker News illustrate two of the worst parts of shell.
(1) The first problem is that a new, possibly buggy, solution is constructed for every text processing problem, depending on the format of the input data. For example, various solutions assumed:
[0-9a-f]
.It would be better to have a single solution that works for a wide range of problems.
(2) These kinds of assumptions may be valid in certain contexts, but the solutions don't check them. In other words, data validation and error checking are ignored.
If assumptions are violated, the script might keep chugging until it fails later, or it might succeed, in which case your end users are left to cope with it.
These are reasons that shell scripts have a reputation for being difficult and unreliable.
In section 4 of the last post, I admitted that I also pushed
the problem around. Instead of avoiding the HTML characters <
and >
, we
now avoid 0x01
and 0x02
. Those bytes are unlikely, but they're trivial to
insert in an adversarial context: git commit -m $'\x01'
.
I'm still pushing the problem around, but I argue that there's a uniquely
desirable place to push it to. Now I use a pair of NUL
bytes for
delimiters, so that:
NUL
bytes (0x00
). This means we handle all UTF-8 text.The new solution is in demo-multiline.sh in the oilshell/blog-code repository.
You can run it like this:
~/git/blog-code/git-changelog$ ./demo-multiline.sh write-file
Wrote git-log-multiline.html
Here is an excerpt:
echo '<table>'
# git inserts NUL bytes when you specify %x00
local num_fields=2 # 2 fields contain arbitrary text
local format='<tr>
...
<td> %x00%an%x00 </td>
...
<td>%x00%B%x00</td>
</tr>
'
# Write raw data to a file
local num_entries=5
git log -n $num_entries --pretty="format:$format" > tmp.bin
# Check the raw data for the the right number of NULs
expect-nul-count $((num_entries * num_fields * 2)) < tmp.bin
# Escapes text between pairs of NULs, writing HTML to stdout.
escape-segments < tmp.bin
echo '</table>'
}
For the case of generating release notes, I wouldn't bother checking for NUL
bytes.
But if I don't trust my input, it's easy to check. I don't think git change
descriptions can contain NUL
characters, but there's no guarantee.
C++ strings do allow internal NUL
s, and in an adversarial context, it's not
wise to rely on implementation details that may change.
Notice that we're now using a git feature (%x00
) rather than a bash
feature ($'\x00'
). This is because bash is written in C and truncates the
string at the first NUL
:
$ mystr=$'abc\x00def' > echo ${#mystr} # length should be 7, not 3 3
Compare with Python:
>>> mystr = 'abc\0def'
>>> print len(mystr)
7
Even if bash supported internal NUL
s, you still couldn't grep for them
like this:
$ grep $'\x00' myfile # matches the empty string on every line
That's because the kernel is also written in C, and the items in the argv
array passed to main()
are NUL
-terminated strings.
So instead I convert the file to hex digits with od
, then grep
and count
lines:
count-nul() {
# -o puts every match on its own line. (grep -o -c doesn't work.)
od -A n -t x1 | grep -o '00' | wc -l
}
Aside: this function is written in point-free style. That means it doesn't mention variables, constants, or any other kind of data. I wrote about this in Pipelines Support Vectorized, Point-Free, and Imperative Style.
This solution isn't perfect, but that gives me things to think about for the Oil language:
(1) Oil should be be able to store NUL
bytes in strings. This is trivial
because Oil is currently based on the Python VM.
(2) Oil should let you directly write loops like that check for NUL
. The od ... | grep ... | wc -l
solution is clever, but I'd rather be straightforward
and efficient.
Maybe something like:
proc expect-nul-count -- num_expected {
var n = 0 # integer variable
while (true) {
var s = read(stdin, 4096)
set n += s.count('\0') # .count() is borrowed from Python
}
test $n -eq $num_expected || die "Expected $num_expected NULs"
}
(3) Left-to-right syntax would be nice. Instead of,
expect-nul-count $n < tmp.bin
escape-segments < tmp.bin
perhaps:
open tmp.bin -- expect-nul-count $n
open tmp.bin -- escape-segments
This can also be done with cat
, but some people don't like the useless use
of cat.
I had originally thought that a broader ecosystem around the Oil shell would
needs its own implementation of tools like ls
and ps
. This is because I
want to transfer structured data over pipes.
But after thoroughly analyzing this problem, I think that the existing
convention of using NUL
bytes suffices for nearly all shell problems. It's
wise to avoid boiling the ocean.
Although bash doesn't properly support NUL
bytes, their use is already an
informal convention in shell:
git log
supports them with %x00
.find
in coreutils supports them with -print0
and -printf "%s\0%P\0"
.On the other side:
xargs -0
splits its input with NUL
bytesHowever, shell should also be able to handle binary data with arbitrary bytes. Possible solutions:
open()
and read()
.I'm not making any strong recommendations for external tools right now, but feel free to leave comments if there's anything I'm missing.
In this post, I posed a more general HTML escaping problem, and gave a simple and secure solution. If you find a problem with it, leave a comment.
This problem was worth thinking about because it made me realize that very little in the shell ecosystem needs to change in order to support structured data over pipes.
Tools can provide JSON or CSV output if they want. Oil will be able to understand those formats. But the minimum set of mechanisms we need is:
printf
-style format strings should support something like
%x00
. This should be an easy change to make.NUL
bytes in the
middle of strings.Except for aesthetics, that's it. Oil may provide optional metaprogramming libraries to do transformations like this:
# The format string is lazily evaluated here, in the context of a
# record. We should also do auto-escaping like as in JavaScript
# ES6 template strings.
git-log-wrapper | format '<td>$hash</td> <td>$description</td>'
But that's a topic for later.