Why Sponsor Oils? | blog | oilshell.org
I write every page on this site in Markdown syntax. At first, I used the
original markdown.pl
to generate HTML.
But I've just switched to cmark, the C implementation of CommonMark. I had a great experience, which I document here.
We need more projects like this: ones that fix existing, widely-deployed technology rather than create new technology.
The home page says:
We propose a standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests ...
Much like Unix shell, Markdown is a complex language with many implementations.
I happened to use markdown.pl
, but another popular implementation is
pandoc. Sites like Reddit, Github,
and StackOverflow have their own variants as well.
However, shell has a POSIX spec. It specifies many non-obvious parts of the language, and shells widely agree on these cases. (Caveat: there are many things that POSIX doesn't specify, as mentioned in the FAQ on POSIX).
But CommonMark goes further. In addition to a detailed written specification, the project provides:
Perfect!
CommonMark's tests and Oil's spec tests follow the same philosophy. In order to specify the OSH language, I test over a thousand shell snippets against bash, dash, mksh, busybox ash, and zsh. (See blog posts tagged #testing.)
I'd like to see executable specs for more data formats and languages. Of course, POSIX has to specify not just the shell, but an entire operating system, so it's perhaps understandable that they don't provide exhaustive tests. However, some tests would be better than none.
I wanted to parse <h1>
, <h2>
, ... headers in the HTML output in order to
generate a table of contents, like the one at the top of this post. That
is, the build process now starts like this:
The TOC used to be generated on the client side by traversing the DOM, using JavaScript borrowed from AsciiDoc. But it caused a noticeable rendering glitch. Since switching to static HTML, my posts no longer "flash" at load time.
I could have simply parsed the output of markdown.pl
, but I didn't trust it.
I knew it was a Perl script that was last updated in 2004, and Perl and shell
share a similar sloppiness with text. They like
to confuse code and data. This is one of the things I aim to fix with Oil.
(See blog posts tagged #escaping-quoting.)
I had a more concrete reason for this suspicion, too. A few months ago, I
noticed markdown.pl
producing MD5 checksums in the HTML output, when none
were in the input. I believe I "fixed" this bug by moving whitespace around,
but I still don't know what the cause was. I see several calls to the Perl
function md5_hex()
in the source code, but there's no explanation for them.
This 2009 reddit blog post has a clue: it says that MD5 checksums are used to prevent double-escaping. But this makes no sense to me: checksums seem irrelevant to that problem, precisely because you can't tell apart checksums that the user wrote and checksums that the rendering process inserted. These bugs feel predictable — almost inevitable.
(However, I have some sympathy, because there are multiple kinds and
multiple layers of escaping in shell. Most of these cases took more than
one try to get right. The next post will list the different meanings of \
in
shell.)
I changed the oilshell.org Makefile
to use cmark
instead of markdown.pl
, and every blog post rendered the same way! When I
looked at the underlying HTML, there were a few differences, which were either
neutral changes or improvements:
Unicode characters are represented as themselves rather than HTML entities.
For example, —
turned into a literal "—". I like this change,
but it means that the output HTML is now UTF-8 rather than ASCII. See the
next section for a tip about charset
declarations.
Insignificant whitespace in the HTML output changed.
<p>"Oil"</p>
→ <p>"Oil"</p>
. The former might be
valid HTML, but the latter is better. Correction: Readers have
pointed out that escaping "
here is unnecessary in both HTML and XML. I
thought that being explicit about "
would be easier to parse in
Python, but I now doubt that too.
A better example is >
:
$ echo 'a > b' | markdown <p>a > b</p> $ echo 'a > b' | cmark <p>a > b</p>
I believe cmark
's output is better. (However, I couldn't find an occurrence
of this problem in my site, since markdown.pl
does escape >
within
<code>
tags.)
So every blog post rendered correctly. Correction: I
found another cmark
incompatibility after publishing this post. See the
update blow.
But when I rendered the blog index, which includes generated HTML, I ran
into a difference. A markdown heading between HTML tags was rendered
literally, rather than with an <h3>
tag:
<table> ... </table> ### Heading <table> ... </table>
I fixed it by adding whitespace. I wouldn't write markdown like this anyway; it was arguably an artifact of generating HTML inside markdown.
Still, I'm glad that I have a git repository for the generated HTML as well
as the source Markdown, so I can do a git diff
after a build and eyeball
changes.
charset
in both HTTP and
HTMLAs noted above, the HTML output now has UTF-8 characters, rather than using
ASCII representations like —
.
This could be a problem if your web server isn't properly configured. I
checked and my web host is not sending a charset
in the Content-Type
header:
$ curl --head http://www.oilshell.org/
HTTP/1.1 200 OK
...
Content-Type: text/html
But I remembered that the default charset for HTTP is ISO-8859-1, not
UTF-8. Luckily, my HTML boilerplate already declared UTF-8. If you "View
Source", you'll see this line in the <head>
of this document:
<meta charset=utf-8>
So I didn't need to change anything. When there's no encoding in the HTTP
Content-Type
header, the browser will use the HTML encoding.
In summary, if you use markdown.pl
, I recommend switching to CommonMark,
but be aware of the encoding you declare in both HTTP and HTML.
cmark
Uses re2c
, AFL, and AddressSanitizerI haven't yet looked deeply into the cmark implementation, but I see three things I like:
It uses re2c, a tool to generate state machines in the form of switch
and goto
statements from regular expressions.
I also used this code generator to implement the OSH lexer. For example, see osh-lex.re2c.h, which I describe in my (unfinished) series of posts on lexing.
It uses American Fuzzy Lop, a relatively new fuzzer that has uncovered many old bugs.
The first time I used it, I found a null pointer dereference in toybox
sed
in less than a second. Roughly speaking, it relies on compiler
technology to know what if
statements are in the code. This means it can
cover more code paths with less execution time than other fuzzers.
It uses AddressSanitizer, a compiler option that adds dynamic checks for memory errors to the generated code.
I used it to find at least one bug in Brian Kernighan's awk implementation, as well as several bugs in toybox. It's like Valgrind, but it has less overhead.
In summary, these are exactly the tools you should use if you're writing a parser in C that needs to be safe against adversarial input.
Fundamentally, parsers have a larger state space than most code you write. It's impossible to reason about every case, so you need tools:
Another technique I've wanted to explore, but haven't yet, is property-based testing. As far as I understand, it's related to and complementary to fuzzing.
I had a great experience with CommonMark, and I'm impressed by its thoroughness. I created oilshell.org/site.html to acknowledge it and all the other projects I depend on.
What other open source projects are fixing widely-deployed technology? Let me know in the comments.
After publishing this post, I noticed that some of my posts had been broken for awhile. I shell out to Pygments to render code blocks like:
def Foo():
pass
def Bar():
pass
Its output is piped back into the Markdown document as embedded HTML:
<div class="highlight">
<!-- Python code highlighted with <span>.
View source to see it. -->
</div>
However, the blank line triggers issue 490, an intentional incompatibility that allows Markdown in embedded HTML blocks.
I fixed it with this Awk filter:
# Replace blank lines with an HTML comment
awk '
/^[ \t]*$/ { print "<!-- blank -->"; next }
{ print }
'
So unfortunately, a few of my posts were broken, and I didn't notice for awhile. I had inspected the diffs, but trivial changes drowned out these more important changes.
On the other hand, I've often wanted to use Markdown inside HTML tables, so I may intentionally use this feature of CommonMark.
A few readers asked me this. The answer is technically no: I probably
could have generated the TOC with the output of markdown.pl
.
But I want firmer foundations for my blog's source text, and more rigorously
defined HTML output. CommonMark has a spec, tests, and multiple
implementations, while markdown.pl
is a Perl script that hasn't been updated
since 2004, and has known bugs.
I also learned that the author of pandoc works on CommonMark, which gives me confidence that CommonMark is "ground in reality" and not inventing something too divergent.
Also, note that Markdown has no syntax errors. Every text file is a valid
Markdown document. So, in theory, every divergence from markdown.pl
breaks a document.
In that sense, the fixing Markdown is harder than fixing shell. In OSH, if I can generate a good error at parse time, which leads the author to a trivial fix, I worry less about the incompatibility.