Why Sponsor Oils? | blog | oilshell.org
Eggex is Oil's regular expression syntax. You may want to skim the headings of the eggex doc if it's unfamiliar.
This post demonstrates egg expressions on a realistic problem:
recognizing integer literals like 42
, 0xFF
, or 0o755
.
It also shows that eggexes are appropriate for two problems that traditionally use different regex syntax:
The Oil language and its eggex subset are both still under development. I made a few changes motivated by this example, and I'm happy with the results. Let me know what you think in the comments or follow the links on the wiki!
Considered by themselves, integer literals are a regular language. You can write them in decimal, binary, octal, or hexadecimal; and add underscores for readability.
The reference manual describes them like this:
integer ::= decinteger | bininteger | octinteger | hexinteger
decinteger ::= nonzerodigit (["_"] digit)* | "0"+ (["_"] "0")*
bininteger ::= "0" ("b" | "B") (["_"] bindigit)+
octinteger ::= "0" ("o" | "O") (["_"] octdigit)+
hexinteger ::= "0" ("x" | "X") (["_"] hexdigit)+
nonzerodigit ::= "1"..."9"
digit ::= "0"..."9"
bindigit ::= "0" | "1"
octdigit ::= "0"..."7"
hexdigit ::= digit | "a"..."f" | "A"..."F"
Examples of valid integer literals:
7 2147483647 0o177 0b100110111
3 79228162514264337593543950336 0o377 0xdeadbeef
100_000_000_000 0b_1110_0101
I'm unfamiliar with this metalanguage, but I can guess that ...
is a
character range, that []
means "optional", and that *
and +
denote
repetition.
This syntax is meant for humans to read — i.e. it isn't executable.
Source: https://docs.python.org/3/reference/lexical_analysis.html#integer-literals
Here's the same language in eggex syntax:
DecDigit = / [0-9] /
BinDigit = / [0-1] /
OctDigit = / [0-7] /
HexDigit = / [0-9 a-f A-F] /
DecInt = / [1-9] ('_'? DecDigit)* | '0'+ ('_'? '0')* /
BinInt = / '0' [b B] ('_'? BinDigit)+ /
OctInt = / '0' [o O] ('_'? OctDigit)+ /
HexInt = / '0' [x X] ('_'? HexDigit)+ /
Integer = / DecInt | BinInt | OctInt | HexInt /
It shows some of the differences between eggex and commmon Perl-style regex syntax:
BinDigit
.'0'
and '_'
. Quotes are optional in
character classes for brevity, and to make them similar to Perl/POSIX syntax.Unlike the Python spec, this syntax is executable. Compared with POSIX regexes, it's more readable.
(Note that I also re-organized the rules: They're in bottom-up order rather
than top-down, and I inlined the nonzerodigit
rule because it's only used
once.)
An eggex in Oil evaluates to a POSIX extended regex (ERE) by default, so
we simply use echo
to see it.
$ echo $Integer
([1-9](_?[0-9])*|0+(_?0)*|0[bB](_?[0-1])+|0[oO](_?[0-7])+|0[xX](_?[0-9a-fA-F])+)
I wouldn't want to type this by hand! Even if I succeeded, reading it and modifying it later would be a challenge.
Perl's /x
and Python's re.VERBOSE
help to a degree, but they're not
available in shell tools like grep and awk.
The integer literal problem seems naturally closer to lexing, but let's try it with Unix tools first.
Here I grep for integer literals using Oil's seamless integration:
bash$ bin/oil
~/git/oilshell/oil$ egrep $Integer */*.py
...
core/process.py: fd = posix.open(path, fd_mode, 0o666)
...
osh/string_ops.py: return ''.join(chr(b & 0xFF) for b in bytes_)
...
osh/string_ops.py: elif (starting_byte >> 3) == 0b11110:
It works! We found:
0o666
for file permissions (about the only thing they're
good for).0xFF
for truncating an integer to a single byte.0b11110
for recognizing with UTF-8. (UTF-8 is
best understood in binary
notation.)In addition to translating to POSIX ERE syntax, eggexes should also translate to:
To be concrete, here's an except from Oil's frontend/lex.py:
VAR_NAME_RE = r'[a-zA-Z_][a-zA-Z0-9_]*'
rules = [
R(r'\$' + VAR_NAME_RE, Id.VSub_DollarName),
...
R(VAR_NAME_RE + '\+?=', Id.Lit_VarLike),
R(VAR_NAME_RE + '\[', Id.Lit_ArrayLhsOpen),
]
As you can see, we use the syntax from Python's re
module and string
concatenation.
We do this because Oil started as a pure Python program, and because the lexer requires "metaprogramming" to remove duplication.
But Eggex could also remove that duplication with pattern composition. It's also improved by "first class" regex syntax, whitespace, and quoted literals.
VarName = / [a-z A-Z _] [a-z A-Z 0-9 _]* /
rules = [
( / '$' VarName /, Id.VSub_DollarName),
...
( / VarName '+'? '=' /, Id.Lit_VarLike),
( / VarName '[' /, Id.Lit_ArrayLhsOpen),
]
See Appendix A for details on the lexer's architecture.
I used the integer literal example to test whether Eggex is a good syntax.
I removed punctuation in:
Subpattern
is allowed in addition to @subpattern
.
The latter is still supported if you don't want to use capital letters.[b B]
is allowed in addition to ['b' 'B']
. This is
consistent with ranges like [a-f]
, and with Perl/POSIX syntax.I also implemented the capturing and grouping syntax, which contains less punctuation than the Perl-style syntax:
Eggex:
(mygroup+)
< mycapture+ >
< mycapture+ : name >
Perl/Python:
(?:mygroup+)
(mycapture+)
(?P<name>mycapture+)
These changes were motivated by Zulip chats. Thanks to Eric Bergstrome for discussing the grouping and capturing syntax, and to Aur Saraf for discussing the use of punctuation.
The Oil language is still under development, and I've made multiple changes based on user feedback. Feel free to lurk on Zulip and see what we're talking about!
Like Oil itself, eggexes are a seamless upgrade. You write patterns in a
modern, composable syntax, but tools like egrep
and awk
are passed the
ERE syntax they expect.
The same eggex syntax can be used for both ad-hoc text processing and writing fast and correct lexers.
Try porting regex to eggex and see if it's more maintainable. Start a thread
on #oil-discuss
on Zulip if you have problems.
Also see the Help Wanted section if you want to help make Oil happen! There are many opportunities to parallelize development.
Since I didn't finish the #lexing series, now is a good time to explain the architecture of Oil's lexer.
I started publishing the following dense source files on each /release/$VERSION/ page. If anyone wants to implement Oil in another language, this portable and principled lexer will save them many hours of work.
frontend/lex.py is a pure Python source file using the re
module
syntax. The dynamic lists are evaluated into (pattern, Id
) rules for 14
lexer modes.
Then we use the (undocumented) regex parse tree from Python's sre_parse.py
to produce
osh-lex.re2c.h, a file in re2c syntax. This syntax has quoted literals like lex and Eggex.
Then re2c produces
osh-lex.h, a large C file with 20K lines of switch
and goto
statements, which nicely express a DFA.
Then the C compiler produces
Machine code. Even though the the source code is large, the machine code isn't, according to the metrics. This is because C compilers perform a switch lowering optimization.
Why go to all this trouble? Because regexes compiled to state machines are more correct than bash's style of groveling through backslashes and braces one at a time. It's also more concise: Oil's source code is many times smaller than bash's.
Eggex obeys some rough refactoring properties. You might interactively type:
$ egrep [0-9]+ foo.txt
And then extract a pattern so you can test and enhance it:
$ pat = / [0-9]+ /
$ egrep $pat foo.txt
In general the eggex should resemble the regex, but be more readable. The integer literal example is intended to show that.
Eggex also uses postfix repetition operators like +
and *
so it's
consistent with:
[optional]
should be optional?
.grep [OPTION]... PATTERN [FILE]...
could begrep OPTION* PATTERN FILE*
to be consistent.Note that ...
meant "character range" in Python's spec, but means
"repetition" in command lines. One of Oil's main goals to avoid such
punning across the many languages and metalanguages that
compose a large software system.