Andy Chu — 2018/6/7

Regex Tips, and Theory vs. Practice

Intro

What are regexes good for?

Quotes

I define UNIX as 30 definitions of regular expressions living under one roof.

— Donald Knuth

(Examples: grep, awk, Perl, Python, vim, emacs, ...)

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

— Jamie Zawinski

(related to tip #4: follow-on talk)

Tips

  1. Use extended regular expressions (ERE) at the shell to minimize the amount of syntax you need to remember (grep/sed/awk vs. Perl/Python/JS/Ruby)

  2. Always write unit tests for your regular expressions.

  3. Use re.VERBOSE in Python for readability (/x in Perl; JavaScript doesn't have this)

  4. Use regular expressions only to match regular languages (a mathematical idea).

Intro

Q: What's good about both "regexes" and regular languages?

A: They are a compact notation for pattern matching on strings (e.g. without writing an if statement for every character). [example]

[A-Za-z_][A-Za-z0-9_]*                    # recognize a variable name
                                          # (programming language world)

^(http|https)://.*\.(png|gif|jpg|jpeg)$   # recognize certain URLs
                                          # ("big data" and system
                                          # administration worlds)

Excerpt from Crafting Interpreters

private boolean isDigit(char c) {
  return c >= '0' && c <= '9';
} 

private boolean isAlpha(char c) {
  return (c >= 'a' && c <= 'z') ||
         (c >= 'A' && c <= 'Z') ||
          c == '_';
}

private boolean isAlphaNumeric(char c) {
  return isAlpha(c) || isDigit(c);
}

private void identifier() {
  while (isAlphaNumeric(peek())) advance();

  addToken(IDENTIFIER);
}

  default:
    ...
    } else if (isAlpha(c)) {
      identifier();
    ...

Comparison

Perl-style "regexes" Regular Languages
Mental Model Machine-like / Imperative (order matters, debugging through time) Mathematical / Algebraic / Linguistic (sets, alphabets, functional programming)
Execution Model Exhaustive backtracking search. Potentially exponential. NFA/DFA. Guaranteed O(n) time, O(1) space.
Example Constructs

Everything in regular languages plus:
Assertions: (?=foo) (?<=foo)
Backreferences: \1 (in pattern, not in replacement)
End of last match: \G
Conditionals, Recursion, Atomic Grouping

Repetition: x+, x*, x?
Alternation: png|gif|jpg
Char Classes: [a-zA-Z], \d, \w, \s
Start and End: ^ and $
Backslash Quotation: \+

Interpreter Implementations The ones in Perl, Python, Ruby, JavaScript, Java. PCRE. libc, awk, re2
Compiler Implementations - re2c, lex/flex
People Larry Wall, Friedl (O'Reilly Book)

Theory: Kleene, Chomsky (Type 3 Grammars)
Bridge to Practice: Ken Thompson, Alfred Aho, Rob Pike, Russ Cox

Recap: What's Good About Regular Languages (but NOT regexes)?

  1. They have efficiency guarantees -- O(n) time, O(1) space.
  2. They can represent nondeterminism
  3. They can be translated to native code (re2c).
  4. They can be checked for exhaustiveness by a compiler (re2c). Compare with ML/Haskell).

Links