blog | oilshell.org

Pipelines Support Vectorized, Point-Free, and Imperative Style

2017-01-15

This is the second post in a series about unique features of the shell language. These features are missing or awkward in languages like Python, Ruby, Perl, and JavaScript.

As always, code is available in the blog-code repository.

Vectorized Code

Vectorized code operates on collections without explicitly mentioning each item. It's found in both high level languages like R, Matlab, and Julia, as well as in assembly language.

In the R language you can multiply two vectors with a single expression:

$ R
...
> a = c(1, 2, 3)  # vector of 3 integers
> b = c(4, 5, 6)  # another one
>
> a * b           # just use * to multiply them
[1]  4 10 18

In Python, this would be an error:

>>> [1, 2, 3] * [4, 5, 6]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  TypeError: can't multiply sequence by non-int of type 'list'

Instead, you need an explicit loop or list comprehension:

>>> [i * j for (i, j) in zip(a, b)]
[4, 10, 18]

R borrows from the Unix tradition and works on vectors of strings as well as vectors of integers. The grep() function returns the indices of items that match a regular expression, counting from index 1:

> grep('e$', c('bart', 'homer', 'maggie', 'marge'))
[1] 3 4

Because many Unix tools operate on streams lines, shell pipelines are also an example of vectorized code.

For example, this short pipeline lists the first five symlinks in /bin:

$ ls -l /bin | grep ^l | head -n 5
lrwxrwxrwx 1 root root       6 Jul  4  2019 bzcmp -> bzdiff
lrwxrwxrwx 1 root root       6 Jul  4  2019 bzegrep -> bzgrep
lrwxrwxrwx 1 root root       6 Jul  4  2019 bzfgrep -> bzgrep
lrwxrwxrwx 1 root root       6 Jul  4  2019 bzless -> bzmore
lrwxrwxrwx 1 root root       8 Jun 10  2017 dnsdomainname -> hostname

It's nicer than an explicit loop:

$ ls -l /bin | (
> count=0
> while read line; do
>   if test ${line::1} = 'l'; then
>     echo $line
>     count=$((count + 1))
>     if test $count -ge 5; then
>       break
>     fi
>   fi
> done
> )
lrwxrwxrwx 1 root root 6 Jul 4 2019 bzcmp -> bzdiff
lrwxrwxrwx 1 root root 6 Jul 4 2019 bzegrep -> bzgrep
lrwxrwxrwx 1 root root 6 Jul 4 2019 bzfgrep -> bzgrep
lrwxrwxrwx 1 root root 6 Jul 4 2019 bzless -> bzmore
lrwxrwxrwx 1 root root 8 Jun 10 2017 dnsdomainname -> hostname

Point-Free Style

If vectorized style is when you mention an entire collection rather than the items within it, then point-free style is when you mention no data at all.

This notation comes from abstract algebra, where you reason about functions without regard to the things they operate on. It's supported by languages like Lisp and Haskell.

About 10 years ago, the following shell idiom blew me away. It doesn't look like code in any other language. I don't think I saw it in any documentation, books, or existing code. It just seemed like it "should" work, and it does.

Suppose you have these pipelines to calculate histograms of HTTP status code and URL:

$ awk '{print $9}' access.log | sort | uniq -c | sort -n -r
   9337 200
    417 404
     88 304
     60 301
     26 206
      1 418
$ egrep -o 'GET /blog/2017/../..\.html' access.log \
  | sort | uniq -c | sort -n -r \
  | head -n 5
    742 GET /blog/2017/01/13.html
    108 GET /blog/2016/12/30.html
     34 GET /blog/2016/12/11.html
     27 GET /blog/2016/10/10.html
     17 GET /blog/2016/11/14.html

Wouldn't it be nice to factor out the sort | uniq -c | sort -n -r pattern? Here's one way to do it:

hist() {
  "$@" | sort | uniq -c | sort -n -r
}
hist awk '{print $9}' access.log 
hist egrep -o 'GET /blog/2017/../..\.html' access.log | head -n 10

But it's awkward: it requires that the input to the histogram is exactly one command. You might want two commands, e.g. grep | awk, or no commands, using a here doc for literal input instead.

This is a better way:

hist() {
  sort | uniq -c | sort -n -r
}
awk '{print $9}' access.log | hist
egrep -o 'GET /blog/2017/../..\.html' access.log | hist | head -n 10

The hist function doesn't look like anything in Python or JavaScript, but it transforms stdin to stdout exactly as expected. We say it's written in point-free style because it doesn't mention any concrete data — neither individual items nor entire collections.

How does the shell do this? The short answer is that when a function occurs in the middle of a pipeline, it forks a subshell. Like all processes, the subshell has its own stdin and stdout, and can be connected to programs like awk, head, or another subshell. I will go into detail on this mechanism in a future post.

Though I haven't heard this term before, Wikipedia says that tacit programming is a synonym for point-free style , and the last example on the page is identical to ours.

Note that pipelines are vectorized code, and they compose in point-free style, but the two concepts are different. For example, the Forth function to square an integer in this post is point-free but not vectorized:

: square dup * ;

Imperative Code in Pipelines

Another nice feature of pipelines is that you can put imperative code right in the middle of them. We can turn our histogram into an HTML table like this:

egrep -o 'GET /blog/2017/../..\.html' access.log |
  hist |
  head -n 10 |
  { echo "<tr> <td>Count</td> <td>Name</td> </tr>";
    while read count name; do
      echo "<tr> <td>$count</td> <td>$name</td> </tr>"
    done
  } | 
  cat  # no-op command for illustration

Notes:

  # ...
  head -n 10 |
  awk '
    BEGIN { print "<tr> <td>Count</td> <td>Name</td> </tr>"}
          { print "<tr> <td>" $1 "</td> <td>" $2 "</td> </tr>"}
  '

To me, this is evidence that shell and awk should be combined.

Conclusion

We saw some interesting properties of pipelines:

As a shortcut, I think of vectorized code as native to R, Matlab, and Julia. Point-free style is native to Lisp and Haskell.

Shell has a bit of both styles. I take this as evidence that it's a well-designed language underneath the hastily evolved syntax.

I will have at least one more post about interesting shell features coming up. Comments welcome!