This is the second post in a series about unique features of the shell language. These features are missing or awkward in languages like Python, Ruby, Perl, and JavaScript.
As always, code is available in the blog-code repository.
Vectorized code operates on collections without explicitly mentioning each item. It's found in both high level languages like R, Matlab, and Julia, as well as in assembly language.
In the R language you can multiply two vectors with a single expression:
$ R
...
> a = c(1, 2, 3) # vector of 3 integers
> b = c(4, 5, 6) # another one
>
> a * b # just use * to multiply them
[1] 4 10 18
In Python, this would be an error:
>>> [1, 2, 3] * [4, 5, 6]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't multiply sequence by non-int of type 'list'
Instead, you need an explicit loop or list comprehension:
>>> [i * j for (i, j) in zip(a, b)]
[4, 10, 18]
R borrows from the Unix tradition and works on vectors of strings as well
as vectors of integers. The grep()
function returns the indices of items
that match a regular expression, counting from index 1
:
> grep('e$', c('bart', 'homer', 'maggie', 'marge'))
[1] 3 4
Because many Unix tools operate on streams lines, shell pipelines are also an example of vectorized code.
For example, this short pipeline lists the first five symlinks in /bin
:
$ ls -l /bin | grep ^l | head -n 5 lrwxrwxrwx 1 root root 6 Jul 4 2019 bzcmp -> bzdiff lrwxrwxrwx 1 root root 6 Jul 4 2019 bzegrep -> bzgrep lrwxrwxrwx 1 root root 6 Jul 4 2019 bzfgrep -> bzgrep lrwxrwxrwx 1 root root 6 Jul 4 2019 bzless -> bzmore lrwxrwxrwx 1 root root 8 Jun 10 2017 dnsdomainname -> hostname
It's nicer than an explicit loop:
$ ls -l /bin | ( > count=0 > while read line; do > if test ${line::1} = 'l'; then > echo $line > count=$((count + 1)) > if test $count -ge 5; then > break > fi > fi > done > ) lrwxrwxrwx 1 root root 6 Jul 4 2019 bzcmp -> bzdiff lrwxrwxrwx 1 root root 6 Jul 4 2019 bzegrep -> bzgrep lrwxrwxrwx 1 root root 6 Jul 4 2019 bzfgrep -> bzgrep lrwxrwxrwx 1 root root 6 Jul 4 2019 bzless -> bzmore lrwxrwxrwx 1 root root 8 Jun 10 2017 dnsdomainname -> hostname
If vectorized style is when you mention an entire collection rather than the items within it, then point-free style is when you mention no data at all.
This notation comes from abstract algebra, where you reason about functions without regard to the things they operate on. It's supported by languages like Lisp and Haskell.
About 10 years ago, the following shell idiom blew me away. It doesn't look like code in any other language. I don't think I saw it in any documentation, books, or existing code. It just seemed like it "should" work, and it does.
Suppose you have these pipelines to calculate histograms of HTTP status code and URL:
$ awk '{print $9}' access.log | sort | uniq -c | sort -n -r
9337 200 417 404 88 304 60 301 26 206 1 418
$ egrep -o 'GET /blog/2017/../..\.html' access.log \
| sort | uniq -c | sort -n -r \
| head -n 5
742 GET /blog/2017/01/13.html 108 GET /blog/2016/12/30.html 34 GET /blog/2016/12/11.html 27 GET /blog/2016/10/10.html 17 GET /blog/2016/11/14.html
Wouldn't it be nice to factor out the sort | uniq -c | sort -n -r
pattern?
Here's one way to do it:
hist() {
"$@" | sort | uniq -c | sort -n -r
}
hist awk '{print $9}' access.log
hist egrep -o 'GET /blog/2017/../..\.html' access.log | head -n 10
But it's awkward: it requires that the input to the histogram is exactly one
command. You might want two commands, e.g. grep | awk
, or no commands, using
a here doc for literal input instead.
This is a better way:
hist() {
sort | uniq -c | sort -n -r
}
awk '{print $9}' access.log | hist
egrep -o 'GET /blog/2017/../..\.html' access.log | hist | head -n 10
The hist
function doesn't look like anything in Python or JavaScript, but it
transforms stdin
to stdout
exactly as expected. We say it's written in
point-free style because it doesn't mention any concrete data —
neither individual items nor entire collections.
How does the shell do this? The short answer is that when a function occurs
in the middle of a pipeline, it forks a subshell. Like all processes, the
subshell has its own stdin
and stdout
, and can be connected to programs
like awk
, head
, or another subshell. I will go into detail on this
mechanism in a future post.
Though I haven't heard this term before, Wikipedia says that tacit programming is a synonym for point-free style , and the last example on the page is identical to ours.
Note that pipelines are vectorized code, and they compose in point-free style, but the two concepts are different. For example, the Forth function to square an integer in this post is point-free but not vectorized:
: square dup * ;
Another nice feature of pipelines is that you can put imperative code right in the middle of them. We can turn our histogram into an HTML table like this:
egrep -o 'GET /blog/2017/../..\.html' access.log |
hist |
head -n 10 |
{ echo "<tr> <td>Count</td> <td>Name</td> </tr>";
while read count name; do
echo "<tr> <td>$count</td> <td>$name</td> </tr>"
done
} |
cat # no-op command for illustration
Notes:
|
makes line continuation with \
unnecessary.{ }
block could be factored into its own function, perhaps using
point-free style. # ...
head -n 10 |
awk '
BEGIN { print "<tr> <td>Count</td> <td>Name</td> </tr>"}
{ print "<tr> <td>" $1 "</td> <td>" $2 "</td> </tr>"}
'
To me, this is evidence that shell and awk should be combined.
We saw some interesting properties of pipelines:
As a shortcut, I think of vectorized code as native to R, Matlab, and Julia. Point-free style is native to Lisp and Haskell.
Shell has a bit of both styles. I take this as evidence that it's a well-designed language underneath the hastily evolved syntax.
I will have at least one more post about interesting shell features coming up. Comments welcome!