Why Sponsor Oils? | blog | oilshell.org

An Opinionated Guide to xargs

2021-08-21

This post has everything you need to know about xargs, an essential tool for shell programming. It's based on my #comments on this "wrong" post:

Apologies to the author for the criticism, but it generated a great discussion! Here is what lies ahead:

Table of Contents
Preliminaries
What Is xargs?
Which Flags Should I Know About?
Globs May Be Enough
Prefer xargs Over Shell's Word Splitting
Usage Tips
Choose One of 3 Ways of Splitting stdin
xargs Can Invoke Shell Functions With the $0 Dispatch Pattern
Preview Tasks With an echo Prefix
xargs -P Automatically Parallelizes Tasks
Use -n To Batch Args, not -L
Benefits Of This Style
Start One rm process, Not 10,000
xargs Composes With Other Tools
Recap
Conclusions
Slogan: Shell-Centric Shell Programming
Oil Language: each Builtin
Appendices
Alternative Shell Challenge #2
More Comments

Preliminaries

What Is xargs?

It's an adapter between text streams and argv arrays, two essential concepts in shell. You pass it flags that specify how to split stdin. Then it generates arguments and invokes processes. Example:

$ echo 'alice bob' | xargs -n 1 -- echo hi
hi alice
hi bob

What's happening here?

  1. xargs splits the input stream on whitespace, producing 2 arguments, alice and bob.
  2. We passed -n 1, so xargs then passes each argument to a separate echo hi $ARG command. By default, it passes as many args to a command as possible, like echo hi alice bob.

It may help to mentally replace xargs with the word each. As in, for each word, line, or token, invoke this process with these args. In fact, I propose an each builtin for the Oil language below.

(This explanation was derived from this comment on the same thread.)

Which Flags Should I Know About?

You should know how to control:

  1. The algorithm for splitting text into arguments (-d, -0). Discussed below.
  2. How many arguments are passed to each process (-n). This determines the total number of processes started.
  3. Whether processes are run in sequence or in parallel (-P).

Globs May Be Enough

The blog post suggests

rm $(ls | grep foo)

as an alternative to xargs. In this case, it's better to use a glob, which is built into the shell:

rm *foo*

Prefer xargs Over Shell's Word Splitting

Besides the extra ls, the suggestion is bad because it relies on shell's word splitting. This is due to the unquoted $(). It's better to rely on the splitting algorithms in xargs, because they're simpler and more powerful. (Related: Oil Doesn't Require Quoting Everywhere.)

For example, if you want the power of regexes to filter names, you can pipe to egrep, then explicitly split its output by newlines:

# Remove Python and C++ unit tests
ls | egrep '.*_test\.(py|cc)' | xargs -d $'\n' -- rm

(Note that parsing ls is an anti-pattern. In Oil you can use write --qsn * | egrep instead.)

Usage Tips

Now that we've introduced xargs and discussed alternatives, here's some advice for using it.

Choose One of 3 Ways of Splitting stdin

In the comment, I suggest using only these three styles of splitting:

  1. xargs (the default): when you want "words" without spaces. For example, you can produce two args from the string 'alice bob'.
  2. xargs -d $'\n': When you want the args to be lines, as in the egrep example above. (Note that $'\n' is bash syntax for a newline character, and Oil uses this syntax too.)
  3. xargs -0: When you want to handle untrusted data. Someone could put a newline in a filename, but this is safe with NUL-delimited tokens.

Most of my scripts use the second style, and occasionally the third. Unix tools generally work better on streams of lines than streams of "words" or NUL-delimited tokens.

(This is one motivation for Oil's QSN serialization format. It's line-based, so regular grep still works, rather than needing GNU grep -z. It can also represent every string, including those with NULs.)

xargs Can Invoke Shell Functions With the $0 Dispatch Pattern

The original post discusses xargs -I {}, which allows you to control where each argument is substituted in the argv array.

I occasionally use -I, but more often I use xargs with what I call the $0 Dispatch Pattern. I outlined this shell programming pattern last month, but I still need to elaborate on it.

The basic idea is to avoid the mini language of -I {} and just use shell — by recursively invoking shell functions. I use this all over Oil's own shell scripts, and elsewhere.

Example:

do_one() {
  # Rather than xargs -I {}, it's more flexible to
  # use a function with $1
  echo "Do something with $1"  
  cp --verbose "$1" /tmp
}

do_all() {
  # Call the do_one function for each item.
  # Also add -P to make it parallel
  cat tasks.txt | xargs -n 1 -d $'\n' -- $0 do_one
}

"$@"  # dispatch on $0; or use 'runproc' in Oil

Now run this script with either:

This breaks the problem down nicely: make it work on one item, and then figure out which items to run it on. When you combine them, they will work, unlike the "sed into bash" solution given in the original post.

In other words: Use the Shell Language, Not Mini-Languages Like xargs -I {}. This reduces language cacophony.

Preview Tasks With an echo Prefix

Before running a command like:

$ cat tasks.txt | xargs -n 1 -- $0 do_one

It's often useful to preview it with echo:

$ cat tasks.txt | xargs -n 1 -- echo $0 do_one
demo.sh do_one filename
demo.sh do_one with
demo.sh do_one spaces.txt
# Oops!  We split the input the wrong way.
# We wanted xargs -d $'\n'.

xargs -P Automatically Parallelizes Tasks

In the do_all example above, you can add -P 8 to the xargs invocation to automatically parallelize it! For example, if you have 1000 independent tasks, xargs will use 8 CPUs to run them as quickly as possible.

I've used -P 32 to make day-long jobs take an hour! You can't do that with a for loop.

This is one of my favorite tricks, and 3 years ago I gave a 5 minute presentation ago at #recurse-center about it:

Try xargs -P Before GNU Parallel

Some shell users use GNU parallel to parallelize processes. I avoid it because it has yet another mini-language with {} and :::.

However, it does have features that xargs doesn't, as surfaced in the Hacker News thread. For example, it buffers to avoid interleaving stdout of parallel processes, and it supports remote execution with load balancing.

I've found that invoking shell functions with xargs -P goes a long way, but you may have a problem where GNU parallel is appropriate.

Use -n To Batch Args, not -L

The original article talks a lot about xargs -L, which I never use. It looks like a mini data language to be avoided: e.g. trailing blanks meaning something special.

I asserted that -n was always better than -L in the comments, and nobody found a counterexample.

Benefits Of This Style

Start One rm process, Not 10,000

A lobste.rs user asked why you would use find | xargs rather than find -exec.

The answer is that it can be much faster. If you’re trying to rm 10,000 files, you can start one process instead of 10,000 processes!

It’s basically

rm one two three

vs.

rm one
rm two
rm three

Other commenters pointed out that you can use find -exec + instead of find -exec \;, but I'd say that's another mini-language to be avoided.

Links:

xargs Composes With Other Tools

In addition to avoiding mini-languages, find | xargs lets you interpose other tools in the pipeline. That is, find -exec is "hard-coded", while the pipeline allows obvious extensions like:

# Filter tasks by name
find ... | grep ... | xargs ...

# Limit the number of tasks.  I use this all the time
# for faster testing
find ... | head | xargs ...

# Believe it or not, I use this to randomize music
# and videos :)
find ... | shuf | xargs mplayer

So shell is a more compositional language than find (which is indeed a separate language).

Recap

To repeat, here are the benefits of the style I advocate:

  1. Incremental Development: Figure out what to do on each item (what's a task?), then figure out what items to do it on (what tasks should I run?)
  2. Easy Testing by using echo to preview tasks. This avoids running long batch jobs on the wrong input!
  3. Better Performance.
  4. Fewer Languages to Remember. We use plain shell and a few flags to xargs.
  5. Composition via Pipelines. The task list becomes a "noun" that other shell tools can operate on.

Conclusions

This post explained xargs, gave advice on using it, and justified the advice. The most important takeaway is that you can invoke and parallelize shell functions with xargs, via the $0 Dispatch Pattern.

I use this pattern all the time, but I've rarely seen it used in the wild. Try it out and let me know what you think! You can start from the sample code in the blog-code/xargs directory.

Slogan: Shell-Centric Shell Programming

You may have noticed a high level pattern to this advice: We avoid "mini-languages" in various tools, and use the shell language instead:

This preference is partly aesthetic, but it also relates to:

(Remember that I'm working on Slogans, Fallacies, and Concepts for shell programming.)

Oil Language: each Builtin

I sketched an idea for each and every builtins in Oil in this comment on the same thread. On second thought, I think we should only have each, and it should look something like this:

# Items are lines by default.

# Start as few processes as possible, like xargs
# and 'find -exec +'
find . | each {
  rm --verbose @_items  # remove many files 
}

# Start one process for each item, like 'xargs -n 1'
# and 'find -exec \;'
find . | each --one {
  echo "Do something with $_item"
}

# Parallelize it
find . | each --one -P 8 {
  echo "Do something with $_item"
  sleep 0.1
}

So the separate do_one and do_all functions are avoided with Ruby-like blocks in the Oil language. Just like the cd builtin accepts a block, the each builtin can as well.

Let me know what you think!

Appendices

Alternative Shell Challenge #2

I issued an "alternative shell challenge" last year: Can you redirect stdout of a shell function that invokes both builtins and external processes?

Here's another challenge for alternative shells: can you parallelize your shell's notion of functions with xargs -P 8? As shown above, it's done in Bourne shell with xargs -P 8 -- $0 myfunc.

Both of these challenges relate to shell functions and the Perlis-Thompson Principle, an important idea in #software-architecture. They explain why the Oil language is designed around procs rather than Python- or JavaScript-like functions.

More Comments