Why Sponsor Oils? | blog | oilshell.org
This post has everything you need to know about xargs, an essential tool for shell programming. It's based on my #comments on this "wrong" post:
xargs considered harmful (codefaster.substack.com
via lobste.rs)
24 points, 31 comments on 2021-07-16
Apologies to the author for the criticism, but it generated a great discussion! Here is what lies ahead:
xargs
?It's an adapter between text streams and argv
arrays, two essential
concepts in shell. You pass it flags that specify how to split stdin
.
Then it generates arguments and invokes processes. Example:
$ echo 'alice bob' | xargs -n 1 -- echo hi
hi alice
hi bob
What's happening here?
xargs
splits the input stream on whitespace, producing 2 arguments,
alice
and bob
.-n 1
, so xargs
then passes each argument to a separate echo hi $ARG
command. By default, it passes as many args to a command as
possible, like echo hi alice bob
.It may help to mentally replace xargs
with the word each. As in, for
each word, line, or token, invoke this process with these args. In fact, I
propose an each
builtin for the Oil language below.
(This explanation was derived from this comment on the same thread.)
You should know how to control:
-d
, -0
). Discussed
below.-n
). This determines
the total number of processes started.-P
).The blog post suggests
rm $(ls | grep foo)
as an alternative to xargs
. In this case, it's better to use a glob, which
is built into the shell:
rm *foo*
xargs
Over Shell's Word SplittingBesides the extra ls
, the suggestion is bad because it relies on
shell's word splitting. This is due to the unquoted $()
. It's better to
rely on the splitting algorithms in xargs
, because they're simpler and more
powerful. (Related: Oil Doesn't Require Quoting
Everywhere.)
For example, if you want the power of regexes to filter names, you can pipe to
egrep
, then explicitly split its output by newlines:
# Remove Python and C++ unit tests
ls | egrep '.*_test\.(py|cc)' | xargs -d $'\n' -- rm
(Note that parsing ls
is an
anti-pattern. In Oil you can use
write --qsn * | egrep
instead.)
Now that we've introduced xargs and discussed alternatives, here's some advice for using it.
stdin
In the comment, I suggest using only these three styles of splitting:
xargs
(the default): when you want "words" without spaces. For
example, you can produce two args from the string 'alice bob'
.xargs -d $'\n'
: When you want the args to be lines, as in the egrep
example above. (Note that $'\n'
is bash syntax for a newline
character, and Oil uses this syntax too.)xargs -0
: When you want to handle untrusted data. Someone could put a
newline in a filename, but this is safe with NUL-delimited tokens.Most of my scripts use the second style, and occasionally the third. Unix tools generally work better on streams of lines than streams of "words" or NUL-delimited tokens.
(This is one motivation for Oil's
QSN serialization format.
It's line-based, so regular grep still works, rather than needing GNU
grep -z
. It can also represent every string, including those with NULs.)
xargs
Can Invoke Shell Functions With the $0
Dispatch PatternThe original post discusses xargs -I {}
, which allows you to control where
each argument is substituted in the argv
array.
I occasionally use -I
, but more often I use xargs with what I call the
$0
Dispatch Pattern. I outlined this shell programming
pattern last month, but I still need to elaborate on it.
The basic idea is to avoid the mini language of -I {}
and just use shell
— by recursively invoking shell functions. I use this all over Oil's
own shell scripts, and elsewhere.
Example:
do_one() {
# Rather than xargs -I {}, it's more flexible to
# use a function with $1
echo "Do something with $1"
cp --verbose "$1" /tmp
}
do_all() {
# Call the do_one function for each item.
# Also add -P to make it parallel
cat tasks.txt | xargs -n 1 -d $'\n' -- $0 do_one
}
"$@" # dispatch on $0; or use 'runproc' in Oil
Now run this script with either:
demo.sh do_one $ARG
to test the work that's done on each item. You
want to make this correct first.demo.sh do_all
to do work on all items.This breaks the problem down nicely: make it work on one item, and then figure out which items to run it on. When you combine them, they will work, unlike the "sed into bash" solution given in the original post.
In other words: Use the Shell Language, Not Mini-Languages Like xargs -I {}
. This reduces language cacophony.
echo
PrefixBefore running a command like:
$ cat tasks.txt | xargs -n 1 -- $0 do_one
It's often useful to preview it with echo
:
$ cat tasks.txt | xargs -n 1 -- echo $0 do_one
demo.sh do_one filename
demo.sh do_one with
demo.sh do_one spaces.txt
# Oops! We split the input the wrong way.
# We wanted xargs -d $'\n'.
xargs -P
Automatically Parallelizes TasksIn the do_all
example above, you can add -P 8
to the xargs invocation
to automatically parallelize it! For example, if you have 1000 independent
tasks, xargs
will use 8 CPUs to run them as quickly as possible.
I've used -P 32
to make day-long jobs take an hour! You can't do that with a
for
loop.
This is one of my favorite tricks, and 3 years ago I gave a 5 minute presentation ago at #recurse-center about it:
xargs -P
Before GNU ParallelSome shell users use GNU parallel
to parallelize processes. I avoid it because it has yet another
mini-language with
{}
and :::
.
However, it does have features that xargs doesn't, as surfaced in the
Hacker News thread. For example, it buffers to avoid
interleaving stdout
of parallel processes, and it supports remote execution
with load balancing.
I've found that invoking shell functions with xargs -P
goes a long way, but
you may have a problem where GNU parallel is appropriate.
-n
To Batch Args, not -L
The original article talks a lot about xargs -L
, which I never use. It looks
like a mini data language to be avoided: e.g. trailing blanks meaning something
special.
I asserted that -n
was always better than -L
in the comments, and nobody
found a counterexample.
rm
process, Not 10,000A lobste.rs user asked why you would use find | xargs
rather than find -exec
.
The answer is that it can be much faster. If you’re trying to rm
10,000 files, you can start one process instead of 10,000 processes!
It’s basically
rm one two three
vs.
rm one
rm two
rm three
Other commenters pointed out that you can use find -exec +
instead of find -exec \;
, but I'd say that's another mini-language to be avoided.
Links:
find -exec
is slower:
https://www.reddit.com/r/ProgrammingLanguages/comments/frhplj/some_syntax_ideas_for_a_shell_please_provide/fm07izj/xargs
Composes With Other ToolsIn addition to avoiding mini-languages, find | xargs
lets you interpose
other tools in the pipeline. That is, find -exec
is "hard-coded", while the
pipeline allows obvious extensions like:
# Filter tasks by name
find ... | grep ... | xargs ...
# Limit the number of tasks. I use this all the time
# for faster testing
find ... | head | xargs ...
# Believe it or not, I use this to randomize music
# and videos :)
find ... | shuf | xargs mplayer
So shell is a more compositional language than find (which is indeed a separate language).
To repeat, here are the benefits of the style I advocate:
echo
to preview tasks. This avoids running long
batch jobs on the wrong input!xargs
lets you start as few processes as possible.This post explained xargs
, gave advice on using it, and justified the advice.
The most important takeaway is that you can invoke and parallelize shell
functions with xargs, via the $0 Dispatch Pattern.
I use this pattern all the time, but I've rarely seen it used in the wild. Try it out and let me know what you think! You can start from the sample code in the blog-code/xargs directory.
You may have noticed a high level pattern to this advice: We avoid "mini-languages" in various tools, and use the shell language instead:
$1
, instead of xargs -I {}
xargs
and xargs -n 1
instead of find -exec +
and find -exec \;
-n
instead of -L
(to avoid an ad hoc data language)$'\n'
instead of tool syntax '\n'
. Using the shell is
simpler than relying on every tool to understand the 2 character escape
sequence \n
.This preference is partly aesthetic, but it also relates to:
(Remember that I'm working on Slogans, Fallacies, and Concepts for shell programming.)
each
BuiltinI sketched an idea for each
and every
builtins in Oil in this
comment on the
same thread. On second thought, I think we should only have each
, and it
should look something like this:
# Items are lines by default.
# Start as few processes as possible, like xargs
# and 'find -exec +'
find . | each {
rm --verbose @_items # remove many files
}
# Start one process for each item, like 'xargs -n 1'
# and 'find -exec \;'
find . | each --one {
echo "Do something with $_item"
}
# Parallelize it
find . | each --one -P 8 {
echo "Do something with $_item"
sleep 0.1
}
So the separate do_one
and do_all
functions are avoided with Ruby-like
blocks in the Oil language. Just like the cd
builtin accepts a
block,
the each
builtin can as well.
Let me know what you think!
I issued an "alternative shell challenge" last year: Can you redirect stdout
of a shell function that invokes both builtins and external processes?
Here's another challenge for alternative shells: can you
parallelize your shell's notion of functions with xargs -P 8
? As shown
above, it's done in Bourne shell with xargs -P 8 -- $0 myfunc
.
Both of these challenges relate to shell functions and the Perlis-Thompson
Principle, an important idea in
#software-architecture. They explain why the Oil
language is designed around procs
rather than Python- or
JavaScript-like functions.
sed | bash
is a dangerous pattern. You can pipe data rather than code.-n
isn't specified.
rm
processes, but not 10,000! (for performance
reasons)--no-run-if-empty
.
I've used this in Oil's own shell scripts. Tools like rm
fail if passed no
arguments, so you may want to avoid running them on empty input.