xargs for each element in a list in the middle of a command, not just the end like -L does - xargs

What is the correct way to xargs to feed a list of strings as inputs to the middle of a command?
For example, say I want to move all files that come through a complicated series of "pipes" to the home directory. Something like
$ ... | ... | ... | awk '{print $2}' | xargs -L1 mv ~/
Tries to move the home directory to each input, rather than the desired order.
Someone previously asked a question about this, but the answers were not helpful:
Unix - "xargs" - output "in the middle" (not at the end!)
Is there a way to place the xargs input into a specific part of the command and not just at the end.

Using GNU Parallel it looks like this:
$ ... | ... | ... | awk '{print $2}' | parallel mv {} ~/
It will run one job per file. Faster is:
$ ... | ... | ... | awk '{print $2}' | parallel -X mv {} ~/
It will insert multiple names before running the job.

You didn't mention what xargs are you using.
If you use BSD xargs instead of GNU xargs, there is a powerful option -L that can do what you want:
-J replstr
If this option is specified, xargs will use the data read from
standard input to replace the first occurrence of replstr instead
of appending that data after all other arguments. This option
will not affect how many arguments will be read from input (-n),
or the size of the command(s) xargs will generate (-s). The op-
tion just moves where those arguments will be placed in the com-
mand(s) that are executed. The replstr must show up as a dis-
tinct argument to xargs. It will not be recognized if, for in-
stance, it is in the middle of a quoted string. Furthermore,
only the first occurrence of the replstr will be replaced. For
example, the following command will copy the list of files and
directories which start with an uppercase letter in the current
directory to destdir:
/bin/ls -1d [A-Z]* | xargs -J % cp -Rp % destdir
Examples
$ seq 3 | xargs -J# echo # fourth fifth
1 2 3 fourth fifth
seq 3 | xargs -J# ruby -e 'pp ARGV' # fourth fifth
["1", "2", "3", "fourth", "fifth"]
The Ruby code pp ARGV here would pretty print the command arguments it receives. I usually use this way to debug xargs scripts.

Related

How can I deduplicate filenames across directories?

I run the following gsutil command:
gsutil ls -d gs://mybucket/v${version}/folder1/*/*.whl |
sort -V |
grep -e "/*.whl"
I get:
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561595893/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561654308/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319372/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319400/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563329633/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563411368/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565916833/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565921265/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1566258114/file1-cp27-cp27mu-linux_x86_64.whl
Since some files in different folders have the same names, how can I retrieve unique filenames ignoring the path?
I would do it like this:
blabla_your_command | rev | sort -t'/' -u -k1,1 | rev
rev reverses lines. Then I unique sort using / as a separator on the first field. After the line is reversed, the first field will be the filename, so sorting -u on it would return only unique filenames. Then the line needs to be reversed back.
The following command:
cat <<EOF |
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561595893/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561654308/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319372/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319400/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563329633/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563411368/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565916833/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565921265/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1566258114/file1-cp27-cp27mu-linux_x86_64.whl
EOF
rev | sort -t'/' -u -k1,1 | rev
outputs:
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
Please check awk option given below, this will print the last occurrence of delimiter '/', it worked for me
example:
gsutil ls gs://mybucket/v1.0.0/folder1/1560930522 | awk -F/ '{print $(NF)}'
print all the file names under '1560930522'
your_command|awk -F/ '!($NF in a){a[$NF]; print}'
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
4 different ways of saying the same thing
nawk -F'^.+/' '++_[$NF]<NF'
gawk -F'/' '__[$NF]++<!_'
mawk -F/ '_^__[$NF]++'
mawk2 -F/ '!_[$NF]--'
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
Here's a simple, straightforward solution:
$ your_gsutil_command | xargs -L 1 basename | sort -u
The easiest way to remove paths is with basename. Unfortunately it accepts only a single filename, which must be on the command line (not from stdin), so we need to take the following steps:
Create the list of files.
We do this with your_gsutil_command, but you can use any command that generates a list of files.
Send each one to basename to remove its path.
The xargs command does this for us by reading its stdin and invoking basename repeatedly, passing the data as command-line arguments. But xargs efficiently tries to reduce the number of invocations by passing multiple filenames on each command line, and that breaks basename. We prevent that with -L 1, limiting it to only one line (that is, one filename) at a time.
Remove duplicates.
The sort -u command does this.
Using your example data:
$ gsutil ls -d gs://mybucket/v${version}/folder1/*/*.whl |
xargs -L 1 basename | sort -u
file1-cp27-cp27mu-linux_x86_64.whl
file1-cp35-cp35m-linux_x86_64.whl
file1-cp36-cp36m-linux_x86_64.whl
file1-cp37-cp37m-linux_x86_64.whl
Caveat: Spaces break everything. 😡
So far we've assumed the filenames and folders do not contain spaces. Spaces break basename because needs exactly one filename, and it would interpret spaces as separators between multiple filenames. We can get around this in two ways:
ls -Q: If you're deduplicating local filenames, you can use the (non-gsutil) ls command with the -Q flag to put the filenames in quotes, so basename will interpret spaces as part of the filenames rather than separators.
gsutil: The -Q flag is unfortunately not supported, so we'll need to escape the spaces manually:
$ your_gsutil_command | sed 's/ /\\ /g' | xargs -L 1 basename | sort -u
Here we use the sed command to escape each space by inserting a backslash before it. (That is, we replace with \ . Note that we also need to escape the backslash in the sed command, which is why we use \\ and not just \.)

perform an operation for *each* item listed by grep

How can I perform an operation for each item listed by grep individually?
Background:
I use grep to list all files containing a certain pattern:
grep -l '<pattern>' directory/*.extension1
I want to delete all listed files but also all files having the same file name but a different extension: .extension2.
I tried using the pipe, but it seems to take the output of grep as a whole.
In find there is the -exec option, but grep has nothing like that.
If I understand your specification, you want:
grep --null -l '<pattern>' directory/*.extension1 | \
xargs -n 1 -0 -I{} bash -c 'rm "$1" "${1%.*}.extension2"' -- {}
This is essentially the same as what #triplee's comment describes, except that it's newline-safe.
What's going on here?
grep with --null will return output delimited with nulls instead of newline. Since file names can have newlines in them delimiting with newline makes it impossible to parse the output of grep safely, but null is not a valid character in a file name and thus makes a nice delimiter.
xargs will take a stream of newline-delimited items and execute a given command, passing as many of those items (one as each parameter) to a given command (or to echo if no command is given). Thus if you said:
printf 'one\ntwo three \nfour\n' | xargs echo
xargs would execute echo one 'two three' four. This is not safe for file names because, again, file names might contain embedded newlines.
The -0 switch to xargs changes it from looking for a newline delimiter to a null delimiter. This makes it match the output we got from grep --null and makes it safe for processing a list of file names.
Normally xargs simply appends the input to the end of a command. The -I switch to xargs changes this to substitution the specified replacement string with the input. To get the idea try this experiment:
printf 'one\ntwo three \nfour\n' | xargs -I{} echo foo {} bar
And note the difference from the earlier printf | xargs command.
In the case of my solution the command I execute is bash, to which I pass -c. The -c switch causes bash to execute the commands in the following argument (and then terminate) instead of starting an interactive shell. The next block 'rm "$1" "${1%.*}.extension2"' is the first argument to -c and is the script which will be executed by bash. Any arguments following the script argument to -c are assigned as the arguments to the script. This, if I were to say:
bash -c 'echo $0' "Hello, world"
Then Hello, world would be assigned to $0 (the first argument to the script) and inside the script I could echo it back.
Since $0 is normally reserved for the script name I pass a dummy value (in this case --) as the first argument and, then, in place of the second argument I write {}, which is the replacement string I specified for xargs. This will be replaced by xargs with each file name parsed from grep's output before bash is executed.
The mini shell script might look complicated but it's rather trivial. First, the entire script is single-quoted to prevent the calling shell from interpreting it. Inside the script I invoke rm and pass it two file names to remove: the $1 argument, which was the file name passed when the replacement string was substituted above, and ${1%.*}.extension2. This latter is a parameter substitution on the $1 variable. The important part is %.* which says
% "Match from the end of the variable and remove the shortest string matching the pattern.
.* The pattern is a single period followed by anything.
This effectively strips the extension, if any, from the file name. You can observe the effect yourself:
foo='my file.txt'
bar='this.is.a.file.txt'
baz='no extension'
printf '%s\n'"${foo%.*}" "${bar%.*}" "${baz%.*}"
Since the extension has been stripped I concatenate the desired alternate extension .extension2 to the stripped file name to obtain the alternate file name.
If this does what you want, pipe the output through /bin/sh.
grep -l 'RE' folder/*.ext1 | sed 's/\(.*\).ext1/rm "&" "\1.ext2"/'
Or if sed makes you itchy:
grep -l 'RE' folder/*.ext1 | while read file; do
echo rm "$file" "${file%.ext1}.ext2"
done
Remove echo if the output looks like the commands you want to run.
But you can do this with find as well:
find /path/to/start -name \*.ext1 -exec grep -q 'RE' {} \; -print | ...
where ... is either the sed script or the three lines from while to done.
The idea here is that find will ... well, "find" things based on the qualifiers you give it -- namely, that things match the file glob "*.ext", AND that the result of the "exec" is successful. The -q tells grep to look for RE in {} (the file supplied by find), and exit with a TRUE or FALSE without generating any of its own output.
The only real difference between doing this in find vs doing it with grep is that you get to use find's awesome collection of conditions to narrow down your search further if required. man find for details. By default, find will recurse into subdirectories.
You can pipe the list to xargs:
grep -l '<pattern>' directory/*.extension1 | xargs rm
As for the second set of files with a different extension, I'd do this (as usual use xargs echo rm when testing to make a dry run; I haven't tested it, it may not work correctly with filenames with spaces in them):
filelist=$(grep -l '<pattern>' directory/*.extension1)
echo $filelist | xargs rm
echo ${filelist//.extension1/.extension2} | xargs rm
Pipe the result to xargs, it will allow you to run a command for each match.

moving files that contain part of a line from a file

I have a file that on each line is a string of some numbers such as
1234
2345
...
I need to move files that contain that number in their name followed by other stuff to a directory examples being
1234_hello_other_stuff_2334.pdf
2345_more_stuff_3343.pdf
I tried using xargs to do this, but my bash scripting isn't the best. Can anyone share the proper command to accomplish what I want to do?
for i in `cat numbers.txt`; do
mv ${i}_* examples
done
or (look ma, no cat!)
while read i; do
mv ${i}_* examples
done < numbers.txt
You could use a for loop, but that could make for a really long command line. If you have 20000 lines in numbers.txt, you might hit shell limits. Instead, you could use a pipe:
cat numbers.txt | while read number; do
mv ${number}_*.pdf /path/to/examples/
done
or:
sed 's/.*/mv -v &_*.pdf/' numbers.txt | sh
You can leave off the | sh for testing. If there are other lines in the file and you only want to match lines with 4 digits, you could restrict your match:
sed -r '/^[0-9]{4}$/s//mv -v &_*.pdf/' numbers.txt | sh
cat numbers.txt | xargs -n1 -I % find . -name '%*.pdf' -exec mv {} /path/to \;
% is your number (-n1 means one at a time), and '%*.pdf' to find means it'll match all files whose names begin with that number; then it just copies to /path/to ({} is the actual file name).

Weird behaviour of a bash script

Here's a snippet:
var=`ls | shuf | head -2 | xargs cat | sed -e 's/\(.\)/\1\n/g' | shuf | tr -d '\n'`
This will select two random files from the current directory, combine their contents, shuffle them, and assign the result to var. This works fine most of the time, but about once in a thousand cases, instead just the output of ls is bound to var (It's not just the output, see EDIT II). What could be the explanation?
Some more potentially relevant facts:
the directory contains at least two files
there are only text files in the directory
file names don't contain spaces
the files are anywhere from 5 to about 1000 characters in length
the snippet is a part of a larger script that it ran two instances in parallel
bash version: GNU bash, version 4.1.5(1)-release (i686-pc-linux-gnu)
uname: Linux 2.6.35-28-generic-pae #50-Ubuntu
EDIT: I ran the snippet by itself a couple of thousand times with no errors. Then I tried running it with various other parts of the whole script. Here's a configuration that produces errors:
cd dir_with_text_files
var=`ls | shuf | head -2 | xargs cat | sed -e 's/\(.\)/\1\n/g' | shuf | tr -d '\n'`
cd ..
There are several hundred lines of the script between the cds, but this is the minimal configuration to reproduce the error. Note that the anomalous output binds to var the output of the current directory, not dir_with_text_files.
EDIT II: I've been looking at the outputs in more detail. The ls output doesn't appear alone, it's along with with two shuffled files (between their contents, or after or before them, intact). But it gets better; let me set up the stage to talk about particular directories.
[~/projects/upload] ls -1
checked // dir
lines // dir, the files to shuffle are here
pages // also dir
proxycheck
singlepost
uploader
indexrefresh
t
tester
So far, I've seen the output of ls ran from upload, but now I saw the output of ls */* (also ran from upload). It was in the form of "someMangledText ls moreMangledText ls */* finalBatchOfText". Is it possible that the sequence ls that undoubtedly was generated was somehow executed?
No problems here either.
I would also rewrite the above to this:
sed 's:\(.\):\1\n:g' < <(shuf -e * | head -2 | xargs cat) | shuf | tr -d '\n'
Do not use ls to list a directory's content, use *.
Moreover, do some debugging. Use a shebang followed by:
set -e
set -o pipefail
and run the script like this:
/bin/bash -x /path/to/script
and do inspect the output.
Instead of debugging the whole script, you can surround just the part that seems to be problematic with -x
set -x
...code that may have problems...
set +x
so that the output focuses on that part of the code.
Also, use the pipefail option.
Some definitions:
-e : Exit immediately if a simple command exits with a non-zero status, unless the command that fails is part of the command list immediately following a while or until keyword, part of the test in an if statement, part of a && or || list, or if the command's return status is being inverted using !. A trap on ERR, if set, is executed before the shell exits
-x : Print a trace of simple commands, for commands, case commands, select commands, and arithmetic for commands and their arguments or associated word lists after they are expanded and before they are executed. The value of the PS4 variable is expanded and the resultant value is printed before the command and its expanded arguments
pipefail : If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully
For debugging purposes you may also clear the environment using env -i and filter out non-printable characters:
#!/usr/bin/env -i /bin/bash --
set -ef
set -o pipefail
unset IFS PATH LC_ALL
IFS=$' \t\n'
PATH="$(PATH=/bin:/usr/bin getconf PATH)"
LC_ALL=C
export IFS PATH LC_ALL
#var="$((find . -type f -maxdepth 1 -print0 | shuf -z -n 2 | xargs -0 cat) | sed -e 's/\(.\)/\1\n/g' | shuf | tr -d '\n')"
var="$((find . -type f -maxdepth 1 -print0 | shuf -z -n 2 | xargs -0 cat) | tr -cd '[[:print:]]' | grep -o '.' | shuf | tr -d '\n')"
Before running the script you may also disable the GNU readline library and ! style history expansion:
bash --noediting
set +H
Based on what you say wrt to your failure rates, and given the success of the other tests performed by the posters above, it sounds like a problem that could be caused by an occasional directory-change failure. Is the directory you're accessing in this script mounted from a remote machine by chance? If so, it might just be a small and temporary network-related failure that's not being handled properly. (Just a guess.)

How to apply shell command to each line of a command output?

Suppose I have some output from a command (such as ls -1):
a
b
c
d
e
...
I want to apply a command (say echo) to each one, in turn. E.g.
echo a
echo b
echo c
echo d
echo e
...
What's the easiest way to do that in bash?
It's probably easiest to use xargs. In your case:
ls -1 | xargs -L1 echo
The -L flag ensures the input is read properly. From the man page of xargs:
-L number
Call utility for every number non-empty lines read.
A line ending with a space continues to the next non-empty line. [...]
You can use a basic prepend operation on each line:
ls -1 | while read line ; do echo $line ; done
Or you can pipe the output to sed for more complex operations:
ls -1 | sed 's/^\(.*\)$/echo \1/'
for s in `cmd`; do echo $s; done
If cmd has a large output:
cmd | xargs -L1 echo
You can use a for loop:
for file in * ; do
echo "$file"
done
Note that if the command in question accepts multiple arguments, then using xargs is almost always more efficient as it only has to spawn the utility in question once instead of multiple times.
You actually can use sed to do it, provided it is GNU sed.
... | sed 's/match/command \0/e'
How it works:
Substitute match with command match
On substitution execute command
Replace substituted line with command output.
A solution that works with filenames that have spaces in them, is:
ls -1 | xargs -I %s echo %s
The following is equivalent, but has a clearer divide between the precursor and what you actually want to do:
ls -1 | xargs -I %s -- echo %s
Where echo is whatever it is you want to run, and the subsequent %s is the filename.
Thanks to Chris Jester-Young's answer on a duplicate question.
xargs fails with with backslashes, quotes. It needs to be something like
ls -1 |tr \\n \\0 |xargs -0 -iTHIS echo "THIS is a file."
xargs -0 option:
-0, --null
Input items are terminated by a null character instead of by whitespace, and the quotes and backslash are
not special (every character is taken literally). Disables the end of file string, which is treated like
any other argument. Useful when input items might contain white space, quote marks, or backslashes. The
GNU find -print0 option produces input suitable for this mode.
ls -1 terminates the items with newline characters, so tr translates them into null characters.
This approach is about 50 times slower than iterating manually with for ... (see Michael Aaron Safyans answer) (3.55s vs. 0.066s). But for other input commands like locate, find, reading from a file (tr \\n \\0 <file) or similar, you have to work with xargs like this.
i like to use gawk for running multiple commands on a list, for instance
ls -l | gawk '{system("/path/to/cmd.sh "$1)}'
however the escaping of the escapable characters can get a little hairy.
Better result for me:
ls -1 | xargs -L1 -d "\n" CMD

Resources