Doing parallel processing in bash? - bash

I've thousands of png files which I like to make smaller with pngcrush. I've a simple find .. -exec job, but it's sequential. My machine has quite some resources and I'd make this in parallel.
The operation to be performed on every png is:
pngcrush input output && mv output input
Ideally I can specify the maximum number of parallel operations.
Is there a way to do this with bash and/or other shell helpers? I'm Ubuntu or Debian.

You can use xargs to run multiple processes in parallel:
find /path -print0 | xargs -0 -n 1 -P <nr_procs> sh -c 'pngcrush $1 temp.$$ && mv temp.$$ $1' sh
xargs will read the list of files produced by find (separated by 0 characters (-0)) and run the provided command (sh -c '...' sh) with one parameter at a time (-n 1). xargs will run <nr_procs> (-P <nr_procs>) in parallel.

You can use custom find/xargs solutions (see Bart Sas' answer), but when things become more complex you have -at least- two powerful options:
parallel (from package moreutils)
GNU parallel

With GNU Parallel http://www.gnu.org/software/parallel/ it can be done like:
find /path -print0 | parallel -0 pngcrush {} {.}.temp '&&' mv {.}.temp {}
Learn more:
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial). You command line
will love you for it.

Related

awk for many compressed files

The following command calculates the GC content for each fasta fastq files
identified with the find command. Briefly a fastq is a file which for a large number of datapoints have 4 lines of information and where the second line I'm interested in only contains (ATGC). For testing (identical) example files can be found here).
find . -iname '*.fastq' -exec awk '(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}' "{}" \;
How can I modify/rewrite it into a one-liner that works on gziped fastq files? I need the regex option currently used with find.
The find '-exec' can be used to invoke (and pass arguments) to a single program. The challenge here is that two commands (cat|awk) need to be combined with a pipe. Two possible path: construct a shell command OR use the more flexible xargs.
# Using the 'shell -c' command
find . -iname '*.fastq.gz' -exec sh -c "zcat {} | awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}'" \;
# OR, using process substitution
find . -iname '*.fastq.gz' -exec bash -c "awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}' <(zcat {})" \;
See many references to find/xargs in stack overflow
If, as you say, you have many large files, I would suggest processing them in parallel. If the issue is that you are having problems quoting your awk, I would suggest putting your script in a separate file, called, say script.awk like this:
(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}
Now you can simply process them all in parallel with GNU Parallel:
find . -iname \*fastq.gz -print0 | parallel -0 gzcat {} \| awk -f ./script.awk

How to convert a find command to instead use grep to filter and then `exec` commands on output?

I have a scenario where I need to execute a series of commands on each file that's found. This normally would work great, except I have over 100 files and folders to exclude from find's results for execution. This becomes unwieldy and non-executable from the shell directly. It seems like it would be optimal to use an "exclusion file" similar to how tar or grep allows for such files.
Since find does not accept a file for exclusion, but grep does, I want to know: how can the following be converted to a command that would replace the exclusion (prune) and exec functions in find to instead utilize grep with an exclusion file (grep -v -f excludefile) to exclude the folders and files and then execute a series of commands on the result like the current command does it:
find $IN_PATH -regextype posix-extended \
-regex "/(excluded1|excluded2|excluded3|...|excludedN)" -prune \
-o -type f \
-exec sh -c "( cmd -with_args 1 '{}'; cmd -args2 '{}'; cmd3 '{}') \
| cmd4 | cmd5 | cmd6; cmd7 '{}'" \; \
> output
As a side note (not critical), I've read that if you don't use exec this process becomes much less efficient and this process is already consuming over 100 minutes to execute each time that it's run, so I don't want to slow it down any more than is necessary.
the best way i think of to fulfill your scenario , is split the one-liner to two line and introduce xargs with parallel .
find $IN_PATH -regextype posix-extended \
-regex "/(excluded1|excluded2|excluded3|...|excludedN)" -prune \
-o -type f > /tmp/full_file_list
cat /tmp/full_file_list|grep -f excludefile |xargs -0 -n 1 -P <nr_procs> sh -c 'command here' >output
see Bash script processing limited number of commands in parallel and Doing parallel processing in bash? to learn more about parallel in bash
finding and command on files are facing disk-io conflicts in one liner , spilt the one-liner could speed up the process a little bit ,
hint: remember to put your full_file_list/excludefile/output in your exclude rules , and always debug your command on a smaller directory to reduce waiting time
Why not simply:
find . -type f |
grep -v -f excludefile |
xargs whatever
With respect to this process is already consuming over 100 minutes to execute - that's almost certainly a problem with whatever command line you wrote to replace whatever above and we could probably help you improve that if you post a separate question.

How can I search and execute two scripts simultaneously in Unix?

Suppose you have a folder that contains two files.
Example: stop_tomcat_center.sh and start_tomcat_center.sh.
In my example, the return of ls *tomcat* returns these two scripts.
How can I search and execute these two scripts simultaneously?
I tried
ls *tomcat* | xargs sh
but only the first script is executed (not the second).
An easy way to do multiple things in parallel is with GNU Parallel:
parallel ::: ./*tomcat*
Or, if your scripts don't have a shebang at the first line:
parallel bash ::: ./*tomcat*
Or if you like xargs:
ls *tomcat* | xargs -P 2
xargs is missing the -n 1 option.
From man xargs:
-n max-args, --max-args=max-args
Use at most max-args arguments per command line. Fewer than max-args arguments will be used if the size (see the -s option) is exceeded, unless the -x option is given, in which case xargs will exit.
xargs otherwise tries to execute the command with as many parameters as possible, which makes sense for most commands.
In your case ls *tomcat* | xargs sh is running sh stop_tomcat_center.sh start_tomcat_center.sh and the stop_tomcat_center.sh is probably just ignoring the $1 parameter.
Also it is not a good idea to use the output of ls. A better way would be to use find -maxdepth 1 -name '*tomcat*' -print0 | xargs -0 -n 1 sh or for command in *tomcat*; do sh "$command"; done
This answer is based on the assumption that the OP meant "both with one command line" when he wrote "simultaneously".
For solutions on parallel execution have take a look at the other answers.
You can do the following to search and execute
find . -name "*.sh" -exec sh x {} \;
find will find the file and exec will find the match and execute

Find recursive/xargs/cp/awk/sed/single quote in single quote together in a one-liner

I am trying to create a shell one liner to find all jpegs in a directory recursively. Then I want to copy them all out to an external directory, while renaming them according to their date and time and then append a random integer in order to avoid overwrites with images that have the same timestamp.
First Attempt:
find /storage/sdcard0/tencent/MicroMsg/ -type f -iname '*.jpg' -print0 | xargs -0 sh -c 'for filename; do echo "$filename" && cp "$filename" $(echo /storage/primary/legacy/image3/$(stat $filename |awk '/Mod/ {print $2"_"$3}'|sed s/:/-/g)_$RANDOM.jpg);done' fnord
Among other things, the above doesn't work because there are the single quotes of the awk within the sh -c single quotes.
The second attempt should do the same thing without sh -c, but gives me this error on stat:
stat: can't stat '': No such file or directory
/system/bin/sh: file: not found
Second Attempt:
find /storage/sdcard0/tencent/MicroMsg/ -type f -iname '*.jpg' -print0 | xargs -0 file cp "$file" $(echo /storage/primary/legacy/image3/$(stat "$file" | awk '/Mod/ {print $2"_"$3}'|sed s/:/-/g)_$RANDOM.jpg)
I think the problem with the second attempt may be too many subshells?
Can anyone help me know where I'm going wrong here?
On another note: if anyone knows how to preserve the actual modified date/time stamps when copying a file, I would love the throw that in here.
Thank you Thank you
Were it my problem, I'd create a script — call it filecopy.sh — like this:
TARGET="/storage/primary/legacy/image3"
for file in "$#"
do
basetime=$(date +'%Y-%m-%d.%H-%M-%S' -d #$(stat -c '%Y' "$file"))
cp "$file" "$TARGET/$basetime.$RANDOM.jpg"
done
The basetime line runs stat to get the modification time of the file in seconds since The Epoch, then uses that with date to format the time as a modified ISO 8601 format (using - in place of :, and . in place of T). This is then used to create the target file name, along with a semi-random number.
Then the find command becomes simply:
SOURCE="/storage/sdcard0/tencent/MicroMsg"
find "$SOURCE" -type f -iname '*.jpg' -exec /path/to/filecopy.sh {} +
Personally, I'd not bother to try making it work without a separate shell script. It could be done, but it would not be trivial:
SOURCE="/storage/sdcard0/tencent/MicroMsg"
find "$SOURCE" -type f -iname '*.jpg' -exec bash -c \
'TARGET="/storage/primary/legacy/image3"
for file in "$#"
do
basetime=$(date +%Y-%m-%d.%H-%M-%S -d #$(stat -c %Y "$file"))
cp "$file" "$TARGET/$basetime.$RANDOM.jpg"
done' command {} +
I've taken some liberties in that by removing the single quotes that I used in the main shell script. They were optional, but I'd use them automatically under normal circumstances.
If you have GNU Parallel > version 20140722 you can run:
find . | parallel 'cp {} ../destdir/{= $a = int(10000*rand); $_ = `date -r "$_" +%FT%T"$a"`; chomp; =}'
It will work on file names containing ' and space, but fail on file names containing ".
All new computers have multiple cores, but most programs are serial in nature and will therefore not use the multiple cores. However, many tasks are extremely parallelizeable:
Run the same program on many files
Run the same program for every line in a file
Run the same program for every block in a file
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
A personal installation does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

How can I execute a list of binaries from a pipe with desired parameters?

I'm doing a find and locating several executables that I want to run with -v. I tried something like this:
find somefilters | xargs -I % % -v
Unfortuntely, xargs seems to require that the "utility" be a fixed binary rather than a binary provided by stdin. Does anyone have a recipe for doing this command line magic?
Use the -exec primary:
find ... -exec '{}' -v \;
Yet another way around this - use xargs to write a shell script for you:
find somefilters | xargs -n 1 -I % echo % -v | ${SHELL}
That won't work out so well if any of the programs require interactivity, but if the -v option is just to spit out the version numbers or something (one common meaning, the other being a verbose flag), it should work fine.

Resources