How to run parallel for loops - bash

I'm not very familiar with bash, but I would like split up this code such that I can run it on a server with 12 processors:
#!/bin/bash
#bashScript.sh
for i in {1..209}
do
Rscript Compute.R $i
done
How would I go about achieving this?
Thanks!

Use xargs with the option --max-procs (-P). If there are enough arguments, xargs will use exactly this number of concurrent processes to process the input:
#! /bin/bash
seq 209 |
xargs -P12 -r -n1 Rscript Compute.R

Try:
#!/bin/bash
#bashScript.sh
for i in {1..209}
do
Rscript Compute.R $i &
done

Use GNU Parallel:
parallel Rscript Compute.R ::: {1..209}
10 seconds installation:
wget -O - pi.dk/3 | sh
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Related

GNU Parallel: Run bash code that reads (seq number) from pipe?

I would like parallel to read the (seq numbers) pipe, so I would like running something like that:
seq 2000 | parallel --max-args 0 --jobs 10 "{ read test; echo $test; }"
Would be equivalent to running:
echo 1
echo 2
echo 3
echo 4
...
echo 2000
But unfortunately, the pipe was not read by parallel, meaning that it was instead ran like:
echo
echo
echo
...
echo
And the output is empty.
Does anyone know how to make parallel read (seq numbers) pipe? Thanks.
An alternative with GNU xargs that does not require GNU parallel:
seq 2000 | xargs -P 10 -I {} "echo" "hello world {}"
Output:
hello world 1
hello world 2
hello world 3
hello world 4
hello world 5
.
.
.
From man xargs:
-P max-procs: Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time.
-I replace-str: Replace occurrences of replace-str in the initial-arguments with names read from standard input.
You want the input to be piped into the command you run, so use --pipe:
seq 2000 |
parallel --pipe -N1 --jobs 10 'read test; echo $test;'
But if you really just need it for a variable, I would do one of these:
seq 2000 | parallel --jobs 10 echo
seq 2000 | parallel --jobs 10 echo {}
seq 2000 | parallel --jobs 10 'test={}; echo $test'
I will encourage you to spend 20 minutes on reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 Your command line will love you for it.
Using xargs instead of parallel while still using a shell (instead of starting up a new copy of the /bin/echo executable per line to run) would look like:
seq 2000 | xargs -P 10 \
sh -c 'for arg in "$#"; do echo "hello world $arg"; done' _
This is likely to be faster than the existing answer by Cyrus, because starting executables takes time; even though starting a new copy of /bin/sh takes longer than starting a copy of /bin/echo, because this isn't using -I {}, it's able to pass many arguments to each copy of /bin/sh, thus amortizing that time cost over more numbers; and that way we're able to used the copy of echo built into sh, instead of the separate echo executable.

Running programs in parallel using xargs

I currently have the current script.
#!/bin/bash
# script.sh
for i in {0..99}; do
script-to-run.sh input/ output/ $i
done
I wish to run it in parallel using xargs. I have tried
script.sh | xargs -P8
But doing the above only executed once at the time. No luck with -n8 as well.
Adding & at the end of the line to be executed in the script for loop would try to run the script 99 times at once. How do I execute the loop only 8 at the time, up to 100 total.
From the xargs man page:
This manual page documents the GNU version of xargs. xargs reads items
from the standard input, delimited by blanks (which can be protected
with double or single quotes or a backslash) or newlines, and executes
the command (default is /bin/echo) one or more times with any initial-
arguments followed by items read from standard input. Blank lines on
the standard input are ignored.
Which means that for your example xargs is waiting and collecting all of the output from your script and then running echo <that output>. Not exactly all that useful nor what you wanted.
The -n argument is how many items from the input to use with each command that gets run (nothing, by itself, about parallelism here).
To do what you want with xargs you would need to do something more like this (untested):
printf %s\\n {0..99} | xargs -n 1 -P 8 script-to-run.sh input/ output/
Which breaks down like this.
printf %s\\n {0..99} - Print one number per-line from 0 to 99.
Run xargs
taking at most one argument per run command line
and run up to eight processes at a time
With GNU Parallel you would do:
parallel script-to-run.sh input/ output/ {} ::: {0..99}
Add in -P8 if you do not want to run one job per CPU core.
Opposite xargs it will do The Right Thing, even if the input contain space, ', or " (not the case here, though). It also makes sure the output from different jobs are not mixed together, so if you use the output you are guaranteed that you will not get half-a-line from two different jobs.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
You can use this simple 1 line command
seq 1 500 | xargs -n 1 -P 8 script-to-run.sh input/ output/
Here's an example running commands in parallel in conjuction with find:
find -name "*.wav" -print0 | xargs -0 -t -I % -P $(nproc) flac %
-print0 terminates filenames with a null byte rather than a newline so we can use -0 in xargs to prevent filenames with spaces being treated as two seperate arguments.
-t means verbose, makes xargs print every command it's executing, can be useful, remove if not needed.
-I % means replace occurrences of % in the command with arguments read from standard input.
-P $(nproc) means run a maximum of nproc instances of our command in parallel (nproc prints the number of available processing units).
flac % is our command, the -I % from earlier means this will become flac foo.wav
See also: Manual for xargs(1)

Running bash script in parallel

I have a very simple command that I would like to execute in parallel rather than sequential.
>for i in ../data/*; do ./run.sh $i done
run.sh processes the input files from the ../data directory and I would like to perform this process all at the same time using a shell script rather than a Python program or something like that. Is there a way to do this using GNU Parallel?
You can try this:
shopt -s nullglob
FILES=(../data/*)
[[ ${#FILES[#]} -gt 0 ]] && printf '%s\0' "${FILES[#]}" | parallel -0 --jobs 2 ./run.sh
I have not used GNU Parallel but you can use & to run your script in the background. Add a wait (optional) later if you want to wait for all the scripts to finish.
for i in ../data/*; do ./run.sh $i & done
# Below wait command is optional
wait
echo "All scripts executed"
You can try this:
find ../data -maxdepth 1 -name '[^.]*' -print0 | parallel -0 --jobs 2 ./run.sh
The name argument of the find command is needed because you used shell globbing ../data/* in your example and so we need to ignore files starting with a dot.

bash script only starts new jobs when old batch is finished

I am trying to run 24 versions of the same code on an 8 core machine. The code takes many many hours to run and I only want to run 8 at a time so I was wondering if it was possible to write a bash script which would run 8 and then when those were complete immediately start the next 8 and so on?
I basically dont want all 24 to start and then run incredibly slowly!
Thanks,
Jack
EDIT 1: (More details on the run)
The code runs with the following command:
nohup ./MyCode MyInputFile 2> Myoutput
You could use gnu parallel
seq 1 24 | parallel -P 8 ./myscript
Or with xargs:
seq 1 24 | xargs -l -P 8 ./myscript
Update:
If you want to run the script with Myinput1 Myinput2 Myinput3 .. as parameters you can do
find . -name 'Myinput*' -print0 | parallel -0 -P 8 ./myscript {1}
or with your command:
find . -name 'Myinput*' -print0 | parallel -0 -P 8 nohup ./myscript {1} 2> Myoutput
yet another way :
for f1 in {1..3};do
for f2 in {1..8};do
echo "$f1,$f2;"
nohup ./MyCode MyInputFile 2> Myoutput &
done
wait
done

How to correctly wrap multiple command calls in bash?

My problem can be summed up by making this simple command works :
nice -n 10 "ls|xargs -I% echo \"%\""
Which fails :
nice: ls|xargs -I% echo "%": No such file or directory
Removing the quotes makes it works, but my point is to wrap multiple quoted commands into one to do something more complex like :
ftphost="192.168.1.1"
dirinputtopush="/tmp/archivedir/"
ftpoutputdir="mydir/"
nice -n 19 ls $dirinputtopush | xargs -I% "lftp $ftphost -e \"mirror -R $dirinputtopush% $ftpoutputdirrecent ;quit\"; sleep 10"
Try using nice -n 10 bash -c 'your; commands | or_complex pipelines' as command. This way bash is the binary and the string after -c contains a sequence interpreted by bash so it can contain pipelines, loops etc. Watch out for proper quoting. You need to do it this way because nice expects a binary, not expressions interpreted by the shell. In contrast, shell builtins such as time (but not /usr/bin/time which is a separate binary) will accept shell expressions as the command to execute. They can because they're built into the shell. nice is not, so it requires a binary to execute.
Children inherit nice value:
nice -n 10 bash -c 'ls | xargs -I% echo %'
Nice each command separately:
nice -n 10 ls | nice -n 10 xargs -I% echo %

Resources