Parallel nested for loop in bash - bash

I am trying to run a c executable through bash. The executable will take a different argument in each iteration, and I want to do it in parallel since I have 12 cores available.
I tried
w=1;
for i in {1..100}
do
l=$(($i-1));
for j in {12*l..12*i}
do
./run $w/100 > "$w"_out &
done
expr=$w % 12;
if ["$expr" -eq "0"]
then wait;
fi;
done
run is the c executable. I want to run it with increasing argument w in each step, and I want to wait until all processes are done if 12 of the cores are in use. SO basically, I will run 12 executables at the same time, then wait until they are completed, and then move to the next 12.
Hope I made my point clear.
Cheers.

Use gnu parallel instead:
parallel ./myscript {1} ::: {1..100}
You can specify the number of parallel processes with the -P option, but it defaults to the number of cores in the system.
You can also specify -k to keep the output order and redirect the file.
To redirect the output to individual files, you can specify the output redirection, but you have to quote it, so that it is not parsed by the shell. For example:
parallel ./run {1} '>' {1}_out ::: {1..10}
is equivalent to running ./run 1 > 1_out to ./run 10 > 10_out

Related

parallel execution of script on cluster for multiple inputs

I have a script pytonscript.py that I want to run on 500 samples. I have 50 CPUs available and want to run the script in parallel using 1 CPU for each sample, so that 50 samples are constantly running with 1 CPU each. Any ideas how to set this up without typing 500 lines with the different inputs? I know how to make a loop for each sample, but not how to make 50 samples running in parallel. I guess GNU parallel is a way?
Input samples in folder samples:
sample1
sample2
sample2
...
sample500
pytonscript.py -i samples/sample1.sam.bz2 -o output_folder
What about GNU xargs?
printf '%s\0' samples/sample*.sam.bz |
xargs -0L1 -P 50 pytonscript.py -o output_dir -i
This launches a new instance of the python script for each file, concurrently, maintaining a maximum of 50 at once.
If the wildcard glob expansion isn't specific enough, you can use bash's extglob: shopt -s exglob; # samples/sample+([0-9]).sam.bz

For loop in parallel

Is there a quick, easy, and efficient way of running iterations in this for loop in parallel?
for i in `seq 1 5000`; do
repid="$(printf "%05d" "$i")"
inp="${repid}.inp"
out="${repid}.out"
/command "$inp" "$out"
done
If you want to take advantage of all your lovely CPU cores that you paid Intel so handsomely for, turn to GNU Parallel:
seq -f "%05g" 5000 | parallel -k echo command {}.inp {}.out
If you like the look of that, run it again without the -k (which keeps the output in order) and without the echo. You may need to enclose the command in single quotes:
seq -f "%05g" 5000 | parallel '/command {}.inp {}.out'
It will run 1 instance per CPU core in parallel, but, if you want say 32 in parallel, use:
seq ... | parallel -j 32 ...
If you want an "estimated time of arrival", use:
parallel --eta ...
If you want a progress meter, use:
parallel --progress ...
If you have bash version 4+, it can zero-pad brace expansions. And if your ARGMAX is big enough, so you can more simply use:
parallel 'echo command {}.inp {}.out' ::: {00001..05000}
You can check your ARGMAX with:
sysctl -a kern.argmax
and it tells you how many bytes long your parameter list can be. You are going to need 5,000 numbers at 5 digits plus a space each, so 30,000 minimum.
If you are on macOS, you can install GNU Parallel with homebrew:
brew install parallel
for i in `seq 1 5000`; do
repid="$(printf "%05d" "$i")"
inp="${repid}.inp"
out="${repid}.out"
/command "$inp" "$out" &
done

Generating sequences of numbers and characters with bash

I have written a script that accept two files as input. I want to run all in parallel at the same time on different CPUs.
inputs:
x00.A x00.B
x01.A x01.B
...
x30.A x30.B
instead of running 30 times:
./script x00.A x00.B
./script x01.A x01.B
...
./script x30.A x30.B
I wanted to use paste and seq to generate and send them to the script.
paste & seq | xargs -n1 -P 30 ./script
But I do not know how to combine letters and numbers using paste and seq commands.
for num in $(seq -f %02.f 0 30); do
./script x$num.A x$num.B &
done
wait
Although I personally prefer to not use GNU seq or BSD jot but (ksh/bash) builtins:
num=-1; while (( ++num <= 30 )); do
./script x$num.A x$num.B &
done
wait
The final wait is just needed to make sure they all finish, after having run spread across your available CPU cores in the background. So, if you need the output of ./script, you must keep the wait.
Putting them into the background with & is the simplest way for parallelism. If you really want to exercise any sort of control over lots of backgrounded jobs like that, you will need some sort of framework like GNU Parallel instead.
You can use pure bash for generating the sequence:
printf "%s %s\n" x{00..30}.{A..B} | xargs -n1 -P 30 ./script
Happy holidays!

running parallel bash background

I have a script that I want to run on a number of files
my_script file_name
but I have many so I have written some code that is meant to process multiple at the same time by first creating 5 'equal' lists of the files I want to process followed by this
my_function() {
while read i; do
my_script $i
done < $1
}
my_function list_1 &
my_function list_2 &
my_function list_3 &
my_function list_4 &
my_function list_5 &
wait
This works for the first file in each list but then finishes. If I change the function to a simple echo it works fine
my_function() {
while read i; do
echo $i
done < $1
}
it prints all the files in each list as I would expect.
Why does it not work if I use 'my_script'?? And is there a 'nicer' way of doing this?
GNU Parallel is made for this:
parallel my_script ::: files*
You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/
You can install GNU Parallel in just 10 seconds with:
wget -O - pi.dk/3 | sh
Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Edit:
If the reason for not installing GNU Parallel is not covered by
http://oletange.blogspot.dk/2013/04/why-not-install-gnu-parallel.html
would you then be kind to elaborate why?
There must be an exit statement in my_script. Replace the exit statement(s) with return statement(s).
Another thing to check is the possibility that the same file is contained in more than one list. There may be contention issues in processing - the file is already being processed and another process attempts to open the same file. Check for any duplicate files with-:
sort file_[1-5] | uniq -d
As an alternative to GNU parallel, there is https://github.com/mauvilsa/run_parallel which is simply a function in bash, so it does not require root access or compiling.
To use it, first source the file
source run_parallel.inc.sh
Then in your example, execute it as
run_parallel -T 5 my_function 'list_{%}'
It could also do the splitting of the lists for you as
run_parallel -T 5 -l full_list -n split my_function '{#}'
To see the usage explanation and some examples, execute run_parallel without any arguments.

How to feed a large array of commands to GNU Parallel?

I'm evaluating if GNU Parallel can be used to search files stored on a system in parallel. There can be only one file for each day of year (doy) on the system (so a maximum of 366 files per year). Let's say there are 3660 files on the system (about 10 years worth of data). The system could be a multi-CPU multi-core Linux or a multi-CPU Solaris.
I'm storing the search commands to run on the files in an array (one command per file). And this is what I'm doing right now (using bash) but then I have no control on how many searches to start in parallel (definitely don't want to start all 3660 searches at once):
#!/usr/bin/env bash
declare -a cmds
declare -i cmd_ctr=0
while [[ <condition> ]]; do
if [[ -s $cur_archive_path/log.${doy_ctr} ]]; then
cmds[$cmd_ctr]="<cmd_to_run>"
let cmd_ctr++
fi
done
declare -i arr_len=${#cmds[#]}
for (( i=0; i<${arr_len}; i++ ));
do
# Get the command and run it in background
eval ${cmds[$i]} &
done
wait
If I were to use parallel (which will automatically figure out the max. CPUs/cores and start only so many searches in parallel), how can I reuse the array cmds with parallel and rewrite the above code? The other alternative is to write all commands to a file and then do cat cmd_file | parallel
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Using-shell-variables says:
parallel echo ::: "${V[#]}"
You do not want the echo, so:
parallel ::: "${cmds[#]}"
If you do not need $cmds for anything else, then use 'sem' (which is an alias for parallel --semaphore) https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Working-as-mutex-and-counting-semaphore
while [[ <condition> ]]; do
if [[ -s $cur_archive_path/log.${doy_ctr} ]]; then
sem -j+0 <cmd_to_run>
fi
done
sem --wait
You have not described what <condition> might be. If you are simply doing a something like a for-loop you could replace the whole script with:
parallel 'if [ -s {} ] ; then cmd_to_run {}; fi' ::: $cur_archive_path/log.{1..3660}
(based on https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Composed-commands).

Resources