Parallel executing commands and wait - bash

I have a set of scripts, e.g.
01_some_stuff1
02_some_stuff2
03_some_stuff3a
03_some_stuff3b
04_some_stuff4a
04_some_stuff4b
These scripts should run ordered by their number and scripts with the same number should run in parallel.
My first idea was to iterate the possible numbers
for n in $(seq -f %02.0f 0 99); do
for s in "${n}_*"; do
export CURRENT_JOB="${s}"
"${s}" &
done
wait
done
Is this a safe method? Is there a more elegant solution that also allows to set a different environment for the inner loop elements?

You could use GNU Parallel like this:
#!/bin/bash
# Don't barf if no matching files when globbing
shopt -s nullglob
for n in $(printf "%02d " {1..4}); do
# Get list (array) of matching scripts
scripts=( ${n}_* )
if [ ${#scripts[#]} -gt 0 ]; then
parallel --dry-run -k 'CURRENT_JOB={} ./{}' ::: ${scripts[#]}
fi
echo barrier
done
Sample Output
CURRENT_JOB=01_some_stuff1 ./01_some_stuff1
barrier
CURRENT_JOB=02_some_stuff2 ./02_some_stuff2
barrier
CURRENT_JOB=03_some_stuff3a ./03_some_stuff3a
CURRENT_JOB=03_some_stuff3b ./03_some_stuff3b
CURRENT_JOB=03_some_stuff3c ./03_some_stuff3c
barrier
CURRENT_JOB=04_some_stuff4a ./04_some_stuff4a
CURRENT_JOB=04_some_stuff4b ./04_some_stuff4b
barrier
Remove the echo barrier and --dry-run to actually run it properly.

The only real change you need is to avoid quoting the * in your pattern. If you are using bash 4.0 or later, you can use brace expansion to eliminate the dependency on seq.
# for n in $(seq -f %02.0f 0 99); do
for n in {00..99}; do
for s in "${n}"_*; do
export CURRENT_JOB="${s}"
"${s}" &
done
wait
done

Related

loop within a loop vcftools bash

I am trying to utilise the vcftools package to calculate weir and cockerham's fst. I would like to loop over two pairs of populations in the first instance and then loop these populations across all variants from the 1000 Genomes project: each chromosome contains a separate vcf file. For example, for pop1 vs pop2, for pop3 vs pop4 calculate fst for chromosomes 1-10. Each population file, for example, LWKfile contains a list of individuals that belong to this population.
I have attempted:
for population in LWK_GBR YRI_FIN; do
firstpop=$(echo $population | cut -d '_' -f1)
secondpop=$(echo $population | cut -d '_' -f2)
for filename in *.vcf.gz; do
vcftools --gzvcf ${filename} \
--weir-fst-pop /outdir/${firstpop}file \
--weir-fst-pop /outdir/${secondpop}file \
--out /out/${population}_${filename}
done
done
However this does not loop through all the files and seems to get stuck on chromosome 10. Is there a more efficient way to perform this in bash as I am concerned the loop within loop will be too slow.
However this does not loop through all the files and seems to get stuck on chromosome 10.
I am concerned the loop within loop will be too slow.
Are you sure that it is the for filename in *.vcf.gz which is too slow to loop over all files?
Try to put an echo before vcftools to see if it remain stuck or not.
You need to be sure on what takes too much time in order to be able to make the right choice.
For example if it's vcftools maybe you don't need to wait the end of this command and think about doing some asynchronous treatments.
If there is too much file for one loop, you should also consider make some parallel treatments.
Also, you seems to repeat the loop over all .vcf.gz files twice. It will be probably a little bit more quick to reverse your two loops.
Here is an example with parallel and async treatments using bash:
#!/bin/bash
MAX_PARALLEL_PIDS=4 # adjust regarding your own machin capacity (cpu available, etc... it could be dynamically calculated)
declare -a POPS
declare -a PIDS
POPS=("LWK_GBR" "YRI_FIN")
# your heavy treatment in a function
process() {
pop="${1}"
filename="${2}"
firstpop="${pop%%_*}" # no need to call an external program here
secondpop="${pop#*_}" # same here
vcftools --gzvcf "${filename}" \
--weir-fst-pop "/outdir/${firstpop}file" \
--weir-fst-pop "/outdir/${secondpop}file" \
--out "/out/${pop}_${filename}"
}
# a function which is usefull to wait all process when your "thread pool" reached its limits
wait_for_pids() {
for pid in "${PIDS[#]}"; do
[[ $pid =~ ^[0-9]+ ]] && wait $pid
done
unset PIDS
}
i=0
for filename in *.vcf.gz; do
if [[ $i -ge $MAX_PARALLEL_PIDS ]]; then
i=0
wait_for_pids
fi
for population in "${POPS[#]}"; do
process "${population}" "${filename}" & # You won't wait for the end here
PIDS[$i]=$!
(( i++ ))
done
done
# at the end wait for the remaining processes
wait_for_pids
N.B: Putting aside the variables inside a [[ conditions, you should pay attention on quoting your variables that can contain some spaces, especially the files names for example. It will break otherwise.

Tricky bash to try to run program with different params

I want to run a program multiple times with different parameters, and then put the results piped into files that use parameters in their names. Here is what I've come up with:
#!/bin/bash
for i in 'seq 1 5';
do
for j in 'seq 1 8';
do
for m in 'seq 1 8';
do
./program -s i -v j -k m ../input_files/input_file1.txt < results_ijm.txt
done
done
done
This doesn't work. It says "no file results_ijm.txt".... I know that - I want it to create this file implicitly.
Otherwise, I also doubt it will assign ijm in the filename correctly - how does it know whether I want the VARIABLES ijm.... or just the characters? It's ambiguous.
You must use variable $i, $j, $m etc.
Better to use ((...)) construct in BASH.
In BASH you can do:
#!/bin/bash
for ((i=1; i<=5; i++)); do
for ((j=1; h<=8; j++)); do
for ((m=1; m<=8; m++)); do
./program -s $i -v $j -k $m ../input_files/input_file1.txt > "results_${i}${j}${m}.txt"
done
done
done
Two problems. As I mentioned in the comments, your arrow is backwards. We want the results of the program to go from stdout to the file so flip that thing around. Second, variables when used gain a dollar sign in front of them... so it won't be ambiguous.
Edited to add: Third thing, use backticks instead of single quotes for seq 1 5 You want the results of that command, not the text "seq 1 5". Thanks #PSkocik
#!/bin/bash
for i in `seq 1 5`;
do
for j in `seq 1 8`;
do
for m in `seq 1 8`;
do
./program -s $i -v $j -k $m ../input_files/input_file1.txt > results_${i}${j}${m}.txt
done
done
done
./program -s i -v j -k m ../input_files/input_file1.txt < results_ijm.txt
I believe the less than symbol should be flipped to greater than, so as to input to file, instead of from file. I've not worked too much with bash, but it seems logical.

For loop with an argument based range

I want to run certain actions on a group of lexicographically named files (01-09 before 10). I have to use a rather old version of FreeBSD (7.3), so I can't use yummies like echo {01..30} or seq -w 1 30.
The only working solution I found is printf "%02d " {1..30}. However, I can't figure out why can't I use $1 and $2 instead of 1 and 30. When I run my script (bash ~/myscript.sh 1 30) printf says {1..30}: invalid number
AFAIK, variables in bash are typeless, so how can't printf accept an integer argument as an integer?
Bash supports C-style for loops:
s=1
e=30
for i in ((i=s; i<e; i++)); do printf "%02d " "$i"; done
The syntax you attempted doesn't work because brace expansion happens before parameter expansion, so when the shell tries to expand {$1..$2}, it's still literally {$1..$2}, not {1..30}.
The answer given by #Kent works because eval goes back to the beginning of the parsing process. I tend to suggest avoiding making habitual use of it, as eval can introduce hard-to-recognize bugs -- if your command were whitelisted to be run by sudo and $1 were, say, '$(rm -rf /; echo 1)', the C-style-for-loop example would safely fail, and the eval example... not so much.
Granted, 95% of the scripts you write may not be accessible to folks executing privilege escalation attacks, but the remaining 5% can really ruin one's day; following good practices 100% of the time avoids being in sloppy habits.
Thus, if one really wants to pass a range of numbers to a single command, the safe thing is to collect them in an array:
a=( )
for i in ((i=s; i<e; i++)); do a+=( "$i" ); done
printf "%02d " "${a[#]}"
I guess you are looking for this trick:
#!/bin/bash
s=1
e=30
printf "%02d " $(eval echo {$s..$e})
Ok, I finally got it!
#!/bin/bash
#BSD-only iteration method
#for day in `jot $1 $2`
for ((day=$1; day<$2; day++))
do
echo $(printf %02d $day)
done
I initially wanted to use the cycle iterator as a "day" in file names, but now I see that in my exact case it's easier to iterate through normal numbers (1,2,3 etc.) and process them into lexicographical ones inside the loop. While using jot, remember that $1 is the numbers amount, and the $2 is the starting point.

Parallel nested for loop in bash

I am trying to run a c executable through bash. The executable will take a different argument in each iteration, and I want to do it in parallel since I have 12 cores available.
I tried
w=1;
for i in {1..100}
do
l=$(($i-1));
for j in {12*l..12*i}
do
./run $w/100 > "$w"_out &
done
expr=$w % 12;
if ["$expr" -eq "0"]
then wait;
fi;
done
run is the c executable. I want to run it with increasing argument w in each step, and I want to wait until all processes are done if 12 of the cores are in use. SO basically, I will run 12 executables at the same time, then wait until they are completed, and then move to the next 12.
Hope I made my point clear.
Cheers.
Use gnu parallel instead:
parallel ./myscript {1} ::: {1..100}
You can specify the number of parallel processes with the -P option, but it defaults to the number of cores in the system.
You can also specify -k to keep the output order and redirect the file.
To redirect the output to individual files, you can specify the output redirection, but you have to quote it, so that it is not parsed by the shell. For example:
parallel ./run {1} '>' {1}_out ::: {1..10}
is equivalent to running ./run 1 > 1_out to ./run 10 > 10_out

How to feed a large array of commands to GNU Parallel?

I'm evaluating if GNU Parallel can be used to search files stored on a system in parallel. There can be only one file for each day of year (doy) on the system (so a maximum of 366 files per year). Let's say there are 3660 files on the system (about 10 years worth of data). The system could be a multi-CPU multi-core Linux or a multi-CPU Solaris.
I'm storing the search commands to run on the files in an array (one command per file). And this is what I'm doing right now (using bash) but then I have no control on how many searches to start in parallel (definitely don't want to start all 3660 searches at once):
#!/usr/bin/env bash
declare -a cmds
declare -i cmd_ctr=0
while [[ <condition> ]]; do
if [[ -s $cur_archive_path/log.${doy_ctr} ]]; then
cmds[$cmd_ctr]="<cmd_to_run>"
let cmd_ctr++
fi
done
declare -i arr_len=${#cmds[#]}
for (( i=0; i<${arr_len}; i++ ));
do
# Get the command and run it in background
eval ${cmds[$i]} &
done
wait
If I were to use parallel (which will automatically figure out the max. CPUs/cores and start only so many searches in parallel), how can I reuse the array cmds with parallel and rewrite the above code? The other alternative is to write all commands to a file and then do cat cmd_file | parallel
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Using-shell-variables says:
parallel echo ::: "${V[#]}"
You do not want the echo, so:
parallel ::: "${cmds[#]}"
If you do not need $cmds for anything else, then use 'sem' (which is an alias for parallel --semaphore) https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Working-as-mutex-and-counting-semaphore
while [[ <condition> ]]; do
if [[ -s $cur_archive_path/log.${doy_ctr} ]]; then
sem -j+0 <cmd_to_run>
fi
done
sem --wait
You have not described what <condition> might be. If you are simply doing a something like a for-loop you could replace the whole script with:
parallel 'if [ -s {} ] ; then cmd_to_run {}; fi' ::: $cur_archive_path/log.{1..3660}
(based on https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Composed-commands).

Resources