I'm evaluating if GNU Parallel can be used to search files stored on a system in parallel. There can be only one file for each day of year (doy) on the system (so a maximum of 366 files per year). Let's say there are 3660 files on the system (about 10 years worth of data). The system could be a multi-CPU multi-core Linux or a multi-CPU Solaris.
I'm storing the search commands to run on the files in an array (one command per file). And this is what I'm doing right now (using bash) but then I have no control on how many searches to start in parallel (definitely don't want to start all 3660 searches at once):
#!/usr/bin/env bash
declare -a cmds
declare -i cmd_ctr=0
while [[ <condition> ]]; do
if [[ -s $cur_archive_path/log.${doy_ctr} ]]; then
cmds[$cmd_ctr]="<cmd_to_run>"
let cmd_ctr++
fi
done
declare -i arr_len=${#cmds[#]}
for (( i=0; i<${arr_len}; i++ ));
do
# Get the command and run it in background
eval ${cmds[$i]} &
done
wait
If I were to use parallel (which will automatically figure out the max. CPUs/cores and start only so many searches in parallel), how can I reuse the array cmds with parallel and rewrite the above code? The other alternative is to write all commands to a file and then do cat cmd_file | parallel
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Using-shell-variables says:
parallel echo ::: "${V[#]}"
You do not want the echo, so:
parallel ::: "${cmds[#]}"
If you do not need $cmds for anything else, then use 'sem' (which is an alias for parallel --semaphore) https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Working-as-mutex-and-counting-semaphore
while [[ <condition> ]]; do
if [[ -s $cur_archive_path/log.${doy_ctr} ]]; then
sem -j+0 <cmd_to_run>
fi
done
sem --wait
You have not described what <condition> might be. If you are simply doing a something like a for-loop you could replace the whole script with:
parallel 'if [ -s {} ] ; then cmd_to_run {}; fi' ::: $cur_archive_path/log.{1..3660}
(based on https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Composed-commands).
Related
I have a set of scripts, e.g.
01_some_stuff1
02_some_stuff2
03_some_stuff3a
03_some_stuff3b
04_some_stuff4a
04_some_stuff4b
These scripts should run ordered by their number and scripts with the same number should run in parallel.
My first idea was to iterate the possible numbers
for n in $(seq -f %02.0f 0 99); do
for s in "${n}_*"; do
export CURRENT_JOB="${s}"
"${s}" &
done
wait
done
Is this a safe method? Is there a more elegant solution that also allows to set a different environment for the inner loop elements?
You could use GNU Parallel like this:
#!/bin/bash
# Don't barf if no matching files when globbing
shopt -s nullglob
for n in $(printf "%02d " {1..4}); do
# Get list (array) of matching scripts
scripts=( ${n}_* )
if [ ${#scripts[#]} -gt 0 ]; then
parallel --dry-run -k 'CURRENT_JOB={} ./{}' ::: ${scripts[#]}
fi
echo barrier
done
Sample Output
CURRENT_JOB=01_some_stuff1 ./01_some_stuff1
barrier
CURRENT_JOB=02_some_stuff2 ./02_some_stuff2
barrier
CURRENT_JOB=03_some_stuff3a ./03_some_stuff3a
CURRENT_JOB=03_some_stuff3b ./03_some_stuff3b
CURRENT_JOB=03_some_stuff3c ./03_some_stuff3c
barrier
CURRENT_JOB=04_some_stuff4a ./04_some_stuff4a
CURRENT_JOB=04_some_stuff4b ./04_some_stuff4b
barrier
Remove the echo barrier and --dry-run to actually run it properly.
The only real change you need is to avoid quoting the * in your pattern. If you are using bash 4.0 or later, you can use brace expansion to eliminate the dependency on seq.
# for n in $(seq -f %02.0f 0 99); do
for n in {00..99}; do
for s in "${n}"_*; do
export CURRENT_JOB="${s}"
"${s}" &
done
wait
done
I have a directory and inside I have two file types : *.sai and *fastq and I vant to use both variable in one shell for loop:
for j in *sai *fastq
do bwa samse $j $j > ${j%.sai}.sam
done;
after command do I want to load corresponding *.sai and *.fastq data in to the program (bwa samse). Could you help me please with syntax?
EXAMPLE:
in one directory is xx.fast xx.sai yy.fastq yy.sai and program bwa samse need to process in one time two corresponding files - bwa samse xx.fastq xx.sai...
Many thanks for any ideas.
Try doing this with bash parameter expansion:
for j in .*sai; do
[[ -s ${j%.sai}.fastq ]] &&
bwa samse "$j" "${j%.sai}.fastq" > "${j%.sai}.sam"
done
and please, stop killing kitties with parsing ls output. (not for you Incorigible)
Try not to use ls to feed the loop. Use brace expansion to only include *.sai and *.fastq files in your loop:
for j in ./*.{sai,fastq}
do
## do what you need to the *.sai & *.fastq files
done
You can also provide a path variable:
mypath=/path/to/files
for j in "${mypath}"/*.{sai,fastq}
(snip)
NOTE: No clue what bwa samse $j $j > ${j%\.*}.sam does. Explain how you need to process the files and I can help further..
If there is a 1-to-1 relationship (matching .sai and .fastq files), then just:
for j in ./*.sai
do
fname="${j%.*}" # remove the extension ($fname is filename w/o ext)
## do what you need to the *.sai & *.fastq files
# bwa samse "${fname}.sai" "${fname}.fastq" whatever else
done
Using GNU Parallel it looks like this:
parallel bwa samse ref.fasta {} {.}.fastq '>' {.}.sam ::: *.sai
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
(editted to reflect the comments--using ls to list filenames isn't necessary)
To strip the file extension you'll need to use ${j%\.*}, which will retain all characters before the last .
for j in *.sai *.fastq
do
bwa samse $j $j > ${j%\.*}.sam
done;
I am trying to run a c executable through bash. The executable will take a different argument in each iteration, and I want to do it in parallel since I have 12 cores available.
I tried
w=1;
for i in {1..100}
do
l=$(($i-1));
for j in {12*l..12*i}
do
./run $w/100 > "$w"_out &
done
expr=$w % 12;
if ["$expr" -eq "0"]
then wait;
fi;
done
run is the c executable. I want to run it with increasing argument w in each step, and I want to wait until all processes are done if 12 of the cores are in use. SO basically, I will run 12 executables at the same time, then wait until they are completed, and then move to the next 12.
Hope I made my point clear.
Cheers.
Use gnu parallel instead:
parallel ./myscript {1} ::: {1..100}
You can specify the number of parallel processes with the -P option, but it defaults to the number of cores in the system.
You can also specify -k to keep the output order and redirect the file.
To redirect the output to individual files, you can specify the output redirection, but you have to quote it, so that it is not parsed by the shell. For example:
parallel ./run {1} '>' {1}_out ::: {1..10}
is equivalent to running ./run 1 > 1_out to ./run 10 > 10_out
I want to loop over these kind of files, where the the files with same Sample_ID have to be used together
Sample_51770BL1_R1.fastq.gz
Sample_51770BL1_R2.fastq.gz
Sample_52412_R1.fastq.gz
Sample_52412_R2.fastq.gz
e.g. Sample_51770BL1_R1.fastq.gz and Sample_51770BL1_R2.fastq.gz are used together in one command to create an output.
Similarly, Sample_52412_R1.fastq.gz and Sample_52412_R2.fastq.gz are used together to create output.
I want to write a for loop in bash to iterate over and create output.
sourcedir=/sourcepath/
destdir=/destinationpath/
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta Sample_52412_R1.fastq.gz Sample_52412_R2.fastq.gz>$destdir/Sample_52412_R1_R2.sam
How should I pattern match the file names Sample_ID_R1 and Sample_ID_R2 to be used in one command?
Thanks,
for fname in *_R1.fastq.gz
do
base=${fname%_R1*}
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam"
done
In the comments, you ask about running several, but not too many, jobs in parallel. Below is my first stab at that:
#!/bin/bash
# Limit background jobs to no more that $maxproc at once.
maxproc=3
for fname in * # _R1.fastq.gz
do
while [ $(jobs | wc -l) -ge "$maxproc" ]
do
sleep 1
done
base=${fname%_R1*}
echo starting new job with ongoing=$(jobs | wc -l)
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam" &
done
The optimal value of maxproc will depend on how many processors your PC has. You may need to experiment to find what works best.
Note that the above script uses jobs which is a bash builtin function. Thus, it has to be run under bash, not dash which is the default for scripts under Debian-like distributions.
I have a script that I want to run on a number of files
my_script file_name
but I have many so I have written some code that is meant to process multiple at the same time by first creating 5 'equal' lists of the files I want to process followed by this
my_function() {
while read i; do
my_script $i
done < $1
}
my_function list_1 &
my_function list_2 &
my_function list_3 &
my_function list_4 &
my_function list_5 &
wait
This works for the first file in each list but then finishes. If I change the function to a simple echo it works fine
my_function() {
while read i; do
echo $i
done < $1
}
it prints all the files in each list as I would expect.
Why does it not work if I use 'my_script'?? And is there a 'nicer' way of doing this?
GNU Parallel is made for this:
parallel my_script ::: files*
You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/
You can install GNU Parallel in just 10 seconds with:
wget -O - pi.dk/3 | sh
Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Edit:
If the reason for not installing GNU Parallel is not covered by
http://oletange.blogspot.dk/2013/04/why-not-install-gnu-parallel.html
would you then be kind to elaborate why?
There must be an exit statement in my_script. Replace the exit statement(s) with return statement(s).
Another thing to check is the possibility that the same file is contained in more than one list. There may be contention issues in processing - the file is already being processed and another process attempts to open the same file. Check for any duplicate files with-:
sort file_[1-5] | uniq -d
As an alternative to GNU parallel, there is https://github.com/mauvilsa/run_parallel which is simply a function in bash, so it does not require root access or compiling.
To use it, first source the file
source run_parallel.inc.sh
Then in your example, execute it as
run_parallel -T 5 my_function 'list_{%}'
It could also do the splitting of the lists for you as
run_parallel -T 5 -l full_list -n split my_function '{#}'
To see the usage explanation and some examples, execute run_parallel without any arguments.