I have a list of configuration files:
cfg1.cfg
cfg2.cfg
cfg3.cfg
cfg4.cfg
cfg5.cfg
cfg6.cfg
cfg7.cfg
...
that serve as input for two scripts:
script1.sh
script2.sh
which I run sequentially as follows:
script1.sh cfgX.cfg && script2.sh cfgX.cfg
where X=1, 2, 3, ...
These scripts are not parallelised and take a long time to run. How can I launch them in parallel, let's say 4 at the time, so I do not kill the server where I run them?
For just one script I tried a brute force approach similar to:
export COUNTER_LIMIT=4
export COUNTER=1
for each in $(ls *.cfg)
do
INSTRUCTION="./script1.sh $each "
if (($COUNTER >= $COUNTER_LIMIT)) ;
then
$INSTRUCTION &&
export COUNTER=$(($COUNTER-$COUNTER_LIMIT));
echo
sleep 600s
else
$INSTRUCTION &
sleep 5s
fi
echo $COUNTER
export COUNTER=$(($COUNTER+1));
done
(the sleeps are because for some reason the scripts cannot be initiated at the same time...)
So, ho can I do so that the double ampersands in
script1.sh cfgX.cfg && script2.sh cfgX.cfg
dont' block the brute force parallelisation?
I also accept better and simpler approaches ;)
Cheers
jorge
UPDATE
I should have mentioned that the config files are not necessarily sequentially named and can have any name, I just made them like this to make the example as simple as possible.
parallel --jobs 4 \
--load 50% \
--bar \
--eta "( echo 1st-for-{}; echo 2nd-for-{} )" < aListOfAdHocArguments.txt
0% 0:5=0s
1st-for-Abraca
2nd-for-Abraca
20% 1:4=0s
1st-for-Dabra
2nd-for-Dabra
40% 2:3=0s
1st-for-Hergot
2nd-for-Hergot
60% 3:2=0s
1st-for-Fagot
2nd-for-Fagot
80% 4:1=0s
100% 5:0=0s
Q : How can I launch them in parallel, let's say 4 at the time, so I do not kill the server where I run them?
A lovely task for GNU parallel.
First let's check the localhost ecosystem ( exosystems, executing parallel-jobs over ssh-connected remote-hosts possible, yet exceed the scope of this post ) :
parallel --number-of-cpus
parallel --number-of-cores
parallel --show-limits
For more configuration details beyond the --jobs 4, potentially --memfree or --noswap, --load <max-load> or --keep-order and --results <aFile> or --output-as-files :
man parallel
parallel --jobs 4 \
--bar \
--eta "( script1.sh cfg{}.cfg; script2.sh cfg{}.cfg )" ::: {1..123}
Here, emulated by a just pair of tandem echo-s for down-counted indexes, so progress-bars are invisible and Estimated-Time-of-Arrival --eta indications are almost instant... :
parallel --jobs 4 \
--load 50% \
--bar \
--eta "( echo 1st-for-cfg-{}; echo 2nd-for-cfg-{} )" ::: {10..0}
0% 0:11=0s 7
1st-for-cfg-10
2nd-for-cfg-10
9% 1:10=0s 6
1st-for-cfg-9
2nd-for-cfg-9
18% 2:9=0s 5
1st-for-cfg-8
2nd-for-cfg-8
27% 3:8=0s 4
1st-for-cfg-7
2nd-for-cfg-7
36% 4:7=0s 3
1st-for-cfg-6
2nd-for-cfg-6
45% 5:6=0s 2
1st-for-cfg-5
2nd-for-cfg-5
54% 6:5=0s 1
1st-for-cfg-4
2nd-for-cfg-4
63% 7:4=0s 0
1st-for-cfg-3
2nd-for-cfg-3
72% 8:3=0s 0
1st-for-cfg-2
2nd-for-cfg-2
81% 9:2=0s 0
1st-for-cfg-1
2nd-for-cfg-1
90% 10:1=0s 0
1st-for-cfg-0
2nd-for-cfg-0
Update
You added:
I should have mentioned that the config files are not necessarily sequentially named and can have any name, I just made them like this to make the example as simple as possible.
The < list_of_arguments solves this ex-post changed problem definition:
parallel [options] [command [arguments]] < list_of_arguments
This would be fairly simple with find and xargs. This would run four processes in parallel, and for any given config file will complete script1.sh before running script2.sh:
find . -name '*.cfg' -print0 | xargs -0 -P 4 -iCFG sh -c 'script1.sh CFG && script2.sh CFG'
I did some simulation test, first I created the file like you describe.
printf '%s\n' cfg{1..100}.cfg > file.txt
Now the script to process it.
#!/bin/bash
file=file.txt
limit=2
array=()
while read -r cfg; do
array+=("$cfg")
done < "$file"
for ((n=0; n<limit; n++)); do
for ((i=n; i<${#array[#]}; i+=limit)); do
echo script1.sh "${array[i]}" && echo script2.sh "${array[i]}" && sleep 2; echo
done &
done
wait
Now if you run that script you should see what's going to happen. The echo and sleep is there just for visual aid :-), you can remove them if you decided to actually run the script. Change the value of limit to your own hearts content. The idea and technique howto solve that particular problem did not came from me. It came from this guy. https://github.com/e36freak/, give credit where it is due...
Related
I'm using the SciNet computers for some simulation work, and part of this requires me to simulate data and test models using many different parameter combinations that have to be set in each bash file individually before running them through the terminal. These are just .sh files that run a .R script with certain options set, and I edit all of them using a simple text editor on Windows.
This can take a lot of time, however, if I have to specify over 100 combinations by hand in a text file. I'm wondering, is there is any way to speed up this process? Possibly a script or program that I can run to copy from one base file and make according changes to the parameters. Or if there was a way I could create a dataset and fill it with columns of parameter values that the bash file could then pull from that would be great. I haven't found anything in my search so far but I feel I may be looking in the wrong places.
Currently I can make some shortcuts by copying and pasting the files for simulations that have the most similarities between parameters, but this still requires me to go into each file manually to change the parameters that do not match.
The main body of the parameters for one setting looks as follows:
--studies 10 --rate 0.003 --alpha 3 --beta 3 --reps 10 --mc 500 --job ${SLURM_ARRAY_TASK_ID} --out /Results
The way you described the problem is confusing. Hence, the number of comments.
The below script is what I visualized to be the issue you are grappling with. Namely,
codifying scenario parameters, and
looping using those parameters.
The script:
#!/bin/bash
DBG=0
COMMAND_ROOT="something_mix_of_command_and_shared_parameters"
###
### Defining Scenario Parameters
###
rates=( 0.001 0.002 0.003 0.004 0.005 )
alphas=( 1 2 3 )
betas=( 1 2 3 )
reps=( 5 10 20 50 100 )
mcs=( 400 450 500 550 600 )
###
### Looping on parameters for "test" plan
###
for rate in ${rates[#]}
do
test ${DBG} -eq 1 && echo -e "\t [rate= ${rate}] ..."
for alpha in ${alphas[#]}
do
test ${DBG} -eq 1 && echo -e "\t\t [alpha= ${alpha}] ..."
for beta in ${betas[#]}
do
test ${DBG} -eq 1 && echo -e "\t\t\t [beta= ${beta}] ..."
for rep in ${reps[#]}
do
test ${DBG} -eq 1 && echo -e "\t\t\t\t [rep= ${rep}] ..."
for mc in ${mcs[#]}
do
test ${DBG} -eq 1 && echo -e "\t\t\t\t\t [mc= ${mc}] ..."
SLURM_ARRAY_TASK_ID="${some_unique_identifier}"
COMMAND_PARAMETERS="\
--studies 10 \
--rate ${rate} \
--alpha ${alpha} \
--beta ${beta} \
--reps ${rep} \
--mc ${mc} \
--job ${SLURM_ARRAY_TASK_ID} --out /Results"
echo -e " Running permutation [${rate}|${alpha}|${beta}|${rep}|${mc}] ..."
#${COMMAND_ROOT} ${COMMAND_PARAMETERS}
done
echo ""
done
done
done
done
exit
I need to do a set of calculations by changing one parameter for each time. A calculation directory contains a control file named 'test.ctrl', a job submission file named 'job-test' and a bunch of data files. Each calculation should be submitted with the same control file name (written inside the job-test), and the output is given in those data files without changing their names, which creates an overwriting problem. For this reason, I want to automize the job submission process with a bash script so that I don't need to submit each calculation by hand.
As an example, I have done the first calculation in directory b1-k1-a1 (I choose this format of dir names to indicate calc. parameters). This test.ctrl file has the parameters:
Beta=1
Kappa=1
Alpha=0 1
and I submitted this job using 'sbatch job-test' command. For the following calculations, my code should copy this whole directory with the name bX-kY-aZ, make the changes in the control file, and finally submit the job. I naively tried this writing the whole thing in the job-test file as you can see in below MWE:
#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --time=0:15:00 ##hh:mm:ss
for n in $(seq 0 5)
do
for m in $(seq 0 5)
do
for v in $(seq 0 5)
do
mkdir b$n-k$m-a$v
cd b$n-k$m-a$v
cp ~/home/b01-k1-a01/* .
sed "s/Beta=1/Beta=$n/" test.ctrl
sed "s/Kappa=1/Kappa=$m/" test.ctrl
sed "s/Alpha=0 1/Alpha=0 $v/" test.ctrl
cd ..<<EOF
EOF
mpirun soft.x test.ctrl
sleep 5
done
done
done
I will appreciate if you could suggest me how to make it work this way.
It worked after I moved cd .. to the very end of the loops and removed sed, as suggested in the comments. Hence this works now:
#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --time=0:15:00 ##hh:mm:ss
for n in $(seq 0 5)
do
for m in $(seq 0 5)
do
for v in $(seq 0 5)
do
mkdir b$n-k$m-a$v
cd b$n-k$m-a$v
cp ~/home/b01-k1-a01/* .
cat >test.ctrl <<EOF
Beta=$n
Kappa=$m
Alpha=0 $v
EOF
mpirun soft.x test.ctrl
sleep 5
cd ..
done
done
done
The immediate problem is that sed without any options does not modify the file at all; it just prints the results to standard output.
It is frightfully unclear what you were hoping the here document was accomplishing. cd does not read its standard input, so it wasn't accomplishing anything at all, anyway.
#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --time=0:15:00 ##hh:mm:ss
for n in $(seq 0 5); do
for m in $(seq 0 5); do
for v in $(seq 0 5); do
mkdir "b$n-k$m-a$v"
cd "b$n-k$m-a$v"
cp ~/home/b01-k1-a01/* .
sed -e "s/Beta=1/Beta=$n/" \
-e "s/Kappa=1/Kappa=$m/" \
-e "s/Alpha=0 1/Alpha=0 $v/" ~/home/b01-k1-a01/test.ctrl >/test.ctrl
mpirun soft.x test.ctrl
cd ..
sleep 5
done
done
done
Notice also the merging of multiple sed commands into a single script (though as noted elsewhere, maybe printf would be even better if that's everything which you have in the configuration file).
I have around 135000 .TIF files (1.2KB to 1.4KB) sitting on my hard drive. I need to extract text out of those files. If I run tesseract as a cron job I am getting 500 to 600 per hour at the most. Can anyone suggest me strategies so I can get atleast 500 per minute?
UPDATE:
Below is my code after implementing on suggestions given by #Mark still I dont seem to go beyond 20 files per min.
#!/bin/bash
cd /mnt/ramdisk/input
function tess()
{
if [ -f /mnt/ramdisk/output/$2.txt ]
then
echo skipping $2
return
fi
tesseract --tessdata-dir /mnt/ramdisk/tessdata -l eng+kan $1 /mnt/ramdisk/output/$2 > /dev/null 2>&1
}
export -f tess
find . -name \*.tif -print0 | parallel -0 -j100 --progress tess {/} {/.}
You need GNU Parallel. Here I process 500 TIF files of 3kB each in 37s on an iMac. By way of comparison, the same processing takes 160s if done in a sequential for loop.
The basic command looks like this:
parallel --bar 'tesseract {} {.} > /dev/null 2>&1' ::: *.tif
which will show a progress bar and use all available cores on your machine. Here it is in action:
If you want to see what it would do without actually doing anything, use parallel --dry-run.
As you have 135,000 files it will probably overflow your command line length - you can check with sysctl like this:
sysctl -a kern.argmax
kern.argmax: 262144
So you need to pump the filenames into GNU Parallel on its stdin and separate them with null characters so you don't get problems with spaces:
find . -iname \*.tif -print0 | parallel -0 --bar 'tesseract {} {.} > /dev/null 2>&1'
If you are dealing with very large numbers of files, you probably need to consider the possibility of being interrupted and restarted. You could either mv each TIF file after processing to a subdirectory called processed so that it won't get done again on restarting, or you could test for the existence of the corresponding txt file before processing any TIF like this:
#!/bin/bash
doit() {
if [ -f "${2}.txt" ]; then
echo Skipping $1...
return
fi
tesseract "$1" "$2" > /dev/null 2>&1
}
export -f doit
time parallel --bar doit {} {.} ::: *.tif
If you run that twice in a row, you will see it is near instantaneous the second time because all the processing was done the first time.
If you have millions of files, you could consider using multiple machines in parallel, so just make sure you have ssh logins to each of the machines on your network and then run across 4 machines, including the localhost like this:
parallel -S :,remote1,remote2,remote3 ...
where : is shorthand for the machine on which you are running.
I have read similar questions about this topic but none of them help me with the following problem:
I have a bash script that looks like this:
#!/bin/bash
for filename in /home/user/Desktop/emak/*.fa; do
mkdir ${filename%.*}
cd ${filename%.*}
mkdir emak
cd ..
done
This script basically does the following:
Iterate through all files in a directory
Create a new directory with the name of each file
Go inside the new file and create a new file called "emak"
The real task does something much computational expensive than create the "emak" file...
I have about thousands of files to iterate through.
As each iteration is independent from the previous one, I will like
to split it in different processors ( I have 24 cores) so I can do multiples files at the same time.
I read some previous post about running in parallel (using: GNU) but I do not see a clear way to apply it in this case.
thanks
No need for parallel; you can simply use
N=10
for filename in /home/user/Desktop/emak/*.fa; do
mkdir -p "${filename%.*}/emak" &
(( ++count % N == 0)) && wait
done
The second line pauses every Nth job to allow all the previous jobs to complete before continuing.
Something like this with GNU Parallel, whereby you create and export a bash function called doit:
#!/bin/bash
doit() {
dir=${1%.*}
mkdir "$dir"
cd "$dir"
mkdir emak
}
export -f doit
parallel doit ::: /home/user/Desktop/emak/*.fa
You will really see the benefit of this approach if the time taken by your "computationally expensive" part is longer, or especially variable. If it takes, say up to 10 seconds and is variable, GNU Parallel will submit the next job as soon as the shortest of the N parallel processes completes, rather than waiting for all N to complete before starting the next batch of N jobs.
As a crude benchmark, this takes 58 seconds:
#!/bin/bash
doit() {
echo $1
# Sleep up to 10 seconds
sleep $((RANDOM*11/32768))
}
export -f doit
parallel -j 10 doit ::: {0..99}
and this is directly comparable and takes 87 seconds:
#!/bin/bash
N=10
for i in {0..99}; do
echo $i
sleep $((RANDOM*11/32768)) &
(( ++count % N == 0)) && wait
done
This question already has answers here:
How to limit number of threads/sub-processes used in a function in bash
(7 answers)
Closed 3 years ago.
I have a large set of files for which some heavy processing needs to be done.
This processing in single threaded, uses a few hundred MiB of RAM (on the machine used to start the job) and takes a few minutes to run.
My current usecase is to start a hadoop job on the input data, but I've had this same problem in other cases before.
In order to fully utilize the available CPU power I want to be able to run several those tasks in paralell.
However a very simple example shell script like this will trash the system performance due to excessive load and swapping:
find . -type f | while read name ;
do
some_heavy_processing_command ${name} &
done
So what I want is essentially similar to what "gmake -j4" does.
I know bash supports the "wait" command but that only waits untill all child processes have completed. In the past I've created scripting that does a 'ps' command and then grep the child processes out by name (yes, i know ... ugly).
What is the simplest/cleanest/best solution to do what I want?
Edit: Thanks to Frederik: Yes indeed this is a duplicate of How to limit number of threads/sub-processes used in a function in bash
The "xargs --max-procs=4" works like a charm.
(So I voted to close my own question)
I know I'm late to the party with this answer but I thought I would post an alternative that, IMHO, makes the body of the script cleaner and simpler. (Clearly you can change the values 2 & 5 to be appropriate for your scenario.)
function max2 {
while [ `jobs | wc -l` -ge 2 ]
do
sleep 5
done
}
find . -type f | while read name ;
do
max2; some_heavy_processing_command ${name} &
done
wait
#! /usr/bin/env bash
set -o monitor
# means: run background processes in a separate processes...
trap add_next_job CHLD
# execute add_next_job when we receive a child complete signal
todo_array=($(find . -type f)) # places output into an array
index=0
max_jobs=2
function add_next_job {
# if still jobs to do then add one
if [[ $index -lt ${#todo_array[*]} ]]
# apparently stackoverflow doesn't like bash syntax
# the hash in the if is not a comment - rather it's bash awkward way of getting its length
then
echo adding job ${todo_array[$index]}
do_job ${todo_array[$index]} &
# replace the line above with the command you want
index=$(($index+1))
fi
}
function do_job {
echo "starting job $1"
sleep 2
}
# add initial set of jobs
while [[ $index -lt $max_jobs ]]
do
add_next_job
done
# wait for all jobs to complete
wait
echo "done"
Having said that Fredrik makes the excellent point that xargs does exactly what you want...
With GNU Parallel it becomes simpler:
find . -type f | parallel some_heavy_processing_command {}
Learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
I think I found a more handy solution using make:
#!/usr/bin/make -f
THIS := $(lastword $(MAKEFILE_LIST))
TARGETS := $(shell find . -name '*.sh' -type f)
.PHONY: all $(TARGETS)
all: $(TARGETS)
$(TARGETS):
some_heavy_processing_command $#
$(THIS): ; # Avoid to try to remake this makefile
Call it as e.g. 'test.mak', and add execute rights. If You call ./test.mak it will call the some_heavy_processing_command one-by-one. But You can call as ./test.mak -j 4, then it will run four subprocesses at once. Also You can use it on a more sophisticated way: run as ./test.mak -j 5 -l 1.5, then it will run maximum 5 sub-processes while the system load is under 1.5, but it will limit the number of processes if the system load exceeds 1.5.
It is more flexible than xargs, and make is part of the standard distribution, not like parallel.
This code worked quite well for me.
I noticed one issue in which the script couldn't end.
If you run into a case where the script wont end due to max_jobs being greater than the number of elements in the array, the script will never quit.
To prevent the above scenario, I've added the following right after the "max_jobs" declaration.
if [ $max_jobs -gt ${#todo_array[*]} ];
then
# there are more elements found in the array than max jobs, setting max jobs to #of array elements"
max_jobs=${#todo_array[*]}
fi
Another option:
PARALLEL_MAX=...
function start_job() {
while [ $(ps --no-headers -o pid --ppid=$$ | wc -l) -gt $PARALLEL_MAX ]; do
sleep .1 # Wait for background tasks to complete.
done
"$#" &
}
start_job some_big_command1
start_job some_big_command2
start_job some_big_command3
start_job some_big_command4
...
Here is a very good function I used to control the maximum # of jobs from bash or ksh. NOTE: the - 1 in the pgrep subtracts the wc -l subprocess.
function jobmax
{
typeset -i MAXJOBS=$1
sleep .1
while (( ($(pgrep -P $$ | wc -l) - 1) >= $MAXJOBS ))
do
sleep .1
done
}
nproc=5
for i in {1..100}
do
sleep 1 &
jobmax $nproc
done
wait # Wait for the rest