How to start a large number of quick jobs in Bash - bash

I have 3000 very quick jobs to run that on average take 2/3 seconds.
The list of jobs is in a file, and I want to control how many I have open.
However, the process of starting a job in background (& line) seems to take some time itself, therefore, some jobs are already finishing before "INTOTAL" amount have got started...
Therefore, I am not using my 32 core efficiently.
Is the a better approach than the one below?
#!/bin/sh
#set -x
INTOTAL=28
while true
do
NUMRUNNING=`tasklist | egrep Prod.exe | wc -l`
JOBS=`cat jobs.lst | wc -l`
if [ $JOBS -gt 0 ]
then
MAXSTART=$(($INTOTAL-$NUMRUNNING))
NUMTOSTART=$JOBS
if [ $NUMTOSTART -gt $MAXSTART ]
then
NUMTOSTART=$MAXSTART
fi
echo 'Starting: '$NUMTOSTART
for ((i=1;i<=$NUMTOSTART;i++))
do
JOB=`head -n1 jobs.lst`
sed -i 1d jobs.lst
/Prod $JOB &
done
sleep 2
fi
sleep 3
done

You may want to have a look at parallel, which you should be able to install on Cygwin according to the release notes. Then running the tasks in parallel can be as easy as:
parallel /Prod {} < jobs.lst
See here for an example of this in its man page (and have a look through the plethora of examples for more about the many options it has).
To control how many jobs to run at a time use the -j flag. By default it will run 1 job per core at a time, so 32 for you. To limit to 16 for instance:
parallel -j 16 /Prod {} < jobs.lst

Related

delay between runs in bash on cluster

I have to submit a large number of jobs on a cluster, I have a script like:
#!/bin/bash
for runname in bcc BNU Can CNRM GFDLG GFDLM
do
cd given_directory/$runname
cat another_directory | while read LINE ; do
qsub $LINE
done
done
There are 4000 lines in the script, i.e. 4000 jobs for each runename.
The number of jobs that can be submitted on the cluster is limited by a user at a given time.
So, I want to delay the process between each runs, in a given for-loop till
one batch, like all runs in bcc directory is done.
How can I do that? Is there a command that I can put after the first done (?) to make the code to wait till bcc is done and then move to BNU?
One option is to use a counter to monitor how many jobs are currently submitted, and wait when the limit is reached. Querying the number of jobs can be a costly operation to the head node so it is better not to do it after every submitted job. Here, it is done maximum once every SLEEP seconds.
#!/bin/bash
TARGET=4000
SLEEP=300
# Count the current jobs, pending or running
get_job_count(){
# The grep is to remove the header, there may be a better way.
qstat -u $USER | grep $USER | wc -l
}
# Wait until the number of job is under the limit, then submit.
submit_when_possible(){
while [ $COUNTER -ge $TARGET ]; do
sleep $SLEEP
COUNTER=$(get_job_count)
done
qsub $1
let "COUNTER++"
}
# Global job counter
COUNTER=$(get_job_count)
for RUNNAME in bcc BNU Can CNRM GFDLG GFDLM
do
cd given_directory/$RUNNAME
cat another_directory | while read JOB ; do
submit_when_possible $JOB
done
done
Note: the script is untested, so it may need minor fixes, but the idea should work.

For loop in parallel

Is there a quick, easy, and efficient way of running iterations in this for loop in parallel?
for i in `seq 1 5000`; do
repid="$(printf "%05d" "$i")"
inp="${repid}.inp"
out="${repid}.out"
/command "$inp" "$out"
done
If you want to take advantage of all your lovely CPU cores that you paid Intel so handsomely for, turn to GNU Parallel:
seq -f "%05g" 5000 | parallel -k echo command {}.inp {}.out
If you like the look of that, run it again without the -k (which keeps the output in order) and without the echo. You may need to enclose the command in single quotes:
seq -f "%05g" 5000 | parallel '/command {}.inp {}.out'
It will run 1 instance per CPU core in parallel, but, if you want say 32 in parallel, use:
seq ... | parallel -j 32 ...
If you want an "estimated time of arrival", use:
parallel --eta ...
If you want a progress meter, use:
parallel --progress ...
If you have bash version 4+, it can zero-pad brace expansions. And if your ARGMAX is big enough, so you can more simply use:
parallel 'echo command {}.inp {}.out' ::: {00001..05000}
You can check your ARGMAX with:
sysctl -a kern.argmax
and it tells you how many bytes long your parameter list can be. You are going to need 5,000 numbers at 5 digits plus a space each, so 30,000 minimum.
If you are on macOS, you can install GNU Parallel with homebrew:
brew install parallel
for i in `seq 1 5000`; do
repid="$(printf "%05d" "$i")"
inp="${repid}.inp"
out="${repid}.out"
/command "$inp" "$out" &
done

Looping files in bash

I want to loop over these kind of files, where the the files with same Sample_ID have to be used together
Sample_51770BL1_R1.fastq.gz
Sample_51770BL1_R2.fastq.gz
Sample_52412_R1.fastq.gz
Sample_52412_R2.fastq.gz
e.g. Sample_51770BL1_R1.fastq.gz and Sample_51770BL1_R2.fastq.gz are used together in one command to create an output.
Similarly, Sample_52412_R1.fastq.gz and Sample_52412_R2.fastq.gz are used together to create output.
I want to write a for loop in bash to iterate over and create output.
sourcedir=/sourcepath/
destdir=/destinationpath/
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta Sample_52412_R1.fastq.gz Sample_52412_R2.fastq.gz>$destdir/Sample_52412_R1_R2.sam
How should I pattern match the file names Sample_ID_R1 and Sample_ID_R2 to be used in one command?
Thanks,
for fname in *_R1.fastq.gz
do
base=${fname%_R1*}
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam"
done
In the comments, you ask about running several, but not too many, jobs in parallel. Below is my first stab at that:
#!/bin/bash
# Limit background jobs to no more that $maxproc at once.
maxproc=3
for fname in * # _R1.fastq.gz
do
while [ $(jobs | wc -l) -ge "$maxproc" ]
do
sleep 1
done
base=${fname%_R1*}
echo starting new job with ongoing=$(jobs | wc -l)
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam" &
done
The optimal value of maxproc will depend on how many processors your PC has. You may need to experiment to find what works best.
Note that the above script uses jobs which is a bash builtin function. Thus, it has to be run under bash, not dash which is the default for scripts under Debian-like distributions.

Strange behaviour of bash script for running parallel subprocess in bash

This following script is used for running parallel subprocess in bash,which is slightly changed from Running a limited number of child processes in parallel in bash?
#!/bin/bash
set -o monitor # means: run background processes in a separate processes...
N=1000
todo_array=($(seq 0 $((N-1))))
max_jobs=5
trap add_next_job CHLD
index=0
function add_next_job {
if [[ $index -lt ${#todo_array[#]} ]]
then
do_job $index &
index=$(($index+1))
fi
}
function do_job {
echo $1 start
time=$(echo "scale=0;x=$RANDOM % 10;scale=5;x/20+0.05" |bc);sleep $time;echo $time
echo $1 done
}
while [[ $index -lt $max_jobs ]] && [[ $index -lt ${#todo_array[#]} ]]
do
add_next_job
done
wait
The job is choosing a random number in 0.05:0.05:5.00 and sleep that much second.
For example, with N=10, a sample out put is
1 start
4 start
3 start
2 start
0 start
.25000
2 done
5 start
.30000
3 done
6 start
.35000
0 done
7 start
.40000
1 done
8 start
.40000
4 done
9 start
.05000
7 done
.20000
5 done
.25000
9 done
.45000
6 done
.50000
8 done
which has 30 lines in total.
But for big N such as 1000,the result can be strange.One run gives 2996 lines of ouput,with 998 lines with start ,999 with done ,and 999 with float number.644 and 652 is missing in start,644 is missing in done.
These test are runned on an Arch Linux with bash 4.2.10(2).Similar results can be produced on debian stable with bash 4.1.5(1).
EDIT:I tried parallel in moreutils and GNU parallel for this test.Parallel in moreutils has the same problem.But GNU parallel works perfect.
I think this is just due to all of the subprocesses inheriting the same file descriptor and trying to append to it in parallel. Very rarely two of the processes race and both start appending at the same location and one overwrites the other. This is essentially the reverse of what one of the comments suggests.
You could easily check this by redirecting through a pipe, such as with your_script | tee file because pipes have rules about atomicity of data delivered by single write() calls that are smaller than a particular size.
There's another question on SO that's similar to this (I think it just involved two threads both quickly writing numbers) where this is also explained but I can't find it.
The only thing I can imagine is that you're running out of resources; check "ulimit -a" and look for "max user processes". If that's less then the number of processes you want to spawn, you will end up with errors.
Try to set the limits for your user (if you're not running as root) to a higher limit. On Redhatish systems you can do this by:
Adding that line to /etc/pam.d/login:
session required pam_limits.so
Adding the following content to /etc/security/limits.conf:
myuser soft nproc 1000
myuser hard nproc 1024
where "myuser" is the username who is granted the right, 1000 the default value of "max user processes" and 1024 the maximum number of userprocesses. Soft- and hard-limit shouldn't be too much apart. It only says what the user is allowed to set himself using the "ulimit" command in his shell.
So the myuser will start with a total of a 1000 processes (including the shell, all other spawned processes), but may raise it to 1024 using ulimit:
$ ulimit -u
1000
$ ulimit -u 1024
$ ulimit -u
1024
$ ulimit -u 2000
-bash: ulimit: max user processes: cannot modify limit: Operation not permitted
A reboot is not required, it works instantly.
Good luck!
Alex.

Easy parallelisation

I often find myself writing simple for loops to perform an operation to many files, for example:
for i in `find . | grep ".xml$"`; do bzip2 $i; done
It seems a bit depressing that on my 4-core machine only one core is getting used.. is there an easy way I can add parallelism to my shell scripting?
EDIT: To introduce a bit more context to my problems, sorry I was not more clear to start with!
I often want to run simple(ish) scripts, such as plot a graph, compress or uncompress, or run some program, on reasonable sized datasets (usually between 100 and 10,000). The scripts I use to solve such problems look like the one above, but might have a different command, or even a sequence of commands to execute.
For example, just now I am running:
for i in `find . | grep ".xml.bz2$"`; do find_graph -build_graph $i.graph $i; done
So my problems are in no way bzip specific! (Although parallel bzip does look cool, I intend to use it in future).
Solution: Use xargs to run in parallel (don't forget the -n option!)
find -name \*.xml -print0 | xargs -0 -n 1 -P 3 bzip2
This perl program fits your needs fairly well, you would just do this:
runN -n 4 bzip2 `find . | grep ".xml$"`
gnu make has a nice parallelism feature (eg. -j 5) that would work in your case. Create a Makefile
%.xml.bz2 : %.xml
all: $(patsubt %.xml,%xml.bz2,$(shell find . -name '*.xml') )
then do a
nice make -j 5
replace '5' with some number, probably 1 more than the number of CPU's. You might want to do 'nice' this just in case someone else wants to use the machine while you are on it.
The answer to the general question is difficult, because it depends on the details of the things you are parallelizing.
On the other hand, for this specific purpose, you should use pbzip2 instead of plain bzip2 (chances are that pbzip2 is already installed or at least in the repositories or your distro). See here for details: http://compression.ca/pbzip2/
I find this kind of operation counterproductive. The reason is the more processes access the disk at the same time the higher the read/write time goes so the final result ends in a longer time. The bottleneck here won't be a CPU issue, no matter how many cores you have.
Haven't you ever performed a simple two big file copies at the same time on the same HD drive? I is usually faster to copy one and then another.
I know this task involves some CPU power (bzip2 is demanding compression method), but try measuring first CPU load before going the "challenging" path we all technicians tend to choose much more often than needed.
I did something like this for bash. The parallel make trick is probably a lot faster for one-offs, but here is the main code section to implement something like this in bash, you will need to modify it for your purposes though:
#!/bin/bash
# Replace NNN with the number of loops you want to run through
# and CMD with the command you want to parallel-ize.
set -m
nodes=`grep processor /proc/cpuinfo | wc -l`
job=($(yes 0 | head -n $nodes | tr '\n' ' '))
isin()
{
local v=$1
shift 1
while (( $# > 0 ))
do
if [ $v = $1 ]; then return 0; fi
shift 1
done
return 1
}
dowait()
{
while true
do
nj=( $(jobs -p) )
if (( ${#nj[#]} < nodes ))
then
for (( o=0; o<nodes; o++ ))
do
if ! isin ${job[$o]} ${nj[*]}; then let job[o]=0; fi
done
return;
fi
sleep 1
done
}
let x=0
while (( x < NNN ))
do
for (( o=0; o<nodes; o++ ))
do
if (( job[o] == 0 )); then break; fi
done
if (( o == nodes )); then
dowait;
continue;
fi
CMD &
let job[o]=$!
let x++
done
wait
If you had to solve the problem today you would probably use a tool like GNU Parallel (unless there is a specialized parallelized tool for your task like pbzip2):
find . | grep ".xml$" | parallel bzip2
To learn more:
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial). You command line
with love you for it.
I think you could to the following
for i in `find . | grep ".xml$"`; do bzip2 $i&; done
But that would spin off however many processes as you have files instantly and isn't an optimal as just running four processes at a time.

Resources