I have sets of jobs and all of the jobs can be run in parallel so I want to parallelize them for better throughput.
This is what I am currently doing:
I wrote a python script using multiprocessing library that runs jobs in a set at the same time. After all of the jobs in a set is finished, then another set of jobs (script) will be invoked. It is inefficient because each of job in a set has different execution time.
Recently, I noticed about GNU parallel and I think it may help to improve my script. However, a set of jobs have some pre-processing and post-processing tasks thus it is impossible to run random job.
In summary, I want to
1) make sure that pre-processing is completed before launching a job and
2) run post-processing after the jobs in a set are all completed.
And this is what I am trying to do:
Run separate script for each set of job.
Run pre-processing in script for each set and now it is free to run all jobs.
Each script registers jobs into job queue in GNU parallel.
GNU parallel runs job in a queue in parallel.
Each script monitors their own job is finished or not.
When all of the job in a set is done, run post-processing.
I am wondering how can I do such thing with GNU parallel or even not sure that GNU parallel is a write tool for this.
If we assume you are limited by CPU (and not mem or I/O) then this might work:
do_jobset() {
jobset=$1
preprocess $jobset
parallel --load 100% do_job ::: $jobset/*
postprocess $jobset
}
export -f do_jobset
parallel do_jobset ::: *.jobset
If do_job does not use a full CPU from the start, but takes 10 seconds to load data to be processed, add --delay 10 before --load 100%.
The alternative is to do:
parallel preprocess ::: *.jobset
parallel do_job ::: jobsets*/*
parallel postprocess ::: *.jobset
Related
I am running GNU parallel to run a bash script but it seems GNU parallel automatically kills my program and I am not sure why. It run normally when I run the inside script individually.
I wonder why this happen and how to solve it?
Your help is really appreciated!
Here is my code:
parallel --progress --joblog ${home}/data/hsc-admmt/Projects/log_a.sh -j 5 :::: a.sh
Here is the message at the end of output of GNU parallel
/scratch/eu82/bt0689/data/hsc-admmt/Projects/sim_causal_mantel_generate.sh: line 54: 3050285 Killed $home/data/hsc-admmt/Tools/mtg2 -plink plink_all${nsamp}${nsnp}_1 -simreal snp.lst1
I just had the same problem and found this possible explanation:
“Some jobs need a lot of memory, and should only be started when there is enough memory free. Using --memfree GNU parallel can check if there is enough memory free. Additionally, GNU parallel will kill off the youngest job if the memory free falls below 50% of the size. The killed job will put back on the queue and retried later.”
(from https://www.gnu.org/software/parallel/parallel_tutorial.html).
In my case the killed job was not resumed. I’m not sure if this is the reason for your problem but it would explain mine since the error only occurs to me when I parallelize my script for more than 3 jobs.
I have a Bash script that I submit to a cluster that calls a pipeline of Python scripts which are built to be multithreaded for parallel processing. I need to call this pipeline on all files in a directory, which I can accomplish with a for-loop. However, I am worried that this will run the operations (i.e. the pipeline) on just a single-thread rather than the full range that was intended.
The batch file for submission looks like this:
#!/bin/bash
##SBATCH <parameters>
for filename in /path/to/*.txt; do
PythonScript1.py "$filename"
PythonScript2.py "$filename"
done
Will this work as intended, or will the for loop hamper the efficiency/parallel processing of the Python scripts?
If you are running on a single server:
parallel ::: PythonScript1.py PythonScript2.py ::: /path/to/*.txt
This will generate all combinations of {PythonScript1.py,PythonScript2.py} and *.txt. These combinations will be run in parallel but GNU parallel will only run as many at a time as there are CPU threads in the server.
If you are running on multiple servers in a cluster, it really depends on what system is used for controlling the cluster. On some system you ask for a list of server and then you can ssh to those:
get list of servers > serverlist
parallel --slf serverlist ::: PythonScript1.py PythonScript2.py ::: /path/to/*.txt
On others you have to give each of the commands you want to run to the queing system:
parallel queue_this ::: PythonScript1.py PythonScript2.py ::: /path/to/*.txt
Without knowing more about which cluster control system is used, it is hard to help you more.
As originally written, PythonScript2.py won't run until PythonScript1.py returns, and the for loop won't iterate until PythonScript2.py returns.
Note that I said "returns", not "finishes"; if PythonScript1.py and/or PythonScript2.py forks or otherwise goes into the background on its own, then it will return before it is finished, and will continue processing while the calling bash script continues on to its next step.
You could have the calling script put them into the background with PythonScript1.py & and PythonScript2.py &, but this might or might not be what you want, since PythonScript1.py and PythonScript2.py will thus (likely) be running at the same time.
If you want multiple files processed at the same time, but want PythonScript1.py and PythonScript2.py to run in strict order, follow the comment from William Pursell:
for filename in /path/to/*.txt; do
{ PythonScript1.py "$filename"; PythonScript2.py "$filename"; } &
done
I'm running parallel MATLAB or python tasks in a cluster that is managed by PBS torque. The embarrassing situation now is that PBS think I'm using 56 cores but that's in the first and eventually I have only 7 hardest tasks running. 49 cores are wasted now.
My parallel tasks took very different time because they did searches in different model parameters, I didn't know which task will spend how much time before I have tried. In the start all cores were used but soon only the hardest tasks ran. Since the whole task was not finished yet PBS torque still thought I was using full 56 cores and prevent new tasks run but actually most cores were idle. I want PBS to detect this and use the idle cores to run new tasks.
So my question is that are there some settings in PBS torque that can automatically detect real cores used in the task, and allocate the really idle cores to new tasks?
#PBS -S /bin/sh
#PBS -N alps_task
#PBS -o stdout
#PBS -e stderr
#PBS -l nodes=1:ppn=56
#PBS -q batch
#PBS -l walltime=1000:00:00
#HPC -x local
cd /tmp/$PBS_O_WORKDIR
alpspython spin_half_correlation.py 2>&1 > tasklog.log
A short answer to your question is No: PBS has no way to reclaim unused resources allocated to a job.
Since your computation is essentially a bunch of independent tasks, what you could and probably should do is try to split your job into 56 independent jobs each running an individual combination of model parameters and when all the jobs are finished you could run an additional job to collect and summarize the results. This is a well supported way of doing things. PBS provides has some useful features for this type of jobs such as array jobs and job dependencies.
I would like to run commands in parallel. So that if one fails, the whole job exists as failure. How can I do that? More specifically, I would like to run aws sync commands in parallel. I have 5 aws sync commands that run sequentially. I would like them to run in parallel so that if one fails, the whole job fails. How can I do that?
GNU Parallel is a really handy and powerful tool that works with anything bash
http://www.gnu.org/software/parallel/
https://www.youtube.com/watch?v=OpaiGYxkSuQ
# run lines from a file, 8 at a time
cat commands.txt | parallel --eta -j 8 "{}"
Currently, I have a driver program that runs several thousand instances of a "payload" program and does some post-processing of the output. The driver currently calls the payload program directly, using a shell() function, from multiple threads. The shell() function executes a command in the current working directory, blocks until the command is finished running, and returns the data that was sent to stdout by the command. This works well on a single multicore machine. I want to modify the driver to submit qsub jobs to a large compute cluster instead, for more parallelism.
Is there a way to make the qsub command output its results to stdout instead of a file and block until the job is finished? Basically, I want it to act as much like "normal" execution of a command as possible, so that I can parallelize to the cluster with as little modification of my driver program as possible.
Edit: I thought all the grid engines were pretty much standardized. If they're not and it matters, I'm using Torque.
You don't mention what queuing system you're using, but SGE supports the '-sync y' option to qsub which will cause it to block until the job completes or exits.
In TORQUE this is done using the -x and -I options. qsub -I specifies that it should be interactive and -x says run only the command specified. For example:
qsub -I -x myscript.sh
will not return until myscript.sh finishes execution.
In PBS you can use qsub -Wblock=true <command>