Run shell script in parallel with more jobs than CPUs, and after a job is finished, instantly take the available spot [duplicate] - bash

This question already has answers here:
Parallelize Bash script with maximum number of processes
(16 answers)
Closed 26 days ago.
I'm running a shell script script.sh in parallel which in each of its lines goes to a folder and run a Fortran code:
cd folder1 && ./code &
cd folder2 && ./code &
cd folder3 && ./code &
cd folder4 && ./code &
..
cd folder 96 && ./code
wait
cd folder 97 && ./code
..
..
..
cd folder2500 && ./code.sh
There are around 2500 Folders and code outputs are independent from each other. I have access to 96 CPUs and each job uses around 1% of CPU, so I run 96 jobs in parallel using the & key and wait command. Due to different reasons, not all 96 jobs finish at the same time. Some of them take 40 minutes, some of them 90 minutes, an important difference. So I was wondering if it is possible that the jobs that finish earlier use the available CPUs in order to optimize the execution time.
I tried also with GNU Parallel:
parallel -a script.sh but it had the same issue, and I could not find in internet somebody with a similar problem.

You can use GNU Parallel:
parallel 'cd {} && ./code' ::: folder*
That will keep all your cores busy, starting a new job immediately as each job finishes.
If you only want to run 48 jobs in parallel, use:
parallel -j 48 ...
If you want to do a dry run and see what would run but without actually running anything, use:
parallel --dry-run ...
If you want to see a progress report, use:
parallel --progress ...

One bash/wait -n approach:
jobmax=96
jobcnt=0
for ((i=1;i<=2500;i++))
do
((++jobcnt))
[[ "${jobcnt}" -gt "${jobmax}" ]] && wait -n && ((--jobcnt)) # if jobcnt > 96 => wait for a job to finish, decrement jobcnt, then continue with next line ...
( cd "folder$i" && ./code ) & # kick off new job
done
wait # wait for rest of jobs to complete
NOTES:
when the jobs complete quickly (eg, < 1 sec) it's possible that more than one job could complete during the wait -n; start new job; wait -n cycle, in which case you could end up with less than jobmax jobs running at a time (ie, jobcnt is higher than the actual number of running jobs)
however, in this scenario where each job is expected to take XX minutes to complete the likelihood of multiple jobs completing during the wait -n; start new job; wait -n cycle should be greatly diminished (if not eliminated)

Related

Holding job script after completion of one simulation

I run multiple serial jobs on HPC. For example, if I have 10 simulations, I use 10 cores on HPC and use each core for a simulation. However, the end time of all these simulations is different and as soon as one simulation completes, all the others stop as well. How do I hold the job script so that even if one simulation is completed, others will keep running, in simple words, job script stays on HPC. An example of my job script:
#!/bin/bash
#SBATCH --job-name=CaseName # name of the job
#SBATCH --ntasks=60 # number of requested cores
#SBATCH --cpus-per-task=1
#SBATCH --time=7-00:00:00 # time limit
#SBATCH --partition=core64 # queue
cd Folder1
for i in {1..5}
do
cd Folder$i
for j in {1..6}
do
cd SubFolder$j
application > log 2>&1 &
cd ..
done
cd ..
done
cd ..
cd LastFolder
application > log 2>&1
Is there any command I can add in job script to do so ?
Any command to use in job script to continue the jobs in hpc after simulation ends.
You need a wait at the end of your script as you run the jobs in the background and you want exit from the script when all of them finished.
from man bash:
wait [-fn] [-p varname] [id ...]
Wait for each specified child process and return
its termination status. ...
...
If id is not given, wait waits for all running background jobs...
There's something wrong with your cd logic.
Perhaps try running the cd and the application in a subshell, e.g.
(cd SubFolder$j ; application > log 2>&1 & )
Then, that way, you can be assured that every command run's concurrently and in their own subdirectory without impacting each other.

is there a way to trigger 10 scripts at any given time in Linux shell scripting?

I have a requirement where I need to trigger 10 shell scripts at a time. I may have 200+ shell scripts to be executed.
e.g. if I trigger 10 jobs and two jobs completed, I need to trigger another 2 jobs which will make number of jobs currently executing to 10.
I need your help and suggestion to cater this requirement.
Yes with GNU Parallel like this:
parallel -j 10 < ListOfJobs.txt
Or, if your jobs are called job_1.sh to job_200.sh:
parallel -j 10 job_{}.sh ::: {1..200}
Or. if your jobs are named with discontiguous, random names but are all shell scripts named with .sh suffix in one directory:
parallel -j 10 ::: *.sh
There is a very good overview here. There are lots of questions and answers on Stack Overflow here.
Simply run them as background jobs:
for i in {1..10}; { ./script.sh & }
Adding more jobs if less than 10 are running:
while true; do
pids=($(jobs -pr))
((${#pids[#]}<10)) && ./script.sh &
done &> /dev/null
There are different ways to handle this:
Launch them together as background tasks (1)
Launch them in parallel (1)
Use the crontab (2)
Use at (3)
Explanations:
(1) You can launch the processes exactly when you like (by launching a command, click a button or whatever event you choose)
(2) The processes will be launched at the same time, every (working) day, periodically.
(3) You choose a time when the processes will be launched together once.
I have used below to trigger 10 jobs a time.
max_jobs_trigger=10
while mapfile -t -n ${max_jobs_trigger} ary && ((${#ary[#]})); do
jobs_to_trigger=`printf '%s\n' "${ary[#]}"`
#Trigger script in background
done

How to run multiple instances of command-line tool in bash script? + user input for script

I am trying to launch multiple instances of imagesnap simultaneously from a single bash script on a Mac. Also, it would be great to give (some of) the arguments by user input when running the script.
I have 4 webcams connected, and want to take series of images from each camera with a given interval. Being an absolute beginner with bash scripts, I don't know where to start searching. I have tested that 4 instances of imagesnap works nicely when running them manually from Terminal, but that's about it.
To summarise I'm looking to make a bash script that:
run multiple instances of imagesnap.
has user input for some of the arguments for imagesnap.
ideally start all the imagesnap instances at (almost) the same time.
--EDIT--
After thinking about this I have a vague idea of how this script could be organised using the ability to take interval images with imagesnap -t x.xx:
Run multiple scripts from within the main script
or
Use subshells to run multiple instances of imagesnap
Start each sub script or subshell in parallel if possible.
Since each instance of imagesnap will run until terminated it would be great if they could all be stopped with a single command
the following quick hack (saved as run-periodically.sh) might do the right thing:
#!/bin/bash
interval=5
start=$(date +%s)
while true; do
# run jobs in the background
for i in 1 2 3 4; do
"$#" &
done
# wait for all background jobs to finish
wait
# figure out how long we have to sleep
end=$(date +%s)
delta=$((start + interval - end))
# if it's positive sleep for this amount of time
if [ $delta -gt 0 ]; then
sleep $delta || exit
fi
start=$((start + interval))
done
if you put this script somewhere appropriate and make it executable, you can run it like:
run-periodically.sh imagesnap arg1 arg2
but while testing, I ran with:
sh run-periodically.sh sh -c "date; sleep 2"
which will cause four copies of "start a shell that displays the date then waits a couple of seconds" to be run in parallel every interval seconds. if you want to run different things in the different jobs, then you might want to put them into this script explicitly or maybe another script which this one calls…

Running a queue of MPI calls in parallel with SLURM and limited resources

I'm trying to run a Particle Swarm Optimization problem on a cluster using SLURM, with the optimization algorithm managed by a single-core matlab process. Each particle evaluation requires multiple MPI calls that alternate between two Python programs until the result converges. Each MPI call takes up to 20 minutes.
I initially naively submitted each MPI call as a separate SLURM job, but the resulting queue time made it slower than running each job locally in serial. I am now trying to figure out a way to submit an N node job that will continuously run MPI tasks to utilize the available resources. The matlab process would manage this job with text file flags.
Here is a pseudo-code bash file that might help to illustrate what I am trying to do on a smaller scale:
#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 32 # total number of processor cores in this job
# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0
# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run Command
while <"KeepRunning.txt” == 1>
do
for i in {0..40}
do
if <“RunJob_i.txt” == 1>
then
mpirun -np 8 -rr -f ${PBS_NODEFILE} <job_i> &
fi
done
done
wait
This approach doesn't work (just crashes), but I don't know why (probably overutilization of resources?). Some of my peers have suggested using parallel with srun, but as far as I can tell this requires that I call the MPI functions in batches. This will be a huge waste of resources, as a significant portion of the runs finish or fail quickly (this is expected behavior). A concrete example of the problem would be starting a batch of 5 8-core jobs and having 4 of them crash immediately; now 32 cores would be doing nothing while they wait up to 20 minutes for the 5th job to finish.
Since the optimization will likely require upwards of 5000 mpi calls, any increase in efficiency will make a huge difference in absolute walltime. Does anyone have any advice as to how I could run a constant stream of MPI calls on a large SLURM job? I would really appreciate any help.
A couple of things: under SLURM you should be using srun, not mpirun.
The second thing is that the pseudo-code you provided launches an infinite number of jobs without waiting for any completion signal. You should try to put the wait into the inner loop, so you launch just a set of jobs, wait for them to finish, evaluate the condition and, maybe, launch the next set of jobs:
#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 4 # total number of tasks in this job
#SBATCH -s 8 # total number of processor cores for each task
# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0
# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run Command
while <"KeepRunning.txt” == 1>
do
for i in {0..40}
do
if <“RunJob_i.txt” == 1>
then
srun -np 8 --exclusive <job_i> &
fi
done
wait
<Update "KeepRunning.txt”>
done
Take care also distinguishing tasks and cores. -n says how many tasks will be used, -c says how many cpus per task will be allocated.
The code I wrote will launch in the background 41 jobs (from 0 to 40, included), but they will only start once the resources are available (--exclusive), waiting while they are occupied. Each jobs will use 8 CPUs. The you will wait for them to finish and I assume that you will update the KeepRunning.txt after that round.

shell script to loop and start processes in parallel?

I need a shell script that will create a loop to start parallel tasks read in from a file...
Something in the lines of..
#!/bin/bash
mylist=/home/mylist.txt
for i in ('ls $mylist')
do
do something like cp -rp $i /destination &
end
wait
So what I am trying to do is send a bunch of tasks in the background with the "&" for each line in $mylist and wait for them to finish before existing.
However, there may be a lot of lines in there so I want to control how many parallel background processes get started; want to be able to max it at say.. 5? 10?
Any ideas?
Thank you
Your task manager will make it seem like you can run many parallel jobs. How many you can actually run to obtain maximum efficiency depends on your processor. Overall you don't have to worry about starting too many processes because your system will do that for you. If you want to limit them anyway because the number could get absurdly high you could use something like this (provided you execute a cp command every time):
...
while ...; do
jobs=$(pgrep 'cp' | wc -l)
[[ $jobs -gt 50 ]] && (sleep 100 ; continue)
...
done
The number of running cp commands will be stored in the jobs variable and before starting a new iteration it will check if there are too many already. Note that we jump to a new iteration so you'd have to keep track of how many commands you already executed. Alternatively you could use wait.
Edit:
On a side note, you can assign a specific CPU core to a process using taskset, it may come in handy when you have fewer more complex commands.
You are probably looking for something like this using GNU Parallel:
parallel -j10 cp -rp {} /destination :::: /home/mylist.txt
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Resources