I have a gnu parallel script that imports data (100,000 jobs distributed to 100 remote servers) into a central database. The first jobs are slamming the central db because they finish almost at the same time. The jobs after that eventually spread out and don't try to import all at the same time.
Is there a way to delay the execution of the first job per remote server? So the script can say "run process #1 to server1 now, run process #2 to server2 in 5 seconds, run process #3 to server3 in 10 seconds, run process #4 to server 4 in 20 seconds, ..., ...". After that first batch gets sent to each server, id like the rest of the processes to run asap.
Is there a param for this?
--delay from version 20121222 and --sshdelay from version 20130122.
= Edit =
The --delay is measured from the start of the job, so if your jobs run for longer than the delay*jobslots (e.g. 100 servers * 5 sec in your example) then you will feel as if there is no delay after the first batch.
Compare:
time parallel -S 2/: --delay 1 'sleep {};hostname' ::: 2 2
To:
time parallel -S 2/: --delay 1 'sleep {};hostname' ::: 2 2 2
The first takes 3 seconds, the 2nd 4 secs.
There is no functionality for dealing with first batch only. What you can do is something like:
parallel 'if [ {#} -lt 100 ] ; then sleep {#} ; fi; do_stuff {}'
where 100 is the size of the first batch.
Related
This question already has answers here:
Parallelize Bash script with maximum number of processes
(16 answers)
Closed 26 days ago.
I'm running a shell script script.sh in parallel which in each of its lines goes to a folder and run a Fortran code:
cd folder1 && ./code &
cd folder2 && ./code &
cd folder3 && ./code &
cd folder4 && ./code &
..
cd folder 96 && ./code
wait
cd folder 97 && ./code
..
..
..
cd folder2500 && ./code.sh
There are around 2500 Folders and code outputs are independent from each other. I have access to 96 CPUs and each job uses around 1% of CPU, so I run 96 jobs in parallel using the & key and wait command. Due to different reasons, not all 96 jobs finish at the same time. Some of them take 40 minutes, some of them 90 minutes, an important difference. So I was wondering if it is possible that the jobs that finish earlier use the available CPUs in order to optimize the execution time.
I tried also with GNU Parallel:
parallel -a script.sh but it had the same issue, and I could not find in internet somebody with a similar problem.
You can use GNU Parallel:
parallel 'cd {} && ./code' ::: folder*
That will keep all your cores busy, starting a new job immediately as each job finishes.
If you only want to run 48 jobs in parallel, use:
parallel -j 48 ...
If you want to do a dry run and see what would run but without actually running anything, use:
parallel --dry-run ...
If you want to see a progress report, use:
parallel --progress ...
One bash/wait -n approach:
jobmax=96
jobcnt=0
for ((i=1;i<=2500;i++))
do
((++jobcnt))
[[ "${jobcnt}" -gt "${jobmax}" ]] && wait -n && ((--jobcnt)) # if jobcnt > 96 => wait for a job to finish, decrement jobcnt, then continue with next line ...
( cd "folder$i" && ./code ) & # kick off new job
done
wait # wait for rest of jobs to complete
NOTES:
when the jobs complete quickly (eg, < 1 sec) it's possible that more than one job could complete during the wait -n; start new job; wait -n cycle, in which case you could end up with less than jobmax jobs running at a time (ie, jobcnt is higher than the actual number of running jobs)
however, in this scenario where each job is expected to take XX minutes to complete the likelihood of multiple jobs completing during the wait -n; start new job; wait -n cycle should be greatly diminished (if not eliminated)
I'm trying to run a Particle Swarm Optimization problem on a cluster using SLURM, with the optimization algorithm managed by a single-core matlab process. Each particle evaluation requires multiple MPI calls that alternate between two Python programs until the result converges. Each MPI call takes up to 20 minutes.
I initially naively submitted each MPI call as a separate SLURM job, but the resulting queue time made it slower than running each job locally in serial. I am now trying to figure out a way to submit an N node job that will continuously run MPI tasks to utilize the available resources. The matlab process would manage this job with text file flags.
Here is a pseudo-code bash file that might help to illustrate what I am trying to do on a smaller scale:
#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 32 # total number of processor cores in this job
# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0
# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run Command
while <"KeepRunning.txt” == 1>
do
for i in {0..40}
do
if <“RunJob_i.txt” == 1>
then
mpirun -np 8 -rr -f ${PBS_NODEFILE} <job_i> &
fi
done
done
wait
This approach doesn't work (just crashes), but I don't know why (probably overutilization of resources?). Some of my peers have suggested using parallel with srun, but as far as I can tell this requires that I call the MPI functions in batches. This will be a huge waste of resources, as a significant portion of the runs finish or fail quickly (this is expected behavior). A concrete example of the problem would be starting a batch of 5 8-core jobs and having 4 of them crash immediately; now 32 cores would be doing nothing while they wait up to 20 minutes for the 5th job to finish.
Since the optimization will likely require upwards of 5000 mpi calls, any increase in efficiency will make a huge difference in absolute walltime. Does anyone have any advice as to how I could run a constant stream of MPI calls on a large SLURM job? I would really appreciate any help.
A couple of things: under SLURM you should be using srun, not mpirun.
The second thing is that the pseudo-code you provided launches an infinite number of jobs without waiting for any completion signal. You should try to put the wait into the inner loop, so you launch just a set of jobs, wait for them to finish, evaluate the condition and, maybe, launch the next set of jobs:
#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 4 # total number of tasks in this job
#SBATCH -s 8 # total number of processor cores for each task
# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0
# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run Command
while <"KeepRunning.txt” == 1>
do
for i in {0..40}
do
if <“RunJob_i.txt” == 1>
then
srun -np 8 --exclusive <job_i> &
fi
done
wait
<Update "KeepRunning.txt”>
done
Take care also distinguishing tasks and cores. -n says how many tasks will be used, -c says how many cpus per task will be allocated.
The code I wrote will launch in the background 41 jobs (from 0 to 40, included), but they will only start once the resources are available (--exclusive), waiting while they are occupied. Each jobs will use 8 CPUs. The you will wait for them to finish and I assume that you will update the KeepRunning.txt after that round.
I have a script that runs through a list of servers to connect to and grabs by SCP over files to store
Occasionally due to various reasons one of the servers crashes and my script gets stuck for around 4 hours before moving on through the list.
I would like to be able to detect a connection issue or a period of time elapsed after script has started and kill that command and move on to next.
I suspect that this would involve a wait or sleep and continue but I am new to loops and bash
#!/bin/bash
#
# Generate a list of backups to grab
df|grep backups|awk -F/ '{ print $NF }'>/tmp/backuplistsmb
# Get each backup in turn
for BACKUP in `cat /tmp/backuplistsmb`
do
cd /srv/backups/$BACKUP
scp -o StrictHostKeyChecking=no $BACKUP:* .
sleep 3h
done
The above script works fine but does get stuck for 4 hours should there be a connection issue. It is worth noting that some of the transfers take 10 mins and some 2.5 hours
Any ideas or help would very appreciated
Try to use the timeout program for that:
Usage:
timeout [OPTION] DURATION COMMAND [ARG]...
E.g. time (timeout 3 sleep 5) will run for 3 secs.
So in your code you can use:
timeout 300 scp -o StrictHostKeyChecking=no $BACKUP:* .
This limits the copy to 5 minutes.
Heres what I'm trying to do. I have 4 shell scripts. Script 1 needs to be run first, then 2, then 3, then 4, and they must be run in that order. Script 1 needs to be running (and waiting in the background) for 2 to function properly, however 1 takes about 3 seconds to get ready for use. I tried doing ./1.sh & ./2.sh & ./3.sh & ./4.sh, but this results in a total mess,since 2 starts requesting things from 1 when 1 is not ready yet. So, my question is, from one shell script, how do I get it to start script 1, wait like 5 seconds, start script 2, wait like 5 seconds, etc. without stopping any previous scripts from running (i.e. they all have to be running in the background for any higher numbered script to work). Any suggestions would be much appreciated!
May I introduce you to the sleep command?
./1.sh & sleep 5
./2.sh & sleep 5
./3.sh & sleep 5
./4.sh
#!/bin/sh
./1.sh &; sleep 5;./2.sh &; sleep 5; ./3.sh &; sleep 5; ./4.sh
Is there a simple way to do the equivalent of this, but run the two processes concurrently with bash?
$ time sleep 5; sleep 8
time should report a total of 8 seconds (or the amount of time of the longest task)
$ time (sleep 5 & sleep 8 & wait)
real 0m8.019s
user 0m0.005s
sys 0m0.005s
Without any arguments, the shell built-in wait waits for all backgrounded jobs to complete.
Using sleeps as examples.
If you want to only time the first process, then
time sleep 10 & sleep 20
If you want to time both processes, then
time (sleep 10 & sleep 20)
time sleep 8 & time sleep 5
The & operator causes the first command to run in the background, which practically means that the two commands will run concurrently.
Sorry my question may not have been exactly clear the first time around, but I think I've found an answer, thanks to some direction given here.
time sleep 5& time sleep 8
will time both processes while they run concurrently, then I'll just take the larger result.
If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:
time parallel sleep ::: 5 8
You can install GNU Parallel simply by:
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
Watch the intro videos for GNU Parallel to learn more:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1