I have the following array job setup:
#BSUB -J "myArray[1-50]"
#BSUB -o o.%J.%I
#BSUB -e e.%J.%I
#BSUB -W 10:00
#BSUB -N
#BSUB -u myEmail#somewhere.com
This will send me 50 emails once the jobs are complete, something like this in the title:
Email1: Job 123456[1]: <myArray[1-50]> in cluster <xyz-cluster> Done
Email2: Job 123456[2]: <myArray[1-50]> in cluster <xyz-cluster> Done
...
Email50: Job 123456[50]: <myArray[1-50]> in cluster <xyz-cluster> Done
Is it possible to send email only if the job has failed. For example if 11th, 12th, and 25th jobs have failed out of 50 jobs, I will only get 3 emails?
Email1: Job 123456[11]: <myArray[1-50]> in cluster <xyz-cluster> Exited
Email2: Job 123456[12]: <myArray[1-50]> in cluster <xyz-cluster> Exited
Email3: Job 123456[25]: <myArray[1-50]> in cluster <xyz-cluster> Exited
Let me know if anything unclear.
LSF App Center has a notification feature that can send emails when a job ends or starts, please check this out for configuration: https://www.ibm.com/support/knowledgecenter/en/SSZRJV_10.2.0/manage_jobs/submit_jobs_help.html
Related
Does anyone had a problem snakemake recognizing a timed-out job. I submit jobs to a cluster using qsub with a time-out set per rule:
snakemake --jobs 29 -k -p --latency-wait 60 --use-envmodules \
--cluster "qsub -l walltime={resources.walltime},nodes=1:ppn={threads},mem={resources.mem_mb}mb"
If a job fails within a script, the next one in line will be executed. When a job however hits the time-out defined in a rule, the next job in line is not executed, reducing the total number of jobs run in parallel on the cluster over time. A timed-out job raises according to the MOAB scheduler (PBS server) a -11 exit status. As far as I understood any non-zero exit status means failure - or does this only apply to positive integers?!
Thanks in advance for any hint:)
If you don't provide a --cluster-status script, snakemake internally checks job status by touching some hidden files in the submitted job script. When a job times out, snakemake (on the node) doesn't get a chance to report the failure to the main snakemake instance as qsub will kill it.
You can try a cluster profile or just grab a suitable cluster status file (be sure to chmod it as an exe and have qsub report a parsable job id).
Hello friendly people,
my question is rather specific.
For more than a week, I am trying to submit thousands of single thread jobs for a scientific experiment using sbatch and srun.
The problem is that these jobs may take different amounts of time to finish and some may even be aborted as they exceed the memory limit. Both behaviors are fine and my evaluation deals with it.
But, I am facing the problem that some of the jobs are never started, even though they have been submitted.
My sbatch script looks like this:
#!/usr/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=12
#SBATCH --mem-per-cpu=10000
for i in {1..500}
do
srun -N1 -n1 -c1 --exclusive --time=60 ${mybinary} $i &
wait 5s
done
Now, my error log shows the following message:
srun: Job 1846955 step creation temporarily disabled, retrying
1) What does 'step creation temporarily disabled' mean? Are all cpu's busy and the job is omitted or is it started again later when resources are free?
2) Why are some of my jobs not carried out and how can I fix it? Do I use the correct parameters for srun?
Thanks for your help!
srun: Job 1846955 step creation temporarily disabled, retrying
This is normal, you reserve 4 x 12 CPUs and start 500 instances of srun. Only 48 instances will run, while the other will output that message. Whenever a running instance stops, a pending instance starts.
wait 5s
The wait command is used to wait for processes, not for a certain amount of time. For that, use the sleep command. The wait command must be at the end of the script. Otherwise, the job could stop before all srun instances have finished.
So the scrip should look like this:
#!/usr/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=12
#SBATCH --mem-per-cpu=10000
for i in {1..500}
do
srun -N1 -n1 -c1 --exclusive --time=60 ${mybinary} $i &
done
wait
After using the Slurm cluster manager to sbatch a job with multiple processes, is there a way to know the status (running or finishing) of each process? Can it be implemented in a python script?
Just use the command sacct that comes with Slurm.
Given this code (my.sh):
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=2
srun -n1 sleep 10 &
srun -n1 sleep 3
wait
I run it:
sbatch my.sh
And then check on it with sacct:
sacct
Which gives me per-step info:
JobID JobName Partition Account AllocCPUS State ExitCode
---------- ---------- ---------- ---------- ---------- ---------- --------
8021 my.sbatch CLUSTER me 2 RUNNING 0:0
8021.0 sleep me 1 RUNNING 0:0
8021.1 sleep me 1 COMPLETED 0:0
sacct has a lot of options to customize its output. For example,
sacct --format='JobID%6,State'
Will just give you the IDs (up to 6 characters) and the current state of jobs:
JobID State
------ ----------
8021 RUNNING
8021.0 RUNNING
8021.1 COMPLETED
If the processes you mention are distincts steps, then sacct can give you the information as explained by #Christopher Bottoms.
But if the processes are different tasks in a single step, then you can use this script that uses parallel SSH to run 'ps' commands on the compute nodes and offer a summarised view, as #Tom de Geus suggests.
I had built the SGE in a four-node cluster for source code. The operating system in Centos7. And when I submit some simple task in the cluster, I found that only one task was running in one node. What's the problem? Here is my task code:
sleep 60
echo "done"
and this is my cmd to submit the tasks:
DIR=`pwd`
option=""
for((i=0;i<5;i++));do
qsub -q multislots $option -V -cwd -o stdout -e stderr -S /bin/bash $DIR/test.sh
sleep 1
done
when run qstat -f, it shows:enter image description here
Given the error message about jobs failing because: "can not find an unused add_grp_id". You should check what gid_range is set to in the sge configuration(both global and also if there is one per-host). It should be a range of otherwise unused group ids. At least as many gids as you want jobs on a node.
If that isn't it try running qalter -w v and qalter -w p on one of the queued jobs to see why they aren't being started.
I have tried a simple Map/Reduce task using Amazon Elastic MapReduce and it took just 3 mins to complete the task. Is it possible to re-use the same instance to run another task.
Even though I have just used the instance for 3 mins Amazon will charge for 1 hr, so I want to use the balance 57 mins to run several other tasks.
The answer is yes.
here's how you do it using the command line client:
When you create an instance pass the --alive flag, this tells emr to keep the cluster around after your job has run.
Then you can submit more tasks to the cluster:
elastic-mapreduce --jobflow <job-id> --stream --input <s3dir> --output <s3dir> --mapper <script1> --reducer <script2>
To terminate the cluster later, simply run:
elastic-mapreduce <jobid> --terminate
try running elastic-mapreduce --help to see all the commands you can run.
If you don't have the command line client, get it here.
Using:
elastic-mapreduce --jobflow job-id \
--jar s3n://some-path/x.jar \
--step-name "New step name" \
--args ...
you can also add non-streaming steps to your cluster. (just so you don't have to try it your yourself ;-) )
http://aws.amazon.com/elasticmapreduce/faqs/#dev-6
Q: Can I run a persistent job flow? Yes. Amazon Elastic MapReduce job
flows that are started with the –alive flag will continue until
explicitly terminated. This allows customers to add steps to a job
flow on demand. You may want to use this to debug your job flow logic
without having to repeatedly wait for job flow startup. You may also
use a persistent job flow to run a long-running data warehouse
cluster. This can be combined with data warehouse and analytics
packages that runs on top of Hadoop such as Hive and Pig.