SLURM requeue with new JOBID - cluster-computing

Is it possible to set some requeue options so that JOBID is changed when slurm decides to requeue a job. (after a node failure, for instance)
So that the folder associated to first JOBID is not overwritten.
Thanks,

A requeued job is still the same job, so the job ID will not change.
What you can do is prevent requeuing with the --no-requeue. But then you will need to re-submit the job, either by hand or using a workflow manager.
Another option, is to append the restart count to the folder name. For instance, if your submission script has a line such as
WORKDIR=/some/path/${SLURM_JOB_ID}
mkdir -p $WORKDIR
cd $WORKDIR
you can replace it with
mkdir -p /some/path/${SLURM_JOB_ID}${SLURM_RESTART_COUNT}
mkdir -p $WORKDIR
cd $WORKDIR
Upon first run, the $SLURM_RESTART_COUNT will be unset, leaving the original behaviour, but then, it will be set to 1, 2, and so on, effectively suffixing the job ID with the requeue number.
For the name of the output file, you can use --open-mode=append to avoir overwriting the output file when the job restarts.

Related

How to skip a always running task in airflow after few hours?

My example DAG:
Task1-->Task2-->Task3
I have a pipeline with a BashOperator task that should not stop (at least for a few hours).
Task1: It watches a folder for zip files and extracts them to another folder
#!/bin/bash
inotifywait -m /<path> -e create -e moved_to|
while read dir action file; do
echo "The file '$file' appeared in directory '$dir' via '$action'"
unzip -o -q "/<path>/$file" "*.csv" -d /<output_path>
rm path/$file
done
Task2: PythonOperator(loads the CSV into MySQL database after cleaning)
The problem is that my task is always running due to the loop, and I want it to proceed to the next task after (execution_date+ x hours).
I was thinking of changing the trigger rules of the downstream task.I have tried the execution_timeout in BashOperator but the task shows as failed on the graph.
What should be my approach to solve this kind of problem?
There are several ways to address the issue you are facing.
Option 1: Use execution_time on parent task and trigger_rule='all_done' on child task. This is basically what you suggested but just for clarifications - Airflow doesn't mind that one of the task in the pipeline has failed. In your use case you describe it as a valid state for the task so it's OK but not very intuitive as people often associate failed with something that is wrong so it's understandable that this is not the preferred solution.
Option 2: Airflow has AirflowSkipException. You can set timer in your python code. If timer exceed the time you defined then do:
from airflow.exceptions import AirflowSkipException
raise AirflowSkipException(f"Snap. Time is OUT")
This will set parent task to status Skipped then the child task can use trigger_rule='none_failed'. In this way if parent task fails it's due to an actual failure (but not timeout). Valid execution will yield either success status or skipped.

How to make sbatch wait until last submitted job is *running* when submitting multiple jobs?

I'm running a numerical model which parameters are in a "parameter.input" file. I use sbatch to submit multiple iterations of the model, with one parameter in the parameter file changing every time. Here is the loop I use:
#!/bin/bash -l
for a in {01..30}
do
sed -i "s/control_[0-9][0-9]/control_${a}/g" parameter.input
sbatch --time=21-00:00:00 run_model.sh
sleep 60
done
The sed line changes a parameter in the parameter file. The
run_model.sh file runs the model.
The problem: depending on the resources available, a job might run immediately or stay pending for a few hours. With my default loop, if 60 seconds is not enough time to find resources for job n to run, the parameter file will be modified while job n is pending, meaning job n will run with the wrong parameters. (and I can't wait for job n to complete before submitting job n+1 because each job takes several days to complete)
How can I force batch to wait to submit job n+1 until job n is running?
I am not sure how to create an until loop that would grab the status of job n and wait until it changes to 'running' before submitting job n+1. I have experimented with a few things, but the server I use also hosts another 150 people's jobs, and I'm afraid too much experimenting might create some issues...
Use the following to grab the last submitted job's ID and its status, and wait until it isn't pending anymore to start the next job:
sentence=$(sbatch --time=21-00:00:00 run_model.sh) # get the output from sbatch
stringarray=($sentence) # separate the output in words
jobid=(${stringarray[3]}) # isolate the job ID
sentence="$(squeue -j $jobid)" # read job's slurm status
stringarray=($sentence)
jobstatus=(${stringarray[12]}) # isolate the status of job number jobid
Check that the job status is 'running' before submitting the next job with:
if [ "$jobstatus" = "R" ];then
# insert here relevant code to run next job
fi
You can insert that last snippet in an until loop that checks the job's status every few seconds.

How can I tell if a PBS script was called by bash or qsub

I have a PBS script that processes several environment variables. PBS is a wrapper for bash that sends the bash script to a job scheduling queue. The processed variables form a command to run a scientific application. A PBS script is written in bash with additional information for the job scheduler encoded in the bash comments.
How can I determine programmatically if my script was called by qsub, the command that interprets PBS scripts, or if it as called by bash?
If the script is running under bash I would like to treat the call as a dry run and only print out the command that was generated. In that way it bypasses the job queue entirely.
This may not be completely robust, but one heuristic which may work is to test for the existence of any of the following environmental variables which tend to be defined under qsub, as listed here.
PBS_O_HOST (the name of the host upon which the qsub command is running)
PBS_SERVER (the hostname of the pbs_server which qsub submits the job to)
PBS_O_QUEUE (the name of the original queue to which the job was submitted)
PBS_O_WORKDIR (the absolute path of the current working directory of the qsub command)
PBS_ARRAYID (each member of a job array is assigned a unique identifier)
PBS_ENVIRONMENT (set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive job)
PBS_JOBID (the job identifier assigned to the job by the batch system)
PBS_JOBNAME (the job name supplied by the user)
PBS_NODEFILE (the name of the file contain the list of nodes assigned to the job)
PBS_QUEUE (the name of the queue from which the job was executed from)
PBS_WALLTIME (the walltime requested by the user or default walltime allotted by the scheduler)
You can check the parent caller of bash:
CALLER=$(ps -p "$PPID" -o comm=)
if [[ <compare $CALLER with expected process name> ]]; then
<script was called by qsub or something>
fi
Extra note: Bash always has an unexported variable set: $BASH_VERSION so if it's set you'd be sure that the script is running with Bash. The question left would just be about which one called it.
Also, don't run the check inside a subshell () as you probably would get from $PPID the process of same shell, not the caller.
If your script is called with deeper levels in which case $PPID would not be enough, you can always recursively scan the parent pids with ps -p <pid> -o ppid=.

"qsub -now" equivalent using bsub

In SGE , we have
qsub -now yes/no <command>
By "-now yes" the job is scheduled immediately(if possible) or not at all . We are not put in pending queue .
By "-now no " the job is put in pending queue if it cannot be executed immediately .
But in LSF , we have qsub's equivalent as bsub .
in bsub, we are put in pending queue, if it cannot be executed immediately. We don't have option as "-now yes" as in qsub .
Do we something in bsub as "qsub -now"
P.S : One solution is that we can check for some time(some secondss) after running bsub, if we are scheduled or not and then exit . I am searching for a more elegant way .
I found the answer in an LSF way.
LSF does provide a way to quit a job if we its unable to schedule the resource. We hava a environment variable LSF_NIOS_PEND_TIMEOUT(specified in minutes) which quits the job, if its still in pending queue.
env LSF_NIOS_PEND_TIMEOUT=1 bsub -Is -m host /bin/bash
From Somewhere on the web:
LSF_NIOS_PEND_TIMEOUT
Syntax
LSF_NIOS_PEND_TIMEOUT=minutes
Description
Applies only to interactive batch jobs.
Maximum amount of time that an interactive batch job can remain pending.
If this parameter is defined, and an interactive batch job is pending for longer than the specified time, the interactive batch job is terminated.
Valid values
Any integer greater than zero
LSF doesn't have the same thing. You could use expect w/ a timeout. LSF will output something like this when the job starts. Your expect script could expect <<Starting on. (But this is basically what your P.S. says.)
$ bsub -Is -m hostA /bin/bash
Job <7536> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on hostA>>
hostA$
You could maybe use lsrun. But it won't work with the batch system to allocate a slot or other resource.

DATASTAGE: how to run more instance jobs in parallel using DSJOB

I have a question.
I want to run more instance of same job in parallel from within a script: I have a loop in which I invoke jobs with dsjob and without option "-wait" and "-jobstatus".
I want that jobs completed before script termination, but I don't know how to verify if job instance terminated.
I though to use wait command but it is not appropriate.
Thanks in advance
First,you should assure job compile option "Allow Multiple Instance" choose.
Second:
#!/bin/bash
. /home/dsadm/.bash_profile
INVOCATION=(1 2 3 4 5)
cd $DSHOME/bin
for id in ${INVOCATION[#]}
do
./dsjob -run -mode NORMAL -wait test demo.$id
done
project -- test
job -- demo
$id -- invocation id
the two line in shell scipt:guarantee the environment path can work.
Run the jobs like you say without the -wait, and then loop around running dsjob -jobinfo and parse the output for a job status of 1 or 2. When all jobs return this status, they are all finished.
You might find, though, that you check the status of the job before it actually starts running and you might pick up an old status. You might be able to fix this by first resetting the job instance and waiting for a status of "Not running", prior to running the job.
Invoke the jobs in loop without wait or job-status option
after your loop , check the jobs status by dsjob command
Example - dsjob -jobinfo projectname jobname.invocationid
you can code one more loop for this also and use sleep command under that
write yours further logic as per status of the jobs
but its good to create Job Sequence to invoke this multi-instance job simultaneously with the help of different invoaction-ids
create a sequence job if these are in same process
create different sequences or directly create different scripts to trigger these jobs simultaneously with invocation- ids and schedule in same time.
Best option create a standard generalized script where each thing will be getting created or getting value as per input command line parameters
Example - log files on the basis of jobname + invocation-id
then schedule the same script for different parameters or invocations .

Resources