how to check failed and terminated jobs

how to check failed and terminated jobs - shell

How to write a shell script for checking failed and terminated jobs in a particular directory for every one hour. if any failure we need to get one mail ,

You can use these commands
For current jobs used
user#mysystem:~$ jobs
and termination use
user#mysystem:~$ fg

Related

Snakemake does not recognise job failure due to timeout with error code -11

Does anyone had a problem snakemake recognizing a timed-out job. I submit jobs to a cluster using qsub with a time-out set per rule:
snakemake --jobs 29 -k -p --latency-wait 60 --use-envmodules \
--cluster "qsub -l walltime={resources.walltime},nodes=1:ppn={threads},mem={resources.mem_mb}mb"
If a job fails within a script, the next one in line will be executed. When a job however hits the time-out defined in a rule, the next job in line is not executed, reducing the total number of jobs run in parallel on the cluster over time. A timed-out job raises according to the MOAB scheduler (PBS server) a -11 exit status. As far as I understood any non-zero exit status means failure - or does this only apply to positive integers?!
Thanks in advance for any hint:)

If you don't provide a --cluster-status script, snakemake internally checks job status by touching some hidden files in the submitted job script. When a job times out, snakemake (on the node) doesn't get a chance to report the failure to the main snakemake instance as qsub will kill it.
You can try a cluster profile or just grab a suitable cluster status file (be sure to chmod it as an exe and have qsub report a parsable job id).

SLURM status string on job completion / exit

How do I get the slurm job status (e.g. COMPLETED, FAILED, TIMEOUT, ...) on job completion (within the submission script)?
I.e. I want to write to separately keep track of jobs which are timed out / failed.
Currently I work with the exit code, however jobs which TIMEOUT also get exit code 0.

For future reference, here is how I finally do it.
To retrieve the jobid at the beginning of the job and write some information (e.g. "${SLURM_JOB_ID} ${PWD}") to a summary file.
Then process this file and use something like sacct -X -n -o State --j ${jid} to get the job status.

After triggering a Jenkins job remotely via a Bash script, when should I retrieve the job id?

I already built a script trigger_jenkins_job.sh which works perfectly fine for now. It’s composed mainly of 3 functions:
input_checkpoint
run_remotejob #: Running Jenkins job remotely using Json api.
sleep 10 #: 10 sec estimated time until pending duration is over
#and Jenkins job start running, i.e. a given slave was
#assigned to run the job.
get_buildID #: Retrieving build state, last build ID and last stable
#build ID using
The problem is I want to get rid of that sleep 10 seconds. And in the same time, I want to be sure before executing the function get_buildID that the remotely- triggered job is actually running on a node.
That way I will be retrieving the triggered job’s id, and not the last one in the queue before triggering that job.
Regarding the Jenkins file of the job, I specified:
agent {
label 'linux-node'
}
So, I guess the question is, I need some how from by bash script, to test if linux-node is running the remotely-triggered job, and if yes I execute the function get_buildID.

Get rid of the sleep command and use the wait command.

If you are triggering Job with tokens,it command itself should return you buildNumber.
Another way could be REST API. Please see "nextBuildNumber" field there (if build is still pending) else "number"

Autosys Job failing due to underlying script not being exited

I have an Autosys Job, which calls a wrapper script, which in turn calls a script which starts up a server and keeps running.
Now the problem i am facing is that once the job starts , it keeps on waiting for the exit signal or script to exit, but being unable to find so sends the job into a FAILED state.
Can anyone provide a workaround for this?
For demo :
lets say i have autosys job :A
wrapper script as : W.sh
Main restart script : serverrestart.sh
A
|---W.sh
|---serverrestart.sh ( Always running )

DATASTAGE: how to run more instance jobs in parallel using DSJOB

I have a question.
I want to run more instance of same job in parallel from within a script: I have a loop in which I invoke jobs with dsjob and without option "-wait" and "-jobstatus".
I want that jobs completed before script termination, but I don't know how to verify if job instance terminated.
I though to use wait command but it is not appropriate.
Thanks in advance

First,you should assure job compile option "Allow Multiple Instance" choose.
Second:
#!/bin/bash
. /home/dsadm/.bash_profile
INVOCATION=(1 2 3 4 5)
cd $DSHOME/bin
for id in ${INVOCATION[#]}
do
./dsjob -run -mode NORMAL -wait test demo.$id
done
project -- test
job -- demo
$id -- invocation id
the two line in shell scipt:guarantee the environment path can work.

Run the jobs like you say without the -wait, and then loop around running dsjob -jobinfo and parse the output for a job status of 1 or 2. When all jobs return this status, they are all finished.
You might find, though, that you check the status of the job before it actually starts running and you might pick up an old status. You might be able to fix this by first resetting the job instance and waiting for a status of "Not running", prior to running the job.

Invoke the jobs in loop without wait or job-status option
after your loop , check the jobs status by dsjob command
Example - dsjob -jobinfo projectname jobname.invocationid
you can code one more loop for this also and use sleep command under that
write yours further logic as per status of the jobs
but its good to create Job Sequence to invoke this multi-instance job simultaneously with the help of different invoaction-ids
create a sequence job if these are in same process
create different sequences or directly create different scripts to trigger these jobs simultaneously with invocation- ids and schedule in same time.
Best option create a standard generalized script where each thing will be getting created or getting value as per input command line parameters
Example - log files on the basis of jobname + invocation-id
then schedule the same script for different parameters or invocations .

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

how to check failed and terminated jobs - shell

How to write a shell script for checking failed and terminated jobs in a particular directory for every one hour. if any failure we need to get one mail ,

You can use these commands For current jobs used user#mysystem:~$ jobs and termination use user#mysystem:~$ fg

Related

Snakemake does not recognise job failure due to timeout with error code -11

SLURM status string on job completion / exit

After triggering a Jenkins job remotely via a Bash script, when should I retrieve the job id?

Autosys Job failing due to underlying script not being exited

DATASTAGE: how to run more instance jobs in parallel using DSJOB

Categories

Resources