How get information of completed PBS or Torque jobs? - shell

I have IDs of completed jobs. How do I check its detailed information, such as execution time, allocated nodes, etc? I remember SGE has a command for it (qacct?). But I could not find it for PBS or Torque. Thanks.

Since job accounting requires root access to view completed jobs, or that the cluster admins have installed pbstools (both out of the control of a user), I've found that the easiest thing to do is to place a
tracejob $PBS_JOBID
on the last line of the submission script. If the scheduler is MAUI, then checkjob -vv $PBS_JOBID is another alternative. These commands could be redirected to a separate outfile:
tracejob $PBS_JOBID > $PBS_O_WORKDIR/$PBS_JOBID.tracejob
Should also be possible to have this run as a user epilog script to make it more reusable from job to job.

I was looking at this thread searching how to do this in my HPC running PBSPro 19.2.3 and as of PBSPro 18 the solution is similar to John Damm Sørensen's reply, but the -w flag is used instead of -1 to display output of each field in a single line and you need to add -x flag to see the details of finished jobs as well, so you don't need to run it within the job script. (p.203, section 2.59.2.2 of the Reference Guide)
qstat -fxw $PBS_JOBID
You can then grep out of it the requested information, such as resources used, Exit status, etc:
qstat -fxw $PBS_JOBID | grep -E "resources_used|Exit_status|array_index"

For Torque, you can check at least part of the information you seek using the "tracejob" command.
Official documentation:
http://docs.adaptivecomputing.com/torque/Content/topics/11-troubleshooting/usingTracejobToLocateFailures.htm
One thing you should notice is that this tool is a convenience that parses the logs. By default it will only check the last day. Be sure to read the doc for the "-n" option.

On a Torque based system. I find that the best way to get stats from a job is to add this to the end of the submitted job script. The output will be added to the STDOUT file.
qstat -f -1 $PBS_JOBID

Right now the only way to get this in TORQUE is to look at the accounting logs. You can grep for the job id and view the accounting records for the job, which look like this:
04/30/2014 15:20:18;Q;5000.bob;queue=batch
04/30/2014 15:33:00;S;5000.bob;user=dbeer group=dbeer jobname=STDIN queue=batch ctime=1398892818 qtime=1398892818 etime=1398892818 start=1398893580 owner=dbeer#bob exec_host=bob/0
04/30/2014 15:36:20;E;5000.bob;user=dbeer group=dbeer jobname=STDIN queue=batch ctime=1398892818 qtime=1398892818 etime=1398892818 start=1398893580 owner=dbeer#bob exec_host=bob/0 session=22933 end=1398893780 Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=2580kb resources_used.vmem=37072kb resources_used.walltime=00:03:20
Unfortunately, to do this directly you have to have root access. To get around this, there are tools such as pbsacct that help better browse this. pbsacct is part of the pbstools package, which is where that link takes you.

Related

LSF - automatic job rerun using sasbatch script

I am trying to create an auto-rerun mechanism by implementing some code into sasbatch script after sascommand will finish. General idea is to:
locate a log of sas process and an id of the flow containing current job,
check if the log contains particular ORA-xxxxx errors which we know that solution for them is just rerun of the process,
if so, then trigger jrerun class from LSF Platform Command Line Interface,
exit sasbatch passing $rc to LSF
The idea was implemented as:
#define used paths
log_dir=/path/to/sas_logs_directory
out_log=/path/to/auto-rerun_log.txt
out_log2=/path/to/lsf_rerun_log.txt
if [ -n "${LSB_JOBNAME}"]; then
if [ ! -f "$out_log"]; then
touch $out_log
fi
#get flow runtime attributes
IFS-: read -r flow_id username flow_name job_name <<< "${LSB_JOBNAME}"
#find log of the current process
log_path=$(ls -t $log_dir/*.log | xargs grep -li "job:\s*$job_name" | grep -i "/$flow_name_" | head -1)
#set path to txt file containing lines which represents ORA errors we look for
conf_path-/path/to/error_list
#analyse process' log line by line
while read -r line;
do
#if error is found in log then try to rerun flow
if grep -q "$line" $log_path; then
(nohup /path/to/rerun_script.sh $flow_id >$out_log2 2>&1) &
disown
break
fi
done < $conf_path
fi
While rerun_script is the script which calls jrerun class after sleep command - in order to let parent script exit $rc in the meanwhile. It looks like:
sleep 10
/some/lsf/path/jrerun
Problem is that job is running for the all time. In LSF history I can see that jrerun was called before job exited.
Furthermore in $out_log2 I can see message: <flow_id> has no starting or exit points.
Do anyone have an idea how I can pass return code to LSF before jrerun calling? Or maybe some simplier way to perform autorerun of SAS jobs in Platform LSF?
I am using SAS 9.4 and Platform Process Manager 9.1
Or maybe some simplier way to perform autorerun of SAS jobs in Platform LSF?
I'm not knowledgeable about the SAS part. But on the LSF side there's at least a couple of ways to requeue the job.
If you have control of the job script, you can use special process exit value to automatically requeue the job.
https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_admin/job_requeue_about.html
If you have control outside of the job script, you can use brequeue -r to requeue a running job.
https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_command_ref/brequeue.1.html
Good Luck
I managed to get this working by using two additional configuration files. When my grep returnes 1 I add found flow_id to flow_list.txt configuration file and modify especially made trigger_file.txt.
I scheduled additional flow execute_rerun in LSF which is triggered after file trigger_file.txt is modified. The execute_rerun flow reads flow_list.txt configuration file line by line and calls jrerun method on each flow.
I managed to achieve an automatic rerun of the flows which fails due to particular errors.

QSUB: Specify output and error files for each task in Job Array

Hopefully this is not a dublicate and also not just a problem of our cluster's configuration...
I am submitting a job array to a cluster using qsub with the following command:
qsub -q QUEUE -N JOBNAME -t 1:10 -e ${ERRFILE}_$SGE_TASK_ID /path/to/script.sh
where
ERRFILE=/home/USER/somedir/errors.
The idea is to specify an error file (also analogously the output file) that also contains the task ID from within the job array.
So far I have learned that the line
#$ -e ${ERRFILE}_$SGE_TASK_ID
inside the script.sh, does not work, because it is a comment and not evaluated by bash. My first line does not work however because $SGE_TASK_ID is only set AFTER the job is submitted.
I read here that escaping the evaluation of $SGE_TASK_ID (in that link it's PBS' $PBS_JOBID, but a similar problem) should work, but when I tried
qsub -q QUEUE -N JOBNAME -t 1:10 -e ${ERRFILE}_\$SGE_TASK_ID /path/to/script.sh
it did not work as expected.
Am I missing something obvious? Is it possible to use $SGE_TASK_ID in the name of an error file (the automatic naming of error files does that, but I want to specify the directory and if possible the name, too)?
Some additional remarks:
I am using the -cwd option for qsub inside script.sh, but that is NOT where I want my error files to be stored.
I have next to no control over how the cluster works and no root access (wouldn't know what I could need it for in this context but anyway...).
Apparently our cluster does not use PBS.
Yes my scripts are all executable and where applicable started with #!/bin/bash (I also specified the use of bash with the -S /bin/bash option for qsub).
There seems to be a solution here, but I am not quite sure how that works and it also appears to be using PBS. If that answer DOES apply to my question and I misunderstood it, please let me know.
I would appreciate any hint into the right direction.
Thank You!
I didn't know this either, but it looks like Grid Engine has something called "pseudo environment variables" like $TASK_ID for this purpose. This should work:
qsub -q QUEUE -N JOBNAME -t 1:10 -e ${ERRFILE}_\$TASK_ID /path/to/script.sh
From the man page:
-e [[hostname]:]path,...
...
If the pathname contains certain pseudo
environment variables, their value will be expanded at
runtime of the job and will be used to constitute the
standard error stream path name. The following pseudo
environment variables are supported currently:
$HOME home directory on execution machine
$USER user ID of job owner
$JOB_ID current job ID
$JOB_NAME current job name (see -N option)
$HOSTNAME name of the execution host
$TASK_ID array job task index number

How to see the output of a job submitted through qsub in my terminal?

I am submitting this simple job to SGE through qsub. How can I see the output of the job which is a simple echo in my terminal. I mean I want it directly on screen not diverting the output to a logfile or something.
So here is the job stored in Dummyjob:
#!/bin/sh
#$ -j y
#$ -S /bin/sh
#$ -q long.q
sleep 30
echo "I'm done!"
And this is the qsub command:
qsub -N job_1 -cwd./Dummyjob
Thank you!
It doesn't do that. You're referring to a batch facility, e.g., How to submit a job using qsub.
Looking at the command-line options, these are the possibilities:
-o <output_logfile> name of the output log file
-e <error_logfile> name of the error log file
-m ea Will send email when job ends or aborts
You can ask it to send mail when the job is done (successfully or not). Or you might be able to make it write to a fifo, e.g., in one terminal you would do
mkfifo myFakeFile
tail -f myFakeFile
and then use
-o myFakeFile
when submitting (in that order, so that something is waiting). But if the program does any checking, it will not write to a fifo (because it is not a regular file).
Further reading:
qsub - submit a batch job to Sun Grid Engine.
6.3.2 Creating a FIFO (The Linux Programmer's Guide)
The previous answer mentions that you are submitting a 'batch job script' and this is true, so you will not see the output on your terminal (tty) but the stdout/stderr will be sent to output files. However that doesn't mean you can't run an interactive job through Grid Engine. You can, just use 'qrsh' instead of using 'qsub' and the script will be run on a remote machine chosen by Grid Engine - the results will be displayed on your screen.
Note: You might have to configure qrsh in your Grid Engine Cluster for this to work.

Capture job id of a job submitted by qsub

I have been looking for a simple way to capture the job ID of a job submitted by qsub. I saw a suggestion was given by providing a name to the job, and using that name. But that's an indirect method. I tried this way but getting an error
jobID="qsub job.sh"
35546.cell0 (This is the output I want to capture)
$jobID
qsub -W depend=afterok:$jobID analyze.sh
Can anyone please suggest a neat way to capture the job ID from qsub?
Thank you very much.
You may try
qsub -W depend=afterok:$(qsub job.sh) analyze.sh

QSUB a process for every file in a directory?

I've been using
qsub -t 1-90000 do_stuff.sh
to submit my tasks on a Sun GridEngine cluster, but now find myself with data sets (super large ones, too) which are not so conveniently named. What's the best way to go about this? I could try to rename them all, but the names contain information which needs to be preserved, and this obviously introduces a host of problems. I could just preprocess everything into jsons, but if there's a way to just qsub -all_contents_of_directory, that would be ideal.
Am I SOL? Should I just go to the directory in question and find . -exec 'qsub setupscript.sh {}'?
Use another script to submit the job - here's an example I used where I want the directory name in the job name. "run_openfoam" is the pbs script in the particular directory.
#!/bin/bash
cd $1
qsub -N $1 run_openfoam
You can adapt this script to suit your job and then run it through a loop on the command line. So rather than submitting a job array, you submit a job for each dir name passed as the first parapmeter to this script.
I tend to use Makefiles to automate this stuff:
INPUTFILES=$(wildcard *.in)
OUTPUTFILES=$(patsubst %.in,%.out,$(INPUTFILES))
all : $(OUTPUTFILES)
%.out : %.in
#echo "mycommand here < $< > $#" | qsub
Then type 'make', and all files will be submitted to qsub. Of course, this will submit everything all at once, which may do unfortunate things to your compute cluster and your sysadmin's blood pressure.
If you remove the "| qsub", the output of make is a list of commands to run. Feed that list into one or more qsub commands, and you'll get an increase in efficiency and a reduction in qsub jobs. I've been using GNU parallel for that, but it needs a qsub that blocks until the job is done. I wrote a wrapper that does that, but it calls qstat a lot, which means a lot of hitting on the system. I should modify it somehow, but there aren't a lot of computationally 'good' options here.
I cannot understand "-t 1-90000" in your qsub command. My searching of qsub manual doesn't show such "-t" option.
Create a file with a list of the datasets in it
find . -print >~/list_of_datasets
Script:
#!/bin/bash
exec ~/setupscript.sh $(sed -n -e "${SGE_TASK_ID}p" <~/list_of_datasets)
qsub -t 1-$(wc -l ~/list_of_datasets) job_script

Resources