Check real time output after qsub a job on cluster - cluster-computing

Here is my pbs file:
#!/bin/bash
#PBS -N myJob
#PBS -j oe
#PBS -k o
#PBS -V
#PBS -l nodes=hpg6-15:ppn=12
cd ${PBS_O_WORKDIR}
./mycommand
On qsub documentation page, it seems like if I put the line
PBS -k o, I should be able to check the real time output in a file named myJob.oJOBID in my home dir. However when I check the output by tail -f or cat or more in runtime, it shows nothing in the file. Only when I terminated the job, then the file would show the output. Is there anything I should check to make the stream flush to the output file in real time?

By default, the files are created on the nodes and copied to your home directory when the job completes. The cluster admin can change this behavior by adding "$spool_as_final_name true" to the config file in the mom_priv directory on each node.
Torque MOM Configuration, parameters

Assuming you are allowed to login to the node running your process (this is allowed by the admin of our cluster for the duration of the job, not sure if this is common or not), then you can have real-time output by
Getting the PID of your process
Browsing through the files that this process has opened with lsof -n -p <PID>, and finding the file whose name "looks like" that of a log. In our cluster the files are
/cm/local/apps/pbspro-ce/var/spool/spool/[JOBID][server].OU
/cm/local/apps/pbspro-ce/var/spool/spool/[JOBID][server].ER
The .OU is stdout and the .ER is stderr. You can then tail -f to get real time output.
The output of lsof can be pretty long though, so you should try grepping your JOBID, or maybe this bit pbspro-ce/var/spool/.
Curious to know if this can be replicated in clusters other than our own.

Related

QSUB: Specify output and error files for each task in Job Array

Hopefully this is not a dublicate and also not just a problem of our cluster's configuration...
I am submitting a job array to a cluster using qsub with the following command:
qsub -q QUEUE -N JOBNAME -t 1:10 -e ${ERRFILE}_$SGE_TASK_ID /path/to/script.sh
where
ERRFILE=/home/USER/somedir/errors.
The idea is to specify an error file (also analogously the output file) that also contains the task ID from within the job array.
So far I have learned that the line
#$ -e ${ERRFILE}_$SGE_TASK_ID
inside the script.sh, does not work, because it is a comment and not evaluated by bash. My first line does not work however because $SGE_TASK_ID is only set AFTER the job is submitted.
I read here that escaping the evaluation of $SGE_TASK_ID (in that link it's PBS' $PBS_JOBID, but a similar problem) should work, but when I tried
qsub -q QUEUE -N JOBNAME -t 1:10 -e ${ERRFILE}_\$SGE_TASK_ID /path/to/script.sh
it did not work as expected.
Am I missing something obvious? Is it possible to use $SGE_TASK_ID in the name of an error file (the automatic naming of error files does that, but I want to specify the directory and if possible the name, too)?
Some additional remarks:
I am using the -cwd option for qsub inside script.sh, but that is NOT where I want my error files to be stored.
I have next to no control over how the cluster works and no root access (wouldn't know what I could need it for in this context but anyway...).
Apparently our cluster does not use PBS.
Yes my scripts are all executable and where applicable started with #!/bin/bash (I also specified the use of bash with the -S /bin/bash option for qsub).
There seems to be a solution here, but I am not quite sure how that works and it also appears to be using PBS. If that answer DOES apply to my question and I misunderstood it, please let me know.
I would appreciate any hint into the right direction.
Thank You!
I didn't know this either, but it looks like Grid Engine has something called "pseudo environment variables" like $TASK_ID for this purpose. This should work:
qsub -q QUEUE -N JOBNAME -t 1:10 -e ${ERRFILE}_\$TASK_ID /path/to/script.sh
From the man page:
-e [[hostname]:]path,...
...
If the pathname contains certain pseudo
environment variables, their value will be expanded at
runtime of the job and will be used to constitute the
standard error stream path name. The following pseudo
environment variables are supported currently:
$HOME home directory on execution machine
$USER user ID of job owner
$JOB_ID current job ID
$JOB_NAME current job name (see -N option)
$HOSTNAME name of the execution host
$TASK_ID array job task index number

How to see the output of a job submitted through qsub in my terminal?

I am submitting this simple job to SGE through qsub. How can I see the output of the job which is a simple echo in my terminal. I mean I want it directly on screen not diverting the output to a logfile or something.
So here is the job stored in Dummyjob:
#!/bin/sh
#$ -j y
#$ -S /bin/sh
#$ -q long.q
sleep 30
echo "I'm done!"
And this is the qsub command:
qsub -N job_1 -cwd./Dummyjob
Thank you!
It doesn't do that. You're referring to a batch facility, e.g., How to submit a job using qsub.
Looking at the command-line options, these are the possibilities:
-o <output_logfile> name of the output log file
-e <error_logfile> name of the error log file
-m ea Will send email when job ends or aborts
You can ask it to send mail when the job is done (successfully or not). Or you might be able to make it write to a fifo, e.g., in one terminal you would do
mkfifo myFakeFile
tail -f myFakeFile
and then use
-o myFakeFile
when submitting (in that order, so that something is waiting). But if the program does any checking, it will not write to a fifo (because it is not a regular file).
Further reading:
qsub - submit a batch job to Sun Grid Engine.
6.3.2 Creating a FIFO (The Linux Programmer's Guide)
The previous answer mentions that you are submitting a 'batch job script' and this is true, so you will not see the output on your terminal (tty) but the stdout/stderr will be sent to output files. However that doesn't mean you can't run an interactive job through Grid Engine. You can, just use 'qrsh' instead of using 'qsub' and the script will be run on a remote machine chosen by Grid Engine - the results will be displayed on your screen.
Note: You might have to configure qrsh in your Grid Engine Cluster for this to work.

command in .bashrc file cannot be executed correctly when submitted a pbs job

I have a script to submit a job in bash shell, which looks like
#!/bin/bash
# PBS -l nodes=1:ppn=1
#PBS -l walltime=00:30:00
#PBS -N xxxxx
However, after I submitted my job, I got an error message in xxxxx.e8980 file as follows:
/home/xxxxx/.bashrc: line 1: /etc/ini.modules: No such file or directory
but the file /etc/ini.modules is there. Why the system cannot find it?
Thank you very much!
When referencing files in a job that will be submitted to a cluster, you must either force the job to the specific node(s) that have the file or make sure the file is present on all compute nodes in the cluster.

Is there a way in a shell script to figure out where its output is redirected?

We have scripts of following nature (in cron)
someScript.sh > /tmp/cronlog/somescript.$(date +%Y%m%d).log 2>&1
Now is there a way by which with in someScript.sh I can figure out what file the output has gone in to?
The script sends email with summary. At the same time I would like to mention that details could be found in so and so output file - with in the email.
I am aware of the construct if [ -t 1 ] to detect stdout etc but how to get the output file name?
Note that I want this to be generic so that some one can change the output file in cron and the script does not need to be modified.
The simplest thing I could think is that:
readlink -f /proc/$$/fd/1
$$ is the PID of the script (inside the script). On most unix systems, /proc/[pid] is the pseudo-directory containing info for process [pid].
/proc/[pid]/fd is a directory containing a list of symlinks for the open file-descriptors of the process. fd/0 is input, fd/1 is the output of the script, etc.
readlink then gives you the target file or tty if you don't redirect the output.
Of course, if you want to display it, you have to display it somewhere else than standard ouput, or it will be redirected! To debug, try the std error (2).
Various callings give those results on my box (script.sh just calls readlink -f /proc/$$/fd/1 >&2)
# ./script.sh
/dev/pts/0
# ./script.sh > /var/tmp/foo
/var/tmp/foo
# ./script.sh | more
/proc/12132/fd/pipe:[916212]
Rather than trying to find a hack (and that too platform dependent) its better to take a slightly different approach here.
Set your cron job like this:
someScript.sh /tmp/cronlog/somescript.$(date +%Y%m%d).log
i.e. without and > or 2>&1 (stdout/stderr streams redirections) and just pass an argument with the desired logfile name.
Now inside someScript.sh redirect streams to your log file like this:
LOGFILE=$1
exec &>${LOGFILE}
And finally you can then message your clients that:
"output details could be found in ${LOGFILE}"

QSUB a process for every file in a directory?

I've been using
qsub -t 1-90000 do_stuff.sh
to submit my tasks on a Sun GridEngine cluster, but now find myself with data sets (super large ones, too) which are not so conveniently named. What's the best way to go about this? I could try to rename them all, but the names contain information which needs to be preserved, and this obviously introduces a host of problems. I could just preprocess everything into jsons, but if there's a way to just qsub -all_contents_of_directory, that would be ideal.
Am I SOL? Should I just go to the directory in question and find . -exec 'qsub setupscript.sh {}'?
Use another script to submit the job - here's an example I used where I want the directory name in the job name. "run_openfoam" is the pbs script in the particular directory.
#!/bin/bash
cd $1
qsub -N $1 run_openfoam
You can adapt this script to suit your job and then run it through a loop on the command line. So rather than submitting a job array, you submit a job for each dir name passed as the first parapmeter to this script.
I tend to use Makefiles to automate this stuff:
INPUTFILES=$(wildcard *.in)
OUTPUTFILES=$(patsubst %.in,%.out,$(INPUTFILES))
all : $(OUTPUTFILES)
%.out : %.in
#echo "mycommand here < $< > $#" | qsub
Then type 'make', and all files will be submitted to qsub. Of course, this will submit everything all at once, which may do unfortunate things to your compute cluster and your sysadmin's blood pressure.
If you remove the "| qsub", the output of make is a list of commands to run. Feed that list into one or more qsub commands, and you'll get an increase in efficiency and a reduction in qsub jobs. I've been using GNU parallel for that, but it needs a qsub that blocks until the job is done. I wrote a wrapper that does that, but it calls qstat a lot, which means a lot of hitting on the system. I should modify it somehow, but there aren't a lot of computationally 'good' options here.
I cannot understand "-t 1-90000" in your qsub command. My searching of qsub manual doesn't show such "-t" option.
Create a file with a list of the datasets in it
find . -print >~/list_of_datasets
Script:
#!/bin/bash
exec ~/setupscript.sh $(sed -n -e "${SGE_TASK_ID}p" <~/list_of_datasets)
qsub -t 1-$(wc -l ~/list_of_datasets) job_script

Resources