qsub: requesting a job array from within a qsub session - cluster-computing

I have a matlab script that processes a large amount of data using torque job arrays.
The server that I SSH into lacks the memory to load the data in the first place, so I need to request the compute node resources as a torque job, as follows:
qsub -I -V -l nodes=1:ppn=1,walltime=12:00:00,vmem=80G
However, when I now run the matlab script I am unable to submit torque job array requests. The error I am getting is as follows:
qsub: submit error (Job rejected by all possible destinations (check syntax, queue resources, ...))
The job array request given was:
qsub -t 1-$1 -l vmem=16G -l nodes=1:ppn=1,walltime=48:00:00 -v batchID=$2,batchDir=$3,funcName=$4 -e $5 -o $6 $HOME/scripts/job.sh
This command works fine outside of a qsub session, and the above error is not transient, so it appears that I cannot submit a request for a torque job array from within a qsub session.
How do I obtain the necessary memory resources from the compute nodes while also being able to submit requests for torque job arrays?

The cluster may not allow you to submit jobs from nodes in the cluster. You may be able to ask the admin to change this behavior or you can ssh to the head from within your first job and run the qsub there.
ssh head "qsub -t .........."

Related

Snakemake does not recognise job failure due to timeout with error code -11

Does anyone had a problem snakemake recognizing a timed-out job. I submit jobs to a cluster using qsub with a time-out set per rule:
snakemake --jobs 29 -k -p --latency-wait 60 --use-envmodules \
--cluster "qsub -l walltime={resources.walltime},nodes=1:ppn={threads},mem={resources.mem_mb}mb"
If a job fails within a script, the next one in line will be executed. When a job however hits the time-out defined in a rule, the next job in line is not executed, reducing the total number of jobs run in parallel on the cluster over time. A timed-out job raises according to the MOAB scheduler (PBS server) a -11 exit status. As far as I understood any non-zero exit status means failure - or does this only apply to positive integers?!
Thanks in advance for any hint:)
If you don't provide a --cluster-status script, snakemake internally checks job status by touching some hidden files in the submitted job script. When a job times out, snakemake (on the node) doesn't get a chance to report the failure to the main snakemake instance as qsub will kill it.
You can try a cluster profile or just grab a suitable cluster status file (be sure to chmod it as an exe and have qsub report a parsable job id).

sun grid engine qsub to all nodes

I have a master and two nodes. They are install with SGN. And I have a shell script ready on all the nodes as well. Now I want to use a qsub to submit the job on all my nodes.
I used:
qsub -V -b n -cwd /root/remotescript.sh
but it seems that only one node is doing the job. I am wondering how do I submit jobs for all nodes. What would the command be.
My reference is this enter link description here
SGE is meant to dispatch jobs to worker nodes. In your example, you create one job so one node will run it. If you want to run a job on each of your node, you need to submit more than one job. If you want to target nodes you probably should use something closer to
qsub -V -b n -cwd -l hostname=node001 /root/remotescript.sh
qsub -V -b n -cwd -l hostname=node002 /root/remotescript.sh
The "-l hostname=*" parameter will require a specific host to run the job.
What are you trying to do? The general use case of using a grid engine is to let the scheduler dispatch the jobs so you don't have to use the "-l hostname=*" parameter. So technically you should just submit a bunch of jobs to SGE and let it dispatch it with the nodes availability.
Finch_Powers answer is good for describing how SGE allocates resources. So, I'll elaborate below on specifics of you question, which may be why you are not getting the desired outcome.
You mention launching remote script via:
qsub -V -b n -cwd /root/remotescript.sh
Also, you mention again that these scripts are located on the nodes:
"And I have a shell script ready on all the nodes as well"
This is not how SGE is designed to work, although it can do this. Typical usage is to have same single (or multiple) scripts accessible to all nodes via network mounted storage on the execution nodes and let SGE decide which nodes to run the script on.
To run remote code, you may be better served using plain SSH.

Running script on my local computer when jobs submitted by qsub on a server finish

I am submitting jobs via qsub to a server, and then want to analyze the results on the local machine after jobs are finished. Though I can find a way to submit the analysis job on the server, but don't know how to run that script on my local machine.
jobID=$(qsub job.sh)
qsub -W depend=afterok:$jobID analyze.sh
But instead of the above, I want something like
if(qsub -W depend=afterok:$jobID) finished successfully
sh analyze.sh
else
some script
How can I accomplish the above task?
Thank you very much.
I've faced a similar issue and I'll try to sketch the solution that worked for me:
After submitting your actual job,
jobID=$(qsub job.sh)
I would create a loop in your script that checks if the job is still running using
qstat $jobID | grep $jobID | awk '{print $5}'
Although I'm not 100% sure if the status is in the 5h column, you better double check. While the job is idling, the status will be I or Q, while running R, and afterwards C.
Once it's finished, I usually grep the output files for signs that the run was a success or not, and then run the appropriate post-processing script.
One thing that works for me is to use qsub synchronous with the option
qsub -sync y job.sh
(either on command line or as
#$ -sync y
in the script (job.sh) itself.
qsub will then exit with code 0 only if the job (or all array jobs) have finished successfully.

Run a job on all nodes of Sun Grid Engine cluster, only once

I want to run a job on all the active nodes of a 64 node Sun Grid Engine Cluster, scheduled using qsub. I am currently using array-job variable for the same, but sometimes the program is scheduled multiple times on the same node.
qsub -t 1-64:1 -S /home/user/.local/bin/bash program.sh
Is it possible to schedule only one job per node, on all nodes parallely?
You could use a parallel environment. Create a parallel environment with :
qconf -ap "parallel_environment_name"
and set "allocation_rule" to 1, which means that all processes will have to reside on different hosts. Then when submitting your array job, specify your the number of nodes you want to use with your parallel environment. In your case :
qsub -t 1-64:1 -pe "parallel_environment_name" 64 -S /home/user/.local/bin/bash program.sh
For more information, check these links: http://linux.die.net/man/5/sge_pe and Configuring a new parallel environment at DanT's Grid Blog (link no longer working; there are copies on the wayback machine and softpanorama).
I you have a bash terminal, you can run
for host in $(qhost | tail -n +4 | cut -d " " -f 1); do qsub -l hostname=$host program.sh; done
"-l hostname=" specifies on which host to run the job.
The for loop iterates over the result returned by qstat to take each node and call the command specifying the host to use.

Making qsub block until job is done?

Currently, I have a driver program that runs several thousand instances of a "payload" program and does some post-processing of the output. The driver currently calls the payload program directly, using a shell() function, from multiple threads. The shell() function executes a command in the current working directory, blocks until the command is finished running, and returns the data that was sent to stdout by the command. This works well on a single multicore machine. I want to modify the driver to submit qsub jobs to a large compute cluster instead, for more parallelism.
Is there a way to make the qsub command output its results to stdout instead of a file and block until the job is finished? Basically, I want it to act as much like "normal" execution of a command as possible, so that I can parallelize to the cluster with as little modification of my driver program as possible.
Edit: I thought all the grid engines were pretty much standardized. If they're not and it matters, I'm using Torque.
You don't mention what queuing system you're using, but SGE supports the '-sync y' option to qsub which will cause it to block until the job completes or exits.
In TORQUE this is done using the -x and -I options. qsub -I specifies that it should be interactive and -x says run only the command specified. For example:
qsub -I -x myscript.sh
will not return until myscript.sh finishes execution.
In PBS you can use qsub -Wblock=true <command>

Resources