sge can only run one task in one node - bash

I had built the SGE in a four-node cluster for source code. The operating system in Centos7. And when I submit some simple task in the cluster, I found that only one task was running in one node. What's the problem? Here is my task code:
sleep 60
echo "done"
and this is my cmd to submit the tasks:
DIR=`pwd`
option=""
for((i=0;i<5;i++));do
qsub -q multislots $option -V -cwd -o stdout -e stderr -S /bin/bash $DIR/test.sh
sleep 1
done
when run qstat -f, it shows:enter image description here

Given the error message about jobs failing because: "can not find an unused add_grp_id". You should check what gid_range is set to in the sge configuration(both global and also if there is one per-host). It should be a range of otherwise unused group ids. At least as many gids as you want jobs on a node.
If that isn't it try running qalter -w v and qalter -w p on one of the queued jobs to see why they aren't being started.

Related

I would like to find the process id of a Jenkins job

I would like to find a way to find the process id of a Jenkins job, so I can kill the process if the job gets hung. The Jenkins instance is on Ubuntu. Sometimes, we are unable to stop a job via the Jenkins interface. I am able to stop a job by killing the process id if I run a Jenkins job that contains a simple shell script where I manually collect the process id such as:
#!/bin/bash
echo "Process ID: $$"
for i in {1..10000}
do
sleep 10;
echo "Welcome $i times"
done
In the command shell, I can run sudo kill -9 [process id]and it successfully kills the job.
The problem is, most of our jobs have multiple build steps and we have multiple projects running on this server. Many of our build steps are shell scripts, windows batch files, and a few of them are ant scripts. I'm wondering how to find the process id of the Jenkins job which is the parent process of all of the build steps. As of now, I have to wait until all other builds have completed and restart the server. Thanks for any help!
On *nix OS you can review environment variables of a running process by investigating a /proc/$pid/environ and look for Jenkins specific variables like BUILD_ID, BUILD_URL, etc.
cat /proc/'$pid'/environ | grep BUILD_URL
You can do it know you $pid or go through of running processes.
This is an update to my question. For killing hung (zombie) jobs, I believe that this will only work for cases where Jenkins is running from the same server as its jobs. I doubt this would work if you are trying to kill a hung process running on a Jenkins slave.
#FIND THE PROCESS ID BASED ON JENKINS JOB
user#ubuntu01x64:~$ sudo egrep -l -i 'BUILD_TAG=jenkins-Wait_Job-11' /proc/*/environ
/proc/5222/environ
/proc/6173/environ
/proc/self/environ
# ONE OF THE PROCESSES LISTED FROM THE EGREP OUTPUT IS THE 'EGREP'COMMAND ITSELF,
# ENSURE THAT (LOOP THROUGH) THE PROCESS ID'S TO DETERMINE WHICH IS
# STILL RUNNING
user#ubuntu01x64:~$ if [[ -e /proc/6173 ]]; then echo "yes"; fi
user#ubuntu01x64:~$ if [[ -e /proc/5222 ]]; then echo "yes"; fi
yes
# KILL THE PROCESS
sudo kill -9 5222

sun grid engine qsub to all nodes

I have a master and two nodes. They are install with SGN. And I have a shell script ready on all the nodes as well. Now I want to use a qsub to submit the job on all my nodes.
I used:
qsub -V -b n -cwd /root/remotescript.sh
but it seems that only one node is doing the job. I am wondering how do I submit jobs for all nodes. What would the command be.
My reference is this enter link description here
SGE is meant to dispatch jobs to worker nodes. In your example, you create one job so one node will run it. If you want to run a job on each of your node, you need to submit more than one job. If you want to target nodes you probably should use something closer to
qsub -V -b n -cwd -l hostname=node001 /root/remotescript.sh
qsub -V -b n -cwd -l hostname=node002 /root/remotescript.sh
The "-l hostname=*" parameter will require a specific host to run the job.
What are you trying to do? The general use case of using a grid engine is to let the scheduler dispatch the jobs so you don't have to use the "-l hostname=*" parameter. So technically you should just submit a bunch of jobs to SGE and let it dispatch it with the nodes availability.
Finch_Powers answer is good for describing how SGE allocates resources. So, I'll elaborate below on specifics of you question, which may be why you are not getting the desired outcome.
You mention launching remote script via:
qsub -V -b n -cwd /root/remotescript.sh
Also, you mention again that these scripts are located on the nodes:
"And I have a shell script ready on all the nodes as well"
This is not how SGE is designed to work, although it can do this. Typical usage is to have same single (or multiple) scripts accessible to all nodes via network mounted storage on the execution nodes and let SGE decide which nodes to run the script on.
To run remote code, you may be better served using plain SSH.

Running script on my local computer when jobs submitted by qsub on a server finish

I am submitting jobs via qsub to a server, and then want to analyze the results on the local machine after jobs are finished. Though I can find a way to submit the analysis job on the server, but don't know how to run that script on my local machine.
jobID=$(qsub job.sh)
qsub -W depend=afterok:$jobID analyze.sh
But instead of the above, I want something like
if(qsub -W depend=afterok:$jobID) finished successfully
sh analyze.sh
else
some script
How can I accomplish the above task?
Thank you very much.
I've faced a similar issue and I'll try to sketch the solution that worked for me:
After submitting your actual job,
jobID=$(qsub job.sh)
I would create a loop in your script that checks if the job is still running using
qstat $jobID | grep $jobID | awk '{print $5}'
Although I'm not 100% sure if the status is in the 5h column, you better double check. While the job is idling, the status will be I or Q, while running R, and afterwards C.
Once it's finished, I usually grep the output files for signs that the run was a success or not, and then run the appropriate post-processing script.
One thing that works for me is to use qsub synchronous with the option
qsub -sync y job.sh
(either on command line or as
#$ -sync y
in the script (job.sh) itself.
qsub will then exit with code 0 only if the job (or all array jobs) have finished successfully.

Run a job on all nodes of Sun Grid Engine cluster, only once

I want to run a job on all the active nodes of a 64 node Sun Grid Engine Cluster, scheduled using qsub. I am currently using array-job variable for the same, but sometimes the program is scheduled multiple times on the same node.
qsub -t 1-64:1 -S /home/user/.local/bin/bash program.sh
Is it possible to schedule only one job per node, on all nodes parallely?
You could use a parallel environment. Create a parallel environment with :
qconf -ap "parallel_environment_name"
and set "allocation_rule" to 1, which means that all processes will have to reside on different hosts. Then when submitting your array job, specify your the number of nodes you want to use with your parallel environment. In your case :
qsub -t 1-64:1 -pe "parallel_environment_name" 64 -S /home/user/.local/bin/bash program.sh
For more information, check these links: http://linux.die.net/man/5/sge_pe and Configuring a new parallel environment at DanT's Grid Blog (link no longer working; there are copies on the wayback machine and softpanorama).
I you have a bash terminal, you can run
for host in $(qhost | tail -n +4 | cut -d " " -f 1); do qsub -l hostname=$host program.sh; done
"-l hostname=" specifies on which host to run the job.
The for loop iterates over the result returned by qstat to take each node and call the command specifying the host to use.

"qsub -now" equivalent using bsub

In SGE , we have
qsub -now yes/no <command>
By "-now yes" the job is scheduled immediately(if possible) or not at all . We are not put in pending queue .
By "-now no " the job is put in pending queue if it cannot be executed immediately .
But in LSF , we have qsub's equivalent as bsub .
in bsub, we are put in pending queue, if it cannot be executed immediately. We don't have option as "-now yes" as in qsub .
Do we something in bsub as "qsub -now"
P.S : One solution is that we can check for some time(some secondss) after running bsub, if we are scheduled or not and then exit . I am searching for a more elegant way .
I found the answer in an LSF way.
LSF does provide a way to quit a job if we its unable to schedule the resource. We hava a environment variable LSF_NIOS_PEND_TIMEOUT(specified in minutes) which quits the job, if its still in pending queue.
env LSF_NIOS_PEND_TIMEOUT=1 bsub -Is -m host /bin/bash
From Somewhere on the web:
LSF_NIOS_PEND_TIMEOUT
Syntax
LSF_NIOS_PEND_TIMEOUT=minutes
Description
Applies only to interactive batch jobs.
Maximum amount of time that an interactive batch job can remain pending.
If this parameter is defined, and an interactive batch job is pending for longer than the specified time, the interactive batch job is terminated.
Valid values
Any integer greater than zero
LSF doesn't have the same thing. You could use expect w/ a timeout. LSF will output something like this when the job starts. Your expect script could expect <<Starting on. (But this is basically what your P.S. says.)
$ bsub -Is -m hostA /bin/bash
Job <7536> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on hostA>>
hostA$
You could maybe use lsrun. But it won't work with the batch system to allocate a slot or other resource.

Resources