Launching and Monitoring a job at the same time in Ansible tower CLI - ansible

We have installed Ansible Tower and have the CLI tools implemented. We can launch jobs in the CLI using the following -
tower-cli job launch -J 5
This returns the output like so -
Resource changed.
=== ============ ======================== ======= =======
id job_template created status elapsed
=== ============ ======================== ======= =======
119 5 2017-12-05T20:26:31.197Z pending 0.0
=== ============ ======================== ======= =======
And then we can monitor the status like this -
tower-cli job monitor 119.
Is it possible to pass the input of ID into the monitor cli argument in some way (or is it possible to run both at the same time)? Since we have multiple jobs running on the server, we would need to be able to reliably get the job id each time.
I didn't see anything about this when I read over the documentation at http://tower-cli.readthedocs.io/en/latest/cli_ref/index.html.
Thanks.

I'm on tower-cli version Tower CLI 3.3.0. I ran tower-cli job launch --help which gave the following, related commands:
--monitor If sent, immediately calls `job monitor` on the
newly launched job rather than exiting with a
success.
--wait Monitor the status of the job, but do not print
while job is in progress.
So I think you can just do the following:
tower-cli job launch -J 5 --monitor
(I add the --wait command when I'm running this in my CI build, which is why I included it above)

I fixed this by doing the following -
OUTPUT="$(tower-cli job launch -J 5 | grep -o '[0-9]*' | head -1 )"
tower-cli monitor $OUTPUT

Related

Snakemake does not recognise job failure due to timeout with error code -11

Does anyone had a problem snakemake recognizing a timed-out job. I submit jobs to a cluster using qsub with a time-out set per rule:
snakemake --jobs 29 -k -p --latency-wait 60 --use-envmodules \
--cluster "qsub -l walltime={resources.walltime},nodes=1:ppn={threads},mem={resources.mem_mb}mb"
If a job fails within a script, the next one in line will be executed. When a job however hits the time-out defined in a rule, the next job in line is not executed, reducing the total number of jobs run in parallel on the cluster over time. A timed-out job raises according to the MOAB scheduler (PBS server) a -11 exit status. As far as I understood any non-zero exit status means failure - or does this only apply to positive integers?!
Thanks in advance for any hint:)
If you don't provide a --cluster-status script, snakemake internally checks job status by touching some hidden files in the submitted job script. When a job times out, snakemake (on the node) doesn't get a chance to report the failure to the main snakemake instance as qsub will kill it.
You can try a cluster profile or just grab a suitable cluster status file (be sure to chmod it as an exe and have qsub report a parsable job id).

After triggering a Jenkins job remotely via a Bash script, when should I retrieve the job id?

I already built a script trigger_jenkins_job.sh which works perfectly fine for now. It’s composed mainly of 3 functions:
input_checkpoint
run_remotejob #: Running Jenkins job remotely using Json api.
sleep 10 #: 10 sec estimated time until pending duration is over
#and Jenkins job start running, i.e. a given slave was
#assigned to run the job.
get_buildID #: Retrieving build state, last build ID and last stable
#build ID using
The problem is I want to get rid of that sleep 10 seconds. And in the same time, I want to be sure before executing the function get_buildID that the remotely- triggered job is actually running on a node.
That way I will be retrieving the triggered job’s id, and not the last one in the queue before triggering that job.
Regarding the Jenkins file of the job, I specified:
agent {
label 'linux-node'
}
So, I guess the question is, I need some how from by bash script, to test if linux-node is running the remotely-triggered job, and if yes I execute the function get_buildID.
Get rid of the sleep command and use the wait command.
If you are triggering Job with tokens,it command itself should return you buildNumber.
Another way could be REST API. Please see "nextBuildNumber" field there (if build is still pending) else "number"

sge can only run one task in one node

I had built the SGE in a four-node cluster for source code. The operating system in Centos7. And when I submit some simple task in the cluster, I found that only one task was running in one node. What's the problem? Here is my task code:
sleep 60
echo "done"
and this is my cmd to submit the tasks:
DIR=`pwd`
option=""
for((i=0;i<5;i++));do
qsub -q multislots $option -V -cwd -o stdout -e stderr -S /bin/bash $DIR/test.sh
sleep 1
done
when run qstat -f, it shows:enter image description here
Given the error message about jobs failing because: "can not find an unused add_grp_id". You should check what gid_range is set to in the sge configuration(both global and also if there is one per-host). It should be a range of otherwise unused group ids. At least as many gids as you want jobs on a node.
If that isn't it try running qalter -w v and qalter -w p on one of the queued jobs to see why they aren't being started.

ExitCode of RunProgramInGuest in Jenkins job

I'm running a batch file in virtual machine by jenkins job. I using following command to run it.
..path..\vmrun.exe -T ws -gu username -gp password runProgramInGuest "c:\vm_image.vmx" -activeWindow -interactive "C:\Installer.bat"
The job is running correctly and installing software (by run batch file).
But sometime it is exiting with exit code 2.
So jenkins is showing as job failed.
Shall I know what is the exit code 2 mean in this job?
What are other possible exit code for this command and there meanings?
How shall I find whether job passed or failed?
If I understood what you ran, it's:
0 – VIX_OK
The operation was successful.
1 – VIX_E_FAIL
Unknown error.
2 – VIX_E_OUT_OF_MEMORY
Memory allocation failed: out of memory.

"qsub -now" equivalent using bsub

In SGE , we have
qsub -now yes/no <command>
By "-now yes" the job is scheduled immediately(if possible) or not at all . We are not put in pending queue .
By "-now no " the job is put in pending queue if it cannot be executed immediately .
But in LSF , we have qsub's equivalent as bsub .
in bsub, we are put in pending queue, if it cannot be executed immediately. We don't have option as "-now yes" as in qsub .
Do we something in bsub as "qsub -now"
P.S : One solution is that we can check for some time(some secondss) after running bsub, if we are scheduled or not and then exit . I am searching for a more elegant way .
I found the answer in an LSF way.
LSF does provide a way to quit a job if we its unable to schedule the resource. We hava a environment variable LSF_NIOS_PEND_TIMEOUT(specified in minutes) which quits the job, if its still in pending queue.
env LSF_NIOS_PEND_TIMEOUT=1 bsub -Is -m host /bin/bash
From Somewhere on the web:
LSF_NIOS_PEND_TIMEOUT
Syntax
LSF_NIOS_PEND_TIMEOUT=minutes
Description
Applies only to interactive batch jobs.
Maximum amount of time that an interactive batch job can remain pending.
If this parameter is defined, and an interactive batch job is pending for longer than the specified time, the interactive batch job is terminated.
Valid values
Any integer greater than zero
LSF doesn't have the same thing. You could use expect w/ a timeout. LSF will output something like this when the job starts. Your expect script could expect <<Starting on. (But this is basically what your P.S. says.)
$ bsub -Is -m hostA /bin/bash
Job <7536> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on hostA>>
hostA$
You could maybe use lsrun. But it won't work with the batch system to allocate a slot or other resource.

Resources