DATASTAGE: how to run more instance jobs in parallel using DSJOB - parallel-processing

I have a question.
I want to run more instance of same job in parallel from within a script: I have a loop in which I invoke jobs with dsjob and without option "-wait" and "-jobstatus".
I want that jobs completed before script termination, but I don't know how to verify if job instance terminated.
I though to use wait command but it is not appropriate.
Thanks in advance

First,you should assure job compile option "Allow Multiple Instance" choose.
Second:
#!/bin/bash
. /home/dsadm/.bash_profile
INVOCATION=(1 2 3 4 5)
cd $DSHOME/bin
for id in ${INVOCATION[#]}
do
./dsjob -run -mode NORMAL -wait test demo.$id
done
project -- test
job -- demo
$id -- invocation id
the two line in shell scipt:guarantee the environment path can work.

Run the jobs like you say without the -wait, and then loop around running dsjob -jobinfo and parse the output for a job status of 1 or 2. When all jobs return this status, they are all finished.
You might find, though, that you check the status of the job before it actually starts running and you might pick up an old status. You might be able to fix this by first resetting the job instance and waiting for a status of "Not running", prior to running the job.

Invoke the jobs in loop without wait or job-status option
after your loop , check the jobs status by dsjob command
Example - dsjob -jobinfo projectname jobname.invocationid
you can code one more loop for this also and use sleep command under that
write yours further logic as per status of the jobs
but its good to create Job Sequence to invoke this multi-instance job simultaneously with the help of different invoaction-ids
create a sequence job if these are in same process
create different sequences or directly create different scripts to trigger these jobs simultaneously with invocation- ids and schedule in same time.
Best option create a standard generalized script where each thing will be getting created or getting value as per input command line parameters
Example - log files on the basis of jobname + invocation-id
then schedule the same script for different parameters or invocations .

Related

SLURM status string on job completion / exit

How do I get the slurm job status (e.g. COMPLETED, FAILED, TIMEOUT, ...) on job completion (within the submission script)?
I.e. I want to write to separately keep track of jobs which are timed out / failed.
Currently I work with the exit code, however jobs which TIMEOUT also get exit code 0.
For future reference, here is how I finally do it.
To retrieve the jobid at the beginning of the job and write some information (e.g. "${SLURM_JOB_ID} ${PWD}") to a summary file.
Then process this file and use something like sacct -X -n -o State --j ${jid} to get the job status.

How to make sbatch wait until last submitted job is *running* when submitting multiple jobs?

I'm running a numerical model which parameters are in a "parameter.input" file. I use sbatch to submit multiple iterations of the model, with one parameter in the parameter file changing every time. Here is the loop I use:
#!/bin/bash -l
for a in {01..30}
do
sed -i "s/control_[0-9][0-9]/control_${a}/g" parameter.input
sbatch --time=21-00:00:00 run_model.sh
sleep 60
done
The sed line changes a parameter in the parameter file. The
run_model.sh file runs the model.
The problem: depending on the resources available, a job might run immediately or stay pending for a few hours. With my default loop, if 60 seconds is not enough time to find resources for job n to run, the parameter file will be modified while job n is pending, meaning job n will run with the wrong parameters. (and I can't wait for job n to complete before submitting job n+1 because each job takes several days to complete)
How can I force batch to wait to submit job n+1 until job n is running?
I am not sure how to create an until loop that would grab the status of job n and wait until it changes to 'running' before submitting job n+1. I have experimented with a few things, but the server I use also hosts another 150 people's jobs, and I'm afraid too much experimenting might create some issues...
Use the following to grab the last submitted job's ID and its status, and wait until it isn't pending anymore to start the next job:
sentence=$(sbatch --time=21-00:00:00 run_model.sh) # get the output from sbatch
stringarray=($sentence) # separate the output in words
jobid=(${stringarray[3]}) # isolate the job ID
sentence="$(squeue -j $jobid)" # read job's slurm status
stringarray=($sentence)
jobstatus=(${stringarray[12]}) # isolate the status of job number jobid
Check that the job status is 'running' before submitting the next job with:
if [ "$jobstatus" = "R" ];then
# insert here relevant code to run next job
fi
You can insert that last snippet in an until loop that checks the job's status every few seconds.

After triggering a Jenkins job remotely via a Bash script, when should I retrieve the job id?

I already built a script trigger_jenkins_job.sh which works perfectly fine for now. It’s composed mainly of 3 functions:
input_checkpoint
run_remotejob #: Running Jenkins job remotely using Json api.
sleep 10 #: 10 sec estimated time until pending duration is over
#and Jenkins job start running, i.e. a given slave was
#assigned to run the job.
get_buildID #: Retrieving build state, last build ID and last stable
#build ID using
The problem is I want to get rid of that sleep 10 seconds. And in the same time, I want to be sure before executing the function get_buildID that the remotely- triggered job is actually running on a node.
That way I will be retrieving the triggered job’s id, and not the last one in the queue before triggering that job.
Regarding the Jenkins file of the job, I specified:
agent {
label 'linux-node'
}
So, I guess the question is, I need some how from by bash script, to test if linux-node is running the remotely-triggered job, and if yes I execute the function get_buildID.
Get rid of the sleep command and use the wait command.
If you are triggering Job with tokens,it command itself should return you buildNumber.
Another way could be REST API. Please see "nextBuildNumber" field there (if build is still pending) else "number"

how to check failed and terminated jobs

How to write a shell script for checking failed and terminated jobs in a particular directory for every one hour. if any failure we need to get one mail ,
You can use these commands
For current jobs used
user#mysystem:~$ jobs
and termination use
user#mysystem:~$ fg

Script didn't Finish execution but cron job started again

i am trying to run a cron job which will execute my shell script, my shell script is having hive & pig scripts. I am setting the cron job to execute after every 2 mins but before my shell script is getting finish my cron job starts again is it going to effect my result or once the script finishes its execution then only it will start. I am in a bit of dilemma here. Please help.
Thanks
I think there are two ways to better resolve this, a long way and a short way:
Long way (probably most correct):
Use something like Luigi to manage job dependencies, then run that with Cron (it won't run more than one of the same job).
Luigi will handle all your job dependencies for you and you can make sure that a particular job only executes once. It's a little more work to get set-up, but it's really worth it.
Short Way:
Lock files have already been mentioned, but you can do this on HDFS too, that way it doesn't depend on where you run the cron job from.
Instead of checking for a lock file, put a flag on HDFS when you start and finish the job, and have this as a standard thing in all of your cron jobs:
# at start
hadoop fs -touchz /jobs/job1/2016-07-01/_STARTED
# at finish
hadoop fs -touchz /jobs/job1/2016-07-01/_COMPLETED
# Then check them (pseudocode):
if(!started && !completed): run_job; add_completed; remove_started
At the start of the script, have a check:
#!/bin/bash
if [ -e /tmp/file.lock ]; then
rm /tmp/file.lock # removes the lock and continue
else
exit # No lock file exists, which means prev execution has not completed.
fi
.... # Your script here
touch /tmp/file.lock
There are many others ways of achieving the same. I am giving a simple example.

Resources