I am running an LSF job array to create a target in a makefile.
However as soon as the array is submitted make considers the command for the target to be executed, and throws an error as the target does not exist.
How can I force make to wait until the completion of the LSF job array before moving onto other dependent targets?
Example:
all: final.txt
first_%.txt:
bsub -J" "jarray[1-100]" < script.sh
final.txt: first_%.txt
cat first_1.txt first_50.txt first_100.txt > final.txt
Unfortunately the -K flag isn't supported for job arrays.
Try bsub -K which should force bsub to stay in the foreground until the job completes.
Edit
Since the option isn't supported on arrays, I think you'll have to submit your array as separate jobs, something like:
for i in `seq 1 100`; do
export INDEX=$i
bsub -K < script.sh &
done
wait
You'll have to pass the index to your script manually instead of using the job array index.
You need to ask the bsub command to wait for the job to complete. I have never used it, but according to the man page you can add the -K option to do this.
Related
I am trying to create an auto-rerun mechanism by implementing some code into sasbatch script after sascommand will finish. General idea is to:
locate a log of sas process and an id of the flow containing current job,
check if the log contains particular ORA-xxxxx errors which we know that solution for them is just rerun of the process,
if so, then trigger jrerun class from LSF Platform Command Line Interface,
exit sasbatch passing $rc to LSF
The idea was implemented as:
#define used paths
log_dir=/path/to/sas_logs_directory
out_log=/path/to/auto-rerun_log.txt
out_log2=/path/to/lsf_rerun_log.txt
if [ -n "${LSB_JOBNAME}"]; then
if [ ! -f "$out_log"]; then
touch $out_log
fi
#get flow runtime attributes
IFS-: read -r flow_id username flow_name job_name <<< "${LSB_JOBNAME}"
#find log of the current process
log_path=$(ls -t $log_dir/*.log | xargs grep -li "job:\s*$job_name" | grep -i "/$flow_name_" | head -1)
#set path to txt file containing lines which represents ORA errors we look for
conf_path-/path/to/error_list
#analyse process' log line by line
while read -r line;
do
#if error is found in log then try to rerun flow
if grep -q "$line" $log_path; then
(nohup /path/to/rerun_script.sh $flow_id >$out_log2 2>&1) &
disown
break
fi
done < $conf_path
fi
While rerun_script is the script which calls jrerun class after sleep command - in order to let parent script exit $rc in the meanwhile. It looks like:
sleep 10
/some/lsf/path/jrerun
Problem is that job is running for the all time. In LSF history I can see that jrerun was called before job exited.
Furthermore in $out_log2 I can see message: <flow_id> has no starting or exit points.
Do anyone have an idea how I can pass return code to LSF before jrerun calling? Or maybe some simplier way to perform autorerun of SAS jobs in Platform LSF?
I am using SAS 9.4 and Platform Process Manager 9.1
Or maybe some simplier way to perform autorerun of SAS jobs in Platform LSF?
I'm not knowledgeable about the SAS part. But on the LSF side there's at least a couple of ways to requeue the job.
If you have control of the job script, you can use special process exit value to automatically requeue the job.
https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_admin/job_requeue_about.html
If you have control outside of the job script, you can use brequeue -r to requeue a running job.
https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_command_ref/brequeue.1.html
Good Luck
I managed to get this working by using two additional configuration files. When my grep returnes 1 I add found flow_id to flow_list.txt configuration file and modify especially made trigger_file.txt.
I scheduled additional flow execute_rerun in LSF which is triggered after file trigger_file.txt is modified. The execute_rerun flow reads flow_list.txt configuration file line by line and calls jrerun method on each flow.
I managed to achieve an automatic rerun of the flows which fails due to particular errors.
I would like to submit an array job on a cluster running SGE.
I know how to use array jobs with the -t option (for instance, qsub -t 1-1000 somescript.sh).
What if I don't know how many tasks I have to submit? The idea would be to use something like (not working):
qsub -t 1- somescript.sh
The submission would then go for all the n tasks, with unknown n.
No, open-ended arrays are not a built-in capability (nor can you add jobs to an array after initial submission).
I'm guessing about why you want to do this, but here's one idea for keeping track of a group of jobs like this: specify a shared name for the set of jobs, appending a counter.
So, for example, you'd include -N myjob.<counter> in your qsub (or add a #PBS script line for it):
-N myjob.1
-N myjob.2
...
-N myjob.n
Hopefully this is not a dublicate and also not just a problem of our cluster's configuration...
I am submitting a job array to a cluster using qsub with the following command:
qsub -q QUEUE -N JOBNAME -t 1:10 -e ${ERRFILE}_$SGE_TASK_ID /path/to/script.sh
where
ERRFILE=/home/USER/somedir/errors.
The idea is to specify an error file (also analogously the output file) that also contains the task ID from within the job array.
So far I have learned that the line
#$ -e ${ERRFILE}_$SGE_TASK_ID
inside the script.sh, does not work, because it is a comment and not evaluated by bash. My first line does not work however because $SGE_TASK_ID is only set AFTER the job is submitted.
I read here that escaping the evaluation of $SGE_TASK_ID (in that link it's PBS' $PBS_JOBID, but a similar problem) should work, but when I tried
qsub -q QUEUE -N JOBNAME -t 1:10 -e ${ERRFILE}_\$SGE_TASK_ID /path/to/script.sh
it did not work as expected.
Am I missing something obvious? Is it possible to use $SGE_TASK_ID in the name of an error file (the automatic naming of error files does that, but I want to specify the directory and if possible the name, too)?
Some additional remarks:
I am using the -cwd option for qsub inside script.sh, but that is NOT where I want my error files to be stored.
I have next to no control over how the cluster works and no root access (wouldn't know what I could need it for in this context but anyway...).
Apparently our cluster does not use PBS.
Yes my scripts are all executable and where applicable started with #!/bin/bash (I also specified the use of bash with the -S /bin/bash option for qsub).
There seems to be a solution here, but I am not quite sure how that works and it also appears to be using PBS. If that answer DOES apply to my question and I misunderstood it, please let me know.
I would appreciate any hint into the right direction.
Thank You!
I didn't know this either, but it looks like Grid Engine has something called "pseudo environment variables" like $TASK_ID for this purpose. This should work:
qsub -q QUEUE -N JOBNAME -t 1:10 -e ${ERRFILE}_\$TASK_ID /path/to/script.sh
From the man page:
-e [[hostname]:]path,...
...
If the pathname contains certain pseudo
environment variables, their value will be expanded at
runtime of the job and will be used to constitute the
standard error stream path name. The following pseudo
environment variables are supported currently:
$HOME home directory on execution machine
$USER user ID of job owner
$JOB_ID current job ID
$JOB_NAME current job name (see -N option)
$HOSTNAME name of the execution host
$TASK_ID array job task index number
I am submitting this simple job to SGE through qsub. How can I see the output of the job which is a simple echo in my terminal. I mean I want it directly on screen not diverting the output to a logfile or something.
So here is the job stored in Dummyjob:
#!/bin/sh
#$ -j y
#$ -S /bin/sh
#$ -q long.q
sleep 30
echo "I'm done!"
And this is the qsub command:
qsub -N job_1 -cwd./Dummyjob
Thank you!
It doesn't do that. You're referring to a batch facility, e.g., How to submit a job using qsub.
Looking at the command-line options, these are the possibilities:
-o <output_logfile> name of the output log file
-e <error_logfile> name of the error log file
-m ea Will send email when job ends or aborts
You can ask it to send mail when the job is done (successfully or not). Or you might be able to make it write to a fifo, e.g., in one terminal you would do
mkfifo myFakeFile
tail -f myFakeFile
and then use
-o myFakeFile
when submitting (in that order, so that something is waiting). But if the program does any checking, it will not write to a fifo (because it is not a regular file).
Further reading:
qsub - submit a batch job to Sun Grid Engine.
6.3.2 Creating a FIFO (The Linux Programmer's Guide)
The previous answer mentions that you are submitting a 'batch job script' and this is true, so you will not see the output on your terminal (tty) but the stdout/stderr will be sent to output files. However that doesn't mean you can't run an interactive job through Grid Engine. You can, just use 'qrsh' instead of using 'qsub' and the script will be run on a remote machine chosen by Grid Engine - the results will be displayed on your screen.
Note: You might have to configure qrsh in your Grid Engine Cluster for this to work.
I've been using
qsub -t 1-90000 do_stuff.sh
to submit my tasks on a Sun GridEngine cluster, but now find myself with data sets (super large ones, too) which are not so conveniently named. What's the best way to go about this? I could try to rename them all, but the names contain information which needs to be preserved, and this obviously introduces a host of problems. I could just preprocess everything into jsons, but if there's a way to just qsub -all_contents_of_directory, that would be ideal.
Am I SOL? Should I just go to the directory in question and find . -exec 'qsub setupscript.sh {}'?
Use another script to submit the job - here's an example I used where I want the directory name in the job name. "run_openfoam" is the pbs script in the particular directory.
#!/bin/bash
cd $1
qsub -N $1 run_openfoam
You can adapt this script to suit your job and then run it through a loop on the command line. So rather than submitting a job array, you submit a job for each dir name passed as the first parapmeter to this script.
I tend to use Makefiles to automate this stuff:
INPUTFILES=$(wildcard *.in)
OUTPUTFILES=$(patsubst %.in,%.out,$(INPUTFILES))
all : $(OUTPUTFILES)
%.out : %.in
#echo "mycommand here < $< > $#" | qsub
Then type 'make', and all files will be submitted to qsub. Of course, this will submit everything all at once, which may do unfortunate things to your compute cluster and your sysadmin's blood pressure.
If you remove the "| qsub", the output of make is a list of commands to run. Feed that list into one or more qsub commands, and you'll get an increase in efficiency and a reduction in qsub jobs. I've been using GNU parallel for that, but it needs a qsub that blocks until the job is done. I wrote a wrapper that does that, but it calls qstat a lot, which means a lot of hitting on the system. I should modify it somehow, but there aren't a lot of computationally 'good' options here.
I cannot understand "-t 1-90000" in your qsub command. My searching of qsub manual doesn't show such "-t" option.
Create a file with a list of the datasets in it
find . -print >~/list_of_datasets
Script:
#!/bin/bash
exec ~/setupscript.sh $(sed -n -e "${SGE_TASK_ID}p" <~/list_of_datasets)
qsub -t 1-$(wc -l ~/list_of_datasets) job_script