Running tasks in parallel with Makefile - makefile

I'm having issues structuring my Makefile to run my shell scripts in the desired order.
Here is my current makefile
## Create data splits
raw_data: src/data/get_data.sh
src/data/get_data.sh
hadoop fs -cat data/raw/target/* >> data/raw/target.csv
hadoop fs -cat data/raw/control/* >> data/raw/control.csv
hadoop fs -rm -r -f data/raw
touch raw_data_loaded
split_data: raw_data_loaded
rm -rf data/interim/splits
mkdir data/interim/splits
$(PYTHON_INTERPRETER) src/data/split_data.py
## Run Models
random_forest: split_data
nohup $(PYTHON_INTERPRETER) src/models/random_forest.py > random_forest &
under_gbm: split_data
nohup $(PYTHON_INTERPRETER) src/models/undersampled_gbm.py > under_gbm &
full_gbm: split_data
nohup $(PYTHON_INTERPRETER) src/models/full_gbm.py > full_gbm &
# Create predictions from model files
predictions: random_forest under_gbm full_gbm
nohup $(PYTHON_INTERPRETER) src/models/predictions.py > predictions &
The Problem
Everything works ok until I start the ##Run Models section. These are all independent scripts, which can all run once split_data is finished. I want to run each of the 3 model scripts simultaneously, so I run each in the background with &.
The problem is that my last task, predictions begins to run at the same time as the three preceding tasks. What I Want to happen is for the 3 simultaneous model scripts to finish, and then predictions runs.
My Attempt
My proposed solution is to run my final model task, full_gbm without the &, so that predictions doesn't run until that is finished. This should work, but I'm wondering if there is a less 'hacky' way to achieve this -- is there some way to structure the target variables to achieve the same result?

You don't say which implementation of Make you're using. If it's GNU Make, you can invoke it with the -j option to allow it to decide which jobs should be run in parallel. Then you can remove the nohup and & from all the commands; predictions won't start until all of random_forest under_gbm full_gbm have completed, and the build itself won't end until predictions has completed.
Also, you won't lose the all-important exit status of the commands.

Related

How to run multiple unique parallel jobs on independent nodes using slurm with master / agent set up

I have a physical model optimization program that uses a master / agent design to run unique parameterizations of the model across multiple nodes in parallel. I reserve the nodes and create the working directories using a batch script that ultimately uses a srun -multi-prog pest.conf command to call the optimization software (PEST++). The optimization program then calls a bash script which ultimately calls the model executable. I've been using something like srun -n 20 process.exe, but keep getting "step creation temporarily disabled" error.
So the workflow is (1) call the batch script, which sets up directories and creates the muli-prog .conf script:
#SBATCH -N 4
#SBATCH --hint=nomultithread
#SBATCH -p workq
#SBATCH --time=1:00:00
(2) The resulting multi-prog pest.conf script looks like this:
0 bash -c 'cd /caldera/projects/usgs/water/waiee/wrftest/base_pp_dir_3593956 && pestpp-glm wrftest.v2.pst /h :10497'
1-3 bash -c 'cd ${WORKER_DIR}${SLURM_PROCID} && pestpp-glm wrftest.v2.pst /h nid00413:10497'
(3) wrftext.v2.pst calls a bash script which ultimately calls the model:
printf "Running WRF-H \n"
srun -n 20 ./wrf_hydro_NoahMP.exe
wait
printf "Finished WRF-H Run.\n\n"
simple calling srun -n 20 ./wrf_hydro.exe from the command line works as expected, so I'm wondering if slurm isn't recognizing the final srun command which is resulting in the step creation temporarily disabled error?

How do I create a new directory for a Slurm job prior to setting the working directory?

I want to create a unique directory for each Slurm job I run. However, mkdir appears to interrupt SBATCH commands. E.g. when I try:
#!/bin/bash
#SBATCH blah blah other Slurm commands
mkdir /path/to/my_dir_$SLURM_JOB_ID
#SBATCH --chdir=/path/to/my_dir_$SLURM_JOB_ID
touch test.txt
...the Slurm execution faithfully creates the directory at /path/to/my_dir_$SLURM_JOB_ID, but skips over the --chdir command and executes the sbatch script from the working directory the batch was called from.
Is there a way to create a unique directory for the output of a job and set the working directory there within a single sbatch script?
First off, the #SBATCH options must be at the top of the file, and citing the documentation
before any executable commands
So it is expected behaviour that the --chdir is not honoured in this case. The issue rationale is that the #SBATCH options, and the --chdir in particular, is used by Slurm to setup the environment in which the job starts. That environment must be decided before the job starts, and cannot be modified afterwards by Slurm.
For similar reasons, environment variables are not processed in #SBATCH options ; they are simply ignored by Bash as they are in a commented line, and Slurm makes no effort to expand them itself.
Also note that --chdir is used to
Set the working directory of the batch script to directory before it is executed.
and that directory must exist. Slurm will not create it for you.
What you need to do is call the cd command in your script.
#!/bin/bash
#SBATCH blah blah other Slurm commands
WORKDIR=/path/to/my_dir_$SLURM_JOB_ID
mkdir -p "$WORKDIR" && cd "$WORKDIR" || exit -1
touch test.txt
Note the exit -1 so that if the directory creation fails, your job stops rather than continuing in the submission directory.
As a side note, it is always interesting to add a set -euo pipefail line in your script. It makes sure your script stops if any command in it fails.

How to adjust bash file to execute on a single node

I would like your help to know whether it is possible (and if yes how) to adjust the bash file below.
I have a principal Matlab script main.m, which in turn calls another Matlab script f.m.
f.m should be executed many times with different inputs.
I structure this as an array job.
I typically use the following bash file called td.sh to execute the array job into the HPC of my university
#$ -S /bin/bash
#$ -l h_vmem=5G
#$ -l tmem=5G
#$ -l h_rt=480:0:0
#$ -cwd
#$ -j y
#Run 237 tasks where each task has a different $SGE_TASK_ID ranging from 1 to 237
#$ -t 1-237
#$ -N mod
date
hostname
#Output the Task ID
echo "Task ID is $SGE_TASK_ID"
/share/[...]/matlab -nodisplay -nodesktop -nojvm -nosplash -r "main; ID = $SGE_TASK_ID; f; exit"
What I do in the terminal is
cd to the folder where the scripts main.m, f.m, td.sh are located
type in the terminal qsub td.sh
Question: I need to change the bash file above because the script f.m calls a solver (Gurobi) whose license is single node single user. This is what I have been told:
" This license has been installed already and works only on node A.
You will not be able to qsub your scripts as the jobs have to run on this node.
Instead you should ssh into node A and run the job on this node directly instead
of submitting to the scheduler. "
Could you guide me through understanding how I should change the bash file above? In particular, how should I force the execution into node A?
Even though I am restricted to one node only, am I still able to parallelise using array jobs? Or array jobs are by definition executed on multiple nodes?
If you cannot use your scheduler, then you cannot use its array jobs. You will have to find another way to parallelize those jobs. Array jobs are not executed on multiple nodes by definition (but they are usually executed on multiple nodes due to resource availability).
Regarding the adaptation of your script, just follow the guidelies provided by your sysadmins: forget about SGE and start your calculus through ssh directly against the node you have been told:
date
hostname
for TASK_ID in {1..237}
do
#Output the Task ID
echo "Task ID is $TASK_ID"
ssh user#A "/share/[...]/matlab -nodisplay -nodesktop -nojvm -nosplash -r \"main; ID = $TASK_ID; f; exit\""
done
If the license is single node and single user (but multiple simultaneous execution), you can try to parallelize the calculus. You will have to take into account the resources available in the node A (number of CPUs, memory...) and the resources that you need for every single execution, and then start simultaneously as many calculus as possible without overloading the node (otherwise they will take longer or even fail).

LSF Job Array in Make File

I am running an LSF job array to create a target in a makefile.
However as soon as the array is submitted make considers the command for the target to be executed, and throws an error as the target does not exist.
How can I force make to wait until the completion of the LSF job array before moving onto other dependent targets?
Example:
all: final.txt
first_%.txt:
bsub -J" "jarray[1-100]" < script.sh
final.txt: first_%.txt
cat first_1.txt first_50.txt first_100.txt > final.txt
Unfortunately the -K flag isn't supported for job arrays.
Try bsub -K which should force bsub to stay in the foreground until the job completes.
Edit
Since the option isn't supported on arrays, I think you'll have to submit your array as separate jobs, something like:
for i in `seq 1 100`; do
export INDEX=$i
bsub -K < script.sh &
done
wait
You'll have to pass the index to your script manually instead of using the job array index.
You need to ask the bsub command to wait for the job to complete. I have never used it, but according to the man page you can add the -K option to do this.

QSUB a process for every file in a directory?

I've been using
qsub -t 1-90000 do_stuff.sh
to submit my tasks on a Sun GridEngine cluster, but now find myself with data sets (super large ones, too) which are not so conveniently named. What's the best way to go about this? I could try to rename them all, but the names contain information which needs to be preserved, and this obviously introduces a host of problems. I could just preprocess everything into jsons, but if there's a way to just qsub -all_contents_of_directory, that would be ideal.
Am I SOL? Should I just go to the directory in question and find . -exec 'qsub setupscript.sh {}'?
Use another script to submit the job - here's an example I used where I want the directory name in the job name. "run_openfoam" is the pbs script in the particular directory.
#!/bin/bash
cd $1
qsub -N $1 run_openfoam
You can adapt this script to suit your job and then run it through a loop on the command line. So rather than submitting a job array, you submit a job for each dir name passed as the first parapmeter to this script.
I tend to use Makefiles to automate this stuff:
INPUTFILES=$(wildcard *.in)
OUTPUTFILES=$(patsubst %.in,%.out,$(INPUTFILES))
all : $(OUTPUTFILES)
%.out : %.in
#echo "mycommand here < $< > $#" | qsub
Then type 'make', and all files will be submitted to qsub. Of course, this will submit everything all at once, which may do unfortunate things to your compute cluster and your sysadmin's blood pressure.
If you remove the "| qsub", the output of make is a list of commands to run. Feed that list into one or more qsub commands, and you'll get an increase in efficiency and a reduction in qsub jobs. I've been using GNU parallel for that, but it needs a qsub that blocks until the job is done. I wrote a wrapper that does that, but it calls qstat a lot, which means a lot of hitting on the system. I should modify it somehow, but there aren't a lot of computationally 'good' options here.
I cannot understand "-t 1-90000" in your qsub command. My searching of qsub manual doesn't show such "-t" option.
Create a file with a list of the datasets in it
find . -print >~/list_of_datasets
Script:
#!/bin/bash
exec ~/setupscript.sh $(sed -n -e "${SGE_TASK_ID}p" <~/list_of_datasets)
qsub -t 1-$(wc -l ~/list_of_datasets) job_script

Resources