Best practice for job resource scaling (environment) on cluster computing? - bash

I am quite new to programming on a cluster and am having great difficulty finding my way around. I am on SGE with bash cluster and using OpenMPI.
I have a task where I want to run several variations of my process, where the only difference is with different configurations in that I will allocate more resources to my program. Take this example:
#$ -pe openmpi $process_num
Here I am allocating process_num processes to my job's environment. I want my environment to change, for example: I want to try 1, 2, and 3 for process_num in other words, I have 3 variations. I was thinking to submit an sh job containing such a simple loop as:
# ... other environment variable definitions
for process_num in 1 2 3
do
# ... some other environment variable definitions
#$ -pe openmpi $process_num
mpirun ./my_prog -npernode 1
done
In other words, one 'packed' job will execute all my variations and account for the resource allocation/scaling. I thought like this I would be able to allocate different resources for all my 3 variations of jobs with each iteration. I want to ask whether this is possible to do i.e. is the job environment able to scale in the way described, or will I have to submit 3 separate jobs?
Of course, if the answer is yes - submit separate jobs, then what happens when I have some 50 such configurations I want to try? What is then the best-practice approach to then submit 50 (or a large number of) separate jobs?
Unfortunately as the cluster is a shared resource, I am not free to experiment as I would like to.

A job is 'defined' by the resources it uses. If you want to test three resource configurations, you need to submit three jobs.
The other option would be to allocate the maximal config and run the three jobs sequentially. This is what the script in the question suggests. But you would be wasting cluster resources by allocating but not using CPUs.
The best practice is to use all resources you allocate to the fullest possible extent.
It's easy to submit multiple jobs via a script on the front end node. I believe SGE uses qsub, so it would be something like parallel "qsub -pe openmpi {} -v CPUS={} -l n_cpus={} test-job.sh" ::: 1 2 3. The exact syntax of the qsub depends a lot on your environment. In test-job.sh you would use the $CPUS to start your mpi job correctly (not sure if this is needed, maybe the correctly initialized SGE parallel environment -pe will be enough). I'm using parallel instead of bash loop just because of the nicer and more compact syntax, it does not make a difference.

Related

SLURM `srun` vs `sbatch` and their parameters

I am trying to understand what the difference is between SLURM's srun and sbatch commands. I will be happy with a general explanation, rather than specific answers to the following questions, but here are some specific points of confusion that can be a starting point and give an idea of what I'm looking for.
According to the documentation, srun is for submitting jobs, and sbatch is for submitting jobs for later execution, but the practical difference is unclear to me, and their behavior seems to be the same. For example, I have a cluster with 2 nodes, each with 2 CPUs. If I execute srun testjob.sh & 5x in a row, it will nicely queue up the fifth job until a CPU becomes available, as will executing sbatch testjob.sh.
To make the question more concrete, I think a good place to start might be: What are some things that I can do with one that I cannot do with the other, and why?
Many of the arguments to both commands are the same. The ones that seem the most relevant are --ntasks, --nodes, --cpus-per-task, --ntasks-per-node. How are these related to each other, and how do they differ for srun vs sbatch?
One particular difference is that srun will cause an error if testjob.sh does not have executable permission i.e. chmod +x testjob.sh whereas sbatch will happily run it. What is happening "under the hood" that causes this to be the case?
The documentation also mentions that srun is commonly used inside of sbatch scripts. This leads to the question: How do they interact with each other, and what is the "canonical" usecase for each them? Specifically, would I ever use srun by itself?
The documentation says
srun is used to submit a job for execution in real time
while
sbatch is used to submit a job script for later execution.
They both accept practically the same set of parameters. The main difference is that srun is interactive and blocking (you get the result in your terminal and you cannot write other commands until it is finished), while sbatch is batch processing and non-blocking (results are written to a file and you can submit other commands right away).
If you use srun in the background with the & sign, then you remove the 'blocking' feature of srun, which becomes interactive but non-blocking. It is still interactive though, meaning that the output will clutter your terminal, and the srun processes are linked to your terminal. If you disconnect, you will loose control over them, or they might be killed (depending on whether they use stdout or not basically). And they will be killed if the machine to which you connect to submit jobs is rebooted.
If you use sbatch, you submit your job and it is handled by Slurm ; you can disconnect, kill your terminal, etc. with no consequence. Your job is no longer linked to a running process.
What are some things that I can do with one that I cannot do with the other, and why?
A feature that is available to sbatch and not to srun is job arrays. As srun can be used within an sbatch script, there is nothing that you cannot do with sbatch.
How are these related to each other, and how do they differ for srun vs sbatch?
All the parameters --ntasks, --nodes, --cpus-per-task, --ntasks-per-node have the same meaning in both commands. That is true for nearly all parameters, with the notable exception of --exclusive.
What is happening "under the hood" that causes this to be the case?
srun immediately executes the script on the remote host, while sbatch copies the script in an internal storage and then uploads it on the compute node when the job starts. You can check this by modifying your submission script after it has been submitted; changes will not be taken into account (see this).
How do they interact with each other, and what is the "canonical" use-case for each of them?
You typically use sbatch to submit a job and srun in the submission script to create job steps as Slurm calls them. srun is used to launch the processes. If your program is a parallel MPI program, srun takes care of creating all the MPI processes. If not, srun will run your program as many times as specified by the --ntasks option. There are many use cases depending on whether your program is paralleled or not, has a long-running time or not, is composed of a single executable or not, etc. Unless otherwise specified, srun inherits by default the pertinent options of the sbatch or salloc which it runs under (from here).
Specifically, would I ever use srun by itself?
Other than for small tests, no. A common use is srun --pty bash to get a shell on a compute job.
This doesn't actually fully answer the question, but here is some more information I found that may be helpful for someone in the future:
From a related thread I found with a similar question:
In a nutshell, sbatch and salloc allocate resources to the job, while srun launches parallel tasks across those resources. When invoked within a job allocation, srun will launch parallel tasks across some or all of the allocated resources. In that case, srun inherits by default the pertinent options of the sbatch or salloc which it runs under. You can then (usually) provide srun different options which will override what it receives by default. Each invocation of srun within a job is known as a job step.
srun can also be invoked outside of a job allocation. In that case, srun requests resources, and when those resources are granted, launches tasks across those resources as a single job and job step.
There's a relatively new web page which goes into more detail regarding the -B and --exclusive options.
doc/html/cpu_management.shtml
Additional information from the SLURM FAQ page.
The srun command has two different modes of operation. First, if not run within an existing job (i.e. not within a Slurm job allocation created by salloc or sbatch), then it will create a job allocation and spawn an application. If run within an existing allocation, the srun command only spawns the application. For this question, we will only address the first mode of operation and compare creating a job allocation using the sbatch and srun commands.
The srun command is designed for interactive use, with someone monitoring the output. The output of the application is seen as output of the srun command, typically at the user's terminal. The sbatch command is designed to submit a script for later execution and its output is written to a file. Command options used in the job allocation are almost identical. The most noticable difference in options is that the sbatch command supports the concept of job arrays, while srun does not. Another significant difference is in fault tolerance. Failures involving sbatch jobs typically result in the job being requeued and executed again, while failures involving srun typically result in an error message being generated with the expectation that the user will respond in an appropriate fashion.
Another relevant conversation here

Reading file in parallel from multiple processes

I'm running multiple processes in parallel and each of these processes read the same file in parallel. It looks like some of the processes see a corrupted version of the file if I increase the number of processes to > 15 or so. What is the recommended way of handling such a scenario?
More details:
The file being read in parallel is actually a perl script. The multiple jobs are python processes, and each of them launch this perl script independently with different input parameters. When the number of jobs is increased, some of these jobs give errors that the perl script has invalid syntax (which is not true). Hence, I suspect that some of these jobs read in corrupted versions of the perl script.
I'm running all of this on a 32core machine.
If any process is also writing to the file, then you need to enforce some synchronization, for example with a global named mutex.
If there is no asynchronous writing going on, I would not expect to see corruption during the reads. Are you opening the files with "r" access? If you're still encountering troubles, it might be worth experimenting with reducing read buffer size. Or call out to a native win32 API for the file access.
Good luck!

Parallel processing in condor

I have a java program that will process 800 images.
I decided to use Condor as a platform for distributed computing, aiming that I can divide those images onto available nodes -> get processed -> combined the results back to me.
Say I have 4 nodes. I want to divide the processing to be 200 images on each node and combine the end result back to me.
I have tried executing it normally by submitting it as java program and stating the requirements = Machine == .. (stating all nodes). But it doesn't seem to work.
How can I divide the processing and execute it in parallel?
HTCondor can definitely help you but you might need to do a little bit of work yourself :-)
There are two possible approaches that come to mind: job arrays and DAG applications.
Job arrays: as you can see from example 5 on the HTCondor Quick Start Guide, you can use the queue command to submit more than 1 job. For instance, queue 800 at the bottom of your job file would submit 800 jobs to your HTCondor pool.
What people do in this case is organize the data to process using a filename convention and exploit that convention in the job file. For instance you could rename your images as img_0.jpg, img_1.jpg, ... img_799.jpg (possibly using symlinks rather than renaming the actual files) and then use a job file along these lines:
Executable = /path/to/my/script
Arguments = /path/to/data/dir/img_$(Process)
Queue 800
When the 800 jobs run, $(Process) gets automatically assigned the value of the corresponding process ID (i.e. a integer going from 0 to 799). Which means that your code will pick up the correct image to process.
DAG: Another approach is to organize your processing in a simple DAG. In this case you could have a pre-processing script (SCRIPT PRE entry in your DAG file) organizing your input data (possibly creating symlinks named appropriately). The real job would be just like the example above.

Can I program to chose a free CPU, I have multiple, to run my shell script?

I have a shell script. In that script I am starting 6 new processes. My system has 4 CPUs.
If I run the shell script the new processes spawned are automatically allocated to the one of the CPUs by default by the operating system. Now, I want to reduce the total time of running of my script. Is there a way that I can check a processor's free utilization and then chose one to run my process?
I do not want to run a process on a CPU which is >75% utilized. I would wait instead and run on a CPU which is <75% utilized.
I need to program my script in such a way that it should check the 4 CPUs' utilization and then run the process on the chosen CPU.
Can someone please help me with an example?
I recommend GNU Parallel:
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.
In addition, use nice.
You can tell the scheduler that a certain CPU should be used, using the taskset command:
taskset -c 1 process
will tell the scheduler that process should run on CPU1.
However, I think in most cases the built-in Linux scheduler should work well.

Creating a shell script which can spawn multiple concurrent processes which call a specified web service

I am trying to create a load testing shell script essentially what I am looking to do is have the script spawn some N number of concurrent processes and have each of those processes call a specified URL and perform a few basic actions. I am having a hard time figuring this out - any help would be awesome!!
If you really need to use shell, take a look at Bash: parallel processes. But there are load testing tools like ab (Apache HTTP server benchmarking) that can do the job for you.
You can use ab as simple as:
ab -n 10 -c 2 -A myuser:mypassword http://localhost:8080/
For more examples, look at Howto: Performance Benchmarks a Webserver.
have a look at this article:
http://prll.sourceforge.net/shell_parallel.html
as described:
"Parallel batch processing in the shell
How to process a large batch job using several concurrent processes in bash or zsh
This article describes three methods of parallel execution: the first is not very performant and the other two are not safe to use. A more complete solution is called prll and can be found here. This article is not meant to be good advice, but instead laments the state of scriptable job control in shells."

Resources