How to distribute executions over a cluster

How to distribute executions over a cluster - parallel-processing

I am doing research, and I often need to execute the same program with different inputs (each combination of inputs repeatedly) and store the results, for aggregation.
I would like to speed up the process by executing these experiments in parallel, over multiple machines. However, I would like to avoid the hassle of launching them manually. Furthermore, I would like my program to be implemented as a single thread and only add parallelization on top of it.
I work with Ubuntu machines, all reachable in the local network.
I know GNU Parallel can solve this problem, but I am not familiar with it. Can someone help me to setup an environment for my experiments?

Please, notice that this answer has been adapted from one of my scripts and is untested. If you find bugs, you are welcome to edit the answer.
First of all, to make the process completely batch, we need a non-interactive SSH login (that's what GNU Parallel uses to launch commands remotely).
To do this, first generate a pair of RSA keys (if you don't already have one) with:
ssh-keygen -t rsa
which will generate a pair of private and public keys, stored by default in ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub. It is important to use these locations, as openssh will go looking for them here. While openssh commands allow you to specify the private key file (passing it by -i PRIVATE_KEY_FILE_PATH), GNU Parallel does not have such an option.
Next, we need to copy the public key on all the remote machines we are going to use. For each of the machines of your cluster (I will call them "workers"), run on this command on your local machine:
ssh-copy-id -i ~/.ssh/id_rsa.pub WORKER_USER#WORKER_HOST
This step is interactive, as you will need to login to each of the workers through user id and password.
From this moment on, login from your client to each of the workers is non-interactive. Next, let's setup a bash variable with a comma-separated list of your workers. We will set this up using GNU Parallel special syntax, which allows to indicate how many CPUs to use on each worker:
WORKERS_PARALLEL="2/user1#192.168.0.10,user2#192.168.0.20,4/user3#10.0.111.69"
Here, I specified that on 192.168.0.10 I want only 2 parallel processes, while on 10.0.111.69 I want for. As for 192.168.0.20, since I did not specify any number, GNU Parallel will figure out how many CPUs (CPU cores, actually) the remote machine has and execute that many parallel processes.
Since I will also need the same list in a format that openssh can understand, I will create a second variable without the CPU information and with spaces instead of commas. I do this automatically with:
WORKERS=`echo $WORKERS_PARALLEL | sed 's/[0-9]*\///g' | sed 's/,/ /g'`
Now it's time to setup my code. I assume that each of the workers is configured to run my code, so that I will just need to copy the code. On workers, I usually work in the /tmp folder, so what follows assumes that. The code will be copied though an SSH tunnel and extracted remotely:
WORKING_DIR=/tmp/myexperiments
TAR_PATH=/tmp/code.tar.gz
# Clean from previous executions
parallel --nonall -S $WORKERS rm -rf $WORKING_DIR $TAR_PATH
# Copy the the tar.gz file on the worker
parallel scp LOCAL_TAR_PATH {}:/tmp ::: `echo $WORKERS`
# Create the working directory on the worker
parallel --nonall -S $WORKERS mkdir -p $WORKING_DIR
# Extract the tar file in the working directory
parallel --nonall -S $WORKERS tar --warning=no-timestamp -xzf $TAR_PATH -C $WORKING_DIR
Notice that multiple executions on the same machine will use the same working directory. I assume only one version of the code will be run at a specific time; if this is not the case you will need to modify the commands to use different working directories.
I use the --warning=no-timestamp directive to avoid annoying warnings that could be issued if the time of your machine ahead of that of your workers.
We now need to create directories in the local machine for storing the results of the runs, one for each group of experiments (that is, multiple executions with the same parameters). Here, I am using two dummy parameters alpha and beta:
GROUP_DIRS="results/alpha=1,beta=1 results/alpha=0.5,beta=1 results/alpha=0.2,beta=0.5"
N_GROUPS=3
parallel --header : mkdir -p {DIR} ::: DIR $GROUP_DIRS
Notice here that using parallel here is not necessary: using a loop would have worked, but I find this more readable. I also stored the number of groups, which we will use in the next step.
A final preparation step consists in creating a list of all the combinations of parameters that will be used in the experiments, each repeated as many times as necessary. Each repetition is coupled with an incremental number for identifying different runs.
ALPHAS="1.0 0.5 0.2"
BETAS="1.0 1.0 0.5"
REPETITIONS=1000
PARAMS_FILE=/tmp/params.txt
# Create header
echo REP GROUP_DIR ALPHA BETA > $PARAMS_FILE
# Populate
parallel \
--header : \
--xapply \
if [ ! -e {GROUP_DIR}"exp"{REP}".dat" ]';' then echo {REP} {GROUP_DIR} {ALPHA} {BETA} '>>' $PARAMS_FILE ';' fi \
::: REP $(for i in `seq $REPETITIONS`; do printf $i" %.0s" $(seq $N_GROUPS) ; done) \
::: GROUP_DIR $GROUP_DIRS \
::: ALPHA $ALPHAS \
::: BETA $BETAS
In this step I also implemented a control: if a .dat file already exists, I skip that set of parameters. This is something that comes out of practice: I often interrupt the execution of GNU Parallel and later decide to resume it by re-executing these commands. With this simple control I avoid running more experiments than necessary.
Now we can finally run the experiments. The algorithm in this example generates a file as specified in the parameter --save-data which I want to retrieve. I also want to save the stdout and stderr in a file, for debugging purposes.
cat $PARAMS_FILE | parallel \
--sshlogin $WORKERS_PARALLEL \
--workdir $WORKING_DIR \
--return {GROUP_DIR}"exp"{REP}".dat" \
--return {GROUP_DIR}"exp"{REP}".txt" \
--cleanup \
--xapply \
--header 1 \
--colsep " " \
mkdir -p {TEST_DIR} ';' \
./myExperiment \
--random-seed {REP} \
--alpha {ALPHA} \
--beta {BETA} \
--save-data {GROUP_DIR}"exp"{REP}".dat" \
'&>' {GROUP_DIR}"exp"{REP}".txt"
A little bit of explanation about the parameters. --sshlogin, which could be abbreviated with -S, passes the list of workers that Parallel will use to distribute the computational load. --workdir sets the working dir of Parallel, which by default is ~. --return directives copy back the specified file after the execution is completed. --cleanup removes the files copied back. --xapply tells Parallel to interpret the parameters as tuples (rather than sets to multiply by cartesian product). --header 1 tells Parallel that the first line of the parameters file has to be interpreted as header (whose entries will be used as names for the columns). --colsep tells Parallel that columns in the parameters file are space-separated.
WARNING: Ubuntu's version of parallel is outdated (2013). In particular, there is a bug preventing the above code to run properly, which has been fixed only a few days ago. To get the latest monthly snapshot, run (does not need root privileges):
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
Notice that the fix to the bug I mentioned above will only be included in the next snapshot, on September 22nd, 2015. If you are in a hurry you should perform a manual installation of the smoking hottest .
Finally, it is a good habit to clean our working environments:
rm $PARAMS_FILE
parallel --nonall -S $WORKERS rm -rf $WORKING_DIR $TAR_PATH
If you use this for reseach and publish a paper, remember to cite the original work by Ole Tange (see parallel --bibtex).

Related

How to adjust bash file to execute on a single node

I would like your help to know whether it is possible (and if yes how) to adjust the bash file below.
I have a principal Matlab script main.m, which in turn calls another Matlab script f.m.
f.m should be executed many times with different inputs.
I structure this as an array job.
I typically use the following bash file called td.sh to execute the array job into the HPC of my university
#$ -S /bin/bash
#$ -l h_vmem=5G
#$ -l tmem=5G
#$ -l h_rt=480:0:0
#$ -cwd
#$ -j y
#Run 237 tasks where each task has a different $SGE_TASK_ID ranging from 1 to 237
#$ -t 1-237
#$ -N mod
date
hostname
#Output the Task ID
echo "Task ID is $SGE_TASK_ID"
/share/[...]/matlab -nodisplay -nodesktop -nojvm -nosplash -r "main; ID = $SGE_TASK_ID; f; exit"
What I do in the terminal is
cd to the folder where the scripts main.m, f.m, td.sh are located
type in the terminal qsub td.sh
Question: I need to change the bash file above because the script f.m calls a solver (Gurobi) whose license is single node single user. This is what I have been told:
" This license has been installed already and works only on node A.
You will not be able to qsub your scripts as the jobs have to run on this node.
Instead you should ssh into node A and run the job on this node directly instead
of submitting to the scheduler. "
Could you guide me through understanding how I should change the bash file above? In particular, how should I force the execution into node A?
Even though I am restricted to one node only, am I still able to parallelise using array jobs? Or array jobs are by definition executed on multiple nodes?

If you cannot use your scheduler, then you cannot use its array jobs. You will have to find another way to parallelize those jobs. Array jobs are not executed on multiple nodes by definition (but they are usually executed on multiple nodes due to resource availability).
Regarding the adaptation of your script, just follow the guidelies provided by your sysadmins: forget about SGE and start your calculus through ssh directly against the node you have been told:
date
hostname
for TASK_ID in {1..237}
do
#Output the Task ID
echo "Task ID is $TASK_ID"
ssh user#A "/share/[...]/matlab -nodisplay -nodesktop -nojvm -nosplash -r \"main; ID = $TASK_ID; f; exit\""
done
If the license is single node and single user (but multiple simultaneous execution), you can try to parallelize the calculus. You will have to take into account the resources available in the node A (number of CPUs, memory...) and the resources that you need for every single execution, and then start simultaneously as many calculus as possible without overloading the node (otherwise they will take longer or even fail).

sending an ssh -t command to multiple systems simultaneously (without ansible)

Because of the nature of the script (done at work, on a work RHEL machine) I cannot show the code, but I can at least provide pseudocode to help with a starting point. Currently:
start loop
1) read in the first line of a host text file (then the next and such per
the loop) of a file and assign it to a variable (host name)
2) send ssh -t command to the host (which takes anywhere between 2 to 6
minutes to receive a response back)
3) log response to a text file (repeat loop with new host from read in
text file)
end loop
Currently I have to run this script over night because of how many systems this script hits.
I want to be able to achieve the same goal and get the response from the command in that file per host, but I want the command to be sent out at the same time so that it takes anywhere between 2 to 6 minutes all together.
But because this is for work, I am not allowed to install ansible on the system; would there be another way to achieve this goal? If so please provide some areas or point me in the right direction.

With GNU Parallel:
parallel -j0 --slf hosts.txt --nonall mycommand > out.txt
But maybe you want a bit more info:
parallel -j0 --slf hosts.txt --joblog my.log --tag --nonall mycommand > out.txt

I did this using sh years ago using something like:
while true
do
if [ numberOfFileinSomeDir -lt N ]
then
(touch SomeDir/hostname; ssh hostname ... > someotherDir/hostname.txt ; rm SomeDir/hostname) &
...
But this stops working after ~100 hosts. It sucks - don't do it. If less than about ~500 hosts pssh may be the easiest - maybe you can install in your home directory?
Google something like python parallel execute process multiple and someone's bound to have a script that will do what you need already.
More than ~500 hosts and you really need to start installing some tools as others have mentioned in the comments.

whether a shell script can be executed if another instance of the same script is already running

I have a shell script which usually runs nearly 10 mins for a single run,but i need to know if another request for running the script comes while a instance of the script is running already, whether new request need to wait for existing instance to compplete or a new instance will be started.
I need a new instance must be started whenever a request is available for the same script.
How to do it...
The shell script is a polling script which looks for a file in a directory and execute the file.The execution of the file takes nearly 10 min or more.But during execution if a new file arrives, it also has to be executed simultaneously.
the shell script is below, and how to modify it to execute multiple requests..
#!/bin/bash
while [ 1 ]; do
newfiles=`find /afs/rch/usr8/fsptools/WWW/cgi-bin/upload/ -newer /afs/rch/usr$
touch /afs/rch/usr8/fsptools/WWW/cgi-bin/upload/.my_marker
if [ -n "$newfiles" ]; then
echo "found files $newfiles"
name2=`ls /afs/rch/usr8/fsptools/WWW/cgi-bin/upload/ -Art |tail -n 2 |head $
echo " $name2 "
mkdir -p -m 0755 /afs/rch/usr8/fsptools/WWW/dumpspace/$name2
name1="/afs/rch/usr8/fsptools/WWW/dumpspace/fipsdumputils/fipsdumputil -e -$
$name1
touch /afs/rch/usr8/fsptools/WWW/dumpspace/tempfiles/$name2
fi
sleep 5
done

When writing scripts like the one you describe, I take one of two approaches.
First, you can use a pid file to indicate that a second copy should not run. For example:
#!/bin/sh
pidfile=/var/run/$(0##*/).pid
# remove pid if we exit normally or are terminated
trap "rm -f $pidfile" 0 1 3 15
# Write the pid as a symlink
if ! ln -s "pid=$$" "$pidfile"; then
echo "Already running. Exiting." >&2
exit 0
fi
# Do your stuff
I like using symlinks to store pid because writing a symlink is an atomic operation; two processes can't conflict with each other. You don't even need to check for the existence of the pid symlink, because a failure of ln clearly indicates that a pid cannot be set. That's either a permission or path problem, or it's due to the symlink already being there.
Second option is to make it possible .. nay, preferable .. not to block additional instances, and instead configure whatever it is that this script does to permit multiple servers to run at the same time on different queue entries. "Single-queue-single-server" is never as good as "single-queue-multi-server". Since you haven't included code in your question, I have no way to know whether this approach would be useful for you, but here's some explanatory meta bash:
#!/usr/bin/env bash
workdir=/var/tmp # Set a better $workdir than this.
a=( $(get_list_of_queue_ids) ) # A command? A function? Up to you.
for qid in "${a[#]}"; do
# Set a "lock" for this item .. or don't, and move on.
if ! ln -s "pid=$$" $workdir/$qid.working; then
continue
fi
# Do your stuff with just this $qid.
...
# And finally, clean up after ourselves
remove_qid_from_queue $qid
rm $workdir/$qid.working
done
The effect of this is to transfer the idea of "one at a time" from the handler to the data. If you have a multi-CPU system, you probably have enough capacity to handle multiple queue entries at the same time.

ghoti's answer shows some helpful techniques, if modifying the script is an option.
Generally speaking, for an existing script:
Unless you know with certainty that:
the script has no side effects other than to output to the terminal or to write to files with shell-instance specific names (such as incorporating $$, the current shell's PID, into filenames) or some other instance-specific location,
OR that the script was explicitly designed for parallel execution,
I would assume that you cannot safely run multiple copies of the script simultaneously.
It is not reasonable to expect the average shell script to be designed for concurrent use.

From the viewpoint of the operating system, several processes may of course execute the same program in parallel. No need to worry about this.
However, it is conceivable, that a (careless) programmer wrote the program in such a way that it produces incorrect results, when two copies are executed in parallel.

log of parallel computations, how do I prevent interleaved write? lockfile or flock?

I see that has been discussed several times how to run scripts not concurrently, but I have not see the topic of concurrent write.
I am doing some parallel computation with xargs launching the commands for the actual computations. At the end of each computation I want that process to access a file and put the results in there. I am getting troubles because the write on the log file happens in a way that each process can access the log file at the same time, resulting in interleaved entries with one line from one run, another line from another run that finished about the same time (which is likely to happen due to the parallel nature of the run with xargs).
So in practice let's say that using xargs I run in parallel several insances of a script that reads:
#!/bin/bash
#### do something that takes some time
#### define content of the log
folder="<folder>"$PWD"</folder>\n"
datetag="<enddate>"`date`"</enddate>\n"
#### store log in XML ####
echo -e "<myrun>\n""$folder""$datetag""</myrun>" >> $outputfie
At present I get output file with interleaved runs log like this
<myrun>
<myrun>
<folder>./generations/test/run1</folder>
<folder>./generations/test/run2</folder>
<enddate>Sun Jul 6 11:17:58 CEST 2014</enddate>
</myrun>
<enddate>Sun Jul 6 11:17:58 CEST 2014</enddate>
</myrun>
Is there a way to give "exclusive access" to one instance of the script at a time, so that each script is writing its log without interference with the others?
I have seen flock and lockfile, but I am not sure what fits best to my case and I am seeking for advise/suggestion.
Thanks,
Roberto

I will use traceroute as example as that prints output slowly, but any other command would also work. Compare:
(echo 8.8.8.8;echo 8.8.4.4) | xargs -P6 -n1 traceroute > traceroute.xarg
to:
(echo 8.8.8.8;echo 8.8.4.4) | parallel traceroute > traceroute.para
Make sure you install GNU Parallel and not another parallel, and that /etc/parallel/config is empty.

I thinks this in the end does the job. The loop keeps going until this instance of the script can lock the log file for itself. Then writes and unlocks it.
The other instances of the script that are running in parallel and might be trying to write will find the lock ... or will be able to lock the file for themselves.
while [ -! `lockfile -1 log.lock` ]; do
echo -e "accessing file at "`date`
echo -e "$logblock" >> log
rm -f log.lock
break
done
Does anybody see any drawbacks in this type of solution?

QSUB a process for every file in a directory?

I've been using
qsub -t 1-90000 do_stuff.sh
to submit my tasks on a Sun GridEngine cluster, but now find myself with data sets (super large ones, too) which are not so conveniently named. What's the best way to go about this? I could try to rename them all, but the names contain information which needs to be preserved, and this obviously introduces a host of problems. I could just preprocess everything into jsons, but if there's a way to just qsub -all_contents_of_directory, that would be ideal.
Am I SOL? Should I just go to the directory in question and find . -exec 'qsub setupscript.sh {}'?

Use another script to submit the job - here's an example I used where I want the directory name in the job name. "run_openfoam" is the pbs script in the particular directory.
#!/bin/bash
cd $1
qsub -N $1 run_openfoam
You can adapt this script to suit your job and then run it through a loop on the command line. So rather than submitting a job array, you submit a job for each dir name passed as the first parapmeter to this script.

I tend to use Makefiles to automate this stuff:
INPUTFILES=$(wildcard *.in)
OUTPUTFILES=$(patsubst %.in,%.out,$(INPUTFILES))
all : $(OUTPUTFILES)
%.out : %.in
#echo "mycommand here < $< > $#" | qsub
Then type 'make', and all files will be submitted to qsub. Of course, this will submit everything all at once, which may do unfortunate things to your compute cluster and your sysadmin's blood pressure.
If you remove the "| qsub", the output of make is a list of commands to run. Feed that list into one or more qsub commands, and you'll get an increase in efficiency and a reduction in qsub jobs. I've been using GNU parallel for that, but it needs a qsub that blocks until the job is done. I wrote a wrapper that does that, but it calls qstat a lot, which means a lot of hitting on the system. I should modify it somehow, but there aren't a lot of computationally 'good' options here.

I cannot understand "-t 1-90000" in your qsub command. My searching of qsub manual doesn't show such "-t" option.

Create a file with a list of the datasets in it
find . -print >~/list_of_datasets
Script:
#!/bin/bash
exec ~/setupscript.sh $(sed -n -e "${SGE_TASK_ID}p" <~/list_of_datasets)
qsub -t 1-$(wc -l ~/list_of_datasets) job_script

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio