What is the difference between
$ make all
and,
$ make all -j8
From man make
-j [jobs], --jobs[=jobs]
Specifies the number of jobs (commands) to run simultaneously. If there is more than one -j option, the last one is
effective. If the -j option is given without
an argument, make will not limit the number of jobs that can run simultaneously.
Related
I am running an LSF job array to create a target in a makefile.
However as soon as the array is submitted make considers the command for the target to be executed, and throws an error as the target does not exist.
How can I force make to wait until the completion of the LSF job array before moving onto other dependent targets?
Example:
all: final.txt
first_%.txt:
bsub -J" "jarray[1-100]" < script.sh
final.txt: first_%.txt
cat first_1.txt first_50.txt first_100.txt > final.txt
Unfortunately the -K flag isn't supported for job arrays.
Try bsub -K which should force bsub to stay in the foreground until the job completes.
Edit
Since the option isn't supported on arrays, I think you'll have to submit your array as separate jobs, something like:
for i in `seq 1 100`; do
export INDEX=$i
bsub -K < script.sh &
done
wait
You'll have to pass the index to your script manually instead of using the job array index.
You need to ask the bsub command to wait for the job to complete. I have never used it, but according to the man page you can add the -K option to do this.
I would like to submit an array job on a cluster running SGE.
I know how to use array jobs with the -t option (for instance, qsub -t 1-1000 somescript.sh).
What if I don't know how many tasks I have to submit? The idea would be to use something like (not working):
qsub -t 1- somescript.sh
The submission would then go for all the n tasks, with unknown n.
No, open-ended arrays are not a built-in capability (nor can you add jobs to an array after initial submission).
I'm guessing about why you want to do this, but here's one idea for keeping track of a group of jobs like this: specify a shared name for the set of jobs, appending a counter.
So, for example, you'd include -N myjob.<counter> in your qsub (or add a #PBS script line for it):
-N myjob.1
-N myjob.2
...
-N myjob.n
I am doing research, and I often need to execute the same program with different inputs (each combination of inputs repeatedly) and store the results, for aggregation.
I would like to speed up the process by executing these experiments in parallel, over multiple machines. However, I would like to avoid the hassle of launching them manually. Furthermore, I would like my program to be implemented as a single thread and only add parallelization on top of it.
I work with Ubuntu machines, all reachable in the local network.
I know GNU Parallel can solve this problem, but I am not familiar with it. Can someone help me to setup an environment for my experiments?
Please, notice that this answer has been adapted from one of my scripts and is untested. If you find bugs, you are welcome to edit the answer.
First of all, to make the process completely batch, we need a non-interactive SSH login (that's what GNU Parallel uses to launch commands remotely).
To do this, first generate a pair of RSA keys (if you don't already have one) with:
ssh-keygen -t rsa
which will generate a pair of private and public keys, stored by default in ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub. It is important to use these locations, as openssh will go looking for them here. While openssh commands allow you to specify the private key file (passing it by -i PRIVATE_KEY_FILE_PATH), GNU Parallel does not have such an option.
Next, we need to copy the public key on all the remote machines we are going to use. For each of the machines of your cluster (I will call them "workers"), run on this command on your local machine:
ssh-copy-id -i ~/.ssh/id_rsa.pub WORKER_USER#WORKER_HOST
This step is interactive, as you will need to login to each of the workers through user id and password.
From this moment on, login from your client to each of the workers is non-interactive. Next, let's setup a bash variable with a comma-separated list of your workers. We will set this up using GNU Parallel special syntax, which allows to indicate how many CPUs to use on each worker:
WORKERS_PARALLEL="2/user1#192.168.0.10,user2#192.168.0.20,4/user3#10.0.111.69"
Here, I specified that on 192.168.0.10 I want only 2 parallel processes, while on 10.0.111.69 I want for. As for 192.168.0.20, since I did not specify any number, GNU Parallel will figure out how many CPUs (CPU cores, actually) the remote machine has and execute that many parallel processes.
Since I will also need the same list in a format that openssh can understand, I will create a second variable without the CPU information and with spaces instead of commas. I do this automatically with:
WORKERS=`echo $WORKERS_PARALLEL | sed 's/[0-9]*\///g' | sed 's/,/ /g'`
Now it's time to setup my code. I assume that each of the workers is configured to run my code, so that I will just need to copy the code. On workers, I usually work in the /tmp folder, so what follows assumes that. The code will be copied though an SSH tunnel and extracted remotely:
WORKING_DIR=/tmp/myexperiments
TAR_PATH=/tmp/code.tar.gz
# Clean from previous executions
parallel --nonall -S $WORKERS rm -rf $WORKING_DIR $TAR_PATH
# Copy the the tar.gz file on the worker
parallel scp LOCAL_TAR_PATH {}:/tmp ::: `echo $WORKERS`
# Create the working directory on the worker
parallel --nonall -S $WORKERS mkdir -p $WORKING_DIR
# Extract the tar file in the working directory
parallel --nonall -S $WORKERS tar --warning=no-timestamp -xzf $TAR_PATH -C $WORKING_DIR
Notice that multiple executions on the same machine will use the same working directory. I assume only one version of the code will be run at a specific time; if this is not the case you will need to modify the commands to use different working directories.
I use the --warning=no-timestamp directive to avoid annoying warnings that could be issued if the time of your machine ahead of that of your workers.
We now need to create directories in the local machine for storing the results of the runs, one for each group of experiments (that is, multiple executions with the same parameters). Here, I am using two dummy parameters alpha and beta:
GROUP_DIRS="results/alpha=1,beta=1 results/alpha=0.5,beta=1 results/alpha=0.2,beta=0.5"
N_GROUPS=3
parallel --header : mkdir -p {DIR} ::: DIR $GROUP_DIRS
Notice here that using parallel here is not necessary: using a loop would have worked, but I find this more readable. I also stored the number of groups, which we will use in the next step.
A final preparation step consists in creating a list of all the combinations of parameters that will be used in the experiments, each repeated as many times as necessary. Each repetition is coupled with an incremental number for identifying different runs.
ALPHAS="1.0 0.5 0.2"
BETAS="1.0 1.0 0.5"
REPETITIONS=1000
PARAMS_FILE=/tmp/params.txt
# Create header
echo REP GROUP_DIR ALPHA BETA > $PARAMS_FILE
# Populate
parallel \
--header : \
--xapply \
if [ ! -e {GROUP_DIR}"exp"{REP}".dat" ]';' then echo {REP} {GROUP_DIR} {ALPHA} {BETA} '>>' $PARAMS_FILE ';' fi \
::: REP $(for i in `seq $REPETITIONS`; do printf $i" %.0s" $(seq $N_GROUPS) ; done) \
::: GROUP_DIR $GROUP_DIRS \
::: ALPHA $ALPHAS \
::: BETA $BETAS
In this step I also implemented a control: if a .dat file already exists, I skip that set of parameters. This is something that comes out of practice: I often interrupt the execution of GNU Parallel and later decide to resume it by re-executing these commands. With this simple control I avoid running more experiments than necessary.
Now we can finally run the experiments. The algorithm in this example generates a file as specified in the parameter --save-data which I want to retrieve. I also want to save the stdout and stderr in a file, for debugging purposes.
cat $PARAMS_FILE | parallel \
--sshlogin $WORKERS_PARALLEL \
--workdir $WORKING_DIR \
--return {GROUP_DIR}"exp"{REP}".dat" \
--return {GROUP_DIR}"exp"{REP}".txt" \
--cleanup \
--xapply \
--header 1 \
--colsep " " \
mkdir -p {TEST_DIR} ';' \
./myExperiment \
--random-seed {REP} \
--alpha {ALPHA} \
--beta {BETA} \
--save-data {GROUP_DIR}"exp"{REP}".dat" \
'&>' {GROUP_DIR}"exp"{REP}".txt"
A little bit of explanation about the parameters. --sshlogin, which could be abbreviated with -S, passes the list of workers that Parallel will use to distribute the computational load. --workdir sets the working dir of Parallel, which by default is ~. --return directives copy back the specified file after the execution is completed. --cleanup removes the files copied back. --xapply tells Parallel to interpret the parameters as tuples (rather than sets to multiply by cartesian product). --header 1 tells Parallel that the first line of the parameters file has to be interpreted as header (whose entries will be used as names for the columns). --colsep tells Parallel that columns in the parameters file are space-separated.
WARNING: Ubuntu's version of parallel is outdated (2013). In particular, there is a bug preventing the above code to run properly, which has been fixed only a few days ago. To get the latest monthly snapshot, run (does not need root privileges):
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
Notice that the fix to the bug I mentioned above will only be included in the next snapshot, on September 22nd, 2015. If you are in a hurry you should perform a manual installation of the smoking hottest .
Finally, it is a good habit to clean our working environments:
rm $PARAMS_FILE
parallel --nonall -S $WORKERS rm -rf $WORKING_DIR $TAR_PATH
If you use this for reseach and publish a paper, remember to cite the original work by Ole Tange (see parallel --bibtex).
Is it possible to get one single result of the current stats from virt-top?
I´ve tried to use the --stream parameter but with that I get a new result every second.
I only need one result every execute of the command.
How can I reach that?
From the virt-top man page:
-b
Batch mode. In this mode keypresses are ignored.
-n iterations
Set the number of iterations to run. The default is to run continuously.
So I think what you want is:
virt-top -b -n 1
This is exactly the same as how you would achieve the same with normal "top".
I've been using
qsub -t 1-90000 do_stuff.sh
to submit my tasks on a Sun GridEngine cluster, but now find myself with data sets (super large ones, too) which are not so conveniently named. What's the best way to go about this? I could try to rename them all, but the names contain information which needs to be preserved, and this obviously introduces a host of problems. I could just preprocess everything into jsons, but if there's a way to just qsub -all_contents_of_directory, that would be ideal.
Am I SOL? Should I just go to the directory in question and find . -exec 'qsub setupscript.sh {}'?
Use another script to submit the job - here's an example I used where I want the directory name in the job name. "run_openfoam" is the pbs script in the particular directory.
#!/bin/bash
cd $1
qsub -N $1 run_openfoam
You can adapt this script to suit your job and then run it through a loop on the command line. So rather than submitting a job array, you submit a job for each dir name passed as the first parapmeter to this script.
I tend to use Makefiles to automate this stuff:
INPUTFILES=$(wildcard *.in)
OUTPUTFILES=$(patsubst %.in,%.out,$(INPUTFILES))
all : $(OUTPUTFILES)
%.out : %.in
#echo "mycommand here < $< > $#" | qsub
Then type 'make', and all files will be submitted to qsub. Of course, this will submit everything all at once, which may do unfortunate things to your compute cluster and your sysadmin's blood pressure.
If you remove the "| qsub", the output of make is a list of commands to run. Feed that list into one or more qsub commands, and you'll get an increase in efficiency and a reduction in qsub jobs. I've been using GNU parallel for that, but it needs a qsub that blocks until the job is done. I wrote a wrapper that does that, but it calls qstat a lot, which means a lot of hitting on the system. I should modify it somehow, but there aren't a lot of computationally 'good' options here.
I cannot understand "-t 1-90000" in your qsub command. My searching of qsub manual doesn't show such "-t" option.
Create a file with a list of the datasets in it
find . -print >~/list_of_datasets
Script:
#!/bin/bash
exec ~/setupscript.sh $(sed -n -e "${SGE_TASK_ID}p" <~/list_of_datasets)
qsub -t 1-$(wc -l ~/list_of_datasets) job_script