SunGridEngine, Condor, Torque as Resource Managers for PVM - cluster-computing

Anyone have any idea which Resource manager is good for PVM? Or should I not have used PVM and instead relied on MPI (or any version of it, such as MPICH-2 [are there any other ones that are better?]). Main reason for using PVM was because the person before me who started this project assumed the use of PVM. However, now that this project is mine (he hasn't done any significant work that relies on PVM) this can be easily changed, preferably to something that is easy to install because installing and setting up PVM was a big hassle.
I'm leaning towards SunGridEngine seeing as how I have dedicated hardware, and after reading up on another post of which ones are better for dedicated hardware, SGE seems to be the winner. However I'm unsure of its performance using PVM. Wondering if anyone have had any experience with PVM and SGE?
If people use SGE, what do you use to communicate from computer to computer (or virtual machine to virtual machine)
Oh and I will be running Perl applications/lines if this matters.
Any suggestions or ideas?
Thanks in advance to all comments,
Tyug

I run PVM on Linux systems using Torque, SGE and LSF without any problems. Are you asking "Is it possible to use SGE, Torque, etc. to run PVM applications?"?
If so, check out my example Linux c-shell job scripts below. Note the scripts are nearly identical, except for the header of each script, which conforms to the appropriate format for each resource manager.
SGE job script:
#!/bin/csh
#$ -N LTR-001
#$ -o LTR-001.output
#$ -e LTR-001.error
#$ -pe comp 24
#$ -l h_rt=04:00:00
#$ -A cmit2
#$ -cwd
#$ -V
# Setup envirnoment
setenv LD_LIBRARY_PATH /lfs0/projects/cmit2/opt-intel/overture-noX/lib:${LD_LIBRARY_PATH}
setenv PVM_ARCH LINUX
setenv PVM_ROOT /lfs0/projects/cmit2/opt-intel/pvm3
setenv PVM_BIN ${PVM_ROOT}/bin
setenv PVM_RSH /usr/bin/ssh
setenv MY_HOSTS pvm_hostfile
rm -f ~/.pvmprofile
env | grep PVM_ > ~/.pvmprofile
# Create file containing _unique_ host names. Note that there are two possible sources of available hosts
sort -k 1,1 -u ${MACHINE_FILE} >! ${MY_HOSTS}
# Start PVM & add nodes
printf "%s\n%s\n" conf quit|${PVM_ROOT}/lib/pvm ${MY_HOSTS}
wait
sleep 2
#
# Run apps requiring PVM.
#
wait
# Exit PVM daemon
echo "reset" | $PVM_ROOT/lib/pvm
echo "halt" | $PVM_ROOT/lib/pvm
Torque job script:
#!/bin/csh
#PBS -N LTR-001
#PBS -o LTR-001.output
#PBS -e LTR-001.error
#PBS -l nodes=3:ppn=8
#PBS -l walltime=04:00:00
#PBS -q compute
#PBS -d .
# Setup envirnoment
setenv LD_LIBRARY_PATH /users/ps14/opt-intel/overture/lib:${LD_LIBRARY_PATH}
setenv PVM_ARCH LINUX64
setenv PVM_ROOT /users/ps14/opt-intel/pvm3
setenv PVM_BIN ${PVM_ROOT}/bin
setenv PVM_RSH ${PVM_ROOT}/ssh
setenv MY_HOSTS pvm_hostfile
rm -f ~/.pvmprofile
env | grep PVM_ > ~/.pvmprofile
# Create file containing _unique_ host names. Note that there are two possible sources of available hosts
sort -k 1,1 -u ${PBS_NODEFILE} >! ${MY_HOSTS}
# Start PVM & add nodes
printf "%s\n%s\n" conf quit|${PVM_ROOT}/lib/pvm ${MY_HOSTS}
wait
sleep 2
#
# Run apps requiring PVM.
#
wait
# Exit PVM daemon
echo "reset" | $PVM_ROOT/lib/pvm
echo "halt" | $PVM_ROOT/lib/pvm

Related

Is there any way to Run codes between multiple Nodes on HPC

I am trying to run let's say 10 different codes each saved in it's respective directory named as 1,2,3,..,10.
#PBS -l nodes=10:cores=1
This means I had 1 thread each on 10 different CPU's. Now I had to submit a job so that each directory get's 1 thread of 1 CPU only, and similarly other directories 2,3..,10.
Codes are for molecular dynamics and runs for several hours, and they are independent as well. I tried by Gnu Parallel but I failed to employ each 10 CPU's. May be Gnu Parallel is made to distribute jobs in between 1 CPU cores. I know MPI can, but I don't know exactly how. May anyone please suggest.
I do not have access to a PBS cluster, but Example 2 from
https://www.nas.nasa.gov/hecc/support/kb/using-gnu-parallel-to-package-multiple-jobs-in-a-single-pbs-job_303.html might be what you are looking for:
#PBS -lselect=6:ncpus=4:model=san
#PBS -lwalltime=4:00:00
cd $PBS_O_WORKDIR
seq 64 | parallel -j 4 -u --sshloginfile $PBS_NODEFILE \
"cd $PWD; ./myscript.csh {}"
Adapted to your situation (untested):
#PBS -l place=scatter
#PBS -l nodes=10:cores=1
cd $PBS_O_WORKDIR
seq 10 | parallel -j 1 --sshloginfile $PBS_NODEFILE --wd $PBS_O_WORKDIR ./myscript {}
You need place=scatter because otherwise the same host may be listed twice in $PBS_NODEFILE, and GNU Parallel ignores duplicates.

Job crashes with OOM when I submit as a que (PBS/Torque) but not when I run simply run the command in terminal

Here's the job script I use,
#!/bin/bash
#PBS -q batch
#PBS -N simulation
#PBS -j n
#PBS -o /dev/null
#PBS -l nodes=1:ppn=1,pmem=3400mb
#PBS -l ncpus=1,mem=3400mb
cd ${PBS_O_WORKDIR} && \
./executable
I get this error:
Operating system error: Cannot allocate memory
Allocation would exceed memory limit
I've tried increasing the ppn and ncpus and mem such that it matches requirement of my program (~6GB). Doesn't help either.
This doesn't happen when I run the command in terminal. It works just fine.
Try run the program with all memory in the system allocated to the job.
Then at the very end execute qstat -f $PBS_JOBID| grep resources to see what Torque thinks about used resources.
resources_used.cput = 00:00:00
resources_used.energy_used = 0
resources_used.mem = 94724kb
resources_used.vmem = 1564024kb
resources_used.walltime = 00:03:25

How to run a TCP/IP game on a cluster

I wrote a TCP/IP game that involves one server and two clients who then play the game, write some statistics about it and close. (It's two AIs playing a game)
I wrote a shell script that opens those three child scripts. However, currently no statistics are being written. Since sometimes the setup stage (clients connect to server) works and sometimes not even that, I assume that those children are wrongly distributed over the cores and can't communicate with the server. (?)
How would I generally solve this problem? Perhaps not with tmux? Running SGE, version 6.2u3beta.
Here's my shell script:
#!/bin/bash
# This script is supposed to take a json problem instance (name is problemNNNNN.json)
# Then open server with -i problem$SGE_TASK_ID.json -o -p open-port,
# then open two clients with -p open-port.
#$ -S /bin/bash
#$ -m n
#$ -l h_vmem=4G
## Tasks
#$ -t 1-1
#$ -cwd
problem_file=problem$SGE_TASK_ID.json
function find_open_port(){
# Ports between 49152 - 65535 are usually unused.
port=$(shuf -i '49152-65535' -n '1')
# Then check if port is open
if lsof -Pi :$port -sTCP:LISTEN -t >/dev/null ; then
find_open_port
else
# There is no service currently running on this port
return $port
fi
}
find_open_port
tmux new-session -d -s '$SGE_TASK_ID' "python server.py -p $port -i $problem_file"
sleep 1
tmux split-window -v -t '$SGE_TASK_ID' "python client.py -p $port"
sleep 1
tmux split-window -h -t '$SGE_TASK_ID' "python client.py -p $port"
exit

Run netlogo in parallel mpi using Sun Grid Engine

#!/bin/bash
#$ -N new
#$ -q all.q
#$ -pe mpi 30
unset SGE_ROOT
/opt/mpi/1.8.1/bin/mpirun -np $NSLOTS -hostfile $TMPDIR/machines /home/abhishekb/netlogo/netlogo-5.2.0/netlogo-headless.sh \
--model /home/abhishekb/scale_med/try4.nlogo \
--experiment experiment1 \
--table /home/abhishekb/Trash/anything.csv
Error:
The: Command not found.
queuing: Command not found.
time-to-exit: Command not found.
Badly placed ()'s.
Output:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
--------------------------------------------------------------------------
A hostfile was provided that contains at least one node not
present in the allocation:
hostfile: /tmp/8396.1.all.q/machines
node: compute-0-1
If you are operating in a resource-managed environment, then only
nodes that are in the allocation can be used in the hostfile. You
may find relative node syntax to be a useful alternative to
specifying absolute node names see the orte_hosts man page for
further information.
--------------------------------------------------------------------------
PE file:
rm: cannot remove `/tmp/8396.1.all.q/rsh': No such file or directory
Earlier, I used to run the below:
#!/bin/bash
#$ -N new
#$ -q all.q
#$ -pe mpi 30
/home/abhishekb/netlogo/netlogo-5.2.0/netlogo-headless.sh \
--model /home/abhishekb/std_low/try4.nlogo \
--experiment experiment1 \
--table /home/abhishekb/Trash/anything.csv \
--threads 30
which simply processes on just one core (on checking at HPC end)though it grabs 30
Edit:
Doc for submitting jobs:
http://it.iiitd.edu.in/HPC_final_doc.pdf Please refer page 4 and 5 section 10 `Job subsmission steps.
Submitted job by qsub <filename.sh>

why the job has been submitted using qsub is unknown?

In regard to create a PBS script file to run long-term jobs on a server with 256 GB of RAM and two CPUs, each with 12 cores and 24 threads, yielding 48 computing unit. I tried to do it, but I think there is something wrong.
I created a PBS script named run_trinity and submitted it to server using qsub command (qsub run_trinity.sh) within the same directory that contain my desired program (trinity) and data, and it returned something like 47.chpc. But when I tried to check the status of job using qstat command, it says: unknown job id 47.chpc. I'm a biology student and really new in this field, could you please help me to figure out what happened? here is my PBS script:
#!/bin/bash
#PBS -N run_trinity
#PBS -l nodes=1:ppn=6
#PBS -l walltime=100:00:00
#PBS -l mem=200gb
#PBS -j oe
#Set stack size to unlimited
ulimit -s unlimited
cd /home/mary/software/trinityrnaseq_r20140717
perl /home/mary/software/trinityrnaseq_r20140717/Trinity.pl --seqType fq --JM 200G --normalize_reads --left reads8_1.fq.gz --right reads8_2.fq.gz --SS_lib_type FR --CPU 6 --full_cleanup --output /home/mary/software/trinityrnaseq_r20140717
Looking forward to hearing your perfect solutions.

Resources