how to let one command uses one GPU in a multi-gpus job with PBS [closed] - shell

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 19 hours ago.
Improve this question
I submitted a multi-GPU job with PBS, i.e., this job has 3 gpus for deep learning model training. But in fact, one gpu for one model trianing is enough. So if I want to run 3 models in this job that one model use single GPU, what can I need to modify? The setting CUDA_VISIBLE_DEVICES can not make sense.
#!/bin/bash
#PBS -N Train
#PBS -l select=1:mem=36GB:ncpus=18:ngpus=3
#PBS -l walltime=96:00:00
#PBS -o run_out.out
#PBS -e run_err.out
#!/bin/bash
#PBS -N Train
#PBS -l select=1:mem=36GB:ncpus=18:ngpus=3
#PBS -l walltime=96:00:00
#PBS -o run_out.out
#PBS -e run_err.out
CUDA_VISIBLE_DEVICES=0 python train1.py
CUDA_VISIBLE_DEVICES=1 python train2.py
CUDA_VISIBLE_DEVICES=2 python train3.py
when I use command qsub to submit this script, there are some errors occur. After checking with debug mode, i found that VISIBLE_DEVICES cannot take effect. Our cluster has 8 GPUs and No.0 has been occupied. But before I submit my scipt I cannot see the detailed info about GPUs (e.g., using nvidia-smi command) and point specific GPUs.
I hope one could help me point out the detailed reason and some modification suggestions. Thank U.

Related

PBS torque: how to solve cores waste problem in parallel tasks that spend very different time from each other?

I'm running parallel MATLAB or python tasks in a cluster that is managed by PBS torque. The embarrassing situation now is that PBS think I'm using 56 cores but that's in the first and eventually I have only 7 hardest tasks running. 49 cores are wasted now.
My parallel tasks took very different time because they did searches in different model parameters, I didn't know which task will spend how much time before I have tried. In the start all cores were used but soon only the hardest tasks ran. Since the whole task was not finished yet PBS torque still thought I was using full 56 cores and prevent new tasks run but actually most cores were idle. I want PBS to detect this and use the idle cores to run new tasks.
So my question is that are there some settings in PBS torque that can automatically detect real cores used in the task, and allocate the really idle cores to new tasks?
#PBS -S /bin/sh
#PBS -N alps_task
#PBS -o stdout
#PBS -e stderr
#PBS -l nodes=1:ppn=56
#PBS -q batch
#PBS -l walltime=1000:00:00
#HPC -x local
cd /tmp/$PBS_O_WORKDIR
alpspython spin_half_correlation.py 2>&1 > tasklog.log
A short answer to your question is No: PBS has no way to reclaim unused resources allocated to a job.
Since your computation is essentially a bunch of independent tasks, what you could and probably should do is try to split your job into 56 independent jobs each running an individual combination of model parameters and when all the jobs are finished you could run an additional job to collect and summarize the results. This is a well supported way of doing things. PBS provides has some useful features for this type of jobs such as array jobs and job dependencies.

gnu parallel one job per processor

I am trying to use gnu parallel GNU parallel (version 20160922)
to launch a large number of protein docking jobs (using UCSF Dock 6.7). I am running on a high performance cluster with several dozen nodes each with 28-40 cores. The system is running CentOS 7.1.1503, and uses torque for job management.
I am trying to submit each config file in dock.n.d to the dock executable, one per core on the cluster. Here is my PBS file:
#PBS -l walltime=01:00:00
#PBS -N pardock
#PBS -l nodes=1:ppn=28
#PBS -j oe
#PBS -o /home/path/to/pardock.log
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE temp.txt
#f=$(pwd)
ls dock.in.d/*.in | parallel -j 300 --sshloginfile $PBS_NODEFILE "/path/to/local/bin/dock6 -i {} -o {}.out"
This works fine on a single node as written above. But when I scale up to, say, 300 processors (with -l procs=300) accross several nodes I begin to get these errors:
parallel: Warning: ssh to node026 only allows for 99 simultaneous logins.
parallel: Warning: You may raise this by changing /etc/ssh/sshd_config:MaxStartups and MaxSessions on node026.
What I do not understand is why there are so many logins. Each node only has 28-40 cores so, as specified in $PBS_NODEFILE, I would expect there to only be 28-40 SSH logins at any point in time on these nodes.
Am I misunderstanding or misexecuting something here? Please advise what other information I can provide or what direction I should go to get this to work.
UPDATE
So my problem above was the combination of -j 300 and the use of $PBS_NODEFILE, which has a separate entry for each core on each node. So in that case it seems I should used -j 1. But then, all the jobs seem to run on a single node.
So my question remains, how to get gnu parallel to balance the jobs between nodes, utilizing all cores, but not creating an excessive number of SSH logins due to multiple jobs per core.
Thank you!
You are asking GNU Parallel to ignore the number of cores and run 300 jobs on each server.
Try instead:
ls dock.in.d/*.in | parallel --sshloginfile $PBS_NODEFILE /path/to/local/bin/dock6 -i {} -o {}.out
This will default to --jobs 100% which is one job per core on all machines.
If you are not allowed to use all cores on the machines, you can in prepend X/ to the hosts in --sshloginfile to force X as the number of cores:
28/server1.example.com
20/server2.example.com
16/server3.example.net
This will force GNU Parallel to skip the detection of cores, and instead use 28, 20, and 16 respectively. This combined with -j 100% can control how many jobs you want started on the different servers.

Why does output behave differently over ssh? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I execute a function foo.sh (which does some more complex things like executing multiple python processes) over SSH in two different ways.
First, the following way (interactive shell):
ssh me#server
./foo.sh
This way I get all the output (stderr and stdout) of foo.sh
Then, the other way:
ssh me#server "./foo.sh"
This way I don't get any output from any of the subprocesses. What is the difference between the two methods? Why does stderr/stdout behave differently?
An example for foo.sh is
#! /bin/bash
./bar.py
Where bar.py is
#! /usr/bin/python3
from sys import stdout, stderr
from time import sleep
while True:
stdout.write("A\n")
stderr.write("B\n")
sleep(0.5)
You need to force pseudo TTY allocation:
ssh -t me#server "./foo.sh"
While doing ssh remote "somecommand", you are spawning an non-login, non-interactive shell session on the remote host that is not bound to any terminal, so no job control. Thats why you just need to force a pseudo TTY allocation.

Using killall to terminate bash [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
I must terminate a Python script and 2 bash scripts using crontab.
I needed a command to terminate all bash scripts ('killall Python' already works for terminating the python script) but when i use 'killall bash' it doesn't works...
Does anyone knows a solution to my problem? Maybe another command, or an especific way to use killall!
Thanks in advance!
Try the following command :
killall -s SIGHUP bash
but you shouldn't do this, you can potentially kill all bash of all users. Instead, I recommend you to use
pkill -f script_name.bash
and
pkill -1 -f script_name.bash
if needed.
Bash traps many signals like SIGTERM(15) and SIGQUIT(3). You could send SIGHUP(1) or SIGKILL(9):
killall -s SIGHUP bash ## or killall -s 1 bash
killall -s SIGKILL bash ## or killall -s 9 bash

PBS running multiple instances of the same program with different arguments

How do you go about running the same program multiple times but with different arguments each instance on a cluster, submitted through a PBS. Also, is it possible to designate each of these programs to a separate node? Currently, if I have a PBS with the following script:
#PBS -l nodes=1:ppn=1
/myscript
it will run the single program once, on a single node. If I use the following script:
#PBS -l nodes=1:ppn=1
/mscript -arg arg1 &
/myscript -arg arg2
I believe this will run each program in serial, but it will use only one node. Can I declare multiple nodes and then delegate specific ones out to each instance of the program I wish to run?
Any help or suggestions will be much appreciate. I apologize if I am not clear on anything or am using incorrect terminology...I am very new to cluster computing.
You want to do that using a form of MPI. MPI stands for message passing interface and there are a number of libraries out there that implement the interface. I would recommend using OpenMPI as it integrates very well with PBS. As you say you are new, you might appreciate this tutorial.
GNU Parallel would be ideal for this purpose. An example PBS script for your case:
#PBS -l nodes=2:ppn=4 # set ppn for however many cores per node on your cluster
#Other PBS directives
module load gnu-parallel # this will depend on your cluster setup
parallel -j4 --sshloginfile $PBS_NODEFILE /mscript -arg {} \
::: arg1 arg2 arg3 arg4 arg5 arg6 arg7 arg8
GNU Parallel will handle ssh connections to the various nodes. I've written out the example with arguments on the command line, but you'd probably want to read the arguments from a text file. Here are links to the man page and tutorial. Option -j4 should match the ppn (number of cores per node).

Resources