I've huge size(few million) job contain list and wants to run java written tool to perform the features comparison. This tool completes the calculation in
real 0m0.179s
user 0m0.005s
sys 0m0.000s sec
Running 5 nodes(each have 72 cpus) with pbs torque scheduler in the GNU parallel, tool runs fine and produces the results but as I set 72 jobs per node, it should run 72 x 5 jobs at a time but I can see only it runs 25-35 jobs!
Checking of cpu utilization on each node also shows low utilization.
I desire to run 72 X 5 jobs or more at a time and produce the results by utilizing all the available source (72 X 5 cpus).
As I mentioned have ~200 millions of job to run, I desire to complete it faster(1-2 hours) by using/increasing the number of nodes/cpus.
Current code, input and job state:
example.lst (it has ~300 million lines)
ZNF512-xxxx_2_N-THRA-xxtx_2_N
ZNF512-xxxx_2_N-THRA-xxtx_3_N
ZNF512-xxxx_2_N-THRA-xxtx_4_N
.......
cat job_script.sh
#!/bin/bash
#PBS -l nodes=5:ppn=72
#PBS -N job01
#PBS -j oe
#work dir
export WDIR=/shared/data/work_dir
cd $WDIR;
# use available 72 cpu in each node
export JOBS_PER_NODE=72
#gnu parallel command
parallelrun="parallel -j $JOBS_PER_NODE --slf $PBS_NODEFILE --wd $WDIR --joblog process.log --resume"
$parallelrun -a example.lst sh run_script.sh {}
cat run_script.sh
#!/bin/bash
# parallel command options
i=$1
data=/shared/TF_data
# create tmp dir and work in
TMP_DIR=/shared/data/work_dir/$i
mkdir -p $TMP_DIR
cd $TMP_DIR/
# get file name
mk=$(echo "$i" | cut -d- -f1-2)
nk=$(echo "$i" | cut -d- -f3-6)
#run a tool to compare the features of pair files
/shared/software/tool_v2.1/tool -s1 $data/inf_tf/$mk -s1cf $data/features/$mk-cf -s1ss $data/features/$mk-ss -s2 $data/inf_tf/$nk.pdb -s2cf $data/features/$nk-cf.pdb -s2ss $data/features/$nk-ss.pdb > $data/$i.out
# move output files
mv matrix.txt $data/glosa_tf/matrix/$mk"_"$nk.txt
mv ali_struct.pdb $data/glosa_tf/aligned/$nk"_"$mk.pdb
# move back and remove tmp dir
cd $TMP_DIR/../
rm -rf $TMP_DIR
exit 0
PBS submission
qsub job_script.sh
Login to one of the node : ssh ip-172-31-9-208
top - 09:28:03 up 15 min, 1 user, load average: 14.77, 13.44, 8.08
Tasks: 928 total, 1 running, 434 sleeping, 0 stopped, 166 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 98.4%id, 1.4%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 193694612k total, 1811200k used, 191883412k free, 94680k buffers
Swap: 0k total, 0k used, 0k free, 707960k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15348 ec2-user 20 0 16028 2820 1820 R 0.3 0.0 0:00.10 top
15621 ec2-user 20 0 169m 7584 6684 S 0.3 0.0 0:00.01 ssh
15625 ec2-user 20 0 171m 7472 6552 S 0.3 0.0 0:00.01 ssh
15626 ec2-user 20 0 126m 3924 3492 S 0.3 0.0 0:00.01 perl
.....
All of the nodes top shows the similar state and produces the results by running only ~26 at a time!
I've aws-parallelcluster contains 5 nodes(each have 72 cpus) with torque scheduler and GNU Parallel 2018, Mar 2018
Update
By introducing the new function that takes input on stdin and running the script in parallel works great and utilizes all the CPU in local machine.
However, when its runs over remote machines it produces a
parallel: Error: test.lst is neither a file nor a block device
MCVE:
A simple code that echoing list gives the same error while running it in remote machines but works great in local machine:
cat test.lst # contains list
DNMT3L-5yx2B_1_N-DNMT3L-5yx2B_2_N
DNMT3L-5yx2B_1_N-DNMT3L-6brrC_3_N
DNMT3L-5yx2B_1_N-DNMT3L-6f57B_2_N
DNMT3L-5yx2B_1_N-DNMT3L-6f57C_2_N
DNMT3L-5yx2B_1_N-DUX4-6e8cA_4_N
DNMT3L-5yx2B_1_N-E2F8-4yo2A_3_P
DNMT3L-5yx2B_1_N-E2F8-4yo2A_6_N
DNMT3L-5yx2B_1_N-EBF3-3n50A_2_N
DNMT3L-5yx2B_1_N-ELK4-1k6oA_3_N
DNMT3L-5yx2B_1_N-EPAS1-1p97A_1_N
cat test_job.sh # GNU parallel submission script
#!/bin/bash
#PBS -l nodes=1:ppn=72
#PBS -N test
#PBS -k oe
# introduce new function and Run from ~/
dowork() {
parallel sh test_work.sh {}
}
export -f dowork
parallel -a test.lst --env dowork --pipepart --slf $PBS_NODEFILE --block -10 dowork
cat test_work.sh # run/work script
#!/bin/bash
i=$1
data=pwd
#create temporary folder in current dir
TMP_DIR=$data/$i
mkdir -p $TMP_DIR
cd $TMP_DIR/
# split list
mk=$(echo "$i" | cut -d- -f1-2)
nk=$(echo "$i" | cut -d- -f3-6)
# echo list and save in echo_test.out
echo $mk, $nk >> $data/echo_test.out
cd $TMP_DIR/../
rm -rf $TMP_DIR
From your timing:
real 0m0.179s
user 0m0.005s
sys 0m0.000s sec
it seems the tool uses very little CPU power. When GNU Parallel runs local jobs it has an overhead of 10 ms CPU time per job. Your jobs use 179 ms time, and 5 ms CPU time. So GNU Parallel will be using quite a bit of the time spent.
The overhead is much worse when running jobs remotely. Here we are talking 10 ms + running an ssh command. This can easily be in the order of 100 ms.
So how can we minimize the number of ssh commands and how can spread the overhead over multiple cores?
First let us make a function that can take input on stdin and run the script - one job per CPU thread in parallel:
dowork() {
[...set variables here. that becomes particularly important we when run remotely...]
parallel sh run_script.sh {}
}
export -f dowork
Test that this actually works by running:
head -n 1000 example.lst | dowork
Then let us look at running jobs locally. This can be done similar to described here: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround
parallel -a example.lst --pipepart --block -10 dowork
This will split example.lst into 10 blocks per CPU thread. So on a machine with 72 CPU threads this will make 720 blocks. It will the start 72 doworks and when one is done it will get another of the 720 blocks. The reason I choose 10 instead of 1 is if one of the jobs "get stuck" for a while, then you are unlikely to notice this.
This should make sure 100% of the CPUs on the local machine is busy.
If that works, we need to distribute this work to remote machines:
parallel -j1 -a example.lst --env dowork --pipepart --slf $PBS_NODEFILE --block -10 dowork
This should in total start 10 ssh per CPU thread (i.e. 5*72*10) - namely one for each block. With 1 running per server listed in $PBS_NODEFILE in parallel.
Unfortunately this means that --joblog and --resume will not work. There is currently no way to make that work, but if it is valuable to you contact me via parallel#gnu.org.
I am not sure what tool does. But if the copying takes most of the time and if tool only reads the files, then you might just be able symlink the files into $TMP_DIR instead of copying.
A good indication of whether you can do it faster is to look at top of the 5 machines in the cluster. If they are all using all cores at >90% then you cannot expect to get it faster.
I don't understand why is 2700x slower than i7-6700 in redis benchmark?
What else should I add?
Below are my tests.
1st system
OS : CentOS Linux release 7.6.1810 (Core)
Linux localhost.localdomain 5.2.8-1.el7.elrepo.x86_64 #1 SMP Fri Aug 9 13:40:33 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux
CPU : AMD Ryzen 7 2700X Eight-Core Processor (pinnacle ridge 2700x)
RAM : 16GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 3000 MHz (0.3 ns) * 2
[root#localhost ~]# redis-cli --latency
min: 0, max: 1, avg: 0.21 (5042 samples)
[root#localhost ~]# redis-benchmark -h 127.0.0.1 -p 6379 -n 100000 -t set -q
SET: 110619.47 requests per second
[root#localhost ~]# redis-benchmark -h 127.0.0.1 -p 6379 -n 100000 -t get -q
GET: 138504.16 requests per second
2nd system
OS : CentOS release 6.7 (Final)
Linux dmlocalhost.localdoamin 2.6.32-573.22.1.el6.x86_64 #1 SMP Wed Mar 23 03:35:39 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
CPU : Intel(R) Core(TM) i7-6700 CPU # 3.40GHz
RAM : 16GiB DIMM Synchronous 2133 MHz (0.5 ns)
[root#dmvault ~]# redis-cli --latency
min: 0, max: 1, avg: 0.11 (5038 samples)
[root#dmvault ~]# redis-benchmark -h 127.0.0.1 -p 6379 -n 100000 -t set -q
SET: 248138.95 requests per second
[root#dmvault ~]# redis-benchmark -h 127.0.0.1 -p 6379 -n 100000 -t get -q
GET: 244498.77 requests per second
I am trying to see the number of active puma threads on my server.
I can not see it through ps:
$ ps aux | grep puma
healthd 2623 0.0 1.8 683168 37700 ? Ssl May02 5:38 puma 2.11.1 (tcp://127.0.0.1:22221) [healthd]
root 8029 0.0 0.1 110460 2184 pts/0 S+ 06:34 0:00 grep --color=auto puma
root 18084 0.0 0.1 56836 2664 ? Ss May05 0:00 su -s /bin/bash -c puma -C /opt/elasticbeanstalk/support/conf/pumaconf.rb webapp
webapp 18113 0.0 0.8 83280 17324 ? Ssl May05 0:04 puma 2.16.0 (unix:///var/run/puma/my_app.sock) [/]
webapp 18116 3.5 6.2 784992 128924 ? Sl May05 182:35 puma: cluster worker 0: 18113 [/]
As in the configuration I have:
threads 8, 32
I was expecting to see at least 8 puma threads?
To quickly answer the question, the number of threads used by a
process running on a given PID, can be obtained using the
following :
% ps -h -o nlwp <pid>
This will just return you the total number of threads used by the
process. The option -h removes the headers and the option -o nlwp
formats the output of ps such that it only outputs the Number of Light Weight Processes (NLWP) or threads. Example, when only a single process puma is running and its PID is obtained with pgrep, you get:
% ps -h -o nlwp $(pgrep puma)
4
What is the difference between process, thread and light-weight process?
This question has been answered already in various places
[See here, here and the excellent geekstuff
article]. The quick, short and ugly version is :
a process is essentially any running instance of a program.
a thread is a flow of execution of the process. A process
containing multiple execution-flows is known as multi-threaded
process and shares its resources amongst its threads (memory,
open files, io, ...). The Linux kernel has no knowledge of what
threads are and only knows processes. In the past,
multi-threading was handled on a user level and not kernel
level. This made it hard for the kernel to do proper process
management.
Enter lightweight processes (LWP). This is essentially the
answer to the issue with threads. Each thread is considered to
be an LWP on kernel level. The main difference between a
process and an LWP is that the LWP shares resources. In other words, an Light Weight Process is kernel-speak for what users call a thread.
Can ps show information about threads or LWP's?
The ps command or process status command provides information
about the currently running processes including their corresponding
LWPs or threads. To do this, it makes use of the /proc directory
which is a virtual filesystem and regarded as the control and
information centre of the kernel. [See here and here].
By default ps will not give you any information about the LWPs,
however, adding the option -L and -m to the command generally does
the trick.
man ps :: THREAD DISPLAY
H Show threads as if they were processes.
-L Show threads, possibly with LWP and NLWP columns.
m Show threads after processes.
-m Show threads after processes.
-T Show threads, possibly with SPID column.
For a single process puma with pid given by pgrep puma
% ps -fL $(pgrep puma)
UID PID PPID LWP C NLWP STIME TTY STAT TIME CMD
kvantour 2160 2876 2160 0 4 15:22 pts/39 Sl+ 0:00 ./puma
kvantour 2160 2876 2161 99 4 15:22 pts/39 Rl+ 0:14 ./puma
kvantour 2160 2876 2162 99 4 15:22 pts/39 Rl+ 0:14 ./puma
kvantour 2160 2876 2163 99 4 15:22 pts/39 Rl+ 0:14 ./puma
however, adding the -m option clearly gives a nicer overview. This
is especially handy when multiple processes are running with the same
name.
% ps -fmL $(pgrep puma)
UID PID PPID LWP C NLWP STIME TTY STAT TIME CMD
kvantour 2160 2876 - 0 4 15:22 pts/39 - 0:44 ./puma
kvantour - - 2160 0 - 15:22 - Sl+ 0:00 -
kvantour - - 2161 99 - 15:22 - Rl+ 0:14 -
kvantour - - 2162 99 - 15:22 - Rl+ 0:14 -
kvantour - - 2163 99 - 15:22 - Rl+ 0:14 -
In this example, you see that process puma with PID 2160 runs with 4
threads (NLWP) having the ID's 2160--2163. Under STAT you see two different values Sl+ and 'Rl+'. Here the l is an indicator for multi-threaded. S and R stand for interruptible sleep (waiting for an event to complete) and respectively running. So we see that 3 of the 4 threads are running at 99% CPU and one thread is sleeping.
You also see the total accumulated CPU time (44s) while a single thread only runs for 14s.
Another way to obtain information is by directly using the format
specifiers with -o or -O.
man ps :: STANDARD FORMAT SPECIFIERS
lwp lightweight process (thread) ID of the dispatchable
entity (alias spid, tid). See tid for additional
information. Show threads as if they were processes.
nlwp number of lwps (threads) in the process. (alias thcount).
So you can use any of lwp,spid or tid and nlwp or thcount.
If you only want to get the number of threads of a process called
puma, you can use :
% ps -o nlwp $(pgrep puma)
NLWP
4
or if you don't like the header
% ps -h -o nlwp $(pgrep puma)
4
You can get a bit more information with :
% ps -O nlwp $(pgrep puma)
PID NLWP S TTY TIME COMMAND
19304 4 T pts/39 00:00:00 ./puma
Finally, you can combine the flags with ps aux to list the threads.
% ps aux -L
USER PID LWP %CPU NLWP %MEM VSZ RSS TTY STAT START TIME COMMAND
...
kvantour 1618 1618 0.0 4 0.0 33260 1436 pts/39 Sl+ 15:17 0:00 ./puma
kvantour 1618 1619 99.8 4 0.0 33260 1436 pts/39 Rl+ 15:17 0:14 ./puma
kvantour 1618 1620 99.8 4 0.0 33260 1436 pts/39 Rl+ 15:17 0:14 ./puma
kvantour 1618 1621 99.8 4 0.0 33260 1436 pts/39 Rl+ 15:17 0:14 ./puma
...
Can top show information about threads or LWP's?
top has the option to show threads by hitting H in the interactive mode or by launching top with top -H. The problem is that it lists the threads as processes (similar to ps -fH).
% top
top - 09:42:10 up 17 days, 3 min, 1 user, load average: 3.35, 3.33, 2.75
Tasks: 353 total, 3 running, 347 sleeping, 3 stopped, 0 zombie
%Cpu(s): 75.5 us, 0.6 sy, 0.5 ni, 22.6 id, 0.0 wa, 0.0 hi, 0.8 si, 0.0 st
KiB Mem : 16310772 total, 8082152 free, 3662436 used, 4566184 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 11363832 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
868 kvantour 20 0 33268 1436 1308 S 299.7 0.0 46:16.22 puma
1163 root 20 0 920488 282524 258436 S 2.0 1.7 124:48.32 Xorg
...
Here you see that puma runs at about 300% CPU for an accumulated time of 46:16.22. There is, however, no indicator that this is a threaded process. The only indicator is the CPU usage, however, this could be below 100% if 3 threads are "sleeping"? Furthermore, the status flag states S which indicates that the first thread is asleep. Hitting H give you then
% top -H
top - 09:48:30 up 17 days, 10 min, 1 user, load average: 3.18, 3.44, 3.02
Threads: 918 total, 5 running, 910 sleeping, 3 stopped, 0 zombie
%Cpu(s): 75.6 us, 0.2 sy, 0.1 ni, 23.9 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 16310772 total, 8062296 free, 3696164 used, 4552312 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 11345440 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
870 kvantour 20 0 33268 1436 1308 R 99.9 0.0 21:45.35 puma
869 kvantour 20 0 33268 1436 1308 R 99.7 0.0 21:45.43 puma
872 kvantour 20 0 33268 1436 1308 R 99.7 0.0 21:45.31 puma
1163 root 20 0 920552 282288 258200 R 2.0 1.7 124:52.05 Xorg
...
Now we see only 3 threads. As one of the Threads is "sleeping", it is way down the bottom as top sorts by CPU usage.
In order to see all threads, it is best to ask top to display a specific pid (for a single process):
% top -H -p $(pgrep puma)
top - 09:52:48 up 17 days, 14 min, 1 user, load average: 3.31, 3.38, 3.10
Threads: 4 total, 3 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 75.5 us, 0.1 sy, 0.2 ni, 23.6 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st
KiB Mem : 16310772 total, 8041048 free, 3706460 used, 4563264 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 11325008 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
869 kvantour 20 0 33268 1436 1308 R 99.9 0.0 26:03.37 puma
870 kvantour 20 0 33268 1436 1308 R 99.9 0.0 26:03.30 puma
872 kvantour 20 0 33268 1436 1308 R 99.9 0.0 26:03.22 puma
868 kvantour 20 0 33268 1436 1308 S 0.0 0.0 0:00.00 puma
When you have multiple processes running, you might be interested in hitting f and toggle PGRP on. This shows the Group PID of the process. (PID in ps where PID in top is LWP in ps).
How do I get the thread count without using ps or top?
The file /proc/$PID/status contains a line stating how many threads
the process with PID $PID is using.
% grep Threads /proc/19304/status
Threads: 4
General comments
It is possible that you do not find the process of another user
and therefore cannot get the number of threads that process is using. This could be due to the mount options of /proc/ (hidepid=2).
Used example program:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
char c = 0;
#pragma omp parallel shared(c) {
int i = 0;
if (omp_get_thread_num() == 0) {
printf("Read character from input : ");
c = getchar();
} else {
while (c == 0) i++;
printf("Total sum is on thread %d : %d\n", omp_get_thread_num(), i);
}
}
}
compiled with gcc -o puma --openmp
If you are just looking for number of threads that are spawned by the process, you can see the number of task folders created under /proc/[pid-of-process]/task because each thread creates a folder under this path. So counting the number of folders would be sufficient.
In fact the ps utility itself reads the information from this path, a file /proc/[PID]/cmdline which is represented in a more readable way.
From Linux Filesystem Hierarchy
/proc is very special in that it is also a virtual file-system. It's sometimes referred to as a process information pseudo-file system. It doesn't contain 'real' files but run-time system information (e.g. system memory, devices mounted, hardware configuration, etc). For this reason it can be regarded as a control and information center for the kernel. In fact, quite a lot of system utilities are simply calls to files in this directory.
All you need to get the PID of the process puma, use ps or any utility of your choice
ps aux | awk '/[p]uma/{print $1}'
or more directly use pidof(8) - Linux man page which gets you the PID directly given the process name as input
pidof -s puma
Now that you have the PID to count the number of task/ folders your process had created use the find command
find /proc/<PID>/task -maxdepth 1 -type d -print | wc -l
ps aux | grep puma will give you list of process for only puma. You need to find out , how many threads are running by the particular Process. May be this will help you:
ps -T -p 2623
You need to provide process id, for which you want to find out number of threads. Make sure you are providing accurate process id.
Using ps and wc to count puma threads:
ps --no-headers -T -C puma | wc -l
The string "puma" can be replaced as desired. Example, count bash threads:
ps --no-headers -T -C bash | wc -l
On my system that outputs:
9
The code in the question, ps aux | grep puma, has a few grep related problems:
It returns grep --color=auto puma, which isn't a puma thread at all.
Similarly any util or command with the string "puma", e.g. a util called notpuma, would be matched by grep.
I found "htop" to be an excellent solution. Just toggle "tree view" and you can view each puma-worker and the threads under that worker.
Number of puma threads for each worker:
ps aux | awk '/[p]uma/{print $2}' | xargs ps -h -o nlwp
Sample output:
7
59
59
61
59
60
59
59
59
this is a followup from GNU Parallel - which job failed?
I'm asking parallel to output its status to a logfile test.log, but it's consistently only logging the last job it tried to run.
weedom#host1: ~/$ parallel --tag --nonall -j8 --joblog test.log -S host1,host2 uptime
host2 10:41:17 up 36 days, 20:45, 1 user, load average: 0.00, 0.00, 0.00
host1 10:41:17 up 22:34, 3 users, load average: 0.06, 0.11, 0.04
weedom#host1: ~/$ cat test.log
Seq Host Starttime Runtime Send Receive Exitval Signal Command
1 host1 1403689277.067 0.519999980926514 0 0 0
weedom#host1: ~/$
this is with "GNU parallel 20130522"
You experience a bug that was fixed in version 20130922.