Intel MPI benchmark fails when # bytes > 128: IMB-EXT - parallel-processing

I just installed Linux and Intel MPI to two machines:
(1) Quite old (~8 years old) SuperMicro server, which has 24 cores (Intel Xeon X7542 X 4). 32 GB memory.
OS: CentOS 7.5
(2) New HP ProLiant DL380 server, which has 32 cores (Intel Xeon Gold 6130 X 2). 64 GB memory.
OS: OpenSUSE Leap 15
After installing OS and Intel MPI, I compiled intel MPI benchmark and ran it:
$ mpirun -np 4 ./IMB-EXT
It is quite surprising that I find the same error when running IMB-EXT and IMB-RMA, though I have a different OS and everything (even GCC version used to compile Intel MPI benchmark is different -- in CentOS, I used GCC 6.5.0, and in OpenSUSE, I used GCC 7.3.1).
On the CentOS machine, I get:
#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.05 0.00
4 1000 30.56 0.13
8 1000 31.53 0.25
16 1000 30.99 0.52
32 1000 30.93 1.03
64 1000 30.30 2.11
128 1000 30.31 4.22
and on the OpenSUSE machine, I get
#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.04 0.00
4 1000 14.40 0.28
8 1000 14.04 0.57
16 1000 14.10 1.13
32 1000 13.96 2.29
64 1000 13.98 4.58
128 1000 14.08 9.09
When I don't use mpirun (which means there is only one process to run IMB-EXT), the benchmark runs through, but Unidir_Put needs >=2 processes, so doesn't help so much, and I also find that the functions with MPI_Put and MPI_Get is extremely slower than I expected (from my experience). Also, using MVAPICH on the OpenSUSE machine did not help. The output is:
#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 6 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.03 0.00
4 1000 17.37 0.23
8 1000 17.08 0.47
16 1000 17.23 0.93
32 1000 17.56 1.82
64 1000 17.06 3.75
128 1000 17.20 7.44
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 49213 RUNNING AT iron-0-1
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
update: I tested OpenMPI, and it goes through smoothly (although my application does not recommend using openmpi, and I still don't understand why Intel MPI or MVAPICH doesn't work...)
#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.06 0.00
4 1000 0.23 17.44
8 1000 0.22 35.82
16 1000 0.22 72.36
32 1000 0.22 144.98
64 1000 0.22 285.76
128 1000 0.30 430.29
256 1000 0.39 650.78
512 1000 0.51 1008.31
1024 1000 0.84 1214.42
2048 1000 1.86 1100.29
4096 1000 7.31 560.59
8192 1000 15.24 537.67
16384 1000 15.39 1064.82
32768 1000 15.70 2086.51
65536 640 12.31 5324.63
131072 320 10.24 12795.03
262144 160 12.49 20993.49
524288 80 30.21 17356.93
1048576 40 81.20 12913.67
2097152 20 199.20 10527.72
4194304 10 394.02 10644.77
Is there any chance that I am missing something in installing MPI, or installing OS in these servers? Actually, I assume that OS is the problem, but not sure where to start...
Thanks a lot in advance,
Jae

Although this question is well written, you were not explicit about
Intel MPI benchmark (please add header)
Intel MPI
Open MPI
MVAPICH
supported host network fabrics - for each MPI distribution
selected fabric while running MPI benchmark
Compilation settings
Debugging this kind of trouble with disparate host machines, multiple Linux distributions and compiler versions can be quite hard. Remote debugging on StackOverflow is even harder.
First of all ensure reproducibility. This seems to be the case. One of many debugging approaches, the one I would recommend, is to reduce complexity of the system as a whole, test smaller sub-systems and start shifting responsibility to third parties. You may replace self-compiled executables with software packages provided by distribution software/package repositories or third parties like Conda.
Intel recently started to provide its libraries through YUM/APT repos as well as for Conda and PyPI. I found that helps a lot with reproducible deployments of HPC clusters and even runtime/development environments. I recommend to use it for CentOS 7.5.
YUM/APT repository for Intel MKL, Intel IPP, Intel DAAL, and Intel® Distribution for Python* (for Linux*):
Installing Intel® Performance Libraries and Intel® Distribution for Python* Using YUM Repository
Installing Intel® Performance Libraries and Intel® Distribution for Python* Using APT Repository
Conda* package/ Anaconda Cloud* support (Intel MKL, Intel IPP, Intel DAAL, Intel Distribution for Python):
Installing Intel Distribution for Python and Intel Performance Libraries with Anaconda
Available Intel packages can be viewed here
Install from the Python Package Index (PyPI) using pip (Intel MKL, Intel IPP, Intel DAAL)
Installing the Intel® Distribution for Python* and Intel® Performance Libraries with pip and PyPI
I do not know much about OpenSUSE Leap 15.

Related

Measure performance of a VPS intended for hosting a website

I just bought for a month an extremely cheap VPS with 16 GB RAM and 6 cores(from Contabo)
Now my question is, how can I get some benchmark results in order to compare it with other VPSes like Hostinger provides?
I did a Geekbench benchmark on it and the results can be seen here: https://browser.geekbench.com/v4/cpu/15852309
The problem with Geekbench is that I feel it's not really web oriented as the scores are influenced by the GPU as well.
What should I use in order to compare the VPSes between them?
Would the plan be enough to host a Magento 2 website / possibly more?
For webserver performance the Network, Disk (random read) and CPU performance are the most important factors.
I like to benchmark and compare each one separately.
For Disk I/O performance, can use sysbench:
apt install sysbench
sysbench fileio --file-num=4 prepare
sysbench fileio --file-num=4 --file-test-mode=rndrw run
For CPU performance can use stress-ng:
apt install stress-ng
stress-ng -t 5 -c 2 --metrics-brief
-c 2 uses 2 logical processors. Adjust if necessary.
For network performance can use speedtest-cli:
apt install speedtest-cli
speedtest-cli
Example output:
# sysbench fileio --file-num=4 --file-test-mode=rndrw run
<skip>
Throughput:
read, MiB/s: 45.01
written, MiB/s: 30.00
# stress-ng -t 5 -c 2 --metrics-brief
stress-ng: info: [14993] dispatching hogs: 2 cpu
stress-ng: info: [14993] successful run completed in 5.00s
stress-ng: info: [14993] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [14993] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [14993] cpu 3957 5.00 9.99 0.00 790.92 396.10
# speedtest-cli
Retrieving speedtest.net configuration...
Testing from <skip> ...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Uganda Hosting Limited (Helsinki) [0.20 km]: 1.807 ms
Testing download speed................................................................................
Download: 575.68 Mbit/s
Testing upload speed................................................................................................
Upload: 499.89 Mbit/s

How to optimize the occupied memory using Ruby with Gitlab

run: top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13960 git 20 0 2032080 336220 13304 S 1.0 16.3 0:31.50 ruby
14284 git 20 0 554792 300168 10844 S 0.0 14.5 0:04.27 ruby
14287 git 20 0 546056 291068 10652 S 0.0 14.1 0:03.13 ruby
2705 mysql 20 0 1082876 287544 380 S 0.0 13.9 0:01.70 mysqld
14104 git 20 0 524072 276016 13324 S 0.0 13.4 0:24.69 ruby
14281 git 20 0 524072 267504 4812 S 0.0 13.0 0:00.00 ruby
13978 gitlab-+ 20 0 579824 39872 39280 S 0.0 1.9 0:00.12 postgres
1404 www 20 0 142196 31304 820 S 0.0 1.5 0:00.05 nginx
1405 www 20 0 142196 31304 820 S 0.0 1.5 0:00.05 nginx
1403 www 20 0 142196 30992 508 S 0.0 1.5 0:00.04 nginx
My machine only has 2GB of memory.
Is there a way to optimize the configuration and reduce the memory consumption?
Not really: see GitLab Requirements for memory
You need at least 8GB of addressable memory (RAM + swap) to install and use GitLab!
The operating system and any other running applications will also be using memory so keep in mind that you need at least 4GB available before running GitLab. With less memory GitLab will give strange errors during the reconfigure run and 500 errors during usage.
We recommend having at least 2GB of swap on your server, even if you currently have enough available RAM. Having swap will help reduce the chance of errors occurring if your available memory changes.
We also recommend configuring the kernel’s swappiness setting to a low value like 10 to make the most of your RAM while still having the swap available when needed.

Vultr virtual cpu vs DigitalOcean Intel(R) Xeon(R) CPU E5-2650 v4 # 2.20GHz

When I run a uname -ar command on Vultr command line I see the following:
Linux my.vultr.account.com 4.12.10-coreos #1 SMP Tue Sep 5 20:29:13
UTC 2017 x86_64 Virtual CPU a7769a6388d5 GenuineIntel GNU/Linux
On DigitalOcean I get:
Linux master 4.11.11-coreos #1 SMP Tue Jul 18 23:06:59 UTC 2017 x86_64
Intel(R) Xeon(R) CPU E5-2650 v4 # 2.20GHz GenuineIntel GNU/Linux
I don't know what the difference means? Is virtual cpu worse/same/better than what I see in DigitalOcean output of "Intel(R) Xeon(R)"?
The real Intel Xeon E5-2650 v4 is a CPU with 12 cores. Depending on your VPS configuration you get X amount of cores assigned from that CPU. Hence virtual CPUs.
Pertaining the specs of Vultr. The official response from Vultr support:
"We do not provide specific information on the CPUs we offer. They are all late-model Intel Xeon CPUs."
The a7769a6388d5 is a 2.4Ghz virtual CPU. According to:
wget freevps.us/downloads/bench.sh -O - -o /dev/null|bash
From there on it can be a diversity of 2.4GHz Intel E5 Xeon's from either V2, V3, V4 generation. You can get down to the bottom of it:
cat /proc/cpuinfo
Family 6 Model 61 Stepping 2 = Broadwell? etc.
Tip: CPU speed is not the best way to compare your VPS though. Focus more on I/O speed, datacenter location, uplink speed and ping times.

Why is my 980 TI outperforming my 1080?

Trying to make sure my new computer is setup properly, and noticed that its GeForce 1080 GPU is significantly underperforming the 980TI on my old system (when running Tensorflow jobs). Since the systems differ in more than just GPU, I wrote a small benchmark to isolate GPU matrix multiplication performance in TensorFlow. The results confirm that the new GPU is significantly slower. I know this is something to do with the software installed, but I've checked the obvious things: same python3, same cudnn, same numpy. What could be causing this strange performance gap?
Benchmarking Script:
import tensorflow as tf
import time
sess = tf.Session()
A = tf.random_uniform((1000,1000))
for i in range(int(1e3)):
A = (tf.matmul(A,A))
cur_time = time.clock()
sess.run(A)
print(time.clock()-cur_time)
Old System (980 Ti):
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.19
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.39GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
time elapsed: 0.81484
New System (1080):
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.898
pciBusID 0000:03:00.0
Total memory: 7.92GiB
Free memory: 7.57GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0)
time elapsed: 1.2753620000000003

shmget return ENOMEM with 12GB free

I try to allocate 22MB of shared memory using shmget(), but it exits with errno ENOMEM. The first lines of top's output look as if there was enough memory:
Processes: 114 total, 4 running, 110 sleeping, 579 threads
Load Avg: 0.50, 0.42, 0.35 CPU usage: 0.24% user, 0.60% sys, 99.15% idle
SharedLibs: 17M resident, 5356K data, 0B linkedit.
MemRegions: 20375 total, 1361M resident, 59M private, 1176M shared.
PhysMem: 1487M wired, 1887M active, 576M inactive, 3950M used, 12G free.
VM: 286G vsize, 1052M framework vsize, 123007(0) pageins, 0(0) pageouts.
The program runs with OS X version 10.8.5. Any idea what the cause might be?
The following sysctl variables affect shared memory: kern.sysv.shmmax, kern.sysv.shmmin, kern.sysv.shmmni, kern.sysv.shmseg, kern.sysv.shmall. Here kern.sysv.shmall should generally be set to at lease kern.sysv.shmmax divided by 4096.

Resources