zeromq high latency in multiprocessor multicore server - zeromq

Running latency test provided in perf/*_lat on single processor and mutiprocessor server showing a huge variance in latency figures.
16 cpu machine
./remote_lat tcp:// 30 1000
message size: 30 [B]
roundtrip count: 1000
average latency: 97.219 [us]
single cpu machine
/remote_lat tcp:// 30 1000
message size: 30 [B]
roundtrip count: 1000
average latency: 27.195 [us]
Running both the processes in the same machine.
Using libzmq 4.2.5
my laptop
intel core i5 7th generation
16 cpu server
Intel(R) Xeon(R) CPU # 2.30GHz


Performance of my MPI code does not improve when I use two NUMA nodes (dual Xeon chips)

I have a computer, Precision-Tower-7810 dual Xeon E5-2680v3 #2.50GHz × 48 threads.
Here is result of $lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz
Stepping: 2
CPU MHz: 1200.000
CPU max MHz: 3300,0000
CPU min MHz: 1200,0000
BogoMIPS: 4988.40
Virtualization: VT-x
L1d cache: 768 KiB
L1i cache: 768 KiB
L2 cache: 6 MiB
L3 cache: 60 MiB
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
My MPI code is based on basic MPI (Isend, Irecv, Wait, Bcast). Fundamentally, the data will be distributed and sent to all processors. On each processor, data is used to calculate something and its value is changed. After the above procedure, the amount of data on each processor is exchanged between all processors. This work is repeated to a limit.
Now, the main issue is that when I increase the number of processors within the limit of one chip (24 threads), performance increases. However, performance does not improve while the number of processors > 24 threads.
An example:
$mpiexec -n 6 ./mywork : 72s
$mpiexec -n 12 ./mywork : 46s
$mpiexec -n 24 ./mywork : 36s
$mpiexec -n 32 ./mywork : 36s
$mpiexec -n 48 ./mywork : 35s
I have tried on the both OpenMPI and MPICH, obtained result is the same. So, I think issue of physical connect type (NUMA nodes) of two chips. It is assumption of mine, I have never used a really supercomputer. I hope anyone know this issue and help me. Thank you for reading.

strange CPU binding/pining result within OpenMPI

I have tried to evaluate an OpenMPI program with Matrix Multiplication algorithm, the written code scales very well on a single thread per core machine in our Laboratory (close to ideal speedup within 48 and 64 cores), However, on some other machines which are hyperthreaded there is strange behavior, as you can see in the screenshot from htop I realized the CPU utilization when I run the same experiment with the same command is different and strange, I executed the program with
mpirun --bind-to hwthread--use-hwthread-cpus -n 2 ...
Here I bind the MPI workers to each hwthread, and can be seen with -n 2 which means I overwrite the variable in such a way to bind the execution on two processors (here hwthreads), however, seems it uses another hwthread with more or less 50% of utilization as well! I found this strange because there is not any extra CPU utilization on other machines, I tried this experiment many times and I'm sure this is not a temporary check or sth by OS and is due to the execution model of OpenMPI.
I appreciate it if someone could explain this behavior and extra CPU utilization when I execute this on the hyper-threaded machine.
The output of lscpu is as below:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 2200.000
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6786.36
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
The version of OpenMPI for all machines is the same 2.1.1.
Maybe Hyperthreading is not the case and I was misled by this, but the only big difference between these environments are 1) the Hyperthreading and 2) Clock Frequency of the processors which is based on different CPUs is different between 2200 MHz to 4.8 GHz.

What's the reason of strange experiment results?

I have a problem with experiment on my computer. I've done 300 tests of parallel algorithm (32 threads) and seen, that runtime of about 10% tests is less than others. It looks like that: we have 100 tests with runtime of each about 100 ms, then we have 30 tests with runtime ~ 80 ms and again 170 tests with runtime ~100 ms. It happens every experiment. I used OpenMP, TBB, PTHREAD, std::Thread and it happens with every parallel technology.
What's the reason of that?
CPU: Intel® Core™ i7 Kaby Lake H 2800 - 3800 MHz
Cores: 4
Threads: 8

MPICH2 on a machine with two NUMA nodes

I am new to MPI. I am using MPICH2 on a Linux machine with the following information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4114 CPU # 2.20GHz
Stepping: 4
CPU MHz: 799.844
CPU max MHz: 3000.0000
CPU min MHz: 800.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
My understanding is that I've got 2 nodes, 20 cores and 40 threads (i.e. processors) on this machine. Is this correct? If yes, I think I should set MPICH to spawn 20 processes (one process on each physical core), right? However, when I run the command mpiexec -n 20 MyProgram, the average CPU usage is only 50%. If I change to mpiexec -n 40 MyProgram, the CPU usage is 100% but the overall performance is actually becoming worse so I think I might be over-specifying.
CPU usage is a misleading metric. CPU usage reflects the portion of time some task was scheduled on a logical CPU. CPU average is just that, the average over all logical cores. So 50% CPU average can just mean that every other logical CPU has 100% usage, (and the others 0 %). So you observe this in a situation where each physical core is always utilized.
CPU usage, does mean resource utilization. There are workloads that benefit from using hyperthreading and workloads that don't. There are workloads that can be faster using less threads than physical cores (e.g. memory bandwidth limited). There are workloads that can be faster using more threads than logical CPUs (e.g. I/O latency limited).
Always use your performance metric (e.g. time) to figure out the best configuration. If you want to understand resource utilization you must look at many different performance metrics, cycles, instructions, memory bandwidth, cache, ....

NodeJS on Ubuntu slow?

I just installed Ubuntu 10.10 server with NodeJS 0.4.6 using this guide: on my laptop:
Acer 5920G (Intel Core 2 Duo (2ghz), 4 gb ram)
After that I created a little test how nodejs would perform and wrote this little hello world script:
var http = require('http');
http.createServer(function(req, res) {
res.writeHead(200, {'Content-Type': 'text/html'});
res.write('Hello World');
Now to test the performance i used Apache Benchmark on Windows with the following settings
ab -r -c 1000 -n 10000
But the results are very low compared to
Server Software:
Server Hostname:
Server Port: 8000
Document Path: /
Document Length: 12 bytes
Concurrency Level: 1000
Time taken for tests: 23.373 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 760000 bytes
HTML transferred: 120000 bytes
Requests per second: 427.84 [#/sec] (mean)
Time per request: 2337.334 [ms] (mean)
Time per request: 2.337 [ms] (mean, across all concurrent requests)
Transfer rate: 31.75 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 1.3 1 28
Processing: 1236 2236 281.2 2327 2481
Waiting: 689 1522 169.5 1562 1785
Total: 1237 2238 281.2 2328 2484
Percentage of the requests served within a certain time (ms)
50% 2328
66% 2347
75% 2358
80% 2364
90% 2381
95% 2397
98% 2442
99% 2464
100% 2484 (longest request)
Any one got a clue? (Compile, Hardware problem, Drivers, Configuration, Slow script)
Edit 4-17 14:04 GMT+1
I am testing the machine over 1Gbit local connection. When I ping it gives me 0 ms so that would be good I guess. When I issue the apachebenchmark on my Windows 7 machine the CPU raises to 100% :|
It seems like you are running the test over a medium with a high Bandwidth-Delay Product; in your case, high latency (>1s). Assuming 1s delay, a 100MBit link and 76 Bytes per request, you need more than 150000 requests in parallel to saturate it.
First, test the latency (with ping or so). Also, watch the CPU and network usage on all participating machines. This will give you an indication of the bottleneck in your tests. What are the benchmark results for an Apache webserver?
Also, it could be hardware/driver problem. Watch dmesg on both machines. And although it's probably not the reason for this specific problem, don't forget to change the CPU speed governor to performance on both machines!
