What's the reason of strange experiment results? - parallel-processing

I have a problem with experiment on my computer. I've done 300 tests of parallel algorithm (32 threads) and seen, that runtime of about 10% tests is less than others. It looks like that: we have 100 tests with runtime of each about 100 ms, then we have 30 tests with runtime ~ 80 ms and again 170 tests with runtime ~100 ms. It happens every experiment. I used OpenMP, TBB, PTHREAD, std::Thread and it happens with every parallel technology.
What's the reason of that?
CPU: Intel® Core™ i7 Kaby Lake H 2800 - 3800 MHz
Cores: 4
Threads: 8

Related

strange CPU binding/pining result within OpenMPI

I have tried to evaluate an OpenMPI program with Matrix Multiplication algorithm, the written code scales very well on a single thread per core machine in our Laboratory (close to ideal speedup within 48 and 64 cores), However, on some other machines which are hyperthreaded there is strange behavior, as you can see in the screenshot from htop I realized the CPU utilization when I run the same experiment with the same command is different and strange, I executed the program with
mpirun --bind-to hwthread--use-hwthread-cpus -n 2 ...
Here I bind the MPI workers to each hwthread, and can be seen with -n 2 which means I overwrite the variable in such a way to bind the execution on two processors (here hwthreads), however, seems it uses another hwthread with more or less 50% of utilization as well! I found this strange because there is not any extra CPU utilization on other machines, I tried this experiment many times and I'm sure this is not a temporary check or sth by OS and is due to the execution model of OpenMPI.
I appreciate it if someone could explain this behavior and extra CPU utilization when I execute this on the hyper-threaded machine.
The output of lscpu is as below:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 2200.000
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6786.36
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
The version of OpenMPI for all machines is the same 2.1.1.
Maybe Hyperthreading is not the case and I was misled by this, but the only big difference between these environments are 1) the Hyperthreading and 2) Clock Frequency of the processors which is based on different CPUs is different between 2200 MHz to 4.8 GHz.

How to calculate speedup and efficiency in a hybrid CPU and GPU algorithm?

I have an algorithm that I have executed in parallel using only CPU and I have achieved a speedup of 30x. That is, an efficiency equal to 0.93 (efficiency = speedup/cores, i.e. 0.93 = 30/32).
Later I added 2 GPUs (Tesla C2075 of 448 cores each) together to the 32 CPU cores.
To calculate the efficiency including CPUs and GPUs, should I add the amount of GPU cores to the CPU cores? That is, I would calculate the efficiency using 928 cores (32 + 448 + 448 = 928). Or should it be calculated differently?
Speedup and efficiency has been calculated based on what has been said here:
https://software.intel.com/en-us/articles/predicting-and-measuring-parallel-performance
GPUs have bigger "core complex" architectures called "SM" or "CU" with tens of pipelines each. Not "very" similar to "SIMD" of a CPU, they can issue commands in parallel to these pipelines in a "single-threaded" kernel code.
You have counted "cores" in CPU and not SIMD pipelines (which is 4 to 16 times of number of cores) so, it wouldn't be wrong to count SM units of Nvidia or CU of Amd or Slice subset of Intel etc.
Tesla C2075 has 14 SM units so you could add 14 for each GPU (32+14+14).
If you have also used SIMDified code for CPU, then it wouldn't be wrong to count each pipeline of a GPU which is 32 to 192 times the number of SM/CU(like 448 per GPU of yours) (32*SIMD_WIDTH + 448 + 448).
At least this is how I would compute "core efficiency" and "pipeline efficiency". If data transfer to/from GPU is not a bottleneck, efficiency should not drop much after GPUs are added.

jmeter performance analysis

I am running performance test for perf environment.
Below is the results:
CPU Utilization
Server Apdex Resp. time Throughput Error Rate CPU usage Memory
per001205 0.970.5 220 ms 2,670 rpm 0.0009 % 493.00% 2.2 GB
per001206 0.950.5 280 ms 2,670 rpm 0.0043 % 516.00% 2.4 GB
per011079 0.830.5 526 ms 2,670 rpm 0.0034 % 598.00% 2.5 GB
per011080 0.670.5 1,110 ms 2,670 rpm 0.0026 % 639.00% 2.6 GB
Can you comment on how the avergage response time? is it accepted?
I can see CPU usage is more than 100% , is it dangerous ?
How should i improve this? i am running it for 250 users.
First of all check out CPU usage mismatch or usage over 100% article.
Consider other monitoring method, i.e. go to hosts directly and check CPU usage via your operating system built-in commands or use JMeter PerfMon plugin to either confirm the picture or get an alternative view of CPU load. Depending on the result you have 2 options:
Either individual servers CPU usage is acceptable and you can decide whether throughput good or not
Or you need to fix the issue in your application code: using profiling tools for the programming language, your application is written in detect the most CPU intensive functions and refactor them to be less processor-time-hungry

Unenhanced performance of matlab GPU computing

With the intention of comparing the speed of GPU vs CPU computing, I ran the example codes available here (a Mandelbrot set on the GPU) from MATLAB central. Below are the results that I obtained:
Case 1 (without GPU): 6.2 secs
Case 2 (using parallel.gpu.GPUArray): 6.518 secs (1.39 secs in the example)
Case 3 (Using Element-wise Operation): 1.259 secs (0.14 secs in the example)
As can be seen, there is no improvement in case 2 and only slight improvement of around 4 times in case 3. As the example did not state the details of GPU they used, may I know if this is simply due to the "incompetency" of my graphic card or am I missing something important?
The graphic card is also responsible for driving my display (HP Z Display Z23i 23-inch IPS LED Backlit Monitor).
CPU: Intel i7-4790, 3.6 GHz (8 cores)
GPU:
Name: 'NVS 510'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.6934e+09
MultiprocessorCount: 1
ClockRateKHz: 797000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Thank you!
Edit
The GPU used in the example here is Tesla C2050. (Credits to #Sam Roberts)
The times on that link are most likely for a different GPU in comparison to yours. They don't specify what kind of graphics card they're using, but my guess is that they're using a more higher end card.
By Googling NVS 510, the specs are similar to the card that I have for my machine. However, your card is geared towards business while mine is geared towards gaming. I have a GTX 660 which is one of the higher end GPUs that are available on the market.
These are the attributes of my graphics card:
CUDADevice with properties:
Name: 'GeForce GTX 660'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6.5000
ToolkitVersion: 5.5000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.5357e+09
MultiprocessorCount: 5
ClockRateKHz: 1084500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
The differences between my card and yours are that I have 5 multiprocessors, and my clock rate is about 300 MHz faster than yours. For a side-by-side comparison, check out my card in comparison to yours:
NVS 510: http://www.nvidia.ca/object/nvs-510-graphics-card.html#pdpContent=2
GTX 660: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-660/specifications
Upon further inspection, I have a much higher memory bandwidth than your card. I also have 960 GPU cores in comparison to your 192.
I decided to run these tests to compare my performance with your timings. My CPU is an i7-4770 3.6 GHz Intel and I have 16 GB of RAM on my machine.
The times that I get by running those examples are the following:
Case #1 - Without GPU: 6.46 seconds
Case #2 - Naive GPU: 0.82 seconds - 7.9x faster
Case #3 - Through CUDA: 0.09 seconds - 71.7x faster
With this, my guess is that your graphics card may be of a lower quality in comparison to those tests that MathWorks performed. Maybe try updating your graphics drivers and see if that helps. However, my guess is that my performance is much better due to the multiprocessor count, faster clock, a higher amount of cores and higher memory bandwidth.

Application not running at full speed?

I have the following scenario:
machine 1: receives messages from outside and processes them (via a
Java application). For processing it relies on a database (on machine
2)
machine 2: an Oracle DB
As performance metrics I usually look at the value of processed messages per time.
Now, what puzzles me: none of the 2 machines is working on "full speed". If I look at typical parameters (CPU utilization, CPU load, I/O bandwidth, etc.) both machines look as they have not enough to do.
What I expect is that one machine, or one of the performance related parameters limits the overall processing speed. Since I cannot observe this I would expect a higher message processing rate.
Any ideas what might limit the overall performance? What is the bottleneck?
Here are some key values during workload:
Machine 1:
CPU load average: 0.75
CPU Utilization: System 12%, User 13%, Wait 5%
Disk throughput: 1 MB/s (write), almost no reads
average tps (as reported by iostat): 200
network: 500 kB/s in, 300 kB/s out, 1600 packets/s in, 1600 packets/s out
Machine 2:
CPU load average: 0.25
CPU Utilization: System 3%, User 15%, Wait 17%
Disk throughput: 4.5 MB/s (write), 3.5 MB/s (read)
average tps (as reported by iostat): 190 (very short peaks to 1000-1500)
network: 250 kB/s in, 800 kB/s out, 1100 packets/s in, 1100 packets/s out
So for me, all values seem not to be at any limit.
PS: for testing of course the message queue is always full, so that both machines have enough work to do.
To find bottlenecks you typically need to measure also INSIDE the application. That means profiling the java application code and possibly what happens inside Oracle.
The good news is that you have excluded at least some possible hardware bottlenecks.

Resources