Measuring Efficiency of GPU Program - performance

I have both serial and parallel program using GPU.
The serial program takes 112.9 seconds to finish.
The parallel program with GPU takes 3.16 second to finish.
Thus, I have the speedup of 35.73.
Can I measure the efficiency of the program using the formula SpeedupTime/NumberOfThread ?
The threads will be 1024

The efficiency is the ratio of the time on the CPU to time on GPU. You might want to try multicore implementation as well and compare it with GPU implementation.

Related

Type of CPU load to perform constant relaxed work

I'm trying to figure out how to program a certain type of load to the CPU that makes it work constantly but with average stress.
The only approach I know how to load a CPU with some work to do which wouldn't be at its maximum possible performance is when we alternate the part of giving the CPU something to do with sleep for some time. E.g. to achieve 20% CPU usage, do some computation which would take e.g. 0.2 seconds and then sleep for 0.8 seconds. Then the CPU usage will be roughly 20%.
However this essentially means the CPU will be jumping between max performance to idle all the time.
I wrote a small Python program where I'm making a process for each CPU core, set its affinity so each process runs on a designated core, and I'm giving it some absolutely meaningless load:
def actual_load_cycle():
x = list(range(10000))
del x
while repeating a call to this procedure in cycle and then sleeping for some time to ensure the working time is N% of total time:
while 1:
timer.mark_time(timer_marker)
for i in range(coef):
actual_load_cycle()
elapsed = timer.get_time_since(timer_marker)
# now we need to sleep for some time. The elapsed time is CPU_LOAD_TARGET% of 100%.
time_to_sleep = elapsed / CPU_LOAD_TARGET * (100 - CPU_LOAD_TARGET)
sleep(time_to_sleep)
It works well, giving the load within 7% of desired value of CPU_LOAD_TARGET - I don't need a precise amount of load.
But it sets the temperature of the CPU very high, with CPU_LOAD_TARGET=35 (real CPU usage reported by the system is around 40%) the CPU temps go up to 80 degrees.
Even with the minimal target like 5%, the temps are spiking, maybe just not as much - up to 72-73.
I believe the reason for this is that those 20% of time the CPU works as hard as it can, and it doesn't get cooler fast enough while sleeping afterwards.
But when I'm running a game, like Uncharted 4, the CPU usage as measured by MSI Afterburner is 42-47%, but the temperatures stay under 65 degrees.
How can I achieve similar results, how can I program such load to make CPU usage high but the work itself would be quite relaxed, as is done e.g. in the games?
Thanks!
The heat dissipation of a CPU is mainly dependent of its power consumption which is very dependent of the workload, and more precisely the instruction being executed and the number of active cores. Modern processors are very complex so it is very hard to predict the power consumption based on a given workload, especially when the executed code is a Python code executed in the CPython interpreter.
There are many factors that can impact the power consumption of a modern processors. The most important one is frequency scaling. Mainstream x86-64 processors can adapt the frequency of a core based on the kind of computation done (eg. use of wide SIMD floating-point vectors like the ZMM registers of AVX-512F VS scalar 64-bit integers), the number of active cores (the higher the number of core the lower the frequency), the current temperature of the core, the time executing instructions VS sleeping, etc. On modern processor, the memory hierarchy can take a significant amount of power so operations involving the memory controller and more generally the RAM can eat more power than the one operating on in-core registers. In fact, regarding the instructions actually executed, the processor needs to enable/disable some parts of its integrated circuit (eg. SIMD units, integrated GPU, etc.) and not all can be enabled at the same time due to TDP constraints (see Dark silicon). Floating-point SIMD instructions tend to eat more energy than integer SIMD instructions. Something even weirder: the consumption can actually be dependent of the input data since transistors may switch more frequently from one state to another with some data (researchers found this while computing matrix multiplication kernels on different kind of platforms with different kind of input). The power is automatically adapted by the processor since it would be insane (if even possible) for engineers to actually consider all possible cases and all possible dynamic workload.
One of the cheapest x86 instruction is NOP which basically mean "do nothing". That being said, the processor can run at the highest turbo frequency so to execute a loop of NOP resulting in a pretty high power consumption. In fact, some processor can run the NOP in parallel on multiple execution units of a given core keeping busy all the available ALUs. Funny point: running dependent instructions with a high latency might actually reduce the power consumption of the target processor.
The MWAIT/MONITOR instructions provide hints to allow the processor to enter an implementation-dependent optimized state. This includes a lower-power consumption possibly due to a lower frequency (eg. no turbo) and the use of sleep states. Basically, your processor can sleep for a very short time so to reduce its power consumption and then be able to use a high frequency for a longer time due to a lower power / heat-dissipation before. The behaviour is similar to humans: the deeper the sleep the faster the processor can be after that, but the deeper the sleep the longer the time to (completely) wake up. The bad news is that such instruction requires very-high privileges AFAIK so you basically cannot use them from a user-land code. AFAIK, there are instructions to do that in user-land like UMWAIT and UMONITOR but they are not yet implemented except maybe in very recent processors. For more information, please read this post.
In practice, the default CPython interpreter consumes a lot of power because it makes a lot of memory accesses (including indirection and atomic operations), does a lot of branches that needs to be predicted by the processor (which has special power-greedy units for that), performs a lot of dynamic jumps in a large code. The kind of pure-Python code executed does not reflect the actual instructions executed by the processor since most of the time will be spent in the interpreter itself. Thus, I think you need to use a lower-level language like C or C++ to better control kind of workload to be executed. Alternatively, you can use JIT compiler like Numba so to have a better control while still using a Python code (but not a pure-Python one anymore). Still, one should keep in mind that the JIT can generate many unwanted instructions that can result in an unexpectedly higher power consumption. Alternatively, a JIT compiler can optimize trivial codes like a sum from 1 to N (simplified as just a N*(N+1)/2 expression).
Here is an example of code:
import numba as nb
def test(n):
s = 1
for i in range(1, n):
s += i
s *= i
s &= 0xFF
return s
pythonTest = test
numbaTest = nb.njit('(int64,)')(test) # Compile the function
pythonTest(1_000_000_000) # takes about 108 seconds
numbaTest(1_000_000_000) # takes about 1 second
In this code, the Python function takes 108 times more time to execute than the Numba function on my machine (i5-9600KF processor) so one should expect a 108 higher energy needed to execute the Python version. However, in practice, this is even worse: the pure-Python function causes the target core to consume a much higher power (not just more energy) than the equivalent compiled Numba implementation on my machine. This can be clearly seen on the temperature monitor:
Base temperature when nothing is running: 39°C
Temperature during the execution of pythonTest: 55°C
Temperature during the execution of numbaTest: 46°C
Note that my processor was running at 4.4-4.5 GHz in all cases (due to the performance governor being chosen). The temperature is retrieved after 30 seconds in each cases and it is stable (due to the cooling system). The function are run in a while(True) loop during the benchmark.
Note that game often use multiple cores and they do a lot of synchronizations (at least to wait for the rendering part to be completed). A a result, the target processor can use a slightly lower turbo frequency (due to TDP constraints) and have a lower temperature due to the small sleeps (saving energy).

Embarrasingly parallel execution, no speedup (MEEP, openMPI)

I've been trying to exploit parallelization to run some simulations with the MEEP simulation software a bit faster. By default the software only uses one CPU, and FDTD simulations are easily sped up by parallelization. In the end I found there was no difference between running 1 or 4 cores, the simulation times were the same.
I then figured I would instead run individual simulations on each core to increase my total simulation throughput (for example running 4 different simulations at the same time).
What I found surprising is that whenever I start a new simulation, the already started simulations would slow down, even though they run on separate cores. For example, if I run only 1 simulation on 1 core, each time step of the FDTD simulation takes around 0.01 seconds. If I start another process on another core, each simulation now spends 0.02 seconds per time step, and so on, meaning that even when I run different simulations that have nothing to do with each other on separate cores, they all slow down giving me no net increase in speed.
I'm not necessarily looking for help to solve this problem as much as I'm looking for help understanding it, because it peaked my curiosity. Each instance of the simulation requires less than 1% of my total memory, so it's not a memory issue. The only thing I can think of is the cores sharing the cache memory, or the memory bandwidth being saturated, is there any way to check if this is the case?
The simulations are fairly simple and I've ran programs which are much more memory hungry than this one and had great speedup with parallelization.
Any tips to help me understand this phenomena?
I think it should be better look on bigger simulations because the well known issue with the turbo boost like technology (the single core performance change with the number of threads) cannot explain your result. It will explain just if have a single core processor.
So, I think that can be explain with memory cache levels. Maybe if you try simulations much bigger than L3 Cache (> 8MB for i7).
MY test on a Intel(R) Core(TM) i7-3517U CPU # 1.90GHz Dual Core (4 Threads). All simulations for 1 mpi thread (-np 1)
10mb simulation:
Four simulation 0.0255 s/step
Two simulation 0.0145 s/step
One simulation 0.0129 s/step
100mb simulation:
Four simulation 1.13 s/step
Two simulation 0.61 s/step
One simulation 0.53 s/step
A curious thing is that two simulation with 2 threads each run at almost the same speed as two simulations with 1 thread.

Estimate parallel efficiency using unicore processor

We know that the parallel efficiency of a program running on a multicore system can be calculated as speedup/N where N is the number of cores. So in order to use this formula first we need to execute the code on a multicore system and need to know the speedup.
I would like to know if I don't have a multicore system,then is it possible to estimate the speedup of the given code on a multicore system just by executing it on an unicore processor?
I have access to performance counters ( Instruction per cycles , number of cache misses , Number of Instructions etc) and I only have binaries of the code.
[Note: I estimated the parallel_running_time (T_P) = serial_running_time/N but this estimation has unacceptable error]
Thanks
Read up on Amdahl's Law, especially the bit about parallelization.
For you to determine how much you can speed up your program, you have to know what parts of the program can benefit from parallelization and what parts must be executed sequentially. If you know that, and if you know how long the serial and the parallel parts take (individually) on a single processor, then you can estimate how fast the program will be on multiple processors.
From your description, it seems that you don't know which parts can make use of parallel processing and which parts have to be executed sequentially. So it won't be possible to estimate the parallel running time.

Does modern GPU (e.g Fermi/Evergreen) supports out of order execution?

I am writing an OpenCL kernel which involves a few barriers in a loop. I have tested the kernel on CPU (8-core FX8150) and the result shows these barriers reduced running speed by a factor of 50~100 times (I further verified this by re-implementing the kernel on Java using multi-threading + CyclicBarrier). I suspect the reason was barrier essentially stops the CPU taking advantage of out-of-order execution, so I am a little worried if I would observe the same magnitude of speed decrease on GPU. I checked a few official documents and googled around a bit but there is little information available on this topic.
Current state-of-the art GPUs are in-order pipelined processor. GPUs fill the pipeline effectively by interleaving instructions from different warps (wavefronts). In comparisons, CPUs use out-of-order speculative execution to fill the pipeline. There are different functional units like ALUs and SFUs which have separated pipelines. But notice that instruction dependency stalls the warp. For more information on instruction dependency resolving on GPUs refer to this NVIDIA patent.
NVIDIA’s Next Generation
CUDA Compute and Graphics Architecture, Code-Named “Fermi”:
Nvidia GigaThread Engine has capabilities of(at page 5)
10x faster application context switching
Concurrent kernel execution
Out of Order thread block execution :)
Dual overlapped memory transfer engines
Evergreen has SIMD capabilities and has a chance outperform some fermi but i dont know about oooe of it. There is also "local atomic add" upper hand of HD 7000 series compared to GTX 600 series (nearly 10x faster)

How to evaluate CUDA performance?

I programmed CUDA kernel my own.
Compare to CPU code, my kernel code is 10 times faster than CPUs.
But I have question with my experiments.
Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?
How can I evaluate my kernel code's performance?
How can I calcuate CUDA's maximum throughput theoretically?
Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?
Thanks in advance.
Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?
To find this out, you use one of the CUDA profilers. See How Do You Profile & Optimize CUDA Kernels?
How can I calcuate CUDA's maximum throughput theoretically?
That math is slightly involved, different for each architecture and easy to get wrong. Better to look the numbers up in the specs for your chip. There are tables on Wikipedia, such as this one, for the GTX500 cards. For instance, you can see from the table that a GTX580 has a theoretical peak bandwidth of 192.4GB/s and compute throughput of 1581.1GFLOPs.
Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?
If I understand correctly, you are asking if the number of theoretical peak GFLOPs on a GPU can be directly compared with the corresponding number on a CPU. There are some things to consider when comparing these numbers:
Older GPUs did not support double precision (DP) floating point, only single precision (SP).
GPUs that do support DP do so with a significant performance degradation as compared to SP. The GFLOPs number I quoted above was for SP. On the other hand, numbers quoted for CPUs are often for DP, and there is less difference between the performance of SP and DP on a CPU.
CPU quotes can be for rates that are achievable only when using SIMD (single instruction, multiple data) vectorized instructions, and is typically very hard to write algorithms that can approach the theoretical maximum (and they may have to be written in assembly). Sometimes, CPU quotes are for a combination of all computing resources available through different types of instructions and it often virtually impossible to write a program that can exploit them all simultaneously.
The rates quoted for GPUs assume that you have enough parallel work to saturate the GPU and that your algorithm is not bandwidth bound.
The preferred measure of performance is elapsed time. GFLOPs can be used as a comparison method but it is often difficult to compare between compilers and architectures due to differences in instruction set, compiler code generation, and method of counting FLOPs.
The best method is to time the performance of the application. For the CUDA code you should time all code that will occur per launch. This includes memory copies and synchronization.
Nsight Visual Studio Edition and the Visual Profiler provide the most accurate measurement of each operation. Nsight Visual Studio Edition provides theoretical bandwidth and FLOPs values for each device. In addition the Achieved FLOPs experiment can be used to capture the FLOP count both for single and double precision.

Resources