How to evaluate CUDA performance? - performance

I programmed CUDA kernel my own.
Compare to CPU code, my kernel code is 10 times faster than CPUs.
But I have question with my experiments.
Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?
How can I evaluate my kernel code's performance?
How can I calcuate CUDA's maximum throughput theoretically?
Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?
Thanks in advance.

Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?
To find this out, you use one of the CUDA profilers. See How Do You Profile & Optimize CUDA Kernels?
How can I calcuate CUDA's maximum throughput theoretically?
That math is slightly involved, different for each architecture and easy to get wrong. Better to look the numbers up in the specs for your chip. There are tables on Wikipedia, such as this one, for the GTX500 cards. For instance, you can see from the table that a GTX580 has a theoretical peak bandwidth of 192.4GB/s and compute throughput of 1581.1GFLOPs.
Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?
If I understand correctly, you are asking if the number of theoretical peak GFLOPs on a GPU can be directly compared with the corresponding number on a CPU. There are some things to consider when comparing these numbers:
Older GPUs did not support double precision (DP) floating point, only single precision (SP).
GPUs that do support DP do so with a significant performance degradation as compared to SP. The GFLOPs number I quoted above was for SP. On the other hand, numbers quoted for CPUs are often for DP, and there is less difference between the performance of SP and DP on a CPU.
CPU quotes can be for rates that are achievable only when using SIMD (single instruction, multiple data) vectorized instructions, and is typically very hard to write algorithms that can approach the theoretical maximum (and they may have to be written in assembly). Sometimes, CPU quotes are for a combination of all computing resources available through different types of instructions and it often virtually impossible to write a program that can exploit them all simultaneously.
The rates quoted for GPUs assume that you have enough parallel work to saturate the GPU and that your algorithm is not bandwidth bound.

The preferred measure of performance is elapsed time. GFLOPs can be used as a comparison method but it is often difficult to compare between compilers and architectures due to differences in instruction set, compiler code generation, and method of counting FLOPs.
The best method is to time the performance of the application. For the CUDA code you should time all code that will occur per launch. This includes memory copies and synchronization.
Nsight Visual Studio Edition and the Visual Profiler provide the most accurate measurement of each operation. Nsight Visual Studio Edition provides theoretical bandwidth and FLOPs values for each device. In addition the Achieved FLOPs experiment can be used to capture the FLOP count both for single and double precision.

Related

Fast hardware integer division

Hardware instruction for integer division has been historically very slow. For example, DIVQ on Skylake has latency of 42-95 cycles [1] (and reciprocal throughput of 24-90), for 64-bits inputs.
There are newer processor however, which perform much better: Goldmont has 14-43 latency and Ryzen has 14-47 latency [1], M1 apparently has "throughput of 2 clock cycles per divide" [2] and even Raspberry Pico has "8-cycle signed/unsigned divide/modulo circuit, per core" (though that seems to be for 32-bit inputs) [3].
My question is, what has changed? Was there a new algorithm invented? What algorithms do the new processors employ for division, anyway?
[1] https://www.agner.org/optimize/#manuals
[2] https://ridiculousfish.com/blog/posts/benchmarking-libdivide-m1-avx512.html
[3] https://raspberrypi.github.io/pico-sdk-doxygen/group__hardware__divider.html#details
On Intel before Ice Lake, 64-bit operand-size is an outlier, much slower than 32-bit operand size for integer division. div r32 is 10 uops, with 26 cycle worst-case latency but 6 cycle throughput. (https://uops.info/ and https://agner.org/optimize/, and Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux has detailed exploration.)
There wasn't a fundamental change in how divide units are built, just widening the HW divider to not need extended-precision microcode. (Intel has had fast-ish dividers for FP for much longer, and that's basically the same problem just with only 53 bits instead of 64. The hard part of FP division is integer division of the mantissas; subtracting the exponents is easy and done in parallel.)
The incremental changes are things like widening the radix to handle more bits with each step. And for example pipelining the refinement steps after the initial (table lookup?) value, to improve throughput but not latency.
Related:
How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson? brief high-level overview of the div/sqrt units that modern CPUs use, with for example a Radix-1024 divider being new in Broadwell.
Do FP and integer division compete for the same throughput resources on x86 CPUs? (No in Ice Lake and later on Intel; having a dedicated integer unit instead of using the low element of the FP mantissa divide/sqrt unit is presumably related to making it 64 bits wide.)
Divide units historically were often not pipelined at all, as that's hard because it requires replicating a lot of gates instead of iterating on the same multipliers, I think. And most software usually avoids (or avoided) integer division because it was historically very expensive, at least does it infrequently enough to not benefit very much from higher-throughput dividers with the same latency.
But with wider CPU pipelines with higher IPC shrinking the cycle gap between divisions, it's more worth doing. Also with huge transistor budgets, spending a bunch on something that will sit idle for a lot of the time in most programs still makes sense if it's very helpful for a few programs. (Like wider SIMD, and specialized execution units like x86 BMI2 pdep / pext). Dark silicon is necessary or chips would melt; power density is a huge concern, see Modern Microprocessors: A 90-Minute Guide!
Also, more and more software being written by people who don't know anything about performance, and more code avoiding compile-time constants in favour of being flexible (function args that ultimately come from some config option), I'd guess modern software doesn't avoid division as much as older programs did.
Floating-point division is often harder to avoid than integer, so it's definitely worth having fast FP dividers. And integer can borrow the mantissa divider from the low SIMD element, if there isn't a dedicated integer-divide unit.
So that FP motivation was likely the actual driving force behind Intel's improvements to divide throughput and latency even though they left 64-bit integer division with garbage performance until Ice Lake.

How do I interpret this difference in matrix multiplication GFLOP/s?

I'm trying some matrix multiplication optimizations from this wiki here. While measuring the GFLOP/s for the naive, triple-for-loop matmul, I expected to see a drop in the GFLOP/s after a particular size, which according to the wiki, represents the point where data stops fitting in the cache:
I ran the benchmark on 2 different PCs:
3rd gen Intel i5 (3210M): (L1=32KB per core, L2=256KB per core, L3=3MB shared).
I got the expected graph, with a sharp drop from ~2GFLOP/s to 0.5.
6th gen Intel i7 (6500U): (L1=32KB per core, L2=256KB per core, L3=4MB shared)
On this, I instead see a gradual decrease in GFLOP/s, even if I try for larger sizes. Looking at the Ubuntu system monitor, one of the CPU cores was always at 100% usage.
I'm trying to understand the following:
How do I interpret the change in GFLOP/s with matrix size? If the expected drop corresponds to the data no longer fitting in the cache, why do I not see such a drop even for much bigger sizes on the i7?
How does the 3rd gen i5 perform faster for smaller sizes?
How do I interpret the CPU occupancy? Would I see a reduction in CPU usage if more time was being spent in fetching data from cache/RAM?
Edit:
I switched to double from float and tried -O3 and -O0, here are the plots. I couldn't check frequencies on the odler i5 but the Skylake i7 goes to turbo freq almost instantaneously for most of the process' duration.
Code from here, used GCC 7.4.0 on i7, and clang(Apple LLVM 7) on i5.
Regarding question 2:
While both CPUs have the same base and turbo frequency, the Ivy Bridge has a TDP of 35W while the Skylake has 15W. Even with a much newer process it is possible that the Ivy Bridge is able to use its turbo for a bigger part of the calculation. (Peter Cordes already mentioned checking the actual turbo.).
Regarding question 3:
CPU utilization doesn't depend on what the CPU is doing, waiting for RAM still counts as utilized. There are performance counters you can query which would tell you if the Ivy Bridge is slower because it stalls for memory more often.
With efficient cache-blocking, dense matmul should bottleneck on ALU, not memory bandwidth. O(N^3) work over O(N^2) memory.
But you're measuring a naive matmul. That means it's always horrible if you're striding down columns of one input. This is the classic problem for cache-blocking / loop-tiling.
Your Skylake has significantly better bandwidth to L3 cache and DRAM, and less-associative L2 cache (4-way instead of 8-way). Still, I would have expected better performance when your working set fits in L2 than when it doesn't.
SKL probably also has better HW prefetching, and definitely a larger out-of-order window size, than IvyBridge.
IvyBridge (including your 3210M) was the generation that introduced next-page hardware prefetching, but I think the feature with that name is just TLB prefetching, not data. It probably isn't a factor, especially if transparent hugepages are avoiding any TLB misses.
But if not, TLB misses might be the real cause of the dropoff on IvB. Use performance counters to check. (e.g. perf stat)
Was your CPU frequency shooting up to max turbo right away and staying there for both CPUs? #idspispopd's answer also makes some good points about total power / cooling budget, but yeah check that your two systems are maintaining the same CPU frequencies for this. Or if not, record what they are.
You did compile with optimization enabled, right? If not, that could be enough overhead to hide a memory bottleneck. Did you use the same compiler/version/options on both systems? Did you use -march=native?

Power consumption estimation from number of FLOPS (floating point operations)?

I have extracted how many flops (floating point operations) each of my algorithms are consuming,
I wonder if I implement this algorithms on FPGA or on a CPU, can predict (roughly at least) how much power is going to be consumed?
Both power estimation in either CPU or ASIC/FPGA are good for me. I am seeking something like a formula. I have this journal paper, for Intel CPUs. It gives power consumption per instruction (not only floating point operation but all those addressing, control, etc. instructions) so I need something more general to give power based on FLOPS not number of instructions of the code in a special processor.
Re CPU: It's not really possible with modern architectures. Let's assuming your program is running on bare metal (i.e. avoiding the complexities of modern OSs, other applications, interrupt processing, optimizing compilers, etc). Circuitry that isn't being used, the modern processor will operate at a reduced power level. There are also hardware power conservation states such as P (Power) and C (Sleep) states that are instruction independent and will vary your power consumption even with the same instruction sequence. Even if we assume your app is CPU-bound (meaning there are no periods long enough to allow the processor to drop into hardware power saving states), we can't predict power usage except at a gross statistical level. Instruction streams are pipelined, taken out-of-order, fused, etc. And this doesn't even include the memory hierarchy, etc.
FPGA: Oh, heck. My experience with FPGA is so old, that I really don't want to say from when. All I can say is that way back, when huge monsters roamed the earth, you could estimate power usage since you knew the circuit design, and the power consumption of on and off gates. I can't imagine that there aren't modern power conservation technologies built into modern FPGAs. Even so, what small literature I scanned implies that a lot of FPGA power technology is based upon a-priori analysis and optimization. See Design techniques for FPGA power optimization, and 40-nm FPGA Power Management and Advantages. (I just did a quick search and scan of the papers, by the way, so don't pay too much attention to my conclusion.)

hyperthreading and turbo boost in matrix multiply - worse performance using hyper threading

I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot
https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba
The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.
I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?
I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?
The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.
However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.
Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.
As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.
Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.
As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.

Optimum performance of GPU

I have been asked to measure how "efficiently " does my code use the GPU /what % of peak performance are algorithms achieving.I am not sure how to do this comparison.Till now I have basically had timers put in my code and measure the execution.How can I compare this to optimal performance and find what might be the bottle necks? (I did hear about visual profiler but couldnt get it to work ..it keeps giving me "cannot load output" error).
Each card has a maximum memory bandwidth and processing speed. For example, the GTX 480 bandwidth is 177.4 GB/s. You will need to know the specs for your card.
The first thing to decide is whether your code is memory bound or computation bound. If it is clearly one or the other, that will help you focus on the correct "efficiency" to measure. If your program is memory bound, then you need to compare your bandwidth with the cards maximum bandwidth.
You can calculate memory bandwidth by computing the amount of memory you read/write and dividing by run time (I use cuda events for timing). Here is a good example of calculating bandwidth efficiency (look at the whitepaper for the parallel reduction) and using it to help validate a kernel.
I don't know very much about determining the efficiency if instead you are ALU bound. You can probably count (or profile) the number of instructions, but what is the card's maximum?
I'm also not sure what to do in the likely case that your kernel is something in between memory bound and ALU bound.
Anyone...?
Generally "efficiently" would probably be a measure of how much memory and GPU cycles (average, min, max) of your program is using. Then the efficiency measure would be avg(mem)/total memory for the time period and so on with AVG(GPU cycles)/Max GPU cycles.
Then I'd compare these metrics to metrics from some GPU benchmark suites (which you can assume to be pretty efficient at using most of the GPU). Or you could measure against some random GPU intensive programs of your choice. That'd be how I'd do it but I've never thought to try so good luck!
As for bottlenecks and "optimal" performance. These are probably NP-Complete problems that no one can help you with. Get out the old profiler and debuggers and start working your way through your code.
Can't help with profiler and microoptimisation, but there is a CUDA calculator http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls , which trys to estimate how does your CUDA code use the hardware resources, based on this values:
Threads Per Block
Registers Per Thread
Shared Memory Per Block (bytes)

Resources