Using Compiler Optimization in Calculating Speed-up - gcc

When we are talking about a parallel program in Cuda on GPU having a speed up over a similar sequential one on CPU , should the sequential one be compiled by a Compiler Optimizer (gcc -O2)?
I have paralleled a program on GPU. It has a speed up of 18 in comparison with its CPU implementation without a compiler optimizer. But when I add the option -O2 to nvcc compiler, the speed up rate decreases to 8.

Of course optimizer should be used for both GPU and CPU program when comparing the performance.
If your focus on GPU v.s. CPU, the comparison should not be affected by the quality of the software code. We often assume the code should have the best performance on its hardware.

Related

What GCC optimization flags and techniques are safe across CPUs?

When compiling/linking C/C++ libraries or programs that are meant to work on all implementations of an ISA (e.g. x86-64), what optimization flags are safe from the correctness and run-time performance perspectives? I want optimizations that yield correct results and won't be detrimental performance-wise for a particular CPU. E.g I would like to avoid optimization flags that yield run-time performance improvements on an 8th-gen Intel Core i7, but result in performance degradation on an AMD Ryzen.
Are PGO, LTO, and -O3 safe? Is it solely dependent on -march and -mtune (or the absence thereof)?
They're all supposed to be "safe", assuming that your code is well defined.
If you don't want to specialize for a particular CPU family then just leave -march and -mtune alone; the default suits a generic x86_64.
PGO is always a good idea, it's mostly used for avoiding branches.
LTO and -O3 can have different effects on different code-bases. For example, if your code benefits from vectorization then -O3 is a big win over -O2, but the extra inlining and unrolling can lead to larger code sizes, and that can be a disadvantage on systems with more limited caches.
In the end, the only advice that ever really means anything here is: measure it and see what's good for your code.

Open CL speedup obtained is above 7000

I have an OpenCL sequential program and a parallel program which consists of the same algorithm. I have got the execution time results as 133000 milliseconds for sequential and 17 milliseconds as the kernel time for parallel. So when I calculate the speed up that is 133000/17 i get 7823 as the speedup. Whether this much of speed up possible?
Such a speedup might happen (but seems quite big; to me, a speedup of 7823 looks suspicious but not entirely impossible, see e.g. these slides and that. A 100x factor would seem more reasonable). Costly graphics cards are rumored to be able to run at several teraflops. A single core gives only gigaflops. Some particular programs can even run slower on GPGPU than on the CPU.
When benchmarking your CPU code, be sure to enable optimizations in your compiler (e.g. compile with gcc -O2 at least with GCC). Without any optimization (e.g. gcc -O0) the CPU performance is slow (e.g. a 3x factor between binary obtained with gcc -O0 and gcc -O2 is common).
BTW, cache considerations matter a lot for CPU performance. If you wrote your numerical CPU code without taking that into account, it may be quite slow (in the weird case when it has bad locality of reference).
If the kernel function has a problem and has not been executed, the time results will be inaccurate

GCC optimization for CPU and MEMORY usage

Is there a way to optimize the GCC compiled code in term of cpu and memory using option flags?
Using O3 rather than 01 does increase or decrease the amount of memory or cpu usage?
About memory usage:
-Os reduces the binary size of a program. It has limited effect on runtime memory usage (C/C++ memory allocation and deallocation is "manual").
I say limited since tail recursion optimization can lower stack usage (this optimization will also be performed with -O2 / -O3).
The -flto (link time optimization) option can also lower binary size.
CPU usage:
Highly optimized code (e.g. -O3) will stress the CPU but that doesn't automatically mean a higher total CPU power consumption (it may lead to minimum execution times).
E.g. in Compiler-Based Optimizations Impact on Embedded Software Power Consumption (not strictly GCC related but interesting), they find that enabling various global speed compiler optimizations lead to considerable increase in the power consumption of the DSP (on average, by 25%). Although these optimizations increase the consumed power by the DSP, the energy usage while running an algorithm decreased, on average, by 95%
Profile guided optimization could lower CPU consumption (The risks of using PGO (profile-guided optimization) with production environment).
Take a look at Can we optimize code to reduce power consumption?
Probably you should use -O2 and do not worry about it: if you're looking to save power / memory, the overall design of your application will have more effect than a compiler switch.
You might try -Os which is like -O2 (good CPU speed) while simultaneously trying to reduce the binary size.
Check out the various optimizations here.
Code size optimizations are addressed above.
I'm only looking at CPU optimization. You can write really good/optimized code that has low processor utilization, and really bad/unoptimized code that maximizes CPU utilization.
So how do you most effectively use your processor?
First, use a good optimizing compiler. I won't speak to GCC, but Intel and some other purchased compilers (e.g. PGI) are very good at optimization.
Exploit the underlying hardware, such as vector instructions, FMA, registers, etc.
Follow best practices for use of peripherals, such as cellular, wifi, gps, etc.
Following best practices for SW design, such as latency hiding, avoid polling by using interrupts, use a thread pool if appropriate, etc
Good luck.

How to evaluate CUDA performance?

I programmed CUDA kernel my own.
Compare to CPU code, my kernel code is 10 times faster than CPUs.
But I have question with my experiments.
Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?
How can I evaluate my kernel code's performance?
How can I calcuate CUDA's maximum throughput theoretically?
Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?
Thanks in advance.
Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?
To find this out, you use one of the CUDA profilers. See How Do You Profile & Optimize CUDA Kernels?
How can I calcuate CUDA's maximum throughput theoretically?
That math is slightly involved, different for each architecture and easy to get wrong. Better to look the numbers up in the specs for your chip. There are tables on Wikipedia, such as this one, for the GTX500 cards. For instance, you can see from the table that a GTX580 has a theoretical peak bandwidth of 192.4GB/s and compute throughput of 1581.1GFLOPs.
Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?
If I understand correctly, you are asking if the number of theoretical peak GFLOPs on a GPU can be directly compared with the corresponding number on a CPU. There are some things to consider when comparing these numbers:
Older GPUs did not support double precision (DP) floating point, only single precision (SP).
GPUs that do support DP do so with a significant performance degradation as compared to SP. The GFLOPs number I quoted above was for SP. On the other hand, numbers quoted for CPUs are often for DP, and there is less difference between the performance of SP and DP on a CPU.
CPU quotes can be for rates that are achievable only when using SIMD (single instruction, multiple data) vectorized instructions, and is typically very hard to write algorithms that can approach the theoretical maximum (and they may have to be written in assembly). Sometimes, CPU quotes are for a combination of all computing resources available through different types of instructions and it often virtually impossible to write a program that can exploit them all simultaneously.
The rates quoted for GPUs assume that you have enough parallel work to saturate the GPU and that your algorithm is not bandwidth bound.
The preferred measure of performance is elapsed time. GFLOPs can be used as a comparison method but it is often difficult to compare between compilers and architectures due to differences in instruction set, compiler code generation, and method of counting FLOPs.
The best method is to time the performance of the application. For the CUDA code you should time all code that will occur per launch. This includes memory copies and synchronization.
Nsight Visual Studio Edition and the Visual Profiler provide the most accurate measurement of each operation. Nsight Visual Studio Edition provides theoretical bandwidth and FLOPs values for each device. In addition the Achieved FLOPs experiment can be used to capture the FLOP count both for single and double precision.

How to get best performance of 8 core system using INTEL fortran

Please let me know how to set INTEL fortran compiler option to gain the best performance of 8 core system for IA32 and X64 bits. Actually I want to execute a fortran program and take the advantages of the all CPU time available in 8 core system. Now the program is only using 13 % of CPU time.
You can learn about autovectorization and guided auto-parallelization features of Intel FORTRAN in this tutorial: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/start/win/tutorial_comp_for_win.pdf.
If you are doing linear algebra, solvers, FFTs, you might get best results if you map your problem into calls into the Intel Math Kernel Libraries: http://software.intel.com/en-us/articles/intel-mkl/
which are already multithreaded and vectorized and cache optimized.
If you are doing media / signal processing you might map your problem into calls into the Intel Performance Primitives library: http://software.intel.com/en-us/articles/intel-ipp/
Happy hacking!
In my specific application, a computational network model containing several loops running thoughout 20k iterations, each iteration accessing a number of nested if's, just by enabling /Q2 level optimization in the compiler was sufficient to reduce the computing time drastically, while keeping the CPU load around 15%.
On a similar note, I have noticed rising the optimization setting to the last level (/Q3), did do what you were asking (running all CPUs at about full load), but the computing time have NOT been reduced at all.
Therefore, if one has a small problem and several cases to test and processing capacity is the only bottleneck, it could be a good idea to open more than one Fortran solution and run those cases simultaneously.

Resources