I have an OpenCL sequential program and a parallel program which consists of the same algorithm. I have got the execution time results as 133000 milliseconds for sequential and 17 milliseconds as the kernel time for parallel. So when I calculate the speed up that is 133000/17 i get 7823 as the speedup. Whether this much of speed up possible?
Such a speedup might happen (but seems quite big; to me, a speedup of 7823 looks suspicious but not entirely impossible, see e.g. these slides and that. A 100x factor would seem more reasonable). Costly graphics cards are rumored to be able to run at several teraflops. A single core gives only gigaflops. Some particular programs can even run slower on GPGPU than on the CPU.
When benchmarking your CPU code, be sure to enable optimizations in your compiler (e.g. compile with gcc -O2 at least with GCC). Without any optimization (e.g. gcc -O0) the CPU performance is slow (e.g. a 3x factor between binary obtained with gcc -O0 and gcc -O2 is common).
BTW, cache considerations matter a lot for CPU performance. If you wrote your numerical CPU code without taking that into account, it may be quite slow (in the weird case when it has bad locality of reference).
If the kernel function has a problem and has not been executed, the time results will be inaccurate
Related
I have a relatively large C++ program that does a lot of dynamic memory allocation (size and structure are calculated at runtime; it's pointers all the way down).
The program is embarrassingly parallel, and OpenMP does speed it up, up to a point. However, I found that launching single-threaded binaries, and having them sync up through the file system runs faster, and slows down less as the number of cores used increases.
Is there an intuitive reason why this might happen? (sorry I don't have a code snippet.)
Thank you!
Is there a way to optimize the GCC compiled code in term of cpu and memory using option flags?
Using O3 rather than 01 does increase or decrease the amount of memory or cpu usage?
About memory usage:
-Os reduces the binary size of a program. It has limited effect on runtime memory usage (C/C++ memory allocation and deallocation is "manual").
I say limited since tail recursion optimization can lower stack usage (this optimization will also be performed with -O2 / -O3).
The -flto (link time optimization) option can also lower binary size.
CPU usage:
Highly optimized code (e.g. -O3) will stress the CPU but that doesn't automatically mean a higher total CPU power consumption (it may lead to minimum execution times).
E.g. in Compiler-Based Optimizations Impact on Embedded Software Power Consumption (not strictly GCC related but interesting), they find that enabling various global speed compiler optimizations lead to considerable increase in the power consumption of the DSP (on average, by 25%). Although these optimizations increase the consumed power by the DSP, the energy usage while running an algorithm decreased, on average, by 95%
Profile guided optimization could lower CPU consumption (The risks of using PGO (profile-guided optimization) with production environment).
Take a look at Can we optimize code to reduce power consumption?
Probably you should use -O2 and do not worry about it: if you're looking to save power / memory, the overall design of your application will have more effect than a compiler switch.
You might try -Os which is like -O2 (good CPU speed) while simultaneously trying to reduce the binary size.
Check out the various optimizations here.
Code size optimizations are addressed above.
I'm only looking at CPU optimization. You can write really good/optimized code that has low processor utilization, and really bad/unoptimized code that maximizes CPU utilization.
So how do you most effectively use your processor?
First, use a good optimizing compiler. I won't speak to GCC, but Intel and some other purchased compilers (e.g. PGI) are very good at optimization.
Exploit the underlying hardware, such as vector instructions, FMA, registers, etc.
Follow best practices for use of peripherals, such as cellular, wifi, gps, etc.
Following best practices for SW design, such as latency hiding, avoid polling by using interrupts, use a thread pool if appropriate, etc
Good luck.
When we are talking about a parallel program in Cuda on GPU having a speed up over a similar sequential one on CPU , should the sequential one be compiled by a Compiler Optimizer (gcc -O2)?
I have paralleled a program on GPU. It has a speed up of 18 in comparison with its CPU implementation without a compiler optimizer. But when I add the option -O2 to nvcc compiler, the speed up rate decreases to 8.
Of course optimizer should be used for both GPU and CPU program when comparing the performance.
If your focus on GPU v.s. CPU, the comparison should not be affected by the quality of the software code. We often assume the code should have the best performance on its hardware.
Please let me know how to set INTEL fortran compiler option to gain the best performance of 8 core system for IA32 and X64 bits. Actually I want to execute a fortran program and take the advantages of the all CPU time available in 8 core system. Now the program is only using 13 % of CPU time.
You can learn about autovectorization and guided auto-parallelization features of Intel FORTRAN in this tutorial: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/start/win/tutorial_comp_for_win.pdf.
If you are doing linear algebra, solvers, FFTs, you might get best results if you map your problem into calls into the Intel Math Kernel Libraries: http://software.intel.com/en-us/articles/intel-mkl/
which are already multithreaded and vectorized and cache optimized.
If you are doing media / signal processing you might map your problem into calls into the Intel Performance Primitives library: http://software.intel.com/en-us/articles/intel-ipp/
Happy hacking!
In my specific application, a computational network model containing several loops running thoughout 20k iterations, each iteration accessing a number of nested if's, just by enabling /Q2 level optimization in the compiler was sufficient to reduce the computing time drastically, while keeping the CPU load around 15%.
On a similar note, I have noticed rising the optimization setting to the last level (/Q3), did do what you were asking (running all CPUs at about full load), but the computing time have NOT been reduced at all.
Therefore, if one has a small problem and several cases to test and processing capacity is the only bottleneck, it could be a good idea to open more than one Fortran solution and run those cases simultaneously.
My Professor found out this interesting experiment of 3D Linearly separable Kernel Convolution using SSE and OpenMP, and gave the task to me to benchmark the statistics on our system. The author claims a crazy 18 fold speedup from the serial approach! Might not be always, but we were expecting at least a 2-4 times speedup running this on a Dual Core Intel.
http://software.intel.com/en-us/articles/16bit-3d-convolution-sse4openmp-implementation-on-penryn-cpu/#comment-41994
Alas, we could find exactly no speedup. The serial code performs always better, with or without OpenMP.
I am using Linux, and observed a certain trend...when no other processes are running on the system, after a while the loadavg starts increasing, and the the %CPU utilization falls down.
Another probable false positive which I ran into accidentally...I started the program, then immediately paused it. Then I ran it on background with bg, and saw a speedup of more than 2. This happens all the time!
Any advice would be great.
Thanks,
Sayan
You really need to profile your program to identify the bottlenecks. You also need to look at optimisation in a more "holistic" way. Your performance issues may be related to poor design, poor coding, memory bandwidth limitations, and a host of other problems, none of which will be addressed by micro-optimisations such as using SIMD instead of scalar code.
Start with a profile (use a tool like Zoom for this) and work from there.
Well I groped around a bit, and then tried the following: I compiled the program using the -O0 option (no optimization) and got a speedup of 2 almost for almost all the XYZ Values. I could also see that 2 threads are utilized on my dual core (previously, it was using only one).
But now, when I remove the OpenMP pragmas, I could see no speedup, this bothers me, because SSE should be able to speed things up considerably. So this speedup could be entirely be attributed to OpenMP, have to find out why SSE is failing. Somebody had told me that if operations are trivial (perhaps the weight that this word puts forth is debatable since it differs from person to person), using SSE garners no speedup. But I wrote a small program, that calculates sqrt(i)/i for i_max_size = 64000.....and the SSE version gave a speedup of 3.5 ~ 4.0.
I would post more once I find the root cause.