Is there a way to optimize the GCC compiled code in term of cpu and memory using option flags?
Using O3 rather than 01 does increase or decrease the amount of memory or cpu usage?
About memory usage:
-Os reduces the binary size of a program. It has limited effect on runtime memory usage (C/C++ memory allocation and deallocation is "manual").
I say limited since tail recursion optimization can lower stack usage (this optimization will also be performed with -O2 / -O3).
The -flto (link time optimization) option can also lower binary size.
CPU usage:
Highly optimized code (e.g. -O3) will stress the CPU but that doesn't automatically mean a higher total CPU power consumption (it may lead to minimum execution times).
E.g. in Compiler-Based Optimizations Impact on Embedded Software Power Consumption (not strictly GCC related but interesting), they find that enabling various global speed compiler optimizations lead to considerable increase in the power consumption of the DSP (on average, by 25%). Although these optimizations increase the consumed power by the DSP, the energy usage while running an algorithm decreased, on average, by 95%
Profile guided optimization could lower CPU consumption (The risks of using PGO (profile-guided optimization) with production environment).
Take a look at Can we optimize code to reduce power consumption?
Probably you should use -O2 and do not worry about it: if you're looking to save power / memory, the overall design of your application will have more effect than a compiler switch.
You might try -Os which is like -O2 (good CPU speed) while simultaneously trying to reduce the binary size.
Check out the various optimizations here.
Code size optimizations are addressed above.
I'm only looking at CPU optimization. You can write really good/optimized code that has low processor utilization, and really bad/unoptimized code that maximizes CPU utilization.
So how do you most effectively use your processor?
First, use a good optimizing compiler. I won't speak to GCC, but Intel and some other purchased compilers (e.g. PGI) are very good at optimization.
Exploit the underlying hardware, such as vector instructions, FMA, registers, etc.
Follow best practices for use of peripherals, such as cellular, wifi, gps, etc.
Following best practices for SW design, such as latency hiding, avoid polling by using interrupts, use a thread pool if appropriate, etc
Good luck.
Related
I'm trying some matrix multiplication optimizations from this wiki here. While measuring the GFLOP/s for the naive, triple-for-loop matmul, I expected to see a drop in the GFLOP/s after a particular size, which according to the wiki, represents the point where data stops fitting in the cache:
I ran the benchmark on 2 different PCs:
3rd gen Intel i5 (3210M): (L1=32KB per core, L2=256KB per core, L3=3MB shared).
I got the expected graph, with a sharp drop from ~2GFLOP/s to 0.5.
6th gen Intel i7 (6500U): (L1=32KB per core, L2=256KB per core, L3=4MB shared)
On this, I instead see a gradual decrease in GFLOP/s, even if I try for larger sizes. Looking at the Ubuntu system monitor, one of the CPU cores was always at 100% usage.
I'm trying to understand the following:
How do I interpret the change in GFLOP/s with matrix size? If the expected drop corresponds to the data no longer fitting in the cache, why do I not see such a drop even for much bigger sizes on the i7?
How does the 3rd gen i5 perform faster for smaller sizes?
How do I interpret the CPU occupancy? Would I see a reduction in CPU usage if more time was being spent in fetching data from cache/RAM?
Edit:
I switched to double from float and tried -O3 and -O0, here are the plots. I couldn't check frequencies on the odler i5 but the Skylake i7 goes to turbo freq almost instantaneously for most of the process' duration.
Code from here, used GCC 7.4.0 on i7, and clang(Apple LLVM 7) on i5.
Regarding question 2:
While both CPUs have the same base and turbo frequency, the Ivy Bridge has a TDP of 35W while the Skylake has 15W. Even with a much newer process it is possible that the Ivy Bridge is able to use its turbo for a bigger part of the calculation. (Peter Cordes already mentioned checking the actual turbo.).
Regarding question 3:
CPU utilization doesn't depend on what the CPU is doing, waiting for RAM still counts as utilized. There are performance counters you can query which would tell you if the Ivy Bridge is slower because it stalls for memory more often.
With efficient cache-blocking, dense matmul should bottleneck on ALU, not memory bandwidth. O(N^3) work over O(N^2) memory.
But you're measuring a naive matmul. That means it's always horrible if you're striding down columns of one input. This is the classic problem for cache-blocking / loop-tiling.
Your Skylake has significantly better bandwidth to L3 cache and DRAM, and less-associative L2 cache (4-way instead of 8-way). Still, I would have expected better performance when your working set fits in L2 than when it doesn't.
SKL probably also has better HW prefetching, and definitely a larger out-of-order window size, than IvyBridge.
IvyBridge (including your 3210M) was the generation that introduced next-page hardware prefetching, but I think the feature with that name is just TLB prefetching, not data. It probably isn't a factor, especially if transparent hugepages are avoiding any TLB misses.
But if not, TLB misses might be the real cause of the dropoff on IvB. Use performance counters to check. (e.g. perf stat)
Was your CPU frequency shooting up to max turbo right away and staying there for both CPUs? #idspispopd's answer also makes some good points about total power / cooling budget, but yeah check that your two systems are maintaining the same CPU frequencies for this. Or if not, record what they are.
You did compile with optimization enabled, right? If not, that could be enough overhead to hide a memory bottleneck. Did you use the same compiler/version/options on both systems? Did you use -march=native?
I have an OpenCL sequential program and a parallel program which consists of the same algorithm. I have got the execution time results as 133000 milliseconds for sequential and 17 milliseconds as the kernel time for parallel. So when I calculate the speed up that is 133000/17 i get 7823 as the speedup. Whether this much of speed up possible?
Such a speedup might happen (but seems quite big; to me, a speedup of 7823 looks suspicious but not entirely impossible, see e.g. these slides and that. A 100x factor would seem more reasonable). Costly graphics cards are rumored to be able to run at several teraflops. A single core gives only gigaflops. Some particular programs can even run slower on GPGPU than on the CPU.
When benchmarking your CPU code, be sure to enable optimizations in your compiler (e.g. compile with gcc -O2 at least with GCC). Without any optimization (e.g. gcc -O0) the CPU performance is slow (e.g. a 3x factor between binary obtained with gcc -O0 and gcc -O2 is common).
BTW, cache considerations matter a lot for CPU performance. If you wrote your numerical CPU code without taking that into account, it may be quite slow (in the weird case when it has bad locality of reference).
If the kernel function has a problem and has not been executed, the time results will be inaccurate
When we are talking about a parallel program in Cuda on GPU having a speed up over a similar sequential one on CPU , should the sequential one be compiled by a Compiler Optimizer (gcc -O2)?
I have paralleled a program on GPU. It has a speed up of 18 in comparison with its CPU implementation without a compiler optimizer. But when I add the option -O2 to nvcc compiler, the speed up rate decreases to 8.
Of course optimizer should be used for both GPU and CPU program when comparing the performance.
If your focus on GPU v.s. CPU, the comparison should not be affected by the quality of the software code. We often assume the code should have the best performance on its hardware.
I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot
https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba
The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.
I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?
I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?
The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.
However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.
Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.
As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.
Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.
As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.
I programmed CUDA kernel my own.
Compare to CPU code, my kernel code is 10 times faster than CPUs.
But I have question with my experiments.
Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?
How can I evaluate my kernel code's performance?
How can I calcuate CUDA's maximum throughput theoretically?
Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?
Thanks in advance.
Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?
To find this out, you use one of the CUDA profilers. See How Do You Profile & Optimize CUDA Kernels?
How can I calcuate CUDA's maximum throughput theoretically?
That math is slightly involved, different for each architecture and easy to get wrong. Better to look the numbers up in the specs for your chip. There are tables on Wikipedia, such as this one, for the GTX500 cards. For instance, you can see from the table that a GTX580 has a theoretical peak bandwidth of 192.4GB/s and compute throughput of 1581.1GFLOPs.
Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?
If I understand correctly, you are asking if the number of theoretical peak GFLOPs on a GPU can be directly compared with the corresponding number on a CPU. There are some things to consider when comparing these numbers:
Older GPUs did not support double precision (DP) floating point, only single precision (SP).
GPUs that do support DP do so with a significant performance degradation as compared to SP. The GFLOPs number I quoted above was for SP. On the other hand, numbers quoted for CPUs are often for DP, and there is less difference between the performance of SP and DP on a CPU.
CPU quotes can be for rates that are achievable only when using SIMD (single instruction, multiple data) vectorized instructions, and is typically very hard to write algorithms that can approach the theoretical maximum (and they may have to be written in assembly). Sometimes, CPU quotes are for a combination of all computing resources available through different types of instructions and it often virtually impossible to write a program that can exploit them all simultaneously.
The rates quoted for GPUs assume that you have enough parallel work to saturate the GPU and that your algorithm is not bandwidth bound.
The preferred measure of performance is elapsed time. GFLOPs can be used as a comparison method but it is often difficult to compare between compilers and architectures due to differences in instruction set, compiler code generation, and method of counting FLOPs.
The best method is to time the performance of the application. For the CUDA code you should time all code that will occur per launch. This includes memory copies and synchronization.
Nsight Visual Studio Edition and the Visual Profiler provide the most accurate measurement of each operation. Nsight Visual Studio Edition provides theoretical bandwidth and FLOPs values for each device. In addition the Achieved FLOPs experiment can be used to capture the FLOP count both for single and double precision.