GCC optimization options for AMD Opteron 4280: benchmark - gcc

We're moving from one local computational server with 2*Xeon X5650 to another one with 2*Opteron 4280... Today I was trying to launch my wonderful C programs on the new machine (AMD one), and discovered a significant downfall of the performance >50%, keeping all possible parameters the same(even seed for a random numbers generator). I started digging into this problem: googling "amd opteron 4200 compiler options" gave me couple suggestions, i.e., "flags"(options) for available to me GCC 4.6.3 compiler. I played with these flags and summarized my findings on the plots down here...
I'm not allowed to upload pictures, so the charts are here https://plus.google.com/117744944962358260676/posts/EY6djhKK9ab
I'm wondering if anyone (coding folks) could give me any comments on the subject, especially I'm interested in the fact that "... -march=bdver1 -fprefetch-loop-arrays" and "... -fprefetch-loop-arrays -march=bdver1" yield in a different runtime?
I'm not sure also if, let's say "-funroll-all-loops" is already included in "-O3" or "-Ofast", - why then adding this flag one more time makes any difference at all?
Why any additional flags for intel processor makes the performance even worse (except only "-ffast-math" - which is kind of obvious, because it enables less precise and faster by definition floating point arithmetic, as I understand it, though...)?
A bit more details about machines and my program:
2*Xeon X5650 machine is an Ubuntu Server with gcc 4.4.3, it is 2(CPUs on the motherboard)X6(real cores per each)*2(HyperThreading)=24 thread machine, and there was something running on it , during my "experiments" or benchmarks...
2*Opteron 4280 machine is an Ubuntu Server with gcc 4.6.3, it is 2(CPUs on the motherboard)X4(real cores per each=Bulldozer module)*2(AMD Bulldozer whatever threading=kind of a core)=18 thread machine, and I was using it solely for my wonderful "benchmarks"...
My benchmarking program is just a Monte Carlo simulation thing, it does some IO in the beginning, and then ~10^5 Mote Carlo loops to give me the result. So, I assume it is both integer and floating point calculations program, looping every now and then and checking if randomly generated "result" is "good" enough for me or not... The program is just a single-threaded , and I was launching it with the very same parameters for every benchmark(it is obvious, but I should mention it anyway) including random generator seed(so, the results were 100% identical)... The program IS NOT MEMORY INTENSIVE. Resulting runtime is just a "user" time by the standard "/usr/bin/time" command.

Related

How many cores do I need to use when I want to benchmark the performance of my compiler?

I reorder the compiler optimization.
And I want to compare the performance of the output with gcc O3.
I have a test-suite.
How many cores do I need to use for benchmark?
I'm sure that the executable files of them are different.
And I use one single core to measure the run time of them, the time is similarly same.
But I don't limit the number of cores to measure the run time, the executable from my compiler is faster than gcc O3.
How can I determine which compiler is better?
Question
How many cores do I need to use when I want to benchmark the performance of my compiler?
Well, the more the merrier. Single-core as you mentioned is definitely not recommended. Since you have mentioned gcc, you have to look into GCC benchmarks.
However, in the context of aforementioned "the more the merrier" beware of "law of diminishing return" as rightly put by this answer below:
In the benchmark wars the individual manufacturers will will throw as many cores/processors/CPUs at the problem as they can be effective with. But there's always (except in some very weird circumstances) a "law of diminishing return" -- the second core will only add 60-80%, the third core less than that, etc. (And this assumes a problem that is sufficiently multi-threaded to actually make use of the added cores.) So you can't look at a given benchmark and assume that twice as many cores will provide twice the performance. In fact, in some cases you could double the number of cores and actually reduce performance. Achieving good performance in a highly multi-threaded application is somewhere between an art and black magic.

Is it possible to compare ARM and x86 performance via benchmarks?

Judging by the latest news, new Apple processor A11 Bionic gains more points than the mobile Intel Core i7 in the Geekbench benchmark.
As I understand, there are a lot of different tests in this benchmark. These tests simulate a different load, including the load, which can occur in everyday use.
Some people state that these results can not be compared to x86 results. They say that x86 is able to perform "more complex tasks". As an example, they lead Photoshop, video conversion, scientific calculations. I agree that the software for the ARM is often only a "lighweight" version of software for desktops. But it seems to me that this limitation is caused by the format of mobile operating systems (do your work on the go, no mouse, etc), and not by the performance of ARM.
As an opposite example, let's look at Safari. A browser is a complex program. And on the iPad Safari works just as well as on the Mac. Moreover, if we take the results of Sunspider (JS benchmark), it turns out that Safari on the iPad is gaining more points.
I think that in everyday tasks (Web, Office, Music/Films) ARM (A10X, A11) and x86 (dual core mobile Intel i7) performance are comparable and equal.
Are there any kinds of tasks where ARM really lags far behind x86? If so, what is the reason for this? What's stopping Apple from releasing a laptop on ARM? They already do same thing with migration from POWER to x86. This is technical restrictions, or just marketing?
(Intended this as a comment since this question is off topic, but it got long..).
Of course you can compare, you just need to be very careful, which most people aren't. The fact that companies publishing (or "leaking") results are biased also doesn't help much.
The common misconception is that you can compare a benchmark across two systems and get a single score for each. That ignores the fact that different systems have different optimization points, most often with regards to power (or "TDP"). What you need to look at is the power/performance curve - this graph shows how the system reacts to more power (raising the frequency, enabling more performance features, etc), and how much it contributes to its performance.
One system can win over the low power range, but lose when the available power increases since it doesn't scale that well (or even stops scaling at some point). This is usually the case with Arm, as most of these CPUs are tuned for low power, while x86 covers a larger domain and scales much better.
If you are forced to observe a single point along the graph (which is a legitimate scenario, for example if you're looking for a CPU for a low-power device), at least make sure the comparison is fair and uses the same power envelope.
There are of course other factors that must be aligned (and sometimes aren't due to negligence or an intention to cheat) - the workload should be the same (i've seen different versions compared..), the compiler should be as close as possible (although generating arm vs x86 code is already a difference, but the compiler intermediate optimizations should be similar. When comparing 2 x86 like intel and AMD you should prefer the same binary, unless you also want to allow machine specific optimizations).
Finally, the system should also be similar, which is not the case when comparing a smartphone against a pc/macbook. The memory could differ, the core count, etc. This could be legitimate difference, but it's not really related to one architecture being better than the other.
the topic is bogus, from the ISA to an application or source code there are many abstraction level and the only metric that we have (execution time, or throughput) depends on many factors that could advantage one or the other: the algorithm choices, the optimization written in source code, the compiler/interpreter implementation/optimizations, the operating system behaviour. So they are not exactly/mathematically comparable.
However, looking at the numbers, and the utility of the mobile application written by talking as a management engeneer, ARM chip seems to be capable of run quite good.
I think the only reason is inertia of standard spread around (if you note microsoft propose a variant of windows running on ARM processors, debian ARM variant are ready https://www.debian.org/distrib/netinst).
the ARMv8 cores seems close to x86/64 ones by looking at raw numbers
note i7-3770k results: https://en.wikipedia.org/wiki/Instructions_per_second#MIPS
summary of last Armv8 CPU characteristics, note the quantity of decode, dispatch, caches, and compare the last column on cortex A73 to the i7 3770k
https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores
intel ivy bridge characteristics:
https://en.wikichip.org/wiki/intel/microarchitectures/ivy_bridge_(client)
A75 details. https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55
the topic of power consumption is complex again, the basic rule that go under all the frequency/tension rule (used and abused) over www is: transistors raise time. https://en.wikipedia.org/wiki/Rise_time
There is a fixed time delay in the switching of a transistor, this determinates the maximum frequency that a transistor could switch, and with more of them linked in a cascade way this time sums up in a nonlinear way (need some integration to demonstrate it), as a result 10 years ago to increase the GHz companies try to split in more stage the execution of an operation and runs them (operations) in a pipeline way, even inside the logical pipeline stage. https://en.wikipedia.org/wiki/Instruction_pipelining
the raise time depends of physical characteristics (materials and shape of transistors). It can be reduced by increasing the voltage, so the transistor switch faster, as the switching is associated (let me the term) to a the charge/discharge of a capacitor that trigger the transistor channel opening/closing.
These ARM chips are designed to low power applications, by changing the design they could easily gain MHz, but they will use much power, how much? again not comparable if you don't work inside a foundry and have the numbers.
an example of server applications of ARM processors that could be closer to desktop/workstation CPU as power consumption are Cavium or qualcomm Falkor CPUs, and some benchmark report that they are not bad.

Numerical differences between older Mac Mini and newer Macbook

I have a project that I compile on both my Mac Mini (Core2 Duo) and a 2014 Macbook quadcore i7. Both are running the latest version of Yosemite. The application is single threaded and I am compiling the tool and libraries using the exact same version of cmake and the clang (xcode) compiler. I am getting test failures due to slight numeric differences.
I am wondering if the inconsistency is coming from the clang compiler automatically doing processor specific optimizations, (which I did not select in cmake)? Could the difference be between the processors? Do the frameworks use processor specific optimizations? I am using the BLAS/Lapack routines the from the Accelerate framework. They are called from the SuperLU sparse matrix factorization package.
In general you should not expect results from BLAS or LAPACK to be bitwise reproducible across machines. There are a number of factors that implementors tune to get the best performance, all of which result in small differences in rounding:
your two machines have different numbers of processors, which will result in work being divided differently for threading purposes (even if your application is single threaded, BLAS may use multiple threads internally).
your two machines handle hyper threading quite differently, which may also cause BLAS to use different numbers of threads.
the cache and TLB hierarchy is different between your two machines, which means that different block sizes are optimal for data reuse.
the SIMD vector size on the newer machine is twice as large as that on the older machine, which again will effect how arithmetic is grouped.
finally, the newer machine supports FMA (and using FMA is necessary to get the best performance on it); this also contributes to small differences in rounding.
Any one of these factors would be enough to result in small differences; taken together it should be expected that the results will not be bitwise identical. And that's OK, so long as both results satisfy the error bounds of the computation.
Making the results identical would require severely limiting the performance on the newer machine, which would result in your shiny expensive hardware going to waste.

How to get best performance of 8 core system using INTEL fortran

Please let me know how to set INTEL fortran compiler option to gain the best performance of 8 core system for IA32 and X64 bits. Actually I want to execute a fortran program and take the advantages of the all CPU time available in 8 core system. Now the program is only using 13 % of CPU time.
You can learn about autovectorization and guided auto-parallelization features of Intel FORTRAN in this tutorial: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/start/win/tutorial_comp_for_win.pdf.
If you are doing linear algebra, solvers, FFTs, you might get best results if you map your problem into calls into the Intel Math Kernel Libraries: http://software.intel.com/en-us/articles/intel-mkl/
which are already multithreaded and vectorized and cache optimized.
If you are doing media / signal processing you might map your problem into calls into the Intel Performance Primitives library: http://software.intel.com/en-us/articles/intel-ipp/
Happy hacking!
In my specific application, a computational network model containing several loops running thoughout 20k iterations, each iteration accessing a number of nested if's, just by enabling /Q2 level optimization in the compiler was sufficient to reduce the computing time drastically, while keeping the CPU load around 15%.
On a similar note, I have noticed rising the optimization setting to the last level (/Q3), did do what you were asking (running all CPUs at about full load), but the computing time have NOT been reduced at all.
Therefore, if one has a small problem and several cases to test and processing capacity is the only bottleneck, it could be a good idea to open more than one Fortran solution and run those cases simultaneously.

OpenMP + SSE gives no speedup

My Professor found out this interesting experiment of 3D Linearly separable Kernel Convolution using SSE and OpenMP, and gave the task to me to benchmark the statistics on our system. The author claims a crazy 18 fold speedup from the serial approach! Might not be always, but we were expecting at least a 2-4 times speedup running this on a Dual Core Intel.
http://software.intel.com/en-us/articles/16bit-3d-convolution-sse4openmp-implementation-on-penryn-cpu/#comment-41994
Alas, we could find exactly no speedup. The serial code performs always better, with or without OpenMP.
I am using Linux, and observed a certain trend...when no other processes are running on the system, after a while the loadavg starts increasing, and the the %CPU utilization falls down.
Another probable false positive which I ran into accidentally...I started the program, then immediately paused it. Then I ran it on background with bg, and saw a speedup of more than 2. This happens all the time!
Any advice would be great.
Thanks,
Sayan
You really need to profile your program to identify the bottlenecks. You also need to look at optimisation in a more "holistic" way. Your performance issues may be related to poor design, poor coding, memory bandwidth limitations, and a host of other problems, none of which will be addressed by micro-optimisations such as using SIMD instead of scalar code.
Start with a profile (use a tool like Zoom for this) and work from there.
Well I groped around a bit, and then tried the following: I compiled the program using the -O0 option (no optimization) and got a speedup of 2 almost for almost all the XYZ Values. I could also see that 2 threads are utilized on my dual core (previously, it was using only one).
But now, when I remove the OpenMP pragmas, I could see no speedup, this bothers me, because SSE should be able to speed things up considerably. So this speedup could be entirely be attributed to OpenMP, have to find out why SSE is failing. Somebody had told me that if operations are trivial (perhaps the weight that this word puts forth is debatable since it differs from person to person), using SSE garners no speedup. But I wrote a small program, that calculates sqrt(i)/i for i_max_size = 64000.....and the SSE version gave a speedup of 3.5 ~ 4.0.
I would post more once I find the root cause.

Resources