I have access to many different computers, and I want to compare their performance.
So I ran a benchmark program, and wanted to see on which computer it runs faster.
Is it important where I compile the program? Are there compilation flags that make it matter (like -xhost or -march=native)?
Related
I reorder the compiler optimization.
And I want to compare the performance of the output with gcc O3.
I have a test-suite.
How many cores do I need to use for benchmark?
I'm sure that the executable files of them are different.
And I use one single core to measure the run time of them, the time is similarly same.
But I don't limit the number of cores to measure the run time, the executable from my compiler is faster than gcc O3.
How can I determine which compiler is better?
Question
How many cores do I need to use when I want to benchmark the performance of my compiler?
Well, the more the merrier. Single-core as you mentioned is definitely not recommended. Since you have mentioned gcc, you have to look into GCC benchmarks.
However, in the context of aforementioned "the more the merrier" beware of "law of diminishing return" as rightly put by this answer below:
In the benchmark wars the individual manufacturers will will throw as many cores/processors/CPUs at the problem as they can be effective with. But there's always (except in some very weird circumstances) a "law of diminishing return" -- the second core will only add 60-80%, the third core less than that, etc. (And this assumes a problem that is sufficiently multi-threaded to actually make use of the added cores.) So you can't look at a given benchmark and assume that twice as many cores will provide twice the performance. In fact, in some cases you could double the number of cores and actually reduce performance. Achieving good performance in a highly multi-threaded application is somewhere between an art and black magic.
I have a C program that I am compiling with mingw, but it runs on only one core of my 8-core machine. How do I compile it to run on multiple cores?
(To clarify: I am not looking to use multiple cores to compile, as compilation time is low. It's runtime where I want to use my full CPU capacity.)
There is no other way but to write a multithread program. You need to first see how to split your tasks into independent parts which can be then run in threads simultaneously.
It cannot be fully automated. You may consider making use of the last additions of the C11 standard, or taking a look at pthreads or OpenMP.
We're moving from one local computational server with 2*Xeon X5650 to another one with 2*Opteron 4280... Today I was trying to launch my wonderful C programs on the new machine (AMD one), and discovered a significant downfall of the performance >50%, keeping all possible parameters the same(even seed for a random numbers generator). I started digging into this problem: googling "amd opteron 4200 compiler options" gave me couple suggestions, i.e., "flags"(options) for available to me GCC 4.6.3 compiler. I played with these flags and summarized my findings on the plots down here...
I'm not allowed to upload pictures, so the charts are here https://plus.google.com/117744944962358260676/posts/EY6djhKK9ab
I'm wondering if anyone (coding folks) could give me any comments on the subject, especially I'm interested in the fact that "... -march=bdver1 -fprefetch-loop-arrays" and "... -fprefetch-loop-arrays -march=bdver1" yield in a different runtime?
I'm not sure also if, let's say "-funroll-all-loops" is already included in "-O3" or "-Ofast", - why then adding this flag one more time makes any difference at all?
Why any additional flags for intel processor makes the performance even worse (except only "-ffast-math" - which is kind of obvious, because it enables less precise and faster by definition floating point arithmetic, as I understand it, though...)?
A bit more details about machines and my program:
2*Xeon X5650 machine is an Ubuntu Server with gcc 4.4.3, it is 2(CPUs on the motherboard)X6(real cores per each)*2(HyperThreading)=24 thread machine, and there was something running on it , during my "experiments" or benchmarks...
2*Opteron 4280 machine is an Ubuntu Server with gcc 4.6.3, it is 2(CPUs on the motherboard)X4(real cores per each=Bulldozer module)*2(AMD Bulldozer whatever threading=kind of a core)=18 thread machine, and I was using it solely for my wonderful "benchmarks"...
My benchmarking program is just a Monte Carlo simulation thing, it does some IO in the beginning, and then ~10^5 Mote Carlo loops to give me the result. So, I assume it is both integer and floating point calculations program, looping every now and then and checking if randomly generated "result" is "good" enough for me or not... The program is just a single-threaded , and I was launching it with the very same parameters for every benchmark(it is obvious, but I should mention it anyway) including random generator seed(so, the results were 100% identical)... The program IS NOT MEMORY INTENSIVE. Resulting runtime is just a "user" time by the standard "/usr/bin/time" command.
I tried to scrub the GCC man page for this, but still don't get it, really.
What's the difference between -march and -mtune?
When does one use just -march, vs. both? Is it ever possible to just -mtune?
If you use -march then GCC will be free to generate instructions that work on the specified CPU, but (typically) not on earlier CPUs in the architecture family.
If you just use -mtune, then the compiler will generate code that works on any of them, but will favour instruction sequences that run fastest on the specific CPU you indicated. e.g. setting loop-unrolling heuristics appropriately for that CPU.
-march=foo implies -mtune=foo unless you also specify a different -mtune. This is one reason why using -march is better than just enabling options like -mavx without doing anything about tuning.
Caveat: -march=native on a CPU that GCC doesn't specifically recognize will still enable new instruction sets that GCC can detect, but will leave -mtune=generic. Use a new enough GCC that knows about your CPU if you want it to make good code.
This is what i've googled up:
The -march=X option takes a CPU name X and allows GCC to generate code that uses all features of X. GCC manual explains exactly which CPU names mean which CPU families and features.
Because features are usually added, but not removed, a binary built with -march=X will run on CPU X, has a good chance to run on CPUs newer than X, but it will almost assuredly not run on anything older than X. Certain instruction sets (3DNow!, i guess?) may be specific to a particular CPU vendor, making use of these will probably get you binaries that don't run on competing CPUs, newer or otherwise.
The -mtune=Y option tunes the generated code to run faster on Y than on other CPUs it might run on. -march=X implies -mtune=X. -mtune=Y will not override -march=X, so, for example, it probably makes no sense to -march=core2 and -mtune=i686 - your code will not run on anything older than core2 anyway, because of -march=core2, so why on Earth would you want to optimize for something older (less featureful) than core2? -march=core2 -mtune=haswell makes more sense: don't use any features beyond what core2 provides (which is still a lot more than what -march=i686 gives you!), but do optimize code for much newer haswell CPUs, not for core2.
There's also -mtune=generic. generic makes GCC produce code that runs best on current CPUs (meaning of generic changes from one version of GCC to another). There are rumors on Gentoo forums that -march=X -mtune=generic produces code that runs faster on X than code produced by -march=X -mtune=X does (or just -march=X, as -mtune=X is implied). No idea if this is true or not.
Generally, unless you know exactly what you need, it seems that the best course is to specify -march=<oldest CPU you want to run on> and -mtune=generic (-mtune=generic is here to counter the implicit -mtune=<oldest CPU you want to run on>, because you probably don't want to optimize for the oldest CPU). Or just -march=native, if you ever going to run only on the same machine you build on.
The GCC 4.1.2 documentation has this to say about the -pipe option:
-pipe
Use pipes rather than temporary files for communication between the various stages of compilation. This fails to work on some systems where the assembler is unable to read from a pipe; but the GNU assembler has no trouble.
I assume I'd be able to tell from error message if my systems' assemblers didn't support pipes, so besides that issue, when does it matter whether I use that option? What factors should go into deciding to use it?
In our experience with a medium-sized project, adding -pipe made no discernible difference in build times. We ran into a couple of problems with it (sometimes failing to delete intermediate files if an error was encountered, IIRC), and so since it wasn't gaining us anything, we quit using it rather than trying to troubleshoot those problems.
It doesn't usually make any difference
It has + and - considerations. Historically, running the compiler and assembler simultaneously would stress RAM resources.
Gcc is small by today's standards and -pipe adds a bit of multi-core accessible parallel execution.
But by the same token the CPU is so fast that it can create that temporary file and read it back without you even noticing. And since -pipe was never the default mode, it occasionally acts up a little. A single developer will generally report not noticing the time difference.
Now, there are some large projects out there. You can check out a single tree that will build all of Firefox, or NetBSD, or something like that, something that is really big. Something that includes all of X, say, as a minor subsystem component. You may or may not notice a difference when the job involves millions of lines of code in thousands and thousands of C files. As I'm sure you know, people normally work on only a small part of something like this at one time. But if you are a release engineer or working on a build server, or changing something in stdio.h, you may well want to build the whole system to see if you broke anything. And now, every drop of performance probably counts...
Trying this out now, it looks to be moderately faster to build when the source / build destinations are on NFS (linux network). Memory usage is high though. If you never fill the RAM and have source on NFS, seems like a win with -pipe.
Honestly there is very little reason to not use it. -pipe will only use a tad more ram, which if this box is building code, I'd assume has a decent amount. It can significantly improve build time if your system is using a more conservative filesystem that writes and then deletes all the temporary files along the way (ext3, for example.)
One advantage is that with -pipe will the compiler interact less with a file system. Even when it is a ram disk does the data still need to go through the block I/O and file system layers when using temporary files whereas with a pipe it becomes a bit more direct.
With files does the compiler first need to finish writing before it can call the assembler. Another advantage of pipes is that both, the compiler and assembler, can run at the same time and it is making a bit more use of SMP architectures. Especially when the compiler needs to wait for the data from the source file (because of blocking I/O calls) can the operating system give the assembler full CPU time and let it do its job faster.
From a hardware point of view I guess you would use -pipe to preserve the lifetime of your hard drive.