What compiler commands can be used to make GCC and ICC compile programs as fast as each other? [closed]

What compiler commands can be used to make GCC and ICC compile programs as fast as each other? [closed] - performance

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last month.
Improve this question
enter image description here
Code
asm diff
I dont really understand what optimization this is, all I know is that it's really fast and I've tried many commands in the manual to no avail. Can anyone explain in detail what optimization this is and what command in GCC generates the same ASM as ICC or better?

I'm not sure there is an optimization option to make GCC do this optimization. Don't write loops that redo the same work 100k times if you don't want you program to spend time doing that.
Defeating benchmark loops can make compilers look good on benchmark, but AFAIK is often not useful in real-world code where something else happens between runs of the loop you want optimized.
ICC is defeating your benchmark repeat-loop by turning it into this
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
for (unsigned i = 0; i < 100000; ++i)
sum += data[c];
}
The first step, swapping the inner and outer loops, is called loop interchange. Making one pass over the array is good for cache locality and enables further optimizations.
Turning for() if() into if() for(){} else for(){} is called loop unswitching. In this case, there is no "else" work to do; the only thing in the loop was if()sum+=..., so it becomes just an if controlling a repeated-addition loop.
ICC unrolls + vectorizes that sum +=, strangely not just doing it with a multiply. Instead it does 100000 64-bit add operations. ymm0 holds _mm256_set1_epi64(data[c]) from vpbroadcastq.
This inner loop only runs conditionally; it's worth branching if it's going to save 6250 iterations of this loop. (Only one pass over the array, one branch per element total, not 100k.)
..B1.9: # Preds ..B1.9 ..B1.8
add edx, 16 #20.2
vpaddq ymm4, ymm4, ymm0 #26.5
vpaddq ymm3, ymm3, ymm0 #26.5
vpaddq ymm2, ymm2, ymm0 #26.5
vpaddq ymm1, ymm1, ymm0 #26.5
cmp edx, 100000 #20.2
jb ..B1.9 # Prob 99% #20.2
Every iteration does 16 additions, 4 per instruction, unrolled by 4 into separate accumulators that are reduced to 1 and then hsummed after the loop. Unrolling lets Skylake and later run 3 vpaddq per clock cycle.
By contrast, GCC does multiple passes over the array, inside the loop vectorizing to branchlessly do 8 compares:
.L85:
vmovdqa ymm4, YMMWORD PTR [rax] # load 8 ints
add rax, 32
vpcmpgtd ymm0, ymm4, ymm3 # signed-compare them against 128
vpand ymm0, ymm0, ymm4 # mask data[c] to 0 or data[c]
vpmovsxdq ymm2, xmm0 # sign-extend to 64-bit
vextracti128 xmm0, ymm0, 0x1
vpaddq ymm1, ymm2, ymm1 # and add
vpmovsxdq ymm0, xmm0 # sign-extend the high half and add it
vpaddq ymm1, ymm0, ymm1 # to the same accumulator
cmp rbx, rax
jne .L85
This is inside a repeat loop that makes multiple passes over the array, and might bottleneck on 1 shuffle uop per clock, like 8 elements per 3 clock cycles on Skylake.
So it just vectorized the inner if() sum+=data[c] like you'd expect, without defeating the repeat loop at all. Clang does similar, as in Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

compilers generate code that is functionally equivalent, there is no reason to assume there is one perfect output for a target from one input. if two compilers produced the same output for a relatively decent sized function/project then one is derived from the other, legally or not.
In general no reason to assume any two compilers generate the same output or any two versions of the same compiler generates the same output. Amplify that by command line options that will change the output of the compiler.
In general the expectation is that for your code one compiler may produce "better" code depending on the definition of better. Smaller, or runs faster on one particular computer, operating system, etc.
gcc is a generic computer it does "okay" for each target but is not great for any target. Some tools that are designed from scratch to aim at one target/system CAN (but not necessarily) do better. And then there can be some cheating. Take some code like the possibly intentionally horribly written dhrystone and whetstone, etc. Then when X compiler that perhaps you already paid (five figures) for or are evaluating to pay for, does not produce code for dhrystone that is as fast as the free tool. Oh, sure, try this command line option for dhrystone. Hmm okay that does work much better (been there, seen this). gcc has been getting worse since versions 3.x.x/4.x.x. for various reasons I assume. I assume that part of it is the folks that truly worked at this level are dying off and being replaced with folks without the low level experience and skills. Processors are getting more complicated and gcc and others have more targets. But the volume of missed optimizations that older versions provided are increasing, and the size of the binaries with the same source and same settings are increasing, by a significant amount, not just a tiny bit for a decent sized project.
This is not a case of I need to get the wheel on the car and I have a choice of tools I can use to tighten the nut and it is the same result independent of tool.
no reason to expect that you can get any two compilers to generate the same output, even if the tools both generate assembly language, and generated the same sequence of instructions and data, no reason to assume that the assembly language itself from different assembly languages for that target, to different label names and spacing, function ordering, and other syntax that would make a diff difficult to deal with.

Related

Is there a good reason why GCC would generate jump to jump just over one cheap instruction?

I was benchmarking some counting in a loop code.
g++ was used with -O2 code and I noticed that it has some perf problems when some condition is true in 50% of the cases. I assumed that may mean that code does unnecessary jumps(since clang produces faster code so it is not some fundamental limitation).
What I find in this asm output funny is that code jumps over one simple add.
=> 0x42b46b <benchmark_many_ints()+1659>: movslq (%rdx),%rax
0x42b46e <benchmark_many_ints()+1662>: mov %rax,%rcx
0x42b471 <benchmark_many_ints()+1665>: imul %r9,%rax
0x42b475 <benchmark_many_ints()+1669>: shr $0xe,%rax
0x42b479 <benchmark_many_ints()+1673>: and $0x1ff,%eax
0x42b47e <benchmark_many_ints()+1678>: cmp (%r10,%rax,4),%ecx
0x42b482 <benchmark_many_ints()+1682>: jne 0x42b488 <benchmark_many_ints()+1688>
0x42b484 <benchmark_many_ints()+1684>: add $0x1,%rbx
0x42b488 <benchmark_many_ints()+1688>: add $0x4,%rdx
0x42b48c <benchmark_many_ints()+1692>: cmp %rdx,%r8
0x42b48f <benchmark_many_ints()+1695>: jne 0x42b46b <benchmark_many_ints()+1659>
Note that my question is not how to fix my code, I am just asking if there is a reason why a good compiler at O2 would generate jne instruction to jump over 1 cheap instruction.
I ask because from what I understand one could "simply" get the comparison result and use that to without jumps increment the counter(rbx in my example) by 0 or 1.
edit: source:
https://godbolt.org/z/v0Iiv4

The relevant part of the source (from a Godbolt link in a comment which you should really edit into your question) is:
const auto cnt = std::count_if(lookups.begin(), lookups.end(),[](const auto& val){
return buckets[hash_val(val)%16] == val;});
I didn't check the libstdc++ headers to see if count_if is implemented with an if() { count++; }, or if it uses a ternary to encourage branchless code. Probably a conditional. (The compiler can choose either, but a ternary is more likely to compile to a branchless cmovcc or setcc.)
It looks like gcc overestimated the cost of branchless for this code with generic tuning. -mtune=skylake (implied by -march=skylake) gives us branchless code for this regardless of -O2 vs. -O3, or -fno-tree-vectorize vs. -ftree-vectorize. (On the Godbolt compiler explorer, I also put the count in a separate function that counts a vector<int>&, so we don't have to wade through the timing and cout code-gen in main.)
branchy code: gcc8.2 -O2 or -O3, and O2/3 -march=haswell or broadwell
branchless code: gcc8.2 -O2/3 -march=skylake.
That's weird. The branchless code it emits has the same cost on Broadwell vs. Skylake. I wondered if Skylake vs. Haswell was favouring branchless because of cheaper cmov. GCC's internal cost model isn't always in terms of x86 instructions when its optimizing in the middle-end (in GIMPLE, an architecture-neutral representation). It doesn't yet know what x86 instructions would actually be used for a branchless sequence. So maybe a conditional-select operation is involved, and gcc models it as more expensive on Haswell, where cmov is 2 uops? But I tested -march=broadwell and still got branchy code. Hopefully we can rule that out assuming gcc's cost model knows that Broadwell (not Skylake) was the first Intel P6/SnB-family uarch to have single-uop cmov, adc, and sbb (3-input integer ops).
I don't know what else about gcc's Skylake tuning option that makes it favour branchless code for this loop. Gather is efficient on Skylake, but gcc is auto-vectorizing (with vpgatherqd xmm) even with -march=haswell, where it doesn't look like a win because gather is expensive, and and requires 32x64 => 64-bit SIMD multiplies using 2x vpmuludq per input vector. Maybe worth it with SKL, but I doubt HSW. Also probably a missed optimization not to pack back down to dword elements to gather twice as many elements with nearly the same throughput for vpgatherdd.
I did rule out the function being less optimized because it was called main (and marked cold). It's generally recommended not to put your microbenchmarks in main: compilers at least used to optimize main differently (e.g. for code-size instead of just speed).
Clang does make it branchless even with just -O2.
When compilers have to decide between branching and branchy, they have heuristics that guess which will be better. If they think it's highly predictable (e.g. probably mostly not-taken), that leans in favour of branchy.
In this case, the heuristic could have decided that out of all 2^32 possible values for an int, finding exactly the value you're looking for is rare. The == may have fooled gcc into thinking it would be predictable.
Branchy can be better sometimes, depending on the loop, because it can break a data dependency. See gcc optimization flag -O3 makes code slower than -O2 for a case where it was very predictable, and the -O3 branchless code-gen was slower.
-O3 at least used to be more aggressive at if-conversion of conditionals into branchless sequences like cmp ; lea 1(%rbx), %rcx; cmove %rcx, %rbx, or in this case more likely xor-zero / cmp/ sete / add. (Actually gcc -march=skylake uses sete / movzx, which is pretty much strictly worse.)
Without any runtime profiling / instrumentation data, these guesses can easily be wrong. Stuff like this is where Profile Guided Optimization shines. Compile with -fprofile-generate, run it, then compiler with -fprofile-use, and you'll probably get branchless code.
BTW, -O3 is generally recommended these days. Is optimisation level -O3 dangerous in g++?. It does not enable -funroll-loops by default, so it only bloats code when it auto-vectorizes (especially with very large fully-unrolled scalar prologue/epilogue around a tiny SIMD loop that bottlenecks on loop overhead. /facepalm.)

What does gcc -fno-trapping-math do?

I cannot find any example where the -fno-trapping-math option has an effect.
I would expect -ftrapping-math to disable optimizations that may affect whether traps are generated or not. For example the calculation of an intermediate value with extended precision using x87 instructions or FMA instructions may prevent an overflow exception from occurring. The -ftrapping-math option does not prevent this.
Common subexpression elimination may result in one exception occurring rather than two, for example the optimization 1./x + 1./x = 2./x will generate one trap rather than two when x=0. The -ftrapping-math option does not prevent this.
Please give some examples of optimizations that are prevented by -fno-trapping-math.
Can you recommend any documents that explain the different floating point optimization options better than the gcc manual, perhaps with specific examples of code that is optimized by each option? Possibly for other compilers.

A simple example is as follows:
float foo()
{
float a = 0;
float nan = a/a;
return nan;
}
Compiled with GCC 7.3 for x64, at -O3:
foo():
pxor xmm0, xmm0
divss xmm0, xmm0
ret
...which is pretty self-explanatory. Note that it's actually doing the div (despite knowing that 0/0 is nan), which is not especially cheap! It has to do that, because your code might be trying to deliberately raise a floating point trap.
With -O3 -fno-signaling-nans -fno-trapping-math:
foo():
movss xmm0, DWORD PTR .LC0[rip]
ret
.LC0:
.long 2143289344
That is, "just load in a NaN and return it". Which is identical behavior, as long as you're not relying on there being a trap.

Subtract and detect underflow, most efficient way? (x86/64 with GCC)

I'm using GCC 4.8.1 to compile C code and I need to detect if underflow occurs in a subtraction on x86/64 architecture. Both are UNSIGNED. I know in assembly is very easy, but I'm wondering if I can do it in C code and have GCC optimize it in a way, cause I can't find it. This is a very used function (or lowlevel, is that the term?) so I need it to be efficient, but GCC seems to be too dumb to recognize this simple operation? I tried so many ways to give it hints in C, but it always uses two registers instead of just a sub and a conditional jump. And to be honest I get annoyed seeing such stupid code written so MANY times (function is called a lot).
My best approach in C seemed to be the following:
if((a-=b)+b < b) {
// underflow here
}
Basically, subtract b from a, and if result underflows detect it and do some conditional processing (which is unrelated to a's value, for example, it brings an error, etc).
GCC seems too dumb to reduce the above to just a sub and a conditional jump, and believe me I tried so many ways to do it in C code, and tried alot of command line options (-O3 and -Os included of course). What GCC does is something like this (Intel syntax assembly):
mov rax, rcx ; 'a' is in rcx
sub rcx, rdx ; 'b' is in rdx
cmp rax, rdx ; useless comparison since sub already sets flags
jc underflow
Needless to say the above is stupid, when all it needs is this:
sub rcx, rdx
jc underflow
This is so annoying because GCC does understand that sub modifies flags that way, since if I typecast it into a "int" it will generate the exact above except it uses "js" which is jump with sign, instead of carry, which will not work if the unsigned values difference is high enough to have the high bit set. Nevertheless it shows it is aware of the sub instruction affecting those flags.
Now, maybe I should give up on trying to make GCC optimize this properly and do it with inline assembly which I have no problems with. Unfortunately, this requires "asm goto" because I need a conditional JUMP, and asm goto is not very efficient with an output because it's volatile.
I tried something but I have no idea if it is "safe" to use or not. asm goto can't have outputs for some reason. I do not want to make it flush all registers to memory, that would kill the entire point I'm doing this which is efficiency. But if I use empty asm statements with outputs set to the 'a' variable before and after it, will that work and is it safe? Here's my macro:
#define subchk(a,b,g) { typeof(a) _a=a; \
asm("":"+rm"(_a)::"cc"); \
asm goto("sub %1,%0;jc %l2"::"r,m,r"(_a),"r,r,m"(b):"cc":g); \
asm("":"+rm"(_a)::"cc"); }
and using it like this:
subchk(a,b,underflow)
// normal code with no underflow
// ...
underflow:
// underflow occured here
It's a bit ugly but it works just fine. On my test scenario, it compiles just FINE without volatile overhead (flushing registers to memory) without generating anything bad, and it seems it works ok, however this is just a limited test, I can't possibly test this everywhere I use this function/macro as I said it is used A LOT, so I'd like to know if someone is knowledgeable, is there something unsafe about the above construct?
Particularly, the value of 'a' is NOT NEEDED if underflow occurs, so with that in mind are there any side effects or unsafe stuff that can happen with my inline asm macro? If not I'll use it without problems till they optimize the compiler so I can replace it back after I guess.
Please don't turn this into a debate about premature optimizations or what not, stay on topic of the question, I'm fully aware of that, so thank you.

I probably miss something obvious, but why isn't this good?
extern void underflow(void) __attribute__((noreturn));
unsigned foo(unsigned a, unsigned b)
{
unsigned r = a - b;
if (r > a)
{
underflow();
}
return r;
}
I have checked, gcc optimizes it to what you want:
foo:
movl %edi, %eax
subl %esi, %eax
jb .L6
rep
ret
.L6:
pushq %rax
call underflow
Of course you can handle underflow however you want, I have just done this to keep the asm simple.

How about the following assembly code (you can wrap it into GCC format):
sub rcx, rdx ; assuming operands are in rcx, rdx
setc al ; capture carry bit int AL (see Intel "setxx" instructions)
; return AL as boolean to compiler
Then you invoke/inline the assembly code, and branch on the resulting boolean.

Have you tested whether this is actually faster? Modern x86-microarchitectures use microcode, turning single assembly instructions into sequences of simpler micro-operations. Some of them also do micro-op fusion, in which a sequence of assembly-instructions is turned into a single micro-op. In particular, sequences like test %reg, %reg; jcc target are fused, probably because global processor flags are a bane of performance.
If cmp %reg, %reg; jcc target is mOp-fused, gcc might use that to get faster code. In my experience, gcc is very good at scheduling and similar low-level optimizations.

Usefulness of LOOPNE

I am unable to understand the usefulness of LOOPNE. Even if LOOPNE was not there and only LOOP was there, it would have done the same thing here. Please help me out.
MOV CX, 80
MOV AH,1
INT 21H
CMP AL, ' '
LOOPNE BACK

CMP is more or less a SUB instruction without changing the value, which means that it sets flags such as ZF (the zero flag).
LOOPNE has 2 conditions to loop: cx > 0 and ZF = 0
LOOP has 1 condition to loop: cx > 0
So, a normal LOOP would go through all characters, whereas LOOPNE will go through all characters, or until a space is encountered. Whichever comes first

LOOPNE loops when a comparison fails, and when there is a remaining nonzero iteration count (after decrementing it). This is arguably very convenient for finding an element in a linear list of known length.
There is little use for it in modern x86 CPUs.
The LOOPNE instruction is likely implemented internally in the CPU by microinstructions and thus effectively equivalent to JNE/DEC CX/JNE.
Because the CPU designers invest vast amounts of effort to optimize compare/branch/register arithmetic, the equivalent instruction sequence is likely, on a highly pipelined CPU, to execute virtually just as fast. It may actually execute slower; you'll only know by timing it. And the fact that you are confused about what it does makes it a source of coding errors.
I presently code the equivalent instruction sequence, because I got bit by a misunderstanding once. I'm not confused about CMP and JNE.

Most Efficient way to set Register to 1 or (-1) on original 8086

I am taking an assembly course now, and the guy who checks our home assignments is a very pedantic old-school optimization freak. For example he deducts 10% if he sees:
mov ax, 0
instead of:
xor ax,ax
even if it's only used once.
I am not a complete beginner in assembly programing but I'm not an optimization expert, so I need your help in something (might be a very stupid question but I'll ask anyway):
if I need to set a register value to 1 or (-1) is it better to use:
mov ax, 1
or do something like:
xor ax,ax
inc ax
I really need a good grade, so I'm trying to get it as optimized as possible. ( I need to optimize both time and code size)

A quick google for 8086 instructions timings size turned up a listing of instruction timings which seems to have all the timings and sizes for the 8086/8088 through Pentium.
Although you should note that this probably doesn't include code fetch memory bottlenecks which can be very significant, especially on an 8088. This usually makes optimization for code-size a better choice. See here for some details on this.
No doubt you could find official Intel documentation on the web with similar information, such as the "8086/8088 User's Manual: Programmer's and Hardware Reference".
For your specific question, the table below gives a comparison that indicates the latter is better (less cycles, and same space):
Instructions
Clock cycles
Bytes
xor ax, axinc ax
33---6
21---3
mov ax, 1
4
3
But you might want to talk to your educational institute about this guy. A 10% penalty for a simple thing like that seems quite harsh. You should ask what should be done in the case where you have two possibilities, one faster and one shorter.
Then, once they've admitted that there are different ways to optimise code depending on what you're trying to achieve, tell them that what you're trying to do is optimise for readability and maintainability, and seriously couldn't give a damn about a wasted cycle or byte here or there(1).
Optimisation is something you generally do if and when you have a performance problem, after a piece of code is in a near-complete state - it's almost always wasted effort when the code is still subject to a not-insignificant likelihood of change.
For what it's worth, sub ax,ax appears to be on par with xor ax,ax in terms of clock cycles and size, so maybe you could throw that into the mix next time to cause him some more work.
(1)No, don't really do that , but it's fun to vent occasionally :-)

You're better off with
mov AX,1
on the 8086. If you're tracking register contents, you can possibly do better if you know that, for example, BX already has a 1 in it:
mov AX,BX
or if you know that AH is 0:
mov AL,1
etc.

Depending upon your circumstances, you may be able to get away with ...
sbb ax, ax
The result will either be 0 if the carry flag is not set or -1 if the carry flag is set.
However, if the above example is not applicable to your situation, I would recommend the
xor ax, ax
inc ax
method. It should satisfy your professor for size. However, if your processor employs any pipe-lining, I would expect there to be some coupling-like delay between the two instructions (I could very well be wrong on that). If such a coupling exists, the speed could be improved slightly by reordering your instructions slightly to have another instruction between them (one that does not use ax).
Hope this helps.

I would use mov [e]ax, 1 under any circumstances. Its encoding is no longer than the hackier xor sequence, and I'm pretty sure it's faster just about anywhere. 8086 is just weird enough to be the exception, and as that thing is so slow, a micro-optimization like this would make most difference. But any where else: executing 2 "easy" instructions will always be slower than executing 1, especially if you consider data hazards and long pipelines. You're trying to read a register in the very next instruction after you modify it, so unless your CPU can bypass the result from stage N of the pipeline (where the xor is executing) to to stage N-1 (where the inc is trying to load the register, never mind adding 1 to its value), you're going to have stalls.
Other things to consider: instruction fetch bandwidth (moot for 16-bit code, both are 3 bytes); mov avoids changing flags (more likely to be useful than forcing them all to zero); depending on what values other registers might hold, you could perhaps do lea ax,[bx+1] (also 3 bytes, even in 32-bit code, no effect on flags); as others have said, sbb ax,ax could work too in circumstances - it's also shorter at 2 bytes.
When faced with these sorts of micro-optimizations you really should measure the alternatives instead of blindly relying even on processor manuals.
P.S. New homework: is xor bx,bx any faster than xor bx,cx (on any processor)?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio