Purpose of "subsd 0x0(%rip),%xmm0" - gcc

I'm trying to understand/debug a library I don't have source code for. In doing so, I came across the following function (and a bunch of other, more complicated, ones like it, e.g. [0]):
0000000000000000 <old_bern_poly_power02(double, double)>:
0: 66 0f 28 d0 movapd %xmm0,%xmm2
4: 66 0f 28 c1 movapd %xmm1,%xmm0
8: f2 0f 59 d1 mulsd %xmm1,%xmm2
c: f2 0f 5c 05 00 00 00 subsd 0x0(%rip),%xmm0 # 14 <old_bern_poly_power02(double, double)+0x14>
13: 00
14: f2 0f 59 c2 mulsd %xmm2,%xmm0
18: c3 retq
19: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
It's pretty obvious what the function does apart from the subsd instruction. However, I am confused about what the subsd instruction is doing. What does the value at the pointed to instruction, 0x0(%rip), have to do with this calculation?
So, this is my question: What is the purpose of the "subsd" instruction above? I assume the function was compiled with GCC. Why would GCC (or an assembly language programmer) include an instruction like this? [1]
[0] Here's another, similar example:
0000000000000020 <old_bern_poly_power03(double, double)>:
20: 66 0f 28 d0 movapd %xmm0,%xmm2
24: 66 0f 28 d8 movapd %xmm0,%xmm3
28: 66 0f 28 c1 movapd %xmm1,%xmm0
2c: f2 0f 59 15 00 00 00 mulsd 0x0(%rip),%xmm2 # 34 <old_bern_poly_power03(double, double)+0x14>
33: 00
34: f2 0f 5c 05 00 00 00 subsd 0x0(%rip),%xmm0 # 3c <old_bern_poly_power03(double, double)+0x1c>
3b: 00
3c: f2 0f 59 d9 mulsd %xmm1,%xmm3
40: f2 0f 59 c3 mulsd %xmm3,%xmm0
44: f2 0f 58 c2 addsd %xmm2,%xmm0
48: f2 0f 59 c3 mulsd %xmm3,%xmm0
4c: c3 retq
4d: 0f 1f 00 nopl (%rax)
[1] Note: Based on stepping through the function in gdb, the subsd instruction seems to have no practical effect when %xmm0 is away from 0.0, because 0x0(%rip) is tiny (I am not sure this is correct):
(gdb) x $rip
0x4004e2 <old_bern_poly_power02+12>: 0x00000000055c0ff2
(gdb) p (double)(void *)(0x00000000055c0ff2)
$22 = 4.4426122995515177e-316
To investigate the effect of the subsd instruction further, I made a version of the old_bern_poly_power02 function that does not include the subsd instruction, but is otherwise identical to the listing above. The following table compares the return values of each version of the function given a variety of inputs.
Indeed, the return values are identical except when the xmm1 argument is 0, in which case the result is -0 (0x8000000000000000) in the original and 0 (0x0) in the version without the subsd instruction.
xmm0 xmm1 ret_val_with_subsd (as hex) ret_val_without_subsd (as hex)
0.000000 0.000000 -0.0000000000 (8000000000000000) 0.0000000000 ( 0)
0.000000 1.000000 0.0000000000 ( 0) 0.0000000000 ( 0)
0.000000 2.000000 0.0000000000 ( 0) 0.0000000000 ( 0)
0.000000 3.000000 0.0000000000 ( 0) 0.0000000000 ( 0)
1.000000 0.000000 -0.0000000000 (8000000000000000) 0.0000000000 ( 0)
1.000000 1.000000 1.0000000000 (3ff0000000000000) 1.0000000000 (3ff0000000000000)
1.000000 2.000000 4.0000000000 (4010000000000000) 4.0000000000 (4010000000000000)
1.000000 3.000000 9.0000000000 (4022000000000000) 9.0000000000 (4022000000000000)
2.000000 0.000000 -0.0000000000 (8000000000000000) 0.0000000000 ( 0)
2.000000 1.000000 2.0000000000 (4000000000000000) 2.0000000000 (4000000000000000)
2.000000 2.000000 8.0000000000 (4020000000000000) 8.0000000000 (4020000000000000)
2.000000 3.000000 18.0000000000 (4032000000000000) 18.0000000000 (4032000000000000)
3.000000 0.000000 -0.0000000000 (8000000000000000) 0.0000000000 ( 0)
3.000000 1.000000 3.0000000000 (4008000000000000) 3.0000000000 (4008000000000000)
3.000000 2.000000 12.0000000000 (4028000000000000) 12.0000000000 (4028000000000000)
3.000000 3.000000 27.0000000000 (403b000000000000) 27.0000000000 (403b000000000000)
Edit 1: Per Harold and Ross's comments, I linked the library into a simple executable (the one that generated the above table) and disassembled the function in GDB, as follows:
$ gdb ./tmp
[snp]
(gdb) b old_bern_poly_power02
Breakpoint 1 at 0x400516
(gdb) r
Starting program: [...]/tmp
xmm0 xmm1 ret_val_with_subsd (as hex) ret_val_without_subsd (as hex)
Breakpoint 1, 0x0000000000400516 in old_bern_poly_power02 ()
(gdb) disassemble
Dump of assembler code for function old_bern_poly_power02:
=> 0x0000000000400516 <+0>: movapd %xmm0,%xmm2
0x000000000040051a <+4>: movapd %xmm1,%xmm0
0x000000000040051e <+8>: mulsd %xmm1,%xmm2
0x0000000000400522 <+12>: subsd 0x0(%rip),%xmm0 # 0x40052a <old_bern_poly_power02+20>
0x000000000040052a <+20>: mulsd %xmm2,%xmm0
0x000000000040052e <+24>: retq
0x000000000040052f <+25>: nopl (%rax)
End of assembler dump.
It seems to still have the same reference to 0-offset-to-%rip here, though it's definitely possible that I am still doing something wrong?
Edit 2: In the above listing, I was disassembling a version I'd copied and pasted from a disassembly of the unlinked static library, then assembled and linked into my test program above. When I disassemble the function from the original static library, as linked into my real program, the offsets are indeed filled in:
(gdb) disassemble old_bern_poly_power02
Dump of assembler code for function _Z21old_bern_poly_power02dd:
0x00007f65cf642a50 <+0>: movapd %xmm0,%xmm2
0x00007f65cf642a54 <+4>: movapd %xmm1,%xmm0
0x00007f65cf642a58 <+8>: mulsd %xmm1,%xmm2
0x00007f65cf642a5c <+12>: subsd 0x38cc(%rip),%xmm0 # 0x7f65cf646330
0x00007f65cf642a64 <+20>: mulsd %xmm2,%xmm0
0x00007f65cf642a68 <+24>: retq
End of assembler dump.
So I guess that the subsd instruction is just referencing a constant in a table here.

Related

Removing CMOV Instructions using GCC-9.2.0 (x86)

I am looking to compile a set of benchmark suites using traditional GCC optimizations (as in using -O2/3) and comparing this with the same benchmark using no cmov instructions. I have seen several posts/websites addressing this issue (all from several years ago and therefore referencing an older version of GCC than 9.2.0). Essentially, the answers from these was to use the following four flags (this is a good summary of everything I've found online):
-fno-if-conversion -fno-if-conversion2 -fno-tree-loop-if-convert -fno-tree-loop-if-convert-stores
Following this advice, I am using the following command to compile my benchmarks (theoretically with no cmov instructions).
g++-9.2.0 -std=c++11 -O2 -g -fno-if-conversion -fno-if-conversion2 -fno-tree-loop-if-convert -fno-tree-loop-if-convert-stores -fno-inline *.C -o bfs-nocmov
However, I am still finding instances where cmov is being used. If I change the optimization flag to -O0 the cmov instructions are not generated, so I am assuming there must be someway to disable this in GCC without modifying the c code/assembly.
Below is a code snippet example of what I am trying to disable (the last instruction is the cmov I am looking to avoid):
int mx = 0;
for (int i=0; i < n; i++)
41bc8a: 45 85 e4 test %r12d,%r12d
41bc8d: 7e 71 jle 41bd00 <_Z11suffixArrayPhi+0xe0>
41bc8f: 41 8d 44 24 ff lea -0x1(%r12),%eax
41bc94: 48 89 df mov %rbx,%rdi
.../suffix/src/ks.C:92
int mx = 0;
41bc97: 31 c9 xor %ecx,%ecx
41bc99: 48 8d 54 03 01 lea 0x1(%rbx,%rax,1),%rdx
41bc9e: 66 90 xchg %ax,%ax
.../suffix/src/ks.C:94
if (s[i] > mx) mx = s[i];
41bca0: 44 0f b6 07 movzbl (%rdi),%r8d
41bca4: 44 39 c1 cmp %r8d,%ecx
41bca7: 41 0f 4c c8 cmovl %r8d,%ecx
Finally, here is the code snippet generated by using -O0. I cannot use any optimization level lower than -O2, and while manually manipulating the code is an option I have a lot of benchmarks I am using so I would like to find a general solution.
for (int i=0; i < n; i++)
c67: 8b 45 e8 mov -0x18(%rbp),%eax
c6a: 3b 45 d4 cmp -0x2c(%rbp),%eax
c6d: 7d 34 jge ca3 <_Z11suffixArrayPhi+0x105>
.../suffix/src/ks.C:94
if (s[i] > mx) mx = s[i];
c6f: 8b 45 e8 mov -0x18(%rbp),%eax
c72: 48 63 d0 movslq %eax,%rdx
c75: 48 8b 45 d8 mov -0x28(%rbp),%rax
c79: 48 01 d0 add %rdx,%rax
c7c: 0f b6 00 movzbl (%rax),%eax
c7f: 0f b6 c0 movzbl %al,%eax
c82: 39 45 e4 cmp %eax,-0x1c(%rbp)
c85: 7d 16 jge c9d <_Z11suffixArrayPhi+0xff>
.../suffix/src/ks.C:94 (discriminator 1)
c87: 8b 45 e8 mov -0x18(%rbp),%eax
c8a: 48 63 d0 movslq %eax,%rdx
c8d: 48 8b 45 d8 mov -0x28(%rbp),%rax
c91: 48 01 d0 add %rdx,%rax
c94: 0f b6 00 movzbl (%rax),%eax
c97: 0f b6 c0 movzbl %al,%eax
c9a: 89 45 e4 mov %eax,-0x1c(%rbp)
If somebody has any advice or direction to look in, it would be much appreciated.

Why do I find some never called instructions nopl, nopw after ret or jmp in GCC compiled code? [duplicate]

I've been working with C for a short while and very recently started to get into ASM. When I compile a program:
int main(void)
{
int a = 0;
a += 1;
return 0;
}
The objdump disassembly has the code, but nops after the ret:
...
08048394 <main>:
8048394: 55 push %ebp
8048395: 89 e5 mov %esp,%ebp
8048397: 83 ec 10 sub $0x10,%esp
804839a: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
80483a1: 83 45 fc 01 addl $0x1,-0x4(%ebp)
80483a5: b8 00 00 00 00 mov $0x0,%eax
80483aa: c9 leave
80483ab: c3 ret
80483ac: 90 nop
80483ad: 90 nop
80483ae: 90 nop
80483af: 90 nop
...
From what I learned nops do nothing, and since after ret wouldn't even be executed.
My question is: why bother? Couldn't ELF(linux-x86) work with a .text section(+main) of any size?
I'd appreciate any help, just trying to learn.
First of all, gcc doesn't always do this. The padding is controlled by -falign-functions, which is automatically turned on by -O2 and -O3:
-falign-functions
-falign-functions=n
Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance,
-falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 would align to the next 32-byte boundary only
if this can be done by skipping 23 bytes or less.
-fno-align-functions and -falign-functions=1 are equivalent and mean that functions will not be aligned.
Some assemblers only support this flag when n is a power of two; in
that case, it is rounded up.
If n is not specified or is zero, use a machine-dependent default.
Enabled at levels -O2, -O3.
There could be multiple reasons for doing this, but the main one on x86 is probably this:
Most processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be
advantageous to align critical loop entries and subroutine entries by 16 in order to minimize
the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16-byte boundary in the first few instructions after a critical loop entry or subroutine entry.
(Quoted from "Optimizing subroutines in assembly
language" by Agner Fog.)
edit: Here is an example that demonstrates the padding:
// align.c
int f(void) { return 0; }
int g(void) { return 0; }
When compiled using gcc 4.4.5 with default settings, I get:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
000000000000000b <g>:
b: 55 push %rbp
c: 48 89 e5 mov %rsp,%rbp
f: b8 00 00 00 00 mov $0x0,%eax
14: c9 leaveq
15: c3 retq
Specifying -falign-functions gives:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
b: eb 03 jmp 10 <g>
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <g>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: b8 00 00 00 00 mov $0x0,%eax
19: c9 leaveq
1a: c3 retq
This is done to align the next function by 8, 16 or 32-byte boundary.
From “Optimizing subroutines in assembly language” by A.Fog:
11.5 Alignment of code
Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an importantsubroutine entry or jump label happens to be near the end of a 16-byte block then themicroprocessor will only get a few useful bytes of code when fetching that block of code. Itmay have to fetch the next 16 bytes too before it can decode the first instructions after thelabel. This can be avoided by aligning important subroutine entries and loop entries by 16.
[...]
Aligning a subroutine entry is as simple as putting as many
NOP
's as needed before thesubroutine entry to make the address divisible by 8, 16, 32 or 64, as desired.
As far as I remember, instructions are pipelined in cpu and different cpu blocks (loader, decoder and such) process subsequent instructions. When RET instructions is being executed, few next instructions are already loaded into cpu pipeline. It's a guess, but you can start digging here and if you find out (maybe the specific number of NOPs that are safe, share your findings please.

gcc likely() unlikely() macros and assembly code

I'm trying to see how gcc's likely() and unlikely() branch prediction macros has effect on assembly code. In the following piece of code I don't see any difference in the generated assembly code regardless of which macro i use. Any pointers on what's happening?
0 int main() {
1 volatile int x;
2 unlikely(x)?x++:x--;
3 }
Asm code:
0 0000000000000014 <main>:
1 int main() {
2 14: 55 push rbp
3 15: 48 89 e5 mov rbp,rsp
4 volatile int x;
5 likely(x)?x++:x--;
6 18: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
7 1b: 85 c0 test eax,eax
8 1d: 0f 95 c0 setne al
9 20: 0f b6 c0 movzx eax,al
10 23: 48 85 c0 test rax,rax
11 26: 74 0b je 33 <main+0x1f>
12 28: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
13 2b: 83 c0 01 add eax,0x1
14 2e: 89 45 fc mov DWORD PTR [rbp-0x4],eax
15 31: eb 09 jmp 3c <main+0x28>
16 33: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
17 36: 83 e8 01 sub eax,0x1
18 39: 89 45 fc mov DWORD PTR [rbp-0x4],eax
19 }
20 3c: 5d pop rbp
21 3d: c3 ret
It looks like you compiled without optimization. Basic block reordering is an optimization, so without it, __builtin_expect does not have this effect. With optimization, I observe that the sense of the branch is inverted when switching the expected result.
Note that whether this has any effect on current x86 processors is difficult to say.

Objdump disassemble doesn't match source code

I'm investigating the execution flow of a OpenMP program linked to libgomp. It uses the #pragma omp parallel for. I already know that this construct becomes, among other things, a call to GOMP_parallel function, which is implemented as follows:
void
GOMP_parallel (void (*fn) (void *), void *data,
unsigned num_threads, unsigned int flags)
{
num_threads = gomp_resolve_num_threads (num_threads, 0);
gomp_team_start (fn, data, num_threads, flags, gomp_new_team (num_threads));
fn (data);
ialias_call (GOMP_parallel_end) ();
}
When executing objdump -d on libgomp, GOMP_parallel appears as:
000000000000bc80 <GOMP_parallel##GOMP_4.0>:
bc80: 41 55 push %r13
bc82: 41 54 push %r12
bc84: 41 89 cd mov %ecx,%r13d
bc87: 55 push %rbp
bc88: 53 push %rbx
bc89: 48 89 f5 mov %rsi,%rbp
bc8c: 48 89 fb mov %rdi,%rbx
bc8f: 31 f6 xor %esi,%esi
bc91: 89 d7 mov %edx,%edi
bc93: 48 83 ec 08 sub $0x8,%rsp
bc97: e8 d4 fd ff ff callq ba70 <GOMP_ordered_end##GOMP_1.0+0x70>
bc9c: 41 89 c4 mov %eax,%r12d
bc9f: 89 c7 mov %eax,%edi
bca1: e8 ca 37 00 00 callq f470 <omp_in_final##OMP_3.1+0x2c0>
bca6: 44 89 e9 mov %r13d,%ecx
bca9: 44 89 e2 mov %r12d,%edx
bcac: 48 89 ee mov %rbp,%rsi
bcaf: 48 89 df mov %rbx,%rdi
bcb2: 49 89 c0 mov %rax,%r8
bcb5: e8 16 39 00 00 callq f5d0 <omp_in_final##OMP_3.1+0x420>
bcba: 48 89 ef mov %rbp,%rdi
bcbd: ff d3 callq *%rbx
bcbf: 48 83 c4 08 add $0x8,%rsp
bcc3: 5b pop %rbx
bcc4: 5d pop %rbp
bcc5: 41 5c pop %r12
bcc7: 41 5d pop %r13
bcc9: e9 32 ff ff ff jmpq bc00 <GOMP_parallel_end##GOMP_1.0>
bcce: 66 90 xchg %ax,%ax
First, there isn't any call to GOMP_ordered_end in the source code of GOMP_parallel, for example. Second, that function consists of:
void
GOMP_ordered_end (void)
{
}
According the the objdump output, this function starts at ba00 and finishes at bbbd. How could it have so much code in a function that is empty? By the way, there is comment in the source code of libgomp saying that it should appear only when using the ORDERED construct (as the name suggests), which is not the case of my test.
Finally, the main concern here for me is: why does the source code differ so much from the disassembly? Why, for example, isn't there any mention to gomp_team_start in the assembly?
The system has gcc version 5.4.0
According the the objdump output, this function starts at ba00 and finishes at bbbd.
How could it have so much code in a function that is empty?
The function itself is small but GCC just used some additional bytes to align the next function and store some static data (probly used by other functions in this file). Here's what I see in local ordered.o:
00000000000003b0 <GOMP_ordered_end>:
3b0: f3 c3 repz retq
3b2: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
3b9: 1f 84 00 00 00 00 00
First, there isn't any call to GOMP_ordered_end in the source code of GOMP_parallel, for example.
Don't get distracted by GOMP_ordered_end##GOMP_1.0+0x70 mark in assembly code. All it says is that this calls some local library function (for which objdump didn't find any symbol info) which happens to be located 112 bytes after GOMP_ordered_end. This is most likely gomp_resolve_num_threads.
Why, for example, isn't there any mention to gomp_team_start in the assembly?
Hm, this looks pretty much like it:
bcb5: e8 16 39 00 00 callq f5d0 <omp_in_final##OMP_3.1+0x420>

Weird SSE assembler instructions for double negation

GCC and Clang compilers seem to employ some dark magic. The C code just negates the value of a double, but the assembler instructions involve bit-wise XOR and the instruction pointer. Can somebody explain what is happening and why is it an optimal solution. Thank you.
Contents of test.c:
void function(double *a, double *b) {
*a = -(*b); // This line.
}
The resulting assembler instructions:
(gcc)
0000000000000000 <function>:
0: f2 0f 10 06 movsd xmm0,QWORD PTR [rsi]
4: 66 0f 57 05 00 00 00 xorpd xmm0,XMMWORD PTR [rip+0x0] # c <function+0xc>
b: 00
c: f2 0f 11 07 movsd QWORD PTR [rdi],xmm0
10: c3 ret
(clang)
0000000000000000 <function>:
0: f2 0f 10 06 movsd xmm0,QWORD PTR [rsi]
4: 0f 57 05 00 00 00 00 xorps xmm0,XMMWORD PTR [rip+0x0] # b <function+0xb>
b: 0f 13 07 movlps QWORD PTR [rdi],xmm0
e: c3 ret
The assembler instruction at address 0x4 represents "This line", however I can't understand how it works. The xorpd/xorps instructions are supposed to be bit-wise XOR and PTR [rip] is the instruction pointer.
I suspect that at the moment of execution rip is pointing somewhere near the 0f 57 05 00 00 00 0f strip of bytes, but I can't quite figure out, how is this working and why do both compilers choose this approach.
P.S. I should point out that this is compiled using -O3
for me the output of gcc with the -S -O3 options for the same code is:
.file "test.c"
.text
.p2align 4,,15
.globl function
.type function, #function
function:
.LFB0:
.cfi_startproc
movsd (%rsi), %xmm0
xorpd .LC0(%rip), %xmm0
movsd %xmm0, (%rdi)
ret
.cfi_endproc
.LFE0:
.size function, .-function
.section .rodata.cst16,"aM",#progbits,16
.align 16
.LC0:
.long 0
.long -2147483648
.long 0
.long 0
.ident "GCC: (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406"
.section .note.GNU-stack,"",#progbits
here the xorpd instruction uses instruction pointer relative addressing with the offset which points to .LC0 label with the 64 bit value 0x8000000000000000(the 63rd bit is set to one).
.LC0:
.long 0
.long -2147483648
if your compiler was big endian these lines where swaped.
xoring the double value with 0x8000000000000000 sets the sign bit(which is the 63rd bit) to one for a negative value.
clang uses xorps instruction for the same manner this xors the first 32bit of the double value.
if you run object dump with -r option it will show you the relocations that should be done on the program before running it.
objdump -d test.o -r
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <function>:
0: f2 0f 10 06 movsd (%rsi),%xmm0
4: 66 0f 57 05 00 00 00 xorpd 0x0(%rip),%xmm0 # c <function+0xc>
b: 00
8: R_X86_64_PC32 .LC0-0x4
c: f2 0f 11 07 movsd %xmm0,(%rdi)
10: c3 retq
Disassembly of section .text.startup:
0000000000000000 <main>:
0: 31 c0 xor %eax,%eax
2: c3 retq
here at <function + 0xb> we have a relocation of type R_X86_64_PC32.
PS: I'm using gcc 6.3.0
xorps xmm0,XMMWORD PTR [rip+0x0]
Any part of an instruction surrounded by [] is an indirect reference to memory.
In this case a reference to the memory at address RIP+0
(I doubt it is actually RIP+0, you might have edited the actual offset)
The X64 instruction set adds instruction pointer relative addressing. This means you can have (usually read-only) data in your program that you can address easily even if the program is moved around in memory.
A XOR xmm0,Y inverts all bits in xmm0 that are set in Y.
Negation involves inverting the sign bit, so that's why xor is used. Specifically xorpd/s because we are dealing with double resp. single floats.

Resources