Reading through GCC code comments and online documentation, it seems there are two types of inliners - Early inliner and the IPA inliner.
For example, in gcc/ipa-fnsummary.c
/* When optimizing and analyzing for IPA inliner, initialize loop
optimizer so we can produce proper inline hints.
When optimizing and analyzing for early inliner, initialize node
params so we can produce correct BB predicates. */
What are these two kinds of inliners ? And What is the difference between the two ?
Simply put:
The early inliner operates on the single source file level, when compiling a single file. It will inline functions in the scope of the compiled source file and its included header files only (the scope of a single compilation unit).
The IPA inliner operates on link time, during whole program optimization. It takes place when activating the -flto option, standing for Link Time
Optimization.
When -flto is specified, gcc embeds the intermediate program representation, called a GIMPLE tree, into specialized sections in each object file. Later on, the link time optimizer (GCC's lto1 executable) reads this information, and executes different optimization passes, including the IPA inliner, to produce the final optimized executable.
The impact of the two inliners could be illustrated with a simple example:
// foo.h
void foo() {}
// goo.h
int goo();
// goo.cpp
#include "goo.h"
int goo() { return 0x123; }
// foo.cpp
#include "foo.h"
#include "goo.h"
int main()
{
foo();
return goo();
}
First, usual -O3 compilation:
g++ -O3 foo.cpp goo.cpp
By disassembling a.out (objdump a.out -d) we get the following code for main:
00000000000004f0 <main>:
4f0: e9 0b 01 00 00 jmpq 600 <_Z3goov>
4f5: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4fc: 00 00 00
4ff: 90 nop
The call to foo() is gone - this is the work of the early inliner. The function goo(), however, is not visible to the compiler during compilation of foo.cpp, so it is not able to optimize it.
Now, repeating compilation with -flto:
g++ -O3 -flto foo.cpp goo.cpp
We would get the following disassembly:
00000000000004f0 <main>:
4f0: b8 23 01 00 00 mov $0x123,%eax
4f5: c3 retq
4f6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4fd: 00 00 00
This time, the call to goo was inlined and replaced with its result, 0x123 - this is the work of the IPA inliner.
According to the internal documentation in the ipa-inline.c early inline is a simple local inlining pass that inline callees in the current function based on local properties only. The main strength of this pass is its ability to remove the abstraction penalty present in most C++ code and prepare the code for the more advanced inter-procedural analysis (IPA).
The IPA inliner is a more advanced inliner based on the information collected during IPA. Since it has more information it can make a better estimate on which callees are most beneficial to inline. It will also prune the call-graph and remove functions where all the call sites have been inlined.
For more information refer to the internal documentation of ipa-inline.c
Related
I am tracing wireshark-2.6.10 using Pin. At several points during the initialization, I can see some calls, such as this:
00000000004e9400 <__libc_csu_init##Base>:
...
4e9449: 41 ff 14 dc callq *(%r12,%rbx,8)
...
The target of this call is 0x197db0, shown here:
0000000000197cb0 <_start##Base>:
...
197db0: 55 push %rbp
197db1: 48 89 e5 mov %rsp,%rbp
197db4: 5d pop %rbp
197db5: e9 66 ff ff ff jmpq 197d20 <_start##Base+0x70>
197dba: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
...
Pin says that this is in the middle of the containing routine, i.e., _start##Base. But, when I reach this target using gdb, I see the following output:
>│0x5555556ebdb0 <frame_dummy> push %rbp
│0x5555556ebdb1 <frame_dummy+1> mov %rsp,%rbp
│0x5555556ebdb4 <frame_dummy+4> pop %rbp
│0x5555556ebdb5 <frame_dummy+5> jmpq 0x5555556ebd20 <register_tm_clones>
│0x5555556ebdba <frame_dummy+10> nopw 0x0(%rax,%rax,1)
│0x5555556ebdc0 <main_window_update()> xor %edi,%edi
Note that if I subtract the bias value, the runtime target address will be consistent with the compile time value (i.e., 0x5555556ebdb0 - 0x555555554000 = 0x197db0). It seems that there exists a pseudo-routine called frame_dummy inside _start##Base. How is that possible? How can I extract the addresses for these pseudo-routines, beforehand (i.e., before execution)?
UPDATE:
These types of calls to the middle of functions were not present in GIMP and Anjuta (which are written almost purely in C and built from source). But are present in Inkscape and Wireshark (written in C++, although I do not think that the language is the cause. These two were installed from packages.).
At first, it seemed that this situation occurs only during the initialization and before calling the main() function. But, at least in wireshark-2.6.10 this occurs at least in one place after main() starts. Here, we have wireshark-qt.cpp: Lines 522-524 (which is part of main()).
/* Get the compile-time version information string */
comp_info_str = get_compiled_version_info(get_wireshark_qt_compiled_info,
get_gui_compiled_info);
This is a call to get_compiled_version_info(). In assembly, the function is called at address 0x5555556e74c2 (0x1934c2 without bias), as shown below:
>│0x5555556e74c2 <main(int, char**)+178> callq 0x5555556f5870 <get_compiled_version_info>
│0x5555556e74c7 <main(int, char**)+183> lea 0x4972(%rip),%rdi # 0x5555556ebe40 <get_wireshark_runtime_info(_GString*)>
│0x5555556e74ce <main(int, char**)+190> mov %rax,%r13
Again, the target is in the middle of another function, _ZN7QStringD1Ev##Base:
00000000001980f0 <_ZN7QStringD1Ev##Base>:
...
1a1870: 41 54 push %r12
...
This is the output of gdb (0x5555556f5870 - 0x555555554000 = 0x1a1870):
>│0x5555556f5870 <get_compiled_version_info> push %r12
│0x5555556f5872 <get_compiled_version_info+2> mov %rdi,%r12
│0x5555556f5875 <get_compiled_version_info+5> push %rbp
│0x5555556f5876 <get_compiled_version_info+6> lea 0x349445(%rip),%rdi # 0x555555a3ecc2
As can be seen, the debugger recognizes that this address is the start address of get_compiled_version_info(). This is because it has access to debug_info. In all cases that I found, the symbol for these pseudo-routines were removed from the original binary (because .symtab was removed from the binary). But the strange thing is that it is located inside _ZN7QStringD1Ev##Base. Therefore, Pin considers get_compiled_version_info() to be inside _ZN7QStringD1Ev##Base.
How is that possible?
The frame_dummy is a bona-fide C function. If Pin thinks it's in the middle of _start, it's probably because:
_start is an assembly function, and
its .st_size is set incorrectly in the symbol table.
You can confirm this by looking at readelf -Ws a.out | egrep ' (_start|frame_dummy)'.
You are probably using the binary linked with fairly old GLIBC.
GLIBC used to generate C runtime startup files (whence _start comes from) by using gcc -S to create assembly from C source, then splitting and editing the assembly with sed. Getting .size directive wrong was one problem with that approach, and it is no longer used on x86_64 as of 2012 (commit).
How can I extract the addresses for these pseudo-routines, beforehand (i.e., before execution)?
Pin doesn't magically create these pseudo-routines, they must be visible in the readelf -Ws output of the original binary.
I am trying to get the raw instruction code output for a simple C program with function calls.
I have already searched on here and Google for the answer but can only find answers that are correct for single functions (no function calls).
A trivial example:
int main(){
return addition(5, 7);
}
int addition(int a, int b){
return a + b;
}
When I use gcc -c test.c -o test.o and then objdump -d test.o on this, the JAL (jump and link) instruction shows a jump to address 0x00000000 which is obviously incorrect, however when I compile the program fully, I get an enormous amount of junk in the objdump command
I am compiling with the mips compiler (and associated mips-objdump, etc).
The programs I am compiling are self contained (no external libraries, or system include files). What I want is a dump of the instructions where the JAL and equivalent instructions point to the correct addresses for the functions they call.
While your code is not linked, the addresses may or may not be resolved (absolute addresses certainly won't be). In the latter case, you should see relocation entries if you use objdump -dr. If you link your program, those issues should be gone, and if the program is really standalone (ie. not even C libraries) then all you see should be your code. You might want to use -nostdlib switch to gcc. I don't have mips gcc available, but here is the x86 version for illustration:
080480d8 <addition>:
80480d8: 8b 44 24 08 mov 0x8(%esp),%eax
80480dc: 03 44 24 04 add 0x4(%esp),%eax
80480e0: c3 ret
080480e1 <main>:
80480e1: 83 ec 08 sub $0x8,%esp
80480e4: c7 44 24 04 07 00 00 movl $0x7,0x4(%esp)
80480eb: 00
80480ec: c7 04 24 05 00 00 00 movl $0x5,(%esp)
80480f3: e8 e0 ff ff ff call 80480d8 <addition>
80480f8: 83 c4 08 add $0x8,%esp
80480fb: c3 ret
That is all the code in the binary, and as you can see at 80480f3 the call is resolved. I expect it works similarly for mips.
I'm using NASM for some projects in Windows. I'd like to call C's printf function, but I don't want GCC with all it's burocracy, making my small project (assembly is actually 30 lines) in disassembled 24000 lines. How can I make linking ( get a function be called within system definitions, like MessageBox ) without a linker?
Edit:
I made it using a disassembler. It's funny to see that almost everything is add and nop.
0000000000402b90 <MessageBoxA>:
402b90: ff 25 2a 58 00 00 jmpq *0x582a(%rip) # 4083c0 <__imp_MessageBoxA>
402b96: 90 nop
402b97: 90 nop
402b98: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
402b9f: 00
and __imp_MessageBoxA:
00000000004083c0 <__imp_MessageBoxA>:
4083c0: 14 87 adc $0x87,%al
4083c2: 00 00 add %al,(%rax)
4083c4: 00 00 add %al,(%rax)
What does it actually do?
You can do C library linking manually. use following command on your object file produced with NASM:
ld -dynamic-linker /lib64/ld-linux-x86-64.so.2 objectfile -lc
Above command tested on 64bit linux so if you want to use it on windows you must change /lib64/ld-linux-x86-64.so.2 into your dynamic linker library usually find on C:/cygwin/lib/.. if you are using cygwin and on C:/MingW/msys/1.0/lib64/ld-linux-x86-64.so.2 if you using MinGW.
I am trying to understand the shared libraries. From what I know, shared libraries have their base addresses as zero so they can be loaded at any address during runtime and so the variables are correctly relocated either during runtime or load time. So before the loading of the library, all the symbols are given some offsets from the base of the library. Hence, I tried to investigate some existing libraries and also created one library. But I found some differences. For libc.so, what I found is this :
$ objdump -D /lib64/libc.so.6
/lib64/libc.so.6: file format elf64-x86-64
Disassembly of section .note.gnu.build-id:
0000003b47a00270 <.note.gnu.build-id>:
3b47a00270: 04 00 add $0x0,%al
3b47a00272: 00 00 add %al,(%rax)
3b47a00274: 14 00 adc $0x0,%al
3b47a00276: 00 00 add %al,(%rax)
3b47a00278: 03 00 add (%rax),%eax
[More contents...]
From what I know, is that the elf headers take some space. But even if it does, it won't take up the addresses from 0 to 0x3b47a00270.
So, I created my own library (using -fPIC and -shared flags) and I saw this :
$ objdump -D ./libvector.so
./libvector.so: file format elf64-x86-64
Disassembly of section .note.gnu.build-id:
00000000000001c8 <.note.gnu.build-id>:
1c8: 04 00 add $0x0,%al
1ca: 00 00 add %al,(%rax)
1cc: 14 00 adc $0x0,%al
1ce: 00 00 add %al,(%rax)
1d0: 03 00 add (%rax),%eax
[More contents...]
This one seems more reasonable in terms of addresses. .note.gnu.build-id here starts at 0x1c8. So, guys any idea, why in case of libc or the other existing libraries like libpthread, the case is different? I am using fedora 18 x86_64. I think that this may be a case of prelinking but I am not sure and even if it is how to find that it is prelinked?
Thanks a lot in advance...
objdump -D /lib64/libc.so.6
You are disassembling sections which do not contain code. Since you don't (yet) understand what you are doing, stick to objdump -d -- it will confuse you less.
From what I know, shared libraries have their base addresses as zero
The above statement is incorrect: shared libraries may have their base address as zero, but they don't have to.
why in case of libc or the other existing libraries like libpthread, the case is different
Because these libraries have been prelinked. See "man prelink", e.g. here.
You can see this clearer with readelf -l. You want to look at the VirtAddr for the first LOAD segment.
In case of non-prelinked library, that address would be 0. In case of your prelinked libc.so.6, it will be 0x3b47a00000. Also note that RedHat systems are often set up to re-run prelink every two weeks, and so the address your libc.so.6 is prelinked at may change with time.
I'm not sure what a good subject line for this question is, but here we go:
In order to force code locality/compactness for a critical section of code, I'm looking for a way to call a function in an external (dynamically-loaded) library through a "jump slot" (an ELF R_X86_64_JUMP_SLOT relocation) directly at the call site - what the linker ordinarily puts into PLT / GOT, but have these inlined right at the call site.
If I emulate the call like:
#include <stdio.h>
int main(int argc, char **argv)
{
asm ("push $1f\n\t"
"jmp *0f\n\t"
"0: .quad %P0\n"
"1:\n\t"
: : "i"(printf), "D"("Hello, World!\n"));
return 0;
}
To get the space for a 64bit word, the call itself works (please, no comments about this being lucky coincidence as this breaks certain ABI rules - all these are not subject of this question.
For my case, be worked around/addressed in other ways, I'm trying to keep this example brief).
It creates the following assembly:0000000000000000 <main>:
0: bf 00 00 00 00 mov $0x0,%edi
1: R_X86_64_32 .rodata.str1.1
5: 68 00 00 00 00 pushq $0x0
6: R_X86_64_32 .text+0x19
a: ff 24 25 00 00 00 00 jmpq *0x0
d: R_X86_64_32S .text+0x11
...
11: R_X86_64_64 printf
19: 31 c0 xor %eax,%eax
1b: c3 retq
But (due to using printf as the immediate, I guess ... ?) the target address here is still that of the PLT hook - the same R_X86_64_64 reloc. Linking the object file against libc into an actual executable results in:
0000000000400428 <printf#plt>:
400428: ff 25 92 04 10 00 jmpq *1049746(%rip) # 5008c0 <_GLOBAL_OFFSET_TABLE_+0x20>
[ ... ]
0000000000400500 <main>:
400500: bf 0c 06 40 00 mov $0x40060c,%edi
400505: 68 19 05 40 00 pushq $0x400519
40050a: ff 24 25 11 05 40 00 jmpq *0x400511
400511: [ .quad 400428 ]
400519: 31 c0 xorl %eax, %eax
40051b: c3 retq
[ ... ]
DYNAMIC RELOCATION RECORDS
OFFSET TYPE VALUE
[ ... ]
00000000005008c0 R_X86_64_JUMP_SLOT printf
I.e. this still gives the two-step redirection, first transfer execution to the PLT hook, then jump into the library entry point.
Is there a way how I can instruct the compiler/assembler/linker to - in this example - "inline" the jump slot target at address 0x400511?
I.e. replace the "local" (resolved at program link time by ld) R_X86_64_64 reloc with the "remote" (resolved at program load time by ld.so) R_X86_64_JUMP_SLOT one (and force non-lazy-load for this section of code) ? Maybe linker mapfiles might make this possible - if so, how?
Edit:
To make this clear, the question is about how to achieve this in a dynamically-linked executable / for an external function that's only available in a dynamic library. Yes, it's true static linking resolves this in a simpler way, but:
There are systems (like Solaris) where static libraries are generally not shipped by the vendor
There are libraries that aren't available as either source code or static versions
Hence static linking is not helpful here :(
Edit2:
I've found that in some architectures (SPARC, noticeably, see section on SPARC relocations in the GNU as manual), GNU is able to create certain types of relocation references for the linker in-place using modifiers. The quoted SPARC one would use %gdop(symbolname) to make the assembler emit instructions to the linker stating "create that relocation right here". Intel's assembler on Itanium knows the #fptr(symbol) link-relocation operator for the same kind of thing (see also section 4 in the Itanium psABI). But does an equivalent mechanism - something to instruct the assembler to emit a specific linker relocation type at a specific position in the code - exist for x86_64?
I've also found that the GNU assembler has a .reloc directive which supposedly is to be used for this purpose; still, if I try:
#include <stdio.h>
int main(int argc, char **argv)
{
asm ("push %%rax\n\t"
"lea 1f(%%rip), %%rax\n\t"
"xchg %%rax, (%rsp)\n\t"
"jmp *0f\n\t"
".reloc 0f, R_X86_64_JUMP_SLOT, printf\n\t"
"0: .quad 0\n"
"1:\n\t"
: : "D"("Hello, World!\n"));
return 0;
}
I get an error from the linker (note that 7 == R_X86_64_JUMP_SLOT):error: /tmp/cc6BUEZh.o: unexpected reloc 7 in object file
The assembler creates an object file for which readelf says:Relocation section '.rela.text.startup' at offset 0x5e8 contains 2 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000001 000000050000000a R_X86_64_32 0000000000000000 .rodata.str1.1 + 0
0000000000000017 0000000b00000007 R_X86_64_JUMP_SLOT 0000000000000000 printf + 0
This is what I want - but the linker doesn't take it.
The linker does accept just using R_X86_64_64 instead above; doing that creates the same kind of binary as in the first case ... redirecting to printf#plt, not the "resolved" one.
This optimization has since been implemented in GCC. It can be enabled with the -fno-plt option and the noplt function attribute:
Do not use the PLT for external function calls in position-independent code. Instead, load the callee address at call sites from the GOT and branch to it. This leads to more efficient code by eliminating PLT stubs and exposing GOT loads to optimizations. On architectures such as 32-bit x86 where PLT stubs expect the GOT pointer in a specific register, this gives more register allocation freedom to the compiler. Lazy binding requires use of the PLT; with -fno-plt all external symbols are resolved at load time.
Alternatively, the function attribute noplt can be used to avoid calls through the PLT for specific external functions.
In position-dependent code, a few targets also convert calls to functions that are marked to not use the PLT to use the GOT instead.
In order to inline the call you would need a code (.text) relocation whose result is the final address of the function in the dynamically loaded shared library. No such relocation exists (and modern static linkers don't allow them) on x86_64 using a GNU toolchain for GNU/Linux, therefore you cannot inline the entire call as you wish to do.
The closest you can get is a direct call through the GOT (avoids PLT):
.section .rodata
.LC0:
.string "Hello, World!\n"
.text
.globl main
.type main, #function
main:
pushq %rbp
movq %rsp, %rbp
movl $.LC0, %eax
movq %rax, %rdi
call *printf#GOTPCREL(%rip)
nop
popq %rbp
ret
.size main, .-main
This should generate a R_X86_64_GLOB_DAT relocation against printf in the GOT to be used by the sequence above. You need to avoid C code because in general the compiler may use any number of caller-saved registers in the prologue and epilogue, and this forces you to save and restore all such registers around the asm function call or risk corrupting those registers for later use in the wrapper function. Therefore it is easier to write the wrapper in pure assembly.
Another option is to compile with -Wl,-z,now -Wl,-z,relro which ensures the PLT and PLT-related GOT entries are resolved at startup to increase code locality and compactness. With full RELRO you'll only have to run code in the PLT and access data in the GOT, two things which should already be somewhere in the cache hierarchy of the logical core. If full RELRO is enough to meet your needs then you wouldn't need wrappers and you would have added security benefits.
The best options are really static linking or LTO if they are available to you.
You can statically link the executable. Just add -static to the final link command, and all you indirect jumps will be replaced by direct calls.