Want good understanding on shared libraries at the memory level - gcc

Please somebody help.
I am creating a shared library and when run with this command this gives a error
"gcc -shared libx.o -o libx.so"
/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: libx.o: relocation R_X86_64_32 against `.rodata' can not be used when making a shared object;
recompile with -fPIC
libx.o: could not read symbols: Bad value
collect2: ld returned 1 exit status
So, I run it with -FPIC, it compiles, please can you give me a good understanding of -FPIC significance at the memory level, I mean how it is shared in the physical memory between 2 programs using this shared library.
Thanks a lot.

-fpic stands for position independent code.
you can read drepper to get more idea on dynamic linking http://www.akkadia.org/drepper/dsohowto.pdf
Seems a duplicate of similar post GCC -fPIC option

For systems with virtual memory the loader is likely to map the shared code into some contiguous pages in the memory space of the applications that are using that library. In order to share these pages between multiple processes they must be:
read-only.
able to be mapped at an arbitrary in the address space of a process.
consequences:
Most code is not read-only in that it can not just be mapped into the memory space of a process and run - it must first be modified by the loader in ways that are specific to each process. In order to achieve read-only text you pass the -fpic option to the compiler. This causes the compiler to generate less optimal machine code but with the benefit that it is readonly.
Efficient code can often not be mapped to an arbitrary location in the address space. Commonly efficient code is either constrained to a particular address, or to a low range of addresses. The -fpic options instructs the compiler to use less efficient code gen but with the benefit of not having a constraint about where it is run.
Now we can understand your problem:
relocation R_X86_64_32 against `.rodata' - Here the linker is warning you that the compiler has used codgen that is constrained run in to a low range of addresses. Therefore, it is unsuitable for use in a shared library.

Related

Dynamic Memory Allocation in kernel's VDSO

For an experiment, I need to instrument and allocate entry nodes for a hashtable inside arch/x86/vdso/vclock_gettime.c through the following typical approach.
struct h_struct *phe = (struct h_struct*) kmalloc(sizeof(struct h_struct), GFP_HIGHUSER);
that I have tested and used in other areas of kernel where it compiled and worked as expected. However, in the case of VDSO it results in the failed linking
CC arch/x86/vdso/vclock_gettime.o
VDSO arch/x86/vdso/vdso.so.dbg
arch/x86/vdso/vclock_gettime.o: In function `kmalloc':
linux-3.10.0/include/linux/slub_def.h:171: undefined reference to `kmalloc_caches'
linux-3.10.0/include/linux/slub_def.h:171: undefined reference to `kmem_cache_alloc_trace'
collect2: error: ld returned 1 exit status
OBJCOPY arch/x86/vdso/vdso.so
I am aware that VDSO has a special status, where although allocated in kernel space is mapped into userspace in the address space of every process.
I wonder, if someone more experienced can spot or suggest the way to allocate memory in vdso for my needs.
P.S. malloc can't be used as that requires stdlib.h which results in linking against glibc

instruction point value of dynamic linking and static linking

By using Intel's pin, I printed out the instruction pointer (ip) values for a program with dynamic linking and static linking.
And I've found that their ip values are quite different, even though they are the same program.
A program with static linking shows 0x400f50 for its very first ip value.
but a program with dynamic linking shows 0x7f94f0762090 for its first ip value
I am not sure why they have that quite a large gap.
It would be appreciated if anyone could help me find out the reason
I am not sure why they have that quite a large gap.
Because a dynamically linked program does not start executing in the binary: the first few thousands of instructions are executed in the dynamic linker (ld-linux), before control is transferred to _start in the main executable.
See also this answer.

Why ELF executables have a fixed load address?

ELF executables have a fixed load address (0x804800 on 32-bit x86 Linux binaries, and 0x40000 on 64-bit x86_64 binaries).
I read the SO answers (e.g., this one) about the historical reasons for those specific addresses. What I still don't understand is why to use a fixed load address and not a randomized one (given some range to be randomized within)?
why to use a fixed load address and not a randomized one
Traditionally that's how executables worked. If you want a randomized load address, build a PIE binary (which is really a special case of shared library that has startup code in it) with -fPIE and link with -pie flags.
Building with -fPIE introduces runtime overhead, in some cases as bad as 10% performance degradation, which may not be tolerable if you have a large cluster or you need every last bit of performance.
not sure if I understood your question correct, but saying I did, that's sort-off a "legacy" / historical issue, ELF is the file format used by UNIX derived operating systems, both POSIX (IOS) and Unix-like (Linux).
and the elf format simply states that there must be some resolved and absolute virtual address that the code is loaded into and begins running from...
and simply that's how the file format is, and during to historical reasons that cant be changed... you couldn't just "throw" the executable in any memory address and have it run successfully, back in the 90's when the ELF format was introduced problems such as calling functions with virtual tables we're raised and it was decided that the elf format would have absolute addresses within it.
Also think about it, take a look at the elf format -https://en.wikipedia.org/wiki/Executable_and_Linkable_Format
how would you design an OS executable-loader that would be able to handle an executable load it to ANY desired virtual address and have the code run successfully without actually having to change the binary itself... if you would like to do something like that you'd either need to vastly change the output compilers generate or the format itself, which again isn't possible
As time passed the requirement of position independent executing (PIE/PIC) has raised and shared objects we're introduced in order to allow that and ASLR
(Address Space Layout Randomization) - which means that the code could be thrown in any memory address and still be able to execute, that is simply implemented by making sure that all calls within the code itself are relative to the current address of the executed instruction, AND that when the shared object is loaded the OS loader would have to change some data within the binary given that the data changed is not executable instructions (R E) but actual data (RW, e.g .data segment), which also is implemented by calling functions from some "Jump tables" ( which would be changed at load time ) for example PLT / GOT.... those shared objects allow absolute randomization of the addresses the code is loaded to and if you want to execute some more "secure" code you'd have to compile it as a shared object and and dynamically link it and load time or run time..
( hope I've cleared some things out :) )

ELF, PIE ASLR and everything in between, specifically within Linux

Before asking my question, I would like to cover some few technical details I want to make sure I've got correct:
A Position Independent Executable (PIE) is a program that would be able to execute regardless of which memory address it is loaded into, right?
ASLR (Address Space Layout Randomization) pretty much states that in order to keep addresses static, we would randomize them in some manner,
I've read that specifically within Linux and Unix based systems, implementing ASLR is possible regardless of if our code is a PIE, if it is PIE, all jumps, calls and offsets are relative hence we have no problem.
If it's not, code somehow gets modified and addresses are edited regardless of whether the code is an executable or a shared object.
Now this leads me to ask a few questions
If ASLR is possible to implement within codes that aren't PIE and are executables AND NOT SHARED / RELOCATABLE OBJECT (I KNOW HOW RELOCATION WORKS WITHIN RELOCATABLE OBJECTS!!!!), how is it done? ELF format should hold no section that states where within the code sections are functions so the kernel loader could modify it, right? ASLR should be a kernel functionality so how on earth could, for example, an executable containing, for example, these instructions.
pseudo code:
inc_eax:
add eax, 5
ret
main:
mov eax, 5
mov ebx, 6
call ABSOLUTE_ADDRES{inc_eax}
How would the kernel executable loader know how to change the
addresses if they aren't stored in some relocatable table within the ELF
file and aren't relative in order to load the executable into some
random address?
Let's say I'm wrong, and in order to implement ASLR you must have a
PIE executable. All segments are relative. How would one compile a
C++ OOP code and make it work, for example, if I have some instance
of a class using a pointer to a virtual table within its struct,
and that virtual table should hold absolute addresses, hence I
wouldn't be able to compile a pure PIE for C++ programs that have
usage of run time virtual tables, and again ASLR isn't possible....
I doubt that virtual tables would contain relative addresses and
there would be a different virtual table for each call of some
virtual function...
My last and least significant question is regarding ELF and PIE — is there some special way to detect an ELF executable is PIE? I'm familiar with the ELF format so I doubt that there is a way, but I might be wrong. Anyway, if there isn't a way, how does the kernel loader know if our executable is PIE hence it could use ASLR on it.
I've got this all messed up in my head and I'd love it if someone could help me here.
Your question appears to be a mish-mash of confusion and misunderstanding.
A Position Independent Executable (PIE) is a program that would be able to execute regardless of which memory address it is loaded into, right?
Almost. A PIE binary usually can not be loaded into memory at arbitrary address, as its PT_LOAD segments will have some alignment requirements (e.g. 0x400, or 0x10000). But it can be loaded and will run correctly if loaded into memory at address satisfying the alignment requirements.
ASLR (Address Space Layout Randomization) pretty much states that in order to keep addresses static we would randomize them in some manner,
I can't parse the above statement in any meaningful way.
ASLR is a technique for randomizing various parts of address space, in order to make "known address" attacks more difficult.
Note that ASLR predates PIE binaries, and does not in any way require PIE. When ASLR was introduced, it randomized placement of stack, heap, and shared libraries. The placement of (non-PIE) main executable could not be randomized.
ASLR has been considered a success, and therefore extended to also support PIE main binary, which is really a specially crafted shared library (and has ET_DYN file type).
call ABSOLUTE_ADDRES{inc_eax}
how would the kernel executable loader know how to change the addresses if > they aren't stored in some relocatable table
Simple: on x86, there is no instruction to call ABSOLUTE_ADDRESS -- all calls are relative.
2 ... I wouldn't be able to compile a pure PIE for C++ programs that have usage of run time virtual tables, and again ASLR isn't possible..
PIE binary requires relocation, just like a shared library. Virtual tables in PIE binaries work exactly the same way they work in shared libraries: ld-linux.so.2 updates GOT (global offset table) before transferring control to the PIE binary.
3 ... is there some special way to detect an ELF executable is PIE
Simple: a PIE binary has ELF file type set to ET_DYN (a non-PIE binary will have type ET_EXEC). If you run file a.out on a PIE executable, you'll see that it's a "shared library".

How does gcc's linktime optimisation (-flto flag) work

I understand more or less the idea: When compiling separate modules and producing assembly code, functions calling each other have to respect strictly the calling convention, which kills the opportunity for many optimisations when compiling separate modules.
For instance if I have function A which calls function B which calls function C, all 3 in their own separate source files, it becomes possible to allocate registers evenly within the functions so that no register saving on the stack is necessary at all during those calls. With traditional compile-assembly-linking this is not possible, as the caller-saved and callee-saved registers are imposed by the calling convention.
Another optimisation is to inline functions which are called only once. This previously was possible only if a function is local, but thanks to linktime optimisation it's now possible even if the function is in another source file.
Now, if I compile with both -flto and -S flags, I see that instead of normal assembly instructions, gcc generates an encoded representation of the program, such as this:
.section .gnu.lto_.inline.c3c5e6ef8ec983c,"dr0"
.ascii "x\234mQ;N\303#\20}\273\353\17\370C\234\20\242`\"!Q\20\11Ah\322&\25\242\314\231|\4\32\220\220(,$.#\205D\343\3P Z.\341Tn\231\35\274\31L\342\342\355\314\274\371<\317\30\354\376\356\365\357\333\7\262"
.ascii "1\240G\325\273\202\7\216\232\204\36\205"
.ascii "8\242\370\240|\222"
.ascii "8\374\21\205ty\352\"*r\340!:!n\357n%]\224\345\10|\304\23\342\274z\346"
.ascii "8\35\23\370\7\4\1\366s\362\203j\271]\27bb{\316\353\27\343\310\4\371\374\237*n#\220\342rA\31"
.ascii "7\365\263\327\231\26\364\10"
.ascii "2\\-\311\277\255^w\220}|\340\233\306\352\263\362Qo+e+\314\354\277\246\354\252\277\20\364\224%T\233'eR\301{\32\340\372\313\362\263\242\331\314\340\24\6\21s\210\243!\371\347\325\333&m\210\305\203\355\277*\326\236\34\300-\213\327\306\2Td\317\27\231\26tl,\301\26\21cd\27\335#\262L\223"
.ascii "8\353\30\351\264{I\26\316\11\14"
.ascii "9\326h\254\220B}6a\247\13\353\27M\274\231"
.ascii "0\23M\332\272\272%d[\274\36Q\200\37\321\1&\35"
Since the data is in its own particular section, the linker sees this, and does the code generation. If the module was written in either assembly or with no -flto flag, then the linker would see data in the .text section instead, so there is no confusion possible for the linker.
The problem is: How can the linker generate code? Normally only gcc can generate code, the linker's role is just here to change a few offsets and adapt the binary format. In order to generate code, the linker would need to contain a second copy of the entire gcc backend (half of the compiler which generates assembly code from intermediate representation), as well as the entire assembler (since no assembly code was produced). How is such a thing possible, especially considering that binutils is a completely separate entity from gcc, developed by different teams?
GCC's -flto emits a serialized form of GCC's internal representation, as you discovered.
Then, at link time, the linker reinvokes GCC and passes it the objects that need final compilation. GCC reads the internal representation and does the work.
I think the actual work is done in collect2, which is part of GCC that is used when invoking the linker (I'm a little fuzzy on the details). There is also a "linker plugin" system that enables this to work a little better (like letting the linker decide how to split the compilation). This is implemented at least by the binutils ld and by gold; but as far as I recall this is just an optimization and isn't needed to get the basic -flto feature to work. You can see a bit more information on the original LTO project page; and maybe links from there would explain more.
There is more overlap between the GCC and binutils teams than you might think. The two projects share some code and have a long history of working together. Some people work on both projects.
From https://gcc.gnu.org/wiki/LinkTimeOptimization:
Despite the "link time" name, LTO does not need to use any special
linker features. The basic mechanism needed is the detection of GIMPLE
sections inside object files. This is currently implemented in
collect2 [which is called by gcc; -ps]. Therefore, LTO will work on any linker already supported by
GCC.
I assume this means you must link calling the compiler driver gcc. Simply linking with the system's vanilla linker wouldn't optimize the whole program, as you already concluded.
Update:
https://gcc.gnu.org/onlinedocs/gccint/Collect2.html says
The program collect2 is installed as ld in the directory where the
passes of the compiler are installed. When collect2 needs to find the
real ld it tries the following file names: [...]
(The page goes on detailing how collect2 looks for configuration-dependent executables and ones with well-known names like real-ld, finally even ld; but will not call itself recursively.)

Resources