Stack allocation to process and its occupancy by the data segment

Stack allocation to process and its occupancy by the data segment - memory-management

Sorry if the questions are dumb, but they are really confusing me!
According to elf standard the binary is divided into segments like text segment (containing code and RO data) and data segment (containing RW & BSS) which is loaded into memory when the program is executed and process is created, with the segments providing information for environment preparation for process execution.
The question is, how it is decided that how much stack to allocate to process, when i am not providing stack size during process creation?
Also, using the data segment we can determine how much memory the process requires (for global variables) but once this memory is allocated how mapping of variables is done with the address space inside this allocated memory?
Lastly, is there any relation of this with scatter loading? which i think is not the case as scatter loading is done when image is to be loaded into memory and once control is passed to OS, the memory to be allocated to executable or applications is take care off by the OS itself!
I know these are too many questions, but any help will be greatly appreciated.
If u can provide any reference books or links where i can study in detail about this, that is also appreciated.
Thanks a tonne! :)

The question is, how it is decided that how much stack to allocate to process, when i am not providing stack size during process creation?
When a new process created, execve() system call is used to load the new program as process image into memory from the current running process image. Which mean execve when new program is loaded replaces older .text, .data segments, heap and reset the stack. Now ELF executable file is mapped into memory address space making stack space getting initialized with environment array and the argument array to main().
In do_execve_common() procedure call under subroutine bprm_mm_init() handles tasks such as,
New instance of mm_struct to manage process address space using call to mm_alloc().
Initialize this instance with init_new_context().
bprm_mm_init() initializes stack.
search_binary_handler() routine searches for suitable binary format i.e load_binary, load_shlib to load programs or dynamic libraries respectively. Followed by mapping memory to virtual address space and making process ready to run when scheduler identifies the process.
Therefore, stack memory finally looks like below, which will appear to main() routine at start of the execution. Now and then each environment of a subset of function calls, including parameters and local variables are stored or pushed in stack memory zone dynamically when the calls happen.
-----------------
| | <--- Top of the Stack
| environmental |
| variables and |
| the other |
| parameters to |
| main() |
_________________ <--- Stack Pointer
| |
| Stack Space |
| |
Also, using the data segment we can determine how much memory the process requires (for global variables) but once this memory is allocated how mapping of variables is done with the address space inside this allocated memory?
Let try figuring out how variables are mapped to different parts of memory segments by debugging a simple C program as follows,
/* File Name: elf.c : Demonstrating Global variables */
#include <stdio.h>
int add_numbers(void);
int value1 = 10; // Global Initialized: .data section
int value2; // Global Initialized: .bss section
int add_numbers(void)
{
int result; // Local Uninitialized: Stack section
result = value1 + value2;
return result;
}
int main(void)
{
int final_result; // Local Uninitialized: Stack section
value2 = 20;
final_result = add_numbers();
printf("The sum of %d + %d is %d\n",
value1, value2, final_result);
}
Using readelf to display .data section header as below,
$readelf -a elf
...
Section Headers:
[26] .data PROGBITS 00000000006c2060 000c2060
00000000000016b0 0000000000000000 WA 0 0 32
[27] .bss NOBITS 00000000006c3720 000c3710
0000000000002bc8 0000000000000000 WA 0 0 32
...
$readelf -x 26 elf
Hex dump of section '.data':
0x006c2060 00000000 00000000 00000000 00000000 ................
0x006c2070 0a000000 00000000 00000000 00000000 ................
...
Let's use GDB to look at what these section contain,
(gdb) disassemble 0x006c2060
Dump of assembler code for function `data_start`:
0x00000000006c2060 <+0>: add %al,(%rax)
0x00000000006c2062 <+2>: add %al,(%rax)
0x00000000006c2064 <+4>: add %al,(%rax)
0x00000000006c2066 <+6>: add %al,(%rax)
End of assembler dump.
The above first address of .data section refers to data_start subroutine.
(gdb) disassemble 0x006c2070
Dump of assembler code for function `value1`:
0x00000000006c2070 <+0>: or (%rax),%al
0x00000000006c2072 <+2>: add %al,(%rax)
End of assembler dump.
....
The above disassemble dumps address of global variable value1 initialized to
10. But we don't see global uninitialized variable value2 in next addresses.
Let's look at printing the address of value2,
(gdb) p &value2
$1 = (int *) 0x6c5eb0
(gdb) info symbol 0x6c5eb0
value2 in section **.bss**
(gdb) disassemble 0x6c5eb0
Dump of assembler code for function `value2`:
0x00000000006c5eb0 <+0>: add %al,(%rax)
0x00000000006c5eb2 <+2>: add %al,(%rax)
End of assembler dump.
Tada! Disassembling reference pointer of value2 revels that the variable is stored in .bss section. This explains how the uninitialized global variables mapped to process memory space.
Lastly, is there any relation of this with scatter loading?
No.

Related

what type of code can trigger unaligned data access sigbus trap dynamically?

I am looking for SIGBUS on unaligned data access. I am tracking one of this errors and I would like to know how this is happening on sitara am335x. Can someone please give me an example code to describe this or ensure triggering it.
Adding code snippet:
int Read( void *value, uint32_t *size, const uint32_t baseAddress )
{
uint8_t *userDataAddress = (uint8_t *)( baseAddress + sizeof( DBANode ));
memcpy( value, userDataAddress, ourDataSize );
*size = ourDataSize;
return 0;
}
DBA node is a class object of 20 bytes.
baseAddress is an mmap to a shared memory file again of a class object type of DBANode casted to a uint32_t so that the arithmetic can be done.
This is the dissasembly of the section:
91a8: e51b3010 ldr r3, [fp, #-16]
91ac: e5933000 ldr r3, [r3]
91b0: e51b0014 ldr r0, [fp, #-20] ; 0xffffffec
91b4: e51b1008 ldr r1, [fp, #-8]
91b8: e1a02003 mov r2, r3
91bc: ebffe72b bl 2e70 <memcpy#plt>
91c0: e51b3010 ldr r3, [fp, #-16]
91c4: e5932000 ldr r2, [r3]
91c8: e51b3018 ldr r3, [fp, #-24] ; 0xffffffe8
91cc: e5832000 str r2, [r3]
00002e70 <memcpy#plt>:
2e70: e28fc600 add ip, pc, #0, 12
2e74: e28cca08 add ip, ip, #8, 20 ; 0x8000
2e78: e5bcf868 ldr pc, [ip, #2152]! ; 0x868
When the exact same code base was re-built, the problem just disappeared. Can the gcc create 2 different versions of instructions with same optimization of -O0 specified for gcc ?
I also diffed the library so files obj dumps in both compilations. They are exactly the same. The api is used quite often. However, the crash only happens after prolonged use over a few days. I am reading the same node every 500ms. So this is not consistent.
Should I be looking at pointer corruption ?

Turns out the baseAddress is the issue. As I mentioned its an mmap to an shared memory location where the mmap can fail. failed mmap returns -1 and the code was checking for NULL and proceeding to write to -1 i.e 0xFFFFFFFF causing a sigbus.
The code 1 is seen when we use memcpy. Trying any other access like a direct byte addressing gives a code 3 with sigbus.
I am still not sure why it triggers SIGBUS instead of SIGSEGV. Shouldn't this be a memory violation instead ?
Here is an example:
int main(int argc, char **argv)
{
// Shared memory example
const char *NAME = "SharedMemory";
const int SIZE = 10 * sizeof(uint8_t);
uint8_t src[]={0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88,0x99,0x00};
int shm_fd = -1;
shm_fd = shm_open(NAME, O_CREAT | O_RDONLY, 0666);
ftruncate(shm_fd, SIZE);
// Map shared memory segment to address space
uint8_t *ptr = (uint8_t *) mmap(0, SIZE, PROT_READ | PROT_WRITE | _NOCACHE, MAP_SHARED, shm_fd, 0);
if(ptr == MAP_FAILED)
{
std::cerr << "ERROR in mmap()" << std::endl;
// return -1;
}
printf("ptr = 0x%08x\n",ptr);
std::cout << "Now storing data to mmap() memory" << std::endl;
#if 0
ptr[0] = 0x11;
ptr[1] = 0x22;
ptr[2] = 0x33;
ptr[3] = 0x44;
ptr[4] = 0x55;
ptr[5] = 0x66;
ptr[6] = 0x77;
ptr[7] = 0x88;
ptr[8] = 0x99;
ptr[9] = 0x00;
#endif
memcpy(ptr,src,SIZE); //causes sigbus code 1
shm_unlink(NAME);
}
I still do not know why mmap is failing on an shm even though I have a 100MB of RAM available and all my resource limits are set to unlimited with over 400 fds (file descriptors) still available out of 1000 fds limit. !!!

From the Cortex-A8 Technical Reference Manual:
The processor supports loads and stores of unaligned words and
halfwords. The processor makes the required number of memory accesses
and transfers adjacent bytes transparently.
Note Data accesses that cross a word boundary can add to the access time.
Setting the A bit in the CP15 c1 Control Register enables alignment
checking. When the A bit is set to 1, two types of memory access
generate a Data Abort signal and an Alignment fault status code:
a 16-bit access that is not halfword-aligned
a 32-bit load or store that is not word-aligned
Alignment fault detection is a mandatory address-generation function
rather than an optionally supported function of external memory
management hardware.See the ARM Architecture Reference Manual for
more information on unaligned data access support.
From the ARM ARM, instructions which always generate an alignment fault if not aligned to the transfer size:
LDREX, STREX, LDREXD, STREXD, LDM, STM, LDRD, RFE, SRS, STRD, SWP, LDC, LDC2, STC, STC2, VLDM, VLDR, VPOP, VPUSH, VSTM, VSTR.
Also, most PUSH, POP and VLDx where :align: is specified.
Further,
In an implementation that includes the Virtualization Extensions, an
unaligned access to Device or Strongly-ordered memory always causes an
Alignment fault Data Abort exception
As in the linked question, structs are the most obvious way to cause 'intended' unaligned accesses, but any corruption of the stack pointer or other variable pointer would also give the same result. Depending on how the core is configured will affect if normal single word/halfword accesses are just slow, or trigger a fault.
If you have access to the ETM trace, you would be able to identify the exact accesses. It seems that part has ETM/ETB (so no fancy trace capture device is required), but I've no idea how easy it will be to get tools to work with it.
As regards what code can trigger this, yes, even memcpy() could be a problem. Since the ARM instruction set has optimisations for transferring multiple registers (or register pairs in AA64), the optimised library functions will prefer to 'stream' data rather than perform byte by byte load and stores. Depending on the data structures and compilation target, it is perfectly possible to end up with illegal LDM to unaligned addresses.

static C variable not getting initialized

I have one file-level static C variable that isn't getting initialized.
const size_t VGA_WIDTH = 80;
const size_t VGA_HEIGHT = 25;
static uint16_t* vgat_buffer = (uint16_t*)0x62414756; // VGAb
static char vgat_initialized= '\0';
In particular, vgat_initialized isn't always 0 the first time it is accessed. (Of course, the problem only appears on certain machines.)
I'm playing around with writing my own OS, so I'm pretty sure this is a problem with my linker script; but, I'm not clear how exactly the variables are supposed to be organized in the image produced by the linker (i.e., I'm not sure if this variable is supposed to go in .data, .bss, some other section, etc.)
VGA_WIDTH and VGA_HEIGHT get placed in the .rodata section as expected.
vgat_buffer is placed in the .data section, as expected (By initializing this variable to 0x62417656, I can clearly see where the linker places it in the resulting image file.)
I can't figure out where vgat_initialized is supposed to go. I've included the relevant parts of the assembly file below. From what I understand, the .comm directive is supposed to allocate space for the variable in the data section; but, I can't tell where. Looking in the linker's map file didn't provide any clues either.
Interestingly enough, if I change the initialization to
static char vgat_initialized= 'x';
everything works as expected: I can clearly see where the variable is placed in the resulting image file (i.e., I can see the x in the hexdump of the image file).
Assembly code generated from the C file:
.text
.LHOTE15:
.local buffer.1138
.comm buffer.1138,100,64
.local buffer.1125
.comm buffer.1125,100,64
.local vgat_initialized
.comm vgat_initialized,1,1
.data
.align 4
.type vgat_buffer, #object
.size vgat_buffer, 4
vgat_buffer:
.long 1648445270
.globl VGA_HEIGHT
.section .rodata
.align 4
.type VGA_HEIGHT, #object
.size VGA_HEIGHT, 4
VGA_HEIGHT:
.long 25
.globl VGA_WIDTH
.align 4
.type VGA_WIDTH, #object
.size VGA_WIDTH, 4
VGA_WIDTH:
.long 80
.ident "GCC: (GNU) 4.9.2"

compilers can conform to their own names for sections certainly but using the common .data, .text, .rodata, .bss that we know from specific compilers, this should land in .bss.
But that doesnt in any way automatically zero it out. There needs to be a mechanism, sometimes depending on your toolchain the toolchain takes care of it and creates a binary that in addition to .data, .rodata (and naturally .text) being filled in will fill in .bss in the binary. But depends on a few things, primarily is this a simple ram only image, is everything living under one memory space definition in the linker script.
you could for example put .data after .bss in the linker script and depending the binary format you use and/or tools that convert that you could end up with zeroed memory in the binary without any other work.
Normally though you should expect to using toolchain specific (linker scripts are linker specific not to be assumed to be universal to all tools) mechanism for defining where .bss is from your perspective, then some form of communication from the linker as to where it starts and what size, that information is used by the bootstrap whose job it is to zero it in that case, and one can assume it is always the bootstrap's job to zero .bss with naturally some exceptions. Likewise if the binary is meant to be on a read only media (rom, flash, etc) but .data, and .bss are read/write you need to have .data in its entirety on this media then someone has to copy it to its runtime position in ram, and .bss is either part of that depending on the toolchain and how you used it or the start address and size are on the read only media and someone has to zero that space at some point pre-main(). Here again this is the job of the bootstrap. Set the stack pointer, move .data if needed, zero .bss are the typical minimal jobs of the bootstrap, you can shortcut them in special cases or avoid using .data or .bss.
Since it is the linkers job to take all the little .data and .bss (and other) definitions from the objects being linked and combine them per the directions from the user (linker script, command line, whatever that tool uses), the linker ultimately knows.
In the case of gcc you use what I would call variables that are defined in the linker script, the linker script can fill in these values with matching variable/label names for the assembler such that a generic bootstrap can be used and you dont have to do any more work than that.
Like this but possibly more complicated
MEMORY
{
bob : ORIGIN = 0x8000, LENGTH = 0x1000
ted : ORIGIN = 0xA000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > bob
__data_rom_start__ = .;
.data : {
__data_start__ = .;
*(.data*)
} > ted AT > bob
__data_end__ = .;
__data_size__ = __data_end__ - __data_start__;
.bss : {
__bss_start__ = .;
*(.bss*)
} > bob
__bss_end__ = .;
__bss_size__ = __bss_end__ - __bss_start__;
}
then you can pull these into the assembly language bootstrap
.globl bss_start
bss_start: .word __bss_start__
.globl bss_end
bss_end: .word __bss_end__
.word __bss_size__
.globl data_rom_start
data_rom_start:
.word __data_rom_start__
.globl data_start
data_start:
.word __data_start__
.globl data_end
data_end:
.word __data_end__
.word __data_size__
and then write some code to operate on those as needed for your design.
you can simply put things like that in a linked in assembly language file without other code using them and assemble, compile other code and link and then the disassembly or other tools you prefer will show you what the linker generated, tweak that until you are satisfied then you can write or borrow or steal bootstrap code to use them.
for bare metal I prefer to not completely conform to the standard with my code, dont have any .data and dont expect .bss to be zero, so my bootstrap sets the stack pointer and calls main, done. For an operating system, you should conform. the toolchains already have this solved for the native platform, but if you are taking over that with your own linker script and boostrap then you need to deal with it, if you want to use an existing toolchains solution for an existing operating system then...done...just do that.

This answer is simply an extension of the others. As has been mentioned C standard has rules about initialization:
10) If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static storage duration is not initialized explicitly, then:
if it has pointer type, it is initialized to a null pointer;
if it has arithmetic type, it is initialized to (positive or unsigned) zero;
if it is an aggregate, every member is initialized (recursively) according to these rules;
if it is a union, the first named member is initialized (recursively) according to these rules.
The problem in your code is that a computers memory may not always be initialized to zero. It is up to you to make sure the BSS section is initialized to zero in a free standing environment (like your OS and bootloader).
The BSS sections usually don't (by default) take up space in a binary file and usually occupy memory in the area beyond the limits of the code and data that appears in the binary. This is done to reduce the size of the binary that has to be read into memory.
I know you are writing an OS for x86 booting with legacy BIOS. I know that you are using GCC from your other recent questions. I know you are using GNU assembler for part of your bootloader. I know that you have a linker script, but I don't know what it looks like. The usual mechanism to do this is via a linker script that places the BSS data at the end, and creates start and end symbols to define the address extents of the section. Once these symbols are defined by the linker they can be used by C code (or assembly code) to loop through the region and set it to zero.
I present a reasonably simple MCVE that does this. The code reads an extra sector with the kernel with Int 13h/AH=2h; enables the A20 line (using fast A20 method); loads a GDT with 32-bit descriptors; enables protected mode; completes the transition into 32-bit protected mode; and then calls a kernel entry point in C called kmain. kmain calls a C function called zero_bss that initializes the BSS section based on the starting and ending symbols (__bss_start and __bss_end) generated by a custom linker script.
boot.S:
.extern kmain
.globl mbrentry
.code16
.section .text
mbrentry:
# If trying to create USB media, a BPB here may be needed
# At entry DL contains boot drive number
# Segment registers to zero
xor %ax, %ax
mov %ax, %ds
mov %ax, %es
# Set stack to grow down from area under the place the bootloader was loaded
mov %ax, %ss
mov $0x7c00, %sp
cld # Ensure forward direction of MOVS/SCAS/LODS instructions
# which is required by generated C code
# Load kernel into memory
mov $0x02, %ah # Disk read
mov $1, %al # Read 1 sector
xor %ch, %ch # Cylinder 0
xor %dh, %dh # Head 0
mov $2, %cl # Start reading from second sector
mov $0x7e00, %bx # Load kernel at 0x7e00
int $0x13
# Quick and dirty A20 enabling. May not work on all hardware
a20fast:
in $0x92, %al
or $2, %al
out %al, $0x92
loadgdt:
cli # Turn off interrupts until a Interrupt Vector
# Table (IVT) is set
lgdt (gdtr)
mov %cr0, %eax
or $1, %al
mov %eax, %cr0 # Enable protected mode
jmp $0x08,$init_pm # FAR JMP to next instruction to set
# CS selector with a 32-bit code descriptor and to
# flush the instruction prefetch queue
.code32
init_pm:
# Set remaining 32-bit selectors
mov $DATA_SEG, %ax
mov %ax, %ds
mov %ax, %es
mov %ax, %fs
mov %ax, %gs
mov %ax, %ss
# Start executing kernel
call kmain
cli
loopend: # Infinite loop when finished
hlt
jmp loopend
.align 8
gdt_start:
.long 0 # null descriptor
.long 0
gdt_code:
.word 0xFFFF # limit low
.word 0 # base low
.byte 0 # base middle
.byte 0b10011010 # access
.byte 0b11001111 # granularity/limit high
.byte 0 # base high
gdt_data:
.word 0xFFFF # limit low (Same as code)
.word 0 # base low
.byte 0 # base middle
.byte 0b10010010 # access
.byte 0b11001111 # granularity/limit high
.byte 0 # base high
end_of_gdt:
gdtr:
.word end_of_gdt - gdt_start - 1
# limit (Size of GDT)
.long gdt_start # base of GDT
CODE_SEG = gdt_code - gdt_start
DATA_SEG = gdt_data - gdt_start
kernel.c:
#include <stdint.h>
extern uintptr_t __bss_start[];
extern uintptr_t __bss_end[];
/* Zero the BSS section 4-bytes at a time */
static void zero_bss(void)
{
uint32_t *memloc = __bss_start;
while (memloc < __bss_end)
*memloc++ = 0;
}
int kmain(){
zero_bss();
return 0;
}
link.ld
ENTRY(mbrentry)
SECTIONS
{
. = 0x7C00;
.mbr : {
boot.o(.text);
boot.o(.*);
}
. = 0x7dfe;
.bootsig : {
SHORT(0xaa55);
}
. = 0x7e00;
.kernel : {
*(.text*);
*(.data*);
*(.rodata*);
}
.bss : SUBALIGN(4) {
__bss_start = .;
*(COMMON);
*(.bss*);
}
. = ALIGN(4);
__bss_end = .;
/DISCARD/ : {
*(.eh_frame);
*(.comment);
}
}
To compile, link and generate a binary file that can be used in a disk image from this code, you could use commands like:
as --32 boot.S -o boot.o
gcc -c -m32 -ffreestanding -O3 kernel.c
gcc -ffreestanding -nostdlib -Wl,--build-id=none -m32 -Tlink.ld \
-o boot.elf -lgcc boot.o kernel.o
objcopy -O binary boot.elf boot.bin

The C standard says that static variables must be zero-initialized, even in absence of explicit initializer, so static char vgat_initialized= '\0'; is equivalent to static char vgat_initialized;.
In ELF and other similar formats, the zero-initialized data, such as this vgat_initialized goes to the .bss section. If you load such an executable yourself into memory, you need to explicitly zero the .bss part of the data segment.

The other answers are very complete and very helpful. In turns out that, in my specific case, I just needed to know that static variables initialized to 0 were put in .bss and not .data. Adding a .bss section to the linker script placed a zeroed-out section of memory in the image which solved the problem.

ELF dynamic symbol table

I have a question about ELF dynamic symbol table. For symbols of type FUNC, I have noticed a value of 0 in some binaries. But in other binaries, it has some non-zero value. Both these binaries were generated by gcc, I want to know why is this difference?. Is there any compiler options to control this?
EDIT: This is the output of readelf --dyn-syms prog1
Symbol table '.dynsym' contains 5 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 00000000 0 NOTYPE LOCAL DEFAULT UND
1: 00000000 0 NOTYPE WEAK DEFAULT UND __gmon_start__
2: 000082f0 0 FUNC GLOBAL DEFAULT UND printf#GLIBC_2.4 (2)
3: 00008314 0 FUNC GLOBAL DEFAULT UND abort#GLIBC_2.4 (2)
4: 000082fc 0 FUNC GLOBAL DEFAULT UND __libc_start_main#GLIBC_2.4
Here value of "printf" symbol is 82f0 which happens to be the address of plt table entry for printf.
Output of readelf --dyn-syms prog2
Symbol table '.dynsym' contains 6 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 00000000 0 NOTYPE LOCAL DEFAULT UND
1: 00000000 0 NOTYPE WEAK DEFAULT UND __gmon_start__
2: 00000000 0 FUNC GLOBAL DEFAULT UND puts#GLIBC_2.4 (2)
3: 00000000 0 FUNC GLOBAL DEFAULT UND printf#GLIBC_2.4 (2)
4: 00000000 0 FUNC GLOBAL DEFAULT UND abort#GLIBC_2.4 (2)
5: 00000000 0 FUNC GLOBAL DEFAULT UND __libc_start_main#GLIBC_2.4
Here the values for all the symbols are zero.

The x86_64 SV ABI mandates that (emphasis mine):
To allow comparisons of function addresses to work as expected,
if an executable file references a function defined in a shared object,
the link editor will place the address of the procedure linkage table
entry for that function in its associated symbol table entry.
This will result in symbol table entries with section index of
SHN_UNDEF but a type of STT_FUNC and a non-zero st_value.
A reference to the address of a function from within a shared
library will be satisfied
by such a definition in the executable.
With my GCC, this program:
#include <stdio.h>
int main()
{
printf("hello %i\n", 42);
return 0;
}
when compiled directly into an executable generates a null value:
1: 0000000000000000 0 FUNC GLOBAL DEFAULT UND printf#GLIBC_2.2.5 (2)
But this program with a comparison of the printf function:
#include <stdio.h>
int main()
{
printf("hello %i\n", 42);
if (printf == puts)
return 1;
return 0;
}
generates a non-null value:
3: 0000000000400410 0 FUNC GLOBAL DEFAULT UND printf#GLIBC_2.2.5 (2)
In the .o file, the first program generates:
000000000014 000a00000002 R_X86_64_PC32 0000000000000000 printf - 4
and the second:
000000000014 000a00000002 R_X86_64_PC32 0000000000000000 printf - 4
000000000019 000a0000000a R_X86_64_32 0000000000000000 printf + 0
The difference is caused by the extra R_X86_64_32 relocation for getting the address of the function.

Observations by running readelf on some binary
All the FUNCTIONS which are UNDEFINED have size zero.
These undefined functions are those which are called through libraries. In my small ELF binary all references to GLIBc are undefined with size zero
From http://docs.oracle.com/cd/E19457-01/801-6737/801-6737.pdf on page 21
It becomes clear that symbol table can have three types of symbols. Among these three, two types UNDEFINED and TENTATIVE symbols are those which are with out storage assigned. in later case you can see in readelf output, some functions which are not undefined(have index) and does not have storage.
for clarity undefined symbols are those which are referenced but does not assign storage(have not been created yet) while tentative symbols are those which are created but w/o assigned storage. e.g uninitialized symbols
edit
if you are talking about .plt, shared libraries symbols bind is lazy.
how to control the bind see http://www.linuxjournal.com/article/1060
This feature is known as lazy symbol binding. The idea is that if you have lots of shared libraries, it could take the dynamic loader lots of time to look up all of the functions to initialize all of the .plt slots, so it would be preferable to defer binding addresses to the functions until we actually need them. This turns out to be a big win if you only end up using a small fraction of the functions in a shared library. It is possible to instruct the dynamic loader to bind addresses to all of the .plt slots before transferring control to the application—this is done by setting the environment variable LD_BIND_NOW=1 before running the program. This turns out to be useful in some cases when you are debugging a program, for example. Also, I should point out that the .plt is in read-only memory. Thus the addresses used for the target of the jump are actually stored in the .got section. The .got also contains a set of pointers for all of the global variables that are used within a program that come from a shared library.

Why is OSX getting a bus error on an amd64 indirect jump?

I'm trying to write trampolines for x86 and amd64 so that a given function invocation is immediately vectored to an address stored at a known memory location (the purpose is to ensure the first target address lives within a given DLL (windows)).
The following code is attempting to use _fn as a memory location (or group of them) to start actual target addresses:
(*_fn[IDX])(); // rough equivalent in C
.globl _asmfn
_asmfn:
jmp *_fn+8*IDX(%rip)
The IDX is intended to be constructed using some CPP macros to provide a range of embedded DLL vectors each uniquely mapped to a slot in the _fn array of function pointers.
This works in a simple test program, but when I actually put it into a shared library (for the moment testing on OSX), I get a bus error when attempting to vector to the _asmfn code:
Invalid memory access of location 0x10aa1f320 rip=0x10aa1f320
The final target of this code is Windows, though I haven't tried it there yet (I figured I could at least prove out the assembly in a test case on OSX/intel first). Is the amd64 jump at least nominally correct, or have I missed something?
A good reference on trampolines on amd64.
EDIT
The jump does work properly on windows 7 (finally got a chance to test). However, I'm still curious to know why it is failing on OSX. The bus error is caused by a KERN_PROTECTION_FAILURE, which would appear to indicate that OS protections are preventing execution of that code. The target address is allocated memory (it's a trampoline generated by libffi), but I believe it to be properly marked as executable memory. If it's an executable memory issue, that would explain why my standalone test code works (the callback trampoline is compiled, not allocated).

When using PC-relative addressing, keep in mind that the offset must be within +- 2GB. That means your jump table and trampoline can't be too far away from each other. Regarding trampolines as such, what can be done on Windows x64 to transfer without requiring to clobber any registers is:
a sequence:
PUSH <high32>
MOV DWORD PTR [ RSP - 4 ], <low32>
RET
this works both on Win64 and UN*X x86_64. Although on UN*X, if the function uses the redzone then you're clobbering ...
a sequence:
JMP [ RIP ]
.L: <tgtaddr64>
again, applicable to both Win64 and UN*X x86_64.
a sequence:
MOV DWORD PTR [ RSP + c ], <low32>
MOV DWORD PTR [ RSP + 8 ], <high32>
JMP [ RSP + 8 ]
this is Win64-specific as it (ab)uses part of the 32-Byte "argument space" reserved (just above the return address on the stack) by the Win64 ABI; the UN*X x86_64 equiv to this would be to (ab)use part of the 128-Byte "red zone" reserved (just below the return address on the stack) there:
MOV DWORD PTR [ RSP - c ], <low32>
MOV DWORD PTR [ RSP - 8 ], <high32>
JMP [ RSP - 8 ]
Both are only usable if it's acceptable to clobber (overwrite) what's in there at the point of invoking the trampoline.
If is possible to directly construct such a position-independent register-neutral trampoline in memory - like this (for method 1.):
#include <stdint.h>
#include <stdio.h>
char *mystr = "Hello, World!\n";
int main(int argc, char **argv)
{
struct __attribute__((packed)) {
char PUSH;
uint32_t CONST_TO_PUSH;
uint32_t MOV_TO_4PLUS_RSP;
uint32_t CONST_TO_MOV;
char RET;
} mycode = {
0x68, ((uint32_t)printf),
0x042444c7, (uint32_t)((uintptr_t)printf >> 32),
0xc3
};
void *buf = /* fill in an OS-specific way to get an executable buffer */;
memcpy(buf, &mycode, sizeof(mycode));
__asm__ __volatile__(
"push $0f\n\t" // this is to make the "jmp" return
"jmp *%0\n\t"
"0:\n\t" : : "r"(buf), "D"(mystr), "a"(0));
return 0;
}
Note that this doesn't take into account whether any nonvolatile registers are being clobbered by the function "invoked"; I've also left out how to make the trampoline buffer executable (the stack ordinarily isn't on Win64/x86_64).

#HarryJohnston had the right of it, the permissions issue was encountered on OS X only. The code runs fine on its target windows environment.

Stack allocation, why the extra space?

I was playing around a bit to get a better grip on calling conventions and how the stack is handled, but I can't figure out why main allocates three extra double words when setting up the stack (at <main+0>). It's neither aligned to 8 bytes nor 16 bytes, so that's not why as far as I know. As I see it, main requires 12 bytes for the two parameters to func and the return value.
What am I missing?
The program is C code compiled with "gcc -ggdb" on a x86 architecture.
Edit: I removed the -O0 flag from gcc, and it made no difference to the output.
(gdb) disas main
Dump of assembler code for function main:
0x080483d1 <+0>: sub esp,0x18
0x080483d4 <+3>: mov DWORD PTR [esp+0x4],0x7
0x080483dc <+11>: mov DWORD PTR [esp],0x3
0x080483e3 <+18>: call 0x80483b4 <func>
0x080483e8 <+23>: mov DWORD PTR [esp+0x14],eax
0x080483ec <+27>: add esp,0x18
0x080483ef <+30>: ret
End of assembler dump.
Edit: Of course I should have posted the C code:
int func(int a, int b) {
int c = 9;
return a + b + c;
}
void main() {
int x;
x = func(3, 7);
}
The platform is Arch Linux i686.

The parameters to a function (including, but not limited to main) are already on the stack when you enter the function. The space you allocate inside the function is for local variables. For functions with simple return types such as int, the return value will normally be in a register (eax, with a typical 32-bit compiler on x86).
If, for example, main was something like this:
int main(int argc, char **argv) {
char a[35];
return 0;
}
...we'd expect to see at least 35 bytes allocated on the stack as we entered main to make room for a. Assuming a 32-bit implementation, that would normally be rounded up to the next multiple of 4 (36, in this case) to maintain 32-bit alignment of the stack. We would not expect to see any space allocated for the return value. argc and argv would be on the stack, but they'd already be on the stack before main was entered, so main would not have to do anything to allocate space for them.
In the case above, after allocating space for a, a would typicaly start at [esp-36], argv would be at [esp-44] and argc would be at [esp-48] (or those two might be reversed -- depending on whether arguments were pushed left to right or right to left). In case you're wondering why I skipped [esp-40], that would be the return address.
Edit: Here's a diagram of the stack on entry to the function, and after setting up the stack frame:
Edit 2: Based on your updated question, what you have is slightly roundabout, but not particularly hard to understand. Upon entry to main, it's allocating space not only for the variables local to main, but also for the parameters you're passing to the function you call from main.
That accounts for at least some of the extra space being allocated (though not necessarily all of it).

It's alignment. I assumed for some reason that esp would be aligned from the start, which it clearly isn't.
gcc aligns stack frames to 16 bytes per default, which is what happened.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio