gfortran for dummies: What does mcmodel=medium do exactly? - memory-management

I have some code that is giving me relocation errors when compiling, below is an example which illustrates the problem:
program main
common/baz/a,b,c
real a,b,c
b = 0.0
call foo()
print*, b
end
subroutine foo()
common/baz/a,b,c
real a,b,c
integer, parameter :: nx = 450
integer, parameter :: ny = 144
integer, parameter :: nz = 144
integer, parameter :: nf = 23*3
real :: bar(nf,nx*ny*nz)
!real, allocatable,dimension(:,:) :: bar
!allocate(bar(nf,nx*ny*nz))
bar = 1.0
b = bar(12,32*138*42)
return
end
Compiling this with gfortran -O3 -g -o test test.f, I get the following error:
relocation truncated to fit: R_X86_64_PC32 against symbol `baz_' defined in COMMON section in /tmp/ccIkj6tt.o
But it works if I use gfortran -O3 -mcmodel=medium -g -o test test.f. Also note that it works if I make the array allocatable and allocate it within the subroutine.
My question is what exactly does -mcmodel=medium do? I was under the impression that the two versions of the code (the one with allocatable arrays and the one without) were more or less equivalent ...

Since bar is quite large the compiler generates static allocation instead of automatic allocation on the stack. Static arrays are created with the .comm assembly directive which creates an allocation in the so-called COMMON section. Symbols from that section are gathered, same-named symbols are merged (reduced to one symbol request with size equal to the largest size requested) and then what is rest is mapped to the BSS (uninitialised data) section in most executable formats. With ELF executables the .bss section is located in the data segment, just before the data segment part of the heap (there is another heap part managed by anonymous memory mappings which does not reside in the data segment).
With the small memory model 32-bit addressing instructions are used to address symbols on x86_64. This makes code smaller and also faster. Some assembly output when using small memory model:
movl $bar.1535, %ebx <---- Instruction length saving
...
movl %eax, baz_+4(%rip) <---- Problem!!
...
.local bar.1535
.comm bar.1535,2575411200,32
...
.comm baz_,12,16
This uses a 32-bit move instruction (5 bytes long) to put the value of the bar.1535 symbol (this value equals to the address of the symbol location) into the lower 32 bits of the RBX register (the upper 32 bits get zeroed). The bar.1535 symbol itself is allocated using the .comm directive. Memory for the baz COMMON block is allocated afterwards. Because bar.1535 is very large, baz_ ends up more than 2 GiB from the start of the .bss section. This poses a problem in the second movl instruction since a non-32bit (signed) offset from RIP should be used to address the b variable where the value of EAX has to be moved into. This is only detected during link time. The assembler itself does not know the appropriate offset since it doesn't know what the value of the instruction pointer (RIP) would be (it depends on the absolute virtual address where the code is loaded and this is determined by the linker), so it simply puts an offset of 0 and then creates a relocation request of type R_X86_64_PC32. It instructs the linker to patch the value of 0 with the real offset value. But it cannot do that since the offset value would not fit inside a signed 32-bit integer and hence bails out.
With the medium memory model in place things look like this:
movabsq $bar.1535, %r10
...
movl %eax, baz_+4(%rip)
...
.local bar.1535
.largecomm bar.1535,2575411200,32
...
.comm baz_,12,16
First a 64-bit immediate move instruction (10 bytes long) is used to put the 64-bit value which represents the address of bar.1535 into register R10. Memory for the bar.1535 symbol is allocated using the .largecomm directive and thus it ends in the .lbss section of the ELF exectuable. .lbss is used to store symbols which might not fit in the first 2 GiB (and hence should not be addressed using 32-bit instructions or RIP-relative addressing), while smaller things go to .bss (baz_ is still allocated using .comm and not .largecomm). Since the .lbss section is placed after the .bss section in the ELF linker script, baz_ would not end up being inaccessible using 32-bit RIP-related addressing.
All addressing modes are described in the System V ABI: AMD64 Architecture Processor Supplement. It is a heavy technical reading but a must read for anybody who really wants to understand how 64-bit code works on most x86_64 Unixes.
When an ALLOCATABLE array is used instead, gfortran allocates heap memory (most likely implemented as an anonymous memory map given the large size of the allocation):
movl $2575411200, %edi
...
call malloc
movq %rax, %rdi
This is basically RDI = malloc(2575411200). From then on elements of bar are accessed by using positive offsets from the value stored in RDI:
movl 51190040(%rdi), %eax
movl %eax, baz_+4(%rip)
For locations that are more than 2 GiB from the start of bar, a more elaborate method is used. E.g. to implement b = bar(12,144*144*450) gfortran emits:
; Some computations that leave the offset in RAX
movl (%rdi,%rax), %eax
movl %eax, baz_+4(%rip)
This code is not affected by the memory model since nothing is assumed about the address where the dynamic allocation would be made. Also, since the array is not passed around, no descriptor is being built. If you add another function that takes an assumed-shaped array and pass bar to it, a descriptor for bar is created as an automatic variable (i.e. on the stack of foo). If the array is made static with the SAVE attribute, the descriptor is placed in the .bss section:
movl $bar.1580, %edi
...
; RAX still holds the address of the allocated memory as returned by malloc
; Computations, computations
movl -232(%rax,%rdx,4), %eax
movl %eax, baz_+4(%rip)
The first move prepares the argument of a function call (in my sample case call boo(bar) where boo has an interface that declares it as taking an assumed-shape array). It moves the address of the array descriptor of bar into EDI. This is a 32-bit immediate move so the descriptor is expected to be in the first 2 GiB. Indeed, it is allocated in the .bss in both small and medium memory models like this:
.local bar.1580
.comm bar.1580,72,32

No, large static arrays (as your bar) may exceed the limit if you do not use -mcmodel=medium. But allocatables are better of course. For allocatables only the array descriptor must fit into 2 GB, not the whole array.
From GCC reference:
-mcmodel=small
Generate code for the small code model: the program and its symbols must be linked in the lower 2 GB of the address space. Pointers are 64 bits. Programs can be statically or dynamically linked. This is the default code model.
-mcmodel=kernel
Generate code for the kernel code model. The kernel runs in the negative 2 GB of the address space. This model has to be used for Linux kernel code.
-mcmodel=medium
Generate code for the medium model: The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model.
-mcmodel=large
Generate code for the large model: This model makes no assumptions about addresses and sizes of sections. Currently GCC does not implement this model.

Related

How to load address of function or label into register

I am trying to load the address of 'main' into a register (R10) in the GNU Assembler. I am unable to. Here I what I have and the error message I receive.
main:
lea main, %r10
I also tried the following syntax (this time using mov)
main:
movq $main, %r10
With both of the above I get the following error:
/usr/bin/ld: /tmp/ccxZ8pWr.o: relocation R_X86_64_32S against symbol `main' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
Compiling with -fPIC does not resolve the issue and just gives me the same exact error.
In x86-64, most immediates and displacements are still 32-bits because 64-bit would waste too much code size (I-cache footprint and fetch/decode bandwidth).
lea main, %reg is an absolute disp32 addressing mode which would stop load-time address randomization (ASLR) from choosing a random 64-bit (or 47-bit) address. So it's not supported on Linux except in position-dependent executables, or at all on MacOS where static code/data are always loaded outside the low 32 bits. (See the x86 tag wiki for links to docs and guides.) On Windows, you can build executables as "large address aware" or not. If you choose not, addresses will fit in 32 bits.
The standard efficient way to put a static address into a register is a RIP-relative LEA:
# RIP-relative LEA always works. Syntax for various assemblers:
lea main(%rip), %r10 # AT&T syntax
lea r10, [rip+main] # GAS .intel_syntax noprefix equivalent
lea r10, [rel main] ; NASM equivalent, or use default rel
lea r10, [main] ; FASM defaults to RIP-relative. MASM may also
See How do RIP-relative variable references like "[RIP + _a]" in x86-64 GAS Intel-syntax work? for an explanation of the 3 syntaxes, and Why are global variables in x86-64 accessed relative to the instruction pointer? (and this) for reasons why RIP-relative is the standard way to address static data.
This uses a 32-bit relative displacement from the end of the current instruction, like jmp/call. This can reach any static data in .data, .bss, .rodata, or function in .text, assuming the usual 2GiB total size limit for static code+data.
In position dependent code (built with gcc -fno-pie -no-pie for example) on Linux, you can take advantage of 32-bit absolute addressing to save code size. Also, mov r32, imm32 has slightly better throughput than RIP-relative LEA on Intel/AMD CPUs, so out-of-order execution may be able to overlap it better with the surrounding code. (Optimizing for code-size is usually less important than most other things, but when all else is equal pick the shorter instruction. In this case all else is at least equal, or also better with mov imm32.)
See 32-bit absolute addresses no longer allowed in x86-64 Linux? for more about how PIE executables are the default. (Which is why you got a link error about -fPIC with your use of a 32-bit absolute.)
# in a non-PIE executable, mov imm32 into a 32-bit register is even better
# same as you'd use in 32-bit code
## GAS AT&T syntax
mov $main, %r10d # 6 bytes
mov $main, %edi # 5 bytes: no REX prefix needed for a "legacy" register
## GAS .intel_syntax
mov edi, OFFSET main
;; mov edi, main ; NASM and FASM syntax
Note that writing any 32-bit register always zero-extends into the full 64-bit register (R10 and RDI).
lea main, %edi or lea main, %rdi would also work in a Linux non-PIE executable, but never use LEA with a [disp32] absolute addressing mode (even in 32-bit code where that doesn't require a SIB byte); mov is always at least as good.
The operand-size suffix is redundant when you have a register operand that uniquely determines it; I prefer to just write mov instead of movl or movq.
The stupid/bad way is a 10-byte 64-bit absolute address as an immediate:
# Inefficient, DON'T USE
movabs $main, %r10 # 10 bytes including the 64-bit absolute address
This is what you get in NASM if you use mov rdi, main instead of mov edi, main so many people end up doing this. Linux dynamic linking does actually support runtime fixups for 64-bit absolute addresses. But the use-case for that is for jump tables, not for absolute addresses as immediates.
movq $sign_extended_imm32, %reg (7 bytes) still uses a 32-bit absolute address, but wastes code bytes on a sign-extended mov to a 64-bit register, instead of implicit zero-extension to 64-bit from writing a 32-bit register.
By using movq, you're telling GAS you want a R_X86_64_32S relocation instead of a R_X86_64_64 64-bit absolute relocation.
The only reason you'd ever want this encoding is for kernel code where static addresses are in the upper 2GiB of 64-bit virtual address space, instead of the lower 2GiB. mov has slight performance advantages over lea on some CPUs (e.g. running on more ports), but normally if you can use a 32-bit absolute it's in the low 2GiB of virtual address space where a mov r32, imm32 works.
(Related: Difference between movq and movabsq in x86-64)
PS: I intentionally left out any discussion of "large" or "huge" memory / code models, where RIP-relative +-2GiB addressing can't reach static data, or maybe can't even reach other code addresses. The above is for x86-64 System V ABI's "small" and/or "small-PIC" code models. You may need movabs $imm64 for medium and large models, but that's very rare.
I don't know if mov $imm32, %r32 works in Windows x64 executables or DLLs with runtime fixups, but RIP-relative LEA certainly does.
Semi-related: Call an absolute pointer in x86 machine code - if you're JITing, try to put the JIT buffer near existing code so you can call rel32, otherwise movabs a pointer into a register.

How do I leave memory uninitialized in GNU ARM assembly?

I'm using GCC on my Raspberry Pi to compile some assembly code for a course I'm taking. It is my understanding from information in the GNU Assembler Reference that I can reproduce the following C code in GNU ARM Assembly:
int num = 0;
By writing this:
.data
num: .word 0
Great! Now how would I write this?
int num;
It is my understanding that leaving a variable uninitialized like this means I should treat it as containing whatever garbage value was in the memory location before. Therefore, I shouldn't use it before I've given it a value in some way.
But suppose for some reason I intended to store a huge amount of data in memory and needed to reserve a massive amount of space for it. It seems to me like it would be a massive waste of resources to initialize the entire area of memory to some value if I'm about to fill it with some data anyways. Yet from what I can find there seems to be no way to make a label in GCC ARM Assembly without initializing it to some value. According to my assembly textbook the .word directive can have zero expressions after it, but if used this way "then the address counter is not advanced and no bytes are reserved." My first though was to use the ".space" or ".skip" directives instead, but for this directive even the official documentation says that "if the comma and fill are omitted, fill is assumed to be zero."
Is there no way for me to reserve a chunk of memory without initializing it in GCC ARM Assembly?
Generally, data that you don't need to initialize should be placed in the .bss section.
.bss
foobar:
.skip 99999999
This will allocate 99999999 bytes in the .bss section, and label foobar will be its address. It won't make your object files or executable 99999999 bytes bigger; the executable header just indicates how many bytes of .bss are needed, and at load time, the system allocates an appropriate amount and initializes it to zero.
You can't skip the load-time zero initialization. The system needs to initialize it to something, because it might otherwise contain sensitive data from the kernel or some other process. But zeroing out memory is quite fast, and the kernel will use an efficient algorithm, so I wouldn't worry about the performance impact. It might even zero pages in its idle time, so that when your program loads, there is zeroed memory already available. Anyway, the time your program spends actually using the memory will swamp it.
This means that you can also safely use .bss for data that you do want to have initialized to zero (though not to any nonzero value; if you want int foo = 3; you'll have to put it in .data as in your original example.).
What happened when you tried it?
When I tried it:
int num = 0;
int mun;
With gnu I got
.cpu arm7tdmi
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "so.c"
.text
.comm mun,4,4
.global num
.bss
.align 2
.type num, %object
.size num, 4
num:
.space 4
.ident "GCC: (GNU) 8.3.0"
.comm symbol , length
.comm declares a common symbol named symbol.
When linking, a common symbol in one object file may be merged with a
defined or common symbol of the same name in another object file. If
ld does not see a definition for the symbol--just one or more common
symbols--then it will allocate length bytes of uninitialized memory.
length must be an absolute expression. If ld sees multiple common
symbols with the same name, and they do not all have the same size, it
will allocate space using the largest size.
When using ELF, the .comm directive takes an optional third argument.
This is the desired alignment of the symbol, specified as a byte
boundary (for example, an alignment of 16 means that the least
significant 4 bits of the address should be zero). The alignment must
be an absolute expression, and it must be a power of two. If ld
allocates uninitialized memory for the common symbol, it will use the
alignment when placing the symbol. If no alignment is specified, as
will set the alignment to the largest power of two less than or equal
to the size of the symbol, up to a maximum of 16.
The syntax for .comm differs slightly on the HPPA. The syntax is
`symbol .comm, length'; symbol is optional.
Assembly language is defined by the assembler not the target. So the answer will be assembler (the tool that reads and assembles assembly language programs) specific and no reason to assume that the answer for one assembler is the same as another. The above is for the gnu assembler, gas.
You could have looked at the documentation you referenced or read other gnu documentation, but the easiest way to answer a "what happens when you do this in a compiled program" is to just compile it and look at the compiler output.
But don't necessarily assume that it isn't initialized:
unsigned int num;
unsigned int fun ( void )
{
return(num);
}
Just enough to link it:
Disassembly of section .text:
00001000 <fun>:
1000: e59f3004 ldr r3, [pc, #4] ; 100c <fun+0xc>
1004: e5930000 ldr r0, [r3]
1008: e12fff1e bx lr
100c: 00002000 andeq r2, r0, r0
Disassembly of section .bss:
00002000 <__bss_start>:
2000: 00000000
it ends up in bss initialized.
You really want uninitialized access to something then just pick an address (that you know isn't initialized (sram)) and access it:
ldr r0,=0x1234
ldr r0,[r0]

MingW Windows GCC cant compile c program with 2gb global data

GCC/G++ of MingW gives Relocation Errors when Building Applications with Large Global or Static Data.
Understanding the x64 code models
References to both code and data on x64 are done with
instruction-relative (RIP-relative in x64 parlance) addressing modes.
The offset from RIP in these instructions is limited to 32 bits.
small code model promises to the compiler that 32-bit relative offsets
should be enough for all code and data references in the compiled
object. The large code model, on the other hand, tells it not to make
any assumptions and use absolute 64-bit addressing modes for code and
data references. To make things more interesting, there's also a
middle road, called the medium code model.
For the below example program, despite adding options-mcmodel=medium or -mcmodel=large the code fails to compile
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];
int main(){
return 0;
}
gcc -mcmodel=medium example.c fails to compile on MingW/Cygwin Windows, Intel windows /MSVC
You are limited to 32-bits for an offset, but this is a signed offset. So in practice, you are actually limited to 2GiB. You asked why this is not possible, but your array alone is 2GiB in size and there are things in the data segment other than just your array. C is a high level language. You get the ease of just being able to define a main function and you get all of these other things for free -- a standard in and output, etc. The C runtime implements this for you and all of this consumes stack space and room in your data segment. For example, if I build this on x86_64-pc-linux-gnu my .bss size is 0x80000020 in size -- an additional 32 bytes. (I've erased PE information from my brain, so I don't remember how those are laid out.)
I don't remember much about the various machine models, but it's probably helpful to note that the x86_64 instruction set doesn't even contain instructions (that I'm aware of, although I'm not an x86 assembly expert) to access any register-relative address beyond a signed 32-bit value. For example, when you want to cram that much stuff on the stack, gcc has to do weird things like this stack pointer allocation:
movabsq $-10000000016, %r11
addq %r11, %rsp
You can't addq $-10000000016, %rsp because it's more than a signed 32-bit offset. The same applies to RIP-relative addressing:
movq $10000000016(%rip), %rax # No such addressing mode

Windows x86 assembly language syntax [duplicate]

This question already has an answer here:
Which segment register is used by default?
(1 answer)
Closed 6 years ago.
(1) What does the following code mean? I cannot find any reference about the ds:[ ] syntax anywhere online. How is it different from without the ds:?
cmp eax,dword ptr ds:[12B656Ch]
(2) In the following instruction,
movsx eax,word ptr [esi+24h]
What is the esi register used for? Is it possible to guess what the original C code is doing from using such a rare register?
DS refers to the Data Segment.
In Win32, CS = DS = ES = SS = 0.
That is these segments do not matter and a flat 32 bit address space is used.
The Data segment is the default segment when accessing memory. Some disassemblers mistakenly list it, even though it serves no purpose to list a default segment.
You can list a different segment if you do wish by using a segment override.
CS is de Code Segment which is the default segment for jumps and calls and SS is the Stack segment which is the default for addresses based on ESP.
ES is the Extra Segment which is used for string instructions.
The only segment override that makes sense in Win32 is FS (The F does not stand for anything, but it comes after E).
FS links to the Thread Information Block (TIB) which houses thread specific data and is very useful for Thread Local Storage and multi-threading in general.
There is also a GS which is reserved for future use in Win32 and is used for the TIB in Win64.
In Linux the picture is more or less the same.
What is register X for
You must let go of the notion that registers have special purposes.
In x86 you can use almost any register for almost any purpose.
Only a few complex instructions use specific registers, but the normal instructions can use any register.
The compiler will try and use as many registers as possible to avoid having to use memory.
Having said this the original purposes of the 8 x86 registers are as follows:
EAX : accumulator, some instructions using this register have 'short versions'.
EDX : overflow for EAX, used to store 64 bit values when multiplying or dividing.
ECX : counter, used in string instructions like rep mov and shifts.
EBX : miscellaneous general purpose register.
ESI : Source Index register, used as source pointer for string instructions
EDI : Destination Index register, used as destination pointer
ESP : Stack pointer, used to keep track of the stack
EBP : Base pointer, used in stack frames
You can use any register pretty much as you please, with the exception of ESP. Although ESP will work in many instructions, it is just too awkward to lose track of the stack.
Is it possible to guess what the original C code is doing from using such a rare register?
My guess:
struct x {
int a,b,c,d,e,f,g,h,i,j; //36 bytes
short s };
....
int i = x.s;
ESI likely points to some structure or object. At offset 24h (36) a short is present which is transfered into an int. (hence the mov with Sign eXtend).
ESI does not link local variable, because in that case EBP or ESP would be used.
If you want to know more about the c code you'd need more context.
Many c constructs translate into multiple cpu instructions.
The best way to see this is to write c code and inspect the cpu code that gets generated.

Understanding OSX 16-Byte alignment

So it seems like everyone knows that OSX syscalls are always 16 byte stack aligned. Great, that makes sense when you have code like this:
section .data
message db 'something', 10, 0
section .text
global start
start:
push 10 ; size of the message (4 bytes)
push msg ; the address of the message (4 bytes)
push 1 ; we want to write to STD_OUT (4 bytes)
mov eax, 4 ; write(...) syscall
sub esp, 4 ; move stack pointer down to 4 bytes for a total of 16.
int 0x80 ; invoke
add esp, 16 ; clean
Perfect, the stack is aligned to 16 bytes, makes perfect sense. How about though we call syscall(1) (exit). Logically that would look something like this:
push 69 ; return value
mov eax, 1 ; exit(...) syscall
sub esp, 12 ; push down stack for total of 16 bytes.
int 0x80 ; invoke
This doesn't work though, but this does:
push 69 ; return value
mov eax, 1 ; exit(...) syscall
sub esp, 4 ; push down stack for total of 8 bytes.
int 0x80 ; invoke
That works fine, but that's only 8 bytes???? Osx is cool, but this ABI is driving me nuts. Can someone shed some light on what I'm not understanding?
Short version: you probably don't need to align to 16 bytes, you just need to always leave a 4-byte gap before your argument list.
Long version:
Here's what I think is happening: I'm not sure that it's true that the stack should be 16-byte aligned. However, logic dictates that if it is and if padding or adjusting the stack is necessary to achieve that alignment, it must happen before the arguments for the syscall are pushed, not after. There can't be an arbitrary number of bytes between the stack pointer at the time of the int 0x80 instruction and where the arguments actually are. The kernel wouldn't know where to find the actual arguments. Subtracting from the stack pointer after pushing the arguments to achieve "alignment" doesn't align the arguments, it aligns the stack pointer by inserting an arbitrary number of bytes between the stack pointer and the arguments. Whatever else may be true, that can't be right.
Then why do the first and third snippets work at all? Don't they also insert arbitrary bytes there? They work by accident. It's because they both happen to insert 4 bytes. That adjustment isn't "successful" because it achieves stack alignment, it's part of the syscall ABI. Apparently, the syscall ABI expects and requires that there be a 4-byte slot before the argument list.
The source for the syscall() function can be found here. It looks like this:
LEAF(___syscall, 0)
popl %ecx // ret addr
popl %eax // syscall number
pushl %ecx
UNIX_SYSCALL_TRAP
movl (%esp),%edx // add one element to stack so
pushl %ecx // caller "pop" will work
jnb 2f
BRANCH_EXTERN(cerror)
2:
END(___syscall)
To call this library function, the caller will have set up the stack pointer to point to the arguments to the syscall() function, which starts with the syscall number and then has the real arguments for the actual syscall. However, the caller will then have used a call instruction to call it, which pushed the return address onto the stack.
So, the above code pops the return address, pops the syscall number into %eax, pushes the return address back onto the stack (where the syscall number originally was), and then does int 0x80. So, the stack pointer points to the return address and then the arguments. There's the extra 4 bytes: the return address. I suspect the kernel ignores the return address. I guess its presence in the syscall ABI may just be to make the ABI for system calls similar to that of function calls.
What does this mean for the alignment requirement of syscalls? Well, this function is guaranteed to change the alignment of the stack from how it was set up by its caller. The caller presumably set up the stack with 16-byte alignment and this function moves it by 4 bytes before the interrupt. It may just be a myth that the stack needs to be 16-byte aligned for syscalls. On the other hand, the 16-byte alignment requirement is definitely real for calling system library functions. The Wine project, for which I develop, was burned by it. It is mostly necessary for 128-bit SSE argument data types, but Apple made their lazy symbol resolver deliberately blow up if the alignemtn is wrong even for functions which don't use such arguments so that problems would be found early. Syscalls would not be subject to that early-failure mechanism. It may be that the kernel doesn't require the 16-byte alignment. I'm not sure if any syscalls take 128-bit arguments.

Resources