How to load address of function or label into register

How to load address of function or label into register - gcc

I am trying to load the address of 'main' into a register (R10) in the GNU Assembler. I am unable to. Here I what I have and the error message I receive.
main:
lea main, %r10
I also tried the following syntax (this time using mov)
main:
movq $main, %r10
With both of the above I get the following error:
/usr/bin/ld: /tmp/ccxZ8pWr.o: relocation R_X86_64_32S against symbol `main' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
Compiling with -fPIC does not resolve the issue and just gives me the same exact error.

In x86-64, most immediates and displacements are still 32-bits because 64-bit would waste too much code size (I-cache footprint and fetch/decode bandwidth).
lea main, %reg is an absolute disp32 addressing mode which would stop load-time address randomization (ASLR) from choosing a random 64-bit (or 47-bit) address. So it's not supported on Linux except in position-dependent executables, or at all on MacOS where static code/data are always loaded outside the low 32 bits. (See the x86 tag wiki for links to docs and guides.) On Windows, you can build executables as "large address aware" or not. If you choose not, addresses will fit in 32 bits.
The standard efficient way to put a static address into a register is a RIP-relative LEA:
# RIP-relative LEA always works. Syntax for various assemblers:
lea main(%rip), %r10 # AT&T syntax
lea r10, [rip+main] # GAS .intel_syntax noprefix equivalent
lea r10, [rel main] ; NASM equivalent, or use default rel
lea r10, [main] ; FASM defaults to RIP-relative. MASM may also
See How do RIP-relative variable references like "[RIP + _a]" in x86-64 GAS Intel-syntax work? for an explanation of the 3 syntaxes, and Why are global variables in x86-64 accessed relative to the instruction pointer? (and this) for reasons why RIP-relative is the standard way to address static data.
This uses a 32-bit relative displacement from the end of the current instruction, like jmp/call. This can reach any static data in .data, .bss, .rodata, or function in .text, assuming the usual 2GiB total size limit for static code+data.
In position dependent code (built with gcc -fno-pie -no-pie for example) on Linux, you can take advantage of 32-bit absolute addressing to save code size. Also, mov r32, imm32 has slightly better throughput than RIP-relative LEA on Intel/AMD CPUs, so out-of-order execution may be able to overlap it better with the surrounding code. (Optimizing for code-size is usually less important than most other things, but when all else is equal pick the shorter instruction. In this case all else is at least equal, or also better with mov imm32.)
See 32-bit absolute addresses no longer allowed in x86-64 Linux? for more about how PIE executables are the default. (Which is why you got a link error about -fPIC with your use of a 32-bit absolute.)
# in a non-PIE executable, mov imm32 into a 32-bit register is even better
# same as you'd use in 32-bit code
## GAS AT&T syntax
mov $main, %r10d # 6 bytes
mov $main, %edi # 5 bytes: no REX prefix needed for a "legacy" register
## GAS .intel_syntax
mov edi, OFFSET main
;; mov edi, main ; NASM and FASM syntax
Note that writing any 32-bit register always zero-extends into the full 64-bit register (R10 and RDI).
lea main, %edi or lea main, %rdi would also work in a Linux non-PIE executable, but never use LEA with a [disp32] absolute addressing mode (even in 32-bit code where that doesn't require a SIB byte); mov is always at least as good.
The operand-size suffix is redundant when you have a register operand that uniquely determines it; I prefer to just write mov instead of movl or movq.
The stupid/bad way is a 10-byte 64-bit absolute address as an immediate:
# Inefficient, DON'T USE
movabs $main, %r10 # 10 bytes including the 64-bit absolute address
This is what you get in NASM if you use mov rdi, main instead of mov edi, main so many people end up doing this. Linux dynamic linking does actually support runtime fixups for 64-bit absolute addresses. But the use-case for that is for jump tables, not for absolute addresses as immediates.
movq $sign_extended_imm32, %reg (7 bytes) still uses a 32-bit absolute address, but wastes code bytes on a sign-extended mov to a 64-bit register, instead of implicit zero-extension to 64-bit from writing a 32-bit register.
By using movq, you're telling GAS you want a R_X86_64_32S relocation instead of a R_X86_64_64 64-bit absolute relocation.
The only reason you'd ever want this encoding is for kernel code where static addresses are in the upper 2GiB of 64-bit virtual address space, instead of the lower 2GiB. mov has slight performance advantages over lea on some CPUs (e.g. running on more ports), but normally if you can use a 32-bit absolute it's in the low 2GiB of virtual address space where a mov r32, imm32 works.
(Related: Difference between movq and movabsq in x86-64)
PS: I intentionally left out any discussion of "large" or "huge" memory / code models, where RIP-relative +-2GiB addressing can't reach static data, or maybe can't even reach other code addresses. The above is for x86-64 System V ABI's "small" and/or "small-PIC" code models. You may need movabs $imm64 for medium and large models, but that's very rare.
I don't know if mov $imm32, %r32 works in Windows x64 executables or DLLs with runtime fixups, but RIP-relative LEA certainly does.
Semi-related: Call an absolute pointer in x86 machine code - if you're JITing, try to put the JIT buffer near existing code so you can call rel32, otherwise movabs a pointer into a register.

Related

What's the purpose of xchg ax,ax prior to the break instruction int 3 in DebugBreak()?

In MASM, I've always inserted a standalone break instruction
00007ff7`63141120 cc int 3
However, replacing that instruction with the MSVC DebugBreak function generates
KERNELBASE!DebugBreak:
00007ff8`6b159b90 6690 xchg ax,ax
00007ff8`6b159b92 cc int 3
00007ff8`6b159b93 c3 ret
I was surprised to see the xchg instruction prior to the break instruction
xchg ax,ax
As noted from another S.O. article:
Actually, xchg ax,ax is just how MS disassembles "66 90". 66 is the
operand size override, so it supposedly operates on ax instead of eax.
However, the CPU still executes it as a nop. The 66 prefix is used
here to make the instruction two bytes in size, usually for alignment
purposes.
MSVC, like most compilers, aligns functions to 16 byte boundaries.
Question What is the purpose of that xchg instruction?

MSVC generates 2 byte nop before any single-byte instruction at the beginning of a function (except ret in empty functions1). I've tried __halt, _enable, _disable intrinsics and seen the same effect.
Apparently it is for patching. /hotpatch option gives the same change for x86, and /hotpatch option is not recognized on x64. According to the /hotpatch documentation, it is expected behavior (emphasis mine):
Because instructions are always two bytes or larger on the ARM architecture, and because x64 compilation is always treated as if /hotpatch has been specified, you don't have to specify /hotpatch when you compile for these targets;
So hotpatching support is unconditional for x64, and its result is seen in DebugBreak implementation.
See here: https://godbolt.org/z/1G737cErf
See this post on why it is needed for hotpatching: Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?. Looks like that currently hotpatching is smart enough to use any two bytes or more instruction, not just MOV EDI, EDI, still it cannot use single-byte instruction, as two-byte backward jump may be written at exact moment when the instruction pointer points at the second instruction.
1 As discussed in comments, empty functions have three-byte ret 0, although it is not apparent from MSVC assembly output, as it is represented there as just ret)

How does RIP-relative addressing perform compared to mov reg, imm64?

It is known fact that x86-64 instructions do not support 64-bit immediate values (except for mov). Hence, when migrating code from 32 to 64 bits, an instruction like this:
cmp rax, addr32
cannot be replaced with the following:
cmp rax, addr64
Under these circumstances, I'm considering two alternatives: (a) using a scratch register for loading the constant or (b) using rip-relative addressing. The two approaches look like this:
mov r11, addr64 ; scratch register
cmp rax, r11
ptr64: dq addr64
...
cmp rax, [rel ptr64] ; encoded as cmp rax, [rip+offset]
I wrote a very simple loop to compare the performance of both approaches (which I paste below). While (b) uses an indirect pointer, (a) has the the immediate encoded in the instruction (which could lead to a worse usage of i-cache). Surprisingly, I found that (b) run ~10% faster than (a). Is this result something to be expected in more common real-world code?
true: dq 0xFFFF0000FFFF0000
false: dq 0xAAAABBBBAAAABBBB
main:
or rax, 1 ; rax is odd and constant "true" is even
mov rcx, 0x1
shl rcx, 30
branch:
mov r11, 0xFFFF0000FFFF0000 ; not present in (b)
cmp rax, r11 ; vs cmp rax, [rel true]
je next
add rax, 2
loop branch
next:
mov rax, 0
ret

Surprisingly, I found that (b) run ~10% faster than (a)
You probably tested on a CPU other than AMD Bulldozer-family or Ryzen, which have a fast loop instruction. On other CPUs, loop is very slow, mostly on purpose for historical reasons, so you bottleneck on it. e.g. 7 uops, one per 5c throughput on Haswell.
mov r64, imm64 is bad for uop cache throughput because of the large immediate taking 2 slots in Intel's uop cache. (See the Sandybridge uop cache section in Agner Fog's microarch pdf), and Which is faster, imm64 or m64 for x86-64? where I listed the details.
Even apart from that, it's not too surprising that 1 extra uop in the loop makes it run slower. You're probably not on an AMD CPU (with single-uop / 1 per 2 clock loop), because the extra mov in such a tiny loop would make more than 10% difference. Or no difference at all, since it's just 3 vs. 4 uops per 2 clocks, if that's correct that even tiny loop loops are limited to one jump per 2 clocks.
On Intel, loop is 7 uops, one per 5 clocks throughput on most CPUs, so the 4-per-clock issue/rename bottleneck won't be what you're hitting. loop is micro-coded, so the front-end can't run from the loop buffer. (And Skylake CPUs have their LSD disabled by a microcode update to fix the partial-register erratum anyway.) So the mov r64,imm64 uop has to be re-read from the uop cache every time through the loop.
A load that hits in cache has very good throughput (2 loads per clock, and in this case micro-fusion means no extra uops to use a memory operand instead of register for cmp). So the main penalty in using a constant from memory is the extra cache footprint and cache misses, but your microbenchmark won't reveal that at all. It also has no other pressure on the load ports.
In the general case:
If possible, use a RIP-relative lea to generate 64-bit address constants.
e.g. lea rax, [rel addr64]. Yes, this takes an extra instruction to get the constant into a register. (BTW, just use default rel. You can use [abs fs:0] if you need it.
You can avoid the extra instruction if you build position-dependent code with the default (small) code model, so static addresses fit in the low 32 bits of virtual address space and can be used as immediates. (Actually low 2GiB, so sign or zero extending both work). See 32-bit absolute addresses no longer allowed in x86-64 Linux? if gcc complains about absolute addressing; -pie is enabled by default on most distros. This of course doesn't work in Linux shared libraries, which only support text relocations for 64-bit addresses. But you should avoid relocations whenever possible by using lea to make position-indepdendent code.
Most integer build-time constants fit in 32 bits, so you can use cmp r64, imm32 or cmp r32, imm32 even in PIC code.
If you do need a 64-bit non-address constant, try to hoist the mov r64, imm64 out of a loop. Your cmp loop would have been fine if the mov wasn't inside the loop. x86-64 has enough registers that you (or the compiler) can usually avoid reloads inside inner-most loops in integer code.

Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array

Running this code off my Mac computer, using command:
nasm -f macho64 -o max.a maximum.asm
This is the code I am attempting to run on my computer that finds the largest number inside an array.
section .data
data_items:
dd 3,67,34,222,45,75,54,34,44,33,22,11,66,0
section .text
global _start
_start:
mov edi, 0
mov eax, [data_items + edi*4]
mov ebx, eax
start_loop:
cmp eax, 0
je loop_exit
inc edi
mov eax, [data_items + edi*4]
cmp eax, ebx
jle start_loop
mov ebx, eax
jmp start_loop
loop_exit:
mov eax, 1
int 0x80
Error:
maximum.asm:14: error: Mach-O 64-bit format does not support 32-bit absolute addresses
maximum.asm:21: error: Mach-O 64-bit format does not support 32-bit absolute addresses

First of all, beware of NASM bugs with the macho64 output format with 64-bit absolute addressing (NASM 2.13.02+) and with RIP-relative in NASM 2.11.08. 64-bit absolute addressing is not recommended, so this answer should work even for buggy NASM 2.13.02 and higher. (The bugs don't cause this error, they lead to wrong addresses being used at runtime.)
[data_items + edi*4] is a 32-bit addressing mode. Even [data_items + rdi*4] can only use a 32-bit absolute displacement, so it wouldn't work either. Note that using an address as a 32-bit (sign-extended) immediate like cmp rdi, data_items is also a problem: only mov allows a 64-bit immediate.
64-bit code on OS X can't use 32-bit absolute addressing at all. Executables are loaded at a base address above 4GiB, so label addresses just plain don't fit in 32-bit integers, with zero- or sign-extension. RIP-relative addressing is the best / most efficient solution, whether you need it to be position-independent or not1.
In NASM, default rel at the top of your file will make all [] memory operands prefer RIP-relative addressing. See also Section 3.3 Effective Addresses in the NASM manual.
default rel ; near the top of file; affects all instructions
my_func:
...
mov ecx, [data_items] ; uses the default: RIP-relative
;mov ecx, [abs data_items] ; override to absolute [disp32], unusuable
mov ecx, [rel data_items] ; explicitly RIP-relative
But RIP-relative is only possible when there are no other registers involved, so for indexing a static array you need to get the address in a register first. Use a RIP-relative lea rsi, [rel data_items].
lea rsi, [data_items] ; can be outside the loop
...
mov eax, [rsi + rdi*4]
Or you could add rsi, 4 inside the loop and use a simpler addressing mode like mov eax, [rsi].
Note that mov rsi, data_items will work for getting an address into a register, but you don't want that because it's less efficient.
Technically, any address within +-2GiB of your array will work, so if you have multiple arrays you can address the others relative to one common base address, only tieing up one register with a pointer. e.g. lea rbx, [arr1] / ... / mov eax, [rbx + rdi*4 + arr2-arr1]. Relative Addressing errors - Mac 10.10 mentions that Agner Fog's "optimizing assembly" guide has some examples of array addressing, including one using the __mh_execute_header as a reference point. (The code in that question looks like another attempt to port this 32-bit Linux example from the PGU book to 64-bit OS X, at the same time as learning asm in the first place.)
Note that on Linux, position-dependent executables are loaded in the low 32 bits of virtual address space, so you will see code like mov eax, [array + rdi*4] or mov edi, symbol_name in Linux examples or compiler output on http://gcc.godbolt.org/. gcc -pie -fPIE will make position-independent executables on Linux, and is the default on many recent distros, but not Godbolt.
This doesn't help you on MacOS, but I mention it in case anyone's confused about code they've seen for other OSes, or why AMD64 architects bothered to allow [disp32] addressing modes at all on x86-64.
And BTW, prefer using 64-bit addressing modes in 64-bit code. e.g. use [rsi + rdi*4], not [esi + edi*4]. You usually don't want to truncate pointers to 32-bit, and it costs an extra address-size prefix to encode.
Similarly, you should be using syscall to make 64-bit system calls, not int 0x80. What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 for the differences in which registers to pass args in.
Footnote 1:
64-bit absolute addressing is supported on OS X, but only in position-dependent executables (non-PIE). This related question x64 nasm: pushing memory addresses onto the stack & call function includes an ld warning from using gcc main.o to link:
ld: warning: PIE disabled. Absolute addressing (perhaps -mdynamic-no-pic) not
allowed in code signed PIE, but used in _main from main.o. To fix this warning,
don't compile with -mdynamic-no-pic or link with -Wl,-no_pie
So the linker checks if any 64-bit absolute relocations are used, and if so disables creation of a Position-Independent Executable. A PIE can benefit from ASLR for security. I think shared-library code always has to be position-independent on OS X; I don't know if jump tables or other cases of pointers-as-data are allowed (i.e. fixed up by the dynamic linker), or if they need to be initialized at runtime if you aren't making a position-dependent executable.
mov r64, imm64 is larger (10 bytes) and not faster than lea r64, [RIP_rel32] (7 bytes).
So you could use mov rsi, qword data_items instead of a RIP-relative LEA which runs about as fast, and takes less space in code caches and the uop cache. 64-bit immediates also have a uop-cache fetch penalty for on Sandybridge-family (http://agner.org/optimize/): they take 2 cycles to read from a uop cache line instead of 1.
x86 also has a form of mov that loads/store from/to a 64-bit absolute address, but only for AL/AX/EAX/RAX. See http://felixcloutier.com/x86/MOV.html. You don't want this either, because it's larger and not faster than mov eax, [rel foo].
(Related: an AT&T syntax version of the same question)

Allocating memory using malloc() in 32-bit and 64-bit assembly language

I have to do a 64 bits stack. To make myself comfortable with malloc I managed to write two integers(32 bits) into memory and read from there:
But, when i try to do this with 64 bits:

The first snippet of code works perfectly fine. As Jester suggested, you are writing a 64-bit value in two separate (32-bit) halves. This is the way you have to do it on a 32-bit architecture. You don't have 64-bit registers available, and you can't write 64-bit chunks of memory at once. But you already seemed to know that, so I won't belabor it.
In the second snippet of code, you tried to target a 64-bit architecture (x86-64). Now, you no longer have to write 64-bit values in two 32-bit halves, since 64-bit architectures natively support 64-bit integers. You have 64-bit wide registers available, and you can write a 64-bit chunk to memory directly. Take advantage of that to simplify (and speed up) the code.
The 64-bit registers are Rxx instead of Exx. When you use QWORD PTR, you will want to use Rxx; when you use DWORD PTR, you will want to use Exx. Both are legal in 64-bit code, but only 32-bit DWORDs are legal in 32-bit code.
A couple of other things to note:
Although it is perfectly valid to clear a register using MOV xxx, 0, it is smaller and faster to use XOR eax, eax, so this is generally what you should write. It is a very old trick, something that any assembly-language programmer should know, and if you ever try to read other people's assembly programs, you'll need to be familiar with this idiom. (But actually, in the code you're writing, you don't need to do this at all. For the reason why, see point #2.)
In 64-bit mode, all instructions implicitly zero the upper 32 bits when writing the lower 32 bits, so you can simply write XOR eax, eax instead of XOR rax, rax. This is, again, smaller and faster.
The calling convention for 64-bit programs is different than the one used in 32-bit programs. The exact specification of the calling convention is going to vary, depending on which operating system you're using. As Peter Cordes commented, there is information on this in the x86 tag wiki. Both Windows and Linux x64 calling conventions pass at least the first 4 integer parameters in registers (rather than on the stack like the x86-32 calling convention), but which registers are actually used is different. Also, the 64-bit calling conventions have different requirements than do the 32-bit calling conventions for how you must set up the stack before calling functions.
(Since your screenshot says something about "MASM", I'll assume that you're using Windows in the sample code below.)
; Set up the stack, as required by the Windows x64 calling convention.
; (Note that we use the 64-bit form of the instruction, with the RSP register,
; to support stack pointers larger than 32 bits.)
sub rsp, 40
; Dynamically allocate 8 bytes of memory by calling malloc().
; (Note that the x64 calling convention passes the parameter in a register, rather
; than via the stack. On Windows, the first parameter is passed in RCX.)
; (Also note that we use the 32-bit form of the instruction here, storing the
; value into ECX, which is safe because it implicitly zeros the upper 32 bits.)
mov ecx, 8
call malloc
; Write a single 64-bit value into memory.
; (The pointer to the memory block allocated by malloc() is returned in RAX.)
mov qword ptr [rax], 1
; ... do whatever
; Clean up the stack space that we allocated at the top of the function.
add rsp, 40
If you wanted to do this in 32-bit halves, even on a 64-bit architecture, you certainly could. That would look like the following:
sub rsp, 40 ; set up stack
mov ecx, 8 ; request 8 bytes
call malloc ; allocate memory
mov dword ptr [eax], 1 ; write "1" into low 32 bits
mov dword ptr [eax+4], 2 ; write "2" into high 32 bits
; ... do whatever
add rsp, 40 ; clean up stack
Note that these last two MOV instructions are identical to what you wrote in the 32-bit version of the code. That makes sense, because you're doing exactly the same thing.
The reason the code you originally wrote didn't work is because EAX doesn't contain a QWORD PTR, it contains a DWORD PTR. Hence, the assembler generated the "invalid instruction operands" error, because there was a mismatch. This is the same reason that you don't offset by 8, because a DWORD PTR is only 4 bytes. A QWORD PTR is indeed 8 bytes, but you don't have one of those in EAX.
Or, if you wanted to write 16 bytes:
sub rsp, 40 ; set up stack
mov ecx, 16 ; request 16 bytes
call malloc ; allocate memory
mov qword ptr [rax], 1 ; write "1" into low 64 bits
mov qword ptr [rax+8], 2 ; write "2" into high 64 bits
; ... do whatever
add rsp, 40 ; clean up stack
Compare these three snippets of code, and make sure you understand the differences and why they need to be written as they are!

gfortran for dummies: What does mcmodel=medium do exactly?

I have some code that is giving me relocation errors when compiling, below is an example which illustrates the problem:
program main
common/baz/a,b,c
real a,b,c
b = 0.0
call foo()
print*, b
end
subroutine foo()
common/baz/a,b,c
real a,b,c
integer, parameter :: nx = 450
integer, parameter :: ny = 144
integer, parameter :: nz = 144
integer, parameter :: nf = 23*3
real :: bar(nf,nx*ny*nz)
!real, allocatable,dimension(:,:) :: bar
!allocate(bar(nf,nx*ny*nz))
bar = 1.0
b = bar(12,32*138*42)
return
end
Compiling this with gfortran -O3 -g -o test test.f, I get the following error:
relocation truncated to fit: R_X86_64_PC32 against symbol `baz_' defined in COMMON section in /tmp/ccIkj6tt.o
But it works if I use gfortran -O3 -mcmodel=medium -g -o test test.f. Also note that it works if I make the array allocatable and allocate it within the subroutine.
My question is what exactly does -mcmodel=medium do? I was under the impression that the two versions of the code (the one with allocatable arrays and the one without) were more or less equivalent ...

Since bar is quite large the compiler generates static allocation instead of automatic allocation on the stack. Static arrays are created with the .comm assembly directive which creates an allocation in the so-called COMMON section. Symbols from that section are gathered, same-named symbols are merged (reduced to one symbol request with size equal to the largest size requested) and then what is rest is mapped to the BSS (uninitialised data) section in most executable formats. With ELF executables the .bss section is located in the data segment, just before the data segment part of the heap (there is another heap part managed by anonymous memory mappings which does not reside in the data segment).
With the small memory model 32-bit addressing instructions are used to address symbols on x86_64. This makes code smaller and also faster. Some assembly output when using small memory model:
movl $bar.1535, %ebx <---- Instruction length saving
...
movl %eax, baz_+4(%rip) <---- Problem!!
...
.local bar.1535
.comm bar.1535,2575411200,32
...
.comm baz_,12,16
This uses a 32-bit move instruction (5 bytes long) to put the value of the bar.1535 symbol (this value equals to the address of the symbol location) into the lower 32 bits of the RBX register (the upper 32 bits get zeroed). The bar.1535 symbol itself is allocated using the .comm directive. Memory for the baz COMMON block is allocated afterwards. Because bar.1535 is very large, baz_ ends up more than 2 GiB from the start of the .bss section. This poses a problem in the second movl instruction since a non-32bit (signed) offset from RIP should be used to address the b variable where the value of EAX has to be moved into. This is only detected during link time. The assembler itself does not know the appropriate offset since it doesn't know what the value of the instruction pointer (RIP) would be (it depends on the absolute virtual address where the code is loaded and this is determined by the linker), so it simply puts an offset of 0 and then creates a relocation request of type R_X86_64_PC32. It instructs the linker to patch the value of 0 with the real offset value. But it cannot do that since the offset value would not fit inside a signed 32-bit integer and hence bails out.
With the medium memory model in place things look like this:
movabsq $bar.1535, %r10
...
movl %eax, baz_+4(%rip)
...
.local bar.1535
.largecomm bar.1535,2575411200,32
...
.comm baz_,12,16
First a 64-bit immediate move instruction (10 bytes long) is used to put the 64-bit value which represents the address of bar.1535 into register R10. Memory for the bar.1535 symbol is allocated using the .largecomm directive and thus it ends in the .lbss section of the ELF exectuable. .lbss is used to store symbols which might not fit in the first 2 GiB (and hence should not be addressed using 32-bit instructions or RIP-relative addressing), while smaller things go to .bss (baz_ is still allocated using .comm and not .largecomm). Since the .lbss section is placed after the .bss section in the ELF linker script, baz_ would not end up being inaccessible using 32-bit RIP-related addressing.
All addressing modes are described in the System V ABI: AMD64 Architecture Processor Supplement. It is a heavy technical reading but a must read for anybody who really wants to understand how 64-bit code works on most x86_64 Unixes.
When an ALLOCATABLE array is used instead, gfortran allocates heap memory (most likely implemented as an anonymous memory map given the large size of the allocation):
movl $2575411200, %edi
...
call malloc
movq %rax, %rdi
This is basically RDI = malloc(2575411200). From then on elements of bar are accessed by using positive offsets from the value stored in RDI:
movl 51190040(%rdi), %eax
movl %eax, baz_+4(%rip)
For locations that are more than 2 GiB from the start of bar, a more elaborate method is used. E.g. to implement b = bar(12,144*144*450) gfortran emits:
; Some computations that leave the offset in RAX
movl (%rdi,%rax), %eax
movl %eax, baz_+4(%rip)
This code is not affected by the memory model since nothing is assumed about the address where the dynamic allocation would be made. Also, since the array is not passed around, no descriptor is being built. If you add another function that takes an assumed-shaped array and pass bar to it, a descriptor for bar is created as an automatic variable (i.e. on the stack of foo). If the array is made static with the SAVE attribute, the descriptor is placed in the .bss section:
movl $bar.1580, %edi
...
; RAX still holds the address of the allocated memory as returned by malloc
; Computations, computations
movl -232(%rax,%rdx,4), %eax
movl %eax, baz_+4(%rip)
The first move prepares the argument of a function call (in my sample case call boo(bar) where boo has an interface that declares it as taking an assumed-shape array). It moves the address of the array descriptor of bar into EDI. This is a 32-bit immediate move so the descriptor is expected to be in the first 2 GiB. Indeed, it is allocated in the .bss in both small and medium memory models like this:
.local bar.1580
.comm bar.1580,72,32

No, large static arrays (as your bar) may exceed the limit if you do not use -mcmodel=medium. But allocatables are better of course. For allocatables only the array descriptor must fit into 2 GB, not the whole array.
From GCC reference:
-mcmodel=small
Generate code for the small code model: the program and its symbols must be linked in the lower 2 GB of the address space. Pointers are 64 bits. Programs can be statically or dynamically linked. This is the default code model.
-mcmodel=kernel
Generate code for the kernel code model. The kernel runs in the negative 2 GB of the address space. This model has to be used for Linux kernel code.
-mcmodel=medium
Generate code for the medium model: The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model.
-mcmodel=large
Generate code for the large model: This model makes no assumptions about addresses and sizes of sections. Currently GCC does not implement this model.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio