How do I leave memory uninitialized in GNU ARM assembly? - gcc

I'm using GCC on my Raspberry Pi to compile some assembly code for a course I'm taking. It is my understanding from information in the GNU Assembler Reference that I can reproduce the following C code in GNU ARM Assembly:
int num = 0;
By writing this:
.data
num: .word 0
Great! Now how would I write this?
int num;
It is my understanding that leaving a variable uninitialized like this means I should treat it as containing whatever garbage value was in the memory location before. Therefore, I shouldn't use it before I've given it a value in some way.
But suppose for some reason I intended to store a huge amount of data in memory and needed to reserve a massive amount of space for it. It seems to me like it would be a massive waste of resources to initialize the entire area of memory to some value if I'm about to fill it with some data anyways. Yet from what I can find there seems to be no way to make a label in GCC ARM Assembly without initializing it to some value. According to my assembly textbook the .word directive can have zero expressions after it, but if used this way "then the address counter is not advanced and no bytes are reserved." My first though was to use the ".space" or ".skip" directives instead, but for this directive even the official documentation says that "if the comma and fill are omitted, fill is assumed to be zero."
Is there no way for me to reserve a chunk of memory without initializing it in GCC ARM Assembly?

Generally, data that you don't need to initialize should be placed in the .bss section.
.bss
foobar:
.skip 99999999
This will allocate 99999999 bytes in the .bss section, and label foobar will be its address. It won't make your object files or executable 99999999 bytes bigger; the executable header just indicates how many bytes of .bss are needed, and at load time, the system allocates an appropriate amount and initializes it to zero.
You can't skip the load-time zero initialization. The system needs to initialize it to something, because it might otherwise contain sensitive data from the kernel or some other process. But zeroing out memory is quite fast, and the kernel will use an efficient algorithm, so I wouldn't worry about the performance impact. It might even zero pages in its idle time, so that when your program loads, there is zeroed memory already available. Anyway, the time your program spends actually using the memory will swamp it.
This means that you can also safely use .bss for data that you do want to have initialized to zero (though not to any nonzero value; if you want int foo = 3; you'll have to put it in .data as in your original example.).

What happened when you tried it?
When I tried it:
int num = 0;
int mun;
With gnu I got
.cpu arm7tdmi
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "so.c"
.text
.comm mun,4,4
.global num
.bss
.align 2
.type num, %object
.size num, 4
num:
.space 4
.ident "GCC: (GNU) 8.3.0"
.comm symbol , length
.comm declares a common symbol named symbol.
When linking, a common symbol in one object file may be merged with a
defined or common symbol of the same name in another object file. If
ld does not see a definition for the symbol--just one or more common
symbols--then it will allocate length bytes of uninitialized memory.
length must be an absolute expression. If ld sees multiple common
symbols with the same name, and they do not all have the same size, it
will allocate space using the largest size.
When using ELF, the .comm directive takes an optional third argument.
This is the desired alignment of the symbol, specified as a byte
boundary (for example, an alignment of 16 means that the least
significant 4 bits of the address should be zero). The alignment must
be an absolute expression, and it must be a power of two. If ld
allocates uninitialized memory for the common symbol, it will use the
alignment when placing the symbol. If no alignment is specified, as
will set the alignment to the largest power of two less than or equal
to the size of the symbol, up to a maximum of 16.
The syntax for .comm differs slightly on the HPPA. The syntax is
`symbol .comm, length'; symbol is optional.
Assembly language is defined by the assembler not the target. So the answer will be assembler (the tool that reads and assembles assembly language programs) specific and no reason to assume that the answer for one assembler is the same as another. The above is for the gnu assembler, gas.
You could have looked at the documentation you referenced or read other gnu documentation, but the easiest way to answer a "what happens when you do this in a compiled program" is to just compile it and look at the compiler output.
But don't necessarily assume that it isn't initialized:
unsigned int num;
unsigned int fun ( void )
{
return(num);
}
Just enough to link it:
Disassembly of section .text:
00001000 <fun>:
1000: e59f3004 ldr r3, [pc, #4] ; 100c <fun+0xc>
1004: e5930000 ldr r0, [r3]
1008: e12fff1e bx lr
100c: 00002000 andeq r2, r0, r0
Disassembly of section .bss:
00002000 <__bss_start>:
2000: 00000000
it ends up in bss initialized.
You really want uninitialized access to something then just pick an address (that you know isn't initialized (sram)) and access it:
ldr r0,=0x1234
ldr r0,[r0]

Related

What's the purpose of xchg ax,ax prior to the break instruction int 3 in DebugBreak()?

In MASM, I've always inserted a standalone break instruction
00007ff7`63141120 cc int 3
However, replacing that instruction with the MSVC DebugBreak function generates
KERNELBASE!DebugBreak:
00007ff8`6b159b90 6690 xchg ax,ax
00007ff8`6b159b92 cc int 3
00007ff8`6b159b93 c3 ret
I was surprised to see the xchg instruction prior to the break instruction
xchg ax,ax
As noted from another S.O. article:
Actually, xchg ax,ax is just how MS disassembles "66 90". 66 is the
operand size override, so it supposedly operates on ax instead of eax.
However, the CPU still executes it as a nop. The 66 prefix is used
here to make the instruction two bytes in size, usually for alignment
purposes.
MSVC, like most compilers, aligns functions to 16 byte boundaries.
Question What is the purpose of that xchg instruction?
MSVC generates 2 byte nop before any single-byte instruction at the beginning of a function (except ret in empty functions1). I've tried __halt, _enable, _disable intrinsics and seen the same effect.
Apparently it is for patching. /hotpatch option gives the same change for x86, and /hotpatch option is not recognized on x64. According to the /hotpatch documentation, it is expected behavior (emphasis mine):
Because instructions are always two bytes or larger on the ARM architecture, and because x64 compilation is always treated as if /hotpatch has been specified, you don't have to specify /hotpatch when you compile for these targets;
So hotpatching support is unconditional for x64, and its result is seen in DebugBreak implementation.
See here: https://godbolt.org/z/1G737cErf
See this post on why it is needed for hotpatching: Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?. Looks like that currently hotpatching is smart enough to use any two bytes or more instruction, not just MOV EDI, EDI, still it cannot use single-byte instruction, as two-byte backward jump may be written at exact moment when the instruction pointer points at the second instruction.
1 As discussed in comments, empty functions have three-byte ret 0, although it is not apparent from MSVC assembly output, as it is represented there as just ret)

How to load address of function or label into register

I am trying to load the address of 'main' into a register (R10) in the GNU Assembler. I am unable to. Here I what I have and the error message I receive.
main:
lea main, %r10
I also tried the following syntax (this time using mov)
main:
movq $main, %r10
With both of the above I get the following error:
/usr/bin/ld: /tmp/ccxZ8pWr.o: relocation R_X86_64_32S against symbol `main' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
Compiling with -fPIC does not resolve the issue and just gives me the same exact error.
In x86-64, most immediates and displacements are still 32-bits because 64-bit would waste too much code size (I-cache footprint and fetch/decode bandwidth).
lea main, %reg is an absolute disp32 addressing mode which would stop load-time address randomization (ASLR) from choosing a random 64-bit (or 47-bit) address. So it's not supported on Linux except in position-dependent executables, or at all on MacOS where static code/data are always loaded outside the low 32 bits. (See the x86 tag wiki for links to docs and guides.) On Windows, you can build executables as "large address aware" or not. If you choose not, addresses will fit in 32 bits.
The standard efficient way to put a static address into a register is a RIP-relative LEA:
# RIP-relative LEA always works. Syntax for various assemblers:
lea main(%rip), %r10 # AT&T syntax
lea r10, [rip+main] # GAS .intel_syntax noprefix equivalent
lea r10, [rel main] ; NASM equivalent, or use default rel
lea r10, [main] ; FASM defaults to RIP-relative. MASM may also
See How do RIP-relative variable references like "[RIP + _a]" in x86-64 GAS Intel-syntax work? for an explanation of the 3 syntaxes, and Why are global variables in x86-64 accessed relative to the instruction pointer? (and this) for reasons why RIP-relative is the standard way to address static data.
This uses a 32-bit relative displacement from the end of the current instruction, like jmp/call. This can reach any static data in .data, .bss, .rodata, or function in .text, assuming the usual 2GiB total size limit for static code+data.
In position dependent code (built with gcc -fno-pie -no-pie for example) on Linux, you can take advantage of 32-bit absolute addressing to save code size. Also, mov r32, imm32 has slightly better throughput than RIP-relative LEA on Intel/AMD CPUs, so out-of-order execution may be able to overlap it better with the surrounding code. (Optimizing for code-size is usually less important than most other things, but when all else is equal pick the shorter instruction. In this case all else is at least equal, or also better with mov imm32.)
See 32-bit absolute addresses no longer allowed in x86-64 Linux? for more about how PIE executables are the default. (Which is why you got a link error about -fPIC with your use of a 32-bit absolute.)
# in a non-PIE executable, mov imm32 into a 32-bit register is even better
# same as you'd use in 32-bit code
## GAS AT&T syntax
mov $main, %r10d # 6 bytes
mov $main, %edi # 5 bytes: no REX prefix needed for a "legacy" register
## GAS .intel_syntax
mov edi, OFFSET main
;; mov edi, main ; NASM and FASM syntax
Note that writing any 32-bit register always zero-extends into the full 64-bit register (R10 and RDI).
lea main, %edi or lea main, %rdi would also work in a Linux non-PIE executable, but never use LEA with a [disp32] absolute addressing mode (even in 32-bit code where that doesn't require a SIB byte); mov is always at least as good.
The operand-size suffix is redundant when you have a register operand that uniquely determines it; I prefer to just write mov instead of movl or movq.
The stupid/bad way is a 10-byte 64-bit absolute address as an immediate:
# Inefficient, DON'T USE
movabs $main, %r10 # 10 bytes including the 64-bit absolute address
This is what you get in NASM if you use mov rdi, main instead of mov edi, main so many people end up doing this. Linux dynamic linking does actually support runtime fixups for 64-bit absolute addresses. But the use-case for that is for jump tables, not for absolute addresses as immediates.
movq $sign_extended_imm32, %reg (7 bytes) still uses a 32-bit absolute address, but wastes code bytes on a sign-extended mov to a 64-bit register, instead of implicit zero-extension to 64-bit from writing a 32-bit register.
By using movq, you're telling GAS you want a R_X86_64_32S relocation instead of a R_X86_64_64 64-bit absolute relocation.
The only reason you'd ever want this encoding is for kernel code where static addresses are in the upper 2GiB of 64-bit virtual address space, instead of the lower 2GiB. mov has slight performance advantages over lea on some CPUs (e.g. running on more ports), but normally if you can use a 32-bit absolute it's in the low 2GiB of virtual address space where a mov r32, imm32 works.
(Related: Difference between movq and movabsq in x86-64)
PS: I intentionally left out any discussion of "large" or "huge" memory / code models, where RIP-relative +-2GiB addressing can't reach static data, or maybe can't even reach other code addresses. The above is for x86-64 System V ABI's "small" and/or "small-PIC" code models. You may need movabs $imm64 for medium and large models, but that's very rare.
I don't know if mov $imm32, %r32 works in Windows x64 executables or DLLs with runtime fixups, but RIP-relative LEA certainly does.
Semi-related: Call an absolute pointer in x86 machine code - if you're JITing, try to put the JIT buffer near existing code so you can call rel32, otherwise movabs a pointer into a register.

How to make gcc emit multibyte NOPs for -fpatchable-function-entry?

gcc does have the ability to use multi-byte NOPs for aligning loops and functions. However when I tried the -fpatchable-function-entry option it always emits single-byte NOPs
You can see in this example that gcc aligns the function with nop DWORD PTR [rax+rax*1+0x0] and nop WORD PTR cs:[rax+rax*1+0x0] but uses eight 0x90 NOPs at the function entry when I specify -fpatchable-function-entry=8,3
I saw this in the document
-fpatchable-function-entry=N[,M]
Generate N NOPs right at the beginning of each function, with the function entry point before the Mth NOP. If M is omitted, it defaults to 0 so the function entry points to the address just at the first NOP. The NOP instructions reserve extra space which can be used to patch in any desired instrumentation at run time, provided that the code segment is writable. The amount of space is controllable indirectly via the number of NOPs; the NOP instruction used corresponds to the instruction emitted by the internal GCC back-end interface gen_nop. This behavior is target-specific and may also depend on the architecture variant and/or other compilation options.
It clearly said that N NOPs will be inserted. However I think this should be an N-byte NOP (or whatever optimal number of NOPs to fill the N-byte space). Similarly if M is specified you need to emit an M-byte and an (N − M)-byte NOP
So why does gcc do this? Can we make it generate multi-byte NOPs? And are two 0x90 NOPs better than Microsoft's mov edi, edi?

Unnecessary instructions generated for _mm_movemask_epi8 intrinsic in x64 mode

The intrinsic function _mm_movemask_epi8 from SSE2 is defined by Intel with the following prototype:
int _mm_movemask_epi8 (__m128i a);
This intrinsic function directly corresponds to the pmovmskb instruction, which is generated by all compilers.
According to this reference, the pmovmskb instruction can write the resulting integer mask to either a 32-bit or a 64-bit general purpose register in x64 mode. In any case, only 16 lower bits of the result can be nonzero, i.e. the result is surely within range [0; 65535].
Speaking of the intrinsic function _mm_movemask_epi8, its returned value is of type int, which as a signed integer of 32-bit size on most platforms. Unfortunately, there is no alternative function which returns a 64-bit integer in x64 mode. As a result:
Compiler usually generates pmovmskb instruction with 32-bit destination register (e.g. eax).
Compiler cannot assume that upper 32 bits of the whole register (e.g. rax) are zero.
Compiler inserts unnecessary instruction (e.g. mov eax, eax) to zero the upper half of 64-bit register, given that the register is later used as 64-bit value (e.g. as an index of array).
An example of code and generated assembly with such a problem can be seen in this answer. Also the comments to that answer contain some related discussion. I regularly experience this problem with MSVC2013 compiler, but it seems that it is also present on GCC.
The questions are:
Why is this happening?
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).
Why is this happening?
gcc's internal instruction definitions that tells it what pmovmskb does must be failing to inform it that the upper 32-bits of rax will always be zero. My guess is that it's treated like a function call return value, where the ABI allows a function returning a 32bit int to leave garbage in the upper 32bits of rax.
GCC does know about 32-bit operations in general zero-extending for free, but this missed optimization is widespread for intrinsics, also affecting scalar intrinsics like _mm_popcnt_u32.
There's also the issue of gcc (not) knowing that the actual result has set bits only in the low 16 of its 32-bit int result (unless you used AVX2 vpmovmskb ymm). So actual sign extension is unnecessary; implicit zero extension is totally fine.
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
No, other than fixing gcc. Has anyone reported this as a compiler missed-optimization bug?
clang doesn't have this bug. I added code to Paul R's test to actually use the result as an array index, and clang is still fine.
gcc always either zero or sign extends (to a different register in this case, perhaps because it wants to "keep" the 32-bit value in the bottom of RAX, not because it's optimizing for mov-elimination.
Casting to unsigned helps with GCC6 and later; it will use the pmovmskb result directly as part of an addressing mode, but also returning it results in a mov rax, rdx.
And with older GCC, at least gets it to use mov instead of movsxd or cdqe.
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).
mov same,same is never eliminated on SnB-family microarchitectures or AMD zen. mov ecx, eax would be eliminated. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details.
Even if it doesn't take an execution unit, it still takes a slot in the fused-domain part of the pipeline, and a slot in the uop-cache. And code-size. If you're close to the front-end 4 fused-domain uops per clock limit (pipeline width), then it's a problem.
It also costs an extra 1c of latency in the dep chain.
(Back-end throughput is not a problem, though. On Haswell and newer, it can run on port6 which has no vector execution units. On AMD, the integer ports are separate from the vector ports.)
gcc.godbolt.org is a great online resource for testing this kind of issue with different compilers.
clang seems to do the best with this, e.g.
#include <xmmintrin.h>
#include <cstdint>
int32_t test32(const __m128i v) {
int32_t mask = _mm_movemask_epi8(v);
return mask;
}
int64_t test64(const __m128i v) {
int64_t mask = _mm_movemask_epi8(v);
return mask;
}
generates:
test32(long long __vector(2)): # #test32(long long __vector(2))
vpmovmskb eax, xmm0
ret
test64(long long __vector(2)): # #test64(long long __vector(2))
vpmovmskb eax, xmm0
ret
Whereas gcc generates an extra cdqe instruction in the 64-bit case:
test32(long long __vector(2)):
vpmovmskb eax, xmm0
ret
test64(long long __vector(2)):
vpmovmskb eax, xmm0
cdqe
ret

gfortran for dummies: What does mcmodel=medium do exactly?

I have some code that is giving me relocation errors when compiling, below is an example which illustrates the problem:
program main
common/baz/a,b,c
real a,b,c
b = 0.0
call foo()
print*, b
end
subroutine foo()
common/baz/a,b,c
real a,b,c
integer, parameter :: nx = 450
integer, parameter :: ny = 144
integer, parameter :: nz = 144
integer, parameter :: nf = 23*3
real :: bar(nf,nx*ny*nz)
!real, allocatable,dimension(:,:) :: bar
!allocate(bar(nf,nx*ny*nz))
bar = 1.0
b = bar(12,32*138*42)
return
end
Compiling this with gfortran -O3 -g -o test test.f, I get the following error:
relocation truncated to fit: R_X86_64_PC32 against symbol `baz_' defined in COMMON section in /tmp/ccIkj6tt.o
But it works if I use gfortran -O3 -mcmodel=medium -g -o test test.f. Also note that it works if I make the array allocatable and allocate it within the subroutine.
My question is what exactly does -mcmodel=medium do? I was under the impression that the two versions of the code (the one with allocatable arrays and the one without) were more or less equivalent ...
Since bar is quite large the compiler generates static allocation instead of automatic allocation on the stack. Static arrays are created with the .comm assembly directive which creates an allocation in the so-called COMMON section. Symbols from that section are gathered, same-named symbols are merged (reduced to one symbol request with size equal to the largest size requested) and then what is rest is mapped to the BSS (uninitialised data) section in most executable formats. With ELF executables the .bss section is located in the data segment, just before the data segment part of the heap (there is another heap part managed by anonymous memory mappings which does not reside in the data segment).
With the small memory model 32-bit addressing instructions are used to address symbols on x86_64. This makes code smaller and also faster. Some assembly output when using small memory model:
movl $bar.1535, %ebx <---- Instruction length saving
...
movl %eax, baz_+4(%rip) <---- Problem!!
...
.local bar.1535
.comm bar.1535,2575411200,32
...
.comm baz_,12,16
This uses a 32-bit move instruction (5 bytes long) to put the value of the bar.1535 symbol (this value equals to the address of the symbol location) into the lower 32 bits of the RBX register (the upper 32 bits get zeroed). The bar.1535 symbol itself is allocated using the .comm directive. Memory for the baz COMMON block is allocated afterwards. Because bar.1535 is very large, baz_ ends up more than 2 GiB from the start of the .bss section. This poses a problem in the second movl instruction since a non-32bit (signed) offset from RIP should be used to address the b variable where the value of EAX has to be moved into. This is only detected during link time. The assembler itself does not know the appropriate offset since it doesn't know what the value of the instruction pointer (RIP) would be (it depends on the absolute virtual address where the code is loaded and this is determined by the linker), so it simply puts an offset of 0 and then creates a relocation request of type R_X86_64_PC32. It instructs the linker to patch the value of 0 with the real offset value. But it cannot do that since the offset value would not fit inside a signed 32-bit integer and hence bails out.
With the medium memory model in place things look like this:
movabsq $bar.1535, %r10
...
movl %eax, baz_+4(%rip)
...
.local bar.1535
.largecomm bar.1535,2575411200,32
...
.comm baz_,12,16
First a 64-bit immediate move instruction (10 bytes long) is used to put the 64-bit value which represents the address of bar.1535 into register R10. Memory for the bar.1535 symbol is allocated using the .largecomm directive and thus it ends in the .lbss section of the ELF exectuable. .lbss is used to store symbols which might not fit in the first 2 GiB (and hence should not be addressed using 32-bit instructions or RIP-relative addressing), while smaller things go to .bss (baz_ is still allocated using .comm and not .largecomm). Since the .lbss section is placed after the .bss section in the ELF linker script, baz_ would not end up being inaccessible using 32-bit RIP-related addressing.
All addressing modes are described in the System V ABI: AMD64 Architecture Processor Supplement. It is a heavy technical reading but a must read for anybody who really wants to understand how 64-bit code works on most x86_64 Unixes.
When an ALLOCATABLE array is used instead, gfortran allocates heap memory (most likely implemented as an anonymous memory map given the large size of the allocation):
movl $2575411200, %edi
...
call malloc
movq %rax, %rdi
This is basically RDI = malloc(2575411200). From then on elements of bar are accessed by using positive offsets from the value stored in RDI:
movl 51190040(%rdi), %eax
movl %eax, baz_+4(%rip)
For locations that are more than 2 GiB from the start of bar, a more elaborate method is used. E.g. to implement b = bar(12,144*144*450) gfortran emits:
; Some computations that leave the offset in RAX
movl (%rdi,%rax), %eax
movl %eax, baz_+4(%rip)
This code is not affected by the memory model since nothing is assumed about the address where the dynamic allocation would be made. Also, since the array is not passed around, no descriptor is being built. If you add another function that takes an assumed-shaped array and pass bar to it, a descriptor for bar is created as an automatic variable (i.e. on the stack of foo). If the array is made static with the SAVE attribute, the descriptor is placed in the .bss section:
movl $bar.1580, %edi
...
; RAX still holds the address of the allocated memory as returned by malloc
; Computations, computations
movl -232(%rax,%rdx,4), %eax
movl %eax, baz_+4(%rip)
The first move prepares the argument of a function call (in my sample case call boo(bar) where boo has an interface that declares it as taking an assumed-shape array). It moves the address of the array descriptor of bar into EDI. This is a 32-bit immediate move so the descriptor is expected to be in the first 2 GiB. Indeed, it is allocated in the .bss in both small and medium memory models like this:
.local bar.1580
.comm bar.1580,72,32
No, large static arrays (as your bar) may exceed the limit if you do not use -mcmodel=medium. But allocatables are better of course. For allocatables only the array descriptor must fit into 2 GB, not the whole array.
From GCC reference:
-mcmodel=small
Generate code for the small code model: the program and its symbols must be linked in the lower 2 GB of the address space. Pointers are 64 bits. Programs can be statically or dynamically linked. This is the default code model.
-mcmodel=kernel
Generate code for the kernel code model. The kernel runs in the negative 2 GB of the address space. This model has to be used for Linux kernel code.
-mcmodel=medium
Generate code for the medium model: The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model.
-mcmodel=large
Generate code for the large model: This model makes no assumptions about addresses and sizes of sections. Currently GCC does not implement this model.

Resources