GCC compiled assembly - gcc

I am trying to learn assembly language by example, or compiling simple C files with GCC using the -S option, intel syntax, and CFI calls disabled (every other free way is extremely confusing
My C file is literally just int main() {return 0;}, but GCC spits out this:
.file "simpleCTest.c"
.intel_syntax noprefix
.def ___main; .scl 2; .type 32; .endef
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
push ebp
mov ebp, esp
and esp, -16
call ___main
mov eax, 0
leave
ret
.ident "GCC: (GNU) 5.3.0"
My real question is why does the main function have any processor instructions (push edp, mov edp, esp, etc)? Are these even necessary (I guess it would be a way of data management to prepare/shut down programs, but I'm not sure)? Why doesn't it just issue a ret statement after the main function? Also why are there TWO main functions (_main & ___main)?
To sum it up, why is it not just like this?
.def _main
_main:
mov eax, 0 ;(for return integer)
ret

GCC spits out this
This would probably be a bit clearer if you actually had your main function do some things, oddly enough, including calling another function.
Your compiled code is setting up a frame by which to reference its stack variables with the first opcode, mov ebp,esp. This would be used if you had variables that could be referred to with ebp and a constant, for instance. Then, it is aligning the stack to a multiple of 16 bytes with the AND instruction- that is, it is saying it will not use from 0 to 15 bytes of the provided stack, such that [esp] is aligned to a multiple of 16 bytes. This would be important because of the calling conventions in use.
The ending opcode leave copies the backed up base pointer over the current state of the stack pointer, and then restores the original base pointer with pop.
My real question is why does the main function have any processor instructions
It's setting stuff up for things that you aren't doing (but that nontrivial programs would do), and is not making the most optimized "return 0" program that it could. By having a base pointer that is mostly a backup of the original stack pointer, the program is free to refer to local variables as an offset plus the base pointer (including implied stuff you aren't using like the argument count, the pointer to the pointers to argument listing, and the pointer to the environment), and by having a stack pointer that is a multiple of 16, the program is free to make calls to functions according to its calling standard.

Related

Why does generated assembly mov edi to variable on stack?

I am a newcomer to assembly trying to understand the objdump of the following function:
int nothing(int num) {
return num;
}
This is the result (linux, x86-64, gcc 8):
push rbp
mov rbp,rsp
mov DWORD PTR [rbp-0x4],edi
mov eax,DWORD PTR [rbp-0x4]
pop rbp
ret
My questions are:
1. Where does edi come from? Reading through some intro docs, I was under the impression that [rbp-0x4] would contain num.
2. From the above, apparently edi contains the argument. But then what role does [rbp-0x4] play? Why not just mov eax, edi?
Thanks!
Where does edi come from?
... From the above, apparently edi contains the argument.
This is the calling convention (for Linux and many other OSs):
All programming languages for these OSs pass the first parameter in rdi. The result (value returned) is passed in rax.
And because your C compiler interprets int as 32 bits, only the low 32 bits of rdi and rax are used - which is edi and eax.
Programming languages for Windows pass the first parameter in rcx...
But then what role does [rbp-0x4] play?
Using rbp has mainly historic reasons here. In 16-bit code (as it was used in 1980s and 1990s PCs) it was not possible to address data on the stack using the sp register (which corresponds to rsp). The only register that allowed addressing values on the stack easily was the bp register (corresponding to rbp).
And even in 32- or 64-bit code it is more difficult to write a compiler that addresses local variables (on the stack) using rsp rather than using rbp.
The compiler generates the first 3 instructions of assembler code before it knows what is done in the C function. The compiler puts the value on the stack because you could do something like address = &num in the code. This is however not possible when num is in a register but only when num is located in the memory.
Why not just mov eax, edi?
If you tell the compiler to optimize the code, it will first check the content of the C function before generating the first assembler instruction. It will find out that it is not required to put the value into the memory.
In this case the code will indeed look like this:
mov eax, edi
ret

Translating Go assembler to NASM

I came across the following Go code:
type Element [12]uint64
//go:noescape
func CSwap(x, y *Element, choice uint8)
//go:noescape
func Add(z, x, y *Element)
where the CSwap and Add functions are basically coming from an assembly, and look like the following:
TEXT ·CSwap(SB), NOSPLIT, $0-17
MOVQ x+0(FP), REG_P1
MOVQ y+8(FP), REG_P2
MOVB choice+16(FP), AL // AL = 0 or 1
MOVBLZX AL, AX // AX = 0 or 1
NEGQ AX // RAX = 0x00..00 or 0xff..ff
MOVQ (0*8)(REG_P1), BX
MOVQ (0*8)(REG_P2), CX
// Rest removed for brevity
TEXT ·Add(SB), NOSPLIT, $0-24
MOVQ z+0(FP), REG_P3
MOVQ x+8(FP), REG_P1
MOVQ y+16(FP), REG_P2
MOVQ (REG_P1), R8
MOVQ (8)(REG_P1), R9
MOVQ (16)(REG_P1), R10
MOVQ (24)(REG_P1), R11
// Rest removed for brevity
What I try to do is that translate the assembly to a syntax that is more familiar to me (I think mine is more like NASM), while the above syntax is Go assembler. Regarding the Add method I didn't have much problem, and translated it correctly (according to test results). It looks like this in my case:
.text
.global add_asm
add_asm:
push r12
push r13
push r14
push r15
mov r8, [reg_p1]
mov r9, [reg_p1+8]
mov r10, [reg_p1+16]
mov r11, [reg_p1+24]
// Rest removed for brevity
But, I have a problem when translating the CSwap function, I have something like this:
.text
.global cswap_asm
cswap_asm:
push r12
push r13
push r14
mov al, 16
mov rax, al
neg rax
mov rbx, [reg_p1+(0*8)]
mov rcx, [reg_p2+(0*8)]
But this doesn't seem to be quite correct, as I get error when compiling it. Any ideas how to translate the above CSwap assembly part to something like NASM?
EDIT (SOLUTION):
Okay, after the two answers below, and some testing and digging, I found out that the code uses the following three registers for parameter passing:
#define reg_p1 rdi
#define reg_p2 rsi
#define reg_p3 rdx
Accordingly, rdx has the value of the choice parameter. So, all that I had to do was use this:
movzx rax, dl // Get the lower 8 bits of rdx (reg_p3)
neg rax
Using byte [rdx] or byte [reg_3] was giving an error, but using dl seems to work fine for me.
Basic docs about Go's asm: https://golang.org/doc/asm. It's not totally equivalent to NASM or AT&T syntax: FP is a pseudo-register name for whichever register it decides to use as the frame pointer. (Typically RSP or RBP). Go asm also seems to omit function prologue (and probably epilogue) instructions. As #RossRidge comments, it's a bit more like a internal representation like LLVM IR than truly asm.
Go also has its own object-file format, so I'm not sure you can make Go-compatible object files with NASM.
If you want to call this function from something other than Go, you'll also need to port the code to a different calling convention. Go appears to be using a stack-args calling convention even for x86-64, unlike the normal x86-64 System V ABI or the x86-64 Windows calling convention. (Or maybe those mov function args into REG_P1 and so on instructions disappear when Go builds this source for a register-arg calling convention?)
(This is why you could you had to use movzx eax, dl instead of loading from the stack at all.)
BTW, rewriting this code in C instead of NASM would probably make even more sense if you want to use it with C. Small functions are best inlined and optimized away by the compiler.
It would be a good idea to check your translation, or get a starting point, by assembling with the Go assembler and using a disassembler.
objdump -drwC -Mintel or Agner Fog's objconv disassembler would be good, but they don't understand Go's object-file format. If Go has a tool to extract the actual machine code or get it in an ELF object file, do that.
If not, you could use ndisasm -b 64 (which treats input files as flat binaries, disassembling all the bytes as if they were instructions). You can specify an offset/length if you can find out where the function starts. x86 instructions are variable length, and disassembly will likely be "out of sync" at the start of the function. You might want to add a bunch of single-byte NOP instructions (kind of a NOP sled) for the disassembler, so if it decodes some 0x90 bytes as part of an immediate or disp32 for a long instruction that was really not part of the function, it will be in sync. (But the function prologue will still be messed up).
You might add some "signpost" instructions to your Go asm functions to make it easy to find the right place in the mess of crazy asm from disassembling metadata as instructions. e.g. put a pmuludq xmm0, xmm0 in there somewhere, or some other instruction with a unique mnemonic that you can search for which the Go code doesn't include. Or an instruction with an immediate that will stand out, like addq $0x1234567, SP. (An instruction that will crash so you don't forget to take it out again is good here.)
Or you could use gdb's built-in disassembler: add an instruction that will segfault (like a load from a bogus absolute address (movl 0, AX null-pointer deref), or a register holding a non-pointer value e.g. movl (AX), AX). Then you'll have an instruction-pointer value for the instructions in memory, and can disassemble from some point behind that. (Probably the function start will be 16-byte aligned.)
Specific instructions.
MOVBLZX AL, AX reads AL, so that's definitely an 8-bit operand. The size for AX is given by the L part of the mnemonic, meaning long for 32 bit, like in GAS AT&T syntax. (The gas mnemonic for that form of movzx is movzbl %al, %eax). See What does cltq do in assembly? for a table of cdq / cdqe and the AT&T equivalent, and the AT&T / Intel mnemonic for the equivalent MOVSX instruction.
The NASM instruction you want is movzx eax, al. Using rax as the destination would be a waste of a REX prefix. Using ax as the destination would be a mistake: it wouldn't zero-extend into the full register, and would leave whatever high garbage. Go asm syntax for x86 is very confusing when you're not used to it, because AX can mean AX, EAX, or RAX depending on the operand size.
Obviously mov rax, al isn't a possibility: Like most instructions, mov requires both its operands to be the same size. movzx is one of the rare exceptions.
MOVB choice+16(FP), AL is a byte load into AL, not an immediate move. choice+16 is a an offset from FP. This syntax is basically the same as AT&T addressing modes, with FP as a register and choice as an assemble-time constant.
FP is a pseudo-register name. It's pretty clear that it should simply be loading the low byte of the 3rd arg-passing slot, because choice is the name of a function arg. (In Go asm, choice is just syntactic sugar, or a constant defined as zero.)
Before a call instruction, rsp points at the first stack arg, so that + 16 is the 3rd arg. It appears that FP is that base address (and might actually be rsp+8 or something). After a call (which pushes an 8 byte return address), the 3rd stack arg is at rsp + 24. After more pushes, the offset will be even larger, so adjust as necessary to reach the right location.
If you're porting this function to be called with a standard calling convention, the 3 integer args will be passed in registers, with no stack args. Which 3 registers depends on whether you're building for Windows vs. non-Windows. (See Agner Fog's calling conventions doc: http://agner.org/optimize/)
BTW, a byte load into AL and then movzx eax, al is just dumb. Much more efficient on all modern CPUs to do it in one step with
movzx eax, byte [rsp + 24] ; or rbp+32 if you made a stack frame.
I hope the source in the question is from un-optimized Go compiler output? Or the assembler itself makes such optimizations?
I think you can translate these as just
mov rbx, [reg_p1]
mov rcx, [reg_p2]
Unless I'm missing some subtlety, the offsets which are zero can just be ignored. The *8 isn't a size hint since that's already in the instruction.
The rest of your code looks wrong though. The MOVB choice+16(FP), AL in the original is supposed to be fetching the choice argument into AL, but you're setting AL to a constant 16, and the code for loading the other arguments seems to be completely missing, as is the code for all of the arguments in the other function.

gcc, __atomic_exchange seems to produce non-atomic asm, why?

I am working on a nice tool, which requires the atomic swap of two different 64-bit values. On the amd64 architecture it is possible with the XCHGQ instruction (see here in doc, warning: it is a long pdf).
Correspondingly, gcc has some atomic builtins which would ideally do the same, as it is visible for example here.
Using these 2 docs I produced the following simple C function, for the atomic swapping of two, 64-bit values:
void theExchange(u64* a, u64* b) {
__atomic_exchange(a, b, b, __ATOMIC_SEQ_CST);
};
(Btw, it wasn't really clear to me, why needs an "atomic exchange" 3 operands.)
It was to me a little bit fishy, that the gcc __atomic_exchange macro uses 3 operands, so I tested its asm output. I compiled this with a gcc -O6 -masm=intel -S and I've got the following output:
.LHOTB0:
.p2align 4,,15
.globl theExchange
.type theExchange, #function
theExchange:
.LFB16:
.cfi_startproc
mov rax, QWORD PTR [rsi]
xchg rax, QWORD PTR [rdi] /* WTF? */
mov QWORD PTR [rsi], rax
ret
.cfi_endproc
.LFE16:
.size theExchange, .-theExchange
.section .text.unlikely
As we can see, the result function contains not only a single data move, but three different data movements. Thus, as I understood this asm code, this function won't be really atomic.
How is it possible? Maybe I misunderstood some of the docs? I admit, the gcc builtin doc wasn't really clear to me.
This is the generic version of __atomic_exchange_n (type *ptr, type val, int memorder) where only the exchange operation on ptr is atomic, the reading of val is not. In the generic version, val is accessed via pointer, but the atomicity still does not apply to it. The pointer is so that it will work with multiple sizes, when the compiler has to call an external helper:
The four non-arithmetic functions (load, store, exchange, and
compare_exchange) all have a generic version as well. This generic
version works on any data type. It uses the lock-free built-in
function if the specific data type size makes that possible;
otherwise, an external call is left to be resolved at run time. This
external call is the same format with the addition of a ‘size_t’
parameter inserted as the first parameter indicating the size of the
object being pointed to. All objects must be the same size.

Some questions about prologue/calling a function gcc intel x86

I dont quiet understand the gcc prologue, especially for main.
Why is there the instruction and esp, 0xfffffff0 ? I know what it does but why is it necessary ?
When we call a function, we first have to push the arguments, but why gcc doesn't use the push instruction and uses movs instead ? Moreover using those movs, it creates an empty padding. It looks like a waste of memory, why so ?
Finally, gcc first uses the sub instruction to esp in order to "reserve" memory for the stack, but what makes sure that this memory is not used by on other program for instance ?
I think I understood quiet well the theory, but I couldnt find a document that explains more about memory in pratice (how do memory of several programs dont overlap, ...). Thank you for your answers.
PS : I add the assembly code and the cpp code :
Dump of assembler code for function main(int, char**):
0x08048657 <+0>: push ebp
0x08048658 <+1>: mov ebp,esp
0x0804865a <+3>: and esp,0xfffffff0
0x0804865d <+6>: sub esp,0x20
0x08048660 <+9>: mov DWORD PTR [esp+0x1c],0x3
0x08048668 <+17>: mov BYTE PTR [esp+0x1b],0x61
=> 0x0804866d <+22>: mov DWORD PTR [esp],0x8048771
0x08048674 <+29>: call 0x804863c <p(char*)>
0x08048679 <+34>: mov eax,0x0
0x0804867e <+39>: leave
0x0804867f <+40>: ret
End of assembler dump.
int main(int argc, char *argv[]) {
int b = 3;
char c = 'a';
p("hello woooooooooorld !!");}
The stack alignment is only done for main, the rest of the functions just keep the alignment required by the ABI.
The compiler uses mov instructions for locals so they can be accessed randomly. For outgoing function arguments you can ask for push instructions using the -mpush-args compiler option which might produce smaller code.
As for the wasted memory, you probably didn't compile with optimizations enabled (which would of course eliminate your b and c altogether since they are not used ;))
Each process has its own virtual memory address space, so there is no chance of anybody else using the memory allocated from the stack.

Can I choose RIP-relative or absolute addressing for different variables with gcc in x86-64

I write my own link script to put different variables in two different data sections (A & B).
A is linked to zero address;
B is linked near to code, and in high address space (higher than 4G, which is not available for normal absolute addressing in x86-64).
A can be accessed through absolute addressing, but not RIP-relative;
B can be accessed through RIP-relative addressing, but not absolute;
My question: Is there any way to choose RIP-relative or absolute addressing for different variables in gcc? Perhaps with some annotation like #pragma?
Without hacking the GCC source code, you're not going to get it to emit 32-bit absolute addressing, but there are cases where gcc will use 64-bit absolute addresses.
-mcmodel=medium puts large objects into a separate section, using 64-bit absolute addresses for the large-data section. (With a size threshold that all objects have to agree on, set by -mlarge-data-threshold=). But still uses RIP-relative for all other variables.
See the x86-64 System V ABI doc for more about the different memory models. And/or GCC docs for -mcmodel= and -mlarge-data-threshold= : https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
The default is -mcmodel=small : everything is within 2GiB of everything else, so RIP-relative works. And for non-PIE executables, that's the low 2GiB of virtual address space so static addresses can be 32-bit absolute sign- or zero-extended immediates or disp32 in addressing modes.
int a[1000000];
int b[1];
int fa() { return a[0]; }
int fb() { return b[0]; }
ASM output (Godbolt):
# gcc9.2 -O3 -mcmodel=medium
fa():
movabs eax, DWORD PTR [a] # 64-bit absolute address, special encoding for EAX
ret
fb():
mov eax, DWORD PTR b[rip]
ret
For loading into a register other than AL/AX/EAX/RAX, GCC would use movabs r64, imm64 with the address and then use mov reg, [reg].
You won't get gcc to use 32-bit absolute addressing for section A. It will always be using 64-bit absolute, never [array + rdx*4] or [abs foo] (NASM syntax). And never mov edi, msg (imm32) for putting an address in a register, always mov rdi, qword msg (imm64).
GCC puts b in the .lbss section and a in the regular .bss. Presumably you can use __attribute__((section("name"))) on
.globl b
.section .lbss,"aw" # "aw" = allocate(?), writeable
.align 32
.size b, 4000000
b:
.zero 4000000
.globl a
.bss # shortcut for .section
.align 4
a:
.zero 4
Things that don't work:
__attribute__((optimize("mcmodel=large"))) on a per-function basis. Doesn't actually work, and is per-function not per-variable anyway.
https://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html doesn't document any x86 or common variable attributes related to memory-model or size. The only x86-specific variable attribute is ms vs gcc struct layout.
There are x86-specific attributes for functions and types, but those don't help.
Possible hacks:
Put all your section-A variables in a large struct, larger than any section-B global/static objects. Possibly pad it at the end with a dummy array to make it larger: your linker script can probably avoid actually allocating extra space for that dummy array.
Then compile with -mcmodel=medium mlarge-data-threshold=that size.

Resources