How can I know which registers WinAPI functions use for arguments? [duplicate] - windows

I'm writing a function in x86 assembly that should be callable from c code, and I'm wondering which registers i have to restore before i return to the caller.
Currently I'm only restoring esp and ebp, while the return value is in eax.
Are there any other registers I should be concerned about, or could I leave whatever pleases me in them?

Using Microsoft's 32 bit ABI (cdecl or stdcall or other calling conventions), EAX, EDX and ECX are scratch registers (call clobbered). The other general-purpose integer registers are call-preserved.
The condition codes in EFLAGS are call-clobbered. DF=0 is required on call/return so you can use rep movsb without a cld first. The x87 stack must be empty on call, or on return from a function that doesn't return an FP value. (FP return values go in st0, with the x87 stack empty other than that.) XMM6 and 7 are call-preserved, the rest are call-clobbered scratch registers.
Outside of Windows, most 32-bit calling conventions (including i386 System V on Linux) agree with this choice of EAX, EDX and ECX as call-clobbered, but all the xmm registers are call-clobbered.
For x64 under Windows, you only need to restore RBX, RBP, RDI, RSI, R12, R13, R14, and R15. XMM6..15 are call-preserved. (And you have to reserve 32 bytes of shadow space for use by the callee, whether or not there are any args that don't fit in registers.) xmm6..15 are call-preserved.
See https://en.wikipedia.org/wiki/X86_calling_conventions#Microsoft_x64_calling_convention for more details.
Other OSes use the x86-64 System V ABI (see figure 3.4), where the call-preserved integer registers are RBP, RBX, RSP, R12, R13, R14, and R15. All the XMM/YMM/ZMM registers are call-clobbered.
EFLAGS and the x87 stack are the same as in 32-bit conventions: DF=0, condition flags are clobbered, and x87 stack is empty. (x86-64 conventions return FP values in XMM0, so the x87 stack registers always need to be empty on call/return.)
For links to official calling convention docs, see https://stackoverflow.com/tags/x86/info

32-bit: EBX, ESI, EDI, EBP
64-bit Windows: RBX, RSI, RDI, RBP, R12-R15, XMM6-XMM15
64-bit Linux,BSD,Mac: RBX, RBP, R12-R15
For details see "Software optimization resources" by Agner Fog. Calling conventions are described in this pdf.

if you are unsure about the registers' situation, these instructions below could save the day easily.
PUSHA/PUSHAD -- Push all General Registers
POPA/POPAD -- Pop all General Registers
These instructions push and pop the general purpose and SI/ESI , DI/EDI registers in certain order.
The order for PUSHA/PUSHAD instruction is as follows.
Opcode Instruction Clocks Description
60 PUSHA 18 Push AX, CX, DX, BX, original SP, BP, SI, and DI
60 PUSHAD 18 Push EAX, ECX, EDX, EBX, original ESP, EBP ESI, and EDI
And the order for POPA/POPAD instruction is as follows. (in reverse order)
Opcode Instruction Clocks Description
61 POPA 24 Pop DI, SI, BP, SP, BX, DX, CX, and AX
61 POPAD 24 Pop EDI, ESI, EBP, ESP(***),EBX, EDX, ECX, and EAX
*** The ESP value is discarded instead of loaded into ESP.

Related

Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array

Running this code off my Mac computer, using command:
nasm -f macho64 -o max.a maximum.asm
This is the code I am attempting to run on my computer that finds the largest number inside an array.
section .data
data_items:
dd 3,67,34,222,45,75,54,34,44,33,22,11,66,0
section .text
global _start
_start:
mov edi, 0
mov eax, [data_items + edi*4]
mov ebx, eax
start_loop:
cmp eax, 0
je loop_exit
inc edi
mov eax, [data_items + edi*4]
cmp eax, ebx
jle start_loop
mov ebx, eax
jmp start_loop
loop_exit:
mov eax, 1
int 0x80
Error:
maximum.asm:14: error: Mach-O 64-bit format does not support 32-bit absolute addresses
maximum.asm:21: error: Mach-O 64-bit format does not support 32-bit absolute addresses
First of all, beware of NASM bugs with the macho64 output format with 64-bit absolute addressing (NASM 2.13.02+) and with RIP-relative in NASM 2.11.08. 64-bit absolute addressing is not recommended, so this answer should work even for buggy NASM 2.13.02 and higher. (The bugs don't cause this error, they lead to wrong addresses being used at runtime.)
[data_items + edi*4] is a 32-bit addressing mode. Even [data_items + rdi*4] can only use a 32-bit absolute displacement, so it wouldn't work either. Note that using an address as a 32-bit (sign-extended) immediate like cmp rdi, data_items is also a problem: only mov allows a 64-bit immediate.
64-bit code on OS X can't use 32-bit absolute addressing at all. Executables are loaded at a base address above 4GiB, so label addresses just plain don't fit in 32-bit integers, with zero- or sign-extension. RIP-relative addressing is the best / most efficient solution, whether you need it to be position-independent or not1.
In NASM, default rel at the top of your file will make all [] memory operands prefer RIP-relative addressing. See also Section 3.3 Effective Addresses in the NASM manual.
default rel ; near the top of file; affects all instructions
my_func:
...
mov ecx, [data_items] ; uses the default: RIP-relative
;mov ecx, [abs data_items] ; override to absolute [disp32], unusuable
mov ecx, [rel data_items] ; explicitly RIP-relative
But RIP-relative is only possible when there are no other registers involved, so for indexing a static array you need to get the address in a register first. Use a RIP-relative lea rsi, [rel data_items].
lea rsi, [data_items] ; can be outside the loop
...
mov eax, [rsi + rdi*4]
Or you could add rsi, 4 inside the loop and use a simpler addressing mode like mov eax, [rsi].
Note that mov rsi, data_items will work for getting an address into a register, but you don't want that because it's less efficient.
Technically, any address within +-2GiB of your array will work, so if you have multiple arrays you can address the others relative to one common base address, only tieing up one register with a pointer. e.g. lea rbx, [arr1] / ... / mov eax, [rbx + rdi*4 + arr2-arr1]. Relative Addressing errors - Mac 10.10 mentions that Agner Fog's "optimizing assembly" guide has some examples of array addressing, including one using the __mh_execute_header as a reference point. (The code in that question looks like another attempt to port this 32-bit Linux example from the PGU book to 64-bit OS X, at the same time as learning asm in the first place.)
Note that on Linux, position-dependent executables are loaded in the low 32 bits of virtual address space, so you will see code like mov eax, [array + rdi*4] or mov edi, symbol_name in Linux examples or compiler output on http://gcc.godbolt.org/. gcc -pie -fPIE will make position-independent executables on Linux, and is the default on many recent distros, but not Godbolt.
This doesn't help you on MacOS, but I mention it in case anyone's confused about code they've seen for other OSes, or why AMD64 architects bothered to allow [disp32] addressing modes at all on x86-64.
And BTW, prefer using 64-bit addressing modes in 64-bit code. e.g. use [rsi + rdi*4], not [esi + edi*4]. You usually don't want to truncate pointers to 32-bit, and it costs an extra address-size prefix to encode.
Similarly, you should be using syscall to make 64-bit system calls, not int 0x80. What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 for the differences in which registers to pass args in.
Footnote 1:
64-bit absolute addressing is supported on OS X, but only in position-dependent executables (non-PIE). This related question x64 nasm: pushing memory addresses onto the stack & call function includes an ld warning from using gcc main.o to link:
ld: warning: PIE disabled. Absolute addressing (perhaps -mdynamic-no-pic) not
allowed in code signed PIE, but used in _main from main.o. To fix this warning,
don't compile with -mdynamic-no-pic or link with -Wl,-no_pie
So the linker checks if any 64-bit absolute relocations are used, and if so disables creation of a Position-Independent Executable. A PIE can benefit from ASLR for security. I think shared-library code always has to be position-independent on OS X; I don't know if jump tables or other cases of pointers-as-data are allowed (i.e. fixed up by the dynamic linker), or if they need to be initialized at runtime if you aren't making a position-dependent executable.
mov r64, imm64 is larger (10 bytes) and not faster than lea r64, [RIP_rel32] (7 bytes).
So you could use mov rsi, qword data_items instead of a RIP-relative LEA which runs about as fast, and takes less space in code caches and the uop cache. 64-bit immediates also have a uop-cache fetch penalty for on Sandybridge-family (http://agner.org/optimize/): they take 2 cycles to read from a uop cache line instead of 1.
x86 also has a form of mov that loads/store from/to a 64-bit absolute address, but only for AL/AX/EAX/RAX. See http://felixcloutier.com/x86/MOV.html. You don't want this either, because it's larger and not faster than mov eax, [rel foo].
(Related: an AT&T syntax version of the same question)

Packing two DWORDs into a QWORD to save store bandwidth

Imagine a load-store loop like the following which loads DWORDs from non-contiguous locations and stores them contiguously:
top:
mov eax, DWORD [rsi]
mov DWORD [rdi], eax
mov eax, DWORD [rdx]
mov DWORD [rdi + 4], eax
; unroll the above a few times
; increment rdi and rsi somehow
cmp ...
jne top
On modern Intel and AMD hardware, when running in-cache such a loop will usually bottleneck ones stores at one store per cycle. That's kind of wasteful, since that's only an IPC of 2 (one store, one load).
One idea that naturally arises is to combine two DWORD loads into a single QWORD store which is possible since the stores are contiguous. Something like this could work:
top:
mov eax, DWORD [rsi]
mov ebx, DWORD [rdx]
shl rbx, 32
or rax, rbx
mov QWORD [rdi]
Basically do the two loads and use two ALU ops to combine them into a single QWORD which we can store with a single store. Now we're bottlenecked on uops: 5 uops per 2 DWORDs - so 1.25 cycles per QWORD or 0.625 cycles per DWORD.
Already much better than the first option, but I can't help but think there is a better option for this shuffling - for example, we are wasting uop throughput by using plain loads - It feels like we should be able to combine at least some of the ALU ops with the loads with memory source operands, but I was mostly stymied on Intel: shl on memory only has a RMW form, and shlx and rolx don't micro-fuse.
It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.
I'm interested in scalar code, and code for both the base x86-64 instruction set and better versions if possible with useful extensions like BMI.
It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.
If wider loads are ok for correctness and performance (cache-line splits...), we can use shld
top:
mov eax, DWORD [rsi]
mov rbx, QWORD [rdx-4] ; unaligned(?) 64-bit load
shld rax, rbx, 32 ; 1 uop on Intel SnB-family, 0.5c recip throughput
mov QWORD [rdi], rax
MMX punpckldq mm0, [mem] micro-fuses on SnB-family (including Skylake).
top:
movd mm0, DWORD [rsi]
punpckldq mm0, QWORD [rdx] ; 1 micro-fused uop on Intel SnB-family
movq QWORD [rdi], mm0
; required after the loop, making it only worth-while for long-running loops
emms
punpckl instructions unfortunately have a vector-width memory operand, not half-width. This often spoils them for uses where they'd otherwise be perfect (especially the SSE2 version where the 16B memory operand must be aligned). But note that the MMX versions (with only a qword memory operand) don't have an alignment requirement.
You could also use the 128-bit AVX version, but that's even more likely to cross a cache line boundary and be slow. (Skylake does not optimize by loading only the required 8 bytes; a loop with an aligned mov + vpunckldq xmm1, xmm0, [cache_line-8] runs at 1 iter per 2 clocks vs. 1 iter per clock for aligned.) The AVX version is required to fault if the 16-byte load crosses into an unmapped page, so it couldn't just use a narrower load without extra support from the load port. :/
Such a frustrating and useless design decision (presumably made before load ports could zero-extend for free, and not fixed with AVX). At least we have movhps as a replacement for memory-source punpcklqdq, but narrower widths that actually shuffle can't be replaced.
To avoid CL-splits, you could also use a separate movd load and punpckldq, or SSE4.1 pinsrd. With this, there's no reason for MMX.
top:
movd xmm0, DWORD [rsi]
movd xmm1, DWORD [rdx] ; SSE2
punpckldq xmm0, xmm1
; or pinsrd xmm0, DWORD [rdx], 1 ; 2 uops not micro-fused
movq QWORD [rdi], xmm0
Obviously AVX2 vpgatherdd is a possibility, and may perform well on Skylake.

Strange`lea` instruction [duplicate]

LEA EAX, [EAX]
I encountered this instruction in a binary compiled with the Microsoft C compiler. It clearly can't change the value of EAX. Then why is it there?
It is a NOP.
The following are typcially used as NOP. They all do the same thing but they result in machine code of different length. Depending on the alignment requirement one of them is chosen:
xchg eax, eax = 90
mov eax, eax = 89 C0
lea eax, [eax + 0x00] = 8D 40 00
From this article:
This trick is used by MSVC++ compiler
to emit the NOP instructions of
different length (for padding before
jump targets). For example, MSVC++
generates the following code if it
needs 4-byte and 6-byte padding:
8d6424 00 lea [ebx+00],ebx
; 4-byte padding 8d9b 00000000
lea [esp+00000000],esp ; 6-byte
padding
The first line is marked as "npad 4"
in assembly listings generated by the
compiler, and the second is "npad 6".
The registers (ebx, esp) can be chosen
from the rarely used ones to avoid
false dependencies in the code.
So this is just a kind of NOP, appearing right before targets of jmp instructions in order to align them.
Interestingly, you can identify the compiler from the characteristic nature of such instructions.
LEA EAX, [EAX]
Indeed doesn't change the value of EAX. As far as I understand, it's identical in function to:
MOV EAX, EAX
Did you see it in optimized code, or unoptimized code?

how to reset general purpose performance counter of intel

I know we can use wrmsr and rdmsr instruction to set the performance counter and read the general purpose performance counter register.
However, my question is:
Do we need to reset the general purpose performance counter register before we issue the wrmsr?
In other words, for the following code, do we need to reset the performance counter before the following code? If we have to, how can we reset it?
mov $0x0001010E, %eax # Write selector value to EAX
xor %edx, %edx # Zero EDX
mov $0x187, %ecx # Write logical register id to ECX (IA32_PERFEVTSEL1)
wrmsr

What is this assembly function prologue / epilogue code doing with rbp / rsp / leave?

I am just starting to learn assembly for the mac using the GCC compiler to assemble my code. Unfortunately, there are VERY limited resources for learning how to do this if you are a beginner. I finally managed to find some simple sample code that I could start to rap my head around, and I got it to assemble and run correctly. Here is the code:
.text # start of code indicator.
.globl _main # make the main function visible to the outside.
_main: # actually label this spot as the start of our main function.
push %rbp # save the base pointer to the stack.
mov %rsp, %rbp # put the previous stack pointer into the base pointer.
subl $8, %esp # Balance the stack onto a 16-byte boundary.
movl $0, %eax # Stuff 0 into EAX, which is where result values go.
leave # leave cleans up base and stack pointers again.
ret
The comments explain some things in the code (I kind of understand what lines 2 - 5 do), but I dont understand what most of this means. I do understand the basics of what registers are and what each register here (rbp, rsp, esp and eax) is used for and how big they are, I also understand (generally) what the stack is, but this is still going over my head. Can anyone tell me exactly what this is doing? Also, could anyone point me in the direction of a good tutorial for beginners?
Stack is a data structure that follows LIFO principle. Whereas stacks in everyday life (outside computers, I mean) grow upward, stacks in x86 and x86-64 processors grow downward. See Wikibooks article on x86 stack (but please take into account that the code examples are 32-bit x86 code in Intel syntax, and your code is 64-bit x86-64 code in AT&T syntax).
So, what your code does (my explanations here are with Intel syntax):
push %rbp
Pushes rbp to stack, practically subtracting 8 from rsp (because the size of rbp is 8 bytes) and then stores rbp to [ss:rsp].
So, in Intel syntax push rbp practically does this:
sub rsp, 8
mov [ss:rsp], rbp
Then:
mov %rsp, %rbp
This is obvious. Just store the value of rsp into rbp.
subl $8, %esp
Subtract 8 from esp and store it into esp. Actually this is a bug in your code, even if it causes no problems here. Any instruction with a 32-bit register (eax, ebx, ecx, edx, ebp, esp, esi or edi) as destination in x86-64 sets the topmost 32 bits of the corresponding 64-bit register (rax, rbx, rcx, rdx, rbp, rsp, rsi or rdi) to zero, causing the stack pointer to point somewhere below the 4 GiB limit, effectively doing this (in Intel syntax):
sub rsp,8
and rsp,0x00000000ffffffff
Edit: added consequences of sub esp,8 below.
However, this causes no problems on a computer with less than 4 GiB of memory. On computers with more than 4 GiB memory, it may result in a segmentation fault. leave further below in your code returns a sane value to rsp. Generally in x86-64 code you don't need esp never (excluding possibly some optimizations or tweaks). To fix this bug:
subq $8, %rsp
The instructions so far are the standard entry sequence (replace $8 according to the stack usage). Wikibooks has a useful article on x86 functions and stack frames (but note again that it uses 32-bit x86 assembly with Intel syntax, not 64-bit x86-64 assembly with AT&T syntax).
Then:
movl $0, %eax
This is obvious. Store 0 into eax. This has nothing to do with the stack.
leave
This is equivalent to mov rsp, rbp followed by pop rbp.
ret
And this, finally, sets rip to the value stored at [ss:rsp], effective returning the code pointer back to where this procedure was called, and adds 8 to rsp.

Resources