What's the purpose of xchg ax,ax prior to the break instruction int 3 in DebugBreak()? - windows

In MASM, I've always inserted a standalone break instruction
00007ff7`63141120 cc int 3
However, replacing that instruction with the MSVC DebugBreak function generates
KERNELBASE!DebugBreak:
00007ff8`6b159b90 6690 xchg ax,ax
00007ff8`6b159b92 cc int 3
00007ff8`6b159b93 c3 ret
I was surprised to see the xchg instruction prior to the break instruction
xchg ax,ax
As noted from another S.O. article:
Actually, xchg ax,ax is just how MS disassembles "66 90". 66 is the
operand size override, so it supposedly operates on ax instead of eax.
However, the CPU still executes it as a nop. The 66 prefix is used
here to make the instruction two bytes in size, usually for alignment
purposes.
MSVC, like most compilers, aligns functions to 16 byte boundaries.
Question What is the purpose of that xchg instruction?

MSVC generates 2 byte nop before any single-byte instruction at the beginning of a function (except ret in empty functions1). I've tried __halt, _enable, _disable intrinsics and seen the same effect.
Apparently it is for patching. /hotpatch option gives the same change for x86, and /hotpatch option is not recognized on x64. According to the /hotpatch documentation, it is expected behavior (emphasis mine):
Because instructions are always two bytes or larger on the ARM architecture, and because x64 compilation is always treated as if /hotpatch has been specified, you don't have to specify /hotpatch when you compile for these targets;
So hotpatching support is unconditional for x64, and its result is seen in DebugBreak implementation.
See here: https://godbolt.org/z/1G737cErf
See this post on why it is needed for hotpatching: Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?. Looks like that currently hotpatching is smart enough to use any two bytes or more instruction, not just MOV EDI, EDI, still it cannot use single-byte instruction, as two-byte backward jump may be written at exact moment when the instruction pointer points at the second instruction.
1 As discussed in comments, empty functions have three-byte ret 0, although it is not apparent from MSVC assembly output, as it is represented there as just ret)

Related

How to load address of function or label into register

I am trying to load the address of 'main' into a register (R10) in the GNU Assembler. I am unable to. Here I what I have and the error message I receive.
main:
lea main, %r10
I also tried the following syntax (this time using mov)
main:
movq $main, %r10
With both of the above I get the following error:
/usr/bin/ld: /tmp/ccxZ8pWr.o: relocation R_X86_64_32S against symbol `main' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
Compiling with -fPIC does not resolve the issue and just gives me the same exact error.
In x86-64, most immediates and displacements are still 32-bits because 64-bit would waste too much code size (I-cache footprint and fetch/decode bandwidth).
lea main, %reg is an absolute disp32 addressing mode which would stop load-time address randomization (ASLR) from choosing a random 64-bit (or 47-bit) address. So it's not supported on Linux except in position-dependent executables, or at all on MacOS where static code/data are always loaded outside the low 32 bits. (See the x86 tag wiki for links to docs and guides.) On Windows, you can build executables as "large address aware" or not. If you choose not, addresses will fit in 32 bits.
The standard efficient way to put a static address into a register is a RIP-relative LEA:
# RIP-relative LEA always works. Syntax for various assemblers:
lea main(%rip), %r10 # AT&T syntax
lea r10, [rip+main] # GAS .intel_syntax noprefix equivalent
lea r10, [rel main] ; NASM equivalent, or use default rel
lea r10, [main] ; FASM defaults to RIP-relative. MASM may also
See How do RIP-relative variable references like "[RIP + _a]" in x86-64 GAS Intel-syntax work? for an explanation of the 3 syntaxes, and Why are global variables in x86-64 accessed relative to the instruction pointer? (and this) for reasons why RIP-relative is the standard way to address static data.
This uses a 32-bit relative displacement from the end of the current instruction, like jmp/call. This can reach any static data in .data, .bss, .rodata, or function in .text, assuming the usual 2GiB total size limit for static code+data.
In position dependent code (built with gcc -fno-pie -no-pie for example) on Linux, you can take advantage of 32-bit absolute addressing to save code size. Also, mov r32, imm32 has slightly better throughput than RIP-relative LEA on Intel/AMD CPUs, so out-of-order execution may be able to overlap it better with the surrounding code. (Optimizing for code-size is usually less important than most other things, but when all else is equal pick the shorter instruction. In this case all else is at least equal, or also better with mov imm32.)
See 32-bit absolute addresses no longer allowed in x86-64 Linux? for more about how PIE executables are the default. (Which is why you got a link error about -fPIC with your use of a 32-bit absolute.)
# in a non-PIE executable, mov imm32 into a 32-bit register is even better
# same as you'd use in 32-bit code
## GAS AT&T syntax
mov $main, %r10d # 6 bytes
mov $main, %edi # 5 bytes: no REX prefix needed for a "legacy" register
## GAS .intel_syntax
mov edi, OFFSET main
;; mov edi, main ; NASM and FASM syntax
Note that writing any 32-bit register always zero-extends into the full 64-bit register (R10 and RDI).
lea main, %edi or lea main, %rdi would also work in a Linux non-PIE executable, but never use LEA with a [disp32] absolute addressing mode (even in 32-bit code where that doesn't require a SIB byte); mov is always at least as good.
The operand-size suffix is redundant when you have a register operand that uniquely determines it; I prefer to just write mov instead of movl or movq.
The stupid/bad way is a 10-byte 64-bit absolute address as an immediate:
# Inefficient, DON'T USE
movabs $main, %r10 # 10 bytes including the 64-bit absolute address
This is what you get in NASM if you use mov rdi, main instead of mov edi, main so many people end up doing this. Linux dynamic linking does actually support runtime fixups for 64-bit absolute addresses. But the use-case for that is for jump tables, not for absolute addresses as immediates.
movq $sign_extended_imm32, %reg (7 bytes) still uses a 32-bit absolute address, but wastes code bytes on a sign-extended mov to a 64-bit register, instead of implicit zero-extension to 64-bit from writing a 32-bit register.
By using movq, you're telling GAS you want a R_X86_64_32S relocation instead of a R_X86_64_64 64-bit absolute relocation.
The only reason you'd ever want this encoding is for kernel code where static addresses are in the upper 2GiB of 64-bit virtual address space, instead of the lower 2GiB. mov has slight performance advantages over lea on some CPUs (e.g. running on more ports), but normally if you can use a 32-bit absolute it's in the low 2GiB of virtual address space where a mov r32, imm32 works.
(Related: Difference between movq and movabsq in x86-64)
PS: I intentionally left out any discussion of "large" or "huge" memory / code models, where RIP-relative +-2GiB addressing can't reach static data, or maybe can't even reach other code addresses. The above is for x86-64 System V ABI's "small" and/or "small-PIC" code models. You may need movabs $imm64 for medium and large models, but that's very rare.
I don't know if mov $imm32, %r32 works in Windows x64 executables or DLLs with runtime fixups, but RIP-relative LEA certainly does.
Semi-related: Call an absolute pointer in x86 machine code - if you're JITing, try to put the JIT buffer near existing code so you can call rel32, otherwise movabs a pointer into a register.

How to make gcc emit multibyte NOPs for -fpatchable-function-entry?

gcc does have the ability to use multi-byte NOPs for aligning loops and functions. However when I tried the -fpatchable-function-entry option it always emits single-byte NOPs
You can see in this example that gcc aligns the function with nop DWORD PTR [rax+rax*1+0x0] and nop WORD PTR cs:[rax+rax*1+0x0] but uses eight 0x90 NOPs at the function entry when I specify -fpatchable-function-entry=8,3
I saw this in the document
-fpatchable-function-entry=N[,M]
Generate N NOPs right at the beginning of each function, with the function entry point before the Mth NOP. If M is omitted, it defaults to 0 so the function entry points to the address just at the first NOP. The NOP instructions reserve extra space which can be used to patch in any desired instrumentation at run time, provided that the code segment is writable. The amount of space is controllable indirectly via the number of NOPs; the NOP instruction used corresponds to the instruction emitted by the internal GCC back-end interface gen_nop. This behavior is target-specific and may also depend on the architecture variant and/or other compilation options.
It clearly said that N NOPs will be inserted. However I think this should be an N-byte NOP (or whatever optimal number of NOPs to fill the N-byte space). Similarly if M is specified you need to emit an M-byte and an (N − M)-byte NOP
So why does gcc do this? Can we make it generate multi-byte NOPs? And are two 0x90 NOPs better than Microsoft's mov edi, edi?

Allocating memory using malloc() in 32-bit and 64-bit assembly language

I have to do a 64 bits stack. To make myself comfortable with malloc I managed to write two integers(32 bits) into memory and read from there:
But, when i try to do this with 64 bits:
The first snippet of code works perfectly fine. As Jester suggested, you are writing a 64-bit value in two separate (32-bit) halves. This is the way you have to do it on a 32-bit architecture. You don't have 64-bit registers available, and you can't write 64-bit chunks of memory at once. But you already seemed to know that, so I won't belabor it.
In the second snippet of code, you tried to target a 64-bit architecture (x86-64). Now, you no longer have to write 64-bit values in two 32-bit halves, since 64-bit architectures natively support 64-bit integers. You have 64-bit wide registers available, and you can write a 64-bit chunk to memory directly. Take advantage of that to simplify (and speed up) the code.
The 64-bit registers are Rxx instead of Exx. When you use QWORD PTR, you will want to use Rxx; when you use DWORD PTR, you will want to use Exx. Both are legal in 64-bit code, but only 32-bit DWORDs are legal in 32-bit code.
A couple of other things to note:
Although it is perfectly valid to clear a register using MOV xxx, 0, it is smaller and faster to use XOR eax, eax, so this is generally what you should write. It is a very old trick, something that any assembly-language programmer should know, and if you ever try to read other people's assembly programs, you'll need to be familiar with this idiom. (But actually, in the code you're writing, you don't need to do this at all. For the reason why, see point #2.)
In 64-bit mode, all instructions implicitly zero the upper 32 bits when writing the lower 32 bits, so you can simply write XOR eax, eax instead of XOR rax, rax. This is, again, smaller and faster.
The calling convention for 64-bit programs is different than the one used in 32-bit programs. The exact specification of the calling convention is going to vary, depending on which operating system you're using. As Peter Cordes commented, there is information on this in the x86 tag wiki. Both Windows and Linux x64 calling conventions pass at least the first 4 integer parameters in registers (rather than on the stack like the x86-32 calling convention), but which registers are actually used is different. Also, the 64-bit calling conventions have different requirements than do the 32-bit calling conventions for how you must set up the stack before calling functions.
(Since your screenshot says something about "MASM", I'll assume that you're using Windows in the sample code below.)
; Set up the stack, as required by the Windows x64 calling convention.
; (Note that we use the 64-bit form of the instruction, with the RSP register,
; to support stack pointers larger than 32 bits.)
sub rsp, 40
; Dynamically allocate 8 bytes of memory by calling malloc().
; (Note that the x64 calling convention passes the parameter in a register, rather
; than via the stack. On Windows, the first parameter is passed in RCX.)
; (Also note that we use the 32-bit form of the instruction here, storing the
; value into ECX, which is safe because it implicitly zeros the upper 32 bits.)
mov ecx, 8
call malloc
; Write a single 64-bit value into memory.
; (The pointer to the memory block allocated by malloc() is returned in RAX.)
mov qword ptr [rax], 1
; ... do whatever
; Clean up the stack space that we allocated at the top of the function.
add rsp, 40
If you wanted to do this in 32-bit halves, even on a 64-bit architecture, you certainly could. That would look like the following:
sub rsp, 40 ; set up stack
mov ecx, 8 ; request 8 bytes
call malloc ; allocate memory
mov dword ptr [eax], 1 ; write "1" into low 32 bits
mov dword ptr [eax+4], 2 ; write "2" into high 32 bits
; ... do whatever
add rsp, 40 ; clean up stack
Note that these last two MOV instructions are identical to what you wrote in the 32-bit version of the code. That makes sense, because you're doing exactly the same thing.
The reason the code you originally wrote didn't work is because EAX doesn't contain a QWORD PTR, it contains a DWORD PTR. Hence, the assembler generated the "invalid instruction operands" error, because there was a mismatch. This is the same reason that you don't offset by 8, because a DWORD PTR is only 4 bytes. A QWORD PTR is indeed 8 bytes, but you don't have one of those in EAX.
Or, if you wanted to write 16 bytes:
sub rsp, 40 ; set up stack
mov ecx, 16 ; request 16 bytes
call malloc ; allocate memory
mov qword ptr [rax], 1 ; write "1" into low 64 bits
mov qword ptr [rax+8], 2 ; write "2" into high 64 bits
; ... do whatever
add rsp, 40 ; clean up stack
Compare these three snippets of code, and make sure you understand the differences and why they need to be written as they are!

Unnecessary instructions generated for _mm_movemask_epi8 intrinsic in x64 mode

The intrinsic function _mm_movemask_epi8 from SSE2 is defined by Intel with the following prototype:
int _mm_movemask_epi8 (__m128i a);
This intrinsic function directly corresponds to the pmovmskb instruction, which is generated by all compilers.
According to this reference, the pmovmskb instruction can write the resulting integer mask to either a 32-bit or a 64-bit general purpose register in x64 mode. In any case, only 16 lower bits of the result can be nonzero, i.e. the result is surely within range [0; 65535].
Speaking of the intrinsic function _mm_movemask_epi8, its returned value is of type int, which as a signed integer of 32-bit size on most platforms. Unfortunately, there is no alternative function which returns a 64-bit integer in x64 mode. As a result:
Compiler usually generates pmovmskb instruction with 32-bit destination register (e.g. eax).
Compiler cannot assume that upper 32 bits of the whole register (e.g. rax) are zero.
Compiler inserts unnecessary instruction (e.g. mov eax, eax) to zero the upper half of 64-bit register, given that the register is later used as 64-bit value (e.g. as an index of array).
An example of code and generated assembly with such a problem can be seen in this answer. Also the comments to that answer contain some related discussion. I regularly experience this problem with MSVC2013 compiler, but it seems that it is also present on GCC.
The questions are:
Why is this happening?
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).
Why is this happening?
gcc's internal instruction definitions that tells it what pmovmskb does must be failing to inform it that the upper 32-bits of rax will always be zero. My guess is that it's treated like a function call return value, where the ABI allows a function returning a 32bit int to leave garbage in the upper 32bits of rax.
GCC does know about 32-bit operations in general zero-extending for free, but this missed optimization is widespread for intrinsics, also affecting scalar intrinsics like _mm_popcnt_u32.
There's also the issue of gcc (not) knowing that the actual result has set bits only in the low 16 of its 32-bit int result (unless you used AVX2 vpmovmskb ymm). So actual sign extension is unnecessary; implicit zero extension is totally fine.
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
No, other than fixing gcc. Has anyone reported this as a compiler missed-optimization bug?
clang doesn't have this bug. I added code to Paul R's test to actually use the result as an array index, and clang is still fine.
gcc always either zero or sign extends (to a different register in this case, perhaps because it wants to "keep" the 32-bit value in the bottom of RAX, not because it's optimizing for mov-elimination.
Casting to unsigned helps with GCC6 and later; it will use the pmovmskb result directly as part of an addressing mode, but also returning it results in a mov rax, rdx.
And with older GCC, at least gets it to use mov instead of movsxd or cdqe.
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).
mov same,same is never eliminated on SnB-family microarchitectures or AMD zen. mov ecx, eax would be eliminated. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details.
Even if it doesn't take an execution unit, it still takes a slot in the fused-domain part of the pipeline, and a slot in the uop-cache. And code-size. If you're close to the front-end 4 fused-domain uops per clock limit (pipeline width), then it's a problem.
It also costs an extra 1c of latency in the dep chain.
(Back-end throughput is not a problem, though. On Haswell and newer, it can run on port6 which has no vector execution units. On AMD, the integer ports are separate from the vector ports.)
gcc.godbolt.org is a great online resource for testing this kind of issue with different compilers.
clang seems to do the best with this, e.g.
#include <xmmintrin.h>
#include <cstdint>
int32_t test32(const __m128i v) {
int32_t mask = _mm_movemask_epi8(v);
return mask;
}
int64_t test64(const __m128i v) {
int64_t mask = _mm_movemask_epi8(v);
return mask;
}
generates:
test32(long long __vector(2)): # #test32(long long __vector(2))
vpmovmskb eax, xmm0
ret
test64(long long __vector(2)): # #test64(long long __vector(2))
vpmovmskb eax, xmm0
ret
Whereas gcc generates an extra cdqe instruction in the 64-bit case:
test32(long long __vector(2)):
vpmovmskb eax, xmm0
ret
test64(long long __vector(2)):
vpmovmskb eax, xmm0
cdqe
ret

Understanding optimized assembly code generated by gcc

I'm trying to understand what kind of optimizations are performed by gcc when -O3 flag was set. I'm quite confused what these two lines,
xor %esi, %esi
lea 0x0(%esi), %esi
It seems to me redundant. What's point to use lea instruction here?
That instruction is used to fill space for alignment purposes. Loops can be faster when they start on aligned addresses, because the processor loads memory into the decoder in chunks. By aligning the beginnings of loops and functions, it becomes more likely that they will be at the beginning of one of these chunks. This prevents previous instructions which will not be used from being loaded, maximizes the number of future instructions that will, and, possibly most importantly, ensures that the first instruction is entirely in the first chunk, so it does not take two loads to execute it.
The compiler knows that it is best to align the loop, and has two options to do so. It can either place a jump to the beginning of the loop, or fill the gap with no-ops and let the processor flow through them. Jump instructions break the flow of instructions and often cause wasted cycles on modern processors, so adding them unnecessarily is inadvisable. For a short distance like this no-ops are better.
The x86 architecture contains an instruction specifically for the purpose of doing nothing, nop. However, this is one byte long, so it would take more than one to align the loop. Decoding each one and deciding it does nothing takes time, so it is faster to simply insert another longer instruction that has no side effects. Therefore, the compiler inserted the lea instruction you see. It has absolutely no effects, and is chosen by the compiler to have the exact length required. In fact, recent processors have standard multi-byte no-op instructions, so this will likely be recognized during decode and never even executed.
As explained by ughoavgfhw - these are paddings for better code alignment.
You can find this lea in the following link -
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2010-September/003881.html
quote:
1-byte: XCHG EAX, EAX
2-byte: 66 NOP
3-byte: LEA REG, 0 (REG) (8-bit displacement)
4-byte: NOP DWORD PTR [EAX + 0] (8-bit displacement)
5-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)
**6-byte: LEA REG, 0 (REG) (32-bit displacement)**
7-byte: NOP DWORD PTR [EAX + 0] (32-bit displacement)
8-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
9-byte: NOP WORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
Also note this SO question describing it in more details -
What does NOPL do in x86 system?
Note that the xor itself is not a nop (it changes the value of the reg), but it is also very cheap to perform since it's a zero idiom - What is the purpose of XORing a register with itself?

Resources