Allocating memory using malloc() in 32-bit and 64-bit assembly language

Allocating memory using malloc() in 32-bit and 64-bit assembly language - windows

I have to do a 64 bits stack. To make myself comfortable with malloc I managed to write two integers(32 bits) into memory and read from there:
But, when i try to do this with 64 bits:

The first snippet of code works perfectly fine. As Jester suggested, you are writing a 64-bit value in two separate (32-bit) halves. This is the way you have to do it on a 32-bit architecture. You don't have 64-bit registers available, and you can't write 64-bit chunks of memory at once. But you already seemed to know that, so I won't belabor it.
In the second snippet of code, you tried to target a 64-bit architecture (x86-64). Now, you no longer have to write 64-bit values in two 32-bit halves, since 64-bit architectures natively support 64-bit integers. You have 64-bit wide registers available, and you can write a 64-bit chunk to memory directly. Take advantage of that to simplify (and speed up) the code.
The 64-bit registers are Rxx instead of Exx. When you use QWORD PTR, you will want to use Rxx; when you use DWORD PTR, you will want to use Exx. Both are legal in 64-bit code, but only 32-bit DWORDs are legal in 32-bit code.
A couple of other things to note:
Although it is perfectly valid to clear a register using MOV xxx, 0, it is smaller and faster to use XOR eax, eax, so this is generally what you should write. It is a very old trick, something that any assembly-language programmer should know, and if you ever try to read other people's assembly programs, you'll need to be familiar with this idiom. (But actually, in the code you're writing, you don't need to do this at all. For the reason why, see point #2.)
In 64-bit mode, all instructions implicitly zero the upper 32 bits when writing the lower 32 bits, so you can simply write XOR eax, eax instead of XOR rax, rax. This is, again, smaller and faster.
The calling convention for 64-bit programs is different than the one used in 32-bit programs. The exact specification of the calling convention is going to vary, depending on which operating system you're using. As Peter Cordes commented, there is information on this in the x86 tag wiki. Both Windows and Linux x64 calling conventions pass at least the first 4 integer parameters in registers (rather than on the stack like the x86-32 calling convention), but which registers are actually used is different. Also, the 64-bit calling conventions have different requirements than do the 32-bit calling conventions for how you must set up the stack before calling functions.
(Since your screenshot says something about "MASM", I'll assume that you're using Windows in the sample code below.)
; Set up the stack, as required by the Windows x64 calling convention.
; (Note that we use the 64-bit form of the instruction, with the RSP register,
; to support stack pointers larger than 32 bits.)
sub rsp, 40
; Dynamically allocate 8 bytes of memory by calling malloc().
; (Note that the x64 calling convention passes the parameter in a register, rather
; than via the stack. On Windows, the first parameter is passed in RCX.)
; (Also note that we use the 32-bit form of the instruction here, storing the
; value into ECX, which is safe because it implicitly zeros the upper 32 bits.)
mov ecx, 8
call malloc
; Write a single 64-bit value into memory.
; (The pointer to the memory block allocated by malloc() is returned in RAX.)
mov qword ptr [rax], 1
; ... do whatever
; Clean up the stack space that we allocated at the top of the function.
add rsp, 40
If you wanted to do this in 32-bit halves, even on a 64-bit architecture, you certainly could. That would look like the following:
sub rsp, 40 ; set up stack
mov ecx, 8 ; request 8 bytes
call malloc ; allocate memory
mov dword ptr [eax], 1 ; write "1" into low 32 bits
mov dword ptr [eax+4], 2 ; write "2" into high 32 bits
; ... do whatever
add rsp, 40 ; clean up stack
Note that these last two MOV instructions are identical to what you wrote in the 32-bit version of the code. That makes sense, because you're doing exactly the same thing.
The reason the code you originally wrote didn't work is because EAX doesn't contain a QWORD PTR, it contains a DWORD PTR. Hence, the assembler generated the "invalid instruction operands" error, because there was a mismatch. This is the same reason that you don't offset by 8, because a DWORD PTR is only 4 bytes. A QWORD PTR is indeed 8 bytes, but you don't have one of those in EAX.
Or, if you wanted to write 16 bytes:
sub rsp, 40 ; set up stack
mov ecx, 16 ; request 16 bytes
call malloc ; allocate memory
mov qword ptr [rax], 1 ; write "1" into low 64 bits
mov qword ptr [rax+8], 2 ; write "2" into high 64 bits
; ... do whatever
add rsp, 40 ; clean up stack
Compare these three snippets of code, and make sure you understand the differences and why they need to be written as they are!

Related

When a push or pop instruction is being executed then what is the byte size of the value that is being pushed or popped? [duplicate]

I can push 4 bytes onto the stack by doing this:
push DWORD 123
But I have found out that I can use push without specifying the operand size:
push 123
In this case, how many bytes does the push instruction push onto the stack? Does the number of bytes pushed depends on the operand size (so in my example it will push 1 byte)?

Does the number of bytes pushed depends on the operand size
It doesn't depend on the value of the number. The technical x86 term for how many bytes push pushes is "operand-size", but that's a separate thing from whether the number fits in an imm8 or not.
See also Does each PUSH instruction push a multiple of 8 bytes on x64?
(so in my example it will push 1 byte)?
No, the size of the immediate is not the operand-size. It always pushes 4 bytes in 32-bit code, or 64 in 64-bit code, unless you do something weird.
Recommendation: always just write push 123 or push 0x12345 to use the default push size for the mode you're in and and let the assembler pick the encoding. That is almost always what you want. If that's all you wanted to know, you can stop reading now.
First of all, it's useful to know what sizes of push are even possible in x86 machine code:
In 16-bit mode, you can push 16 or (with operand-size prefix on 386 and later) 32 bits.
In 32-bit mode, you can push 32 or (with operand-size prefix) 16 bits.
In 64-bit mode, you can push 64 or (with operand-size prefix) 16 bits.
A REX.W=0 prefix does not let you encode a 32-bit push.1
There are no other options. The stack pointer is always decremented by the operand-size of the push2. (So it's possible to "misalign" the stack by pushing 16 bits). pop has the same choices of size: 16, 32, or 64, except no 32-bit pop in 64-bit mode.
This applies whether you're pushing a register or an immediate, and regardless of whether the immediate fits in a sign-extended imm8 or it needs an imm32 (or imm16 for 16-bit pushes). (A 64-bit push imm32 sign-extends to 64-bit. There is no push imm64, only mov reg, imm64)
In NASM source code, push 123 assembles to the operand-size that matches the mode you're in. In your case, I think you're writing 32-bit code, so push 123 is a 32-bit push, even though it can (and does) use the push imm8 encoding.
Your assembler always knows what kind of code it's assembling, since it has to know when to use or not use operand-size prefixes when you do force the operand-size.
MASM is the same; the only thing that might be different is the syntax for forcing a different operand-size.
Anything you write in assembler will assemble to one of the valid machine-code options (because the people that wrote the assembler know what is and isn't encodeable), so no, you can't push a single byte with a push instruction. If you wanted that, you could emulate it with dec esp / mov byte [esp], 123
NASM Examples:
Output from nasm -l /dev/stdout to dump a listing to the terminal, along with the original source line.
Lightly edited to separate opcode and prefix bytes from the operands. (Unlike objdump -drwC -Mintel, NASM's disassembly format doesn't leave spaces between bytes in the machine-code hexdump).
68 80000000 push 128
6A 80 push -128 ;; signed imm8 is -128 to +127
6A 7B push byte 123
6A 7B push dword 123 ;; still optimized to the imm8 encoding
68 7B000000 push strict dword 123
6A 80 push strict byte 0x80 ;; will decode as push -128
****************** warning: signed byte value exceeds bounds [-w+number-overflow]
dword is normally an operand-size thing, while strict dword is how you request that the assembler doesn't optimize it to a smaller encoding.
All the preceding instructions are 32-bit pushes (or 64-bit in 64-bit mode, with the same machine code). All the following instructions are 16-bit pushes, regardless of what mode you assemble them in. (If assembled in 16-bit mode, they won't have a 0x66 operand-size prefix)
66 6A 7B push word 123
66 68 8000 push word 128
66 68 7B00 push strict word 123
NASM apparently seems to treat the byte and dword overrides as applying to the size of the immediate, but word applies to the operand-size of the instruction. Actually using o32 push 12 in 64-bit mode doesn't get a warning either. push eax does, though: "error: instruction not supported in 64-bit mode".
Notice that push imm8 is encoded as 6A ib in all modes. With no operand-size prefix, the operand size is the mode's size. (e.g. 6A FF decodes in long mode as a 64-bit operand-size push with an operand of -1, decrementing RSP by 8 and doing an 8-byte store.)
The address-size prefix only affects the explicit addressing mode used for push with a memory-source, e.g. in 64-bit mode: push qword [rsi] (no prefixes) vs. push qword [esi] (address-size prefix for 32-bit addressing mode). push dword [rsi] is not encodeable, because nothing can make the operand-size 32-bit in 64-bit code1. push qword [esi] does not truncate rsp to 32-bit. Apparently "Stack Address Width" is a different thing, probably set in a segment descriptor. (It's always 64 in 64-bit code on a normal OS, I think even for Linux's x32 ABI: ILP32 in long mode.)
When would you ever want to push 16 bits? If you're writing in asm for performance reasons, then probably never. In my code-golf adler32, a narrow push -> wide pop took fewer bytes of code than shift/OR to combine two 16b integers into a 32b value.
Or maybe in an exploit for 64-bit code, you might want to push some data onto the stack without gaps. You can't just use push imm32, because that sign or zero extends to 64-bit. You could do it in 16-bit chunks with multiple 16-bit push instructions. But still probably more efficient to mov rax, imm64 / push rax (10B+1B = 11B for an 8B imm payload). Or push 0xDEADBEEF / mov dword [rsp+4], 0xDEADC0DE (5B + 8B = 13B and doesn't need a register). four 16-bit pushes would take 16B.
Footnotes:
In fact REX.W=0 is ignored, and doesn't modify the operand-size away from its default 64-bit. NASM, YASM, and GAS all assemble push r12 to 41 54, not 49 54. GNU objdjump thinks 49 54 is unusual, and decodes it as 49 54 rex.WB push r12. (Both execute the same). Microsoft agrees as well, using a 40h REX as padding on push rbx in some Windows DLLs.
Intel just says that 32-bit pushes are "not encodeable" (N.E. in the table) in long mode. I don't understand why W=1 isn't the standard encoding for push / pop when a REX prefix is needed, but apparently the choice is arbitrary.
Fun-fact: only stack instructions and a few others default to 64-bit operand size in 64-bit mode. In machine code, add rax, rdx needs a REX prefix (with the W bit set). Otherwise it would decode as add eax, edx. But you can't decrease the operand-size with a REX.W=0 when it defaults to 64-bit, only increase it when it defaults to 32.
http://wiki.osdev.org/X86-64_Instruction_Encoding#REX_prefix lists the instructions that default to 64-bit in 64-bit mode. Note that jrcxz doesn't strictly belong in that list, because the register it checks (cx/ecx/rcx) is determined by address-size, not operand-size, so it can be overridden to 32-bit (but not 16-bit) in 64-bit mode. loop is the same.
It's strange that Intel's instruction reference manual entry for push (HTML extract: http://felixcloutier.com/x86/PUSH.html)
shows what would happen for a 32-bit operand-size push in 64-bit mode (the only case where stack address width can be 64, so it uses rsp). Perhaps it's achievable somehow with some non-standard settings in the code-segment descriptor, so you can't do it in normal 64-bit code running under a normal OS. Or more likely it's an oversight, and that's what would happen if it was encodeable, but it's not.
Except segment registers are 16-bit, but a normal push fs will still decrement the stack pointer by the stack-width (operand-size). Intel documents that recent Intel CPUs only do a 16b store in that case, leaving the rest of the 32 or 64b unmodified.
x86 doesn't officially have a stack width that's enforced in hardware. It's a software / calling convention term, e.g. char and short args passed on the stack in any calling conventions are padded out to 4B or 8B, so the stack stays aligned. (Modern 32 and 64-bit calling conventions such as the x86-32 System V psABI used by Linux keep the stack 16B aligned before function calls, even though an arg "slot" on the stack is still only 4B). Anyway, "stack width" is only a programming convention on any architecture.
The closest thing in the x86 ISA to a "stack width" is the default operand-size of push/pop. But you can manipulate the stack pointer however you want, e.g. sub esp,1. You can, but don't for performance reasons :P

The "stack width" in a computer, which is the smallest amount of data that can be pushed onto the stack, is defined to be the register size of the processor. This means that if you are dealing with a processor with 16 bit registers, the stack width will be 2 bytes. If the processor has 32 bit registers, the stack width is 4 bytes. If the processor has 64 bit registers, the stack width is 8 bytes.
Don't be confused when using modern x86/x86_64 systems; if the system is running in a 32 bit mode, the stack width and register size is 32 bits or 4 bytes. If you switch to 64 bit mode, then and only then will the register and stack size change.

How does RIP-relative addressing perform compared to mov reg, imm64?

It is known fact that x86-64 instructions do not support 64-bit immediate values (except for mov). Hence, when migrating code from 32 to 64 bits, an instruction like this:
cmp rax, addr32
cannot be replaced with the following:
cmp rax, addr64
Under these circumstances, I'm considering two alternatives: (a) using a scratch register for loading the constant or (b) using rip-relative addressing. The two approaches look like this:
mov r11, addr64 ; scratch register
cmp rax, r11
ptr64: dq addr64
...
cmp rax, [rel ptr64] ; encoded as cmp rax, [rip+offset]
I wrote a very simple loop to compare the performance of both approaches (which I paste below). While (b) uses an indirect pointer, (a) has the the immediate encoded in the instruction (which could lead to a worse usage of i-cache). Surprisingly, I found that (b) run ~10% faster than (a). Is this result something to be expected in more common real-world code?
true: dq 0xFFFF0000FFFF0000
false: dq 0xAAAABBBBAAAABBBB
main:
or rax, 1 ; rax is odd and constant "true" is even
mov rcx, 0x1
shl rcx, 30
branch:
mov r11, 0xFFFF0000FFFF0000 ; not present in (b)
cmp rax, r11 ; vs cmp rax, [rel true]
je next
add rax, 2
loop branch
next:
mov rax, 0
ret

Surprisingly, I found that (b) run ~10% faster than (a)
You probably tested on a CPU other than AMD Bulldozer-family or Ryzen, which have a fast loop instruction. On other CPUs, loop is very slow, mostly on purpose for historical reasons, so you bottleneck on it. e.g. 7 uops, one per 5c throughput on Haswell.
mov r64, imm64 is bad for uop cache throughput because of the large immediate taking 2 slots in Intel's uop cache. (See the Sandybridge uop cache section in Agner Fog's microarch pdf), and Which is faster, imm64 or m64 for x86-64? where I listed the details.
Even apart from that, it's not too surprising that 1 extra uop in the loop makes it run slower. You're probably not on an AMD CPU (with single-uop / 1 per 2 clock loop), because the extra mov in such a tiny loop would make more than 10% difference. Or no difference at all, since it's just 3 vs. 4 uops per 2 clocks, if that's correct that even tiny loop loops are limited to one jump per 2 clocks.
On Intel, loop is 7 uops, one per 5 clocks throughput on most CPUs, so the 4-per-clock issue/rename bottleneck won't be what you're hitting. loop is micro-coded, so the front-end can't run from the loop buffer. (And Skylake CPUs have their LSD disabled by a microcode update to fix the partial-register erratum anyway.) So the mov r64,imm64 uop has to be re-read from the uop cache every time through the loop.
A load that hits in cache has very good throughput (2 loads per clock, and in this case micro-fusion means no extra uops to use a memory operand instead of register for cmp). So the main penalty in using a constant from memory is the extra cache footprint and cache misses, but your microbenchmark won't reveal that at all. It also has no other pressure on the load ports.
In the general case:
If possible, use a RIP-relative lea to generate 64-bit address constants.
e.g. lea rax, [rel addr64]. Yes, this takes an extra instruction to get the constant into a register. (BTW, just use default rel. You can use [abs fs:0] if you need it.
You can avoid the extra instruction if you build position-dependent code with the default (small) code model, so static addresses fit in the low 32 bits of virtual address space and can be used as immediates. (Actually low 2GiB, so sign or zero extending both work). See 32-bit absolute addresses no longer allowed in x86-64 Linux? if gcc complains about absolute addressing; -pie is enabled by default on most distros. This of course doesn't work in Linux shared libraries, which only support text relocations for 64-bit addresses. But you should avoid relocations whenever possible by using lea to make position-indepdendent code.
Most integer build-time constants fit in 32 bits, so you can use cmp r64, imm32 or cmp r32, imm32 even in PIC code.
If you do need a 64-bit non-address constant, try to hoist the mov r64, imm64 out of a loop. Your cmp loop would have been fine if the mov wasn't inside the loop. x86-64 has enough registers that you (or the compiler) can usually avoid reloads inside inner-most loops in integer code.

Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array

Running this code off my Mac computer, using command:
nasm -f macho64 -o max.a maximum.asm
This is the code I am attempting to run on my computer that finds the largest number inside an array.
section .data
data_items:
dd 3,67,34,222,45,75,54,34,44,33,22,11,66,0
section .text
global _start
_start:
mov edi, 0
mov eax, [data_items + edi*4]
mov ebx, eax
start_loop:
cmp eax, 0
je loop_exit
inc edi
mov eax, [data_items + edi*4]
cmp eax, ebx
jle start_loop
mov ebx, eax
jmp start_loop
loop_exit:
mov eax, 1
int 0x80
Error:
maximum.asm:14: error: Mach-O 64-bit format does not support 32-bit absolute addresses
maximum.asm:21: error: Mach-O 64-bit format does not support 32-bit absolute addresses

First of all, beware of NASM bugs with the macho64 output format with 64-bit absolute addressing (NASM 2.13.02+) and with RIP-relative in NASM 2.11.08. 64-bit absolute addressing is not recommended, so this answer should work even for buggy NASM 2.13.02 and higher. (The bugs don't cause this error, they lead to wrong addresses being used at runtime.)
[data_items + edi*4] is a 32-bit addressing mode. Even [data_items + rdi*4] can only use a 32-bit absolute displacement, so it wouldn't work either. Note that using an address as a 32-bit (sign-extended) immediate like cmp rdi, data_items is also a problem: only mov allows a 64-bit immediate.
64-bit code on OS X can't use 32-bit absolute addressing at all. Executables are loaded at a base address above 4GiB, so label addresses just plain don't fit in 32-bit integers, with zero- or sign-extension. RIP-relative addressing is the best / most efficient solution, whether you need it to be position-independent or not1.
In NASM, default rel at the top of your file will make all [] memory operands prefer RIP-relative addressing. See also Section 3.3 Effective Addresses in the NASM manual.
default rel ; near the top of file; affects all instructions
my_func:
...
mov ecx, [data_items] ; uses the default: RIP-relative
;mov ecx, [abs data_items] ; override to absolute [disp32], unusuable
mov ecx, [rel data_items] ; explicitly RIP-relative
But RIP-relative is only possible when there are no other registers involved, so for indexing a static array you need to get the address in a register first. Use a RIP-relative lea rsi, [rel data_items].
lea rsi, [data_items] ; can be outside the loop
...
mov eax, [rsi + rdi*4]
Or you could add rsi, 4 inside the loop and use a simpler addressing mode like mov eax, [rsi].
Note that mov rsi, data_items will work for getting an address into a register, but you don't want that because it's less efficient.
Technically, any address within +-2GiB of your array will work, so if you have multiple arrays you can address the others relative to one common base address, only tieing up one register with a pointer. e.g. lea rbx, [arr1] / ... / mov eax, [rbx + rdi*4 + arr2-arr1]. Relative Addressing errors - Mac 10.10 mentions that Agner Fog's "optimizing assembly" guide has some examples of array addressing, including one using the __mh_execute_header as a reference point. (The code in that question looks like another attempt to port this 32-bit Linux example from the PGU book to 64-bit OS X, at the same time as learning asm in the first place.)
Note that on Linux, position-dependent executables are loaded in the low 32 bits of virtual address space, so you will see code like mov eax, [array + rdi*4] or mov edi, symbol_name in Linux examples or compiler output on http://gcc.godbolt.org/. gcc -pie -fPIE will make position-independent executables on Linux, and is the default on many recent distros, but not Godbolt.
This doesn't help you on MacOS, but I mention it in case anyone's confused about code they've seen for other OSes, or why AMD64 architects bothered to allow [disp32] addressing modes at all on x86-64.
And BTW, prefer using 64-bit addressing modes in 64-bit code. e.g. use [rsi + rdi*4], not [esi + edi*4]. You usually don't want to truncate pointers to 32-bit, and it costs an extra address-size prefix to encode.
Similarly, you should be using syscall to make 64-bit system calls, not int 0x80. What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 for the differences in which registers to pass args in.
Footnote 1:
64-bit absolute addressing is supported on OS X, but only in position-dependent executables (non-PIE). This related question x64 nasm: pushing memory addresses onto the stack & call function includes an ld warning from using gcc main.o to link:
ld: warning: PIE disabled. Absolute addressing (perhaps -mdynamic-no-pic) not
allowed in code signed PIE, but used in _main from main.o. To fix this warning,
don't compile with -mdynamic-no-pic or link with -Wl,-no_pie
So the linker checks if any 64-bit absolute relocations are used, and if so disables creation of a Position-Independent Executable. A PIE can benefit from ASLR for security. I think shared-library code always has to be position-independent on OS X; I don't know if jump tables or other cases of pointers-as-data are allowed (i.e. fixed up by the dynamic linker), or if they need to be initialized at runtime if you aren't making a position-dependent executable.
mov r64, imm64 is larger (10 bytes) and not faster than lea r64, [RIP_rel32] (7 bytes).
So you could use mov rsi, qword data_items instead of a RIP-relative LEA which runs about as fast, and takes less space in code caches and the uop cache. 64-bit immediates also have a uop-cache fetch penalty for on Sandybridge-family (http://agner.org/optimize/): they take 2 cycles to read from a uop cache line instead of 1.
x86 also has a form of mov that loads/store from/to a 64-bit absolute address, but only for AL/AX/EAX/RAX. See http://felixcloutier.com/x86/MOV.html. You don't want this either, because it's larger and not faster than mov eax, [rel foo].
(Related: an AT&T syntax version of the same question)

Understanding optimized assembly code generated by gcc

I'm trying to understand what kind of optimizations are performed by gcc when -O3 flag was set. I'm quite confused what these two lines,
xor %esi, %esi
lea 0x0(%esi), %esi
It seems to me redundant. What's point to use lea instruction here?

That instruction is used to fill space for alignment purposes. Loops can be faster when they start on aligned addresses, because the processor loads memory into the decoder in chunks. By aligning the beginnings of loops and functions, it becomes more likely that they will be at the beginning of one of these chunks. This prevents previous instructions which will not be used from being loaded, maximizes the number of future instructions that will, and, possibly most importantly, ensures that the first instruction is entirely in the first chunk, so it does not take two loads to execute it.
The compiler knows that it is best to align the loop, and has two options to do so. It can either place a jump to the beginning of the loop, or fill the gap with no-ops and let the processor flow through them. Jump instructions break the flow of instructions and often cause wasted cycles on modern processors, so adding them unnecessarily is inadvisable. For a short distance like this no-ops are better.
The x86 architecture contains an instruction specifically for the purpose of doing nothing, nop. However, this is one byte long, so it would take more than one to align the loop. Decoding each one and deciding it does nothing takes time, so it is faster to simply insert another longer instruction that has no side effects. Therefore, the compiler inserted the lea instruction you see. It has absolutely no effects, and is chosen by the compiler to have the exact length required. In fact, recent processors have standard multi-byte no-op instructions, so this will likely be recognized during decode and never even executed.

As explained by ughoavgfhw - these are paddings for better code alignment.
You can find this lea in the following link -
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2010-September/003881.html
quote:
1-byte: XCHG EAX, EAX
2-byte: 66 NOP
3-byte: LEA REG, 0 (REG) (8-bit displacement)
4-byte: NOP DWORD PTR [EAX + 0] (8-bit displacement)
5-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)
**6-byte: LEA REG, 0 (REG) (32-bit displacement)**
7-byte: NOP DWORD PTR [EAX + 0] (32-bit displacement)
8-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
9-byte: NOP WORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
Also note this SO question describing it in more details -
What does NOPL do in x86 system?
Note that the xor itself is not a nop (it changes the value of the reg), but it is also very cheap to perform since it's a zero idiom - What is the purpose of XORing a register with itself?

How does writing to CPU register actually work?

When writing to a register, say, like mov ax, 1, it overwrites the value it may have had earlier.
Now what I wonder is that how big figures/strings can I feed into a register, and that can another application overwrite my app's register values? I mean, are the registers shared among processes or do they receive their own sandboxed/virtual registers?
I am interesting in Intel x86(-64) Core CPUs and Windows.

Only one thread is scheduled at a time on a single core. The core is what has the registers.
When a new thread is scheduled, the registers are first saved, and the previously-saved registers of the thread are restored. This includes the Program Counter register, which points to the next instruction to execute.
Registers (from memory):
AX, BX, CX, DX are 16 bits, broken into bytes (AH, AL, BH, BL)
SI, DI, SP and BP are also 16 bits
EAX, EBX, ECX etc. are 32 bits
I'm not sure what they're called on a 64-bit system. I think I saw RAX, but I'm not sure.
There are also special-purpose registers, floating-point registers, etc.

1) The size of registers depends (in well-defined ways) on what names you're using for them. For instance, eax is 32 bits wide, ax is 16 bits, and ah/al are 8 bits. If you're on a 64-bit system, rax is 64 bits wide.
The exact limits of these register sizes will depend somewhat on how you're interpreting the values (in particular, whether you're treating them as signed or unsigned). The size is what fundamentally matters, though.
2) The operating system kernel will save your process's registers while other processes, or the kernel, are running. The registers do take on other values while you're not running, but it's all transparent -- while your process is running, registers won't change out from under you.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio