How does writing to CPU register actually work?

How does writing to CPU register actually work? - windows

When writing to a register, say, like mov ax, 1, it overwrites the value it may have had earlier.
Now what I wonder is that how big figures/strings can I feed into a register, and that can another application overwrite my app's register values? I mean, are the registers shared among processes or do they receive their own sandboxed/virtual registers?
I am interesting in Intel x86(-64) Core CPUs and Windows.

Only one thread is scheduled at a time on a single core. The core is what has the registers.
When a new thread is scheduled, the registers are first saved, and the previously-saved registers of the thread are restored. This includes the Program Counter register, which points to the next instruction to execute.
Registers (from memory):
AX, BX, CX, DX are 16 bits, broken into bytes (AH, AL, BH, BL)
SI, DI, SP and BP are also 16 bits
EAX, EBX, ECX etc. are 32 bits
I'm not sure what they're called on a 64-bit system. I think I saw RAX, but I'm not sure.
There are also special-purpose registers, floating-point registers, etc.

1) The size of registers depends (in well-defined ways) on what names you're using for them. For instance, eax is 32 bits wide, ax is 16 bits, and ah/al are 8 bits. If you're on a 64-bit system, rax is 64 bits wide.
The exact limits of these register sizes will depend somewhat on how you're interpreting the values (in particular, whether you're treating them as signed or unsigned). The size is what fundamentally matters, though.
2) The operating system kernel will save your process's registers while other processes, or the kernel, are running. The registers do take on other values while you're not running, but it's all transparent -- while your process is running, registers won't change out from under you.

Related

Are some general purpose registers faster than others?

In x86-64, will certain instructions execute faster if some general purpose registers are preferred over others?
For instance, would mov eax, ecx execute faster than mov r8d, ecx? I can imagine that the latter would need a REX prefix which would make the instruction fetch slower?
What about using rax instead of rcx? What about add or xor? Other operations? Smaller registers like r15b vs al? al vs ah?
AMD vs Intel? Newer processors? Older processors? Combinations of instructions?
Clarification: Should certain general purpose registers be preferred over others, and which ones are they?

In general, architectural registers are all equal, and renamed onto a large array of physical registers.
(Except partial registers can be slower, especially high-byte AH/BH/CH/DH which are slow to read after writing the full register, on Haswell and later. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent and also Why doesn't GCC use partial registers? for problems when writing 8-bit and 16-bit registers). The rest of this answer is just going to consider 32/64-bit operand-size.)
But some instruction require specific registers, like legacy variable-count shifts (without BMI2 shrx etc) require the count in CL. Division requires the dividend in EDX:EAX (or RDX:RAX for the slower 64-bit version).
Using a call-preserved register like RBX means your function has to spend extra instructions saving/restoring it.
But of course there are perf differences if you need more instructions. So lets assume all else is equal, and just talk about the uops, latency, and code-size of a single instruction just by changing which register is used for one of its operands. TL:DR: the only perf difference is due to instruction-encoding restrictions / differences. Sometimes a different register will allow / require (or get the assembler to pick) a different encoding, which will often be smaller / larger as a special case, and sometimes even executes differently.
Generally smaller code is faster, and packs better in the uop cache and I-cache, so unless you've analyzed a specific case and found a problem, favour the smaller encoding. Often that means keeping a byte value in AL so you can use those special-case instructions, and avoiding RBP / R13 for pointers.
Special cases where a specific encoding is extra slow, not just size
LEA with RBP or R13 as a base can be slower on Intel if the addressing mode didn't already have a +displacement constant.
e.g. lea eax, [rbp + 12] is encodeable as-written, and is just as fast as lea eax, [rcx + 12].
But lea eax, [rbp + rcx*4] can only be encoded in machine code as lea eax, [rbp + rcx*4 + 0] (because of addressing mode escape-code stuff), which is a 3-component LEA, and thus slower on Intel (3 cycle latency on Sandybridge-family instead of 1 cycle, see https://agner.org/optimize/ instruction tables and microarch PDF). On AMD, having a scaled-index would already make it a slow-LEA even with lea eax, [rdx + rcx*4]
Outside of LEA, using RBP / R13 as the base in any addressing mode always requires a disp8/32 byte or dword, but I don't think the actual AGUs are slower for a 3-component addressing mode. So it's just a code-size effect.
Other cases include Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? where the short-form 2-byte encoding for adc al, imm8 is 2 uops even on modern uarches like Skylake, where adc bl, imm8 is 1 uop.
So not only does the adc reg,0 special case not work for adc al,0 on Sandybridge through Haswell, Broadwell and newer forgot (or chose not to) optimize how that encoding decodes to uops. (Of course you could manually encode adc al,0 using the 3-byte Mod/RM encoding, but assemblers will always pick the shortest encoding so adc al,0 will assemble to the short form by default.) Only a problem with byte registers; adc eax,0 will use the opcode ModRM imm8 3-byte encoding, not 5-byte opcode imm32.
For other cases of op al,imm8, the only difference is code-size, which only indirectly matters for performance. (Because of decoding, uop-cache packing, and I-cache misses).
See Tips for golfing in x86/x64 machine code for more about special cases of code-size, like xchg eax, ecx being 1-byte vs. xchg edx, ecx being 2 bytes.
add rsp, 8 can need an extra stack-sync uop if there hasn't been an explicit use of RSP or ESP since the last push/pop/call/ret (along the path of execution of course, not in the static code layout). (What is the stack engine in the Sandybridge microarchitecture?). This is why compilers like clang use a dummy push or pop to reserve / free a single stack slot: Why does this function push RAX to the stack as the first operation?

LEA will be slower with EBP, RBP, or R13 as the base (PDF warning, page 3-22). But generally the answer is No.
Taking a step back, it's important to realize that since the advent of register renaming that architectural registers don't deal with actual, physical registers on most micro-architectures. For example, each Cascade Lake core has a register file of 180 integer and 168 FP registers.

You have stuffed too many questions altogether, however, if I understood the question well, you are confusing the processor architecture with the small but fast Register file, which fills in the speed gap between the processor and memory technologies. The register file is small enough that it can only support one instruction at a time, i.e. the current instruction, and fast enough that it can almost catch up with the processor speed.
I would like to build a short background, the naming conventions of these registers serves two purposes: one, it makes the older versions of the x86 ISA implementations compatible up till now, and two, every name of these registers has a special purpose to it besides its general purpose use. For example, the ECX register is used as a counter to implement loops i.e. instructions like JECXZ and LOOP uses ECX register exclusively. Though you need to watch out for some flags that you would not want to lose.
And now the answer to your question stems from the second purpose. So some registers would seem to be faster because these special registers are hardcoded into the processor and can be accessed much quicker, however, the difference should not be much.
And the second thing that you might know, not all instructions are of the same complexity, especially in x86, the opcode of instructions can be from 1-3 bytes and as more and more functionality is added to the instruction in terms of, prefixes, addressing modes, etc. these instructions start to become slower, So it is not the case that some registers are slower than other, it is just that some registers are encoded into the instruction and therefore those instructions run faster with that combination of register. And if otherwise used, it would seem slower. I hope that helps. Thanks

How often do the contents of a CPU register change?

Does the data that CPU registers hold change often? The Wikipedia article describes the registers as "a quickly accessible location...of a small amount of fast storage". I'm assuming the memory is fast because a register is accessed and modified often?

Yes, data registers may change on subsequent instructions which is quite often. There are more complications with superscalarity, out-of-order execution, pipelining, register renaming, etc which complicate the analysis, but even on a simple in-order CPU, a register can change as often as once per instruction. A plausible program may have a run of many instructions, all affecting the same register:
// Type your code here, or load an example.
int polynom(int num) {
return num * num + 2 * num + 1;
}
which compiles as:
polynom(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
* mov eax, DWORD PTR [rbp-4]
* imul eax, eax
* mov edx, DWORD PTR [rbp-4]
add edx, edx
* add eax, edx
* add eax, 1
pop rbp
ret
Note the many writes to the eax register, noted with an asterisk. In this little function, five almost-consecutive instructions write to this specific register, meaning that we can expect the program-visible state of eax1 to change at a rate of over 1 GHz if this code were to be called in a tight loop.
On a more fundamental note, there are some architectural registers that almost always change on every instruction. The most evident of these is the program counter (called PC in many contexts, EIP on x86, RIP on x86_64). Because this register points to the currently executing instruction, it must certainly change with every instruction, barring counterexamples like x86 REP encodings or an instruction that simply jumps to itself.
1 Again, barring architectural considerations like register renaming, which uses multiple physical registers to implement a single logical, program-visible register.

Since modern CPU's run in GHz, CPU registers can change what they are storing hundred of millions or even billions of times per second.
Since most modern CPU's have ~128 registers, they would typically change values a few million times per second when performing many operations.

Which is generally faster to test for zero in x86 ASM: "TEST EAX, EAX" versus "TEST AL, AL"?

Which is generally faster to test the byte in AL for zero / non-zero?
TEST EAX, EAX
TEST AL, AL
Assume a previous "MOVZX EAX, BYTE PTR [ESP+4]" instruction loaded a byte parameter with zero-extension to the remainder of EAX, preventing the combine-value penalty that I already know about.
So AL=EAX and there are no partial-register penalties for reading EAX.
Intuitively just examining AL might let you think it's faster, but I'm betting there are more penalty issues to consider for byte access of a >32-bit register.
Any info/details appreciated, thanks!

Code-size is equal, and so is performance on all x86 CPUs AFAIK.
Intel CPUs (with partial-register renaming) definitely don't have a penalty for reading AL after writing EAX. Other CPUs also have no penalty for reading low-byte registers.
Reading AH would have a penalty on Intel CPUs, like some extra latency. (How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent)
In general 32-bit operand-size and 8-bit operand size (with low-8 not high-8) are equal speed except for the false-dependencies or later partial-register reading penalties of writing an 8-bit register. Since TEST only reads registers, this can't be a problem. Even add al, bl is fine: the instruction already had an input dependency on both registers, and on Sandybridge-family a RMW to the low byte of a register doesn't rename it separately. (Haswell and later don't rename low-byte registers separately anyway).
Pick whichever operand-size you like. 8-bit and 32-bit are basically equal. The choice is just a matter of human readability. If you're going to work with the value as a 32-bit integer later, then go 32-bit. If it's logically still an 8-bit value and you were only using movzx as the x86 equivalent of ARM ldrb or MIPS lbu, then using 8-bit makes sense.
There are code-size advantages to instructions like cmp al, imm which can use the no-modrm short-form encoding. cmp al, 0 is still worse than test al,al on some old CPUs (Core 2), where cmp/jcc macro-fusion is less flexible than test/jcc macro-fusion. (Test whether a register is zero with CMP reg,0 vs OR reg,reg?)
There is one difference between these instructions: test al,al sets SF according to the high bit of AL (which can be non-zero). test eax,eax will always clear SF. If you only care about ZF then that makes no difference, but if you have a use for the high bit in SF for a later branch or cmovcc/setcc then you can avoid doing a 2nd test.
Other ways to test a byte in memory:
If you're consuming the flag result with setcc or cmovcc, not a jcc branch, then macro-fusion doesn't matter in the discussion below.
If you also need the actual value in a register later, movzx/test/jcc is almost certainly best. Otherwise you can consider a memory-destination compare.
cmp [mem], immediate can micro-fuse into a load+cmp uop on Intel, as long as the addressing mode is not RIP-relative. (On Sandybridge-family, indexed addressing modes will un-laminate even on Haswell and later: See Micro fusion and addressing modes). Agner Fog doesn't mention whether AMD has this limitation for fusing cmp/jcc with a memory operand.
;;; no downside for setcc or cmovcc, only with JCC on Intel
;;; unknown on AMD
cmp byte [esp+4], 0 ; micro-fuses into load+cmp with this addressing mode
jnz ... ; breaks macro-fusion on SnB-family
I don't have an AMD CPU to test whether Ryzen or any other AMD still fuses cmp/jcc when the cmp is mem, immediate. Modern AMD CPUs do in general do cmp/jcc and test/jcc fusion. (But not add/sub/and/jcc fusion like SnB-family).
cmp mem,imm / jcc (vs. movzx/test+jcc):
smaller code-size in bytes
same number of front-end / fused-domain uops (2) on mainstream Intel. This would be 3 front-end uops if micro-fusion of the cmp+load wasn't possible, e.g. with a RIP-relative addressing mode + immediate. Or on Sandybridge-family with an indexed addressing mode, it would unlaminate to 3 uops after decode but before issuing into the back-end.
Advantage: this is still 2 on Silvermont/Goldmont / KNL or very old CPUs without macro-fusion. The main advantage of movzx/test/jcc over this is macro-fusion, so it falls behind on CPUs where that doesn't happen.
3 back-end uops (unfused domain = execution ports and space in the scheduler aka RS) because cmp-immediate can't macro-fuse with a JCC on Intel Sandybridge-family CPUs (tested on Skylake). The uops are load, cmp, and a separate branch uop. (vs. 2 for movzx / test+jcc). Back-end uops usually aren't a bottleneck directly, but if the load isn't ready for a while it takes up more space in the RS, limiting how much further past this out-of-order execution can see.
cmp [mem], reg / jcc can macro + micro-fuse into a single compare+branch uop so it's excellent. If you need a zeroed register for anything later in your function, do xor-zero it first and use it for a single-uop compare+branch on memory.
movzx eax, [esp+4] ; 1 uop (load-port only on Intel and Ryzen)
test al,al ; fuses with jcc
jnz ... ; 1 uop
This is still 2 uops for the front-end but only 2 for the back-end as well. The test/jcc macro-fuse together. It costs more code-size, though.
If you aren't branching but instead using the FLAGS result for cmovcc or setcc, using cmp mem, imm has no downside. It can micro-fuse as long as you don't use a RIP-relative addressing mode (which always blocks micro-fusion when there's also an immediate), or an indexed addressing mode.

Setting and clearing the zero flag in x86

What's the most efficient way to set and also to clear the zero flag (ZF) in x86-64?
Methods that work without the need for a register with a known value, or without any free registers at all are preferred, but if a better method is available when those or other assumptions are true it is also worth mentioning.

ZF=0
This is harder. cmp between any two regs known to be not equal. Or cmp reg,imm with any value some reg couldn't possibly have. e.g. cmp reg,1 with any known-zero register.
In general test reg,reg is good with any known-non-0 register value, e.g. a pointer.
test rsp, rsp is probably a good choice, or even test esp, esp to save a byte will work except if your stack is in the unusual location of spanning a 4G boundary.
I don't see a way to create ZF=0 in one instruction without a false dependency on some input reg. xor eax,eax / inc eax or dec will do the trick in 2 uops if you don't mind destroying a register, breaking false dependencies. (not doesn't set FLAGS, and neg will just do 0-0 = 0.)
or eax, -1 doesn't need any pre-condition for the register value. (False dependency, but not a true dependency so you can pick any register even if it might be zero.) It doesn't have to be -1, it's not gaining you anything so if you can make it something useful so much the better.
or eax,-1 FLAG results: ZF=0 PF=1 SF=1 CF=0 OF=0 (AF=undefined).
If you need to do this in a loop, you can obviously set up for it outside the loop, if you can dedicate a register to being non-zero for use with test.
ZF=1
Least destructive: cmp eax,eax - but has a false dependency (I assume) and needs a back-end uop: not a zeroing idiom. RSP doesn't usually change much so cmp esp, esp could be a good choice. (Unless that forces a stack-sync uop).
Most efficient: xor-zeroing (like xor eax,eax using any free register) is definitely the most efficient way on SnB-family (same cost as a 2-byte nop, or 3-byte if it needs a REX because you want to zero one of r8d..r15d): 1 front-end uop, zero back-end uops on SnB-family, and the FLAGS result is ready in the same cycle it issues. (Relevant only in case the front-end was stalled, or some other case where a uop depending on it issues in the same cycle and there aren't any older uops in the RS with ready inputs, otherwise such uops would have priority for whichever execution port.)
Flag results: ZF=1 PF=1 SF=0 CF=0 OF=0 (AF=undefined). (Or use sub eax,eax to get well-defined AF=0. In practice modern CPUs pick AF=0 for xor-zeroing, too, so they can decode both zeroing idioms the same way. Silvermont only recognizes 32-bit operand-size xor as a zeroing idiom, not sub.)
xor-zero is very cheap on all other uarches as well, of course: no input dependencies, and doesn't need any pre-existing register value. (And thus doesn't contribute to P6-family register-read stalls). So it will be at worst tied with anything else you could do on any other uarch (where it does require an execution unit.)
(On early P6-family, before Pentium M, xor-zeroing does not break dependencies; it only triggers the special al=eax state that avoids partial-register stuff. But none of those CPUs are x86-64, all 32-bit only.)
It's pretty common to want a zeroed register for something anyway, e.g. as a sub destination for 0 - x to copy-and-negate, so take advantage of it by putting the xor-zeroing where you need it to also create a useful FLAG condition.
Interesting but probably not useful: test al, 0 is 2 bytes long. But so is cmp esp,esp.
As #prl suggested, cmp same,same with any register will work without disturbing a value. I suspect this is not special-cased as dependency breaking the way sub same,same is on some CPUs, so pick a "cold" register. Again 2 or 3 bytes, 1 uop. It can micro-fuse with a JCC, but that would be dumb (unless the JCC is also a branch target from some other condition?)
Flag results: same as xor-zeroing.
Downsides:
(probably) false dependency
on P6-family can contribute to a register-read stall, so pick a cold register you're already reading in nearby instructions.
needs a back-end execution unit on SnB-family
Just for fun, other as-cheap alternatives include test al, 0. 2 bytes for AL, 3 or 4 bytes for any other 8-bit register. (REX) + opcode + modrm + imm8. The original register value doesn't matter because an imm8 of zero guarantees that reg & 0 = 0.
If you happen to have a 1 or -1 in a register you can destroy, 32-bit mode inc or dec would set ZF in only 1 byte. But in x86-64 that's at least 2 bytes. Nothing comes to mind for a 1-byte instruction in 64-bit mode that's actually efficient and sets FLAGS.
ZF=!CF
sbb same,same can set ZF=!CF (leaving CF unmodified), and setting the reg to 0 (CF=0) or -1 (CF=1). On AMD since Bulldozer (BD-family and Zen-family), this has no dependency on the GP register, only CF. But on other uarches it's not special cased and there is a false dep on the reg. And it's 2 uops on Intel before Broadwell.
ZF=!bool(integer register)
To set ZF=!integer_reg, obviously the normal test reg,reg is your best bet. (Better than and reg,reg or or reg,reg, unless you're intentionally rewriting the register to avoid P6 register-read stalls.)
ZF=1 if the register value is zero, so it's like C's logical inverse operator.
ZF=!ZF
Perhaps setz al / test al, al. No single instruction: I don't think any read ZF and write FLAGS. setz materializes ZF in a register, then test is just ZF = !reg.
Other FLAGS conditions:
How to read and write x86 flags registers directly?
One instruction to clear PF (Parity Flag) -- get odd number of bits in result register (not possible without pre-existing register values for test or cmp).
How can I set or clear overflow flag in x86 assembly? (e.g. for the start of an ADOX chain.)
pushf/pop rax is not terrible, but writing flags with popf is very slow (e.g. 1/20c throughput on SKL). It's microcoded because flags like IF also live in EFLAGS, and there isn't a condition-codes-only version or a special fast-path for user-space. (Or maybe 20c is the fast path.)
lahf (FLAGS->AH) / sahf (AH->FLAGS) can be useful but miss OF.
CF has clc/stc/cmc instructions. (clc is as efficient as xor-zeroing on SnB-family.)

The least intrusive way to manipulate any(i) of the lower 8 bits of Flags is to use the classic LAHF/SAHF instructions which bring them to/from AH, on which any bit operation can be applied.
(i) Just bits 7 (SF), 6 (ZF), 4 (AF), 2 (PF), and 0 (CF)
Turning off ZF
LAHF ; Load lower 8 bit from Flags into AH
AND AH,010111111b ; Clear bit for ZF
SAHF ; Store AH back to Flags
Turning on ZF
LAHF ; Load AH from FLAGS
OR AH,001000000b ; Set bit for ZF
SAHF ; Store AH back to Flags
Of course any CMP (E)AX,(E)AX will set ZF faster and with less code; the point of this is to leave other FLAGS unmodified, as in How to read and write x86 flags registers directly? and how to change flags manually (in assembly code) for 8086?
CAVEAT for early AMD64 - LAHF in long mode is an extension
Some very early x86-64 CPU's, most notably all
AMD Athlon 64, Opteron and Turion 64 before revision D (March 2005) and
Intel before Pentium 4 stepping G1* (December 2005)
As that instruction was originally removed from the AMD64 instruction subset, but later reintroduced. Luckily that happened before x86-64 became a common sight, so only a few, early-on high-end CPUs are affected and even less, surviving today. More so as these are the CPUs that are not able to run Windows 10, or any 64-bit Windows before Windows 10 (see this answer at SuperUser.SE).
If you really expect that someone might try to run that software on a more than 17 year old high-end CPU, it can be checked for by executing CPUID with EAX=80000001h and test for 2^0=1.

Assuming you don’t need to preserve the values of the other flags,
cmp eax, eax

Allocating memory using malloc() in 32-bit and 64-bit assembly language

I have to do a 64 bits stack. To make myself comfortable with malloc I managed to write two integers(32 bits) into memory and read from there:
But, when i try to do this with 64 bits:

The first snippet of code works perfectly fine. As Jester suggested, you are writing a 64-bit value in two separate (32-bit) halves. This is the way you have to do it on a 32-bit architecture. You don't have 64-bit registers available, and you can't write 64-bit chunks of memory at once. But you already seemed to know that, so I won't belabor it.
In the second snippet of code, you tried to target a 64-bit architecture (x86-64). Now, you no longer have to write 64-bit values in two 32-bit halves, since 64-bit architectures natively support 64-bit integers. You have 64-bit wide registers available, and you can write a 64-bit chunk to memory directly. Take advantage of that to simplify (and speed up) the code.
The 64-bit registers are Rxx instead of Exx. When you use QWORD PTR, you will want to use Rxx; when you use DWORD PTR, you will want to use Exx. Both are legal in 64-bit code, but only 32-bit DWORDs are legal in 32-bit code.
A couple of other things to note:
Although it is perfectly valid to clear a register using MOV xxx, 0, it is smaller and faster to use XOR eax, eax, so this is generally what you should write. It is a very old trick, something that any assembly-language programmer should know, and if you ever try to read other people's assembly programs, you'll need to be familiar with this idiom. (But actually, in the code you're writing, you don't need to do this at all. For the reason why, see point #2.)
In 64-bit mode, all instructions implicitly zero the upper 32 bits when writing the lower 32 bits, so you can simply write XOR eax, eax instead of XOR rax, rax. This is, again, smaller and faster.
The calling convention for 64-bit programs is different than the one used in 32-bit programs. The exact specification of the calling convention is going to vary, depending on which operating system you're using. As Peter Cordes commented, there is information on this in the x86 tag wiki. Both Windows and Linux x64 calling conventions pass at least the first 4 integer parameters in registers (rather than on the stack like the x86-32 calling convention), but which registers are actually used is different. Also, the 64-bit calling conventions have different requirements than do the 32-bit calling conventions for how you must set up the stack before calling functions.
(Since your screenshot says something about "MASM", I'll assume that you're using Windows in the sample code below.)
; Set up the stack, as required by the Windows x64 calling convention.
; (Note that we use the 64-bit form of the instruction, with the RSP register,
; to support stack pointers larger than 32 bits.)
sub rsp, 40
; Dynamically allocate 8 bytes of memory by calling malloc().
; (Note that the x64 calling convention passes the parameter in a register, rather
; than via the stack. On Windows, the first parameter is passed in RCX.)
; (Also note that we use the 32-bit form of the instruction here, storing the
; value into ECX, which is safe because it implicitly zeros the upper 32 bits.)
mov ecx, 8
call malloc
; Write a single 64-bit value into memory.
; (The pointer to the memory block allocated by malloc() is returned in RAX.)
mov qword ptr [rax], 1
; ... do whatever
; Clean up the stack space that we allocated at the top of the function.
add rsp, 40
If you wanted to do this in 32-bit halves, even on a 64-bit architecture, you certainly could. That would look like the following:
sub rsp, 40 ; set up stack
mov ecx, 8 ; request 8 bytes
call malloc ; allocate memory
mov dword ptr [eax], 1 ; write "1" into low 32 bits
mov dword ptr [eax+4], 2 ; write "2" into high 32 bits
; ... do whatever
add rsp, 40 ; clean up stack
Note that these last two MOV instructions are identical to what you wrote in the 32-bit version of the code. That makes sense, because you're doing exactly the same thing.
The reason the code you originally wrote didn't work is because EAX doesn't contain a QWORD PTR, it contains a DWORD PTR. Hence, the assembler generated the "invalid instruction operands" error, because there was a mismatch. This is the same reason that you don't offset by 8, because a DWORD PTR is only 4 bytes. A QWORD PTR is indeed 8 bytes, but you don't have one of those in EAX.
Or, if you wanted to write 16 bytes:
sub rsp, 40 ; set up stack
mov ecx, 16 ; request 16 bytes
call malloc ; allocate memory
mov qword ptr [rax], 1 ; write "1" into low 64 bits
mov qword ptr [rax+8], 2 ; write "2" into high 64 bits
; ... do whatever
add rsp, 40 ; clean up stack
Compare these three snippets of code, and make sure you understand the differences and why they need to be written as they are!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio