What's ARM instruction equivalent to Intel's xchgl?

What's ARM instruction equivalent to Intel's xchgl? - gcc

I found LDREX and STREX might be the ones to use. But they are two instructions (and thus not provide the atomicity of xchgl). The value I want to exchange atomically is a 32-bit value.
Can LDREX and STREX be used in a way that provides atomic exchange of a 32-bit value or are they other ways to achieve it (provided it works on armv7l or higher)?
Normally, I'd the gcc's atomic builtins or the more recent (C++11 equivalent) builtin functions
for such atomic operations. But in this case, I have to use inline assembly in C (to port an x86-based futex implementation to ARM architecture). Thanks!

In the ARM instruction set, there is no atomic exchange instruction. Instead you use ldrex and strex and code like this:
# exchange r0 and [r1]
ldrex r2,[r1]
strex r3, r0,[r1]
mov r0,r2
When [r1] is modified between ldrex and strex or the exchange cannot be guaranteed to be atomic for some other reason, 1 is returned in r3 and the store isn't performed. If the sequence is atomic, 0 is returned. Thus, by executing this snippet in a loop until you get a zero r3 you can eventually reach an atomic exchange operation. That's actually how gcc and clang implement the corresponding intrinsic; pass -S to the compiler to observe what it does.

SWP is still supported on some cores despite what the docs say (they often say please dont use rather than we have removed it) but it is going away or may be gone on your core.
Atomics are costly, CISC is costly so perhaps it is fine there, but RISC it makes sense what they have done. You are basically synthesizing the atomic but you may have to repeat it until it works (rather than stopping all data movement on the bus while the atomic happens). Not limited to a RISC/CISC thing but simply a performance thing.

Related

Is it legal to begin an x64 function with a single-byte instruction?

According to masm's macamd64.inc, rex_push_reg,
...rex_push_reg must be used in lieu of push_reg when it appears as
the first instruction in a function, as the calling standard dictates
that functions must not begin with a single byte instruction.
I wasn't able to find any documentation expressing this, however. Is this true? Where is it documented? Why is this the case?

The operative portion of this claim seems to be "the calling standard"—which calling standard? The joke may be an old one, but it remains apt: the great thing about standards is there are so many to choose from.
In this case, since you're speaking of MASM, we can assume that the target platform is Windows, so the Windows 64-bit calling convention would be assumed, rather than something in the official AMD64 specification. However, like you, I can't find anything there that speaks to this requirement.
However, I think what this comment is referring to is Microsoft's internal standard designed to allow hot patching of system binaries. By "hot patching" is meant the ability to dynamically patch binaries in memory—e.g. to apply a system update—without the need to restart.
The minimum requirement for this to work is that there is room for a 2-byte short JMP instruction to be patched in at the beginning of every function. (Note that a short jump only allows execution to be passed anywhere from −128 to +127 bytes from the current instruction pointer, but that's enough to branch to a long jump, which then branches to the patched function provided by the update. In practice, the long jump instruction is patched into the padding between functions.)
Therefore, a function cannot begin with a 1-byte instruction because then a hot patch could potentially result in the instruction pointer pointing into the middle of an instruction. (Think about multi-threading race conditions.) So the rule is, if you want to begin a function with a prologue instruction like PUSH RBP that would normally be only 1 byte, you need to add a 1-byte REX prefix. This unnecessary REX prefix is ignored by the CPU and functions essentially as a 1-byte NOP.
In 32-bit builds, hot patching was provided for by the 2-byte instruction MOV EDI, EDI. This copies the EDI register to itself without affecting flags, so it is effectively a NOP.
For 32-bit builds, you have to specifically pass the /hotpatch switch to the compiler to have it insert this instruction. However, on 64-bit builds, the compiler always acts as if /hotpatch has been specified, so this requirement that the first instruction be 2 bytes in length effectively becomes part of the platform standard.
So, why make this complicated rule instead of just having the compiler insert a 2-byte NOP at the beginning of every function, like is done in 32-bit builds? Well, I can't say for certain, but I can speculate. One problem is that MOV EDI, EDI is not a NOP on x64 because it implicitly zeroes the upper 32-bits of the RDI register. You'd have to choose a different instruction as the NOP, and once you've done that, you might as well rethink the whole business. Second, there is a (slight) performance cost you pay for having that NOP there, and since most instructions in long mode are at least 2 bytes long, it hardly seems worth it to require a pointless NOP instruction when the instruction that would normally be there is sufficient with only a few exceptions.

Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

LOOP (Intel ref manual entry)
decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags.
loop on various microarchitectures, from Agner Fog's instruction tables:
K8/K10: 7 m-ops
Bulldozer-family/Ryzen: 1 m-op (same cost as macro-fused test-and-branch, or jecxz)
P4: 4 uops (same as jecxz)
P6 (PII/PIII): 8 uops
Pentium M, Core2: 11 uops
Nehalem: 6 uops. (11 for loope / loopne). Throughput = 4c (loop) or 7c (loope/ne).
SnB-family: 7 uops. (11 for loope / loopne). Throughput = one per 5 cycles, as much of a bottleneck as keeping your loop counter in memory! jecxz is only 2 uops with same throughput as regular jcc
Silvermont: 7 uops
AMD Jaguar (low-power): 8 uops, 5c throughput
Via Nano3000: 2 uops
Couldn't the decoders just decode the same as lea rcx, [rcx-1] / jrcxz? That would be 3 uops. At least that would be the case with no address-size prefix, otherwise it has to use ecx and truncate RIP to EIP if the jump is taken; maybe the odd choice of address-size controlling the width of the decrement explains the many uops? (Fun fact: rep-string instructions have the same behaviour with using ecx with 32-bit address-size.)
Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx / jnz on SnB decodes to a single uop (which does set flags).
I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.
Would it be easy for SnB-family uarch to have fast loop? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.
What's special about Bulldozer that made a fast loop easy / worth it? Or did AMD waste a bunch of transistors on making loop fast? If so, presumably someone thought it was a good idea.
If loop was fast, it would be perfect for BigInteger arbitrary-precision adc loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz. (And dec/jnz only macro-fuses on SnB-family).
On modern CPUs where dec/jnz is ok in an ADC loop, loop would still be nice for ADCX / ADOX loops (to preserve OF).
If loop had been fast, compilers would already be using it as a peephole optimization for code-size + speed on CPUs without macro-fusion.
It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.

In 1988, IBM fellow Glenn Henry had just come on board at Dell, which had a few hundred employees at the time, and in his first month he gave a tech talk about 386 internals. A bunch of us BIOS programmers had been wondering why LOOP was slower than DEC/JNZ so during the question/answer section somebody posed the question.
His answer made sense. It had to do with paging.
LOOP consists of two parts: decrementing CX, then jumping if CX is not zero. The first part cannot cause a processor exception, whereas the jump part can. For one, you could jump (or fall through) to an address outside segment boundaries, causing a SEGFAULT. For two, you could jump to a page that is swapped out.
A SEGFAULT usually spells the end for a process, but page faults are different. When a page fault occurs, the processor throws an exception, and the OS does the housekeeping to swap in the page from disk into RAM. After that, it restarts the instruction that caused the fault.
Restarting means restoring the state of the process to what it was just before the offending instruction. In the case of the LOOP instruction in particular, it meant restoring the value of the CX register. One might think you could just add 1 to CX, since we know CX got decremented, but apparently, it's not that simple. For example, check out this erratum from Intel:
The protection violations involved usually indicate a probable
software bug and restart is not desired if one of these violations
occurs. In a Protected Mode 80286 system with wait states during any
bus cycles, when certain protection violations are detected by the
80286 component, and the component transfers control to the exception
handling routine, the contents of the CX register may be unreliable.
(Whether CX contents are changed is a function of bus activity at the
time internal microcode detects the protection violation.)
To be safe, they needed to save the value of CX on every iteration of a LOOP instruction, in order to reliably restore it if needed.
It's this extra burden of saving CX that made LOOP so slow.
Intel, like everyone else at the time, was getting more and more RISC. The old CISC instructions (LOOP, ENTER, LEAVE, BOUND) were being phased out. We still used them in hand-coded assembly, but compilers ignored them completely.

Now that I googled after writing my question, it turns out to be an exact duplicate of one on comp.arch, which came up right away. I expected it to be hard to google (lots of "why is my loop slow" hits), but my first try (why is the x86 loop instruction slow) got results.
This is not a good or complete answer.
It might be the best we'll get, and will have to suffice unless someone can shed some more light on it. I didn't set out to write this as an answer-my-own-question post.
Good posts with different theories in that thread:
Robert
LOOP became slow on some of the earliest machines (circa 486) when
significant pipelining started to happen, and running any but the
simplest instruction down the pipeline efficiently was technologically
impractical. So LOOP was slow for a number of generations. So nobody
used it. So when it became possible to speed it up, there was no real
incentive to do so, since nobody was actually using it.
Anton Ertl:
IIRC LOOP was used in some software for timing loops; there was
(important) software that did not work on CPUs where LOOP was too fast
(this was in the early 90s or so). So CPU makers learned to make LOOP
slow.
(Paul, and anyone else: You're welcome to re-post your own writing as your own answer. I'll remove it from my answer and up-vote yours.)
#Paul A. Clayton (occasional SO poster and CPU architecture guy) took a guess at how you could use that many uops. (This looks like loope/ne which checks both the counter and ZF):
I could imagine a possibly sensible 6-µop version:
virtual_cc = cc;
temp = test (cc);
rCX = rCX - temp; // also setting cc
cc = temp & cc; // assumes branch handling is not
// substantially changed for the sake of LOOP
branch
cc = virtual_cc
(Note that this is 6 uops, not SnB's 11 for LOOPE/LOOPNE, and is a total guess not even trying to take into account anything known from SnB perf counters.)
Then Paul said:
I agree that a shorter sequence should be possible, but I was trying
to think of a bloated sequence that might make sense if minimal
microarchitectural adjustments were permitted.
summary: The designers wanted loop to be supported only via microcode, with no adjustments whatsoever to the hardware proper.
If a useless, compatibility-only instruction is handed to the
microcode developers, they might reasonably not be able or willing to
suggest minor changes to the internal microarchitecture to improve
such an instruction. Not only would they rather use their "change
suggestion capital" more productively but the suggestion of a change
for a useless case would reduce the credibility of other suggestions.
(My opinion: Intel is probably still making it slow on purpose, and hasn't bothered to rewrite their microcode for it for a long time. Modern CPUs are probably too fast for anything using loop in a naive way to work correctly.)
... Paul continues:
The architects behind Nano may have found avoiding the special casing
of LOOP simplified their design in terms of area or power. Or they
may have had incentives from embedded users to provide a fast
implementation (for code density benefits). Those are just WILD
guesses.
If optimization of LOOP fell out of other optimizations (like fusion
of compare and branch), it might be easier to tweak LOOP into a fast
path instruction than to handle it in microcode even if the
performance of LOOP was unimportant.
I suspect that such decisions are based on specific details of the
implementation. Information about such details does not seem to be
generally available and interpreting such information would be
beyond the skill level of most people. (I am not a hardware
designer--and have never played one on television or stayed at a
Holiday Inn Express. :-)
The thread then went off-topic into the realm of AMD blowing our one chance to clean up the cruft in x86 instruction encoding. It's hard to blame them, since every change is a case where the decoders can't share transistors. And before Intel adopted x86-64, it wasn't even clear that it would catch on. AMD didn't want to burden their CPUs with hardware nobody used if AMD64 didn't catch on.
But still, there are so many small things: setcc could have changed to 32bits. (Usually you have to use xor-zero / test / setcc to avoid false dependencies, or because you need a zero-extended reg). Shift could have unconditionally written flags, even with zero shift count (removing the input data dependency on eflags for variable-count shift for OOO execution). Last time I typed this list of pet peeves, I think there was a third one... Oh yeah, bt / bts etc. with memory operands has the address dependent on the upper bits of the index (bit string, not just bit within a machine word).
bts instructions are very useful for bit-field stuff, and are slower than they need to be so you almost always want to load into a register and then use that. (It's usually faster to shift/mask to get an address yourself, instead of using 10 uop bts [mem], reg on Skylake, but it does take extra instructions. So it made sense on 386, but not on K8). Atomic bit-manipulation has to use the memory-dest form, but the locked version needs lots of uops anyway. It's still slower than if it couldn't access outside the dword it's operating on.

Please see the nice article by Abrash, Michael, published in Dr. Dobb's Journal March 1991 v16 n3 p16(8): http://archive.gamedev.net/archive/reference/articles/article369.html
The summary of the article is the following:
Optimizing code for 8088, 80286, 80386 and 80486 microprocessors is
difficult because the chips use significantly different memory
architectures and instruction execution times. Code cannot be
optimized for the 80x86 family; rather, code must be designed to
produce good performance on a range of systems or optimized for
particular combinations of processors and memory. Programmers must
avoid the unusual instructions supported by the 8088, which have lost
their performance edge in subsequent chips. String instructions
should be used but not relied upon. Registers should be used rather
than memory operations. Branching is also slow for all four
processors. Memory accesses should be aligned to improve
performance. Generally, optimizing an 80486 requires exactly the
opposite steps as optimizing an 8088.
By "unusual instructions supported by the 8088" the author also means "loop":
Any 8088 programmer would instinctively replace: DEC CX JNZ LOOPTOP
with: LOOP LOOPTOP because LOOP is significantly faster on the 8088.
LOOP is also faster on the 286. On the 386, however, LOOP is actually
two cycles slower than DEC/JNZ. The pendulum swings still further on
the 486, where LOOP is about twice as slow as DEC/JNZ--and, mind you,
we're talking about what was originally perhaps the most obvious
optimization in the entire 80x86 instruction set.
This is a very good article, and I highly recommend it. Even though it was published in 1991, it is surprisingly highly relevant today.
But this article just gives advices, it encourages to test execution speed and choose faster variants. It doesn’t explain WHY some commands become very slow, so it doesn’t fully address your question.
The answer is that earlier processors, like 80386 (released in 1985) and before, executed instructions one-by-one, sequentially.
Later processors have started to use instruction pipelining – initially, simple, for 804086, and, finally, Pentium Pro (released in 1995) introduced radically different internal pipeline, calling it the Out Of Order (OOO) core where instructions were transformed to small fragments of operations called micro-ops or µops, and then all micro-ops of different instructions were put to a large pool of micro-ops where they were supposed to execute simultaneously as long as they do not depend on one another. This OOO pipeline principle is still used, almost unchanged, on modern processors. You can find more information about instruction pipelining in this brilliant article: https://www.gamedev.net/resources/_/technical/general-programming/a-journey-through-the-cpu-pipeline-r3115
In order to simplify chip design, Intel decided to build processors in such a way that one instructions did transform to micro-ops in a very efficient way, while others are not.
Efficient conversion from instructions to micro-ops requires more transistors, so Intel have decided to save on transistors at a cost of slower decoding and execution of some “complex” or “rarely-used” instructions.
For example, the “Intel® Architecture Optimization Reference Manual” http://download.intel.com/design/PentiumII/manuals/24512701.pdf mentions the following: “Avoid using complex instructions (for example, enter, leave, or loop) that generally have more than four µops and require multiple cycles to decode. Use sequences of simple instructions instead.”
So, Intel somehow have decided that the “loop” instruction is “complex”, and, since then, it became very slow. However, there is no official Intel reference on instruction breakdown: how many micro-ops each instruction produces, and how many cycles are required to decode it.
You can also read about The Out-of-Order Execution Engine
in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section the 2.1.2.

How do I force the CPU to perform in order execution of a program without any loops or branches?

Is it possible? For a small code without any branches/loops.
Are there any gcc flags or intrinsic instructions like SSE's for x86 and other processor families? I am just curious since all the processors available these days follow out of order execution model.
Thanks in advance

Most modern out-of-order CPUs are inherently out-of-order, without switching possible between in-order and out-of-order modes.
You can try to find some in-order CPU, and there are some:
x86: Intel Atom (only 45 nm and older versions; they have two parallel pipelines but executes all instructions in order)
arm: Cortex-A8, and many older cores;
While it is not possible to directly turn off instruction reordering in the typical out-of-order CPU, you can inject something serializing (like cpuid in x86 world) between every your instruction to simulate in-order execution.
There is a part of Intel manuals (vol 3a) about serializing instructions (copied from http://objectmix.com/asm-x86-asm-370/69413-serializing-instructions.html):
Volume 3A: System Programming Guide states
7.4 SERIALIZING INSTRUCTIONS
The Intel 64 and IA-32 architectures define several serializing
instructions. These instructions force the processor to complete all
modifications to flags, registers, and memory by previous instructions
and to drain all buffered writes to memory before the next instruction
is fetched and executed. For example, when a MOV to control register
instruction is used to load a new value into control register CR0 to
enable protected mode, the processor must perform a serializing
operation before it enters protected mode. This serializing operation
insures that all operations that were started while the processor was
in real-address mode are completed before the switch to protected
mode is made.
The concept of serializing instructions was introduced into the IA-32
architecture with the Pentium processor to support parallel
instruction execution. Serializing instructions have no meaning for
the Intel486 and earlier processors that do not implement parallel
instruction execution.
It is important to note that executing of serializing instructions on
P6 and more recent processor families constrain speculative execution
because the results of speculatively executed instructions are
discarded. The following instructions are serializing instructions:
o Privileged serializing instructions - MOV (to control register,
with the exception of MOV CR8), MOV (to debug register), WRMSR, INVD,
INVLPG, WBINVD, LGDT, LLDT, LIDT, and LTR.
o Non-privileged serializing instructions - CPUID, IRET, and RSM.
When the processor serializes instruction execution, it ensures that
all pending memory transactions are completed (including writes
stored in its store buffer) before it executes the next instruction.
Nothing can pass a serializing instruction and a serializing
instruction cannot pass any other instruction (read, write,
instruction fetch, or I/O). For example, CPUID can be executed at any
privilege level to serialize instruction execution with no effect on
program flow, except that the EAX, EBX, ECX, and EDX registers are
modified.

It is possible, but it depends on the CPU. Then again, the instructions themselves don't matter, the memory accesses matter.
AFAIK all CPUs guarantee that registers (and thus the internal state) appear updated in order, regardless of how execution happens. In some CPUs temporary registers are allocated with a value and the result is "written back" (register renaming, so there's no copy per se) at the appropriate time.
For memory accesses, most CPUs have memory barriers of some kind, which limit the reordering of memory accesses. There are several different kinds of memory barriers, and they differ from CPU to CPU. You can conceivably place a full memory barrier between each instruction and you'll get halfway there. If you have a multi-processor machine you might need to do some extra work to make sure the caches are also flushed. Without explicit instructions the other core may not see the results in order.
It very much depends on what you're trying to achieve and on which specific CPU. Every CPU out there is different in some way. And there won't be any magical gcc flags. In gcc the best you'll have are builtin atomic types (link below). The topic is huge. No simple answers.
Recommended reading list:
http://lxr.free-electrons.com/source/Documentation/memory-barriers.txt
https://en.wikipedia.org/wiki/Memory_ordering
http://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/Atomic-Builtins.html
http://www.akkadia.org/drepper/cpumemory.pdf

Implementing registers in a C virtual machine

I've written a virtual machine in C as a hobby project. This virtual machine executes code that's very similar to Intel syntax x86 assembly. The problem is that the registers this virtual machine uses are only registers in name. In my VM code, registers are used just like x86 registers, but the machine stores them in system memory. There are no performance improvements to using registers over system memory in VM code. (I thought that the locality alone would increase performance somewhat, but in practice, nothing has changed.)
When interpreting a program, this virtual machine stores arguments to instructions as pointers. This allows a virtual instruction to take a memory address, constant value, virtual register, or just about anything as an argument.
Since hardware registers don't have addresses, I can't think of a way to actually store my VM registers in hardware registers. Using the register keyword on my virtual register type doesn't work, because I have to get a pointer to the virtual register to use it as an argument. Is there any way to make these virtual registers perform more like their native counterparts?
I'm perfectly comfortable delving into assembly if necessary. I'm aware that JIT compiling this VM code could allow me to utilize hardware registers, but I'd like to be able to use them with my interpreted code as well.

Machine registers don't have indexing support: you can't access the register with a runtime-specified "index", whatever that would mean, without code generation. Since you're likely decoding the register index from your instructions, the only way is to make a huge switch (i.e. switch (opcode) { case ADD_R0_R1: r[0] += r[1]; break; ... }). This is likely a bad idea since it increases the interpreter loop size too much, so it will introduce instruction cache thrashing.
If we're talking about x86, the additional problem is that the amount of general-purpose registers is pretty low; some of them will be used for bookkeeping (storing PC, storing your VM stack state, decoding instructions, etc.) - it's unlikely that you'll have more than one free register for the VM.
Even if register indexing support were available, it's unlikely it would give you a lot of performance. Commonly in interpreters the largest bottleneck is instruction decoding; x86 supports fast and compact memory addressing based on register values (i.e. mov eax, dword ptr [ebx * 4 + ecx]), so you would not win much. It's worthwhile though to check the generated assembly - i.e. to make sure the 'register pool' address is stored in the register.
The best way to accelerate interpreters is JITting; even a simple JIT (i.e. without smart register allocation - basically just emitting the same code you would execute with the instruction loop and a switch statement, except the instruction decoding) can boost your performance 3x or more (these are actual results from a simple JITter on top of a Lua-like register-based VM). An interpreter is best kept as reference code (or for cold code to decrease JIT memory cost - the JIT generation cost is a non-issue for simple JITs).

Even if you could directly access hardware registers, wrapping code around the decision to use a register instead of memory is that much slower.
To get performance you need to design for performance up front.
A few examples.
Prepare an x86 VM by setting up all the traps to catch the code leaving its virtual memory space. Execute the code directly, dont emulate, branch to it and run. When the code reaches out of its memory/i/o space to talk to a device, etc, trap that and emulate that device or whatever it was reaching for then return control back to the program. If the code is processor bound it will run really fast, if I/O bound then slow but not as slow as emulating each instruction.
Static binary translation. Disassemble and translate the code before running, for example an instruction 0x34,0x2E would turn into ascii in a .c file:
al ^= 0x2E;
of =0;
cf=0;
sf=al
Ideally performing tons of dead code removal (if the next instruction modifies the flags as well then dont modify them here, etc). And letting the optimizer in the compiler do the rest. You can get a performance gain this way over an emulator, how good of a performance gain depends on how well you can optimize the code. Being a new program it runs on the hardware, registers memory and all, so the processor bound code is slower than a VM, in some cases you dont have to deal with the processor doing exceptions to trap memory/io because you have simulated the memory accesses in the code, but that still has a cost and calls a simulated device anyway so no savings there.
Dynamic translation, similar to sbt but you do this at runtime, I have heard this done for example when simulating x86 code on some other processor say a dec alpha, the code is slowly changed into native alpha instructions from x86 instructions so the next time around it executes the alpha instruction directly instead of emulating the x86 instruction. Each time through the code the program executes faster.
Or maybe just redesign your emulator to be more efficient from an execution standpoint. Look at the emulated processors in MAME for example, the readability and maintainability of the code has been sacrificed for performance. When written that was important, today with multi-core gigahertz processors you dont have to work so hard to emulate a 1.5ghz 6502 or 3ghz z80. Something as simple as looking the next opcode up in a table and deciding not to emulate some or all of the flag calculation for an instruction can give you a noticeable boost.
Bottom line, if you are interested in using the x86 hardware registers, Ax, BX, etc to emulate AX, BX, etc registers when running a program, the only efficient way to do that is to actually execute the instruction, and not execute and trap as in single stepping a debugger, but execute long strings of instructions while preventing them from leaving the VM space. There are different ways to do this, and performance results will vary, and that doesnt mean it will be faster than a performance efficient emulator. This limits you to matching the processor to the program. Emulating the registers with efficient code and a really good compiler (good optimizer) will give you reasonable performance and portability in that you dont have to match the hardware to the program being run.

transform your complex, register-based code before execution (ahead of time). A simple solution would be a forth like dual-stack vm for execution which offering the possibility to cache the top-of-stack element (TOS) in a register. If you prefer a register-based solution choose an "opcode" format which bundles as much as possible instructions (thumb rule, up to four instructions can be bundled into a byte if a MISC style design is chosen). This way virtual register accesses are locally resolvable to physical register references for each static super-instruction (clang and gcc able to perform such optimization). As side effect the lowered BTB mis-prediction rate would result in far better performance regardless of specific register allocations.
Best threading techniques for C based interpreters are direct threading (label-as-address extension) and replicated switch-threading (ANSI conform).

So you're writing an x86 interpreter, which is bound to be between 1 and 3 powers of 10 slower that the actual hardware. In the real hardware, saying mov mem, foo is going take a lot more time than mov reg, foo, while in your program mem[adr] = foo is going to take about as long as myRegVars[regnum] = foo (modulo cacheing). So you're expecting the same speed differential?
If you want to simulate the speed differential between registers and memory, you're going to have to do something like what Cachegrind does. That is, keep a simulated clock, and when it does a memory reference, it adds a big number to that.

Your VM seems to be too complicated for an efficient interpretation. An obvious optimisation is to have a "microcode" VM, with register load/store instructions, probably even a stack-based one. You can translate your high level VM into a simpler one before the execution. Another useful optimisation depends on a gcc computable labels extension, see the Objective Caml VM interpreter for the example of such a threaded VM implementation.

To answer the specific question you asked:
You could instruct your C compiler to leave a bunch of registers free for your use. Pointers to the first page of memory are usually not allowed, they are reserved for NULL pointer checks, so you could abuse the initial pointers for marking registers. It helps if you have a few native registers to spare, so my example uses 64 bit mode to simulate 4 registers. It may very well be that the additional overhead of the switch slows down execution instead of making it faster. Also see the other answers for general advices.
/* compile with gcc */
register long r0 asm("r12");
register long r1 asm("r13");
register long r2 asm("r14");
register long r3 asm("r15");
inline long get_argument(long* arg)
{
unsigned long val = (unsigned long)arg;
switch(val)
{
/* leave 0 for NULL pointer */
case 1: return r0;
case 2: return r1;
case 3: return r2;
case 4: return r3;
default: return *arg;
}
}

Why does Windows64 use a different calling convention from all other OSes on x86-64?

AMD has an ABI specification that describes the calling convention to use on x86-64. All OSes follow it, except for Windows which has it's own x86-64 calling convention. Why?
Does anyone know the technical, historical, or political reasons for this difference, or is it purely a matter of NIHsyndrome?
I understand that different OSes may have different needs for higher level things, but that doesn't explain why for example the register parameter passing order on Windows is rcx - rdx - r8 - r9 - rest on stack while everyone else uses rdi - rsi - rdx - rcx - r8 - r9 - rest on stack.
P.S. I am aware of how these calling conventions differ generally and I know where to find details if I need to. What I want to know is why.
Edit: for the how, see e.g. the wikipedia entry and links from there.

Choosing four argument registers on x64 - common to UN*X / Win64
One of the things to keep in mind about x86 is that the register name to "reg number" encoding is not obvious; in terms of instruction encoding (the MOD R/M byte, see http://www.c-jump.com/CIS77/CPU/x86/X77_0060_mod_reg_r_m_byte.htm), register numbers 0...7 are - in that order - ?AX, ?CX, ?DX, ?BX, ?SP, ?BP, ?SI, ?DI.
Hence choosing A/C/D (regs 0..2) for return value and the first two arguments (which is the "classical" 32bit __fastcall convention) is a logical choice. As far as going to 64bit is concerned, the "higher" regs are ordered, and both Microsoft and UN*X/Linux went for R8 / R9 as the first ones.
Keeping that in mind, Microsoft's choice of RAX (return value) and RCX, RDX, R8, R9 (arg[0..3]) are an understandable selection if you choose four registers for arguments.
I don't know why the AMD64 UN*X ABI chose RDX before RCX.
Choosing six argument registers on x64 - UN*X specific
UN*X, on RISC architectures, has traditionally done argument passing in registers - specifically, for the first six arguments (that's so on PPC, SPARC, MIPS at least). Which might be one of the major reasons why the AMD64 (UN*X) ABI designers chose to use six registers on that architecture as well.
So if you want six registers to pass arguments in, and it's logical to choose RCX, RDX, R8 and R9 for four of them, which other two should you pick ?
The "higher" regs require an additional instruction prefix byte to select them and therefore have a bigger instruction size footprint, so you wouldn't want to choose any of those if you have options. Of the classical registers, due to the implicit meaning of RBP and RSP these aren't available, and RBX traditionally has a special use on UN*X (global offset table) which seemingly the AMD64 ABI designers didn't want to needlessly become incompatible with.
Ergo, the only choice were RSI / RDI.
So if you have to take RSI / RDI as argument registers, which arguments should they be ?
Making them arg[0] and arg[1] has some advantages. See cHao's comment.
?SI and ?DI are string instruction source / destination operands, and as cHao mentioned, their use as argument registers means that with the AMD64 UN*X calling conventions, the simplest possible strcpy() function, for example, only consists of the two CPU instructions repz movsb; ret because the source/target addresses have been put into the correct registers by the caller. There is, particularly in low-level and compiler-generated "glue" code (think, for example, some C++ heap allocators zero-filling objects on construction, or the kernel zero-filling heap pages on sbrk(), or copy-on-write pagefaults) an enormous amount of block copy/fill, hence it'll be useful for code so frequently used to save the two or three CPU instructions that'd otherwise load such source/target address arguments into the "correct" registers.
So in a way, UN*X and Win64 are only different in that UN*X "prepends" two additional arguments, in purposefully chosen RSI/RDI registers, to the natural choice of four arguments in RCX, RDX, R8 and R9.
Beyond that ...
There are more differences between the UN*X and Windows x64 ABIs than just the mapping of arguments to specific registers. For the overview on Win64, check:
http://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx
Win64 and AMD64 UN*X also strikingly differ in the way stackspace is used; on Win64, for example, the caller must allocate stackspace for function arguments even though args 0...3 are passed in registers. On UN*X on the other hand, a leaf function (i.e. one that doesn't call other functions) is not even required to allocate stackspace at all if it needs no more than 128 Bytes of it (yes, you own and can use a certain amount of stack without allocating it ... well, unless you're kernel code, a source of nifty bugs). All these are particular optimization choices, most of the rationale for those is explained in the full ABI references that the original poster's wikipedia reference points to.

IDK why Windows did what they did. See the end of this answer for a guess. I was curious about how the SysV calling convention was decided on, so I dug into the mailing list archive and found some neat stuff.
It's interesting reading some of those old threads on the AMD64 mailing list, since AMD architects were active on it. e.g. Choosing register names was one of the hard parts: AMD considered renaming the original 8 registers r0-r7, or calling the new registers UAX etc.
Also, feedback from kernel devs identified things that made the original design of syscall and swapgs unusable. That's how AMD updated the instruction to get this sorted out before releasing any actual chips. It's also interesting that in late 2000, the assumption was that Intel probably wouldn't adopt AMD64.
The SysV (Linux) calling convention, and the decision on how many registers should be callee-preserved vs. caller-save, was made initially in Nov 2000, by Jan Hubicka (a gcc developer). He compiled SPEC2000 and looked at code size and number of instructions. That discussion thread bounces around some of the same ideas as answers and comments on this SO question. In a 2nd thread, he proposed the current sequence as optimal and hopefully final, generating smaller code than some alternatives.
He's using the term "global" to mean call-preserved registers, that have to be push/popped if used.
The choice of rdi, rsi, rdx as the first three args was motivated by:
minor code-size saving in functions that call memset or other C string function on their args (where gcc inlines a rep string operation?)
rbx is call-preserved because having two call-preserved regs accessible without REX prefixes (rbx and rbp) is a win. Presumably chosen because they're the only "legacy" registers that aren't implicitly used by any common instruction. (rep string, shift count, and mul/div outputs/inputs touch everything else).
None of the registers that common instructions force you to use are call-preserved (see prev point), so a function that wants to use a variable-count shift or division might have to move function args somewhere else, but doesn't have to save/restore the caller's value. cmpxchg16b and cpuid need RBX, but are rarely used so not a big factor. (cmpxchg16b wasn't part of original AMD64, but RBX would still have been the obvious choice. cmpxchg8b exists but was obsoleted by qword cmpxchg)
We are trying to avoid RCX early in the sequence, since it is register
used commonly for special purposes, like EAX, so it has same purpose to be
missing in the sequence.
Also it can't be used for syscalls and we would like to make syscall sequence
to match function call sequence as much as possible.
(background: syscall / sysret unavoidably destroy rcx(with rip) and r11(with RFLAGS), so the kernel can't see what was originally in rcx when syscall ran.)
The kernel system-call ABI was chosen to match the function call ABI, except for r10 instead of rcx, so a libc wrapper functions like mmap(2) can just mov %rcx, %r10 / mov $0x9, %eax / syscall.
Note that the SysV calling convention used by i386 Linux sucks compared to Window's 32bit __vectorcall. It passes everything on the stack, and only returns in edx:eax for int64, not for small structs. It's no surprise little effort was made to maintain compatibility with it. When there's no reason not to, they did things like keeping rbx call-preserved, since they decided that having another in the original 8 (that don't need a REX prefix) was good.
Making the ABI optimal is much more important long-term than any other consideration. I think they did a pretty good job. I'm not totally sure about returning structs packed into registers, instead of different fields in different regs. I guess code that passes them around by value without actually operating on the fields wins this way, but the extra work of unpacking seems silly. They could have had more integer return registers, more than just rdx:rax, so returning a struct with 4 members could return them in rdi, rsi, rdx, rax or something.
They considered passing integers in vector regs, because SSE2 can operate on integers. Fortunately they didn't do that. Integers are used as pointer offsets very often, and a round-trip to stack memory is pretty cheap. Also SSE2 instructions take more code bytes than integer instructions.
I suspect Windows ABI designers might have been aiming to minimize differences between 32 and 64bit for the benefit of people that have to port asm from one to the other, or that can use a couple #ifdefs in some ASM so the same source can more easily build a 32 or 64bit version of a function.
Minimizing changes in the toolchain seems unlikely. An x86-64 compiler needs a separate table of which register is used for what, and what the calling convention is. Having a small overlap with 32bit is unlikely to produce significant savings in toolchain code size / complexity.

Remember that Microsoft was initially "officially noncommittal toward the early AMD64 effort" (from "A History of Modern 64-bit Computing" by Matthew Kerner and Neil Padgett) because they were strong partners with Intel on the IA64 architecture. I think that this meant that even if they would have otherwise been open to working with GCC engineers on a ABI to use both on Unix and Windows, they wouldn't have done so as it would mean publicly supporting the AMD64 effort when they hadn't yet officially done so (and would have probably upset Intel).
On top of that, back in those days Microsoft had absolutely no leanings toward being friendly with open source projects. Certainly not Linux or GCC.
So why would they have cooperated on an ABI? I'd guess that the ABIs are different simply because they were designed at more or less the same time and in isolation.
Another quote from "A History of Modern 64-bit Computing":
In parallel with the Microsoft collaboration, AMD also engaged the
open source community to prepare for the chip. AMD contracted with
both Code Sorcery and SuSE for tool chain work (Red Hat was already
engaged by Intel on the IA64 tool chain port). Russell explained that
SuSE produced C and FORTRAN compilers, and Code Sorcery produced a
Pascal compiler. Weber explained that the company also engaged with
the Linux community to prepare a Linux port. This effort was very
important: it acted as an incentive for Microsoft to continue to
invest in the AMD64 Windows effort, and also ensured that Linux, which
was becoming an important OS at the time, would be available once the
chips were released.
Weber goes so far as to say that the Linux work was absolutely crucial
to AMD64’s success, because it enabled AMD to produce an end-to-end
system without the help of any other companies if necessary. This
possibility ensured that AMD had a worst-case survival strategy even
if other partners backed out, which in turn kept the other partners
engaged for fear of being left behind themselves.
This indicates that even AMD didn't feel that cooperation was necessarily the most important thing between MS and Unix, but that having Unix/Linux support was very important. Maybe even trying to convince one or both sides to compromise or cooperate wasn't worth the effort or risk(?) of irritating either of them? Perhaps AMD thought that even suggesting a common ABI might delay or derail the more important objective of simply having software support ready when the chip was ready.
Speculation on my part, but I think the major reason the ABIs are different was the political reason that MS and the Unix/Linux sides just didn't work together on it, and AMD didn't see that as a problem.

Win32 has its own uses for ESI and EDI, and requires that they not be modified (or at least that they be restored before calling into the API). I'd imagine 64-bit code does the same with RSI and RDI, which would explain why they're not used to pass function arguments around.
I couldn't tell you why RCX and RDX are switched, though.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio