Why my kernel module performs float division perfectly? [duplicate]

Why my kernel module performs float division perfectly? [duplicate] - linux-kernel

I'm running on an x86 CentOS 6.3 (kernel v2.6.32) system.
I compiled the following function into a bare-bones character driver module as an experiment to see how the Linux kernel reacts to floating point operations.
static unsigned floatstuff(void){
float x = 3.14;
x *= 2.5;
return x;
}
...
printk(KERN_INFO "x: %u", x);
The code compiled (which wasn't expecting) so I inserted the module and checked the log with dmesg. The log showed: x: 7.
This seems strange; I thought you couldn't perform floating point operations in the Linux kernel -- save some exceptions such as kernel_fpu_begin(). How did the module perform the floating point operation?
Is this because I'm on an x86 processor?

I thought you couldn't perform floating point operations in the Linux kernel
You can't safely: failure to use kernel_fpu_begin() / kernel_fpu_end() doesn't mean FPU instructions will fault (not on x86 at least).
Instead it will silently corrupt user-space's FPU state. This is bad; don't do that.
The compiler doesn't know what kernel_fpu_begin() means, so it can't check / warn about code that compiles to FPU instructions outside of FPU-begin regions.
There may be a debug mode where the kernel does disable SSE, x87, and MMX instructions outside of kernel_fpu_begin / end regions, but that would be slower and isn't done by default.
It is possible, though: setting CR0::TS = 1 makes x87 instructions fault, so lazy FPU context switching is possible, and there are other bits for SSE and AVX.
There are many ways for buggy kernel code to cause serious problems. This is just one of many. In C, you pretty much always know when you're using floating point (unless a typo results in a 1. constant or something in a context that actually compiles).
Why is the FP architectural state different from integer?
Linux has to save/restore the integer state any time it enters/exits the kernel. All code needs to use integer registers (except for a giant straight-line block of FPU computation that ends with a jmp instead of a ret (ret modifies rsp).)
But kernel code avoids FPU generally, so Linux leaves the FPU state unsaved on entry from a system call, only saving before an actual context switch to a different user-space process or on kernel_fpu_begin. Otherwise, it's common to return to the same user-space process on the same core, so FPU state doesn't need to be restored because the kernel didn't touch it. (And this is where corruption would happen if a kernel task actually did modify the FPU state. I think this goes both ways: user-space could also corrupt your FPU state).
The integer state is fairly small, only 16x 64-bit registers + RFLAGS and segment regs. FPU state is more than twice as large even without AVX: 8x 80-bit x87 registers, and 16x XMM or YMM, or 32x ZMM registers (+ MXCSR, and x87 status + control words). Also the MPX bnd0-4 registers are lumped in with "FPU". At this point "FPU state" just means all non-integer registers. On my Skylake, dmesg says x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
See Understanding FPU usage in linux kernel; modern Linux doesn't do lazy FPU context switches by default for context switches (only for kernel/user transitions). (But that article explains what Lazy is.)
Most processes use SSE for copying/zeroing small blocks of memory in compiler-generated code, and most library string/memcpy/memset implementations use SSE/SSE2. Also, hardware supported optimized save/restore is a thing now (xsaveopt / xrstor), so "eager" FPU save/restore may actually do less work if some/all FP registers haven't actually been used. e.g. save just the low 128b of YMM registers if they were zeroed with vzeroupper so the CPU knows they're clean. (And mark that fact with just one bit in the save format.)
With "eager" context switching, FPU instructions stay enabled all the time, so bad kernel code can corrupt them at any time.

Don't do that!
In kernel-space FPU mode is disabled due to several reasons:
It allows Linux to run in architectures that do not have FPU
It avoids to save and restore the whole set of registers every kernel/user-space transition (it may double the time of context switch)
Basically all of the kernel functions use integers also for representing decimal numbers -> you don't probably need floating point
In Linux, preemption is disabled when kernel-space is running in FPU mode
Floating point numbers are evil and may generate very bad unexpected behaviour
If you really want to use FP numbers (and you should not) you must use the kernel_fpu_begin and kernel_fpu_end primitives to avoid to break user-space registers, and you should take in account all of the possible problems (security included) in dealing with FP numbers.

Not sure where this perception is coming from. But the kernel executes on the same processor as the user mode code, and therefore has access to the same instruction set. If the processor can do floating point (directly or by a co-processor), the kernel can too.
Maybe you are thinking of cases where floating point arithmetic is emulated in software. But even so, it would be available in kernel (well, unless disabled somehow).
I am curious, where is this perception coming from? Maybe I am missing something.
Found this. Seems to be a good explanation.

OS kernel may simply turn the FPU off in kernel mode.
While FPU operation, while floating point operation kernel will turn the FPU on and after that turn off the FPU.
But you can not print it.

Related

What does the following assembly instruction mean "mov rax,qword ptr gs:[20h]" [duplicate]

So I know what the following registers and their uses are supposed to be:
CS = Code Segment (used for IP)
DS = Data Segment (used for MOV)
ES = Destination Segment (used for MOVS, etc.)
SS = Stack Segment (used for SP)
But what are the following registers intended to be used for?
FS = "File Segment"?
GS = ???
Note: I'm not asking about any particular operating system -- I'm asking about what they were intended to be used for by the CPU, if anything.

There is what they were intended for, and what they are used for by Windows and Linux.
The original intention behind the segment registers was to allow a program to access many different (large) segments of memory that were intended to be independent and part of a persistent virtual store. The idea was taken from the 1966 Multics operating system, that treated files as simply addressable memory segments. No BS "Open file, write record, close file", just "Store this value into that virtual data segment" with dirty page flushing.
Our current 2010 operating systems are a giant step backwards, which is why they are called "Eunuchs". You can only address your process space's single segment, giving a so-called "flat (IMHO dull) address space". The segment registers on the x86-32 machine can still be used for real segment registers, but nobody has bothered (Andy Grove, former Intel president, had a rather famous public fit last century when he figured out after all those Intel engineers spent energy and his money to implement this feature, that nobody was going to use it. Go, Andy!)
AMD in going to 64 bits decided they didn't care if they eliminated Multics as a choice (that's the charitable interpretation; the uncharitable one is they were clueless about Multics) and so disabled the general capability of segment registers in 64 bit mode. There was still a need for threads to access thread local store, and each thread needed a a pointer ... somewhere in the immediately accessible thread state (e.g, in the registers) ... to thread local store. Since Windows and Linux both used FS and GS (thanks Nick for the clarification) for this purpose in the 32 bit version, AMD decided to let the 64 bit segment registers (GS and FS) be used essentially only for this purpose (I think you can make them point anywhere in your process space; I don't know if the application code can load them or not). Intel in their panic to not lose market share to AMD on 64 bits, and Andy being retired, decided to just copy AMD's scheme.
It would have been architecturally prettier IMHO to make each thread's memory map have an absolute virtual address (e.g, 0-FFF say) that was its thread local storage (no [segment] register pointer needed!); I did this in an 8 bit OS back in the 1970s and it was extremely handy, like having another big stack of registers to work in.
So, the segment registers are now kind of like your appendix. They serve a vestigial purpose. To our collective loss.
Those that don't know history aren't doomed to repeat it; they're doomed to doing something dumber.

The registers FS and GS are segment registers. They have no processor-defined purpose, but instead are given purpose by the OS's running them. In Windows 64-bit the GS register is used to point to operating system defined structures. FS and GS are commonly used by OS kernels to access thread-specific memory. In windows, the GS register is used to manage thread-specific memory. The linux kernel uses GS to access cpu-specific memory.

FS is used to point to the thread information block (TIB) on windows processes .
one typical example is (SEH) which store a pointer to a callback function in FS:[0x00].
GS is commonly used as a pointer to a thread local storage (TLS) .
and one example that you might have seen before is the stack canary protection (stackguard) , in gcc you might see something like this :
mov eax,gs:0x14
mov DWORD PTR [ebp-0xc],eax

TL;DR;
What is the “FS”/“GS” register intended for?
Simply to access data beyond the default data segment (DS). Exactly like ES.
The Long Read:
So I know what the following registers and their uses are supposed to be:
[...]
Well, almost, but DS is not 'some' Data Segment, but the default one. Where all operation take place by default (*1). This is where all default variables are located - essentially data and bss. It's in some way part of the reason why x86 code is rather compact. All essential data, which is what is most often accessed, (plus code and stack) is within 16 bit shorthand distance.
ES is used to access everything else (*2), everything beyond the 64 KiB of DS. Like the text of a word processor, the cells of a spreadsheet, or the picture data of a graphics program and so on. Unlike often assumed, this data doesn't get as much accessed, so needing a prefix hurts less than using longer address fields.
Similarly, it's only a minor annoyance that DS and ES might have to be loaded (and reloaded) when doing string operations - this at least is offset by one of the best character handling instruction sets of its time.
What really hurts is when user data exceeds 64 KiB and operations have to be commenced. While some operations are simply done on a single data item at a time (think A=A*2), most require two (A=A*B) or three data items (A=B*C). If these items reside in different segments, ES will be reloaded several times per operation, adding quite some overhead.
In the beginning, with small programs from the 8 bit world (*3) and equally small data sets, it wasn't a big deal, but it soon became a major performance bottleneck - and more so a true pain in the ass for programmers (and compilers). With the 386 Intel finally delivered relief by adding two more segments, so any series unary, binary or ternary operation, with elements spread out in memory, could take place without reloading ES all the time.
For programming (at least in assembly) and compiler design, this was quite a gain. Of course, there could have been even more, but with three the bottleneck was basically gone, so no need to overdo it.
Naming wise the letters F/G are simply alphabetic continuations after E. At least from the point of CPU design nothing is associated.
*1 - The usage of ES for string destination is an exception, as simply two segment registers are needed. Without they wouldn't be much useful - or always needing a segment prefix. Which could kill one of the surprising features, the use of (non repetitive) string instructions resulting in extreme performance due to their single byte encoding.
*2 - So in hindsight 'Everything Else Segment' would have been a way better naming than 'Extra Segment'.
*3 - It's always important to keep in mind that the 8086 was only meant as a stop gap measure until the 8800 was finished and mainly intended for the embedded world to keep 8080/85 customers on board.

According to the Intel Manual, in 64-bit mode these registers are intended to be used as additional base registers in some linear address calculations. I pulled this from section 3.7.4.1 (pg. 86 in the 4 volume set). Usually when the CPU is in this mode, linear address is the same as effective address, because segmentation is often not used in this mode.
So in this flat address space, FS & GS play role in addressing not just local data but certain operating system data structures(pg 2793, section 3.2.4) thus these registers were intended to be used by the operating system, however those particular designers determine.
There is some interesting trickery when using overrides in both 32 & 64-bit modes but this involves privileged software.
From the perspective of "original intentions," that's tough to say other than they are just extra registers. When the CPU is in real address mode, this is like the processor is running as a high speed 8086 and these registers have to be explicitly accessed by a program. For the sake of true 8086 emulation you'd run the CPU in virtual-8086 mode and these registers would not be used.

The FS and GS segment registers were very useful in 16-bit real mode or 16-bit protected mode under 80386 processors, when there were just 64KB segments, for example in MS-DOS.
When the 80386 processor was introduced in 1985, PC computers with 640KB RAM under MS-DOS were common. RAM was expensive and PCs were mostly running under MS-DOS in real mode with a maximum of that amount of RAM.
So, by using FS and GS, you could effectively address two more 64KB memory segments from your program without the need to change DS or ES registers whenever you need to address other segments than were loaded in DS or ES. Essentially, Raffzahn has already replied that these registers are useful when working with elements spread out in memory, to avoid reloading other segment registers like ES all the time. But I would like to emphasize that this is only relevant for 64KB segments in real mode or 16-bit protected mode.
The 16-bit protected mode was a very interesting mode that provided a feature not seen since then. The segments could have lengths in range from 1 to 65536 bytes. The range checking (the checking of the segment size) on each memory access was implemented by a CPU, that raised an interrupt on accessing memory beyond the size of the segment specified in the selector table for that segment. That prevented buffer overrun on hardware level. You could allocate own segment for each memory block (with a certain limitation on a total number). There were compilers like Borland Pascal 7.0 that made programs that run under MS-DOS in 16-bit Protected Mode known as DOS Protected Mode Interface (DPMI) using its own DOS extender.
The 80286 processor had 16-bit protected mode, but not FS/GS registers. So a program had first to check whether it is running under 80386 before using these registers, even in the real 16-bit mode. Please see an example of use of FS and GS registers a program for MS-DOS real mode.

Is there performance advantage to ARM64

Recently 64-bit ARM mobiles started appearing. But is there any practical advantage to building an application 64-bit? Specifically considering application that does not have much use for the increased virtual address space¹, but would waste some space due to increased pointer size.
So does ARM64 have any other advantages than the larger address that would actually warrant building such application 64bit?
Note: I've seen 64-bit Performance Advantages, but it only mentions x86-64 which does have other improvements besides extended virtual address space. I also recall that the situation is indeed specific to x86 and on some other platforms that went 64-bit like Sparc the usual approach was to only compile kernel and the applications that actually did use lot of memory as 64-bit and everything else as 32-bit.
¹The application is multi-platform and it still needs to be built for and run on devices with as little as 48MiB of memory. Does have some large data that it reads from external storage, but it never needs more than some megabytes of it at once.

I am not sure a general response can be given, but I can provide some examples of differences. There are of course additional differences added in version 8 of the ARM architecture, which apply regardless of target instruction set.
Performance-positive additions in AArch64
32 General-purpose registers gives compilers more wiggle room.
I/D cache synchronization mechanisms accessible from user mode (no system call needed).
Load/Store-Pair instructions makes it possible to load 128-bits of data with one instruction, and still remain RISC-like.
The removal of near-universal conditional execution makes more out-of-ordering possible.
The change in layout of NEON registers (D0 is still lower half of Q0, but D1 is now lower half of Q1 rather than upper half of Q0) makes more out-of-ordering possible.
64-bit pointers make pointer tagging possible.
CSEL enables all kind of crazy optimizations.
Performance-negative changes in AArch64
More registers may also mean higher pressure on the stack.
Larger pointers mean larger memory footprint.
Removal of near-universal conditional execution may cause higher pressure on branch predictor.
Removal of load/store-multiple means more instructions needed for function entry/exit.
Performance-relevant changes in ARMv8-A
Load-Aquire/Store-Release semantics remove need for explicit memory barriers for basic synchronization operations.
I probably forgot lots of things, but those are some of the more obvious changes.

Why 64 bit mode ( Long mode ) doesn't use segment registers?

I'm a beginner level of student :) I'm studying about intel architecture,
and I'm studying a memory management such as a segmentation and paging.
I'm reading Intel's manual and it's pretty nice to understand intel's architectures.
However I'm still curious about something fundamental.
Why in the 64bit long mode, all segment registers are going to bit 0?
Why system doesn't use segment registers any longer?
Because system's 64bit of size (such as a GP registers) are enough to contain those logical address at once?
Is protection working properly in 64bit mode?
I tried to find 64bit addressing but I couldn't find in Google. Perhaps I have terrible searching skill or I may need some specfied previous knowledge to searching in google.
Hence I'd like to know why 16bit of segment registers are not going to use in 64bit mode,
and how could protection work properly in 64bit mode.
Thank you!

In a manner of speaking, when you perform array ("indexed") type addressing with general registers, you are doing essentially the same thing as the segment registers. In the bad old days of 8-bit and 16-bit programming, many applications required much more data (and occasionally more code) than a 16-bit address could reach.
So many CPUs solved this by having a larger addressable memory space than the 16-bit addresses could reach, and made those regions of memory accessible by means of "segment registers" or similar. A program would set the address in a "segment register" to an address above the (65536 byte) 16-bit address space. Then when certain instructions were executed, they would add the instruction specified address to the appropriate (or specified) "segment register" to read data (or code) beyond the range of 16-bit addresses or 16-bit offsets.
However, the situation today is opposite!
How so? Today, a 64-bit CPU can address more than (not less than) all addressable memory space. Most 64-bit CPUs today can address something like 40-bits to 48-bits of physical memory. True, there is nothing to stop them from addressing a full 64-bit memory space, but they know nobody (but the NSA) can afford that much RAM, and besides, hanging that much RAM on the CPU bus would load it down with capacitance, and slow down ALL memory accesses outside the CPU chip.
Therefore, the current generation of mainstream CPUs can address 40-bits to 48-bits of memory space, which is more than 99.999% of the market would ever imagine reaching. Note that 32-bits is 4-gigabytes (which some people do exceed today by a factor of 2, 4, 8, 16), but even 40-bits can address 256 * 4GB == 1024GB == 1TB. While 64GB of RAM is reasonable today, and perhaps even 256GB in extreme cases, 1024GB just isn't necessary except for perhaps 0.001% of applications, and is unaffordable to boot.
And if you are in that 0.001% category, just buy one of the CPUs that address 48-bits of physical memory, and you're talking 256TB... which is currently impractical because it would load down the memory bus with vastly too much capacitance (maybe even to the point the memory bus would stop completely stop working).
The point is this. When your normal addressing modes with normal 64-bit registers can already address vastly more memory than your computer can contain, the conventional reason to add segment registers vanishes.
This doesn't mean people could not find useful purposes for segment registers in 64-bit CPUs. They could. Several possibilities are evident. However, with 64-bit general registers and 64-bit address space, there is nothing that general registers could not do that segment registers can. And general purpose registers have a great many purposes, which segment registers do not. Therefore, if anyone was planning to add more registers to a modern 64-bit CPU, they would add general purpose registers (which can do "anything") rather than add very limited purpose "segment registers".
And indeed they have. As you may have noticed, AMD and Intel keep adding more [sorta] general-purpose registers to the SIMD register-file, and AMD doubled the number of [truly] general purpose registers when they designed their 64-bit x86_64 CPUs (which Intel copied).

Most answers to questions on irrelevance of segment registers in a 32/64 bit world always centers around memory addressing. We all agree that the primary purpose of segment registers was to get around address space limitation in a 16 bit DOS world. However, from a security capability perspective segment registers provide 4 rings of address space isolation, which is not available if we do 64 bit long mode, say for a 64 bit OS. This is not a problem with current popular OS's such as Windows and Linux that use only ring 0 and ring 3 with two levels of isolation. Ring 1 and 2 are sometimes part of the kernel and sometimes part of user space depending on how the code is written. With the advent of hardware virtualization (as opposed to OS virtualization) from isolation perspective, hypervisors did not quite fit in either in ring 0 or ring 1/2/3. Intel and AMD added additional instructions (e.g., INTEL VMX) for root and non-root operations of VM's.
So what is the point being made? If one is designing a new secure OS with 4 rings of isolation then we run in to problems if segmentation is disabled. As an example, we use one ring each for hardware mux code, hypervisor code /containers/VM, OS Kernel and User Space. So we can make a case for leveraging additional security afforded by segmentation based on requirements stated above. However, Intel/AMD still allow F and G segment registers to have non-zero value (i.e., segmentation is not disabled). To best of my knowledge no OS exploits this ray of hope to write more secure OS/Hypervisor for hardware virtualization.

Implementing registers in a C virtual machine

I've written a virtual machine in C as a hobby project. This virtual machine executes code that's very similar to Intel syntax x86 assembly. The problem is that the registers this virtual machine uses are only registers in name. In my VM code, registers are used just like x86 registers, but the machine stores them in system memory. There are no performance improvements to using registers over system memory in VM code. (I thought that the locality alone would increase performance somewhat, but in practice, nothing has changed.)
When interpreting a program, this virtual machine stores arguments to instructions as pointers. This allows a virtual instruction to take a memory address, constant value, virtual register, or just about anything as an argument.
Since hardware registers don't have addresses, I can't think of a way to actually store my VM registers in hardware registers. Using the register keyword on my virtual register type doesn't work, because I have to get a pointer to the virtual register to use it as an argument. Is there any way to make these virtual registers perform more like their native counterparts?
I'm perfectly comfortable delving into assembly if necessary. I'm aware that JIT compiling this VM code could allow me to utilize hardware registers, but I'd like to be able to use them with my interpreted code as well.

Machine registers don't have indexing support: you can't access the register with a runtime-specified "index", whatever that would mean, without code generation. Since you're likely decoding the register index from your instructions, the only way is to make a huge switch (i.e. switch (opcode) { case ADD_R0_R1: r[0] += r[1]; break; ... }). This is likely a bad idea since it increases the interpreter loop size too much, so it will introduce instruction cache thrashing.
If we're talking about x86, the additional problem is that the amount of general-purpose registers is pretty low; some of them will be used for bookkeeping (storing PC, storing your VM stack state, decoding instructions, etc.) - it's unlikely that you'll have more than one free register for the VM.
Even if register indexing support were available, it's unlikely it would give you a lot of performance. Commonly in interpreters the largest bottleneck is instruction decoding; x86 supports fast and compact memory addressing based on register values (i.e. mov eax, dword ptr [ebx * 4 + ecx]), so you would not win much. It's worthwhile though to check the generated assembly - i.e. to make sure the 'register pool' address is stored in the register.
The best way to accelerate interpreters is JITting; even a simple JIT (i.e. without smart register allocation - basically just emitting the same code you would execute with the instruction loop and a switch statement, except the instruction decoding) can boost your performance 3x or more (these are actual results from a simple JITter on top of a Lua-like register-based VM). An interpreter is best kept as reference code (or for cold code to decrease JIT memory cost - the JIT generation cost is a non-issue for simple JITs).

Even if you could directly access hardware registers, wrapping code around the decision to use a register instead of memory is that much slower.
To get performance you need to design for performance up front.
A few examples.
Prepare an x86 VM by setting up all the traps to catch the code leaving its virtual memory space. Execute the code directly, dont emulate, branch to it and run. When the code reaches out of its memory/i/o space to talk to a device, etc, trap that and emulate that device or whatever it was reaching for then return control back to the program. If the code is processor bound it will run really fast, if I/O bound then slow but not as slow as emulating each instruction.
Static binary translation. Disassemble and translate the code before running, for example an instruction 0x34,0x2E would turn into ascii in a .c file:
al ^= 0x2E;
of =0;
cf=0;
sf=al
Ideally performing tons of dead code removal (if the next instruction modifies the flags as well then dont modify them here, etc). And letting the optimizer in the compiler do the rest. You can get a performance gain this way over an emulator, how good of a performance gain depends on how well you can optimize the code. Being a new program it runs on the hardware, registers memory and all, so the processor bound code is slower than a VM, in some cases you dont have to deal with the processor doing exceptions to trap memory/io because you have simulated the memory accesses in the code, but that still has a cost and calls a simulated device anyway so no savings there.
Dynamic translation, similar to sbt but you do this at runtime, I have heard this done for example when simulating x86 code on some other processor say a dec alpha, the code is slowly changed into native alpha instructions from x86 instructions so the next time around it executes the alpha instruction directly instead of emulating the x86 instruction. Each time through the code the program executes faster.
Or maybe just redesign your emulator to be more efficient from an execution standpoint. Look at the emulated processors in MAME for example, the readability and maintainability of the code has been sacrificed for performance. When written that was important, today with multi-core gigahertz processors you dont have to work so hard to emulate a 1.5ghz 6502 or 3ghz z80. Something as simple as looking the next opcode up in a table and deciding not to emulate some or all of the flag calculation for an instruction can give you a noticeable boost.
Bottom line, if you are interested in using the x86 hardware registers, Ax, BX, etc to emulate AX, BX, etc registers when running a program, the only efficient way to do that is to actually execute the instruction, and not execute and trap as in single stepping a debugger, but execute long strings of instructions while preventing them from leaving the VM space. There are different ways to do this, and performance results will vary, and that doesnt mean it will be faster than a performance efficient emulator. This limits you to matching the processor to the program. Emulating the registers with efficient code and a really good compiler (good optimizer) will give you reasonable performance and portability in that you dont have to match the hardware to the program being run.

transform your complex, register-based code before execution (ahead of time). A simple solution would be a forth like dual-stack vm for execution which offering the possibility to cache the top-of-stack element (TOS) in a register. If you prefer a register-based solution choose an "opcode" format which bundles as much as possible instructions (thumb rule, up to four instructions can be bundled into a byte if a MISC style design is chosen). This way virtual register accesses are locally resolvable to physical register references for each static super-instruction (clang and gcc able to perform such optimization). As side effect the lowered BTB mis-prediction rate would result in far better performance regardless of specific register allocations.
Best threading techniques for C based interpreters are direct threading (label-as-address extension) and replicated switch-threading (ANSI conform).

So you're writing an x86 interpreter, which is bound to be between 1 and 3 powers of 10 slower that the actual hardware. In the real hardware, saying mov mem, foo is going take a lot more time than mov reg, foo, while in your program mem[adr] = foo is going to take about as long as myRegVars[regnum] = foo (modulo cacheing). So you're expecting the same speed differential?
If you want to simulate the speed differential between registers and memory, you're going to have to do something like what Cachegrind does. That is, keep a simulated clock, and when it does a memory reference, it adds a big number to that.

Your VM seems to be too complicated for an efficient interpretation. An obvious optimisation is to have a "microcode" VM, with register load/store instructions, probably even a stack-based one. You can translate your high level VM into a simpler one before the execution. Another useful optimisation depends on a gcc computable labels extension, see the Objective Caml VM interpreter for the example of such a threaded VM implementation.

To answer the specific question you asked:
You could instruct your C compiler to leave a bunch of registers free for your use. Pointers to the first page of memory are usually not allowed, they are reserved for NULL pointer checks, so you could abuse the initial pointers for marking registers. It helps if you have a few native registers to spare, so my example uses 64 bit mode to simulate 4 registers. It may very well be that the additional overhead of the switch slows down execution instead of making it faster. Also see the other answers for general advices.
/* compile with gcc */
register long r0 asm("r12");
register long r1 asm("r13");
register long r2 asm("r14");
register long r3 asm("r15");
inline long get_argument(long* arg)
{
unsigned long val = (unsigned long)arg;
switch(val)
{
/* leave 0 for NULL pointer */
case 1: return r0;
case 2: return r1;
case 3: return r2;
case 4: return r3;
default: return *arg;
}
}

Alignment requirements for atomic x86 instructions vs. MS's InterlockedCompareExchange documentation?

Microsoft offers the InterlockedCompareExchange function for performing atomic compare-and-swap operations. There is also an _InterlockedCompareExchange intrinsic.
On x86 these are implemented using the lock cmpxchg instruction.
However, reading through the documentation on these three approaches, they don't seem to agree on the alignment requirements.
Intel's reference manual says nothing about alignment (other than that if alignment checking is enabled and an unaligned memory reference is made, an exception is generated)
I also looked up the lock prefix, which specifically states that
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
(emphasis mine)
So Intel seems to say that alignment is irrelevant. The operation will be atomic no matter what.
The _InterlockedCompareExchange intrinsic documentation also says nothing about alignment, however the InterlockedCompareExchange function states that
The parameters for this function must be aligned on a 32-bit boundary; otherwise, the function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.
So what gives?
Are the alignment requirements for InterlockedCompareExchange just to make sure the function will work even on pre-486 CPU's where the cmpxchg instruction isn't available?
That seems likely based on the above information, but I'd like to be sure before I rely on it. :)
Or is alignment required by the ISA to guarantee atomicity, and I'm just looking the wrong places in Intel's reference manuals?

x86 does not require alignment for a lock cmpxchg instruction to be atomic. However, alignment is necessary for good performance.
This should be no surprise, backward compatibility means that software written with a manual from 14 years ago will still run on today's processors. Modern CPUs even have a performance counter specifically for split-lock detection because it's so expensive. (The core can't just hold onto exclusive access to a single cache line for the duration of the operation; it does have to do something like a traditional bus lock).
Why exactly Microsoft documents an alignment requirement is not clear. It's certainly necessary for supporting RISC architectures, but the specific claim of unpredictable behaviour on multiprocessor x86 might not even be valid. (Unless they mean unpredictable performance, rather than a correctness problem.)
Your guess of applying only to pre-486 systems without lock cmpxchg might be right; a different mechanism would be needed there which might have required some kind of locking around pure loads or pure stores. (Also note that 486 cmpxchg has a different and currently-undocumented opcode (0f a7) from modern cmpxchg (0f b1) which was new with 586 Pentium; Windows might have only used cmpxchg on P5 Pentium and later, I don't know.) That could maybe explain weirdness on some x86, without implying weirdness on modern x86.
Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A): System Programming Guide
January 2013
8.1.2.2 Software Controlled Bus Locking
To explicitly force the LOCK semantics, software can use the LOCK prefix with the following instructions when they are used to modify a memory location. [...]
• The exchange instructions (XADD, CMPXCHG, and CMPXCHG8B).
• The LOCK prefix is automatically assumed for XCHG instruction.
• [...]
[...] The integrity of a bus lock is not affected by the alignment of the
memory field. The LOCK semantics are followed for as many bus cycles
as necessary to update the entire operand. However, it is recommend
that locked accesses be aligned on their natural boundaries for better
system performance:
• Any boundary for an 8-bit access (locked or otherwise).
• 16-bit boundary for locked word accesses.
• 32-bit boundary for locked doubleword accesses.
• 64-bit boundary for locked quadword accesses.
Fun fact: cmpxchg without a lock prefix is still atomic wrt. context switches, so is usable for multi-threading on a single-core system.
Even misaligned it's still atomic wrt. interrupts (either completely before or completely after), and only memory reads by other devices (e.g. DMA) could see tearing. But such accesses could also see the separation between load and store, so even if old Windows did use that for a more efficient InterlockedCompareExchange on single-core systems, it still wouldn't require alignment for correctness, only performance. If this can be used for hardware access, Windows probably wouldn't do that.
If the library function needed to do a pure load separate from the lock cmpxchg this might make sense, but it doesn't need to do that. (If not inlined, the 32-bit version would have to load its args from the stack, but that's private, not access to the shared variable.)

The PDF you are quoting from is from 1999 and CLEARLY outdated.
The up-to-date Intel documentation, specifically Volume-3A tells a different story.
For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.
On Volume 3A, System Programming, For x86/x64 Intel clearly states:
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
Reading or writing a quadword aligned on a 64-bit boundary
16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Accesses to cacheable memory that are split across cache lines and page boundaries
are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core
Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors.
The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon,
and P6 family processors provide bus control signals that permit external memory
subsystems to make split accesses atomic; however, nonaligned data accesses will
seriously impact the performance of the processor and should be avoided

See this SO question: natural alignment is important for performance, and is required on the x64 architecture (so it's not just PRE-x86 systems, but POST-x86 ones too -- x64 may still be a bit of a niche case but it's growing in popularity after all;-); that may be why Microsoft documents it as required (hard to find docs on whether MS has decided to FORCE the alignment issue by enabling alignment checking -- that may vary by Windows version; by claiming in the docs that alignment is required, MS keeps the freedom to force it in some version of Windows even if they did not force it on others).

Microsoft's Interlocked APIs also applied to ia64 (while it still existed). There was no lock prefix on ia64, only the cmpxchg.acq and cmpxchg.rel instructions (or fetchadd and other similar beasties), and these all required alignment if I recall correctly.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio