How can I speed up my code coverage tool?

How can I speed up my code coverage tool? - winapi

I've written a small code coverage utility to log which basic blocks are hit in an x86 executable. It runs without source code or debugging symbols for the target, and just takes a lost of basic blocks which it monitors.
However, it is becoming the bottleneck in my application, which involves repeated coverage snapshots of a single executable image.
It has gone through a couple of phases as I've tried to speed it up. I started off just placing an INT3 at the start of each basic block, attaching as debugger, and logging hits. Then I tried to improve performance by patching in a counter to any block bigger than 5 bytes (the size of a JMP REL32). I wrote a small stub ('mov [blah], 1 / jmp backToTheBasicBlockWeCameFrom') in the process memory space and patch a JMP to that. This greatly speeds things up, since there's no exception and no debugger break, but I'd like to speed things up more.
I'm thinking of one of the following:
1) Pre-instrument the target binary with my patched counters (at the moment I do this at runtime). I could make a new section in the PE, throw my counters in it, patch in all the hooks I need, then just read data out of the same section with my debugger after each execution. That'll gain me some speed (about a 16% according to my estimation) but there are still those pesky INT3's which I need to have in the smaller blocks, which are really going to cripple performance.
2) Instrument the binary to include its own UnhandledExceptionFilter and handle its own int3's in conjunction with the above. This would mean there's no process switch from the debuggee to my coverage tool on every int3, but there'd still be the breakpoint exception raised and the subsequent kernel transition - am I right in thinking this wouldn't actually gain me much performance?
3) Try to do something clever using Intel's hardware branch profiling instructions. This sounds pretty awesome but I'm not clear on how I'd go about it - is it even possible in a windows usermode application? I might go as far as to write a kernel-mode driver if it's fairly straightforward but I'm not a kernel coder (I dabble a bit) and would probably cause myself lots of headaches. Are there any other projects using this approach? I see the Linux kernel has it to monitor the kernel itself, which makes me think that monitoring a specific usermode application will be difficult.
4) Use an off-the-shelf application. It'd need to work without any source or debugging symbols, be scriptable (so I can run in batches), and preferably be free (I'm pretty stingy). For-pay tools aren't off the table, however (if I can spend less on a tool and increase perf enough to avoid buying new hardware, that'd be good justification).
5) Something else. I'm running in VMWare on Windows XP, on fairly old hardware (Pentium 4-ish) - is there anything I've missed, or any leads I should read up on? Can I get my JMP REL32 down to less than 5 bytes (and catch smaller blocks without the need for an int3)?
Thanks.

If you insist on instrumenting binaries, pretty much your fastest coverage is the 5-byte jump-out jump-back trick. (You're covering standard ground for binary instrumentation tools.)
The INT 3 solution will always involve a trap. Yes, you could handle the trap in your space instead of a debugger space and that would speed up it, but it will never be close to competitive to the jump-out/back patch. You may need it as backup anyway, if the function you are instrumenting happens to be shorter than 5 bytes (e.g., "inc eax/ret") because then you don't have 5 bytes you can patch.
What you might do to optimize things a little is examine the patched code. Without such examination, with original code:
instrn 1
instrn 2
instrn N
next:
patched, in general to look like this:
jmp patch
xxx
next:
has to generally have a patch:
patch: pushf
inc count
popf
instrn1
instrn2
instrnN
jmp back
If all you want is coverage, you don't need to increment, and the means you don't need to save the flags:
patch: mov byte ptr covered,1
instrn1
instrn2
instrnN
jmp back
You should use a byte rather than a word to keep the patch size down. You should align the patch on a cache line so the processor doesn't have fetch 2 cache lines to execute the patch.
If you insist on counting, you can analyze the instrn1/2/N to see if they care about the flags that "inc" fools with, and only pushf/popf if needed, or you can insert the increment between two instructions in the patch that don't care. You must be analyzing these to some extent to handle complications such as instn being ret anyway; you can generate a better patch (e.g., don't "jmp back").
You may find that using add count,1 is faster than inc count because this avoids partial condition code updates and consequent pipeline interlocks. This will affect your cc-impact-analysis a bit, since inc doesn't set the carry bit, and add does.
Another possibility is PC sampling. Don't instrument the code at all; just interrupt the thread periodically and take a sample PC value. If you know where the basic blocks are, a PC sample anywhere in the basic block is evidence the entire block got executed. This won't necessarily give precise coverage data (you may miss critical PC values), but the overhead is pretty low.
If you are willing to patch source code, you can do better: just insert "covered[i]=true;" in the beginning the ith basic block, and let the compiler take care of all the various optimizations. No patches needed. The really cool part of this is that if you have basic blocks inside nested loops, and you insert source probes like this, the compiler will notice that the probe assignments are idempotent with respect to the loop and lift the probe out of the loop. Viola, zero probe overhead inside the loop. What more more could you want?

Related

Dynamically executing large volumes of execute-once, straight-line x86 code

Dynamically generating code is pretty well-known technique, for example to speed up interpreted languages, domain-specific languages and so on. Whether you want to work low-level (close to 1:1 with assembly), or high-level you can find libraries you help you out.
Note the distinction between self-modifying code and dynamically-generated code. The former means that some code that has executed will be modified in part and then executed again. The latter means that some code, that doesn't exist statically in the process binary on disk, is written to memory and then executed (but will not necessarily ever be modified). The distinction might be important below or simply because people treat self-modifying code as a smell, but dynamically generated code as a great performance trick.
The usual use-case is that the generated code will be executed many times. This means the focus is usually on the efficiency of the generated code, and to a lesser extent the compilation time, and least of all the mechanics of actually writing the code, making it executable and starting execution.
Imagine however, that your use case was generating code that will execute exactly once and that this is straight-line code without loops. The "compilation" process that generates the code is very fast (close to memcpy speed). In this case, the actual mechanics of writing to the code to memory and executing it once become important for performance.
For example, the total amount of code executed may be 10s of GBs or more. Clearly you don't want to just write all out to a giant buffer without any re-use: this would imply writing 10GB to memory and perhaps also reading 10GB (depending on how generation and execution was interleaved). Instead you'd probably want to use some reasonably sized buffer (say to fit in the L1 or L2 cache): write out a buffer's worth of code, execute it, then overwrite the buffer with the next chunk of code and so on.
The problem is that this seems to raise the spectre of self-modifying code. Although the "overwrite" is complete, you are still overwriting memory that was at one point already executed as instructions. The newly written code has to somehow make its way from the L1D to the L1I, and the associated performance hit is not clear. In particular, there have been reports that simply writing to the code area that has already been executed may suffer penalties of 100s of cycles and that the number of writes may be important.
What's the best way of generating a large about of dynamically generated straight-line code on x86 and executing it?

I think you're worried unnecessarily. Your case is more like when a process exits and its pages are reused for another process (with different code loaded into them), which shouldn't cause self-modifying code penalties. It's not the same as when a process writes into its own code pages.
The self-modifying code penalties are significant when the overwritten instructions have been prefetched or decoded to the trace cache. I think it is highly unlikely that any of the generated code will still be in the prefetch queue or trace cache by the time the code generator starts overwriting it with the next bit (unless the code generator is trivial).
Here's my suggestion: Allocate pages up to some fraction of L2 (as suggested by Peter), fill them with code, and execute them. Then map the same pages at the next higher virtual address and fill them with the next part of the code. You'll get the benefit of cache hits for the reads and the writes but I don't think you'll get any self-modifying code penalty. You'll use 10s of GB of virtual address space, but keep using the same physical pages.
Use a serializing operation such as CPUID before each time you start executing the modified instructions, as described in sections 8.1.3 and 11.6 of the Intel SDM.

I'm not sure you'll stand to gain much performance by using a gigantic amount of straight-line code instead of much smaller code with loops, since there's significant overhead in continually thrashing the instruction cache for so long, and the overhead of conditional jumps has gotten much better over the past several years. I was dubious when Intel made claims along those lines, and some of their statements were rather hyperbolic, but it has improved a lot in common cases. You can still always avoid call instructions if you need to for simplicity, even for tree recursive functions, by effectively simulating "the stack" with "a stack" (possibly itself on "the stack"), in the worst case.
That leaves two reasons I can think of that you'd want to stick with straight-line code that's only executed once on a modern computer: 1) it's too complicated to figure out how to express what needs to be computed with less code using jumps, or 2) it's an extremely heterogeneous problem being solved that actually needs so much code. #2 is quite uncommon in practice, though possible in a computer theoretical sense; I've just never encountered such a problem. If it's #1 and the issue is just how to efficiently encode the jumps as either short or near jumps, there are ways. (I've also just recently gotten back into x86-64 machine code generation in a side project, after years of not touching my assembler/linker, but it's not ready for use yet.)
Anyway, it's a bit hard to know what the stumbling block is, but I suspect that you'll get much better performance if you can figure out a way to avoid generating gigabytes of code, even if it may seem suboptimal on paper. Either way, it's usually best to try several options and see what works best experimentally if it's unclear. I've sometimes found surprising results that way. Best of luck!

Drhystone benchmark on 32 bit micrcontroller

Currently I am doing a performance comparison on two 32bit microcontrollers. I used Dhrystone benchmark to run on both microcontrollers. One microcontroller has 4KB I-cache while second coontroller has 8KB of I-cache. Both microcontrollers are using same tool chain. As much as possible I kept same static and run-time settings onboth microcontrollers. But microcontroller with 4KB cache are faster than 8KB cache microcontroller. Both microcontroller are from same vendor and based on same CPU.
Could anyone provide some information why microcontroller with 4KB cache is faster than other?

Benchmarks are in general useless. dhrystone being one of the oldest ones and it may have had a little bit of value back then, before pipelines and too much compiler optimization. I think I gave up on dhrystone 15 years ago roughly about the time I started using it.
It is trivial to demonstrate that this code
.globl ASMDELAY
ASMDELAY:
sub r0,r0,#1
bne ASMDELAY
bx lr
Which is primarily two INSTRUCTIONS, can vary WIDELY in execution time on the same chip if you understand how modern processors work. The simple trick to seeing this is turn off the caches and prefetchers and such, and place this code at offset 0x0000, call it with some value. place it at 0x0004 repeat, then at 0x0008, repeat. keep doing this. You can put one, two, etc nops between the subtract and branch. try it at various offsets.
THEN, turn on and of caches for each of these alignments, if you have a prefetch outside the processor for the flash, turn that on and off.
THEN, vary your clocks, esp for those MCUs where you have to adjust the wait states based on clock rate.
On a SINGLE MCU, you are going to see those essentially two instructions vary in execution time by a very large amount. Lets just say 20 times longer in some cases than others.
Now take a small program or a small fraction of the dhrystone program. Compile that for your mcu, how many instructions do you see? Make minor to major optimization and other variations on the compile command line, how much does the code change. If two instructions can vary by lets call it 20 times in execution time, how bad can it get for 200 instructions or 2000 instructions? It can get pretty bad.
If you take the dhrystone programs you have right now with the compiler options you have right now, go into your bootstrap, add one nop (causing the entire binary to shift by one instruction in flash) run again. Add two, three, four. You are still not comparing different mcus just running your benchmark on one system.
Run with and without d cache, with and without i cache if you have each of those. turn on and off the flash prefetch if you have one and if you have a write buffer you can turn on and of try that. Still remaining on the same compiler same options same mcu.
Take the different functions in the dhrystone source code, and re-arrange them in the source code. instead of Proc_1, Proc_2, Proc_3 make it Proc_1, Proc_3, Proc_2. Do all of the above again. Re-arrange again, repeat.
Before leaving this mcu you should now see that the execution time of the same source code which is completely unmodified (other than perhaps re-arranging functions) can and will have vastly different execution times.
if you then start changing compiler options or keep the same source and change compilers, you will see even more of a difference in execution time.
How is it possible that dhrystone benchmarks today or from way back had a single result for each platform? Simple, that result was just one of a wide range, not really representing the platform.
So then if you try to compare two different hardware platforms be it the same arm core inside a different mcu from the same vendor or different vendors. the arm core, assuming (which is not a safe assumption) it is the same source with the same compile/build options, even assuming the same verilog compile and synthesis was used. you can have that same core change based on arm provided options. Anyway, how the vendor be it the same vendor in two instances or two different vendors wraps that same core, you will see variations. Then take a completely different core be it another arm or a mips, etc. How could you gain any value comparing those using a program like this that itself varies widely on each platform?
You cant. What you can do is use benchmarks to give the illusion of one thing being better than another, one computer is faster than another, one compiler is faster than another. In order to sell computers or compilers. Sprints coverage is within one percent of Verizons...does that tell us anything useful? Nope.
If you were to eliminate the compiler from the equation, and if these are truly the same "CPU", same rev of source from ARM, built the same way, then they should fetch the same, but the size of the cache is part of that so it may already be a different cpu implementation as the width or depth of the cache can affect things. In software it is like needing a 32 bit pointer rather than a 16 bit pointer (17 bit instead of 16, but you cant have a 17 bit generally in logic you can).
Anyway, if you compile the code under test one time for an address space that is common to both platforms, use that same binary exactly for that space, can attach different bootstrap code as needed, note the C library calls strcpy, etc also have to be the same in the same space between platforms to eliminate the compiler and alignment from messing you up. this may or may not level the playing field.
If you want to believe these are the same cpu, then turn the caches off, eliminate the compiler variations by doing the above. See if they execute the same. Copy the program to ram and run in ram, eliminate the flash issues. I assume you have them both clocked the same with the same wait states in the flash?
if they are the same cpu and the chip vendor has with these two chips made the memory system take the same number of clocks say for ram accesses, and it is really the same cpu, you should be able to get the same time by eliminating optimizations (caching, flash prefetching, alignment).
what you are probably seeing is some form of alignment with how the code lies in memory from the compiler vs cache lines, or it could be much much simpler it could be just the differences in the caches, how the hits and misses work and the 4KB is just more lucky than the 8KB for this particular program compiled a certain way, aligned in memory a certain way, etc.
With the simple two instruction loop above it is easy to see some of the reasons why performance varies on the same system, if your modern cpu fetches 8 instructions at a time and your loop gets too close to the tail end of that fetch, the prefetch may think it needs to fetch another 8 beyond that costing you those clock cycles. Certainly as you exactly straddle two "fetch lines" as I call them with those two instructions it is going to cost you a lot more cycles per loop even with a cache. Same problem happens when those two instructions approach a cache line (as you vary their alignment per test) eventually it takes two cache line reads instead of one to fetch those two instructions. At least the first time through there is an extra cache line read. The extra clocks for that first time through is something you can see using a simple benchmark like that while playing with alignment.
Michael Abrash, Zen of Assembly Language. There is an epub/etc you can build from github of this book. the 8088 was obsolete when this book came out if that is all you see 8088 stuff, then you are completely missing the point. It applies to the most modern processors today, how to view the problem, how to test, how to time the test, how to interpret the results. All the stuff I have mentioned so far, and all the things I know about this that I have not mentioned, all came from that books knowledge applied for however many decades I have been doing this.
So again if you have truly eliminated the compiler, alignment, the cpu, the memory system tied to that cpu, etc and it is down to only the size of the cache varies. Then it is probably related to how the cache line fetches hit and miss differently based on alignment of the code relative to the cache lines for the two caches. One is hitting more and missing less and/or evicting better for this particular binary. You can try rearranging the functions, can add nops, or if you cant get at the bootstrap then add whole functions or more code (another printf, etc) at a lower address in the binary causing the linker to slide the code under test through to different addresses changing how the program lines up with the cache lines. Since the functions in the code under test are so big (more than a few instructions to implement) you would have to start modifying the program in order to get a finer grained adjustment of binary relative to cache lines.
You most definitely should see execution time differences on the two platforms if you adjust alignment and or wholesale re-arrange the binary based on functions being re-arranged.
Bottom line benchmarks dont really tell you much, the results have more of a negative stink to them than a positive joy. Without a re-write a particular benchmark or application may just do better on one platform (be it everything is the same but the size of the cache or two completely different architectures) even when you try to vary the results due to alignment, turning on and off prefetching, write buffering, branch prediction, etc. Pulling out all the tricks you can come up with one may vary from x to y the other from n to m and maybe there is some overlap in the ranges. For these two platforms that you say are similiar except for cache size I would hope you could find a combination where sometimes A is faster than B and at least one combination where B is faster than A, with the same features turned on and off for both in that comparision. If/when you switch to some other application/benchmark this all starts over again, no reason to assume dhrystone results predict any other code under test. The only program that matters, esp on an MCU is the final build of your application. Just remember changing a single line of code, or even adding a single nop in the bootstrap can have dramatic results in performance sometimes several TIMES slower or faster from a single nop.

Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

LOOP (Intel ref manual entry)
decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags.
loop on various microarchitectures, from Agner Fog's instruction tables:
K8/K10: 7 m-ops
Bulldozer-family/Ryzen: 1 m-op (same cost as macro-fused test-and-branch, or jecxz)
P4: 4 uops (same as jecxz)
P6 (PII/PIII): 8 uops
Pentium M, Core2: 11 uops
Nehalem: 6 uops. (11 for loope / loopne). Throughput = 4c (loop) or 7c (loope/ne).
SnB-family: 7 uops. (11 for loope / loopne). Throughput = one per 5 cycles, as much of a bottleneck as keeping your loop counter in memory! jecxz is only 2 uops with same throughput as regular jcc
Silvermont: 7 uops
AMD Jaguar (low-power): 8 uops, 5c throughput
Via Nano3000: 2 uops
Couldn't the decoders just decode the same as lea rcx, [rcx-1] / jrcxz? That would be 3 uops. At least that would be the case with no address-size prefix, otherwise it has to use ecx and truncate RIP to EIP if the jump is taken; maybe the odd choice of address-size controlling the width of the decrement explains the many uops? (Fun fact: rep-string instructions have the same behaviour with using ecx with 32-bit address-size.)
Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx / jnz on SnB decodes to a single uop (which does set flags).
I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.
Would it be easy for SnB-family uarch to have fast loop? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.
What's special about Bulldozer that made a fast loop easy / worth it? Or did AMD waste a bunch of transistors on making loop fast? If so, presumably someone thought it was a good idea.
If loop was fast, it would be perfect for BigInteger arbitrary-precision adc loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz. (And dec/jnz only macro-fuses on SnB-family).
On modern CPUs where dec/jnz is ok in an ADC loop, loop would still be nice for ADCX / ADOX loops (to preserve OF).
If loop had been fast, compilers would already be using it as a peephole optimization for code-size + speed on CPUs without macro-fusion.
It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.

In 1988, IBM fellow Glenn Henry had just come on board at Dell, which had a few hundred employees at the time, and in his first month he gave a tech talk about 386 internals. A bunch of us BIOS programmers had been wondering why LOOP was slower than DEC/JNZ so during the question/answer section somebody posed the question.
His answer made sense. It had to do with paging.
LOOP consists of two parts: decrementing CX, then jumping if CX is not zero. The first part cannot cause a processor exception, whereas the jump part can. For one, you could jump (or fall through) to an address outside segment boundaries, causing a SEGFAULT. For two, you could jump to a page that is swapped out.
A SEGFAULT usually spells the end for a process, but page faults are different. When a page fault occurs, the processor throws an exception, and the OS does the housekeeping to swap in the page from disk into RAM. After that, it restarts the instruction that caused the fault.
Restarting means restoring the state of the process to what it was just before the offending instruction. In the case of the LOOP instruction in particular, it meant restoring the value of the CX register. One might think you could just add 1 to CX, since we know CX got decremented, but apparently, it's not that simple. For example, check out this erratum from Intel:
The protection violations involved usually indicate a probable
software bug and restart is not desired if one of these violations
occurs. In a Protected Mode 80286 system with wait states during any
bus cycles, when certain protection violations are detected by the
80286 component, and the component transfers control to the exception
handling routine, the contents of the CX register may be unreliable.
(Whether CX contents are changed is a function of bus activity at the
time internal microcode detects the protection violation.)
To be safe, they needed to save the value of CX on every iteration of a LOOP instruction, in order to reliably restore it if needed.
It's this extra burden of saving CX that made LOOP so slow.
Intel, like everyone else at the time, was getting more and more RISC. The old CISC instructions (LOOP, ENTER, LEAVE, BOUND) were being phased out. We still used them in hand-coded assembly, but compilers ignored them completely.

Now that I googled after writing my question, it turns out to be an exact duplicate of one on comp.arch, which came up right away. I expected it to be hard to google (lots of "why is my loop slow" hits), but my first try (why is the x86 loop instruction slow) got results.
This is not a good or complete answer.
It might be the best we'll get, and will have to suffice unless someone can shed some more light on it. I didn't set out to write this as an answer-my-own-question post.
Good posts with different theories in that thread:
Robert
LOOP became slow on some of the earliest machines (circa 486) when
significant pipelining started to happen, and running any but the
simplest instruction down the pipeline efficiently was technologically
impractical. So LOOP was slow for a number of generations. So nobody
used it. So when it became possible to speed it up, there was no real
incentive to do so, since nobody was actually using it.
Anton Ertl:
IIRC LOOP was used in some software for timing loops; there was
(important) software that did not work on CPUs where LOOP was too fast
(this was in the early 90s or so). So CPU makers learned to make LOOP
slow.
(Paul, and anyone else: You're welcome to re-post your own writing as your own answer. I'll remove it from my answer and up-vote yours.)
#Paul A. Clayton (occasional SO poster and CPU architecture guy) took a guess at how you could use that many uops. (This looks like loope/ne which checks both the counter and ZF):
I could imagine a possibly sensible 6-µop version:
virtual_cc = cc;
temp = test (cc);
rCX = rCX - temp; // also setting cc
cc = temp & cc; // assumes branch handling is not
// substantially changed for the sake of LOOP
branch
cc = virtual_cc
(Note that this is 6 uops, not SnB's 11 for LOOPE/LOOPNE, and is a total guess not even trying to take into account anything known from SnB perf counters.)
Then Paul said:
I agree that a shorter sequence should be possible, but I was trying
to think of a bloated sequence that might make sense if minimal
microarchitectural adjustments were permitted.
summary: The designers wanted loop to be supported only via microcode, with no adjustments whatsoever to the hardware proper.
If a useless, compatibility-only instruction is handed to the
microcode developers, they might reasonably not be able or willing to
suggest minor changes to the internal microarchitecture to improve
such an instruction. Not only would they rather use their "change
suggestion capital" more productively but the suggestion of a change
for a useless case would reduce the credibility of other suggestions.
(My opinion: Intel is probably still making it slow on purpose, and hasn't bothered to rewrite their microcode for it for a long time. Modern CPUs are probably too fast for anything using loop in a naive way to work correctly.)
... Paul continues:
The architects behind Nano may have found avoiding the special casing
of LOOP simplified their design in terms of area or power. Or they
may have had incentives from embedded users to provide a fast
implementation (for code density benefits). Those are just WILD
guesses.
If optimization of LOOP fell out of other optimizations (like fusion
of compare and branch), it might be easier to tweak LOOP into a fast
path instruction than to handle it in microcode even if the
performance of LOOP was unimportant.
I suspect that such decisions are based on specific details of the
implementation. Information about such details does not seem to be
generally available and interpreting such information would be
beyond the skill level of most people. (I am not a hardware
designer--and have never played one on television or stayed at a
Holiday Inn Express. :-)
The thread then went off-topic into the realm of AMD blowing our one chance to clean up the cruft in x86 instruction encoding. It's hard to blame them, since every change is a case where the decoders can't share transistors. And before Intel adopted x86-64, it wasn't even clear that it would catch on. AMD didn't want to burden their CPUs with hardware nobody used if AMD64 didn't catch on.
But still, there are so many small things: setcc could have changed to 32bits. (Usually you have to use xor-zero / test / setcc to avoid false dependencies, or because you need a zero-extended reg). Shift could have unconditionally written flags, even with zero shift count (removing the input data dependency on eflags for variable-count shift for OOO execution). Last time I typed this list of pet peeves, I think there was a third one... Oh yeah, bt / bts etc. with memory operands has the address dependent on the upper bits of the index (bit string, not just bit within a machine word).
bts instructions are very useful for bit-field stuff, and are slower than they need to be so you almost always want to load into a register and then use that. (It's usually faster to shift/mask to get an address yourself, instead of using 10 uop bts [mem], reg on Skylake, but it does take extra instructions. So it made sense on 386, but not on K8). Atomic bit-manipulation has to use the memory-dest form, but the locked version needs lots of uops anyway. It's still slower than if it couldn't access outside the dword it's operating on.

Please see the nice article by Abrash, Michael, published in Dr. Dobb's Journal March 1991 v16 n3 p16(8): http://archive.gamedev.net/archive/reference/articles/article369.html
The summary of the article is the following:
Optimizing code for 8088, 80286, 80386 and 80486 microprocessors is
difficult because the chips use significantly different memory
architectures and instruction execution times. Code cannot be
optimized for the 80x86 family; rather, code must be designed to
produce good performance on a range of systems or optimized for
particular combinations of processors and memory. Programmers must
avoid the unusual instructions supported by the 8088, which have lost
their performance edge in subsequent chips. String instructions
should be used but not relied upon. Registers should be used rather
than memory operations. Branching is also slow for all four
processors. Memory accesses should be aligned to improve
performance. Generally, optimizing an 80486 requires exactly the
opposite steps as optimizing an 8088.
By "unusual instructions supported by the 8088" the author also means "loop":
Any 8088 programmer would instinctively replace: DEC CX JNZ LOOPTOP
with: LOOP LOOPTOP because LOOP is significantly faster on the 8088.
LOOP is also faster on the 286. On the 386, however, LOOP is actually
two cycles slower than DEC/JNZ. The pendulum swings still further on
the 486, where LOOP is about twice as slow as DEC/JNZ--and, mind you,
we're talking about what was originally perhaps the most obvious
optimization in the entire 80x86 instruction set.
This is a very good article, and I highly recommend it. Even though it was published in 1991, it is surprisingly highly relevant today.
But this article just gives advices, it encourages to test execution speed and choose faster variants. It doesn’t explain WHY some commands become very slow, so it doesn’t fully address your question.
The answer is that earlier processors, like 80386 (released in 1985) and before, executed instructions one-by-one, sequentially.
Later processors have started to use instruction pipelining – initially, simple, for 804086, and, finally, Pentium Pro (released in 1995) introduced radically different internal pipeline, calling it the Out Of Order (OOO) core where instructions were transformed to small fragments of operations called micro-ops or µops, and then all micro-ops of different instructions were put to a large pool of micro-ops where they were supposed to execute simultaneously as long as they do not depend on one another. This OOO pipeline principle is still used, almost unchanged, on modern processors. You can find more information about instruction pipelining in this brilliant article: https://www.gamedev.net/resources/_/technical/general-programming/a-journey-through-the-cpu-pipeline-r3115
In order to simplify chip design, Intel decided to build processors in such a way that one instructions did transform to micro-ops in a very efficient way, while others are not.
Efficient conversion from instructions to micro-ops requires more transistors, so Intel have decided to save on transistors at a cost of slower decoding and execution of some “complex” or “rarely-used” instructions.
For example, the “Intel® Architecture Optimization Reference Manual” http://download.intel.com/design/PentiumII/manuals/24512701.pdf mentions the following: “Avoid using complex instructions (for example, enter, leave, or loop) that generally have more than four µops and require multiple cycles to decode. Use sequences of simple instructions instead.”
So, Intel somehow have decided that the “loop” instruction is “complex”, and, since then, it became very slow. However, there is no official Intel reference on instruction breakdown: how many micro-ops each instruction produces, and how many cycles are required to decode it.
You can also read about The Out-of-Order Execution Engine
in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section the 2.1.2.

Better to encode ADD EAX,4 as r/m32 imm32 or r32 imm8? x86 assembly

Given the instruction:
ADD EAX, 4
in x86 assembly P2+, would it be better to encode this as ADD EAX, imm32 (op code 05) or ADD r/m32, imm8 (op code 83 /0), if the goal is execution speed and code size is irrelevant?
Note that EAX is a pointer in this case so ADD AL, imm8 is not viable.
Option 2 (op code 83 /0) will lead to a smaller code size, but from my limited understanding it may not pipeline as well as option 1, even though it is shorter.

If you're not optimising for one specific CPU (e.g. "Intel Pentium II, model 2" and not just "P2+"), just guess. Regardless of how you guess, some CPUs might be faster and some might be slower (but I'd expect that for most CPUs it won't make any difference at all).
If you are optimising for one specific CPU, just guess. The alternative is to profile it under the exact conditions you plan to use it, and then realise that you want to change the code 2 days later and that you wasted 8 hours messing about with profiling for nothing.
Finally; if you're writing a code optimiser (e.g. as part of a compiler back-end or an optimising assembler), then just guess - you've got far more important things to worry about than this. The alternative is to get about 50 different computers and profile many different sequences of instructions on all of them; and use the results to create rules for many different cases.

Choose the short form 83 /0. Code density will improve, which means performance and power consumption have a modest chance to improve on many processors. There is no upside to encoding fat immediate values.
Instruction cache size on a modern x86 processor is at least 32KB, which means modest code density improvements won't really improve caching. See the paper "The Impact of Code Density on Instruction Cache Performance" by Steenkiste. Sorry, I can't seem to find a free link for that one.
The real potential win is for dense code executing in a tight loop. If the loop is tight enough, modern processors can take shortcuts in which they fetch and decode the instructions only once, then execute the small loop from internal buffering inside the pipeline itself. Even if loop performance is limited due to data hazards or whatever, these tricks still reduce power consumption.

Implementing registers in a C virtual machine

I've written a virtual machine in C as a hobby project. This virtual machine executes code that's very similar to Intel syntax x86 assembly. The problem is that the registers this virtual machine uses are only registers in name. In my VM code, registers are used just like x86 registers, but the machine stores them in system memory. There are no performance improvements to using registers over system memory in VM code. (I thought that the locality alone would increase performance somewhat, but in practice, nothing has changed.)
When interpreting a program, this virtual machine stores arguments to instructions as pointers. This allows a virtual instruction to take a memory address, constant value, virtual register, or just about anything as an argument.
Since hardware registers don't have addresses, I can't think of a way to actually store my VM registers in hardware registers. Using the register keyword on my virtual register type doesn't work, because I have to get a pointer to the virtual register to use it as an argument. Is there any way to make these virtual registers perform more like their native counterparts?
I'm perfectly comfortable delving into assembly if necessary. I'm aware that JIT compiling this VM code could allow me to utilize hardware registers, but I'd like to be able to use them with my interpreted code as well.

Machine registers don't have indexing support: you can't access the register with a runtime-specified "index", whatever that would mean, without code generation. Since you're likely decoding the register index from your instructions, the only way is to make a huge switch (i.e. switch (opcode) { case ADD_R0_R1: r[0] += r[1]; break; ... }). This is likely a bad idea since it increases the interpreter loop size too much, so it will introduce instruction cache thrashing.
If we're talking about x86, the additional problem is that the amount of general-purpose registers is pretty low; some of them will be used for bookkeeping (storing PC, storing your VM stack state, decoding instructions, etc.) - it's unlikely that you'll have more than one free register for the VM.
Even if register indexing support were available, it's unlikely it would give you a lot of performance. Commonly in interpreters the largest bottleneck is instruction decoding; x86 supports fast and compact memory addressing based on register values (i.e. mov eax, dword ptr [ebx * 4 + ecx]), so you would not win much. It's worthwhile though to check the generated assembly - i.e. to make sure the 'register pool' address is stored in the register.
The best way to accelerate interpreters is JITting; even a simple JIT (i.e. without smart register allocation - basically just emitting the same code you would execute with the instruction loop and a switch statement, except the instruction decoding) can boost your performance 3x or more (these are actual results from a simple JITter on top of a Lua-like register-based VM). An interpreter is best kept as reference code (or for cold code to decrease JIT memory cost - the JIT generation cost is a non-issue for simple JITs).

Even if you could directly access hardware registers, wrapping code around the decision to use a register instead of memory is that much slower.
To get performance you need to design for performance up front.
A few examples.
Prepare an x86 VM by setting up all the traps to catch the code leaving its virtual memory space. Execute the code directly, dont emulate, branch to it and run. When the code reaches out of its memory/i/o space to talk to a device, etc, trap that and emulate that device or whatever it was reaching for then return control back to the program. If the code is processor bound it will run really fast, if I/O bound then slow but not as slow as emulating each instruction.
Static binary translation. Disassemble and translate the code before running, for example an instruction 0x34,0x2E would turn into ascii in a .c file:
al ^= 0x2E;
of =0;
cf=0;
sf=al
Ideally performing tons of dead code removal (if the next instruction modifies the flags as well then dont modify them here, etc). And letting the optimizer in the compiler do the rest. You can get a performance gain this way over an emulator, how good of a performance gain depends on how well you can optimize the code. Being a new program it runs on the hardware, registers memory and all, so the processor bound code is slower than a VM, in some cases you dont have to deal with the processor doing exceptions to trap memory/io because you have simulated the memory accesses in the code, but that still has a cost and calls a simulated device anyway so no savings there.
Dynamic translation, similar to sbt but you do this at runtime, I have heard this done for example when simulating x86 code on some other processor say a dec alpha, the code is slowly changed into native alpha instructions from x86 instructions so the next time around it executes the alpha instruction directly instead of emulating the x86 instruction. Each time through the code the program executes faster.
Or maybe just redesign your emulator to be more efficient from an execution standpoint. Look at the emulated processors in MAME for example, the readability and maintainability of the code has been sacrificed for performance. When written that was important, today with multi-core gigahertz processors you dont have to work so hard to emulate a 1.5ghz 6502 or 3ghz z80. Something as simple as looking the next opcode up in a table and deciding not to emulate some or all of the flag calculation for an instruction can give you a noticeable boost.
Bottom line, if you are interested in using the x86 hardware registers, Ax, BX, etc to emulate AX, BX, etc registers when running a program, the only efficient way to do that is to actually execute the instruction, and not execute and trap as in single stepping a debugger, but execute long strings of instructions while preventing them from leaving the VM space. There are different ways to do this, and performance results will vary, and that doesnt mean it will be faster than a performance efficient emulator. This limits you to matching the processor to the program. Emulating the registers with efficient code and a really good compiler (good optimizer) will give you reasonable performance and portability in that you dont have to match the hardware to the program being run.

transform your complex, register-based code before execution (ahead of time). A simple solution would be a forth like dual-stack vm for execution which offering the possibility to cache the top-of-stack element (TOS) in a register. If you prefer a register-based solution choose an "opcode" format which bundles as much as possible instructions (thumb rule, up to four instructions can be bundled into a byte if a MISC style design is chosen). This way virtual register accesses are locally resolvable to physical register references for each static super-instruction (clang and gcc able to perform such optimization). As side effect the lowered BTB mis-prediction rate would result in far better performance regardless of specific register allocations.
Best threading techniques for C based interpreters are direct threading (label-as-address extension) and replicated switch-threading (ANSI conform).

So you're writing an x86 interpreter, which is bound to be between 1 and 3 powers of 10 slower that the actual hardware. In the real hardware, saying mov mem, foo is going take a lot more time than mov reg, foo, while in your program mem[adr] = foo is going to take about as long as myRegVars[regnum] = foo (modulo cacheing). So you're expecting the same speed differential?
If you want to simulate the speed differential between registers and memory, you're going to have to do something like what Cachegrind does. That is, keep a simulated clock, and when it does a memory reference, it adds a big number to that.

Your VM seems to be too complicated for an efficient interpretation. An obvious optimisation is to have a "microcode" VM, with register load/store instructions, probably even a stack-based one. You can translate your high level VM into a simpler one before the execution. Another useful optimisation depends on a gcc computable labels extension, see the Objective Caml VM interpreter for the example of such a threaded VM implementation.

To answer the specific question you asked:
You could instruct your C compiler to leave a bunch of registers free for your use. Pointers to the first page of memory are usually not allowed, they are reserved for NULL pointer checks, so you could abuse the initial pointers for marking registers. It helps if you have a few native registers to spare, so my example uses 64 bit mode to simulate 4 registers. It may very well be that the additional overhead of the switch slows down execution instead of making it faster. Also see the other answers for general advices.
/* compile with gcc */
register long r0 asm("r12");
register long r1 asm("r13");
register long r2 asm("r14");
register long r3 asm("r15");
inline long get_argument(long* arg)
{
unsigned long val = (unsigned long)arg;
switch(val)
{
/* leave 0 for NULL pointer */
case 1: return r0;
case 2: return r1;
case 3: return r2;
case 4: return r3;
default: return *arg;
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio