Collecting performance data on heap usage - performance

I am working on a compiler module that adds extra instructions around every load/store in an application with an emphasis on heap objects. One of the performance factors is the size of the object being accessed; different instructions are chosen at run time for different object sizes.
I have chosen a few benchmarks from SPEC to grade the performance impact of my changes. Currently, I am limited to just looking at overheads measured by perf. This leads to a substantial amount of guessing about why certain benchmarks are more severely impacted than others. Supporting each hypothesis with more data seems like a good step to take. From each benchmark, for each object allocated on the heap, it would be useful to know:
The size of each heap allocation or reallocation
The number of times each allocation is accessed throughout the run of the application.
I have been successful in #1. It was easy enough to inject a few printf() calls into glibc as I am already tinkering in glibc. I do not know how to get #2; access counts seem much better suited to a framework or wrapper tool and I don't know which one would work best.
Can you provide a recommendation on how to collect this information?

If you're doing the instrumentation in assembly (which I think you are?), you can, in the data segment, just stick in a label with a value:
.data
# probably some other stuff goes here
.align 4
count:
.long 0
and increment it like so:
movl count, %eax
addl $1, %eax
movl %eax, count
choosing an appropriate register. Though I suppose if you're doing it at the head of a function call, %eax would get clobbered anyway.

Valgrind has a tool 'DHAT" - dynamic heap analysis tool - that can collect this data. The output is not exactly in the format I want, but it is close enough for research work. The access counts are summarized as "average [reads|writes] per byte per allocation"; the exact access count isn't reported nor recoverable from other report data. Maybe some open source development in my future?
http://valgrind.org/docs/manual/dh-manual.html
valgrind --tool=exp-dhat --show-top-n=100000 --trace-children=yes --log-file="log.file" ./benchmark

Related

Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

LOOP (Intel ref manual entry)
decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags.
loop on various microarchitectures, from Agner Fog's instruction tables:
K8/K10: 7 m-ops
Bulldozer-family/Ryzen: 1 m-op (same cost as macro-fused test-and-branch, or jecxz)
P4: 4 uops (same as jecxz)
P6 (PII/PIII): 8 uops
Pentium M, Core2: 11 uops
Nehalem: 6 uops. (11 for loope / loopne). Throughput = 4c (loop) or 7c (loope/ne).
SnB-family: 7 uops. (11 for loope / loopne). Throughput = one per 5 cycles, as much of a bottleneck as keeping your loop counter in memory! jecxz is only 2 uops with same throughput as regular jcc
Silvermont: 7 uops
AMD Jaguar (low-power): 8 uops, 5c throughput
Via Nano3000: 2 uops
Couldn't the decoders just decode the same as lea rcx, [rcx-1] / jrcxz? That would be 3 uops. At least that would be the case with no address-size prefix, otherwise it has to use ecx and truncate RIP to EIP if the jump is taken; maybe the odd choice of address-size controlling the width of the decrement explains the many uops? (Fun fact: rep-string instructions have the same behaviour with using ecx with 32-bit address-size.)
Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx / jnz on SnB decodes to a single uop (which does set flags).
I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.
Would it be easy for SnB-family uarch to have fast loop? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.
What's special about Bulldozer that made a fast loop easy / worth it? Or did AMD waste a bunch of transistors on making loop fast? If so, presumably someone thought it was a good idea.
If loop was fast, it would be perfect for BigInteger arbitrary-precision adc loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz. (And dec/jnz only macro-fuses on SnB-family).
On modern CPUs where dec/jnz is ok in an ADC loop, loop would still be nice for ADCX / ADOX loops (to preserve OF).
If loop had been fast, compilers would already be using it as a peephole optimization for code-size + speed on CPUs without macro-fusion.
It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.
In 1988, IBM fellow Glenn Henry had just come on board at Dell, which had a few hundred employees at the time, and in his first month he gave a tech talk about 386 internals. A bunch of us BIOS programmers had been wondering why LOOP was slower than DEC/JNZ so during the question/answer section somebody posed the question.
His answer made sense. It had to do with paging.
LOOP consists of two parts: decrementing CX, then jumping if CX is not zero. The first part cannot cause a processor exception, whereas the jump part can. For one, you could jump (or fall through) to an address outside segment boundaries, causing a SEGFAULT. For two, you could jump to a page that is swapped out.
A SEGFAULT usually spells the end for a process, but page faults are different. When a page fault occurs, the processor throws an exception, and the OS does the housekeeping to swap in the page from disk into RAM. After that, it restarts the instruction that caused the fault.
Restarting means restoring the state of the process to what it was just before the offending instruction. In the case of the LOOP instruction in particular, it meant restoring the value of the CX register. One might think you could just add 1 to CX, since we know CX got decremented, but apparently, it's not that simple. For example, check out this erratum from Intel:
The protection violations involved usually indicate a probable
software bug and restart is not desired if one of these violations
occurs. In a Protected Mode 80286 system with wait states during any
bus cycles, when certain protection violations are detected by the
80286 component, and the component transfers control to the exception
handling routine, the contents of the CX register may be unreliable.
(Whether CX contents are changed is a function of bus activity at the
time internal microcode detects the protection violation.)
To be safe, they needed to save the value of CX on every iteration of a LOOP instruction, in order to reliably restore it if needed.
It's this extra burden of saving CX that made LOOP so slow.
Intel, like everyone else at the time, was getting more and more RISC. The old CISC instructions (LOOP, ENTER, LEAVE, BOUND) were being phased out. We still used them in hand-coded assembly, but compilers ignored them completely.
Now that I googled after writing my question, it turns out to be an exact duplicate of one on comp.arch, which came up right away. I expected it to be hard to google (lots of "why is my loop slow" hits), but my first try (why is the x86 loop instruction slow) got results.
This is not a good or complete answer.
It might be the best we'll get, and will have to suffice unless someone can shed some more light on it. I didn't set out to write this as an answer-my-own-question post.
Good posts with different theories in that thread:
Robert
LOOP became slow on some of the earliest machines (circa 486) when
significant pipelining started to happen, and running any but the
simplest instruction down the pipeline efficiently was technologically
impractical. So LOOP was slow for a number of generations. So nobody
used it. So when it became possible to speed it up, there was no real
incentive to do so, since nobody was actually using it.
Anton Ertl:
IIRC LOOP was used in some software for timing loops; there was
(important) software that did not work on CPUs where LOOP was too fast
(this was in the early 90s or so). So CPU makers learned to make LOOP
slow.
(Paul, and anyone else: You're welcome to re-post your own writing as your own answer. I'll remove it from my answer and up-vote yours.)
#Paul A. Clayton (occasional SO poster and CPU architecture guy) took a guess at how you could use that many uops. (This looks like loope/ne which checks both the counter and ZF):
I could imagine a possibly sensible 6-µop version:
virtual_cc = cc;
temp = test (cc);
rCX = rCX - temp; // also setting cc
cc = temp & cc; // assumes branch handling is not
// substantially changed for the sake of LOOP
branch
cc = virtual_cc
(Note that this is 6 uops, not SnB's 11 for LOOPE/LOOPNE, and is a total guess not even trying to take into account anything known from SnB perf counters.)
Then Paul said:
I agree that a shorter sequence should be possible, but I was trying
to think of a bloated sequence that might make sense if minimal
microarchitectural adjustments were permitted.
summary: The designers wanted loop to be supported only via microcode, with no adjustments whatsoever to the hardware proper.
If a useless, compatibility-only instruction is handed to the
microcode developers, they might reasonably not be able or willing to
suggest minor changes to the internal microarchitecture to improve
such an instruction. Not only would they rather use their "change
suggestion capital" more productively but the suggestion of a change
for a useless case would reduce the credibility of other suggestions.
(My opinion: Intel is probably still making it slow on purpose, and hasn't bothered to rewrite their microcode for it for a long time. Modern CPUs are probably too fast for anything using loop in a naive way to work correctly.)
... Paul continues:
The architects behind Nano may have found avoiding the special casing
of LOOP simplified their design in terms of area or power. Or they
may have had incentives from embedded users to provide a fast
implementation (for code density benefits). Those are just WILD
guesses.
If optimization of LOOP fell out of other optimizations (like fusion
of compare and branch), it might be easier to tweak LOOP into a fast
path instruction than to handle it in microcode even if the
performance of LOOP was unimportant.
I suspect that such decisions are based on specific details of the
implementation. Information about such details does not seem to be
generally available and interpreting such information would be
beyond the skill level of most people. (I am not a hardware
designer--and have never played one on television or stayed at a
Holiday Inn Express. :-)
The thread then went off-topic into the realm of AMD blowing our one chance to clean up the cruft in x86 instruction encoding. It's hard to blame them, since every change is a case where the decoders can't share transistors. And before Intel adopted x86-64, it wasn't even clear that it would catch on. AMD didn't want to burden their CPUs with hardware nobody used if AMD64 didn't catch on.
But still, there are so many small things: setcc could have changed to 32bits. (Usually you have to use xor-zero / test / setcc to avoid false dependencies, or because you need a zero-extended reg). Shift could have unconditionally written flags, even with zero shift count (removing the input data dependency on eflags for variable-count shift for OOO execution). Last time I typed this list of pet peeves, I think there was a third one... Oh yeah, bt / bts etc. with memory operands has the address dependent on the upper bits of the index (bit string, not just bit within a machine word).
bts instructions are very useful for bit-field stuff, and are slower than they need to be so you almost always want to load into a register and then use that. (It's usually faster to shift/mask to get an address yourself, instead of using 10 uop bts [mem], reg on Skylake, but it does take extra instructions. So it made sense on 386, but not on K8). Atomic bit-manipulation has to use the memory-dest form, but the locked version needs lots of uops anyway. It's still slower than if it couldn't access outside the dword it's operating on.
Please see the nice article by Abrash, Michael, published in Dr. Dobb's Journal March 1991 v16 n3 p16(8): http://archive.gamedev.net/archive/reference/articles/article369.html
The summary of the article is the following:
Optimizing code for 8088, 80286, 80386 and 80486 microprocessors is
difficult because the chips use significantly different memory
architectures and instruction execution times. Code cannot be
optimized for the 80x86 family; rather, code must be designed to
produce good performance on a range of systems or optimized for
particular combinations of processors and memory. Programmers must
avoid the unusual instructions supported by the 8088, which have lost
their performance edge in subsequent chips. String instructions
should be used but not relied upon. Registers should be used rather
than memory operations. Branching is also slow for all four
processors. Memory accesses should be aligned to improve
performance. Generally, optimizing an 80486 requires exactly the
opposite steps as optimizing an 8088.
By "unusual instructions supported by the 8088" the author also means "loop":
Any 8088 programmer would instinctively replace: DEC CX JNZ LOOPTOP
with: LOOP LOOPTOP because LOOP is significantly faster on the 8088.
LOOP is also faster on the 286. On the 386, however, LOOP is actually
two cycles slower than DEC/JNZ. The pendulum swings still further on
the 486, where LOOP is about twice as slow as DEC/JNZ--and, mind you,
we're talking about what was originally perhaps the most obvious
optimization in the entire 80x86 instruction set.
This is a very good article, and I highly recommend it. Even though it was published in 1991, it is surprisingly highly relevant today.
But this article just gives advices, it encourages to test execution speed and choose faster variants. It doesn’t explain WHY some commands become very slow, so it doesn’t fully address your question.
The answer is that earlier processors, like 80386 (released in 1985) and before, executed instructions one-by-one, sequentially.
Later processors have started to use instruction pipelining – initially, simple, for 804086, and, finally, Pentium Pro (released in 1995) introduced radically different internal pipeline, calling it the Out Of Order (OOO) core where instructions were transformed to small fragments of operations called micro-ops or µops, and then all micro-ops of different instructions were put to a large pool of micro-ops where they were supposed to execute simultaneously as long as they do not depend on one another. This OOO pipeline principle is still used, almost unchanged, on modern processors. You can find more information about instruction pipelining in this brilliant article: https://www.gamedev.net/resources/_/technical/general-programming/a-journey-through-the-cpu-pipeline-r3115
In order to simplify chip design, Intel decided to build processors in such a way that one instructions did transform to micro-ops in a very efficient way, while others are not.
Efficient conversion from instructions to micro-ops requires more transistors, so Intel have decided to save on transistors at a cost of slower decoding and execution of some “complex” or “rarely-used” instructions.
For example, the “Intel® Architecture Optimization Reference Manual” http://download.intel.com/design/PentiumII/manuals/24512701.pdf mentions the following: “Avoid using complex instructions (for example, enter, leave, or loop) that generally have more than four µops and require multiple cycles to decode. Use sequences of simple instructions instead.”
So, Intel somehow have decided that the “loop” instruction is “complex”, and, since then, it became very slow. However, there is no official Intel reference on instruction breakdown: how many micro-ops each instruction produces, and how many cycles are required to decode it.
You can also read about The Out-of-Order Execution Engine
in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section the 2.1.2.

How can I speed up my code coverage tool?

I've written a small code coverage utility to log which basic blocks are hit in an x86 executable. It runs without source code or debugging symbols for the target, and just takes a lost of basic blocks which it monitors.
However, it is becoming the bottleneck in my application, which involves repeated coverage snapshots of a single executable image.
It has gone through a couple of phases as I've tried to speed it up. I started off just placing an INT3 at the start of each basic block, attaching as debugger, and logging hits. Then I tried to improve performance by patching in a counter to any block bigger than 5 bytes (the size of a JMP REL32). I wrote a small stub ('mov [blah], 1 / jmp backToTheBasicBlockWeCameFrom') in the process memory space and patch a JMP to that. This greatly speeds things up, since there's no exception and no debugger break, but I'd like to speed things up more.
I'm thinking of one of the following:
1) Pre-instrument the target binary with my patched counters (at the moment I do this at runtime). I could make a new section in the PE, throw my counters in it, patch in all the hooks I need, then just read data out of the same section with my debugger after each execution. That'll gain me some speed (about a 16% according to my estimation) but there are still those pesky INT3's which I need to have in the smaller blocks, which are really going to cripple performance.
2) Instrument the binary to include its own UnhandledExceptionFilter and handle its own int3's in conjunction with the above. This would mean there's no process switch from the debuggee to my coverage tool on every int3, but there'd still be the breakpoint exception raised and the subsequent kernel transition - am I right in thinking this wouldn't actually gain me much performance?
3) Try to do something clever using Intel's hardware branch profiling instructions. This sounds pretty awesome but I'm not clear on how I'd go about it - is it even possible in a windows usermode application? I might go as far as to write a kernel-mode driver if it's fairly straightforward but I'm not a kernel coder (I dabble a bit) and would probably cause myself lots of headaches. Are there any other projects using this approach? I see the Linux kernel has it to monitor the kernel itself, which makes me think that monitoring a specific usermode application will be difficult.
4) Use an off-the-shelf application. It'd need to work without any source or debugging symbols, be scriptable (so I can run in batches), and preferably be free (I'm pretty stingy). For-pay tools aren't off the table, however (if I can spend less on a tool and increase perf enough to avoid buying new hardware, that'd be good justification).
5) Something else. I'm running in VMWare on Windows XP, on fairly old hardware (Pentium 4-ish) - is there anything I've missed, or any leads I should read up on? Can I get my JMP REL32 down to less than 5 bytes (and catch smaller blocks without the need for an int3)?
Thanks.
If you insist on instrumenting binaries, pretty much your fastest coverage is the 5-byte jump-out jump-back trick. (You're covering standard ground for binary instrumentation tools.)
The INT 3 solution will always involve a trap. Yes, you could handle the trap in your space instead of a debugger space and that would speed up it, but it will never be close to competitive to the jump-out/back patch. You may need it as backup anyway, if the function you are instrumenting happens to be shorter than 5 bytes (e.g., "inc eax/ret") because then you don't have 5 bytes you can patch.
What you might do to optimize things a little is examine the patched code. Without such examination, with original code:
instrn 1
instrn 2
instrn N
next:
patched, in general to look like this:
jmp patch
xxx
next:
has to generally have a patch:
patch: pushf
inc count
popf
instrn1
instrn2
instrnN
jmp back
If all you want is coverage, you don't need to increment, and the means you don't need to save the flags:
patch: mov byte ptr covered,1
instrn1
instrn2
instrnN
jmp back
You should use a byte rather than a word to keep the patch size down. You should align the patch on a cache line so the processor doesn't have fetch 2 cache lines to execute the patch.
If you insist on counting, you can analyze the instrn1/2/N to see if they care about the flags that "inc" fools with, and only pushf/popf if needed, or you can insert the increment between two instructions in the patch that don't care. You must be analyzing these to some extent to handle complications such as instn being ret anyway; you can generate a better patch (e.g., don't "jmp back").
You may find that using add count,1 is faster than inc count because this avoids partial condition code updates and consequent pipeline interlocks. This will affect your cc-impact-analysis a bit, since inc doesn't set the carry bit, and add does.
Another possibility is PC sampling. Don't instrument the code at all; just interrupt the thread periodically and take a sample PC value. If you know where the basic blocks are, a PC sample anywhere in the basic block is evidence the entire block got executed. This won't necessarily give precise coverage data (you may miss critical PC values), but the overhead is pretty low.
If you are willing to patch source code, you can do better: just insert "covered[i]=true;" in the beginning the ith basic block, and let the compiler take care of all the various optimizations. No patches needed. The really cool part of this is that if you have basic blocks inside nested loops, and you insert source probes like this, the compiler will notice that the probe assignments are idempotent with respect to the loop and lift the probe out of the loop. Viola, zero probe overhead inside the loop. What more more could you want?

Better to encode ADD EAX,4 as r/m32 imm32 or r32 imm8? x86 assembly

Given the instruction:
ADD EAX, 4
in x86 assembly P2+, would it be better to encode this as ADD EAX, imm32 (op code 05) or ADD r/m32, imm8 (op code 83 /0), if the goal is execution speed and code size is irrelevant?
Note that EAX is a pointer in this case so ADD AL, imm8 is not viable.
Option 2 (op code 83 /0) will lead to a smaller code size, but from my limited understanding it may not pipeline as well as option 1, even though it is shorter.
If you're not optimising for one specific CPU (e.g. "Intel Pentium II, model 2" and not just "P2+"), just guess. Regardless of how you guess, some CPUs might be faster and some might be slower (but I'd expect that for most CPUs it won't make any difference at all).
If you are optimising for one specific CPU, just guess. The alternative is to profile it under the exact conditions you plan to use it, and then realise that you want to change the code 2 days later and that you wasted 8 hours messing about with profiling for nothing.
Finally; if you're writing a code optimiser (e.g. as part of a compiler back-end or an optimising assembler), then just guess - you've got far more important things to worry about than this. The alternative is to get about 50 different computers and profile many different sequences of instructions on all of them; and use the results to create rules for many different cases.
Choose the short form 83 /0. Code density will improve, which means performance and power consumption have a modest chance to improve on many processors. There is no upside to encoding fat immediate values.
Instruction cache size on a modern x86 processor is at least 32KB, which means modest code density improvements won't really improve caching. See the paper "The Impact of Code Density on Instruction Cache Performance" by Steenkiste. Sorry, I can't seem to find a free link for that one.
The real potential win is for dense code executing in a tight loop. If the loop is tight enough, modern processors can take shortcuts in which they fetch and decode the instructions only once, then execute the small loop from internal buffering inside the pipeline itself. Even if loop performance is limited due to data hazards or whatever, these tricks still reduce power consumption.

Does omitting the frame pointers really have a positive effect on performance and a negative effect on debug-ability?

As was advised long time ago, I always build my release executables without frame pointers (which is the default if you compile with /Ox).
However, now I read in the paper http://research.microsoft.com/apps/pubs/default.aspx?id=81176, that frame pointers don't have much of an effect on performance. So optimizing it fully (using /Ox) or optimizing it fully with frame pointers (using /Ox /Oy-) doesn't really make a difference on peformance.
Microsoft seems to indicate that adding frame pointers (/Oy-) makes debugging easier, but is this really the case?
I did some experiments and noticed that:
in a simple 32-bit test executable (compiled using /Ox /Ob0) the omission of frame pointers does increase performance (with about 10%). But this test executable only performs some function calls, nothing else.
in my own application the adding/removing of frame pointers don't seem to have a big effect. Adding frame pointers seems to make the application about 5% faster, but that could be within the error margin.
What is the general advice regarding frame pointers?
should they be omitted (/Ox) in a release executable because they really have a positive effect on performance?
should they be added (/Ox /Oy-) in a release executable because they improve debug-ablity (when debugging with a crash-dump file)?
Using Visual Studio 2010.
Phoronix tested the performance downside of -O2 -fno-omit-frame-pointer with x86-64 GCC 12.1 on a Zen 3 laptop CPU for multiple open-source programs, as proposed for Fedora 37. Most of them had performance regressions, a few of them very serious, although the biggest ones are probably some kind of fluke or other interaction. Geometric mean slowdown of 14% (including those possible outliers).
Short answer: By omitting the frame pointer,
You need to use the stack pointer to access local variables and arguments. The compiler doesn't mind, but if you are coding in assembler, this makes your life slightly harder. Much harder if you don't use macros.
You save four bytes (32-bit architecture) of stack space per function call. Unless you are using deep recursion, this isn't a win.
You save a memory write to a cached memory (the stack) and you (theoretically) save a few clock ticks on function entry/exit, but you can increase the code size. Unless your function is doing very little very often (in which case it should be inlined), this shouldn't be noticeable.
You free up a general-purpose register. If the compiler can utilize the register, it will produce code that is both substantially smaller and potentially faster. But, if most of the CPU time is spent talking to the main memory (or even the hard drive), omitting the frame pointer is not going save you from that.
The debugger will lose an easy way to generate the stack trace. The debugger might still be able to able to generate the stack trace from a different source (such as a PDB file).
Long answer:
The typical function entry and exit is (16-bit processor):
PUSH BP ;push the base pointer (frame pointer)
MOV BP,SP ;store the stack pointer in the frame pointer
SUB SP,xx ;allocate space for local variables et al.
...
LEAVE ;restore the stack pointer and pop the old frame pointer
RET ;return from the function
An entry and exit without a frame pointer could look like (32-bit processor):
SUB ESP,xx ;allocate space for local variables et al.
...
ADD ESP,xx ;de-allocate space for local variables et al.
RET ;return from the function.
You will save two instructions, but you also duplicate a literal value, so the code doesn't get shorter (quite the opposite, especially with [esp+xx] addressing modes taking an extra byte vs. [ebp+xx]), but you might have saved a few clock cycles (or not, if it causes a cache miss in the instruction cache). You did save some space on the stack, though.
You do free up a general-purpose register. This has only benefits.
In regcall/fastcall, this is one extra register where you can store arguments to your function. Thus, if your function takes seven (on x86; more on most other architectures) or more arguments (including this), the seventh argument still fits into a register. (Although most calling conventions don't pass that many in registers, e.g., two for MS fastcall, three for GCC regparm(3) on 32-bit x86. Up to six integer register arguments on x86-64 System V, or 4 register arguments on most RISC processors.)
The same, more importantly, applies to local variables as well. Arrays and large objects don't fit into registers (but pointers to them do), but if your function is using seven different local variables (including temporary variables needed to calculate complex expressions), chances are the compiler will be able to produce smaller code. Smaller code means lower instruction cache footprint, which means reduced miss rate and thus even less memory access (but Intel Atom has a 32K instruction cache, meaning that your code will probably fit anyway).
The x86 architecture features the [BX/BP/SI/DI] and [BX/BP + SI/DI] addressing modes. This makes the BP register an extremely useful place for a scaled array index, especially if the array pointer resides in the SI or DI registers. Two offset registers are better than one.
Utilising a register avoids memory access, but if a variable is worth storing in a register, chances are it will survive just as fine in an L1 cache (especially since it's going to be on the stack). There is still the cost of moving to/from the cache, but since modern CPUs do a lot move optimisation and parallelisation, it is possible that an L1 access would be just as fast as a register access. Thus, the speed benefit from not moving data around is still present, but not as enormous. I can easily imagine the CPU avoiding the data cache completely, at least as far as reading is concerned (and writing to cache can be done in parallel).
A register that is utilised is a register that needs preserving. It is not worth storing much in the registers if you are going to push it to the stack anyway before you use it again. In preserve-by-caller calling conventions (such as the one above), this means that registers as persistent storage are not as useful in a function that calls other functions a lot.
See What are callee and caller saved registers? for more about how calling conventions are designed with a mix of call-clobbered and call-preserved registers to give compilers a good mix of each, so functions have some scratch registers for temporaries that don't need to live across function calls, but also some registers that callees will preserve. Also Why make some registers caller-saved and others callee-saved? Why not make the caller save everything it wants saved?
Also note that x86 has a separate register space for floating point registers, meaning that floats cannot utilise the BP register without extra data movement instructions anyway. Only integers and memory pointers do.
You do lose debugability by omitting frame pointers. This answer show why:
If the code crashes, all the debugger needs to do to generate the stack trace is:
PUSH BP ; log the current frame pointer as well
$1: CALL log_BP ; log the frame pointer currently on stack
LEAVE ; pop the frame pointer to get the next one
CMP [BP+4],0
JNZ $1 ; until the stack cannot be popped (the return address is some specific value)
If the code crashes without a frame pointer, the debugger might not have any way to generate the stack trace, because it might not know (namely, it needs to locate the function entry/exit point) how much needs to be subtracted from the stack pointer. If the debugger doesn't know the frame pointer is not being used, it might even crash itself.
Modern debug-info formats have metadata that still allows stack backtraces in optimized code where the compiler defaults to not using [E/R]BP as a frame pointer. Compilers know how to use assembler directives to create this extra metadata, or write it directly in the object file, not in the parts that normally get mapped into memory. If you don't do this for hand-written assembly, then debugability would suffer, especially for crashes in functions called by a hand-written assembly function.

How many CPU cycles are needed for each assembly instruction?

I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle please?
Here is an example, in the below code, mov/lock is 1 CPU cycle, and xchg is 3 CPU cycles.
// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress,
int nValue)
{
__asm
{
mov edx, dword ptr [pTargetAddress]
mov eax, nValue
lock xchg eax, dword ptr [edx]
}
// mov = 1 CPU cycle
// lock = 1 CPU cycle
// xchg = 3 CPU cycles
}
#endif // WIN32
BTW: here is the URL for the code I posted: http://www.codeproject.com/KB/threads/spinlocks.aspx
Modern CPUs are complex beasts, using pipelining, superscalar execution, and out-of-order execution among other techniques which make performance analysis difficult... but not impossible!
While you can no longer simply add together the latencies of a stream of instructions to get the total runtime, you can still get a (often) highly accurate analysis of the behavior of some piece of code (especially a loop) as described below and in other linked resources.
Instruction Timings
First, you need the actual timings. These vary by CPU architecture, but the best resource currently for x86 timings is Agner Fog's instruction tables. Covering no less than thirty different microarchitecures, these tables list the instruction latency, which is the minimum/typical time that an instruction takes from inputs ready to output available. In Agner's words:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN's and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
So, for example, the add instruction has a latency of one cycle, so a series of dependent add instructions, as shown, will have a latency of 1 cycle per add:
add eax, eax
add eax, eax
add eax, eax
add eax, eax # total latency of 4 cycles for these 4 adds
Note that this doesn't mean that add instructions will only take 1 cycle each. For example, if the add instructions were not dependent, it is possible that on modern chips all 4 add instructions can execute independently in the same cycle:
add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx # these 4 instructions might all execute, in parallel in a single cycle
Agner provides a metric which captures some of this potential parallelism, called reciprocal throughput:
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind
in the same thread.
For add this is listed as 0.25 meaning that up to 4 add instructions can execute every cycle (giving a reciprocal throughput of 1 / 4 = 0.25).
The reciprocal throughput number also gives a hint at the pipelining capability of an instruction. For example, on most recent x86 chips, the common forms of the imul instruction have a latency of 3 cycles, and internally only one execution unit can handle them (unlike add which usually has four add-capable units). Yet the observed throughput for a long series of independent imul instructions is 1/cycle, not 1 every 3 cycles as you might expect given the latency of 3. The reason is that the imul unit is pipelined: it can start a new imul every cycle, even while the previous multiplication hasn't completed.
This means a series of independent imul instructions can run at up to 1 per cycle, but a series of dependent imul instructions will run at only 1 every 3 cycles (since the next imul can't start until the result from the prior one is ready).
So with this information, you can start to see how to analyze instruction timings on modern CPUs.
Detailed Analysis
Still, the above is only scratching the surface. You now have multiple ways of looking at a series of instructions (latency or throughput) and it may not be clear which to use.
Furthermore, there are other limits not captured by the above numbers, such as the fact that certain instructions compete for the same resources within the CPU, and restrictions in other parts of the CPU pipeline (such as instruction decoding) which may result in a lower overall throughput than you'd calculate just by looking at latency and throughput. Beyond that, you have factors "beyond the ALUs" such as memory access and branch prediction: entire topics unto themselves - you can mostly model these well, but it takes work. For example here's a recent post where the answer covers in some detail most of the relevant factors.
Covering all the details would increase the size of this already long answer by a factor of 10 or more, so I'll just point you to the best resources. Agner Fog has an Optimizing Asembly guide that covers in detail the precise analysis of a loop with a dozen or so instructions. See "12.7 An example of analysis for bottlenecks in vector loops" which starts on page 95 in the current version of the PDF.
The basic idea is that you create a table, with one row per instruction and mark the execution resources each uses. This lets you see any throughput bottlenecks. In addition, you need to examine the loop for carried dependencies, to see if any of those limit the throughput (see "12.16 Analyzing dependencies" for a complex case).
If you don't want to do it by hand, Intel has released the Intel Architecture Code Analyzer, which is a tool that automates this analysis. It currently hasn't been updated beyond Skylake, but the results are still largely reasonable for Kaby Lake since the microarchitecture hasn't changed much and therefore the timings remain comparable. This answer goes into a lot of detail and provides example output, and the user's guide isn't half bad (although it is out of date with respect to the newest versions).
Other sources
Agner usually provides timings for new architectures shortly after they are released, but you can also check out instlatx64 for similarly organized timings in the InstLatX86 and InstLatX64 results. The results cover a lot of interesting old chips, and new chips usually show up fairly quickly. The results are mostly consistent with Agner's, with a few exceptions here and there. You can also find memory latency and other values on this page.
You can even get the timing results directly from Intel in their IA32 and Intel 64 optimization manual in Appendix C: INSTRUCTION LATENCY AND THROUGHPUT. Personally I prefer Agner's version because they are more complete, often arrive before the Intel manual is updated, and are easier to use as they provide a spreadsheet and PDF version.
Finally, the x86 tag wiki has a wealth of resources on x86 optimization, including links to other examples of how to do a cycle accurate analysis of code sequences.
If you want a deeper look into the type of "dataflow analysis" described above, I would recommend A Whirlwind Introduction to Data Flow Graphs.
Given pipelining, out of order processing, microcode, multi-core processors, etc there's no guarantee that a particular section of assembly code will take exactly x CPU cycles/clock cycle/whatever cycles.
If such a reference exists, it will only be able to provide broad generalizations given a particular architecture, and depending on how the microcode is implemented you may find that the Pentium M is different than the Core 2 Duo which is different than the AMD dual core, etc.
Note that this article was updated in 2000, and written earlier. Even the Pentium 4 is hard to pin down regarding instruction timing - PIII, PII, and the original pentium were easier, and the texts referenced were probably based on those earlier processors that had a more well-defined instruction timing.
These days people generally use statistical analysis for code timing estimation.
What the other answers say about it being impossible to accurately predict the performance of code running on a modern CPU is true, but that doesn't mean the latencies are unknown, or that knowing them is useless.
The exact latencies for Intels and AMD's processors are listed in Agner Fog's instruction tables. See also Intel® 64 and IA-32 Architectures Optimization Reference Manual, and Instruction latencies and throughput for AMD and Intel x86 processors (from Can Berk Güder's now-deleted link-only answer). AMD also has pdf manuals on their own website with their official values.
For (micro-)optimizing tight loops, knowing the latencies for each instruction can help a lot in manually trying to schedule your code. The programmer can make a lot of optimizations that the compiler can't (because the compiler can't guarantee it won't change the meaning of the program).
Of course, this still requires you to know a lot of other details about the CPU, such as how deeply pipelined it is, how many instructions it can issue per cycle, number of execution units and so on. And of course, these numbers vary for different CPU's. But you can often come up with a reasonable average that more or less works for all CPU's.
It's worth noting though, that it is a lot of work to optimize even a few lines of code at this level. And it is easy to make something that turns out to be a pessimization. Modern CPUs are hugely complicated, and they try extremely hard to get good performance out of bad code. But there are also cases they're unable to handle efficiently, or where you think you're clever and making efficient code, and it turns out to slow the CPU down.
Edit
Looking in Intel's optimization manual, table C-13:
The first column is instruction type, then there is a number of columns for latency for each CPUID. The CPUID indicates which processor family the numbers apply to, and are explained elsewhere in the document. The latency specifies how many cycles it takes before the result of the instruction is available, so this is the number you're looking for.
The throughput columns show how many of this type of instructions can be executed per cycle.
Looking up xchg in this table, we see that depending on the CPU family, it takes 1-3 cycles, and a mov takes 0.5-1. These are for the register-to-register forms of the instructions, not for a lock xchg with memory, which is a lot slower. And more importantly, hugely-variable latency and impact on surrounding code (much slower when there's contention with another core), so looking only at the best-case is a mistake. (I haven't looked up what each CPUID means, but I assume the .5 are for Pentium 4, which ran some components of the chip at double speed, allowing it to do things in half cycles)
I don't really see what you plan to use this information for, however, but if you know the exact CPU family the code is running on, then adding up the latency tells you the minimum number of cycles required to execute this sequence of instructions.
Measuring and counting CPU-cycles does not make sense on the x86 anymore.
First off, ask yourself for which CPU you're counting cycles? Core-2? a Athlon? Pentium-M? Atom? All these CPUs execute x86 code but all of them have different execution times. The execution even varies between different steppings of the same CPU.
The last x86 where cycle-counting made sense was the Pentium-Pro.
Also consider, that inside the CPU most instructions are transcoded into microcode and executed out of order by a internal execution unit that does not even remotely look like a x86. The performance of a single CPU instruction depends on how much resources in the internal execution unit is available.
So the time for a instruction depends not only on the instruction itself but also on the surrounding code.
Anyway: You can estimate the throughput-resource usage and latency of instructions for different processors. The relevant information can be found at the Intel and AMD sites.
Agner Fog has a very nice summary on his web-site. See the instruction tables for latency, throughput, and uop count. See the microarchictecture PDF to learn how to interpret those.
http://www.agner.org/optimize
But note that xchg-with-memory does not have predictable performance, even if you look at only one CPU model. Even in the no-contention case with the cache-line already hot in L1D cache, being a full memory barrier will mean it's impact depends a lot on loads and stores to other addresses in the surrounding code.
Btw - since your example-code is a lock-free datastructure basic building block: Have you considered using the compiler built-in functions? On win32 you can include intrin.h and use functions such as _InterlockedExchange.
That'll give you better execution time because the compiler can inline the instructions. Inline-assembler always forces the compiler to disable optimizations around the asm-code.
lock xchg eax, dword ptr [edx]
Note the lock will lock memory for the memory fetch for all cores, this can take 100 cycles on some multi cores and a cache line will also need to be flushed. It will also stall the pipeline. So i wouldnt worry about the rest.
So optimal performance gets back to tuning your algorithms critical regions.
Note on a single core you can optmize this by removing the lock but it is needed for multi core.

Resources