(Un-)Deterministic CPU behavior and reasoning about (physical) execution duration

(Un-)Deterministic CPU behavior and reasoning about (physical) execution duration - caching

In the past I have dealt with time-critical software development. The development of these applications basically proceeded as follows: "let's write the code, test latency and jitter, and optimize both until they are in the acceptable range." I find that highly frustrating; it isn't what I call
proper engineering and I want to do better.
So I looked into the question: why do we have jitter at all? And the answers, of course, are:
caching: getting a piece of code or data from main memory takes some 2 orders of magnitude more time than getting the same data from L1 cache. So
the physical execution time depends on what is in the cache. Which, in turn, depends on several factors:
code and data layout of the application: we all know the horrifying row- vs. column-major matrix traversal example
caching strategy of the CPU, including speculative pre-fetching of cache lines
other processes on the same core doing stuff
branch prediction: the CPU tries to guess which branch of a conditional jump will be executed. Even if the same conditional jump is executed twice,
the prediction may differ and hence a "pipeline bubble" might form one time, but not the other.
Interrupts: asynchronous behavior obviously leads to jitter
Frequency scaling: thankfully disabled in real-time systems
That's a lot of things that can interfere with the behavior of a piece of code. Nonetheless: if I have two instructions, located on the same cache
line, not depending on any data and not containing (conditional) jumps. Then the jitter from caching and branch prediction should be eliminated, and
only interrupts should play a role. Right? Well, I wrote a small program getting the Time Stamp Counter (tsc) twice, and writing the difference to stdout. I executed it on a rt-patched linux kernel with frequency scaling disabled.
The code has glibc-based init and cleanup, and calls printf which I presume is sometimes in cache and sometimes isn't. But between the calls to "rdtsc" (writing the tsc into edx:eax), everything should be deterministic over each execution of the binary. Just to be sure, I disassembled the elf file, here the part with the two rdtsc calls:
00000000000006b0 <main>:
6b0: 0f 31 rdtsc
6b2: 48 c1 e2 20 shl $0x20,%rdx
6b6: 48 09 d0 or %rdx,%rax
6b9: 48 89 c6 mov %rax,%rsi
6bc: 0f 31 rdtsc
6be: 48 c1 e2 20 shl $0x20,%rdx
6c2: 48 09 d0 or %rdx,%rax
6c5: 48 29 c6 sub %rax,%rsi
6c8: e8 01 00 00 00 callq 6ce <print_rsi>
6cd: c3 retq
No conditional jumps, located on the same cache line (although I am not 100% sure about that - where exactly does the elf loader put the instructions? Do 64 byte boundaries here map to 64 byte boundaries in memory?)... where is the jitter coming from? If I execute that code 1000 times (via zsh, re-starting the program each time), I get values from 12 to 46 with several values in between. As frequency scaling is disabled in my kernel, that leaves interrupts. Now I am willing to believe that, out of 1000 executions, one is interrupted. I am not prepared to believe that 90% are interrupted (we are talking about ns-intervals here! Where should the interrupts come from?!).
So my questions are these:
why is the code not deterministic, i.e., why don't I get the same number with every run?
is it possible to reason about the running time, at least of this very simple piece of code, at all? Is there at least a bound on the running time that I can guarantee (using engineering principles, not measurement combined with hope)?
if there isn't, what exactly is the source of the non-deterministic behavior? What component of the CPU (or the rest of the computer?) is rolling the dice here?

Once you remove the external sources of jitter, CPUs are still not entirely deterministic - at least based on the factors you can control.
More to the point, you seem to be operating under a model where each instruction executes serially, taking a certain about of time. Of course, modern out-of-order CPUs are generally executing more than one instruction at once, and in general may reorder the instruction stream such that instructions are executing 200+ or more instructions in front of the oldest unexecuted instruction.
In that model, it is hard to say exactly where an instruction begins or ends (it is when it is decoded, executed, retired, or something else) and it is certainly tough for "timing" instructions to have a reasonable cycle-accurate interpretation while participating in this highly parallel pipeline.
Since the rdstc doesn't serialize the pipeline, the timings you get may be quite random, even if the process is totally deterministic - it will depend entirely on the other instructions in the pipeline and so on. The second call to rdtsc is never going to have the same pipeline state as the first, and the initial pipeline state is going to be different as well.
The usual solution here is to issue a cpuid instruction prior to issuing a rdstc, but some refinements have been discussed.
If you want a good model of how a piece CPU bound code operates1, you can get most of the way there by reading the first three guides on Agner Fog's optimization page (skip the C++ if you are only interesting in assembly level), as well as What every programmer should know about memory. There is a PDF version of the latter that might be easier to read.
That will allow to take a piece of code and model how it will perform, without every running it. I've done it and sometimes received cycle-accurate results for my effort. In other cases, the results are slower than the model would predict and you have to dig around to understand what other bottleneck you are hitting - and occasionally you will discover something entirely undocumented about an architecture!
If you just want cycle-accurate (or nearly so) timings for short segments of code, I recommend libpfc which on x86 gives you userland access to the performance counters and claims cycle accurate results under the right conditions (basically you have the pin the process to a CPU and prevent context switches, which it seems you are likely already doing). The perf counters can give you better results than rdstc.
Finally, note that rdtsc is measuring wall clock time, which is fundamentally different than CPU cycles on nearly all modern cores with DVFS. As the CPU slows down, your apparent measured cost will increase and vice-versa. This also adds some slowdown to the instruction itself which has to go out and read a counter tied to a clock domain different than the CPU clock.
1 That is, one which is bound my computation, memory access and so on - and not by IO, user input, external devices etc.

The loader has placed the instructions in the addresses that you see on the left. I do not know whether the cache works on physical or logical addresses, but that's irrelevant, because the granularity of the mapping between physical and logical addresses is quite coarse, (at least 4k if I am not mistaken,) and in any case it is certain to be a multiple of the size of a cache line. So, you probably have one cache line boundary at address 680, and then the next one is at address 6C0, so you are most probably fine with respect to cache lines.
If your code had been pre-empted by an interrupt then one of your readings would have probably been off by hundreds, possibly thousands of cycles, not by tens of cycles as you witnessed. So that's not it, either.
Besides the factors that you have identified, there are many more that can influence the readings:
DMA access being done on behalf of another thread
The state of the CPU pipeline
CPU register allocation
CPU register allocation is of specific interest, because it gives an idea of how complicated modern CPUs are, and therefore how difficult it is to predict how much time is going to be taken by any given instruction. The registers that you are using are not real registers; they are "virtual" in a way. The CPU contains an internal bank of general purpose registers, and it assigns some of them to your thread, mapping them to whatever you want to think of as "rax" or "rdx". The complexity of this is mind-boggling.
At the end of the day, what you are discovering is that CPU timing is (not really, but) practically non-deterministic in x86-x64-based modern desktop systems. That's to be expected.
Luckily these systems are so fast that it hardly ever matters, and when it does matter, we do not use a desktop system, we use an embedded system.
And for those who have an academic need for predictable instruction execution time, there are emulators, which sum the number of clock cycles taken, according to the book, by each emulated instruction. Those are absolutely deterministic.

Just to explain in simple words: RDTSC cannot be used reliable to measure time between two instructions. It can be used to measure much longer periods of time (for example, time taken by a subroutine that calculats a checksum of a memory buffer).
On older processors, the time-stamp counter increments with every internal processor clock cycle, but on newer ones, since Core, the time-stamp counter increments at a constant rate regardless of the internal clock cycles.
For longer periods of time, the constant rate of the counter increase matches the internal clock cycles (if the processor does not change frequency), but for small periods of time, that just happens between two instructions, there may be dissonance between the constant rate of counter increase and the processor clock cycles.
The second reason why RDTSC cannot be used to measure time between two instructions is out-of-order execution and instructino pipelining. The CPU mixes the order of instructions that don't depend on each other, and splits instructions to micro-ops to further execute these micro-ops, so you may never know when the RDTSC itself will get executed.

Related

Optimizing double->long conversion with memory operand using perf?

I have an x86-64 Linux program which I am attempting to optimize via perf. The perf report shows the hottest instructions are scalar conversions from double to long with a memory argument, for example:
cvttsd2si (%rax, %rdi, 8), %rcx
which corresponds to C code like:
extern double *array;
long val = (long)array[idx];
(This is an unusual bottleneck but the code itself is very unusual.)
To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question? What other data should I collect and how should I proceed to optimize this?
Some things I have looked at already. CPU counter results show 1.5% cache misses per instruction:
30686845493287      cache-references
     2140314044614      cache-misses           #    6.975 % of all cache refs
          52970546      page-faults
  1244774326560850      instructions
   194784478522820      branches
     2573746947618      branch-misses          #    1.32% of all branches
          52970546      faults
Top-down performance monitors show we are primarily backend-bound:
frontend bound retiring bad speculation backend bound
10.1% 25.9% 4.5% 59.5%
Ad-hoc measurement with top shows all CPUs pegged at 100% suggesting we are not waiting on memory.
A final note of interest: when run on AWS EC2, the code is dramatically slower (44%) on AMD vs Intel with the same core count. (Tested on Ice Lake 8375C vs EPYC 7R13). What could explain this discrepancy?
Thank you for any help.

To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question?
I think there is two main reason for this instruction to be slow. 1. There is a dependency chain and the latency of this instruction is a problem since the processor is waiting on it to execute other instructions. 2. There is a cache miss (saturating the memory with such instruction is improbable unless many cores are doing memory-based operations).
First of all, tracking what is going on on a specific instruction is hard (especially if the instruction is not executed a lot of time). You need to use precise events to track the root of the problem, that is, events for which the exact instruction addresses that caused the event are available. Only a (small) subset of all events are precise one.
Regarding (1), the latency of the instruction should be about 12 cycles on both architecture (although it might be slightly more on the AMD processor, I do not expect a 44% difference). The target processor are able to execute multiple instruction at the same time in a given cycle. Instructions are executed on different port and are also pipelined. The port usage matters to understand what is going on. This means all the instruction in the hot loop matters. You cannot isolate this specific instruction. Modern processors are insanely complex so a basic analysis can be tricky. On Ice Lake processors, you can measure the average port usage with events like UOPS_DISPATCHED.PORT_XXX where XXX can be 0, 1, 2_3, 4_9, 5, 6, 7_8. Only the first three matters for this instruction. The EXE_ACTIVITY.XXX events may also be useful. You should check if a port is saturated and which one. AFAIK, none of these events are precise so you can only analyse a block of code (typically the hot loop). On Zen 3, the ports are FP23 and FP45. IDK what are the useful events on this architecture (I am not very familiar with it).
On Ice Lake, you can check the FRONTEND_RETIRED.LATENCY_GE_XXX events where XXX is a power of two integer (which should be precise one so you can see if this instruction is the one impacting the events). This help you to see whether the front-end or the back-end is the limiting factor.
Regarding (2), you can check the latency of the memory accesses as well as the number of L1/L2/L3 cache hits/misses. On Ice Lake, you can use events like MEM_LOAD_RETIRED.XXX where XXX can be for example L1_MISS L1_HIT, L2_MISS, L2_HIT, L3_MISS and L3_HIT. Still on Ice Lake, t may be useful to track the latency of the memory operation with MEM_TRANS_RETIRED.LOAD_LATENCY_GT_XXX where XXX is again a power of two integer.
You can also use LLVM-MCA to simulate the scheduling of the loop instruction statically on the target architecture (do not consider branches). This is very useful to understand deeply what the scheduler can do pretty easily.
What could explain this discrepancy?
The latency and reciprocal throughput should be about the same on the two platform or at least close. That being said, for the same core count, the two certainly do not operate at the same frequency. If this is not coming from that, then I doubt this instruction is actually the problem alone (tricky scheduling issues, wrong/inaccurate profiling results, etc.).
CPU counter results show 1.5% cache misses per instruction
The thing is the cache-misses event is certainly not very informative here. Indeed, it references the last-level cache (L3) misses. Thus, it does not give any information about the L1/L2 misses (previous events do).
how should I proceed to optimize this?
If the code is latency bound, the solution is to first break any dependency chain in this loop. Unrolling the loop dans rewriting it so to make it more SIMD-friendly can help a lot to improve performance (the reciprocal throughput of this instruction is about 1 cycle as opposed to 12 for the latency so there is a room for improvements in this case).
If the code is memory bound, they you should care about data locality. Data should fit in the L1 cache if possible. There are many tips to do so but it is hard to guide you without more context. This includes for example sorting data, reordering loop iterations, using smaller data types.
There are many possible source of weird unusual unexpected behaviours that can occurs. If such a thing happens, then it is nearly impossible to understand what is going on without the exact code executed. All details matter in this case.

Is it preferable to use the total time taken for a canonical workload as a benchmark or count the cycles/time taken by the individual operations?

I'm designing a benchmark for a critical system operation. Ideally the benchmark can be used to detect performance regressions. I'm debating between using the total time for a large workload passed into the operation or counting the cycles taken by the operation as the measurement criterion for the benchmark.
The time to run each iteration of the operation in question is fast perhaps 300-500 nanoseconds.

A total time is much easier to measure accurately / reliably, and the measurement overhead is irrelevant. It's what I'd recommend, as long as you're sure you can stop your compiler from optimizing across iterations of whatever you're measuring. (Check the generated asm if necessary).
If you think your runtime might be data-dependent and want to look into variation across iterations, then you might consider recording timestamps somehow. But 300 ns is only ~1k clock cycles on a 3.3GHz CPU, and recording a timestamp takes some time. So you definitely need to worry about measurement overhead.
Assuming you're on x86, raw rdtsc around each operation is pretty lightweight, but out-of-order execution can reorder the timestamps with the work. Get CPU cycle count?, and clflush to invalidate cache line via C function.
An lfence; rdtsc; lfence to stop the timing from reordering with each iteration of the workload will block out-of-order execution of the steps of the workload, distorting things. (The out-of-order execution window on Skylake is a ROB size of 224 uops. At 4 per clock that's a small fraction of 1k clock cycles, but in lower-throughput code with stalls for cache misses there could be significant overlap between independent iterations.)
Any standard timing functions like C++ std::chrono will normally call library functions that ultimately use rdtsc, but with many extra instructions. Or worse, will make an actual system call taking well over a hundred clock cycles to enter/leave the kernel, and more with Meltdown+Spectre mitigation enabled.
However, one thing that might work is using Intel-PT (https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing) to record timestamps on taken branches. Without blocking out-of-order exec at all, you can still get timestamps on when the loop branch in your repeat loop executed. This may well be independent of your workload and able to run soon after its issued into the out-of-order part of the core, but that can only happen a limited distance ahead of the oldest not-yet-retired instruction.

Reliability of Xcode Instrument's disassembly time profiling

I've profiled my code using Instrument's time profiler, and zooming in to the disassembly, here's a snippet of its results:
I wouldn't expect a mov instruction to take 23.3% of the time while a div instruction to take virtually nothing.
This causes me to believe these results are unreliable.
Is this true and known? Or am I just experiencing an Instruments bug? Or is there some option I need to use to obtain reliable results?
Is there any reference expanding on this issue?

First of all, it's possible that some counts that really belong to divss are being charged to later instructions, which is called a "skid". (Also see the rest of that comment thread for some more details.) Presumably Xcode is like Linux perf, and uses the fixed cpu_clk_unhalted.thread counter for cycles instead of one of the programmable counters. This is not a "precise" event (PEBS), so skids are possible. As #BeeOnRope points out, you can use a PEBS event that ticks once per cycle (like UOPS_RETIRED < 16) as a PEBS substitute for the fixed cycles counter, removing some of the dependence on interrupt behaviour.
But the way counters fundamentally work for pipelined / out-of-order execution also explains most of what you're seeing. Or it might; you didn't show the complete loop so we can't simulate the code on a simple pipeline model like IACA does, or by hand using hardware guides like http://agner.org/optimize/ and Intel's optimization manual. (And you haven't even specified what microarchitecture you have. I guess it's some member of Intel Sandybridge-family on a Mac).
Counts for cycles are typically charged to the instruction that's waiting for the result, not usually the instruction that's slow to produce the result. Pipelined CPUs don't stall until you try to read a result that isn't ready yet.
Out-of-order execution massively complicates this, but it's still generally true when there's one really slow instruction, like a load that often misses in cache. When the cycles counter overflows (triggering an interrupt), there are many instruction in flight, but only one can be the RIP associated with that performance-counter event. It's also the RIP where execution will resume after the interrupt.
So what happens when an interrupt is raised? See Andy Glew's answer about that, which explains the internals of perf-counter interrupts in the Intel P6 microarchitecture's pipeline, and why (before PEBS) they were always delayed. Sandybridge-family is similar to P6 for this.
I think a reasonable mental model for perf-counter interrupts on Intel CPUs is that it discards any uops that haven't yet been dispatched to an execution unit. But ALU uops that have been dispatched already go through the pipeline to retirement (if there aren't any younger uops that got discarded) instead of being aborted, which makes sense because the maximum extra latency is ~16 cycles for sqrtpd, and flushing the store queue can easily take longer than that. (Pending stores that have already retired can't be rolled back). IDK about loads/stores that haven't retired; at least the loads are probably discarded.
I'm basing this guess on the fact that it's easy to construct loops that don't show any counts for divss when the CPU is sometimes waiting for it to produce its outputs. If it was discarded without retiring, it would be the next instruction when resuming the interrupt, so (other than skids) you'd see lots of counts for it.
Thus, the distribution of cycles counts shows you which instructions spend the most time being the oldest not-yet-dispatched instruction in the scheduler. (Or in case of front-end stalls, which instructions the CPU is stalled trying to fetch / decode / issue). Remember, this usually means it shows you the instructions that are waiting for inputs, not the instructions that are slow to produce them.
(Hmm, this might not be right, and I haven't tested this much. I usually use perf stat to look at overall counts for a whole loop in a microbenchmark, not statistical profiles with perf record. addss and mulss are higher latency than andps, so you'd expect andps to get counts waiting for its xmm5 input if my proposed model was right.)
Anyway, the general problem is, with multiple instructions in flight at once, which one does the HW "blame" when the cycles counter wraps around?
Note that divss is slow to produce the result, but is only a single-uop instruction (unlike integer div which is microcoded on AMD and Intel). If you don't bottleneck on its latency or its not-fully-pipelined throughput, it's not slower than mulss because it can overlap with surrounding code just as well.
(divss / divps is not fully pipelined. On Haswell for example, an independent divps can start every 7 cycles. But each only takes 10-13 cycles to produce its result. All other execution units are fully pipelined; able to start a new operation on independent data every cycle.)
Consider a large loop that bottlenecks on throughput, not latency of any loop-carried dependency, and only needs divss to run once per 20 FP instructions. Using divss by a constant instead of mulss with the reciprocal constant should make (nearly) no difference in performance. (In practice out-of-order scheduling isn't perfect, and longer dependency chains hurt some even when not loop-carried, because they require more instructions to be in flight to hide all that latency and sustain max throughput. i.e. for the out-of-order core to find the instruction-level parallelism.)
Anyway, the point here is that divss is a single uop and it makes sense for it not to get many counts for the cycles event, depending on the surrounding code.
You see the same effect with a cache-miss load: the load itself mostly only gets counts if it has to wait for the registers in the addressing mode, and the first instruction in the dependency chain that uses the loaded data gets a lot of counts.
What your profile result might be telling us:
The divss isn't having to wait for its inputs to be ready. (The movaps %xmm3, %xmm5 before the divss sometimes takes some cycles, but the divss never does.)
We may come close to bottlenecking on the throughput of divss
The dependency chain involving xmm5 after divss is getting some counts. Out-of-order execution has to work to keep multiple independent iterations of that in flight at once.
The maxss / movaps loop-carried dependency chain may be a significant bottleneck. (Especially if you're on Skylake where divss throughput is one per 3 clocks, but maxss latency is 4 cycles. And resource conflicts from competition for ports 0 and 1 will delay maxss.)
The high counts for movaps might be due to it following maxss, forming the only loop-carried dependency in the part of the loop you show. So it's plausible that maxss really is slow to produce results. But if it really was a loop-carried dep chain that was the major bottleneck, you'd expect to see lots of counts on maxss itself, as it would be waiting for its input from the last iteration.
But maybe mov-elimination is "special", and all the counts for some reason get charged to movaps? On Ivybridge and later CPUs, register copies doesn't need an execution unit, but instead are handled in the issue/rename stage of the pipeline.

Is this true and known?
Yes, it is a known problem with profiling tools on Intel x86. I've observed it (time spent suspiciously assigned to seemingly innocent instructions) both with Linux perf_events and Intel VTune. It has also been reported elsewhere by other people.
A better and more honest visualization of collected results would have summed up all samples inside every basic block, and demonstrated the resulting value associated with a basic block, not its individual instructions. Not 100% fool-proof but a bit better and honest,
Or is there some option I need to use to obtain reliable results?
I do not know if newer profiling hardware, namely tools based on Intel Processor Trace (available starting from Broadwell, but improved in Skylake) instead of older PEBS, would give more accurate data. I guess one needs to experiment with such tools first.

Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

LOOP (Intel ref manual entry)
decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags.
loop on various microarchitectures, from Agner Fog's instruction tables:
K8/K10: 7 m-ops
Bulldozer-family/Ryzen: 1 m-op (same cost as macro-fused test-and-branch, or jecxz)
P4: 4 uops (same as jecxz)
P6 (PII/PIII): 8 uops
Pentium M, Core2: 11 uops
Nehalem: 6 uops. (11 for loope / loopne). Throughput = 4c (loop) or 7c (loope/ne).
SnB-family: 7 uops. (11 for loope / loopne). Throughput = one per 5 cycles, as much of a bottleneck as keeping your loop counter in memory! jecxz is only 2 uops with same throughput as regular jcc
Silvermont: 7 uops
AMD Jaguar (low-power): 8 uops, 5c throughput
Via Nano3000: 2 uops
Couldn't the decoders just decode the same as lea rcx, [rcx-1] / jrcxz? That would be 3 uops. At least that would be the case with no address-size prefix, otherwise it has to use ecx and truncate RIP to EIP if the jump is taken; maybe the odd choice of address-size controlling the width of the decrement explains the many uops? (Fun fact: rep-string instructions have the same behaviour with using ecx with 32-bit address-size.)
Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx / jnz on SnB decodes to a single uop (which does set flags).
I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.
Would it be easy for SnB-family uarch to have fast loop? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.
What's special about Bulldozer that made a fast loop easy / worth it? Or did AMD waste a bunch of transistors on making loop fast? If so, presumably someone thought it was a good idea.
If loop was fast, it would be perfect for BigInteger arbitrary-precision adc loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz. (And dec/jnz only macro-fuses on SnB-family).
On modern CPUs where dec/jnz is ok in an ADC loop, loop would still be nice for ADCX / ADOX loops (to preserve OF).
If loop had been fast, compilers would already be using it as a peephole optimization for code-size + speed on CPUs without macro-fusion.
It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.

In 1988, IBM fellow Glenn Henry had just come on board at Dell, which had a few hundred employees at the time, and in his first month he gave a tech talk about 386 internals. A bunch of us BIOS programmers had been wondering why LOOP was slower than DEC/JNZ so during the question/answer section somebody posed the question.
His answer made sense. It had to do with paging.
LOOP consists of two parts: decrementing CX, then jumping if CX is not zero. The first part cannot cause a processor exception, whereas the jump part can. For one, you could jump (or fall through) to an address outside segment boundaries, causing a SEGFAULT. For two, you could jump to a page that is swapped out.
A SEGFAULT usually spells the end for a process, but page faults are different. When a page fault occurs, the processor throws an exception, and the OS does the housekeeping to swap in the page from disk into RAM. After that, it restarts the instruction that caused the fault.
Restarting means restoring the state of the process to what it was just before the offending instruction. In the case of the LOOP instruction in particular, it meant restoring the value of the CX register. One might think you could just add 1 to CX, since we know CX got decremented, but apparently, it's not that simple. For example, check out this erratum from Intel:
The protection violations involved usually indicate a probable
software bug and restart is not desired if one of these violations
occurs. In a Protected Mode 80286 system with wait states during any
bus cycles, when certain protection violations are detected by the
80286 component, and the component transfers control to the exception
handling routine, the contents of the CX register may be unreliable.
(Whether CX contents are changed is a function of bus activity at the
time internal microcode detects the protection violation.)
To be safe, they needed to save the value of CX on every iteration of a LOOP instruction, in order to reliably restore it if needed.
It's this extra burden of saving CX that made LOOP so slow.
Intel, like everyone else at the time, was getting more and more RISC. The old CISC instructions (LOOP, ENTER, LEAVE, BOUND) were being phased out. We still used them in hand-coded assembly, but compilers ignored them completely.

Now that I googled after writing my question, it turns out to be an exact duplicate of one on comp.arch, which came up right away. I expected it to be hard to google (lots of "why is my loop slow" hits), but my first try (why is the x86 loop instruction slow) got results.
This is not a good or complete answer.
It might be the best we'll get, and will have to suffice unless someone can shed some more light on it. I didn't set out to write this as an answer-my-own-question post.
Good posts with different theories in that thread:
Robert
LOOP became slow on some of the earliest machines (circa 486) when
significant pipelining started to happen, and running any but the
simplest instruction down the pipeline efficiently was technologically
impractical. So LOOP was slow for a number of generations. So nobody
used it. So when it became possible to speed it up, there was no real
incentive to do so, since nobody was actually using it.
Anton Ertl:
IIRC LOOP was used in some software for timing loops; there was
(important) software that did not work on CPUs where LOOP was too fast
(this was in the early 90s or so). So CPU makers learned to make LOOP
slow.
(Paul, and anyone else: You're welcome to re-post your own writing as your own answer. I'll remove it from my answer and up-vote yours.)
#Paul A. Clayton (occasional SO poster and CPU architecture guy) took a guess at how you could use that many uops. (This looks like loope/ne which checks both the counter and ZF):
I could imagine a possibly sensible 6-µop version:
virtual_cc = cc;
temp = test (cc);
rCX = rCX - temp; // also setting cc
cc = temp & cc; // assumes branch handling is not
// substantially changed for the sake of LOOP
branch
cc = virtual_cc
(Note that this is 6 uops, not SnB's 11 for LOOPE/LOOPNE, and is a total guess not even trying to take into account anything known from SnB perf counters.)
Then Paul said:
I agree that a shorter sequence should be possible, but I was trying
to think of a bloated sequence that might make sense if minimal
microarchitectural adjustments were permitted.
summary: The designers wanted loop to be supported only via microcode, with no adjustments whatsoever to the hardware proper.
If a useless, compatibility-only instruction is handed to the
microcode developers, they might reasonably not be able or willing to
suggest minor changes to the internal microarchitecture to improve
such an instruction. Not only would they rather use their "change
suggestion capital" more productively but the suggestion of a change
for a useless case would reduce the credibility of other suggestions.
(My opinion: Intel is probably still making it slow on purpose, and hasn't bothered to rewrite their microcode for it for a long time. Modern CPUs are probably too fast for anything using loop in a naive way to work correctly.)
... Paul continues:
The architects behind Nano may have found avoiding the special casing
of LOOP simplified their design in terms of area or power. Or they
may have had incentives from embedded users to provide a fast
implementation (for code density benefits). Those are just WILD
guesses.
If optimization of LOOP fell out of other optimizations (like fusion
of compare and branch), it might be easier to tweak LOOP into a fast
path instruction than to handle it in microcode even if the
performance of LOOP was unimportant.
I suspect that such decisions are based on specific details of the
implementation. Information about such details does not seem to be
generally available and interpreting such information would be
beyond the skill level of most people. (I am not a hardware
designer--and have never played one on television or stayed at a
Holiday Inn Express. :-)
The thread then went off-topic into the realm of AMD blowing our one chance to clean up the cruft in x86 instruction encoding. It's hard to blame them, since every change is a case where the decoders can't share transistors. And before Intel adopted x86-64, it wasn't even clear that it would catch on. AMD didn't want to burden their CPUs with hardware nobody used if AMD64 didn't catch on.
But still, there are so many small things: setcc could have changed to 32bits. (Usually you have to use xor-zero / test / setcc to avoid false dependencies, or because you need a zero-extended reg). Shift could have unconditionally written flags, even with zero shift count (removing the input data dependency on eflags for variable-count shift for OOO execution). Last time I typed this list of pet peeves, I think there was a third one... Oh yeah, bt / bts etc. with memory operands has the address dependent on the upper bits of the index (bit string, not just bit within a machine word).
bts instructions are very useful for bit-field stuff, and are slower than they need to be so you almost always want to load into a register and then use that. (It's usually faster to shift/mask to get an address yourself, instead of using 10 uop bts [mem], reg on Skylake, but it does take extra instructions. So it made sense on 386, but not on K8). Atomic bit-manipulation has to use the memory-dest form, but the locked version needs lots of uops anyway. It's still slower than if it couldn't access outside the dword it's operating on.

Please see the nice article by Abrash, Michael, published in Dr. Dobb's Journal March 1991 v16 n3 p16(8): http://archive.gamedev.net/archive/reference/articles/article369.html
The summary of the article is the following:
Optimizing code for 8088, 80286, 80386 and 80486 microprocessors is
difficult because the chips use significantly different memory
architectures and instruction execution times. Code cannot be
optimized for the 80x86 family; rather, code must be designed to
produce good performance on a range of systems or optimized for
particular combinations of processors and memory. Programmers must
avoid the unusual instructions supported by the 8088, which have lost
their performance edge in subsequent chips. String instructions
should be used but not relied upon. Registers should be used rather
than memory operations. Branching is also slow for all four
processors. Memory accesses should be aligned to improve
performance. Generally, optimizing an 80486 requires exactly the
opposite steps as optimizing an 8088.
By "unusual instructions supported by the 8088" the author also means "loop":
Any 8088 programmer would instinctively replace: DEC CX JNZ LOOPTOP
with: LOOP LOOPTOP because LOOP is significantly faster on the 8088.
LOOP is also faster on the 286. On the 386, however, LOOP is actually
two cycles slower than DEC/JNZ. The pendulum swings still further on
the 486, where LOOP is about twice as slow as DEC/JNZ--and, mind you,
we're talking about what was originally perhaps the most obvious
optimization in the entire 80x86 instruction set.
This is a very good article, and I highly recommend it. Even though it was published in 1991, it is surprisingly highly relevant today.
But this article just gives advices, it encourages to test execution speed and choose faster variants. It doesn’t explain WHY some commands become very slow, so it doesn’t fully address your question.
The answer is that earlier processors, like 80386 (released in 1985) and before, executed instructions one-by-one, sequentially.
Later processors have started to use instruction pipelining – initially, simple, for 804086, and, finally, Pentium Pro (released in 1995) introduced radically different internal pipeline, calling it the Out Of Order (OOO) core where instructions were transformed to small fragments of operations called micro-ops or µops, and then all micro-ops of different instructions were put to a large pool of micro-ops where they were supposed to execute simultaneously as long as they do not depend on one another. This OOO pipeline principle is still used, almost unchanged, on modern processors. You can find more information about instruction pipelining in this brilliant article: https://www.gamedev.net/resources/_/technical/general-programming/a-journey-through-the-cpu-pipeline-r3115
In order to simplify chip design, Intel decided to build processors in such a way that one instructions did transform to micro-ops in a very efficient way, while others are not.
Efficient conversion from instructions to micro-ops requires more transistors, so Intel have decided to save on transistors at a cost of slower decoding and execution of some “complex” or “rarely-used” instructions.
For example, the “Intel® Architecture Optimization Reference Manual” http://download.intel.com/design/PentiumII/manuals/24512701.pdf mentions the following: “Avoid using complex instructions (for example, enter, leave, or loop) that generally have more than four µops and require multiple cycles to decode. Use sequences of simple instructions instead.”
So, Intel somehow have decided that the “loop” instruction is “complex”, and, since then, it became very slow. However, there is no official Intel reference on instruction breakdown: how many micro-ops each instruction produces, and how many cycles are required to decode it.
You can also read about The Out-of-Order Execution Engine
in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section the 2.1.2.

How many CPU cycles are needed for each assembly instruction?

I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle please?
Here is an example, in the below code, mov/lock is 1 CPU cycle, and xchg is 3 CPU cycles.
// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress,
int nValue)
{
__asm
{
mov edx, dword ptr [pTargetAddress]
mov eax, nValue
lock xchg eax, dword ptr [edx]
}
// mov = 1 CPU cycle
// lock = 1 CPU cycle
// xchg = 3 CPU cycles
}
#endif // WIN32
BTW: here is the URL for the code I posted: http://www.codeproject.com/KB/threads/spinlocks.aspx

Modern CPUs are complex beasts, using pipelining, superscalar execution, and out-of-order execution among other techniques which make performance analysis difficult... but not impossible!
While you can no longer simply add together the latencies of a stream of instructions to get the total runtime, you can still get a (often) highly accurate analysis of the behavior of some piece of code (especially a loop) as described below and in other linked resources.
Instruction Timings
First, you need the actual timings. These vary by CPU architecture, but the best resource currently for x86 timings is Agner Fog's instruction tables. Covering no less than thirty different microarchitecures, these tables list the instruction latency, which is the minimum/typical time that an instruction takes from inputs ready to output available. In Agner's words:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN's and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
So, for example, the add instruction has a latency of one cycle, so a series of dependent add instructions, as shown, will have a latency of 1 cycle per add:
add eax, eax
add eax, eax
add eax, eax
add eax, eax # total latency of 4 cycles for these 4 adds
Note that this doesn't mean that add instructions will only take 1 cycle each. For example, if the add instructions were not dependent, it is possible that on modern chips all 4 add instructions can execute independently in the same cycle:
add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx # these 4 instructions might all execute, in parallel in a single cycle
Agner provides a metric which captures some of this potential parallelism, called reciprocal throughput:
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind
in the same thread.
For add this is listed as 0.25 meaning that up to 4 add instructions can execute every cycle (giving a reciprocal throughput of 1 / 4 = 0.25).
The reciprocal throughput number also gives a hint at the pipelining capability of an instruction. For example, on most recent x86 chips, the common forms of the imul instruction have a latency of 3 cycles, and internally only one execution unit can handle them (unlike add which usually has four add-capable units). Yet the observed throughput for a long series of independent imul instructions is 1/cycle, not 1 every 3 cycles as you might expect given the latency of 3. The reason is that the imul unit is pipelined: it can start a new imul every cycle, even while the previous multiplication hasn't completed.
This means a series of independent imul instructions can run at up to 1 per cycle, but a series of dependent imul instructions will run at only 1 every 3 cycles (since the next imul can't start until the result from the prior one is ready).
So with this information, you can start to see how to analyze instruction timings on modern CPUs.
Detailed Analysis
Still, the above is only scratching the surface. You now have multiple ways of looking at a series of instructions (latency or throughput) and it may not be clear which to use.
Furthermore, there are other limits not captured by the above numbers, such as the fact that certain instructions compete for the same resources within the CPU, and restrictions in other parts of the CPU pipeline (such as instruction decoding) which may result in a lower overall throughput than you'd calculate just by looking at latency and throughput. Beyond that, you have factors "beyond the ALUs" such as memory access and branch prediction: entire topics unto themselves - you can mostly model these well, but it takes work. For example here's a recent post where the answer covers in some detail most of the relevant factors.
Covering all the details would increase the size of this already long answer by a factor of 10 or more, so I'll just point you to the best resources. Agner Fog has an Optimizing Asembly guide that covers in detail the precise analysis of a loop with a dozen or so instructions. See "12.7 An example of analysis for bottlenecks in vector loops" which starts on page 95 in the current version of the PDF.
The basic idea is that you create a table, with one row per instruction and mark the execution resources each uses. This lets you see any throughput bottlenecks. In addition, you need to examine the loop for carried dependencies, to see if any of those limit the throughput (see "12.16 Analyzing dependencies" for a complex case).
If you don't want to do it by hand, Intel has released the Intel Architecture Code Analyzer, which is a tool that automates this analysis. It currently hasn't been updated beyond Skylake, but the results are still largely reasonable for Kaby Lake since the microarchitecture hasn't changed much and therefore the timings remain comparable. This answer goes into a lot of detail and provides example output, and the user's guide isn't half bad (although it is out of date with respect to the newest versions).
Other sources
Agner usually provides timings for new architectures shortly after they are released, but you can also check out instlatx64 for similarly organized timings in the InstLatX86 and InstLatX64 results. The results cover a lot of interesting old chips, and new chips usually show up fairly quickly. The results are mostly consistent with Agner's, with a few exceptions here and there. You can also find memory latency and other values on this page.
You can even get the timing results directly from Intel in their IA32 and Intel 64 optimization manual in Appendix C: INSTRUCTION LATENCY AND THROUGHPUT. Personally I prefer Agner's version because they are more complete, often arrive before the Intel manual is updated, and are easier to use as they provide a spreadsheet and PDF version.
Finally, the x86 tag wiki has a wealth of resources on x86 optimization, including links to other examples of how to do a cycle accurate analysis of code sequences.
If you want a deeper look into the type of "dataflow analysis" described above, I would recommend A Whirlwind Introduction to Data Flow Graphs.

Given pipelining, out of order processing, microcode, multi-core processors, etc there's no guarantee that a particular section of assembly code will take exactly x CPU cycles/clock cycle/whatever cycles.
If such a reference exists, it will only be able to provide broad generalizations given a particular architecture, and depending on how the microcode is implemented you may find that the Pentium M is different than the Core 2 Duo which is different than the AMD dual core, etc.
Note that this article was updated in 2000, and written earlier. Even the Pentium 4 is hard to pin down regarding instruction timing - PIII, PII, and the original pentium were easier, and the texts referenced were probably based on those earlier processors that had a more well-defined instruction timing.
These days people generally use statistical analysis for code timing estimation.

What the other answers say about it being impossible to accurately predict the performance of code running on a modern CPU is true, but that doesn't mean the latencies are unknown, or that knowing them is useless.
The exact latencies for Intels and AMD's processors are listed in Agner Fog's instruction tables. See also Intel® 64 and IA-32 Architectures Optimization Reference Manual, and Instruction latencies and throughput for AMD and Intel x86 processors (from Can Berk Güder's now-deleted link-only answer). AMD also has pdf manuals on their own website with their official values.
For (micro-)optimizing tight loops, knowing the latencies for each instruction can help a lot in manually trying to schedule your code. The programmer can make a lot of optimizations that the compiler can't (because the compiler can't guarantee it won't change the meaning of the program).
Of course, this still requires you to know a lot of other details about the CPU, such as how deeply pipelined it is, how many instructions it can issue per cycle, number of execution units and so on. And of course, these numbers vary for different CPU's. But you can often come up with a reasonable average that more or less works for all CPU's.
It's worth noting though, that it is a lot of work to optimize even a few lines of code at this level. And it is easy to make something that turns out to be a pessimization. Modern CPUs are hugely complicated, and they try extremely hard to get good performance out of bad code. But there are also cases they're unable to handle efficiently, or where you think you're clever and making efficient code, and it turns out to slow the CPU down.
Edit
Looking in Intel's optimization manual, table C-13:
The first column is instruction type, then there is a number of columns for latency for each CPUID. The CPUID indicates which processor family the numbers apply to, and are explained elsewhere in the document. The latency specifies how many cycles it takes before the result of the instruction is available, so this is the number you're looking for.
The throughput columns show how many of this type of instructions can be executed per cycle.
Looking up xchg in this table, we see that depending on the CPU family, it takes 1-3 cycles, and a mov takes 0.5-1. These are for the register-to-register forms of the instructions, not for a lock xchg with memory, which is a lot slower. And more importantly, hugely-variable latency and impact on surrounding code (much slower when there's contention with another core), so looking only at the best-case is a mistake. (I haven't looked up what each CPUID means, but I assume the .5 are for Pentium 4, which ran some components of the chip at double speed, allowing it to do things in half cycles)
I don't really see what you plan to use this information for, however, but if you know the exact CPU family the code is running on, then adding up the latency tells you the minimum number of cycles required to execute this sequence of instructions.

Measuring and counting CPU-cycles does not make sense on the x86 anymore.
First off, ask yourself for which CPU you're counting cycles? Core-2? a Athlon? Pentium-M? Atom? All these CPUs execute x86 code but all of them have different execution times. The execution even varies between different steppings of the same CPU.
The last x86 where cycle-counting made sense was the Pentium-Pro.
Also consider, that inside the CPU most instructions are transcoded into microcode and executed out of order by a internal execution unit that does not even remotely look like a x86. The performance of a single CPU instruction depends on how much resources in the internal execution unit is available.
So the time for a instruction depends not only on the instruction itself but also on the surrounding code.
Anyway: You can estimate the throughput-resource usage and latency of instructions for different processors. The relevant information can be found at the Intel and AMD sites.
Agner Fog has a very nice summary on his web-site. See the instruction tables for latency, throughput, and uop count. See the microarchictecture PDF to learn how to interpret those.
http://www.agner.org/optimize
But note that xchg-with-memory does not have predictable performance, even if you look at only one CPU model. Even in the no-contention case with the cache-line already hot in L1D cache, being a full memory barrier will mean it's impact depends a lot on loads and stores to other addresses in the surrounding code.
Btw - since your example-code is a lock-free datastructure basic building block: Have you considered using the compiler built-in functions? On win32 you can include intrin.h and use functions such as _InterlockedExchange.
That'll give you better execution time because the compiler can inline the instructions. Inline-assembler always forces the compiler to disable optimizations around the asm-code.

lock xchg eax, dword ptr [edx]
Note the lock will lock memory for the memory fetch for all cores, this can take 100 cycles on some multi cores and a cache line will also need to be flushed. It will also stall the pipeline. So i wouldnt worry about the rest.
So optimal performance gets back to tuning your algorithms critical regions.
Note on a single core you can optmize this by removing the lock but it is needed for multi core.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio