Why isn't RDTSC a serializing instruction?

Why isn't RDTSC a serializing instruction? - performance

The Intel manuals for the RDTSC instruction warn that out of order execution can change when RDTSC is actually executed, so they recommend inserting a CPUID instruction in front of it because CPUID will serialize the instruction stream (CPUID is never executed out of order). My question is simple: if they had the ability to make instructions serializing, why didn't they make RDTSC serializing? The entire point of it appears to be to get cycle accurate timings. Is there a situation under which you would not want to precede it with a serializing instruction?
Newer Intel CPUs have a separate RDTSCP instruction that is serializing. Intel opted to introduce a separate instruction rather than change the behavior of RDTSC, which suggests to me that there has to be some situation where a potentially out of order timing is what you want. What is it?

The time stamp counter was introduced on the Pentium microarchitecture. Out-of-order execution didn't show up until the Pentium Pro. Intel could have made rdtsc serializing (architecturally or internally), but it seems that they decided to keep it non-serializing, which is OK for general-purpose time measurements, and leave it up to the programmer to add serializing instructions if necessary. This is good for reducing the overhead of the measurement.
That's actually confirmed in the document you provide, with the following comment about Pentium and Pentium/MMX (in 4.2, slightly paraphrased):
All of the rules and code samples described in section 4.1 (Pentium Pro and Pentium II) also apply to the Pentium and Pentium/MMX. The only difference is, the CPUID instruction is not necessary for serialization.
And, from Wikipedia:
The Time Stamp Counter is a 64-bit register present on all x86 processors since the Pentium.
: : :
Starting with the Pentium Pro, Intel processors have supported out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. This can cause RDTSC to be executed later than expected, producing a misleading cycle count.
One of the two uses of RDTSCP is to give you the processor ID in addition to the time stamp information (it's right there in the name Read Time-Stamp Counter *AND* Processor ID), which is useful on systems with unsynced TSCs across cores or sockets (See: How to get the CPU cycle count in x86_64 from C++?). The additional serialization properties of rdtscp makes it more convenient at the end of the region of interest (See: Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time?).

If you are trying to use rdtsc to see if a branch mispredicts, the non-serializing version is what you want.
//math here
rdtsc
branch if zero to done
//do some work that always takes 1 cycle
done: rdtsc
If the branch is predicted correctly, the delta will be small (maybe even negative?). If the branch is mispredicted, the delta will be large.
With the serializing version, the branch condition will be resolved because the first rdtsc waits for the math to finish.

why didn't they make RDTSC serializing? The entire point of it appears to be to get cycle accurate timings
Well, most of the time it's to get high-resolution timestamps. At least some of the time, these timestamps are used for performance metrics. Making the intruction serializing would likely require a pipeline flush, which can be very expensive for CPU-bound applications.
Intel opted to introduce a separate instruction rather than change the behavior of RDTSC, which suggests to me that there has to be some situation where a potentially out of order timing is what you want.
Changing the behavior is almost always undesirable. Intel's customers would be disappointed to find out that RDTSC does something different on newer parts.

As paxdiably explains, RDTSC predates the concept of "serializing" instructions because it was implemented on an in-order CPU. Adding that behavior later would change the memory access behavior of code using it, and thus be incompatible for some purposes.
Instead, more recent CPUs have a related RDTSCP instruction that is defined as serializing (actually stronger: it promises to wait until all instructions issued before it have completed, not just that memory accesses have been done), for exactly this reason. Use that if you are running on modern CPUs.

Related

How does perf record (or other profilers) pick which instruction to count as costing time?

Recently, I found out that actually perf (or pprof) may show in disassembly view instruction timing near the line that didn't actually take this time. The real instruction, which actually took this time, is before it. I know a vague explanation that this happens due to instruction pipelining in CPU. However, I would like to find out the following:
Is there a more detailed explanation of this effect?
Is it documented in perf or pprof? I haven't found any references.
Is there a way to obtain correctly placed timings?

(quick not super detailed answer; a more detailed one would be good if someone wants to write one).
perf just uses the CPU's own hardware performance counters, which can be put into a mode where they record an event when the counter counts down to zero or up to a threshold.
Either raising an interrupt or writing an event into a buffer in memory (with PEBS precise events). That event will include a code address that the CPU picked to associate with the event (i.e. the point at which the interrupt was raised), even for events like cycles which unlike instructions don't inherently have a specific instruction associated. The out-of-order exec back-end can have a couple hundred instructions in flight when counter wraps, but has to pick exactly one for any given sample.
Generally the CPU "blames" the instruction that was waiting for a slow-to-produce result, not the one producing it, especially cache-miss loads.
For an example with Intel x86 CPUs, see Why is this jump instruction so expensive when performing pointer chasing?
which also appears to depend on the effect of letting the last instruction in the ROB retire when an interrupt is raised. (Intel CPUs at least do seem to do that; makes sense for ensuring forward progress even with a potentially slow instruction.)
In general there can be "skew" when a later instruction is blamed than the one actually taking the time, possibly with different causes. (Perhaps especially for uncore events, since they happen asynchronously to the core clock.)
Other related Q&As with interesting examples or other things
Inconsistent `perf annotate` memory load/store time reporting
Linux perf reporting cache misses for unexpected instruction
https://travisdowns.github.io/blog/2019/08/20/interrupts.html - some experiments into which instructions tend to get counts on Skylake.

Reliability of Xcode Instrument's disassembly time profiling

I've profiled my code using Instrument's time profiler, and zooming in to the disassembly, here's a snippet of its results:
I wouldn't expect a mov instruction to take 23.3% of the time while a div instruction to take virtually nothing.
This causes me to believe these results are unreliable.
Is this true and known? Or am I just experiencing an Instruments bug? Or is there some option I need to use to obtain reliable results?
Is there any reference expanding on this issue?

First of all, it's possible that some counts that really belong to divss are being charged to later instructions, which is called a "skid". (Also see the rest of that comment thread for some more details.) Presumably Xcode is like Linux perf, and uses the fixed cpu_clk_unhalted.thread counter for cycles instead of one of the programmable counters. This is not a "precise" event (PEBS), so skids are possible. As #BeeOnRope points out, you can use a PEBS event that ticks once per cycle (like UOPS_RETIRED < 16) as a PEBS substitute for the fixed cycles counter, removing some of the dependence on interrupt behaviour.
But the way counters fundamentally work for pipelined / out-of-order execution also explains most of what you're seeing. Or it might; you didn't show the complete loop so we can't simulate the code on a simple pipeline model like IACA does, or by hand using hardware guides like http://agner.org/optimize/ and Intel's optimization manual. (And you haven't even specified what microarchitecture you have. I guess it's some member of Intel Sandybridge-family on a Mac).
Counts for cycles are typically charged to the instruction that's waiting for the result, not usually the instruction that's slow to produce the result. Pipelined CPUs don't stall until you try to read a result that isn't ready yet.
Out-of-order execution massively complicates this, but it's still generally true when there's one really slow instruction, like a load that often misses in cache. When the cycles counter overflows (triggering an interrupt), there are many instruction in flight, but only one can be the RIP associated with that performance-counter event. It's also the RIP where execution will resume after the interrupt.
So what happens when an interrupt is raised? See Andy Glew's answer about that, which explains the internals of perf-counter interrupts in the Intel P6 microarchitecture's pipeline, and why (before PEBS) they were always delayed. Sandybridge-family is similar to P6 for this.
I think a reasonable mental model for perf-counter interrupts on Intel CPUs is that it discards any uops that haven't yet been dispatched to an execution unit. But ALU uops that have been dispatched already go through the pipeline to retirement (if there aren't any younger uops that got discarded) instead of being aborted, which makes sense because the maximum extra latency is ~16 cycles for sqrtpd, and flushing the store queue can easily take longer than that. (Pending stores that have already retired can't be rolled back). IDK about loads/stores that haven't retired; at least the loads are probably discarded.
I'm basing this guess on the fact that it's easy to construct loops that don't show any counts for divss when the CPU is sometimes waiting for it to produce its outputs. If it was discarded without retiring, it would be the next instruction when resuming the interrupt, so (other than skids) you'd see lots of counts for it.
Thus, the distribution of cycles counts shows you which instructions spend the most time being the oldest not-yet-dispatched instruction in the scheduler. (Or in case of front-end stalls, which instructions the CPU is stalled trying to fetch / decode / issue). Remember, this usually means it shows you the instructions that are waiting for inputs, not the instructions that are slow to produce them.
(Hmm, this might not be right, and I haven't tested this much. I usually use perf stat to look at overall counts for a whole loop in a microbenchmark, not statistical profiles with perf record. addss and mulss are higher latency than andps, so you'd expect andps to get counts waiting for its xmm5 input if my proposed model was right.)
Anyway, the general problem is, with multiple instructions in flight at once, which one does the HW "blame" when the cycles counter wraps around?
Note that divss is slow to produce the result, but is only a single-uop instruction (unlike integer div which is microcoded on AMD and Intel). If you don't bottleneck on its latency or its not-fully-pipelined throughput, it's not slower than mulss because it can overlap with surrounding code just as well.
(divss / divps is not fully pipelined. On Haswell for example, an independent divps can start every 7 cycles. But each only takes 10-13 cycles to produce its result. All other execution units are fully pipelined; able to start a new operation on independent data every cycle.)
Consider a large loop that bottlenecks on throughput, not latency of any loop-carried dependency, and only needs divss to run once per 20 FP instructions. Using divss by a constant instead of mulss with the reciprocal constant should make (nearly) no difference in performance. (In practice out-of-order scheduling isn't perfect, and longer dependency chains hurt some even when not loop-carried, because they require more instructions to be in flight to hide all that latency and sustain max throughput. i.e. for the out-of-order core to find the instruction-level parallelism.)
Anyway, the point here is that divss is a single uop and it makes sense for it not to get many counts for the cycles event, depending on the surrounding code.
You see the same effect with a cache-miss load: the load itself mostly only gets counts if it has to wait for the registers in the addressing mode, and the first instruction in the dependency chain that uses the loaded data gets a lot of counts.
What your profile result might be telling us:
The divss isn't having to wait for its inputs to be ready. (The movaps %xmm3, %xmm5 before the divss sometimes takes some cycles, but the divss never does.)
We may come close to bottlenecking on the throughput of divss
The dependency chain involving xmm5 after divss is getting some counts. Out-of-order execution has to work to keep multiple independent iterations of that in flight at once.
The maxss / movaps loop-carried dependency chain may be a significant bottleneck. (Especially if you're on Skylake where divss throughput is one per 3 clocks, but maxss latency is 4 cycles. And resource conflicts from competition for ports 0 and 1 will delay maxss.)
The high counts for movaps might be due to it following maxss, forming the only loop-carried dependency in the part of the loop you show. So it's plausible that maxss really is slow to produce results. But if it really was a loop-carried dep chain that was the major bottleneck, you'd expect to see lots of counts on maxss itself, as it would be waiting for its input from the last iteration.
But maybe mov-elimination is "special", and all the counts for some reason get charged to movaps? On Ivybridge and later CPUs, register copies doesn't need an execution unit, but instead are handled in the issue/rename stage of the pipeline.

Is this true and known?
Yes, it is a known problem with profiling tools on Intel x86. I've observed it (time spent suspiciously assigned to seemingly innocent instructions) both with Linux perf_events and Intel VTune. It has also been reported elsewhere by other people.
A better and more honest visualization of collected results would have summed up all samples inside every basic block, and demonstrated the resulting value associated with a basic block, not its individual instructions. Not 100% fool-proof but a bit better and honest,
Or is there some option I need to use to obtain reliable results?
I do not know if newer profiling hardware, namely tools based on Intel Processor Trace (available starting from Broadwell, but improved in Skylake) instead of older PEBS, would give more accurate data. I guess one needs to experiment with such tools first.

Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

LOOP (Intel ref manual entry)
decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags.
loop on various microarchitectures, from Agner Fog's instruction tables:
K8/K10: 7 m-ops
Bulldozer-family/Ryzen: 1 m-op (same cost as macro-fused test-and-branch, or jecxz)
P4: 4 uops (same as jecxz)
P6 (PII/PIII): 8 uops
Pentium M, Core2: 11 uops
Nehalem: 6 uops. (11 for loope / loopne). Throughput = 4c (loop) or 7c (loope/ne).
SnB-family: 7 uops. (11 for loope / loopne). Throughput = one per 5 cycles, as much of a bottleneck as keeping your loop counter in memory! jecxz is only 2 uops with same throughput as regular jcc
Silvermont: 7 uops
AMD Jaguar (low-power): 8 uops, 5c throughput
Via Nano3000: 2 uops
Couldn't the decoders just decode the same as lea rcx, [rcx-1] / jrcxz? That would be 3 uops. At least that would be the case with no address-size prefix, otherwise it has to use ecx and truncate RIP to EIP if the jump is taken; maybe the odd choice of address-size controlling the width of the decrement explains the many uops? (Fun fact: rep-string instructions have the same behaviour with using ecx with 32-bit address-size.)
Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx / jnz on SnB decodes to a single uop (which does set flags).
I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.
Would it be easy for SnB-family uarch to have fast loop? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.
What's special about Bulldozer that made a fast loop easy / worth it? Or did AMD waste a bunch of transistors on making loop fast? If so, presumably someone thought it was a good idea.
If loop was fast, it would be perfect for BigInteger arbitrary-precision adc loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz. (And dec/jnz only macro-fuses on SnB-family).
On modern CPUs where dec/jnz is ok in an ADC loop, loop would still be nice for ADCX / ADOX loops (to preserve OF).
If loop had been fast, compilers would already be using it as a peephole optimization for code-size + speed on CPUs without macro-fusion.
It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.

In 1988, IBM fellow Glenn Henry had just come on board at Dell, which had a few hundred employees at the time, and in his first month he gave a tech talk about 386 internals. A bunch of us BIOS programmers had been wondering why LOOP was slower than DEC/JNZ so during the question/answer section somebody posed the question.
His answer made sense. It had to do with paging.
LOOP consists of two parts: decrementing CX, then jumping if CX is not zero. The first part cannot cause a processor exception, whereas the jump part can. For one, you could jump (or fall through) to an address outside segment boundaries, causing a SEGFAULT. For two, you could jump to a page that is swapped out.
A SEGFAULT usually spells the end for a process, but page faults are different. When a page fault occurs, the processor throws an exception, and the OS does the housekeeping to swap in the page from disk into RAM. After that, it restarts the instruction that caused the fault.
Restarting means restoring the state of the process to what it was just before the offending instruction. In the case of the LOOP instruction in particular, it meant restoring the value of the CX register. One might think you could just add 1 to CX, since we know CX got decremented, but apparently, it's not that simple. For example, check out this erratum from Intel:
The protection violations involved usually indicate a probable
software bug and restart is not desired if one of these violations
occurs. In a Protected Mode 80286 system with wait states during any
bus cycles, when certain protection violations are detected by the
80286 component, and the component transfers control to the exception
handling routine, the contents of the CX register may be unreliable.
(Whether CX contents are changed is a function of bus activity at the
time internal microcode detects the protection violation.)
To be safe, they needed to save the value of CX on every iteration of a LOOP instruction, in order to reliably restore it if needed.
It's this extra burden of saving CX that made LOOP so slow.
Intel, like everyone else at the time, was getting more and more RISC. The old CISC instructions (LOOP, ENTER, LEAVE, BOUND) were being phased out. We still used them in hand-coded assembly, but compilers ignored them completely.

Now that I googled after writing my question, it turns out to be an exact duplicate of one on comp.arch, which came up right away. I expected it to be hard to google (lots of "why is my loop slow" hits), but my first try (why is the x86 loop instruction slow) got results.
This is not a good or complete answer.
It might be the best we'll get, and will have to suffice unless someone can shed some more light on it. I didn't set out to write this as an answer-my-own-question post.
Good posts with different theories in that thread:
Robert
LOOP became slow on some of the earliest machines (circa 486) when
significant pipelining started to happen, and running any but the
simplest instruction down the pipeline efficiently was technologically
impractical. So LOOP was slow for a number of generations. So nobody
used it. So when it became possible to speed it up, there was no real
incentive to do so, since nobody was actually using it.
Anton Ertl:
IIRC LOOP was used in some software for timing loops; there was
(important) software that did not work on CPUs where LOOP was too fast
(this was in the early 90s or so). So CPU makers learned to make LOOP
slow.
(Paul, and anyone else: You're welcome to re-post your own writing as your own answer. I'll remove it from my answer and up-vote yours.)
#Paul A. Clayton (occasional SO poster and CPU architecture guy) took a guess at how you could use that many uops. (This looks like loope/ne which checks both the counter and ZF):
I could imagine a possibly sensible 6-µop version:
virtual_cc = cc;
temp = test (cc);
rCX = rCX - temp; // also setting cc
cc = temp & cc; // assumes branch handling is not
// substantially changed for the sake of LOOP
branch
cc = virtual_cc
(Note that this is 6 uops, not SnB's 11 for LOOPE/LOOPNE, and is a total guess not even trying to take into account anything known from SnB perf counters.)
Then Paul said:
I agree that a shorter sequence should be possible, but I was trying
to think of a bloated sequence that might make sense if minimal
microarchitectural adjustments were permitted.
summary: The designers wanted loop to be supported only via microcode, with no adjustments whatsoever to the hardware proper.
If a useless, compatibility-only instruction is handed to the
microcode developers, they might reasonably not be able or willing to
suggest minor changes to the internal microarchitecture to improve
such an instruction. Not only would they rather use their "change
suggestion capital" more productively but the suggestion of a change
for a useless case would reduce the credibility of other suggestions.
(My opinion: Intel is probably still making it slow on purpose, and hasn't bothered to rewrite their microcode for it for a long time. Modern CPUs are probably too fast for anything using loop in a naive way to work correctly.)
... Paul continues:
The architects behind Nano may have found avoiding the special casing
of LOOP simplified their design in terms of area or power. Or they
may have had incentives from embedded users to provide a fast
implementation (for code density benefits). Those are just WILD
guesses.
If optimization of LOOP fell out of other optimizations (like fusion
of compare and branch), it might be easier to tweak LOOP into a fast
path instruction than to handle it in microcode even if the
performance of LOOP was unimportant.
I suspect that such decisions are based on specific details of the
implementation. Information about such details does not seem to be
generally available and interpreting such information would be
beyond the skill level of most people. (I am not a hardware
designer--and have never played one on television or stayed at a
Holiday Inn Express. :-)
The thread then went off-topic into the realm of AMD blowing our one chance to clean up the cruft in x86 instruction encoding. It's hard to blame them, since every change is a case where the decoders can't share transistors. And before Intel adopted x86-64, it wasn't even clear that it would catch on. AMD didn't want to burden their CPUs with hardware nobody used if AMD64 didn't catch on.
But still, there are so many small things: setcc could have changed to 32bits. (Usually you have to use xor-zero / test / setcc to avoid false dependencies, or because you need a zero-extended reg). Shift could have unconditionally written flags, even with zero shift count (removing the input data dependency on eflags for variable-count shift for OOO execution). Last time I typed this list of pet peeves, I think there was a third one... Oh yeah, bt / bts etc. with memory operands has the address dependent on the upper bits of the index (bit string, not just bit within a machine word).
bts instructions are very useful for bit-field stuff, and are slower than they need to be so you almost always want to load into a register and then use that. (It's usually faster to shift/mask to get an address yourself, instead of using 10 uop bts [mem], reg on Skylake, but it does take extra instructions. So it made sense on 386, but not on K8). Atomic bit-manipulation has to use the memory-dest form, but the locked version needs lots of uops anyway. It's still slower than if it couldn't access outside the dword it's operating on.

Please see the nice article by Abrash, Michael, published in Dr. Dobb's Journal March 1991 v16 n3 p16(8): http://archive.gamedev.net/archive/reference/articles/article369.html
The summary of the article is the following:
Optimizing code for 8088, 80286, 80386 and 80486 microprocessors is
difficult because the chips use significantly different memory
architectures and instruction execution times. Code cannot be
optimized for the 80x86 family; rather, code must be designed to
produce good performance on a range of systems or optimized for
particular combinations of processors and memory. Programmers must
avoid the unusual instructions supported by the 8088, which have lost
their performance edge in subsequent chips. String instructions
should be used but not relied upon. Registers should be used rather
than memory operations. Branching is also slow for all four
processors. Memory accesses should be aligned to improve
performance. Generally, optimizing an 80486 requires exactly the
opposite steps as optimizing an 8088.
By "unusual instructions supported by the 8088" the author also means "loop":
Any 8088 programmer would instinctively replace: DEC CX JNZ LOOPTOP
with: LOOP LOOPTOP because LOOP is significantly faster on the 8088.
LOOP is also faster on the 286. On the 386, however, LOOP is actually
two cycles slower than DEC/JNZ. The pendulum swings still further on
the 486, where LOOP is about twice as slow as DEC/JNZ--and, mind you,
we're talking about what was originally perhaps the most obvious
optimization in the entire 80x86 instruction set.
This is a very good article, and I highly recommend it. Even though it was published in 1991, it is surprisingly highly relevant today.
But this article just gives advices, it encourages to test execution speed and choose faster variants. It doesn’t explain WHY some commands become very slow, so it doesn’t fully address your question.
The answer is that earlier processors, like 80386 (released in 1985) and before, executed instructions one-by-one, sequentially.
Later processors have started to use instruction pipelining – initially, simple, for 804086, and, finally, Pentium Pro (released in 1995) introduced radically different internal pipeline, calling it the Out Of Order (OOO) core where instructions were transformed to small fragments of operations called micro-ops or µops, and then all micro-ops of different instructions were put to a large pool of micro-ops where they were supposed to execute simultaneously as long as they do not depend on one another. This OOO pipeline principle is still used, almost unchanged, on modern processors. You can find more information about instruction pipelining in this brilliant article: https://www.gamedev.net/resources/_/technical/general-programming/a-journey-through-the-cpu-pipeline-r3115
In order to simplify chip design, Intel decided to build processors in such a way that one instructions did transform to micro-ops in a very efficient way, while others are not.
Efficient conversion from instructions to micro-ops requires more transistors, so Intel have decided to save on transistors at a cost of slower decoding and execution of some “complex” or “rarely-used” instructions.
For example, the “Intel® Architecture Optimization Reference Manual” http://download.intel.com/design/PentiumII/manuals/24512701.pdf mentions the following: “Avoid using complex instructions (for example, enter, leave, or loop) that generally have more than four µops and require multiple cycles to decode. Use sequences of simple instructions instead.”
So, Intel somehow have decided that the “loop” instruction is “complex”, and, since then, it became very slow. However, there is no official Intel reference on instruction breakdown: how many micro-ops each instruction produces, and how many cycles are required to decode it.
You can also read about The Out-of-Order Execution Engine
in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section the 2.1.2.

Instruction to get the current time on x86

Is there an x86 instruction to get the current time?
Basically... something like a replacement for clock_get_time ... something with the minimum overhead... where I don't really care about getting the time in any specific format... as long as it's a format I can use.
Basically I'm doing some work to "Detect how much PHYSICAL REAL LIFE TIME" has gone by... and I want to be able to measure time as frequently as possible!
I guess you can imagine i'm doing something like a profiling app... :)
I really need aggressively efficient access to the hardware time. So ideally... some ASM to get the time... store it somewhere... then massage it later into some format that I can actually process.
I'm not interested in _rdtsc as that measures the number of cycles gone by. I need to know how much physical time has executed... not cycles which can vary due to thermal fluctations or so..

For profiling, often it's most useful to profile in terms of CPU clock cycles, rather than wall-clock time. CPU dynamic clocking (turbo and power saving) makes it annoying to get the CPU ramped up to full speed before the start of a measurement period.
If you still need wall-clock time after that:
Recent x86 CPUs have a TSC that runs at a fixed rate, regardless of CPU frequency adjustment for power-saving. Also, the TSC doesn't stop when the CPU is halted. (i.e. no work to do, so it ran the HLT instruction to wait for an interrupt in low-power mode.)
It turned out that efficient access to a useful time-source was more useful to have in hardware than an actual clock cycle counter, so that's what RDTSC morphed into, a few CPU generations after its introduction. Now we're back to using hardware performance counters for measuring clock cycles.
In Linux, look for constant_tsc and nonstop_tsc in the CPU features flags in /proc/cpuinfo. IDK if there are CPUID bits for those. If no, use Linux's code for it (if you can use GPLed code).
On a CPU with those two key features, Linux uses the TSC as its clocksource, IIRC.
The lowest overhead way to get the current time in user-space will be to work out the conversion between RDTSC ticks and real time. While profiling, you might just store 64bit TSC snapshots, and convert to real-time later. (So you can handle TSC wraparound then). RDTSC only takes about 24 cycles (Agner Fog's instruction table, Intel Haswell). I think the overhead of a system call will be an order of magnitude higher than that. (The kernel will have to do a RDTSC in there somewhere anyway).
Agner Fog has documented his profiling / timing methods, and has some example code. I haven't looked recently, but it might have useful stuff for this application.

How many CPU cycles are needed for each assembly instruction?

I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle please?
Here is an example, in the below code, mov/lock is 1 CPU cycle, and xchg is 3 CPU cycles.
// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress,
int nValue)
{
__asm
{
mov edx, dword ptr [pTargetAddress]
mov eax, nValue
lock xchg eax, dword ptr [edx]
}
// mov = 1 CPU cycle
// lock = 1 CPU cycle
// xchg = 3 CPU cycles
}
#endif // WIN32
BTW: here is the URL for the code I posted: http://www.codeproject.com/KB/threads/spinlocks.aspx

Modern CPUs are complex beasts, using pipelining, superscalar execution, and out-of-order execution among other techniques which make performance analysis difficult... but not impossible!
While you can no longer simply add together the latencies of a stream of instructions to get the total runtime, you can still get a (often) highly accurate analysis of the behavior of some piece of code (especially a loop) as described below and in other linked resources.
Instruction Timings
First, you need the actual timings. These vary by CPU architecture, but the best resource currently for x86 timings is Agner Fog's instruction tables. Covering no less than thirty different microarchitecures, these tables list the instruction latency, which is the minimum/typical time that an instruction takes from inputs ready to output available. In Agner's words:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN's and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
So, for example, the add instruction has a latency of one cycle, so a series of dependent add instructions, as shown, will have a latency of 1 cycle per add:
add eax, eax
add eax, eax
add eax, eax
add eax, eax # total latency of 4 cycles for these 4 adds
Note that this doesn't mean that add instructions will only take 1 cycle each. For example, if the add instructions were not dependent, it is possible that on modern chips all 4 add instructions can execute independently in the same cycle:
add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx # these 4 instructions might all execute, in parallel in a single cycle
Agner provides a metric which captures some of this potential parallelism, called reciprocal throughput:
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind
in the same thread.
For add this is listed as 0.25 meaning that up to 4 add instructions can execute every cycle (giving a reciprocal throughput of 1 / 4 = 0.25).
The reciprocal throughput number also gives a hint at the pipelining capability of an instruction. For example, on most recent x86 chips, the common forms of the imul instruction have a latency of 3 cycles, and internally only one execution unit can handle them (unlike add which usually has four add-capable units). Yet the observed throughput for a long series of independent imul instructions is 1/cycle, not 1 every 3 cycles as you might expect given the latency of 3. The reason is that the imul unit is pipelined: it can start a new imul every cycle, even while the previous multiplication hasn't completed.
This means a series of independent imul instructions can run at up to 1 per cycle, but a series of dependent imul instructions will run at only 1 every 3 cycles (since the next imul can't start until the result from the prior one is ready).
So with this information, you can start to see how to analyze instruction timings on modern CPUs.
Detailed Analysis
Still, the above is only scratching the surface. You now have multiple ways of looking at a series of instructions (latency or throughput) and it may not be clear which to use.
Furthermore, there are other limits not captured by the above numbers, such as the fact that certain instructions compete for the same resources within the CPU, and restrictions in other parts of the CPU pipeline (such as instruction decoding) which may result in a lower overall throughput than you'd calculate just by looking at latency and throughput. Beyond that, you have factors "beyond the ALUs" such as memory access and branch prediction: entire topics unto themselves - you can mostly model these well, but it takes work. For example here's a recent post where the answer covers in some detail most of the relevant factors.
Covering all the details would increase the size of this already long answer by a factor of 10 or more, so I'll just point you to the best resources. Agner Fog has an Optimizing Asembly guide that covers in detail the precise analysis of a loop with a dozen or so instructions. See "12.7 An example of analysis for bottlenecks in vector loops" which starts on page 95 in the current version of the PDF.
The basic idea is that you create a table, with one row per instruction and mark the execution resources each uses. This lets you see any throughput bottlenecks. In addition, you need to examine the loop for carried dependencies, to see if any of those limit the throughput (see "12.16 Analyzing dependencies" for a complex case).
If you don't want to do it by hand, Intel has released the Intel Architecture Code Analyzer, which is a tool that automates this analysis. It currently hasn't been updated beyond Skylake, but the results are still largely reasonable for Kaby Lake since the microarchitecture hasn't changed much and therefore the timings remain comparable. This answer goes into a lot of detail and provides example output, and the user's guide isn't half bad (although it is out of date with respect to the newest versions).
Other sources
Agner usually provides timings for new architectures shortly after they are released, but you can also check out instlatx64 for similarly organized timings in the InstLatX86 and InstLatX64 results. The results cover a lot of interesting old chips, and new chips usually show up fairly quickly. The results are mostly consistent with Agner's, with a few exceptions here and there. You can also find memory latency and other values on this page.
You can even get the timing results directly from Intel in their IA32 and Intel 64 optimization manual in Appendix C: INSTRUCTION LATENCY AND THROUGHPUT. Personally I prefer Agner's version because they are more complete, often arrive before the Intel manual is updated, and are easier to use as they provide a spreadsheet and PDF version.
Finally, the x86 tag wiki has a wealth of resources on x86 optimization, including links to other examples of how to do a cycle accurate analysis of code sequences.
If you want a deeper look into the type of "dataflow analysis" described above, I would recommend A Whirlwind Introduction to Data Flow Graphs.

Given pipelining, out of order processing, microcode, multi-core processors, etc there's no guarantee that a particular section of assembly code will take exactly x CPU cycles/clock cycle/whatever cycles.
If such a reference exists, it will only be able to provide broad generalizations given a particular architecture, and depending on how the microcode is implemented you may find that the Pentium M is different than the Core 2 Duo which is different than the AMD dual core, etc.
Note that this article was updated in 2000, and written earlier. Even the Pentium 4 is hard to pin down regarding instruction timing - PIII, PII, and the original pentium were easier, and the texts referenced were probably based on those earlier processors that had a more well-defined instruction timing.
These days people generally use statistical analysis for code timing estimation.

What the other answers say about it being impossible to accurately predict the performance of code running on a modern CPU is true, but that doesn't mean the latencies are unknown, or that knowing them is useless.
The exact latencies for Intels and AMD's processors are listed in Agner Fog's instruction tables. See also Intel® 64 and IA-32 Architectures Optimization Reference Manual, and Instruction latencies and throughput for AMD and Intel x86 processors (from Can Berk Güder's now-deleted link-only answer). AMD also has pdf manuals on their own website with their official values.
For (micro-)optimizing tight loops, knowing the latencies for each instruction can help a lot in manually trying to schedule your code. The programmer can make a lot of optimizations that the compiler can't (because the compiler can't guarantee it won't change the meaning of the program).
Of course, this still requires you to know a lot of other details about the CPU, such as how deeply pipelined it is, how many instructions it can issue per cycle, number of execution units and so on. And of course, these numbers vary for different CPU's. But you can often come up with a reasonable average that more or less works for all CPU's.
It's worth noting though, that it is a lot of work to optimize even a few lines of code at this level. And it is easy to make something that turns out to be a pessimization. Modern CPUs are hugely complicated, and they try extremely hard to get good performance out of bad code. But there are also cases they're unable to handle efficiently, or where you think you're clever and making efficient code, and it turns out to slow the CPU down.
Edit
Looking in Intel's optimization manual, table C-13:
The first column is instruction type, then there is a number of columns for latency for each CPUID. The CPUID indicates which processor family the numbers apply to, and are explained elsewhere in the document. The latency specifies how many cycles it takes before the result of the instruction is available, so this is the number you're looking for.
The throughput columns show how many of this type of instructions can be executed per cycle.
Looking up xchg in this table, we see that depending on the CPU family, it takes 1-3 cycles, and a mov takes 0.5-1. These are for the register-to-register forms of the instructions, not for a lock xchg with memory, which is a lot slower. And more importantly, hugely-variable latency and impact on surrounding code (much slower when there's contention with another core), so looking only at the best-case is a mistake. (I haven't looked up what each CPUID means, but I assume the .5 are for Pentium 4, which ran some components of the chip at double speed, allowing it to do things in half cycles)
I don't really see what you plan to use this information for, however, but if you know the exact CPU family the code is running on, then adding up the latency tells you the minimum number of cycles required to execute this sequence of instructions.

Measuring and counting CPU-cycles does not make sense on the x86 anymore.
First off, ask yourself for which CPU you're counting cycles? Core-2? a Athlon? Pentium-M? Atom? All these CPUs execute x86 code but all of them have different execution times. The execution even varies between different steppings of the same CPU.
The last x86 where cycle-counting made sense was the Pentium-Pro.
Also consider, that inside the CPU most instructions are transcoded into microcode and executed out of order by a internal execution unit that does not even remotely look like a x86. The performance of a single CPU instruction depends on how much resources in the internal execution unit is available.
So the time for a instruction depends not only on the instruction itself but also on the surrounding code.
Anyway: You can estimate the throughput-resource usage and latency of instructions for different processors. The relevant information can be found at the Intel and AMD sites.
Agner Fog has a very nice summary on his web-site. See the instruction tables for latency, throughput, and uop count. See the microarchictecture PDF to learn how to interpret those.
http://www.agner.org/optimize
But note that xchg-with-memory does not have predictable performance, even if you look at only one CPU model. Even in the no-contention case with the cache-line already hot in L1D cache, being a full memory barrier will mean it's impact depends a lot on loads and stores to other addresses in the surrounding code.
Btw - since your example-code is a lock-free datastructure basic building block: Have you considered using the compiler built-in functions? On win32 you can include intrin.h and use functions such as _InterlockedExchange.
That'll give you better execution time because the compiler can inline the instructions. Inline-assembler always forces the compiler to disable optimizations around the asm-code.

lock xchg eax, dword ptr [edx]
Note the lock will lock memory for the memory fetch for all cores, this can take 100 cycles on some multi cores and a cache line will also need to be flushed. It will also stall the pipeline. So i wouldnt worry about the rest.
So optimal performance gets back to tuning your algorithms critical regions.
Note on a single core you can optmize this by removing the lock but it is needed for multi core.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio