Reorder buffer and not modified registers - performance

The reorder buffer can handle no more than three reads per clock cycle
from registers that have not been modified recently.
That comes from Agner Fog's material. But, my doubt is:
Why is important when registers have been modified? Why does that matter?

First, that only applies to pre-Sandybridge Intel P6-family microarchitectures. (PPro to Nehalem).
This P6-family bottleneck is usually called register-read stalls, or sometimes "ROB-read stalls" because the values get read into the ROB during issue/rename; inputs for execution units are always read from the ROB (or bypass network) when dispatching uops to execution ports.
AMD, Intel P4, and Intel Sandybridge-family, use a physical register file design. SnB-family has no bottlenecks on register read ports. (Or does it? Some effect seems to limit throughput when we read very many registers on HSW and (higher limit) SKL. My post on Agner blog. But it might be bottlenecking dispatch instead of issue/rename; I haven't done experiments with bursts, just steady state. No reason to think the mechanism is the same as on P6-family.)
Recently-modified architectural registers are ones that were written by an instruction that hasn't retired yet. The data is already available in the ROB1, instead of having to be read from the permanent register file during issue/rename. There is a limit on how many such "cold" registers can be read per cycle during issue/rename as instructions are added to the ROB. Nehalem increased it some.
Agner Fog's microarch PDF (which you were already reading) explains the details; see the PPro section, and the update in the Core2/Nehalem section.
Footnote 1: or is in flight being forwarded from one execution unit to the next in the bypass network, for inputs that haven't even finished executing.

Related

How does perf record (or other profilers) pick which instruction to count as costing time?

Recently, I found out that actually perf (or pprof) may show in disassembly view instruction timing near the line that didn't actually take this time. The real instruction, which actually took this time, is before it. I know a vague explanation that this happens due to instruction pipelining in CPU. However, I would like to find out the following:
Is there a more detailed explanation of this effect?
Is it documented in perf or pprof? I haven't found any references.
Is there a way to obtain correctly placed timings?
(quick not super detailed answer; a more detailed one would be good if someone wants to write one).
perf just uses the CPU's own hardware performance counters, which can be put into a mode where they record an event when the counter counts down to zero or up to a threshold.
Either raising an interrupt or writing an event into a buffer in memory (with PEBS precise events). That event will include a code address that the CPU picked to associate with the event (i.e. the point at which the interrupt was raised), even for events like cycles which unlike instructions don't inherently have a specific instruction associated. The out-of-order exec back-end can have a couple hundred instructions in flight when counter wraps, but has to pick exactly one for any given sample.
Generally the CPU "blames" the instruction that was waiting for a slow-to-produce result, not the one producing it, especially cache-miss loads.
For an example with Intel x86 CPUs, see Why is this jump instruction so expensive when performing pointer chasing?
which also appears to depend on the effect of letting the last instruction in the ROB retire when an interrupt is raised. (Intel CPUs at least do seem to do that; makes sense for ensuring forward progress even with a potentially slow instruction.)
In general there can be "skew" when a later instruction is blamed than the one actually taking the time, possibly with different causes. (Perhaps especially for uncore events, since they happen asynchronously to the core clock.)
Other related Q&As with interesting examples or other things
Inconsistent `perf annotate` memory load/store time reporting
Linux perf reporting cache misses for unexpected instruction
https://travisdowns.github.io/blog/2019/08/20/interrupts.html - some experiments into which instructions tend to get counts on Skylake.

Clock Cycles for the invlpg instruction

I was reading some documentation about the invlpg instruction for Intel Pentium processors and it says that it takes 25 clock cycles. I thought that this depended on the implementation (the particular CPU) and not the actual instruction set architecture? Or is the fact that this instruction must take 25 clock cycles to run also part of the instruction set specification?
The documentation is saying that it took 25 clock cycles on the Pentium. The number of clock cycles the instruction takes on other CPUs may be more or fewer. The performance of instructions is not part of the instruction set specification.
That number is not part of any official ISA documentation, it's just performance data that someone annotated into an old (then-current) copy of Intel's ISA docs.
It's from some random microarchitecture, presumably P5 Pentium that was relevant back when Tripod was a widely used web host, and which that guide labels itself as documenting. (These days there are Pentium/Celeron CPUs that are just cut-down versions of i3/i5/i7 of the same generation, with stuff like AVX and BMI1/2 disabled. But Pentium used to refer to the P5 microarchitecture.)
It's not from Intel's documentation; it was added by whoever compiled that HTML. The formatting is similar to modern versions of Intel's vol.2 x86 SDM instruction-set reference manual. You can find HTML extracts of that at https://github.com/HJLebbink/asm-dude/wiki/INVLPG and https://www.felixcloutier.com/x86/invlpg for example. The encoding / mnemonic / description table at the top has identical formatting in your Tripod link, but the actual text is somewhat different. Also, the text for inc (current Intel vs. tripod) is word for word identical.
So yes, this is based on an old PDF->HTML of Intel's vol.2 manual, with P5 cycles and instruction-pairing info added (inc pairs in the U or V pipe on that dual-issue in-order pipeline that doesn't break instructions down into uops). Also with FLAGS updating section turned into tables.
That instruction-pairing and cycle-count info is totally irrelevant when tuning for modern microarchitectures like Skylake and Zen, but you can find it in Agner Fog's instruction tables: his spreadsheet has a sheet for P5, as well as for later Intel, AMD, and Via microarchitectures. (Also see his optimization guide and microarch pdf for background info to help you make sense of uops / ports / latency / throughput info.) Agner doesn't test most kernel instructions so invlpg isn't in his list.
http://faydoc.tripod.com/cpu/index.htm is obviously not an official Intel source. IDK where the author of this got their info from. Maybe they tested themselves. Or Intel has sometimes published some timing numbers for some microarchitectures, e.g. as part of their optimization manual. This is totally separate from the x86 ISA manuals, and is not something you can rely on for correctness. And other people have published their test results.
Another good source for experimental test results of instruction performance (uops for which ports, latency, and throughput) is https://uops.info/. Their testing for invlpg m8 shows it has a back-to-back throughput of ~194 cycles in practice on Skylake-client, ~157 on Nehalem, and ~126.25 on Zen+ and Zen2, to pick some random examples. But it may interleave better with other instructions, taking "only" 47 front-end uops on recent Intel CPUs and thus can issue in under 12 cycles if the back-end has room in the ROB / RS, maybe letting later instructions execute while the invlpg operation is in progress. (Although if it takes over 100 cycles for its uops to retire, that will often stall OoO exec at some point for a fraction of the total time.)
Remember that instruction performance can't be characterized by a single number on out-of-order CPUs; it's not one dimensional. Perf analysis is not as simple as adding up a cycle costs for all instructions in a loop, you have to analyze how the can overlap with each other. Or for complex cases like invlpg, measure.

Estimating Cycles Per Instruction

I have disassembled a small C++ program compiled with MSVC v140 and am trying to estimate the cycles per instruction in order to better understand how code design impacts performance. I've been following Mike Acton's CppCon 2014 talk on "Data-Oriented Design and C++", specifically the portion I've linked to.
In it, he points out these lines:
movss 8(%rbx), %xmm1
movss 12(%rbx), %xmm0
He then claims that these 2 x 32-bit reads are probably on the same cache line therefore cost roughly ~200 cycles.
The Intel 64 and IA-32 Architectures Optimization Reference Manual has been a great resource, specifically "Appendix C - Instruction Latency and Throughput". However on page C-15 in "Table C-16. Streaming SIMD Extension Single-precision Floating-point Instructions" it states that movss is only 1 cycle (unless I'm understanding what latency means here wrong... if so, how do I read this thing?)
I know that a theoretical prediction of execution time will never be correct, but nevertheless this is important to learn. How are these two commands 200 cycles, and how can I learn to reason about execution time beyond this snippet?
I've started to read some things on CPU pipelining... maybe the majority of the cycles are being picked up there?
PS: I'm not interested in actually measuring hardware performance counters here. I'm just looking to learn how to reasonable sight read ASM and cycles.
As you already pointed out, the theoretical throughput and latency of a MOVSS instruction is at 1 cycle. You were looking at the right document (Intel Optimization Manual). Agner Fog (mentioned in the comments) measured the same numbers in his Intruction Tables for Intel CPUs (AMD is has a higher latency).
This leads us to the first problem: What specific microarchitecture are you investigating? This can make a big difference, even for the same vendor. Agner Fog reports that MOVSS has a 2-6cy latency on AMD Bulldozer depending on the source and destination (register vs memory). This is important to keep in mind when looking into performance of computer architectures.
The 200cy are most likely cache misses, as already pointed out be dwelch in the comments. The numbers you get from the Optimization Manual for any memory accessing instructions are all under the assumption that the data resides in the first level cache (L1). Now, if you have never touched the data by previous instructions the cache line (64 bytes with Intel and AMD x86) will need to be loaded from memory into the last level cache, form there into the second level cache, then into L1 and finally into the XMM register (within 1 cycle). Transfers between L3-L2 and L2-L1 have a throughput (not latency!) of two cycles per cache line on current Intel microarchitectures. And the memory bandwidth can be used to estimate the throughput between L3 and memory (e.g, a 2 GHz CPU with an achievable memory bandwidth of 40 GB/s will have a throughput of 3.2 cycles per cache line). Cache lines or memory blocks are typically the smallest unit caches and memory can operate on, they differ between microarchitectures and may even be different within the architecture, depending on the cache level (L1, L2 and so on).
Now this is all throughput and not latency, which will not help you estimate what you have described above. To verify this, you would need to execute the instructions over and over (for at least 1/10s) to get cycle accurate measurements. By changing the instructions you can decide if you want to measure for latency (by including dependencies between instructions) or throughput (by having the instructions input independent of the result of previous instructions). To measure for caches and memory accesses you would need to predict if an access is going to a cache or not, this can be done using layer conditions.
A tool to estimate instruction execution (both latency and throughput) for Intel CPUs is the Intel Architecture Code Analyzer, which supports multiple microarchitectures up to Haswell. The latency predictions are to be taken with the grain of salt, since it much harder to estimate latency than throughput.

Why isn't RDTSC a serializing instruction?

The Intel manuals for the RDTSC instruction warn that out of order execution can change when RDTSC is actually executed, so they recommend inserting a CPUID instruction in front of it because CPUID will serialize the instruction stream (CPUID is never executed out of order). My question is simple: if they had the ability to make instructions serializing, why didn't they make RDTSC serializing? The entire point of it appears to be to get cycle accurate timings. Is there a situation under which you would not want to precede it with a serializing instruction?
Newer Intel CPUs have a separate RDTSCP instruction that is serializing. Intel opted to introduce a separate instruction rather than change the behavior of RDTSC, which suggests to me that there has to be some situation where a potentially out of order timing is what you want. What is it?
The time stamp counter was introduced on the Pentium microarchitecture. Out-of-order execution didn't show up until the Pentium Pro. Intel could have made rdtsc serializing (architecturally or internally), but it seems that they decided to keep it non-serializing, which is OK for general-purpose time measurements, and leave it up to the programmer to add serializing instructions if necessary. This is good for reducing the overhead of the measurement.
That's actually confirmed in the document you provide, with the following comment about Pentium and Pentium/MMX (in 4.2, slightly paraphrased):
All of the rules and code samples described in section 4.1 (Pentium Pro and Pentium II) also apply to the Pentium and Pentium/MMX. The only difference is, the CPUID instruction is not necessary for serialization.
And, from Wikipedia:
The Time Stamp Counter is a 64-bit register present on all x86 processors since the Pentium.
: : :
Starting with the Pentium Pro, Intel processors have supported out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. This can cause RDTSC to be executed later than expected, producing a misleading cycle count.
One of the two uses of RDTSCP is to give you the processor ID in addition to the time stamp information (it's right there in the name Read Time-Stamp Counter *AND* Processor ID), which is useful on systems with unsynced TSCs across cores or sockets (See: How to get the CPU cycle count in x86_64 from C++?). The additional serialization properties of rdtscp makes it more convenient at the end of the region of interest (See: Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time?).
If you are trying to use rdtsc to see if a branch mispredicts, the non-serializing version is what you want.
//math here
rdtsc
branch if zero to done
//do some work that always takes 1 cycle
done: rdtsc
If the branch is predicted correctly, the delta will be small (maybe even negative?). If the branch is mispredicted, the delta will be large.
With the serializing version, the branch condition will be resolved because the first rdtsc waits for the math to finish.
why didn't they make RDTSC serializing? The entire point of it appears to be to get cycle accurate timings
Well, most of the time it's to get high-resolution timestamps. At least some of the time, these timestamps are used for performance metrics. Making the intruction serializing would likely require a pipeline flush, which can be very expensive for CPU-bound applications.
Intel opted to introduce a separate instruction rather than change the behavior of RDTSC, which suggests to me that there has to be some situation where a potentially out of order timing is what you want.
Changing the behavior is almost always undesirable. Intel's customers would be disappointed to find out that RDTSC does something different on newer parts.
As paxdiably explains, RDTSC predates the concept of "serializing" instructions because it was implemented on an in-order CPU. Adding that behavior later would change the memory access behavior of code using it, and thus be incompatible for some purposes.
Instead, more recent CPUs have a related RDTSCP instruction that is defined as serializing (actually stronger: it promises to wait until all instructions issued before it have completed, not just that memory accesses have been done), for exactly this reason. Use that if you are running on modern CPUs.

How many CPU cycles are needed for each assembly instruction?

I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle please?
Here is an example, in the below code, mov/lock is 1 CPU cycle, and xchg is 3 CPU cycles.
// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress,
int nValue)
{
__asm
{
mov edx, dword ptr [pTargetAddress]
mov eax, nValue
lock xchg eax, dword ptr [edx]
}
// mov = 1 CPU cycle
// lock = 1 CPU cycle
// xchg = 3 CPU cycles
}
#endif // WIN32
BTW: here is the URL for the code I posted: http://www.codeproject.com/KB/threads/spinlocks.aspx
Modern CPUs are complex beasts, using pipelining, superscalar execution, and out-of-order execution among other techniques which make performance analysis difficult... but not impossible!
While you can no longer simply add together the latencies of a stream of instructions to get the total runtime, you can still get a (often) highly accurate analysis of the behavior of some piece of code (especially a loop) as described below and in other linked resources.
Instruction Timings
First, you need the actual timings. These vary by CPU architecture, but the best resource currently for x86 timings is Agner Fog's instruction tables. Covering no less than thirty different microarchitecures, these tables list the instruction latency, which is the minimum/typical time that an instruction takes from inputs ready to output available. In Agner's words:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN's and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
So, for example, the add instruction has a latency of one cycle, so a series of dependent add instructions, as shown, will have a latency of 1 cycle per add:
add eax, eax
add eax, eax
add eax, eax
add eax, eax # total latency of 4 cycles for these 4 adds
Note that this doesn't mean that add instructions will only take 1 cycle each. For example, if the add instructions were not dependent, it is possible that on modern chips all 4 add instructions can execute independently in the same cycle:
add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx # these 4 instructions might all execute, in parallel in a single cycle
Agner provides a metric which captures some of this potential parallelism, called reciprocal throughput:
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind
in the same thread.
For add this is listed as 0.25 meaning that up to 4 add instructions can execute every cycle (giving a reciprocal throughput of 1 / 4 = 0.25).
The reciprocal throughput number also gives a hint at the pipelining capability of an instruction. For example, on most recent x86 chips, the common forms of the imul instruction have a latency of 3 cycles, and internally only one execution unit can handle them (unlike add which usually has four add-capable units). Yet the observed throughput for a long series of independent imul instructions is 1/cycle, not 1 every 3 cycles as you might expect given the latency of 3. The reason is that the imul unit is pipelined: it can start a new imul every cycle, even while the previous multiplication hasn't completed.
This means a series of independent imul instructions can run at up to 1 per cycle, but a series of dependent imul instructions will run at only 1 every 3 cycles (since the next imul can't start until the result from the prior one is ready).
So with this information, you can start to see how to analyze instruction timings on modern CPUs.
Detailed Analysis
Still, the above is only scratching the surface. You now have multiple ways of looking at a series of instructions (latency or throughput) and it may not be clear which to use.
Furthermore, there are other limits not captured by the above numbers, such as the fact that certain instructions compete for the same resources within the CPU, and restrictions in other parts of the CPU pipeline (such as instruction decoding) which may result in a lower overall throughput than you'd calculate just by looking at latency and throughput. Beyond that, you have factors "beyond the ALUs" such as memory access and branch prediction: entire topics unto themselves - you can mostly model these well, but it takes work. For example here's a recent post where the answer covers in some detail most of the relevant factors.
Covering all the details would increase the size of this already long answer by a factor of 10 or more, so I'll just point you to the best resources. Agner Fog has an Optimizing Asembly guide that covers in detail the precise analysis of a loop with a dozen or so instructions. See "12.7 An example of analysis for bottlenecks in vector loops" which starts on page 95 in the current version of the PDF.
The basic idea is that you create a table, with one row per instruction and mark the execution resources each uses. This lets you see any throughput bottlenecks. In addition, you need to examine the loop for carried dependencies, to see if any of those limit the throughput (see "12.16 Analyzing dependencies" for a complex case).
If you don't want to do it by hand, Intel has released the Intel Architecture Code Analyzer, which is a tool that automates this analysis. It currently hasn't been updated beyond Skylake, but the results are still largely reasonable for Kaby Lake since the microarchitecture hasn't changed much and therefore the timings remain comparable. This answer goes into a lot of detail and provides example output, and the user's guide isn't half bad (although it is out of date with respect to the newest versions).
Other sources
Agner usually provides timings for new architectures shortly after they are released, but you can also check out instlatx64 for similarly organized timings in the InstLatX86 and InstLatX64 results. The results cover a lot of interesting old chips, and new chips usually show up fairly quickly. The results are mostly consistent with Agner's, with a few exceptions here and there. You can also find memory latency and other values on this page.
You can even get the timing results directly from Intel in their IA32 and Intel 64 optimization manual in Appendix C: INSTRUCTION LATENCY AND THROUGHPUT. Personally I prefer Agner's version because they are more complete, often arrive before the Intel manual is updated, and are easier to use as they provide a spreadsheet and PDF version.
Finally, the x86 tag wiki has a wealth of resources on x86 optimization, including links to other examples of how to do a cycle accurate analysis of code sequences.
If you want a deeper look into the type of "dataflow analysis" described above, I would recommend A Whirlwind Introduction to Data Flow Graphs.
Given pipelining, out of order processing, microcode, multi-core processors, etc there's no guarantee that a particular section of assembly code will take exactly x CPU cycles/clock cycle/whatever cycles.
If such a reference exists, it will only be able to provide broad generalizations given a particular architecture, and depending on how the microcode is implemented you may find that the Pentium M is different than the Core 2 Duo which is different than the AMD dual core, etc.
Note that this article was updated in 2000, and written earlier. Even the Pentium 4 is hard to pin down regarding instruction timing - PIII, PII, and the original pentium were easier, and the texts referenced were probably based on those earlier processors that had a more well-defined instruction timing.
These days people generally use statistical analysis for code timing estimation.
What the other answers say about it being impossible to accurately predict the performance of code running on a modern CPU is true, but that doesn't mean the latencies are unknown, or that knowing them is useless.
The exact latencies for Intels and AMD's processors are listed in Agner Fog's instruction tables. See also Intel® 64 and IA-32 Architectures Optimization Reference Manual, and Instruction latencies and throughput for AMD and Intel x86 processors (from Can Berk Güder's now-deleted link-only answer). AMD also has pdf manuals on their own website with their official values.
For (micro-)optimizing tight loops, knowing the latencies for each instruction can help a lot in manually trying to schedule your code. The programmer can make a lot of optimizations that the compiler can't (because the compiler can't guarantee it won't change the meaning of the program).
Of course, this still requires you to know a lot of other details about the CPU, such as how deeply pipelined it is, how many instructions it can issue per cycle, number of execution units and so on. And of course, these numbers vary for different CPU's. But you can often come up with a reasonable average that more or less works for all CPU's.
It's worth noting though, that it is a lot of work to optimize even a few lines of code at this level. And it is easy to make something that turns out to be a pessimization. Modern CPUs are hugely complicated, and they try extremely hard to get good performance out of bad code. But there are also cases they're unable to handle efficiently, or where you think you're clever and making efficient code, and it turns out to slow the CPU down.
Edit
Looking in Intel's optimization manual, table C-13:
The first column is instruction type, then there is a number of columns for latency for each CPUID. The CPUID indicates which processor family the numbers apply to, and are explained elsewhere in the document. The latency specifies how many cycles it takes before the result of the instruction is available, so this is the number you're looking for.
The throughput columns show how many of this type of instructions can be executed per cycle.
Looking up xchg in this table, we see that depending on the CPU family, it takes 1-3 cycles, and a mov takes 0.5-1. These are for the register-to-register forms of the instructions, not for a lock xchg with memory, which is a lot slower. And more importantly, hugely-variable latency and impact on surrounding code (much slower when there's contention with another core), so looking only at the best-case is a mistake. (I haven't looked up what each CPUID means, but I assume the .5 are for Pentium 4, which ran some components of the chip at double speed, allowing it to do things in half cycles)
I don't really see what you plan to use this information for, however, but if you know the exact CPU family the code is running on, then adding up the latency tells you the minimum number of cycles required to execute this sequence of instructions.
Measuring and counting CPU-cycles does not make sense on the x86 anymore.
First off, ask yourself for which CPU you're counting cycles? Core-2? a Athlon? Pentium-M? Atom? All these CPUs execute x86 code but all of them have different execution times. The execution even varies between different steppings of the same CPU.
The last x86 where cycle-counting made sense was the Pentium-Pro.
Also consider, that inside the CPU most instructions are transcoded into microcode and executed out of order by a internal execution unit that does not even remotely look like a x86. The performance of a single CPU instruction depends on how much resources in the internal execution unit is available.
So the time for a instruction depends not only on the instruction itself but also on the surrounding code.
Anyway: You can estimate the throughput-resource usage and latency of instructions for different processors. The relevant information can be found at the Intel and AMD sites.
Agner Fog has a very nice summary on his web-site. See the instruction tables for latency, throughput, and uop count. See the microarchictecture PDF to learn how to interpret those.
http://www.agner.org/optimize
But note that xchg-with-memory does not have predictable performance, even if you look at only one CPU model. Even in the no-contention case with the cache-line already hot in L1D cache, being a full memory barrier will mean it's impact depends a lot on loads and stores to other addresses in the surrounding code.
Btw - since your example-code is a lock-free datastructure basic building block: Have you considered using the compiler built-in functions? On win32 you can include intrin.h and use functions such as _InterlockedExchange.
That'll give you better execution time because the compiler can inline the instructions. Inline-assembler always forces the compiler to disable optimizations around the asm-code.
lock xchg eax, dword ptr [edx]
Note the lock will lock memory for the memory fetch for all cores, this can take 100 cycles on some multi cores and a cache line will also need to be flushed. It will also stall the pipeline. So i wouldnt worry about the rest.
So optimal performance gets back to tuning your algorithms critical regions.
Note on a single core you can optmize this by removing the lock but it is needed for multi core.

Resources