Does changing one single bit consume less cycles per byte than adding/subtracting/xoring an entire processor word? - cpu

Let's suppose I change one single bit in a word and add two other words.
Does changing one bit in a word consume less CPU cycles than changing an entire word?
If it consumes less CPU cycles, how much faster would it be?

Performance (in clock cycles) is not data-dependent for integer ALU instructions other than division on most CPUs. ADD and XOR have the same 1-cycle latency on the majority of modern pipelined CPUs. (And the same cycle cost as each other on most older / simpler CPUs, whether or not it's 1 cycle.)
See https://agner.org/optimize/ and https://uops.info/ for numbers on modern x86 CPUs.
Lower power can indirectly affect performance by allowing higher boost clocks without having to slow down for thermal limits. But the difference in this case is so small that I don't expect it would be a measurable difference on a mainstream CPU, like the efficiency cores of an Alder Lake, or even a mobile phone CPU that's more optimized for low power.
Power in a typical CPU (using CMOS logic) scales with how many gates have their outputs change value per cycle. When a transistor switches on, it conducts current from Vcc or to ground, charging or discharging the tiny parasitic capacitance of the things the logic gate's output is connected to. Since the majority of the (low) resistance in the path of that current is in the transistor itself, that's where the electrical energy turns into heat.
For more details, see:
Why does switching cause power dissipation? on electronics.SE for the details for one CMOS gate
For a mathematical operation in CPU, could power consumption depend on the operands?
Modern Microprocessors
A 90-Minute Guide! has a section about power. (And read the whole article if you have any general interest in CPU architecture; it's good stuff.)
ADD does require carry propagation potentially across the whole width of the word, e.g. for 0xFFFFFFFF + 1, so ALUs use tricks like carry-lookahead or carry-select to keep the worst case gate-delay latency within one cycle.
So ADD involves more gates than a simple bitwise operation like XOR, but still not many compared to the amount of gates involved in controlling all the decode and other control logic to get the operands to the ALU and the result written back (and potentially bypass-forwarded to later instructions that use the result right away.)
Also, a typical ALU probably doesn't have fully separate adder vs. bitwise units, so a lot of those adder gates are probably seeing their inputs change, but control signals block carry propagation. (i.e. a typical ALU implements XOR using a lot of the same gates as ADD, but with control signals controlling AND gates or something to all or block carry propagation. XOR is add-without-carry.) An integer ALU in a CPU will usually be at least an adder-subtractor so one of the inputs is coming through multiple gates, with other control signals that can make it do bitwise ops.
But there's still maybe a few fewer bit-flips when doing an XOR operation than an ADD. Partly it would depend what the previous outputs were (of whatever computation it did in the previous cycle, not the value of one of the inputs to the XOR). But with carry propagation blocked by AND gates, flipping the inputs to those gates doesn't change the outputs, so less capacitance is charged or discharged.
In a high-performance CPU, a lot of power is spent on pipelining and out-of-order exec, tracking instructions in flight, and writing back the results. So even the whole ALU ADD operation is a pretty minor component of total energy cost to execute the instruction. Small differences in that power due to operands are an even smaller difference. Pretty much negligible compared to how many gates flip every clock cycle just to get data and control signals sent to the right place.
Another tiny effect: if your CPU didn't do register renaming, then possibly a few fewer transistors might flip (in the register file's SRAM) when writing back the result if it's almost the same as what that register held before.
(Assuming an ISA like x86 where you do xor dst, src for dst ^= src, not a 3-operand ISA where xor dst, src1, src2 could be overwriting a different value if you didn't happen to pick the same register for dst and src1.)
If your CPU does out-of-order exec with register renaming, writes to the register file won't be overwriting the same SRAM cells as the original destination value, so it depends what other values were computed recently in registers.
If you want to see a measurable difference in power, run instructions like integer multiply, or FP mul or FMA. Or SIMD instructions, so the CPU is doing 4x or 8x 32-bit addition or shuffle in parallel. Or 8x 32-bit FMA. The max-power workload on a typical modern x86 CPU is two 256-bit FMAs per clock cycle.
See also:
Do sse instructions consume more power/energy? - Mysticial's answer is excellent, and discusses the race-to-sleep benefit of doing the same work faster and with fewer instructions, even if each one costs somewhat more power.
Why does the CPU get hotter when performing heavier calculations, compared to being idle?
How do I achieve the theoretical maximum of 4 FLOPs per cycle?

Related

Optimizing double->long conversion with memory operand using perf?

I have an x86-64 Linux program which I am attempting to optimize via perf. The perf report shows the hottest instructions are scalar conversions from double to long with a memory argument, for example:
cvttsd2si (%rax, %rdi, 8), %rcx
which corresponds to C code like:
extern double *array;
long val = (long)array[idx];
(This is an unusual bottleneck but the code itself is very unusual.)
To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question? What other data should I collect and how should I proceed to optimize this?
Some things I have looked at already. CPU counter results show 1.5% cache misses per instruction:
30686845493287      cache-references
     2140314044614      cache-misses           #    6.975 % of all cache refs
          52970546      page-faults
  1244774326560850      instructions
   194784478522820      branches
     2573746947618      branch-misses          #    1.32% of all branches
          52970546      faults
Top-down performance monitors show we are primarily backend-bound:
frontend bound retiring bad speculation backend bound
10.1% 25.9% 4.5% 59.5%
Ad-hoc measurement with top shows all CPUs pegged at 100% suggesting we are not waiting on memory.
A final note of interest: when run on AWS EC2, the code is dramatically slower (44%) on AMD vs Intel with the same core count. (Tested on Ice Lake 8375C vs EPYC 7R13). What could explain this discrepancy?
Thank you for any help.
To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question?
I think there is two main reason for this instruction to be slow. 1. There is a dependency chain and the latency of this instruction is a problem since the processor is waiting on it to execute other instructions. 2. There is a cache miss (saturating the memory with such instruction is improbable unless many cores are doing memory-based operations).
First of all, tracking what is going on on a specific instruction is hard (especially if the instruction is not executed a lot of time). You need to use precise events to track the root of the problem, that is, events for which the exact instruction addresses that caused the event are available. Only a (small) subset of all events are precise one.
Regarding (1), the latency of the instruction should be about 12 cycles on both architecture (although it might be slightly more on the AMD processor, I do not expect a 44% difference). The target processor are able to execute multiple instruction at the same time in a given cycle. Instructions are executed on different port and are also pipelined. The port usage matters to understand what is going on. This means all the instruction in the hot loop matters. You cannot isolate this specific instruction. Modern processors are insanely complex so a basic analysis can be tricky. On Ice Lake processors, you can measure the average port usage with events like UOPS_DISPATCHED.PORT_XXX where XXX can be 0, 1, 2_3, 4_9, 5, 6, 7_8. Only the first three matters for this instruction. The EXE_ACTIVITY.XXX events may also be useful. You should check if a port is saturated and which one. AFAIK, none of these events are precise so you can only analyse a block of code (typically the hot loop). On Zen 3, the ports are FP23 and FP45. IDK what are the useful events on this architecture (I am not very familiar with it).
On Ice Lake, you can check the FRONTEND_RETIRED.LATENCY_GE_XXX events where XXX is a power of two integer (which should be precise one so you can see if this instruction is the one impacting the events). This help you to see whether the front-end or the back-end is the limiting factor.
Regarding (2), you can check the latency of the memory accesses as well as the number of L1/L2/L3 cache hits/misses. On Ice Lake, you can use events like MEM_LOAD_RETIRED.XXX where XXX can be for example L1_MISS L1_HIT, L2_MISS, L2_HIT, L3_MISS and L3_HIT. Still on Ice Lake, t may be useful to track the latency of the memory operation with MEM_TRANS_RETIRED.LOAD_LATENCY_GT_XXX where XXX is again a power of two integer.
You can also use LLVM-MCA to simulate the scheduling of the loop instruction statically on the target architecture (do not consider branches). This is very useful to understand deeply what the scheduler can do pretty easily.
What could explain this discrepancy?
The latency and reciprocal throughput should be about the same on the two platform or at least close. That being said, for the same core count, the two certainly do not operate at the same frequency. If this is not coming from that, then I doubt this instruction is actually the problem alone (tricky scheduling issues, wrong/inaccurate profiling results, etc.).
CPU counter results show 1.5% cache misses per instruction
The thing is the cache-misses event is certainly not very informative here. Indeed, it references the last-level cache (L3) misses. Thus, it does not give any information about the L1/L2 misses (previous events do).
how should I proceed to optimize this?
If the code is latency bound, the solution is to first break any dependency chain in this loop. Unrolling the loop dans rewriting it so to make it more SIMD-friendly can help a lot to improve performance (the reciprocal throughput of this instruction is about 1 cycle as opposed to 12 for the latency so there is a room for improvements in this case).
If the code is memory bound, they you should care about data locality. Data should fit in the L1 cache if possible. There are many tips to do so but it is hard to guide you without more context. This includes for example sorting data, reordering loop iterations, using smaller data types.
There are many possible source of weird unusual unexpected behaviours that can occurs. If such a thing happens, then it is nearly impossible to understand what is going on without the exact code executed. All details matter in this case.

Fast hardware integer division

Hardware instruction for integer division has been historically very slow. For example, DIVQ on Skylake has latency of 42-95 cycles [1] (and reciprocal throughput of 24-90), for 64-bits inputs.
There are newer processor however, which perform much better: Goldmont has 14-43 latency and Ryzen has 14-47 latency [1], M1 apparently has "throughput of 2 clock cycles per divide" [2] and even Raspberry Pico has "8-cycle signed/unsigned divide/modulo circuit, per core" (though that seems to be for 32-bit inputs) [3].
My question is, what has changed? Was there a new algorithm invented? What algorithms do the new processors employ for division, anyway?
[1] https://www.agner.org/optimize/#manuals
[2] https://ridiculousfish.com/blog/posts/benchmarking-libdivide-m1-avx512.html
[3] https://raspberrypi.github.io/pico-sdk-doxygen/group__hardware__divider.html#details
On Intel before Ice Lake, 64-bit operand-size is an outlier, much slower than 32-bit operand size for integer division. div r32 is 10 uops, with 26 cycle worst-case latency but 6 cycle throughput. (https://uops.info/ and https://agner.org/optimize/, and Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux has detailed exploration.)
There wasn't a fundamental change in how divide units are built, just widening the HW divider to not need extended-precision microcode. (Intel has had fast-ish dividers for FP for much longer, and that's basically the same problem just with only 53 bits instead of 64. The hard part of FP division is integer division of the mantissas; subtracting the exponents is easy and done in parallel.)
The incremental changes are things like widening the radix to handle more bits with each step. And for example pipelining the refinement steps after the initial (table lookup?) value, to improve throughput but not latency.
Related:
How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson? brief high-level overview of the div/sqrt units that modern CPUs use, with for example a Radix-1024 divider being new in Broadwell.
Do FP and integer division compete for the same throughput resources on x86 CPUs? (No in Ice Lake and later on Intel; having a dedicated integer unit instead of using the low element of the FP mantissa divide/sqrt unit is presumably related to making it 64 bits wide.)
Divide units historically were often not pipelined at all, as that's hard because it requires replicating a lot of gates instead of iterating on the same multipliers, I think. And most software usually avoids (or avoided) integer division because it was historically very expensive, at least does it infrequently enough to not benefit very much from higher-throughput dividers with the same latency.
But with wider CPU pipelines with higher IPC shrinking the cycle gap between divisions, it's more worth doing. Also with huge transistor budgets, spending a bunch on something that will sit idle for a lot of the time in most programs still makes sense if it's very helpful for a few programs. (Like wider SIMD, and specialized execution units like x86 BMI2 pdep / pext). Dark silicon is necessary or chips would melt; power density is a huge concern, see Modern Microprocessors: A 90-Minute Guide!
Also, more and more software being written by people who don't know anything about performance, and more code avoiding compile-time constants in favour of being flexible (function args that ultimately come from some config option), I'd guess modern software doesn't avoid division as much as older programs did.
Floating-point division is often harder to avoid than integer, so it's definitely worth having fast FP dividers. And integer can borrow the mantissa divider from the low SIMD element, if there isn't a dedicated integer-divide unit.
So that FP motivation was likely the actual driving force behind Intel's improvements to divide throughput and latency even though they left 64-bit integer division with garbage performance until Ice Lake.

latency vs throughput in intel intrinsics

I think I have a decent understanding of the difference between latency and throughput, in general. However, the implications of latency on instruction throughput are unclear to me for Intel Intrinsics, particularly when using multiple intrinsic calls sequentially (or nearly sequentially).
For example, let's consider:
_mm_cmpestrc
This has a latency of 11, and a throughput of 7 on a Haswell processor. If I ran this instruction in a loop, would I get a continuous per cycle-output after 11 cycles? Since this would require 11 instructions to be running at a time, and since I have a throughput of 7, do I run out of "execution units"?
I am not sure how to use latency and throughput other than to get an impression of how long a single instruction will take relative to a different version of the code.
For a much more complete picture of CPU performance, see Agner Fog's microarchitecture guide and instruction tables. (Also his Optimizing C++ and Optimizing Assembly guides are excellent). See also other links in the x86 tag wiki, especially Intel's optimization manual.
See also
https://uops.info/ for accurate tables collected programmatically from microbenchmarks, so they're free from editing errors like Agner's tables sometimes have.
How many CPU cycles are needed for each assembly instruction?
and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for more details about using instruction-cost numbers.
What is the efficient way to count set bits at a position or lower? For an example of analyzing short sequences of asm in terms of front-end uops, back-end ports, and latency.
Modern Microprocessors: A 90-Minute Guide! very good intro to the basics of CPU pipelines and HW design constraints like power.
Latency and throughput for a single instruction are not actually enough to get a useful picture for a loop that uses a mix of vector instructions. Those numbers don't tell you which intrinsics (asm instructions) compete with each other for throughput resources (i.e. whether they need the same execution port or not). They're only sufficient for super-simple loops that e.g. load / do one thing / store, or e.g. sum an array with _mm_add_ps or _mm_add_epi32.
You can use multiple accumulators to get more instruction-level parallelism, but you're still only using one intrinsic so you do have enough information to see that e.g. CPUs before Skylake can only sustain a throughput of one _mm_add_ps per clock, while SKL can start two per clock cycle (reciprocal throughput of one per 0.5c). It can run ADDPS on both its fully-pipelined FMA execution units, instead of having a single dedicated FP-add unit, hence the better throughput but worse latency than Haswell (3c lat, one per 1c tput).
Since _mm_add_ps has a latency of 4 cycles on Skylake, that means 8 vector-FP add operations can be in flight at once. So you need 8 independent vector accumulators (which you add to each other at the end) to expose that much parallelism. (e.g. manually unroll your loop with 8 separate __m256 sum0, sum1, ... variables. Compiler-driven unrolling (compile with -funroll-loops -ffast-math) will often use the same register, but loop overhead wasn't the problem).
Those numbers also leave out the third major dimension of Intel CPU performance: fused-domain uop throughput. Most instructions decode to a single uop, but some decode to multiple uops. (Especially the SSE4.2 string instructions like the _mm_cmpestrc you mentioned: PCMPESTRI is 8 uops on Skylake). Even if there's no bottleneck on any specific execution port, you can still bottleneck on the frontend's ability to keep the out-of-order core fed with work to do. Intel Sandybridge-family CPUs can issue up to 4 fused-domain uops per clock, and in practice can often come close to that when other bottlenecks don't occur. (See Is performance reduced when executing loops whose uop count is not a multiple of processor width? for some interesting best-case frontend throughput tests for different loop sizes.) Since load/store instructions use different execution ports than ALU instructions, this can be the bottleneck when data is hot in L1 cache.
And unless you look at the compiler-generated asm, you won't know how many extra MOVDQA instructions the compiler had to use to copy data between registers, to work around the fact that without AVX, most instructions replace their first source register with the result. (i.e. destructive destination). You also won't know about loop overhead from any scalar operations in the loop.
I think I have a decent understanding of the difference between latency and throughput
Your guesses don't seem to make sense, so you're definitely missing something.
CPUs are pipelined, and so are the execution units inside them. A "fully pipelined" execution unit can start a new operation every cycle (throughput = one per clock)
(reciprocal) Throughput is how often an operation can start when no data dependencies force it to wait, e.g. one per 7 cycles for this instruction.
Latency is how long it takes for the results of one operation to be ready, and usually matters only when it's part of a loop-carried dependency chain.
If the next iteration of a loop operates independently from the previous, then out-of-order execution can "see" far enough ahead to find the instruction-level parallelism between two iterations and keep itself busy, bottlenecking only on throughput.
See also Latency bounds and throughput bounds for processors for operations that must occur in sequence for an example of a practice problem from CS:APP with a diagram of two dep chains, one also depending on results from the other.

Instruction Completion Rate Vs. Instruction Throughput Vs. Instructions Per Clock

From what I understand:
ICR (Instruction Completion Rate): Is (# of instructions / time)
Instruction Throughput: Is usually an average of the number of instructions completed each clock cycle.
IPC (Instructions Per Clock): Is how many instructions are being completing each clock cycle. (Maybe this is usually an average?)
I'm confused on these definitions, I'm definitely looking for clarification. They might even be wrong, I've been having a tough time finding clear definitions of them.
How does the instruction completion rate affect overall performance of the processor?
How is Instruction Throughput affected compared to IPC?
Instruction throughput is typically used with respect to a specific type of instruction and is meant to provide instruction scheduling information in the context of structural hazards. For example, one might say "this fully pipelined multiplier has a latency of three cycles and an instruction throughput of one". Repeat rate is the inverse of throughput.
IPC describes performance per cycle, while your definition of instruction completion rate describes performance directly (independent of clock frequency).
(Of course, the performance value of "instruction" depends on the instruction set, the compiler, and the application — all of which influence the number of (and types of) instructions executed to complete a task. In addition, the relative performance of different instructions can depend on the hardware implementation; this can, in turn, drive compilation changes and sometimes application programming changes and even ISA changes.)

Count clock cycles from assembly source code?

I have the source code written and I want to measure efficiency as how many clock cycles it takes to complete a particular task. Where can I learn how many clock cycles different commands take? Does every command take the same amount of time on 8086?
RDTSC is the high-resolution clock fetch instruction.
Bear in mind that cache misses, context switches, instruction reordering and pipelining, and multicore contention can all interfere with the results.
Clock cycles and efficiency are not the same thing.
For efficiency of code you need to consider, in particular, how the memory is utalised, in particular the differing levels of the cache. Also important is the branching prediction of the code etc. You want a profiler that tells you these things, ideally one that gives you profile specific information: examples are CodeAnalyst for AMD chips.
To answer your question, particular base instructions do have a given (average) number of cycles (AMD release the approximate numbers for the basic maths functions in their maths library). These numbers are a poor place to start optimising code, however.

Resources