For example, a modern i7-8700k can supposedly do ~60 GFLOPS (single-precision, source) while its maximum frequency is 4.7GHz. As far as I am aware, an instruction has to take at least one cycle to complete, so how is this possible?
There are multiple factors that are all multiplied together for this large effect:
SIMD, Intel 8700k and similar processors support AVX and AVX2, which includes many instructions that operate on registers that can hold 8 floats at the same time.
multiple cores, 8700k has 6 cores.
fused multiply-add, part of AVX2, has both a multiplication and addition in the same instruction.
high throughput execution. The latency (time an individual instruction takes) is not directly important to how much computation a processor can do in a unit of time. A modern CPU such as 8700k can start executing two (independent) FMAs in the same cycle (and keep in mind these are still SIMD instructions so that represents a lot of floating point operations) even through the latency of the operation is actually 4 cycles.
Multiplying all those factors together we get: 8 * 6 * 2 * 2 * 4.3 = 825 GFLOPS (matching the stats reported here). This calculation certainly does not mean that it can actually be attained. For example the processor may downclock significantly under such a workload in order to stay within its power budget, which is what Intel has been doing at least since Haswell (though the specifics have changed and it applied to server parts). Also, most real code has significant trouble feeding that many FMAs with data. Large matrix multiplications can get close though, and for example according to these stats the 8700k reached 496.7 Gflops in their SGEMM benchmark. Possibly the 8700k's max AVX2 turbo speed on 6 cores is 2.6GHz but as far as I can find it does not have an AVX offset by default (only needed when overclocked), or that GEMM is just not that close to hitting peak FLOPS.
Related
I got access to the AMD Zen4 server and tested AVX-512 packed double performance. I chose Harmonic Series Sum[1/n over positive integers] and compared the performance using standard doubles, AVX2 (4 packed doubles) and AVX-512 (8 packed doubles). The test code is here.
AVX-256 version runs four times faster than the standard double version. I was expecting the AVX-512 version to run two times faster than the AVX-256 version, but there was barely any improvement in runtimes:
Method Runtime (minutes:seconds)
HarmonicSeriesPlain 0:41.33
HarmonicSeriesAVX256 0:10.32
HarmonicSeriesAVX512 0:09.82
I was scratching my head over the results and tested individual operations. See full results. Here is runtime for the division:
Method Runtime (minutes:seconds)
div_plain 1:53.80
div_avx256f 0:28.47
div_avx512f 0:14.25
Interestingly, div_avx256f takes 28 seconds, while HarmonicSeriesAVX256 takes only 10 seconds to complete. HarmonicSeriesAVX256 is doing more operations than div_avx256f - summing up the results and increasing the denominator each time (the number of packed divisions is the same). The speed-up has to be due to the instructions pipelining.
However, I need help finding out more details.
The analysis with the llvm-mca (LLVM Machine Code Analyzer) fails because it does not support Zen4 yet:
gcc -O3 -mavx512f -mfma -S "$file" -o - | llvm-mca -iterations 10000 -timeline -bottleneck-analysis -retire-stats
error: found an unsupported instruction in the input assembly sequence.
note: instruction: vdivpd %zmm0, %zmm4, %zmm2
On the Intel platform, I would use
perf stat -M pipeline binary
to find more details, but this metricgroup is not available on Zen4. Any more suggestions on how to analyze the instructions pipelining on Zen4? I have tried these perf stat events:
cycles,stalled-cycles-frontend,stalled-cycles-backend,cache-misses,sse_avx_stalls,fp_ret_sse_avx_ops.all,fp_ret_sse_avx_ops.div_flops,fpu_pipe_assignment.total,fpu_pipe_assignment.total0,
fpu_pipe_assignment.total1,fpu_pipe_assignment.total2,fpu_pipe_assignment.total3
and got the results here.
From this I can see, that the workload is backed bound. AMD's performance event fp_ret_sse_avx_ops.all ( the number of retired SSE/AVX operations) helps, but I still want to get better insights into instructions pipelining on Zen4. Any tips?
Zen 4 execution units are mostly 256-bit wide; handling a 512-bit uop occupies it for 2 cycles. It's normal that 512-bit vectors don't have more raw throughput for any math instructions in general on Zen 4. Although using them on Zen4 does mean more work per uop so out-of-order exec has an easier time.
Or in the case of division, they're occupied for longer since division isn't fully pipelined, like on all modern CPUs. Division is hard to implement.
On Intel Ice Lake for example, divpd throughput is 2 doubles per 4 clocks whether you're using 128-bit, 256-bit, or 512-bit vectors. 512-bit takes extra uops, so we can infer that the actual divider execution unit is 256-bit wide in Ice Lake, but that divpd xmm can use the two halves of it independently. (Unlike AMD).
https://agner.org/optimize/ has instructing timing tables (and his microarch PDF has details on how CPUs work that are essential to making sense of them). https://uops.info/ also has good automated microbenchmark results, free from typos and other human error except sometimes in choosing what to benchmark. (But the actual instruction sequences tested are available, so you can check what they actually tested.) Unfortunately they don't yet have Zen 4 results up, only up to Zen 3.
Zen4 has execution units 256-bit wide for the most part, so 512-bit instructions are single uop but take 2 cycles on most execution units. (Unlike Zen1 where they took 2 uops and thus hurt OoO exec). And it has efficient 512-bit shuffles, and lets you use the power of new AVX-512 instructions for 256-bit vector width, which is where a lot of the real value is. (Better shuffles, masking, vpternlogd, vector popcount, etc.)
Division isn't fully pipelined on any modern x86 CPU. Even on Intel CPUs 512-bit vdivpd zmm has about the same doubles-per-clock throughput as vdivpd ymm (Floating point division vs floating point multiplication has some older data on the YMM vs. XMM situation which is similar, although Zen4 apparently can't send different XMM vectors through the halves of its 256-bit-wide divide unit; vdivpd xmm has the same instruction throughput as vdivpd ymm)
Fast-reciprocal + Newton iterations
For something that's almost entirely bottlenecked on division throughput (not front-end or other ports), you might consider approximate-reciprocal with a Newton-Raphson iteration or two to refine the accuracy to close to 1 ulp. (Not quite the 0.5 ulp you'd get from exact division).
AVX-512 has vrcp14pd approx-reciprocal for packed-double. So two rounds of Newton iterations should double the number of correct bits each time, to 28 then 56 (which is more than the 53-bit mantissa of a double). Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision mostly talks about rsqrt, but similar idea.
SSE/AVX1 only had single-precision versions of the fast-reciprocal and rsqrt instructions, with only 12-bit precision. e.g. rcpps.
AVX-512ER has 28-bit precision versions, but only Xeon Phi ever had those; mainstream CPUs haven't included them. (Xeon Phi had very vdivps / pd exact division, so it was much better to use the reciprocals.)
I got the answer for the question from title: How to analyze the instructions pipelining on Zen4? directly from AMD:
For determining if a workload is backend-bound, the recommended
method on Zen 4 is to use the pipeline utilization metrics. We are
the process of providing similar metrics and metric groups through
the perf JSON event files for Zen 4 and they will be out very soon.
Read more details in this email thread
AMD has already posted the patches.
Before the patches land in favorite Linux distribution, you can use the raw events on Zen4. Check this example
I'm learning SIMD intrinsics and parallel computing. I am not sure if Intel's definition for the x86 instruction sqrtpd says that the square root of the two numbers that are passed to it will be calculated at the same time:
Performs a SIMD computation of the square roots of the two, four, or eight packed double-precision floating-point values in the source operand (the second operand) and stores the packed double-precision floating-point results in the destination operand (the first operand).
I understand that it explicitly says SIMD computation but does this imply that for this operation the root will be calculated simultaneously for both numbers?
For sqrtpd xmm, yes, modern CPUs do that truly in parallel, not running it through a narrower execution unit one at a time. Older (especially low-power) CPUs did do that. For AVX vsqrtpd ymm, some CPUs do perform it in two halves.
But if you're just comparing performance numbers against narrower operations, note that some CPUs like Skylake can use different halves of their wide div/sqrt unit for separate sqrtpd/sd xmm, so those have twice the throughput of YMM, even though it can do a full vsqrtpd ymm in parallel.
Same for AVX-512 vsqrtpd zmm, even Ice Lake splits it up into two halves, as we can see from it being 3 uops (2 for port 0 where Intel puts the div/sqrt unit, and that can run on other ports.)
Being 3 uops is the key tell-tale for a sqrt instruction being wider than the execution unit on Intel, but you can look at the throughput of YMM vs. XMM vs. scalar XMM to see how it's able to feed narrower operations do different pipes of a wide execution unit independently.
The only difference is performance; the destination x/y/zmm register definitely has the square roots of each input element. Check performance numbers (and uop counts) on https://uops.info/ (currently down but normally very good), and/or https://agner.org/optimize/.
It's allowed but not guaranteed that CPUs internally have wide execution units, as wide as the widest vectors they support, and thus truly compute all results in parallel pipes.
Full-width execution units are common for instructions other than divide and square root, although AMD from Bulldozer through before Zen1 supported AVX/AVX2 with only 128-bit execution units, so vaddps ymm decoded to 2 uops, doing each half separately. Intel Alder Lake E-cores work the same way.
Some ancient and/or low-power CPUs (like Pentium-M and K8, and Bobcat) have had only 64-bit wide execution units, running SSE instructions in two halves (for all instructions, not just "hard" ones like div/sqrt).
So far only Intel has supported AVX-512 on any CPUs, and (other than div/sqrt) they've all had full-width execution units. And unfortunately they haven't come up with a way to expose the powerful new capabilities like masking and better shuffles for 128 and 256-bit vectors on CPUs without the full AVX-512. There's some really nice stuff in AVX-512 totally separate from wider vectors.
The SIMD div / sqrt unit is often narrower than others
Divide and square root are inherently slow, not really possible to make low latency. It's also expensive to pipeline; no current CPUs can start a new operation every clock cycle. But recent CPUs have been doing that, at least for part of the operation: I think they normally end with a couple steps of Newton-Raphson refinement, and that part can be pipelined as it only involves multiply/add/FMA type of operations.
Intel has supported AVX since Sandybridge, but it wasn't until Skylake that they widened the FP div/sqrt unit to 256-bit.
For example, Haswell runs vsqrtpd ymm as 3 uops, 2 for port 0 (where the div/sqrt unit is) and one for any port, presumably to recombine the results. The latency is just about a factor of 2 longer, and throughput is half. (A uop reading the result needs to wait for both halves to be ready.)
Agner Fog may have tested latency with vsqrtpd ymm reading its own result; IDK if Intel can let one half of the operation start before the other half is ready, of if the merging uop (or whatever it is) would end up forcing it to wait for both halves to be ready before starting either half of another div or sqrt. Instructions other than div/sqrt have full-width execution units and would always need to wait for both halves.
I also collected divps / pd / sd / ss throughputs and latencies for YMM and XMM on various CPUs in a table on Floating point division vs floating point multiplication
To complete the great answer of #PeterCordes, this is indeed dependent of architecture. One can expect the two square roots to be computed in parallel (or possibly efficiently pipelined at the ALU level) on most recent mainstream processors though. Here is the latency and throughput for intel architectures (you can get it from Intel):
Architecture
Latency single
Latency packed XMM
Throughput single
Throughput packed XMM
Skylake
18
18
6
6
Knights Landing
40
38
33
10
Broadwell
20
20
7
13
Haswell
20
20
13
13
Ivy Bridge
21
21
14
14
The throughput (number of cycle per instruction) is generally what matter in SIMD codes, as long as out-of-order exec can overlap the latency chains for independent iterations. As you can see, on Skylake, Haswell and Ivy Bridge, the throughput is the same meaning that sqrtsd and sqrtpd xmm are equally fast. The pd version gets twice as much work done, so it must be computing two elements in parallel. Note that Coffee Lake, Cannon Lake and Ice Lake have the same timings as Skylake for this specific instruction.
For Broadwell, sqrtpd does not execute the operation in parallel on the two lanes. Instead, it pipelines the operation and most of the computation is serialized (sqrtpd takes 1 cycle less than two sqrtsd). Or it has a parallel 2x 64-bit div/sqrt unit, but can independently use halves of it for scalar sqrt, which would explain the latency being the same but the throughput being better for scalar instructions (like how Skylake is for sqrt ymm vs. xmm).
For KNL Xeon Phi, the results are a bit surprising as sqrtpd xmm is much faster than sqrtsd while computing more items in parallel. Agner Fog's testing confirmed that, and that it takes many more uops. It's hard to imagine why; just merging the scalar result into the bottom of an XMM register shouldn't be much different from merging an XMM into the bottom of a ZMM, which is the same speed as a full vsqrtpd zmm. (It's optimized for AVX-512 with 512-bit registers, but it's also slow at div/sqrt in general; you're intended to use vrsqrt28pd on Xeon Phi CPUs, to get an approximation that only needs one Newton iteration to get close to double precision. Other AVX-512 CPUs only support vrsqrt14pd/ps, lacking the AVX-512ER extension)
PS: It turns out that Intel reports the maximum throughput cost (worst case) when it is variable. (0.0 is one of the best cases, for example). The latency is a bit different from the one reported from Agner Fog's instruction table. The overall analysis remains the same though.
Yes, SIMD (vector) instructions on packed operands perform the same operation on all vector elements "in parallel". This follows from the fact that sqrtsd (scalar square root on one double) and sqrtpd (packed square root on two doubles in a 128-bit register) have the same latency.
vsqrtpd for 256-bit and larger vectors may have higher latency on some processors, as the operation is performed on 128-bit parts of the vector sequentially. This may be true for vdivpd as well, but not other instructions - most of the time you can expect that the latency is the same regardless of the vector size. Consult with instruction tables if you want to be sure.
I am confused about the theoretical maximum performance of the Intel Xeon E5-2640 v4 CPU (Boardwell-based). In this post, >800GFLOPS; in this post, about 200GFLOPS; in this post, 3.69GFLOPS per core, 147.70GFLOPS per computer. So what is the theoretical maximum performance of Intel Xeon E5-2640 v4 CPU?
Some specifications:
Processor Base Frequency = 2.4GHz;
Max turbo frequency = 3.4GHz;
IPC (instruction per cycle) = 2;
Instruction Set Extensions: AVX2, so #SIMD = 256/32 = 8;
I tried to compute the theoretical maximum FLOPS. Based on my understanding, it should be (Max turbo frequency) * (IPC) * (#SIMD), which is 3.4 * 2 * 8 = 54.4GFLOPS, is it right?
Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)? What if additions and multiplications do not appear at the same time? (eg. if the workload only contains additions, is *2 appropriate?)
Besides, the above computation should be the maximum FLOPS per core, right?
3.4 GHz is the max single-core turbo (and also 2-core), so note that this isn't the per-core GFLOPS, it's the single-core GFLOPS.
The max all-cores turbo is 2.6 GHz on that CPU, and probably won't sustain that for long with all cores maxing out their SIMD FP execution units. That's the most power-intensive thing x86 CPUs can do. So it will likely drop back to 2.4 GHz if you actually keep all cores busy.
And yes you're missing a factor of two because FMA counts as two FP operations, and that's what you need to do to achieve the hardware's theoretical max FLOPS. FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 . (Your Broadwell is the same as Haswell for max-throughput purposes.)
If you're only using addition then only have one FLOP per SIMD element per instruction, and also only 1/clock FP instruction throughput on a v4 (Broadwell) or earlier.
Haswell / Broadwell have two fully-pipelined SIMD FMA units (on ports 0 and 1), and one fully-pipelined SIMD FP-add unit (on port 1) with lower latency than FMA.
The FP-add unit is on the same execution port as one of the FMA units, so it can start 2 FP uops per clock, up to one of which can be pure addition. (Unless you do addition x+y as fma(x, 1.0, y), trading higher latency for more throughput.)
IPC (instruction per cycle) = 2;
Nope, that's the number of FP math instructions per cycle, max, not total instructions per clock. The pipeline's narrowest point is 4 uops wide, so there's room for a bit of loop overhead and a store instruction every cycle as well as two SIMD FP operations.
But yes, 2 FP operations started per clock, if they're not both addition.
Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)?
You're already multiplying by IPC=2 for parallel additions and multiplications.
If you mean FMA (Fused Multiply-Add), then no, that's literally doing them both as part of a single operation, not in parallel as a "pipeline technique". That's why it's called "fused".
FMA has the same latency as multiply in many CPUs, not multiply and then addition. (Although on Broadwell, FMA latency = 5 cycles, vmulpd latency = 3 cycles, vaddpd latency = 3 cycles. All are fully pipelined, with a throughput discussed in the rest of this answer, since theoretical max throughput requires arranging your calculations to not bottleneck on the latency of addition or multiplication. e.g. using multiple accumulators for a dot product or other reduction.) Anyway, point being, a hardware FMA execution unit is not terribly more complex than an FP multiplier or adder, and you shouldn't think of it as two separate operations.
If you write a*b + c in the source, a compiler can contract that into an FMA, instead of rounding the a*b to a temporary result before addition, depending on compiler options (and defaults) to allow that or not.
How to use Fused Multiply-Add (FMA) instructions with SSE/AVX
FMA3 in GCC: how to enable
Instruction Set Extensions: AVX2, so #SIMD = 256/64 = 8;
256/64 = 4, not 8. In a 32-byte (256-bit) SIMD vector, you can fit 4 double-precision elements.
Per core per clock, Haswell/Broadwell can begin up to:
two FP math instructions (FMA/MUL/ADD), up to one of which can be addition.
FMA counts as 2 FLOPs per element, MUL/ADD only count as 1 each.
on up to 32 byte wide inputs (e.g. 4 doubles or 8 floats)
I'm trying to optimize my code with SIMD ( on ARM CPUs ), and want to know its arithmetic intensity (flops/byte, AI) and FLOPS.
In order to calculate AI and FLOPS, I have to count the number of floating point operations(FLOPs).
However, I can't find any precise definition of FLOPs.
Of course, mul, add, sub, div are clearly FLOPs, but how about move operations, shuffle operations (e.g. _mm_shuffle_ps), set operations (e.g. _mm_set1_ps), conversion operations (e.g. _mm_cvtps_pi32), etc. ?
They're operations that deal with floating point values. Should I count them as FLOPs ? If not, why ?
Which operations do profilers like Intel VTune and Nvidia's nvprof, or PMUs usually count ?
EDIT:
What all operations does FLOPS include?
This question is mainly about mathematically complex operations.
I also want to know the standard way to deal with "not mathematical" operations which take floating point values or vectors as inputs.
Shuffle / blend on FP values are not considered FLOPs. They are just overhead of using SIMD on not purely "vertical" problems, or for problems with branching that you do branchlessly with a blend.
Neither are FP AND/OR/XOR. You could try to justify counting FP absolute value using andps (_mm_and_ps), but normally it's not counted. FP abs doesn't require looking at the exponent / significand, or normalizing the result, or any of the things that make FP execution units expensive. abs (AND) / sign-flip (XOR) or make negative (OR) are trivial bitwise ops.
FMA is normally counted as two floating point ops (the mul and add), even though it's a single instruction with the same (or similar) performance to SIMD FP add or mul. The most important problem that bottlenecks on raw FLOP/s is matmul, which does need an equal mix of mul and add, and can take advantage of FMA perfectly.
So the FLOP/s of a Haswell core is
its SIMD vector width (8 float elements per vector)
times SIMD FMA per clock (2)
times FLOPs per FMA (2)
times clock speed (max single core turbo it can sustain while maxing out both FMA units; long-term depends on cooling, short term just depends on power limits).
For a whole CPU, not just a single core: multiply by number of cores and use the max sustained clock speed with all cores busy, usually lower than single-core turbo on CPUs that have turbo at all.)
Intel and other CPU vendors don't count the fact that their CPUs can also sustain a vandps in parallel with 2 vfma132ps instructions per clock, because FP abs is not a difficult operation.
See also How do I achieve the theoretical maximum of 4 FLOPs per cycle?. (It's actually more than 4 on modern CPUs :P)
Peak FLOPS (FP ops per second, or FLOP/s) isn't achievable if you have much other overhead taking up front-end bandwidth or creating other bottlenecks. The metric is just the raw amount of math you can do when running in a straight line, not on any specific practical problem.
Although people would think it's silly if theoretical peak flops is much higher than a carefully hand-tuned matmul or Mandelbrot could ever achieve, even for compile-time-constant problem sizes. e.g. if the front-end couldn't keep up with doing any stores as well as the FMAs. e.g. if Haswell had four FMA execution units, so it could only sustain max FLOPs if literally every instruction was an FMA. Memory source operands could micro-fuse for loads, but there'd be no room to store without hurting throughput.
The reason Intel doesn't have even 3 FMA units is that most real code has trouble saturating 2 FMA units, especially with only 2 load ports and 1 store port. They'd be wasted almost all of the time, and 256-bit FMA unit takes a lot of transistors.
(Ice Lake widens issue/rename stage of the pipeline to 5 uops/clock, but also widens SIMD execution units to 512-bit with AVX-512 instead of adding a 3rd 256-bit FMA unit. It has 2/clock load and 2/clock store, although that store throughput is only sustainable to L1d cache for 32-byte or narrower stores, not 64-byte.)
When it comes to optimisation, it is common practise to only measure FLOPs on the hotspots of your code, for example, the number of Floating Point Multiply & Accumulate operations in Convolution. This is mainly because other operations might be insignificant or irreplaceable and therefore can't be exploited for any kind of optimization.
For example, all instructions under Vector Floating Point Instructions in A4.13 in ARMv7 Reference Manual fall under a Floating Point Operation as a FLOPs/Cycle for an FPU instruction is typically constant in a processor.
Not just ARM, but many micro-processors have a dedicated Floating Point Unit, so when you are measuring FLOPs, you're measuring the speed of this unit. With this and FLOPs/cycle you can more or less calculate the theoretical peak performance.
But, FLOPs are to be taken with a grain of salt, as they can only be used to approximately estimate the speed of your code because they fail to take into account other conditions your processor operates under. This is why counting FLOPs only for your hotspots (usually arithmetic ops) is more or less enough in most cases.
Having said that, FLOPs can act as a comparative metric for two strenuous piece of code but doesn't say much about your code per se.
I've wondered about the amount of bits that a given CPU can handles and how should I manage to calculate this specific value.
I wanted to make sure that my thought and also my calculation are right.
Given that I have 64 bit CPU which feature 2.3 Ghz. The amount of bits that being handled per second should be the following :
(2.3 * 10^9) * 64
Is it that simple or I need consider any other variables?
Calculating throughput of a CPU is not as simple. Modern CPUs are pipelined and can execute multiple instructions concurrently if there are no data dependencies, yielding instructions per cycle (IPC) > 1. Some instructions may also have a higher latency than one cycle, meaning IPC can be < 1.
For some operations they might be constrained by memory bandwidth.
And then there are SIMD instructions which can process more data per instruction than the width of a single general purpose register.
http://www.agner.org/optimize/instruction_tables.pdf
https://en.wikipedia.org/wiki/Instruction_pipelining