I am investigating how many FLOPs could be done in one CPU cycles using gotoblas library. I used 32-bit floating point number to run a matrix multiplication, and got roughly 8 FLOPs per CPU cycle by hand calculation. I guess this may be because there are two FPUs in my processor (Intel Xeon E5430), each of which takes care of one SSE instruction over 128-bit XMM registers. Therefore, using 32-bit floating point numbers, I got 2*4 FLOPs per CPU cycle.
Is my guess correct? Is there an official manual I can refer to get the number of FPUs in one Intel processor?
Thanks!
I think I found out the reason. Theoretically Intel Xeon E5430 can do 4-wide SSE addition + 4-wide SSE multiplication together in one CPU cycle for single precision floating point numbers.
Related
I got access to the AMD Zen4 server and tested AVX-512 packed double performance. I chose Harmonic Series Sum[1/n over positive integers] and compared the performance using standard doubles, AVX2 (4 packed doubles) and AVX-512 (8 packed doubles). The test code is here.
AVX-256 version runs four times faster than the standard double version. I was expecting the AVX-512 version to run two times faster than the AVX-256 version, but there was barely any improvement in runtimes:
Method Runtime (minutes:seconds)
HarmonicSeriesPlain 0:41.33
HarmonicSeriesAVX256 0:10.32
HarmonicSeriesAVX512 0:09.82
I was scratching my head over the results and tested individual operations. See full results. Here is runtime for the division:
Method Runtime (minutes:seconds)
div_plain 1:53.80
div_avx256f 0:28.47
div_avx512f 0:14.25
Interestingly, div_avx256f takes 28 seconds, while HarmonicSeriesAVX256 takes only 10 seconds to complete. HarmonicSeriesAVX256 is doing more operations than div_avx256f - summing up the results and increasing the denominator each time (the number of packed divisions is the same). The speed-up has to be due to the instructions pipelining.
However, I need help finding out more details.
The analysis with the llvm-mca (LLVM Machine Code Analyzer) fails because it does not support Zen4 yet:
gcc -O3 -mavx512f -mfma -S "$file" -o - | llvm-mca -iterations 10000 -timeline -bottleneck-analysis -retire-stats
error: found an unsupported instruction in the input assembly sequence.
note: instruction: vdivpd %zmm0, %zmm4, %zmm2
On the Intel platform, I would use
perf stat -M pipeline binary
to find more details, but this metricgroup is not available on Zen4. Any more suggestions on how to analyze the instructions pipelining on Zen4? I have tried these perf stat events:
cycles,stalled-cycles-frontend,stalled-cycles-backend,cache-misses,sse_avx_stalls,fp_ret_sse_avx_ops.all,fp_ret_sse_avx_ops.div_flops,fpu_pipe_assignment.total,fpu_pipe_assignment.total0,
fpu_pipe_assignment.total1,fpu_pipe_assignment.total2,fpu_pipe_assignment.total3
and got the results here.
From this I can see, that the workload is backed bound. AMD's performance event fp_ret_sse_avx_ops.all ( the number of retired SSE/AVX operations) helps, but I still want to get better insights into instructions pipelining on Zen4. Any tips?
Zen 4 execution units are mostly 256-bit wide; handling a 512-bit uop occupies it for 2 cycles. It's normal that 512-bit vectors don't have more raw throughput for any math instructions in general on Zen 4. Although using them on Zen4 does mean more work per uop so out-of-order exec has an easier time.
Or in the case of division, they're occupied for longer since division isn't fully pipelined, like on all modern CPUs. Division is hard to implement.
On Intel Ice Lake for example, divpd throughput is 2 doubles per 4 clocks whether you're using 128-bit, 256-bit, or 512-bit vectors. 512-bit takes extra uops, so we can infer that the actual divider execution unit is 256-bit wide in Ice Lake, but that divpd xmm can use the two halves of it independently. (Unlike AMD).
https://agner.org/optimize/ has instructing timing tables (and his microarch PDF has details on how CPUs work that are essential to making sense of them). https://uops.info/ also has good automated microbenchmark results, free from typos and other human error except sometimes in choosing what to benchmark. (But the actual instruction sequences tested are available, so you can check what they actually tested.) Unfortunately they don't yet have Zen 4 results up, only up to Zen 3.
Zen4 has execution units 256-bit wide for the most part, so 512-bit instructions are single uop but take 2 cycles on most execution units. (Unlike Zen1 where they took 2 uops and thus hurt OoO exec). And it has efficient 512-bit shuffles, and lets you use the power of new AVX-512 instructions for 256-bit vector width, which is where a lot of the real value is. (Better shuffles, masking, vpternlogd, vector popcount, etc.)
Division isn't fully pipelined on any modern x86 CPU. Even on Intel CPUs 512-bit vdivpd zmm has about the same doubles-per-clock throughput as vdivpd ymm (Floating point division vs floating point multiplication has some older data on the YMM vs. XMM situation which is similar, although Zen4 apparently can't send different XMM vectors through the halves of its 256-bit-wide divide unit; vdivpd xmm has the same instruction throughput as vdivpd ymm)
Fast-reciprocal + Newton iterations
For something that's almost entirely bottlenecked on division throughput (not front-end or other ports), you might consider approximate-reciprocal with a Newton-Raphson iteration or two to refine the accuracy to close to 1 ulp. (Not quite the 0.5 ulp you'd get from exact division).
AVX-512 has vrcp14pd approx-reciprocal for packed-double. So two rounds of Newton iterations should double the number of correct bits each time, to 28 then 56 (which is more than the 53-bit mantissa of a double). Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision mostly talks about rsqrt, but similar idea.
SSE/AVX1 only had single-precision versions of the fast-reciprocal and rsqrt instructions, with only 12-bit precision. e.g. rcpps.
AVX-512ER has 28-bit precision versions, but only Xeon Phi ever had those; mainstream CPUs haven't included them. (Xeon Phi had very vdivps / pd exact division, so it was much better to use the reciprocals.)
I got the answer for the question from title: How to analyze the instructions pipelining on Zen4? directly from AMD:
For determining if a workload is backend-bound, the recommended
method on Zen 4 is to use the pipeline utilization metrics. We are
the process of providing similar metrics and metric groups through
the perf JSON event files for Zen 4 and they will be out very soon.
Read more details in this email thread
AMD has already posted the patches.
Before the patches land in favorite Linux distribution, you can use the raw events on Zen4. Check this example
I'm learning SIMD intrinsics and parallel computing. I am not sure if Intel's definition for the x86 instruction sqrtpd says that the square root of the two numbers that are passed to it will be calculated at the same time:
Performs a SIMD computation of the square roots of the two, four, or eight packed double-precision floating-point values in the source operand (the second operand) and stores the packed double-precision floating-point results in the destination operand (the first operand).
I understand that it explicitly says SIMD computation but does this imply that for this operation the root will be calculated simultaneously for both numbers?
For sqrtpd xmm, yes, modern CPUs do that truly in parallel, not running it through a narrower execution unit one at a time. Older (especially low-power) CPUs did do that. For AVX vsqrtpd ymm, some CPUs do perform it in two halves.
But if you're just comparing performance numbers against narrower operations, note that some CPUs like Skylake can use different halves of their wide div/sqrt unit for separate sqrtpd/sd xmm, so those have twice the throughput of YMM, even though it can do a full vsqrtpd ymm in parallel.
Same for AVX-512 vsqrtpd zmm, even Ice Lake splits it up into two halves, as we can see from it being 3 uops (2 for port 0 where Intel puts the div/sqrt unit, and that can run on other ports.)
Being 3 uops is the key tell-tale for a sqrt instruction being wider than the execution unit on Intel, but you can look at the throughput of YMM vs. XMM vs. scalar XMM to see how it's able to feed narrower operations do different pipes of a wide execution unit independently.
The only difference is performance; the destination x/y/zmm register definitely has the square roots of each input element. Check performance numbers (and uop counts) on https://uops.info/ (currently down but normally very good), and/or https://agner.org/optimize/.
It's allowed but not guaranteed that CPUs internally have wide execution units, as wide as the widest vectors they support, and thus truly compute all results in parallel pipes.
Full-width execution units are common for instructions other than divide and square root, although AMD from Bulldozer through before Zen1 supported AVX/AVX2 with only 128-bit execution units, so vaddps ymm decoded to 2 uops, doing each half separately. Intel Alder Lake E-cores work the same way.
Some ancient and/or low-power CPUs (like Pentium-M and K8, and Bobcat) have had only 64-bit wide execution units, running SSE instructions in two halves (for all instructions, not just "hard" ones like div/sqrt).
So far only Intel has supported AVX-512 on any CPUs, and (other than div/sqrt) they've all had full-width execution units. And unfortunately they haven't come up with a way to expose the powerful new capabilities like masking and better shuffles for 128 and 256-bit vectors on CPUs without the full AVX-512. There's some really nice stuff in AVX-512 totally separate from wider vectors.
The SIMD div / sqrt unit is often narrower than others
Divide and square root are inherently slow, not really possible to make low latency. It's also expensive to pipeline; no current CPUs can start a new operation every clock cycle. But recent CPUs have been doing that, at least for part of the operation: I think they normally end with a couple steps of Newton-Raphson refinement, and that part can be pipelined as it only involves multiply/add/FMA type of operations.
Intel has supported AVX since Sandybridge, but it wasn't until Skylake that they widened the FP div/sqrt unit to 256-bit.
For example, Haswell runs vsqrtpd ymm as 3 uops, 2 for port 0 (where the div/sqrt unit is) and one for any port, presumably to recombine the results. The latency is just about a factor of 2 longer, and throughput is half. (A uop reading the result needs to wait for both halves to be ready.)
Agner Fog may have tested latency with vsqrtpd ymm reading its own result; IDK if Intel can let one half of the operation start before the other half is ready, of if the merging uop (or whatever it is) would end up forcing it to wait for both halves to be ready before starting either half of another div or sqrt. Instructions other than div/sqrt have full-width execution units and would always need to wait for both halves.
I also collected divps / pd / sd / ss throughputs and latencies for YMM and XMM on various CPUs in a table on Floating point division vs floating point multiplication
To complete the great answer of #PeterCordes, this is indeed dependent of architecture. One can expect the two square roots to be computed in parallel (or possibly efficiently pipelined at the ALU level) on most recent mainstream processors though. Here is the latency and throughput for intel architectures (you can get it from Intel):
Architecture
Latency single
Latency packed XMM
Throughput single
Throughput packed XMM
Skylake
18
18
6
6
Knights Landing
40
38
33
10
Broadwell
20
20
7
13
Haswell
20
20
13
13
Ivy Bridge
21
21
14
14
The throughput (number of cycle per instruction) is generally what matter in SIMD codes, as long as out-of-order exec can overlap the latency chains for independent iterations. As you can see, on Skylake, Haswell and Ivy Bridge, the throughput is the same meaning that sqrtsd and sqrtpd xmm are equally fast. The pd version gets twice as much work done, so it must be computing two elements in parallel. Note that Coffee Lake, Cannon Lake and Ice Lake have the same timings as Skylake for this specific instruction.
For Broadwell, sqrtpd does not execute the operation in parallel on the two lanes. Instead, it pipelines the operation and most of the computation is serialized (sqrtpd takes 1 cycle less than two sqrtsd). Or it has a parallel 2x 64-bit div/sqrt unit, but can independently use halves of it for scalar sqrt, which would explain the latency being the same but the throughput being better for scalar instructions (like how Skylake is for sqrt ymm vs. xmm).
For KNL Xeon Phi, the results are a bit surprising as sqrtpd xmm is much faster than sqrtsd while computing more items in parallel. Agner Fog's testing confirmed that, and that it takes many more uops. It's hard to imagine why; just merging the scalar result into the bottom of an XMM register shouldn't be much different from merging an XMM into the bottom of a ZMM, which is the same speed as a full vsqrtpd zmm. (It's optimized for AVX-512 with 512-bit registers, but it's also slow at div/sqrt in general; you're intended to use vrsqrt28pd on Xeon Phi CPUs, to get an approximation that only needs one Newton iteration to get close to double precision. Other AVX-512 CPUs only support vrsqrt14pd/ps, lacking the AVX-512ER extension)
PS: It turns out that Intel reports the maximum throughput cost (worst case) when it is variable. (0.0 is one of the best cases, for example). The latency is a bit different from the one reported from Agner Fog's instruction table. The overall analysis remains the same though.
Yes, SIMD (vector) instructions on packed operands perform the same operation on all vector elements "in parallel". This follows from the fact that sqrtsd (scalar square root on one double) and sqrtpd (packed square root on two doubles in a 128-bit register) have the same latency.
vsqrtpd for 256-bit and larger vectors may have higher latency on some processors, as the operation is performed on 128-bit parts of the vector sequentially. This may be true for vdivpd as well, but not other instructions - most of the time you can expect that the latency is the same regardless of the vector size. Consult with instruction tables if you want to be sure.
We know that Intel CPUs do integer division and FP div / sqrt on a not-fully-pipelined divide execution unit on port 0. We know this from IACA output, other published stuff, and experimental testing. (e.g. https://agner.org/optimize/)
But are there independent dividers for FP and integer (competing only for dispatch via port 0), or does interleaving two div-throughput-bound workloads make their cost add nearly linearly, if one is integer and the other is FP?
This is complicated by Intel CPUs (unlike AMD) decoding integer division to multiple uops, e.g. 10 for div r32 on Skylake.
AMD CPUs similarly have their divider on one execution port, but I don't know as much about them and don't have one to test on. AMD integer division decodes to only a couple uops (to write RDX and RAX), not microcoded. Experiments on AMD might be easier to interpret without lots of uops flying around being a possible cause for contention between int and fp div.
Further reading:
Semi-related: Radix divider internals
Floating point division vs floating point multiplication - FP div/sqrt vs. multiply/FMA throughputs on various Intel and AMD CPUs.
Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux - Intel's 64-bit integer division is a lot slower. Decoding to more uops (36 vs. 10 on SKL) and not even saturating the arith.divider_active perf counter.
Intel CPU architect Ronak Singhal mentions on Twitter that Broadwell (and by implication subsequent architectures until ICL) use the FP hardware for division, but that Ice Lake has a dedicated integer division unit:
Keep in mind that Broadwell that this was benchmarked on does integer division on the FP divider. In Ice Lake, there is now a dedicated integer divide unit.
So I would expect significant competition. Many of the operations that integer division perform no doubt are plain ALU ops not using the divider, so I wouldn't necessarily expect their inverse throughput to be strictly cumulative but they will definitely compete.
Ronak doesn't imply anything about pre-Broadwell implementation, but based on the similar port assignment and performance going back to at least Sandy Bridge, I think we can expect that the same sharing holds.
I'm trying to optimize my code with SIMD ( on ARM CPUs ), and want to know its arithmetic intensity (flops/byte, AI) and FLOPS.
In order to calculate AI and FLOPS, I have to count the number of floating point operations(FLOPs).
However, I can't find any precise definition of FLOPs.
Of course, mul, add, sub, div are clearly FLOPs, but how about move operations, shuffle operations (e.g. _mm_shuffle_ps), set operations (e.g. _mm_set1_ps), conversion operations (e.g. _mm_cvtps_pi32), etc. ?
They're operations that deal with floating point values. Should I count them as FLOPs ? If not, why ?
Which operations do profilers like Intel VTune and Nvidia's nvprof, or PMUs usually count ?
EDIT:
What all operations does FLOPS include?
This question is mainly about mathematically complex operations.
I also want to know the standard way to deal with "not mathematical" operations which take floating point values or vectors as inputs.
Shuffle / blend on FP values are not considered FLOPs. They are just overhead of using SIMD on not purely "vertical" problems, or for problems with branching that you do branchlessly with a blend.
Neither are FP AND/OR/XOR. You could try to justify counting FP absolute value using andps (_mm_and_ps), but normally it's not counted. FP abs doesn't require looking at the exponent / significand, or normalizing the result, or any of the things that make FP execution units expensive. abs (AND) / sign-flip (XOR) or make negative (OR) are trivial bitwise ops.
FMA is normally counted as two floating point ops (the mul and add), even though it's a single instruction with the same (or similar) performance to SIMD FP add or mul. The most important problem that bottlenecks on raw FLOP/s is matmul, which does need an equal mix of mul and add, and can take advantage of FMA perfectly.
So the FLOP/s of a Haswell core is
its SIMD vector width (8 float elements per vector)
times SIMD FMA per clock (2)
times FLOPs per FMA (2)
times clock speed (max single core turbo it can sustain while maxing out both FMA units; long-term depends on cooling, short term just depends on power limits).
For a whole CPU, not just a single core: multiply by number of cores and use the max sustained clock speed with all cores busy, usually lower than single-core turbo on CPUs that have turbo at all.)
Intel and other CPU vendors don't count the fact that their CPUs can also sustain a vandps in parallel with 2 vfma132ps instructions per clock, because FP abs is not a difficult operation.
See also How do I achieve the theoretical maximum of 4 FLOPs per cycle?. (It's actually more than 4 on modern CPUs :P)
Peak FLOPS (FP ops per second, or FLOP/s) isn't achievable if you have much other overhead taking up front-end bandwidth or creating other bottlenecks. The metric is just the raw amount of math you can do when running in a straight line, not on any specific practical problem.
Although people would think it's silly if theoretical peak flops is much higher than a carefully hand-tuned matmul or Mandelbrot could ever achieve, even for compile-time-constant problem sizes. e.g. if the front-end couldn't keep up with doing any stores as well as the FMAs. e.g. if Haswell had four FMA execution units, so it could only sustain max FLOPs if literally every instruction was an FMA. Memory source operands could micro-fuse for loads, but there'd be no room to store without hurting throughput.
The reason Intel doesn't have even 3 FMA units is that most real code has trouble saturating 2 FMA units, especially with only 2 load ports and 1 store port. They'd be wasted almost all of the time, and 256-bit FMA unit takes a lot of transistors.
(Ice Lake widens issue/rename stage of the pipeline to 5 uops/clock, but also widens SIMD execution units to 512-bit with AVX-512 instead of adding a 3rd 256-bit FMA unit. It has 2/clock load and 2/clock store, although that store throughput is only sustainable to L1d cache for 32-byte or narrower stores, not 64-byte.)
When it comes to optimisation, it is common practise to only measure FLOPs on the hotspots of your code, for example, the number of Floating Point Multiply & Accumulate operations in Convolution. This is mainly because other operations might be insignificant or irreplaceable and therefore can't be exploited for any kind of optimization.
For example, all instructions under Vector Floating Point Instructions in A4.13 in ARMv7 Reference Manual fall under a Floating Point Operation as a FLOPs/Cycle for an FPU instruction is typically constant in a processor.
Not just ARM, but many micro-processors have a dedicated Floating Point Unit, so when you are measuring FLOPs, you're measuring the speed of this unit. With this and FLOPs/cycle you can more or less calculate the theoretical peak performance.
But, FLOPs are to be taken with a grain of salt, as they can only be used to approximately estimate the speed of your code because they fail to take into account other conditions your processor operates under. This is why counting FLOPs only for your hotspots (usually arithmetic ops) is more or less enough in most cases.
Having said that, FLOPs can act as a comparative metric for two strenuous piece of code but doesn't say much about your code per se.
I have a C/C++ program which involves intensive 32-bit floating-point matrix math computations such as addition, subtraction, multiplication, division, etc.
Can I speed up my program by converting 32-bit floating-point numbers into 16-bit fixed-point numbers ? How much speed gain can I get ?
Currently I'm working on a Intel I5 CPU. I'm using Openblas to perform the matrix calculations. How should I re-implement Openblas functions such as cblas_dgemm to perform fixed-point calculations ?
I know that SSE(Simple SIMD Extensions) operates on 4x32=8x16=128 bit data at one time, i.e., 4 32-bit floating-point type or 8 16-bit fixed-point type. I guess that after conversion from 32-bit floating-point to 16-bit fixed-point, my program would be twice faster.
Summary: Modern FPU hardware is hard to beat with fixed-point, even if you have twice as many elements per vector.
Modern BLAS library are typically very well tuned for cache performance (with cache blocking / loop tiling) as well as for instruction throughput. That makes them very hard to beat. Especially DGEMM has lots of room for this kind of optimization, because it does O(N^3) work on O(N^2) data, so it's worth transposing just a cache-sized chunk of one input, and stuff like that.
What might help is reducing memory bottlenecks by storing your floats in 16-bit half-float format. There is no hardware support for doing math on them in that format, just a couple instructions to convert between that format and normal 32-bit element float vectors while loading/storing: VCVTPH2PS (__m256 _mm256_cvtph_ps(__m128i)) and VCVTPS2PH (__m128i _mm256_cvtps_ph(__m256 m1, const int imm8_rounding_control). These two instructions comprise the F16C extension, first supported by AMD Bulldozer and Intel IvyBridge.
IDK if any BLAS libraries support that format.
Fixed point:
SSE/AVX doesn't have any integer division instructions. If you're only dividing by constants, you might not need a real div instruction, though. So that's one major stumbling block for fixed point.
Another big downside of fixed point is the extra cost of shifting to correct the position of the decimal (binary?) point after multiplies. That will eat into any gain you could get from having twice as many elements per vector with 16-bit fixed point.
SSE/AVX actually has quite a good selection of packed 16-bit multiplies (better than for any other element size). There's packed multiply producing the low half, high half (signed or unsigned), and even one that takes 16 bits from 2 bits below the top, with rounding (PMULHRSW.html). Skylake runs those at two per clock, with 5 cycle latency. There are also integer multiply-add instructions, but they do a horizontal add between pairs of multiply results. (See Agner Fog's insn tables, and also the x86 tag wiki for performance links.) Haswell and previous don't have as many integer-vector add and multiply execution units. Often code bottlenecks on total uop throughput, not on a specific execution port anyway. (But a good BLAS library might even have hand-tuned asm.)
If your inputs and outputs are integer, it's often faster to work with integer vectors, instead of converting to floats. (e.g. see my answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?, where I used 16-bit fixed-point to deal with 8-bit integers).
But if you're really working with floats, and have a lot of multiplying and dividing to do, just use the hardware FPUs. They're amazingly powerful in modern CPUs, and have made fixed-point mostly obsolete for many tasks. As #Iwill points out, FMA instructions are another big boost for FP throughput (and sometimes latency).
Integer add/subtract/compare instructions (but not multiply) are also lower latency than their FP counterparts.