On most modern 64-bit processors, does the speed of `mulq` depend on the operands? - performance

On most moder 64-bit processors (such as Intel Core 2 Duo or the Intel i7 series), does the speed of the x86_64 command mulq and its variants depend on the operands? For example, will multiplying 11 * 13 be faster than 11111111 * 13131313? Or does it always take the time of the worst case?

TL;DR: No. Constant-length integer math operations (barring division, which is non-linear) consume a constant number of cycles, regardless of the numerical value of the operands.
mulq takes two QWORD arguments.
The values are represented in little-endian binary format (used by x86 architecture) as follows:
1011000000000000000000000000000000000000000000000000000000000000 = 13
1000110001111010000100110000000000000000000000000000000000000000 = 13131313
The processor sees both of these as the same "size", as both are 64-bit values.
Therefore, the cycle count should always be the same, regardless of the actual numerical value of the operands.
More info:
There are the concepts of Leading Zero Anticipation and Leading Zero Detection[1][2] (LZA/LZD) that can be employed to speed up floating-point operations.
To the best of my knowledge however, there are no mainstream processors that employ either of these methods towards integer arithmetic. This is most likely due to the simplistic nature of most integer arithmetic (multiplication in this case). The overhead of LZA/LZD may simply not be worth it, for simple integer math circuits that can complete the full multiplication in less time anyhow.

I don't have any reference to hand, but I would place money on the latency/throughput being invariant of the values of the operands. Otherwise, it would be a nightmare to schedule.

For decades, Agner Fog has been publishing tables of instruction timings. His August 2019 tables confirm what I had expected: that the CPU chips in modern laptops and desktop computers have invariant timing for their integer-multiply units. These are extremely fast and rather power-hungry.
The CPU design space is quite different for battery-limited devices such as smartphones. On such devices, the integer multiply may be implemented in a microcoded loop with variable timing.
In (approx) 2016, Thomas Pornin had this to say about "the problem" posed by variable-latency multiplication instructions to the design of his SSL/TLS library:
"... integer multiplication opcodes in CPU may or may not execute in constant time; when they do not, implementations that use such operations may exhibit execution time variations that depend on the involved data, thereby potentially leaking secret information... When a CPU has non-constant-time multiplication opcodes, the execution time depends on the size (as integers) of one or both of the operands. For instance, on the ARM Cortex-M3, the umull opcode takes 3 to 5 cycles to complete, the `short' counts (3 or 4) being taken only if both operands are numerically less than 65536 ... In general, Intel x86 CPU all provide constant-time multiplications since the days of the first Pentium. The same does not necessarily hold for other vendors, in particular the early VIA Nano." 2

Related

How to analyze the instructions pipelining on Zen4 for AVX-512 packed double computations? (backend bound)

I got access to the AMD Zen4 server and tested AVX-512 packed double performance. I chose Harmonic Series Sum[1/n over positive integers] and compared the performance using standard doubles, AVX2 (4 packed doubles) and AVX-512 (8 packed doubles). The test code is here.
AVX-256 version runs four times faster than the standard double version. I was expecting the AVX-512 version to run two times faster than the AVX-256 version, but there was barely any improvement in runtimes:
Method Runtime (minutes:seconds)
HarmonicSeriesPlain 0:41.33
HarmonicSeriesAVX256 0:10.32
HarmonicSeriesAVX512 0:09.82
I was scratching my head over the results and tested individual operations. See full results. Here is runtime for the division:
Method Runtime (minutes:seconds)
div_plain 1:53.80
div_avx256f 0:28.47
div_avx512f 0:14.25
Interestingly, div_avx256f takes 28 seconds, while HarmonicSeriesAVX256 takes only 10 seconds to complete. HarmonicSeriesAVX256 is doing more operations than div_avx256f - summing up the results and increasing the denominator each time (the number of packed divisions is the same). The speed-up has to be due to the instructions pipelining.
However, I need help finding out more details.
The analysis with the llvm-mca (LLVM Machine Code Analyzer) fails because it does not support Zen4 yet:
gcc -O3 -mavx512f -mfma -S "$file" -o - | llvm-mca -iterations 10000 -timeline -bottleneck-analysis -retire-stats
error: found an unsupported instruction in the input assembly sequence.
note: instruction: vdivpd %zmm0, %zmm4, %zmm2
On the Intel platform, I would use
perf stat -M pipeline binary
to find more details, but this metricgroup is not available on Zen4. Any more suggestions on how to analyze the instructions pipelining on Zen4? I have tried these perf stat events:
cycles,stalled-cycles-frontend,stalled-cycles-backend,cache-misses,sse_avx_stalls,fp_ret_sse_avx_ops.all,fp_ret_sse_avx_ops.div_flops,fpu_pipe_assignment.total,fpu_pipe_assignment.total0,
fpu_pipe_assignment.total1,fpu_pipe_assignment.total2,fpu_pipe_assignment.total3
and got the results here.
From this I can see, that the workload is backed bound. AMD's performance event fp_ret_sse_avx_ops.all ( the number of retired SSE/AVX operations) helps, but I still want to get better insights into instructions pipelining on Zen4. Any tips?
Zen 4 execution units are mostly 256-bit wide; handling a 512-bit uop occupies it for 2 cycles. It's normal that 512-bit vectors don't have more raw throughput for any math instructions in general on Zen 4. Although using them on Zen4 does mean more work per uop so out-of-order exec has an easier time.
Or in the case of division, they're occupied for longer since division isn't fully pipelined, like on all modern CPUs. Division is hard to implement.
On Intel Ice Lake for example, divpd throughput is 2 doubles per 4 clocks whether you're using 128-bit, 256-bit, or 512-bit vectors. 512-bit takes extra uops, so we can infer that the actual divider execution unit is 256-bit wide in Ice Lake, but that divpd xmm can use the two halves of it independently. (Unlike AMD).
https://agner.org/optimize/ has instructing timing tables (and his microarch PDF has details on how CPUs work that are essential to making sense of them). https://uops.info/ also has good automated microbenchmark results, free from typos and other human error except sometimes in choosing what to benchmark. (But the actual instruction sequences tested are available, so you can check what they actually tested.) Unfortunately they don't yet have Zen 4 results up, only up to Zen 3.
Zen4 has execution units 256-bit wide for the most part, so 512-bit instructions are single uop but take 2 cycles on most execution units. (Unlike Zen1 where they took 2 uops and thus hurt OoO exec). And it has efficient 512-bit shuffles, and lets you use the power of new AVX-512 instructions for 256-bit vector width, which is where a lot of the real value is. (Better shuffles, masking, vpternlogd, vector popcount, etc.)
Division isn't fully pipelined on any modern x86 CPU. Even on Intel CPUs 512-bit vdivpd zmm has about the same doubles-per-clock throughput as vdivpd ymm (Floating point division vs floating point multiplication has some older data on the YMM vs. XMM situation which is similar, although Zen4 apparently can't send different XMM vectors through the halves of its 256-bit-wide divide unit; vdivpd xmm has the same instruction throughput as vdivpd ymm)
Fast-reciprocal + Newton iterations
For something that's almost entirely bottlenecked on division throughput (not front-end or other ports), you might consider approximate-reciprocal with a Newton-Raphson iteration or two to refine the accuracy to close to 1 ulp. (Not quite the 0.5 ulp you'd get from exact division).
AVX-512 has vrcp14pd approx-reciprocal for packed-double. So two rounds of Newton iterations should double the number of correct bits each time, to 28 then 56 (which is more than the 53-bit mantissa of a double). Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision mostly talks about rsqrt, but similar idea.
SSE/AVX1 only had single-precision versions of the fast-reciprocal and rsqrt instructions, with only 12-bit precision. e.g. rcpps.
AVX-512ER has 28-bit precision versions, but only Xeon Phi ever had those; mainstream CPUs haven't included them. (Xeon Phi had very vdivps / pd exact division, so it was much better to use the reciprocals.)
I got the answer for the question from title: How to analyze the instructions pipelining on Zen4? directly from AMD:
For determining if a workload is backend-bound, the recommended
method on Zen 4 is to use the pipeline utilization metrics. We are
the process of providing similar metrics and metric groups through
the perf JSON event files for Zen 4 and they will be out very soon.
Read more details in this email thread
AMD has already posted the patches.
Before the patches land in favorite Linux distribution, you can use the raw events on Zen4. Check this example

Is time cost of integer multiplication the same as any binary operation on ARM or Intel processors?

Is the processing time of an integer multiplication the same as any integer binary operation on modern CPU with pipelining (e.g Intel, ARM) ?
In the Assembly documentation of Intel, it is said that an integer multiplication takes 1 cycle, like any integer binary operation. Is this cycle equivalent to the time duration supposing the operations are pipelined ?
There are more than the cycles to consider:
latency
pipeline
While the results of ALU instructions are instantaneous, multiply instructions have to go through MAC(multiply accumulate) which usually costs more cycles and comes with a latency of multiple cycles.
And often there is only one MAC unit which means the core doesn't allow two mul instructions to be dual issued.
example: ARMv5E:
smulxy(16bit): one cycle plus three cycles latency
mul(32bit): two cycles plus three cycles latency
umull(64bit): three cycles plus four(lower half) and five(upper half) cycles latency
No, multiply is much more complicated than XOR, ADD, OR, NOT, etc. While binary makes it much easier than base 10 you still have to have a larger adder (than just a two operand ADD or other operation).
Take the bits abcd
abcd
* 1011
========
abcd
abcd.
0000..
+abcd...
=========
In base 10 like grade school you had to multiply each time, you are still multiplying here but only by one or zero so either you copy and shift the first operand or you copy and shift zeros. And it gets very big, addition is cascaded. Look up xor gate at wikipedia and see the full adder or just google it. You have a single column adder for a simple two operand add with three inputs and two outputs but the carry out of one bit is the carry in of the other. No logic is instantaneous even a single transistor inversion (NOT) takes a non-zero amount of time. You can start to think about how many gates are lined up just to make one 32 bit two operand ADD, and then think about a 32 bit multiply where each adder is 32 operand bits and some number of carry bits, and then all of that is cascaded. The chip real estate and the time to settle multiply almost exponentially for multiply, and you then start to worry about can you meet timing (can you settle the msbit of the result within the desired/designed clock speed).
So you will see optimizations made including multiple pipe stages, not 32 clocks to do a 32 bit multiply but maybe not one clock maybe two or four. With a dozen stage deep pipe though you can bury that in there and still meet an advertised one clock per instruction average.
Intel, ARM, etc the 1 cycle thing is an illusion, the math operation itself might take that long, but the execution of the instruction takes a few to a handful, and your pipe depths may be several to a dozen or more. There is limited use in attempting to count cycles these days. And feeding the pipe and handling memory operations tend to dominate the performance not the pipe/instructions themselves outside a carefully crafted sim of the core.
For the cortex-ms which are perhaps not what you are asking about but are very much part of our daily life you see in the documentation that it is the chip vendor that can choose the larger faster multiply or the slower smaller that helps with overall chip size and perhaps performance. (I do not examine the cortex-a docs that much as I do not use them as often) A compile time option when they compile the core, there are many compile time options (which is why for any arm core cortex-m or cortex-a) you cannot compare, say, two cortex-m4s from different vendors or chip families within a vendor as they could have been compiled differently and behave/perform differently (they still execute the enabled instructions in the same functional way of course).
So no you cannot assume the "execution time" or "cycle time" of ANY instruction, and in particular ones like multiply and divide and anything floating point cannot assumed to be single cycle. Yes like all the other instructions the one cycle advertised is based on the pipeline effects, no instruction takes one cycle start to finish, and based on pipe depth of the design the multiply and divide may take more than one clock but be hidden by the pipe to still average one clock per instruction.
Note that this question is "too broad", as there are many Intel and ARM implementations past and present. And chip implementation details are often not available or protected by NDA, all you have if anything are public documents that can hide the reality.

Do FP and integer division compete for the same throughput resources on x86 CPUs?

We know that Intel CPUs do integer division and FP div / sqrt on a not-fully-pipelined divide execution unit on port 0. We know this from IACA output, other published stuff, and experimental testing. (e.g. https://agner.org/optimize/)
But are there independent dividers for FP and integer (competing only for dispatch via port 0), or does interleaving two div-throughput-bound workloads make their cost add nearly linearly, if one is integer and the other is FP?
This is complicated by Intel CPUs (unlike AMD) decoding integer division to multiple uops, e.g. 10 for div r32 on Skylake.
AMD CPUs similarly have their divider on one execution port, but I don't know as much about them and don't have one to test on. AMD integer division decodes to only a couple uops (to write RDX and RAX), not microcoded. Experiments on AMD might be easier to interpret without lots of uops flying around being a possible cause for contention between int and fp div.
Further reading:
Semi-related: Radix divider internals
Floating point division vs floating point multiplication - FP div/sqrt vs. multiply/FMA throughputs on various Intel and AMD CPUs.
Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux - Intel's 64-bit integer division is a lot slower. Decoding to more uops (36 vs. 10 on SKL) and not even saturating the arith.divider_active perf counter.
Intel CPU architect Ronak Singhal mentions on Twitter that Broadwell (and by implication subsequent architectures until ICL) use the FP hardware for division, but that Ice Lake has a dedicated integer division unit:
Keep in mind that Broadwell that this was benchmarked on does integer division on the FP divider. In Ice Lake, there is now a dedicated integer divide unit.
So I would expect significant competition. Many of the operations that integer division perform no doubt are plain ALU ops not using the divider, so I wouldn't necessarily expect their inverse throughput to be strictly cumulative but they will definitely compete.
Ronak doesn't imply anything about pre-Broadwell implementation, but based on the similar port assignment and performance going back to at least Sandy Bridge, I think we can expect that the same sharing holds.

What are the relative cycle times for the 6 basic arithmetic operations?

When I try to optimize my code, for a very long time I've just been using a rule of thumb that addition and subtraction are worth 1, multiplication and division are worth 3, squaring is worth 3 (I rarely use the more general pow function so I have no rule of thumb for it), and square roots are worth 10. (And I assume squaring a number is just a multiplication, so worth 3.)
Here's an example from a 2D orbital simulation. To calculate and apply acceleration from gravity, first I get distance from the ship to the center of earth, then calculate the acceleration.
D = sqrt( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 19
A = G*Earth.mass/sqr(D); // this is worth 9, total is 28
However, notice that in calculating D, you take a square root, but when using it in the next calculation, you square it. Therefore you can just do this:
A = G*Earth.mass/( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 15
So if my rule of thumb is true, I almost cut in half the cycle time.
However, I cannot even remember where I heard that rule before. I'd like to ask what is the actual cycle times for those basic arithmetic operations?
Assumptions:
everything is a 64-bit floating number in x64 architecture.
everything is already loaded into registers, so no worrying about hits and misses from caches or memory.
no interrupts to the CPU
no if/branching logic such as look ahead prediction
Edit: I suppose what I'm really trying to do is look inside the ALU and only count the cycle time of its logic for the 6 operations. If there is still variance within that, please explain what and why.
Note: I did not see any tags for machine code, so I chose the next closest thing, assembly. To be clear, I am talking about actual machine code operations in x64 architecture. Thus it doesn't matter whether those lines of code I wrote are in C#, C, Javascript, whatever. I'm sure each high-level language will have its own varying times so I don't wanna get into an argument over that. I think it's a shame that there's no machine code tag because when talking about performance and/or operation, you really need to get down into it.
At a minimum, one must understand that an operation has at least two interesting timings: the latency and the throughput.
Latency
The latency is how long any particular operation takes, from its inputs to its output. If you had a long series of operations where the output of one operation is fed into the input of the next, the latency would determine the total time. For example, an integer multiplication on most recent x86 hardware has a latency of 3 cycles: it takes 3 cycles to complete a single multiplication operation. Integer addition has a latency of 1 cycle: the result is available the cycle after the addition executes. Latencies are generally positive integers.
Throughput
The throughput is the number of independent operations that can be performed per unit time. Since CPUs are pipelined and superscalar, this is often more than the inverse of the latency. For example, on most recent x86 chips, 4 integer addition operations can execute per cycle, even though the latency is 1 cycle. Similarly, 1 integer multiplication can execute, on average per cycle, even though any particular multiplication takes 3 cycles to complete (meaning that you must have multiple independent multiplications in progress at once to achieve this).
Inverse Throughput
When discussing instruction performance, it is common to give throughput numbers as "inverse throughput", which is simply 1 / throughput. This makes it easy to directly compare with latency figures without doing a division in your head. For example, the inverse throughput of addition is 0.25 cycles, versus a latency of 1 cycle, so you can immediately see that you if you have sufficient independent additions, they use only something like 0.25 cycles each.
Below I'll use inverse throughput.
Variable Timings
Most simple instructions have fixed timings, at least in their reg-reg form. Some more complex mathematical operations, however, may have input-dependent timings. For example, addition, subtraction and multiplication usually have fixed timings in their integer and floating point forms, but on many platforms division has variable timings in integer, floating point or both. Agner's numbers often show a range to indicate this, but you shouldn't assume the operand space has been tested extensively, especially for floating point.
The Skylake numbers below, for example, show a small range, but it isn't clear if that's due to operand dependency (which would likely be larger) or something else.
Passing denormal inputs, or results that themselves are denormal may incur significant additional cost depending on the denormal mode. The numbers you'll see in the guides generally assume no denormals, but you might be able to find a discussion of denormal costs per operation elsewhere.
More Details
The above is necessary but often not sufficient information to fully qualify performance, since you have other factors to consider such as execution port contention, front-end bottlenecks, and so on. It's enough to start though and you are only asking for "rule of thumb" numbers if I understand it correctly.
Agner Fog
My recommended source for measured latency and inverse throughput numbers are Agner's Fogs guides. You want the files under 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, which lists fairly exhaustive timings on a huge variety of AMD and Intel CPUs. You can also get the numbers for some CPUs directly from Intel's guides, but I find them less complete and more difficult to use than Agner's.
Below I'll pull out the numbers for a couple of modern CPUs, for the basic operations you are interested in.
Intel Skylake
Lat Inv Tpt
add/sub (addsd, subsd) 4 0.5
multiply (mulsd) 4 0.5
divide (divsd) 13-14 4
sqrt (sqrtpd) 15-16 4-6
So a "rule of thumb" for latency would be add/sub/mul all cost 1, and division and sqrt are about 3 and 4, respectively. For throughput, the rule would be 1, 8, 8-12 respectively. Note also that the latency is much larger than the inverse throughput, especially for add, sub and mul: you'd need 8 parallel chains of operations if you wanted to hit the max throughput.
AMD Ryzen
Lat Inv Tpt
add/sub (addsd, subsd) 3 0.5
multiply (mulsd) 4 0.5
divide (divsd) 8-13 4-5
sqrt (sqrtpd) 14-15 4-8
The Ryzen numbers are broadly similar to recent Intel. Addition and subtraction are slightly lower latency, multiplication is the same. Latency-wise, the rule of thumb could still generally be summarized as 1/3/4 for add,sub,mul/div/sqrt, with some loss of precision.
Here, the latency range for divide is fairly large, so I expect it is data dependent.

Floating-point number vs fixed-point number: speed on Intel I5 CPU

I have a C/C++ program which involves intensive 32-bit floating-point matrix math computations such as addition, subtraction, multiplication, division, etc.
Can I speed up my program by converting 32-bit floating-point numbers into 16-bit fixed-point numbers ? How much speed gain can I get ?
Currently I'm working on a Intel I5 CPU. I'm using Openblas to perform the matrix calculations. How should I re-implement Openblas functions such as cblas_dgemm to perform fixed-point calculations ?
I know that SSE(Simple SIMD Extensions) operates on 4x32=8x16=128 bit data at one time, i.e., 4 32-bit floating-point type or 8 16-bit fixed-point type. I guess that after conversion from 32-bit floating-point to 16-bit fixed-point, my program would be twice faster.
Summary: Modern FPU hardware is hard to beat with fixed-point, even if you have twice as many elements per vector.
Modern BLAS library are typically very well tuned for cache performance (with cache blocking / loop tiling) as well as for instruction throughput. That makes them very hard to beat. Especially DGEMM has lots of room for this kind of optimization, because it does O(N^3) work on O(N^2) data, so it's worth transposing just a cache-sized chunk of one input, and stuff like that.
What might help is reducing memory bottlenecks by storing your floats in 16-bit half-float format. There is no hardware support for doing math on them in that format, just a couple instructions to convert between that format and normal 32-bit element float vectors while loading/storing: VCVTPH2PS (__m256 _mm256_cvtph_ps(__m128i)) and VCVTPS2PH (__m128i _mm256_cvtps_ph(__m256 m1, const int imm8_rounding_control). These two instructions comprise the F16C extension, first supported by AMD Bulldozer and Intel IvyBridge.
IDK if any BLAS libraries support that format.
Fixed point:
SSE/AVX doesn't have any integer division instructions. If you're only dividing by constants, you might not need a real div instruction, though. So that's one major stumbling block for fixed point.
Another big downside of fixed point is the extra cost of shifting to correct the position of the decimal (binary?) point after multiplies. That will eat into any gain you could get from having twice as many elements per vector with 16-bit fixed point.
SSE/AVX actually has quite a good selection of packed 16-bit multiplies (better than for any other element size). There's packed multiply producing the low half, high half (signed or unsigned), and even one that takes 16 bits from 2 bits below the top, with rounding (PMULHRSW.html). Skylake runs those at two per clock, with 5 cycle latency. There are also integer multiply-add instructions, but they do a horizontal add between pairs of multiply results. (See Agner Fog's insn tables, and also the x86 tag wiki for performance links.) Haswell and previous don't have as many integer-vector add and multiply execution units. Often code bottlenecks on total uop throughput, not on a specific execution port anyway. (But a good BLAS library might even have hand-tuned asm.)
If your inputs and outputs are integer, it's often faster to work with integer vectors, instead of converting to floats. (e.g. see my answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?, where I used 16-bit fixed-point to deal with 8-bit integers).
But if you're really working with floats, and have a lot of multiplying and dividing to do, just use the hardware FPUs. They're amazingly powerful in modern CPUs, and have made fixed-point mostly obsolete for many tasks. As #Iwill points out, FMA instructions are another big boost for FP throughput (and sometimes latency).
Integer add/subtract/compare instructions (but not multiply) are also lower latency than their FP counterparts.

Resources