Is time cost of integer multiplication the same as any binary operation on ARM or Intel processors? - performance

Is the processing time of an integer multiplication the same as any integer binary operation on modern CPU with pipelining (e.g Intel, ARM) ?
In the Assembly documentation of Intel, it is said that an integer multiplication takes 1 cycle, like any integer binary operation. Is this cycle equivalent to the time duration supposing the operations are pipelined ?

There are more than the cycles to consider:
latency
pipeline
While the results of ALU instructions are instantaneous, multiply instructions have to go through MAC(multiply accumulate) which usually costs more cycles and comes with a latency of multiple cycles.
And often there is only one MAC unit which means the core doesn't allow two mul instructions to be dual issued.
example: ARMv5E:
smulxy(16bit): one cycle plus three cycles latency
mul(32bit): two cycles plus three cycles latency
umull(64bit): three cycles plus four(lower half) and five(upper half) cycles latency

No, multiply is much more complicated than XOR, ADD, OR, NOT, etc. While binary makes it much easier than base 10 you still have to have a larger adder (than just a two operand ADD or other operation).
Take the bits abcd
abcd
* 1011
========
abcd
abcd.
0000..
+abcd...
=========
In base 10 like grade school you had to multiply each time, you are still multiplying here but only by one or zero so either you copy and shift the first operand or you copy and shift zeros. And it gets very big, addition is cascaded. Look up xor gate at wikipedia and see the full adder or just google it. You have a single column adder for a simple two operand add with three inputs and two outputs but the carry out of one bit is the carry in of the other. No logic is instantaneous even a single transistor inversion (NOT) takes a non-zero amount of time. You can start to think about how many gates are lined up just to make one 32 bit two operand ADD, and then think about a 32 bit multiply where each adder is 32 operand bits and some number of carry bits, and then all of that is cascaded. The chip real estate and the time to settle multiply almost exponentially for multiply, and you then start to worry about can you meet timing (can you settle the msbit of the result within the desired/designed clock speed).
So you will see optimizations made including multiple pipe stages, not 32 clocks to do a 32 bit multiply but maybe not one clock maybe two or four. With a dozen stage deep pipe though you can bury that in there and still meet an advertised one clock per instruction average.
Intel, ARM, etc the 1 cycle thing is an illusion, the math operation itself might take that long, but the execution of the instruction takes a few to a handful, and your pipe depths may be several to a dozen or more. There is limited use in attempting to count cycles these days. And feeding the pipe and handling memory operations tend to dominate the performance not the pipe/instructions themselves outside a carefully crafted sim of the core.
For the cortex-ms which are perhaps not what you are asking about but are very much part of our daily life you see in the documentation that it is the chip vendor that can choose the larger faster multiply or the slower smaller that helps with overall chip size and perhaps performance. (I do not examine the cortex-a docs that much as I do not use them as often) A compile time option when they compile the core, there are many compile time options (which is why for any arm core cortex-m or cortex-a) you cannot compare, say, two cortex-m4s from different vendors or chip families within a vendor as they could have been compiled differently and behave/perform differently (they still execute the enabled instructions in the same functional way of course).
So no you cannot assume the "execution time" or "cycle time" of ANY instruction, and in particular ones like multiply and divide and anything floating point cannot assumed to be single cycle. Yes like all the other instructions the one cycle advertised is based on the pipeline effects, no instruction takes one cycle start to finish, and based on pipe depth of the design the multiply and divide may take more than one clock but be hidden by the pipe to still average one clock per instruction.
Note that this question is "too broad", as there are many Intel and ARM implementations past and present. And chip implementation details are often not available or protected by NDA, all you have if anything are public documents that can hide the reality.

Related

How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?

I am on the hook to analyze some "timing channels" of some x86 binary code. I am posting one question to comprehend the bsf/bsr opcodes.
So high-levelly, these two opcodes can be modeled as a "loop", which counts the leading and trailing zeros of a given operand. The x86 manual has a good formalization of these opcodes, something like the following:
IF SRC = 0
THEN
ZF ← 1;
DEST is undefined;
ELSE
ZF ← 0;
temp ← OperandSize – 1;
WHILE Bit(SRC, temp) = 0
DO
temp ← temp - 1;
OD;
DEST ← temp;
FI;
But to my suprise, bsf/bsr instructions seem to have fixed cpu cycles. According to some documents I found here: https://gmplib.org/~tege/x86-timing.pdf, seems that they always take 8 CPU cycles to finish.
So here are my questions:
I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.
Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?
BSF/BSR performance is not data dependent on any modern CPUs. See https://agner.org/optimize/, https://uops.info/, or http://instlatx64.atw.hu/ for experimental timing results, as well as the https://gmplib.org/~tege/x86-timing.pdf you found.
On modern Intel, they decode to 1 uop with 3 cycle latency and 1/clock throughput, running only on port 1. Ryzen also runs them with 3c latency for BSF, 4c latency for BSR, but multiple uops. Earlier AMD is sometimes even slower.
(Prefer rep bsf aka tzcnt in code that might run on AMD CPUs, if you don't need the FLAGS difference between bsf and tzcnt for zero inputs. lzcnt and tzcnt are fast on AMD as well, like 1 cycle latency with 3/clock throughput for lzcnt on Zen 2 (https://uops.info/). Unfortunately lzcnt and bsr aren't compatible that way, so you can't use it in an "optimistic" forward-compatible way, you have to know which you're getting.)
Your "8 cycle" (latency and throughput) cost appears to be for 32-bit BSF on AMD K8, from Granlund's table that you linked. Agner Fog's table agrees, (and shows it decodes to 21 uops instead of having a dedicated bit-scan execution unit. But the microcoded implementation is presumably still branchless and not data-dependent). No clue why you picked that number; K8 doesn't have SMT / Hyperthreading so the opportunity for an ALU-timing side channel is much reduced.
Do note that they have an output dependency on the destination register, which they leave unmodified if the input was zero. AMD documents this behaviour, Intel implements it in hardware but documents it as an "undefined" result, so unfortunately compilers won't take advantage of it and human programmers maybe should be cautious. IDK if some ancient 32-bit only CPU had different behaviour, or if Intel is planning to ever change (doubtful!), but I wish Intel would document the behaviour at least for 64-bit mode (which excludes any older CPUs).
lzcnt/tzcnt and popcnt on Intel CPUs (but not AMD) have the same output dependency before Skylake and before Cannon Lake (respectively), even though architecturally the result is well-defined for all inputs. They all use the same execution unit. (How is POPCNT implemented in hardware?). AMD Bulldozer/Ryzen builds their bit-scan execution unit without the output dependency baked in, so BSF/BSR are slower than LZCNT/TZCNT (multiple uops to handle the input=0 case, and probably also setting ZF according to the input, not the result).
(Taking advantage of that with intrinsics isn't possible; not even with MSVC's _BitScanReverse64 which uses a by-reference output arg that you could set first. MSVC doesn't respect the previous value and assumes it's output-only. VS: unexpected optimization behavior with _BitScanReverse64 intrinsic)
The pseudocode in the manual is not the implementation
(i.e. it's not necessarily how hardware or microcode works).
It gives precisely the same result in all cases, so you can use it to understand exactly what will happen for any corner cases the text leaves you wondering about. That is all.
The point is to be simple and easy to understand, and that means modeling things in terms of simple 2-input operations which happen serially. C / Fortran / typical pseudocode doesn't have operators for many-input AND, OR, or XOR, but you can build that in hardware up to a point (limited by fan-in, the opposite of fan-out).
Integer addition can be modelled as bit-serial ripple carry, but that's not how it's implemented! Instead, we get single-cycle latency for 64-bit addition with far fewer than 64 gate delays using tricks like carry lookahead adders.
The actual implementation techniques used in Intel's bit-scan / popcnt execution unit are described in US Patent US8214414 B2.
Abstract
A merged datapath for PopCount and BitScan is described. A hardware
circuit includes a compressor tree utilized for a PopCount function,
which is reused by a BitScan function (e.g., bit scan forward (BSF) or
bit scan reverse (BSR)).
Selector logic enables the compressor tree to
operate on an input word for the PopCount or BitScan operation, based
on a microprocessor instruction. The input word is encoded if a
BitScan operation is selected.
The compressor tree receives the input
word, operates on the bits as though all bits have same level of
significance (e.g., for an N-bit input word, the input word is treated
as N one-bit inputs). The result of the compressor tree circuit is a
binary value representing a number related to the operation performed
(the number of set bits for PopCount, or the bit position of the first
set bit encountered by scanning the input word).
It's fairly safe to assume that Intel's actual silicon works similarly to this. Other Intel patents for things like out-of-order machinery (ROB, RS) do tend to match up with performance experiments we can perform.
AMD may do something different, but regardless we know from performance experiments that it's not data-dependent.
It's well known that fixed latency is a hugely beneficial thing for out-of-order scheduling, so it's very surprising when instructions don't have fixed latency. Sandybridge even went so far as to standardize uop latencies to simplify the scheduler and reduce the opportunities write-back conflicts. (e.g. a 3-cycle latency uop followed by a 2-cycle latency uop to the same port would produce 2 results in the same cycle). This meant making complex-LEA (with all 3 components: [disp + base + idx*scale]) take 3 cycles instead of just 2 for the 2 additions like on previous CPUs. There are no 2-cycle latency uops on Sandybridge-family. (There are some 2-cycle latency instructions, because they decode to 2 uops with 1c latency each. The scheduler schedules uops, not instructions).
One of the few exceptions to the rule of fixed latency for ALU uops is division / sqrt, which uses a not-fully-pipelined execution unit. Division is inherently iterative, unlike multiplication where you can make wide hardware that does the partial products and partial additions in parallel.
On Intel CPUs, variable-latency for L1d cache access can produce replays of dependent uops if the data wasn't ready when the scheduler optimistically hoped it would be.
Is there a penalty when base+offset is in a different page than the base?
Why does the number of uops per iteration increase with the stride of streaming loads?
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
The 80x86 manual has a good description of the expected behavior, but that has nothing to do with how it's actually implemented in silicon in any model from any manufacturer.
Let's say that there's been 50 different CPU designs from Intel, 25 CPU designs from AMD, then 25 more from other manufacturers (VIA, Cyrix, SiS/Vortex, NSC, ...). Out of those 100 different CPU designs, maybe there's 20 completely different ways that BSF has been implemented, and maybe 10 of them have fixed timing, 5 have timing that depends on every bit of the source operand, and 5 depend on groups of bits of the source operand (e.g. maybe like "if highest 32 bits of 64-bit operand are zeros { switch to 32-bit logic that's 2 cycles faster }").
I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.
You can't. More specifically, you can test or research existing CPUs, but that's a waste of time because next week Intel (or AMD or VIA or someone else) can release a new CPU that has completely different timing.
As soon as you rely on "measured from existing CPUs" you're doing it wrong. You have to rely on "architectural guarantees" that apply to all future CPUs. There is no "architectural guarantee". You have to assume that there may be a timing side-channel (even if there isn't for current CPUs)
Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?
Instead of doing a 64-bit BSF, why not split it into a pair of 32-bit pieces and do them in parallel, then merge the results? Why not split it into eight 8-bit pieces? Why not use a table lookup for each 8-bit piece?
The answers posted have explained well that the implementation is different from pseudocode. But if you are still curious why the latency is fixed and not data dependent or uses any loops for that matter, you need to see electronic side of things.
One way you could implement this feature in hardware is by using a Priority encoder.
A priority encoder will accept n input lines that can be one or off (0 or 1) and give out the index of the highest priority line that is on. Below is a table from the linked Wikipedia article modified for a most significant set bit function.
input | output index of first set bit
0000 | xx undefined
0001 | 00 0
001x | 01 1
01xx | 10 2
1xxx | 11 3
x denotes the bit value does not matter and can be anything
If you see the circuit diagram on the article, there are no loops of any kind, it is all parallel.

What is the definition of Floating Point Operations ( FLOPs )

I'm trying to optimize my code with SIMD ( on ARM CPUs ), and want to know its arithmetic intensity (flops/byte, AI) and FLOPS.
In order to calculate AI and FLOPS, I have to count the number of floating point operations(FLOPs).
However, I can't find any precise definition of FLOPs.
Of course, mul, add, sub, div are clearly FLOPs, but how about move operations, shuffle operations (e.g. _mm_shuffle_ps), set operations (e.g. _mm_set1_ps), conversion operations (e.g. _mm_cvtps_pi32), etc. ?
They're operations that deal with floating point values. Should I count them as FLOPs ? If not, why ?
Which operations do profilers like Intel VTune and Nvidia's nvprof, or PMUs usually count ?
EDIT:
What all operations does FLOPS include?
This question is mainly about mathematically complex operations.
I also want to know the standard way to deal with "not mathematical" operations which take floating point values or vectors as inputs.
Shuffle / blend on FP values are not considered FLOPs. They are just overhead of using SIMD on not purely "vertical" problems, or for problems with branching that you do branchlessly with a blend.
Neither are FP AND/OR/XOR. You could try to justify counting FP absolute value using andps (_mm_and_ps), but normally it's not counted. FP abs doesn't require looking at the exponent / significand, or normalizing the result, or any of the things that make FP execution units expensive. abs (AND) / sign-flip (XOR) or make negative (OR) are trivial bitwise ops.
FMA is normally counted as two floating point ops (the mul and add), even though it's a single instruction with the same (or similar) performance to SIMD FP add or mul. The most important problem that bottlenecks on raw FLOP/s is matmul, which does need an equal mix of mul and add, and can take advantage of FMA perfectly.
So the FLOP/s of a Haswell core is
its SIMD vector width (8 float elements per vector)
times SIMD FMA per clock (2)
times FLOPs per FMA (2)
times clock speed (max single core turbo it can sustain while maxing out both FMA units; long-term depends on cooling, short term just depends on power limits).
For a whole CPU, not just a single core: multiply by number of cores and use the max sustained clock speed with all cores busy, usually lower than single-core turbo on CPUs that have turbo at all.)
Intel and other CPU vendors don't count the fact that their CPUs can also sustain a vandps in parallel with 2 vfma132ps instructions per clock, because FP abs is not a difficult operation.
See also How do I achieve the theoretical maximum of 4 FLOPs per cycle?. (It's actually more than 4 on modern CPUs :P)
Peak FLOPS (FP ops per second, or FLOP/s) isn't achievable if you have much other overhead taking up front-end bandwidth or creating other bottlenecks. The metric is just the raw amount of math you can do when running in a straight line, not on any specific practical problem.
Although people would think it's silly if theoretical peak flops is much higher than a carefully hand-tuned matmul or Mandelbrot could ever achieve, even for compile-time-constant problem sizes. e.g. if the front-end couldn't keep up with doing any stores as well as the FMAs. e.g. if Haswell had four FMA execution units, so it could only sustain max FLOPs if literally every instruction was an FMA. Memory source operands could micro-fuse for loads, but there'd be no room to store without hurting throughput.
The reason Intel doesn't have even 3 FMA units is that most real code has trouble saturating 2 FMA units, especially with only 2 load ports and 1 store port. They'd be wasted almost all of the time, and 256-bit FMA unit takes a lot of transistors.
(Ice Lake widens issue/rename stage of the pipeline to 5 uops/clock, but also widens SIMD execution units to 512-bit with AVX-512 instead of adding a 3rd 256-bit FMA unit. It has 2/clock load and 2/clock store, although that store throughput is only sustainable to L1d cache for 32-byte or narrower stores, not 64-byte.)
When it comes to optimisation, it is common practise to only measure FLOPs on the hotspots of your code, for example, the number of Floating Point Multiply & Accumulate operations in Convolution. This is mainly because other operations might be insignificant or irreplaceable and therefore can't be exploited for any kind of optimization.
For example, all instructions under Vector Floating Point Instructions in A4.13 in ARMv7 Reference Manual fall under a Floating Point Operation as a FLOPs/Cycle for an FPU instruction is typically constant in a processor.
Not just ARM, but many micro-processors have a dedicated Floating Point Unit, so when you are measuring FLOPs, you're measuring the speed of this unit. With this and FLOPs/cycle you can more or less calculate the theoretical peak performance.
But, FLOPs are to be taken with a grain of salt, as they can only be used to approximately estimate the speed of your code because they fail to take into account other conditions your processor operates under. This is why counting FLOPs only for your hotspots (usually arithmetic ops) is more or less enough in most cases.
Having said that, FLOPs can act as a comparative metric for two strenuous piece of code but doesn't say much about your code per se.

What are the relative cycle times for the 6 basic arithmetic operations?

When I try to optimize my code, for a very long time I've just been using a rule of thumb that addition and subtraction are worth 1, multiplication and division are worth 3, squaring is worth 3 (I rarely use the more general pow function so I have no rule of thumb for it), and square roots are worth 10. (And I assume squaring a number is just a multiplication, so worth 3.)
Here's an example from a 2D orbital simulation. To calculate and apply acceleration from gravity, first I get distance from the ship to the center of earth, then calculate the acceleration.
D = sqrt( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 19
A = G*Earth.mass/sqr(D); // this is worth 9, total is 28
However, notice that in calculating D, you take a square root, but when using it in the next calculation, you square it. Therefore you can just do this:
A = G*Earth.mass/( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 15
So if my rule of thumb is true, I almost cut in half the cycle time.
However, I cannot even remember where I heard that rule before. I'd like to ask what is the actual cycle times for those basic arithmetic operations?
Assumptions:
everything is a 64-bit floating number in x64 architecture.
everything is already loaded into registers, so no worrying about hits and misses from caches or memory.
no interrupts to the CPU
no if/branching logic such as look ahead prediction
Edit: I suppose what I'm really trying to do is look inside the ALU and only count the cycle time of its logic for the 6 operations. If there is still variance within that, please explain what and why.
Note: I did not see any tags for machine code, so I chose the next closest thing, assembly. To be clear, I am talking about actual machine code operations in x64 architecture. Thus it doesn't matter whether those lines of code I wrote are in C#, C, Javascript, whatever. I'm sure each high-level language will have its own varying times so I don't wanna get into an argument over that. I think it's a shame that there's no machine code tag because when talking about performance and/or operation, you really need to get down into it.
At a minimum, one must understand that an operation has at least two interesting timings: the latency and the throughput.
Latency
The latency is how long any particular operation takes, from its inputs to its output. If you had a long series of operations where the output of one operation is fed into the input of the next, the latency would determine the total time. For example, an integer multiplication on most recent x86 hardware has a latency of 3 cycles: it takes 3 cycles to complete a single multiplication operation. Integer addition has a latency of 1 cycle: the result is available the cycle after the addition executes. Latencies are generally positive integers.
Throughput
The throughput is the number of independent operations that can be performed per unit time. Since CPUs are pipelined and superscalar, this is often more than the inverse of the latency. For example, on most recent x86 chips, 4 integer addition operations can execute per cycle, even though the latency is 1 cycle. Similarly, 1 integer multiplication can execute, on average per cycle, even though any particular multiplication takes 3 cycles to complete (meaning that you must have multiple independent multiplications in progress at once to achieve this).
Inverse Throughput
When discussing instruction performance, it is common to give throughput numbers as "inverse throughput", which is simply 1 / throughput. This makes it easy to directly compare with latency figures without doing a division in your head. For example, the inverse throughput of addition is 0.25 cycles, versus a latency of 1 cycle, so you can immediately see that you if you have sufficient independent additions, they use only something like 0.25 cycles each.
Below I'll use inverse throughput.
Variable Timings
Most simple instructions have fixed timings, at least in their reg-reg form. Some more complex mathematical operations, however, may have input-dependent timings. For example, addition, subtraction and multiplication usually have fixed timings in their integer and floating point forms, but on many platforms division has variable timings in integer, floating point or both. Agner's numbers often show a range to indicate this, but you shouldn't assume the operand space has been tested extensively, especially for floating point.
The Skylake numbers below, for example, show a small range, but it isn't clear if that's due to operand dependency (which would likely be larger) or something else.
Passing denormal inputs, or results that themselves are denormal may incur significant additional cost depending on the denormal mode. The numbers you'll see in the guides generally assume no denormals, but you might be able to find a discussion of denormal costs per operation elsewhere.
More Details
The above is necessary but often not sufficient information to fully qualify performance, since you have other factors to consider such as execution port contention, front-end bottlenecks, and so on. It's enough to start though and you are only asking for "rule of thumb" numbers if I understand it correctly.
Agner Fog
My recommended source for measured latency and inverse throughput numbers are Agner's Fogs guides. You want the files under 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, which lists fairly exhaustive timings on a huge variety of AMD and Intel CPUs. You can also get the numbers for some CPUs directly from Intel's guides, but I find them less complete and more difficult to use than Agner's.
Below I'll pull out the numbers for a couple of modern CPUs, for the basic operations you are interested in.
Intel Skylake
Lat Inv Tpt
add/sub (addsd, subsd) 4 0.5
multiply (mulsd) 4 0.5
divide (divsd) 13-14 4
sqrt (sqrtpd) 15-16 4-6
So a "rule of thumb" for latency would be add/sub/mul all cost 1, and division and sqrt are about 3 and 4, respectively. For throughput, the rule would be 1, 8, 8-12 respectively. Note also that the latency is much larger than the inverse throughput, especially for add, sub and mul: you'd need 8 parallel chains of operations if you wanted to hit the max throughput.
AMD Ryzen
Lat Inv Tpt
add/sub (addsd, subsd) 3 0.5
multiply (mulsd) 4 0.5
divide (divsd) 8-13 4-5
sqrt (sqrtpd) 14-15 4-8
The Ryzen numbers are broadly similar to recent Intel. Addition and subtraction are slightly lower latency, multiplication is the same. Latency-wise, the rule of thumb could still generally be summarized as 1/3/4 for add,sub,mul/div/sqrt, with some loss of precision.
Here, the latency range for divide is fairly large, so I expect it is data dependent.

Faster cpu wastes more time as compared to slower cpu

Suppose I have a program that has an instruction to add two numbers and that operation takes 10 nanoseconds(constant, as enforced by the gate manufactures).
Now I have 3 different processors A, B and C(where A< B < C in terms of clock cycles). A's one clock cycle has 15 nanosec, B has 10 nanosec and C has 7 nanosec.
Firstly am I correct on my following assumptions-
1. Add operation takes 1 complete cycle of processor A(slow processor) and wastes rest of 5 ns of the cycle.
2. Add operation takes 1 complete cycle of processor B wasting no time.
3. Add operation takes 2 complete cycles(20 ns) of processor C(fast processor) wasting rest of the 20-14=7 ns.
If the above assumptions are correct then isn't this a contradiction to the regular assumption that processors with high clock cycles are faster. Here processor C which is the fastest actually takes 2 cycles and wastes 7ns whereas, the slower processor A takes just 1 cycle.
Processor C is fastest, no matter what. It takes 7 ns per cycle and therefore performs more cycles than A and B. It's not C's fault that the circuit is not fast enough. If you would implement the addition circuit in a way that it gives result in 1 ns, all processors will give the answer in 1 clock cycle (i.e. C will give you the answer in 7ns, B in 10ns and A in 15ns).
Firstly am I correct on my following assumptions-
1. Add operation takes 1 complete cycle of processor A(slow processor) and wastes rest of 5 ns of the cycle.
2. Add operation takes 1 complete cycle of processor B wasting no time.
3. Add operation takes 2 complete cycles(20 ns) of processor C(fast processor) wasting rest of the 20-7=13 ns.
No. It is because you are using incomplete data to express the time for an operation. Measure the time taken to finish an operation on a particular processor in clock cycles instead of nanoseconds as you are doing here. When you say ADD op takes 10 ns and you do not mention the processor on which you measured the time for the ADD op, the time measurement in ns is meaningless.
So when you say that ADD op takes 2 clock cycles on all three processors, then you have standardized the measurement. A standardized measurement can then be translated as:
Time taken by A for addition = 2 clock cycles * 15 ns per cycle = 30 ns
Time taken by B for addition = 2 clock cycles * 10 ns per cycle = 20 ns
Time taken by C for addition = 2 clock cycles * 07 ns per cycle = 14 ns
In case you haven't noticed, when you say:
A's one clock cycle has 15 nanosec, B has 10 nanosec and C has 7 nanosec.
which of the three processors is fastest?
Answer: C is fastest. It's one cycle is finished in 7ns. It implies that it finishes 109/7 (~= 1.4 * 108) cycles in one second, compared to B which finishes 109/10 (= 108) cycles in one second, compared to A which finishes only 109/15 (~= 0.6 * 108) cycles in one second.
What does a ADD instruction mean, does it purely mean only and only ADD(with operands available at the registers) or does it mean getting
the operands, decoding the instruction and then actually adding the
numbers.
Getting the operands is done by MOV op. If you are trying to compare how fast ADD op is being done, it should be compared by time to perform ADD op only. If you, on the other hand want to find out how fast addition of two numbers is being done, then it will involve more operations than simple ADD. However, if it's helpful, the list of all Original 8086/8088 instructions is available on Wikipedia too.
Based on the above context to what add actually means, how many cycles does add take, one or more than one.
It will depend on the processor because each processor may have the adder differently implemented. There are many ways to generate addition of two numbers. Quoting Wikipedia again - A full adder can be implemented in many different ways such as with a custom transistor-level circuit or composed of other gates.
Also, there may be pipelining in the instructions which can result in parallelizing of the addition of the numbers resulting in huge time savings.
Why is clock cycle a standard since it can vary with processor to processor. Shouldn't nanosec be the standard. Atleast its fixed.
Clock cycle along with the processor speed can be the standard if you want to tell the time taken by a processor to execute an instruction. Pick any two from:
Time to execute an instruction,
Processor Speed, and
Clock cycles needed for an instruction.
The third can be derived from it.
When you say the clock cycles taken by ADD is x and you know the processor speed is y MHz, you can calculate that the time to ADD is x / y. Also, you can mention the time to perform ADD as z ns and you know the processor speed is same y MHz as earlier, you can calculate the cycles needed to execute ADD as y * z.
I'm no expert BUT I'd say ...
the regular assumption that processors with high clock cycles are faster FOR THE VAST MAJORITY OF OPERATIONS
For example, a more intelligent processor might perform an "overhead task" that takes X ns. The "overhead task" might make it faster for repetitive operations but might actually cause it to take longer for a one-off operation such as adding 2 numbers.
Now, if the same processor performed that same operation 1 million times, it should be massively faster than the slower less intelligent processor.
Hope my thinking helps. Your feedback on my thoughts welcome.
Why would a faster processor take more cycles to do the same operation than a slower one?
Even more important: modern processors use Instruction pipelining, thus executing multiple operations in one clock cycle.
Also, I don't understand what you mean by 'wasting 5ns', the frequency determines the clock speed, thus the time it takes to execute 1 clock. Of course, cpu's can have to wait on I/O for example, but that holds for all cpu's.
Another important aspect of modern cpu's are the L1, L2 and L3 caches and the architecture of those caches in multicore systems. For example: if a register access takes 1 time unit, a L1 cache access will take around 2 while a normal memory access will take between 50 and 100 (and a harddisk access would take thousands..).
This is actually almost correct, except that on processor B taking 2 cycles means 14ns, so with 10ns being enough the next cycle starts 4ns after the result was already "stable" (though it is likely that you need some extra time if you chop it up, to latch the partial result). It's not that much of a contradiction, setting your frequency "too high" can require trade-offs like that. An other thing you might do it use more a different circuit or domino logic to get the actual latency of addition down to one cycle again. More likely, you wouldn't set addition at 2 cycles to begin with. It doesn't work out so well in this case, at least not for addition. You could do it, and yes, basically you will have to "round up" the time a circuit takes to an integer number of cycles. You can also see this in bitwise operations, which take less time than addition but nevertheless take a whole cycle. On machine C you could probably still fit bitwise operations in a single cycle, for some workloads it might even be worth splitting addition like that.
FWIW, Netburst (Pentium 4) had staggered adders, which computed the lower half in one "half-cycle" and the upper half in the next (and the flags in the third half cycle, in some sense giving the whole addition a latency of 1.5). It's not completely out of this world, though Netburst was over all, fairly mad - it had to do a lot of weird things to get the frequency up that high. But those half-cycles aren't very half (it wasn't, AFAIK, logic that advanced on every flank, it just used a clock multiplier), you could also see them as the real cycles that are just very fast, with most of the rest of the logic (except that crazy ALU) running at half speed.
Your broad point that 'a CPU will occasionally waste clock cycles' is valid. But overall in the real world, part of what makes a good CPU a good CPU is how it alleviates this problem.
Modern CPUs consist of a number of different components, none of whose operations will end up taking a constant time in practice. For example, an ADD instruction might 'burst' at 1 instruction per clock cycle if the data is immediately available to it... which in turn means something like 'if the CPU subcomponents required to fetch that data were immediately available prior to the instruction'. So depending on if e.g. another subcomponent had to wait for a cache fetch, the ADD may in practice take 2 or 3 cycles, say. A good CPU will attempt to re-order the incoming stream of instructions to maximise the availability of subcomponents at the right time.
So you could well have the situation where a particular series of instructions is 'suboptimal' on one processor compared to another. And the overall performance of a processor is certainly not just about raw clock speed: it is as much about the clever logic that goes around taking a stream of incoming instructions and working out which parts of which instructions to fire off to which subcomponents of the chip when.
But... I would posit that any modern chip contains such logic. Both a 2GHz and a 3GHz processor will regularly "waste" clock cycles because (to put it simply) a "fast" instruction executed on one subcomponent of the CPU has to wait for the result of the output from another "slower" subcomponent. But overall, you will still expect the 3GHz processor to "execute real code faster".
First, if the 10ns time to perform the addition does not include the pipeline overhead (clock skew and latch delay), then Processor B cannot complete an addition (with these overheads) in one 10ns clock cycle, but Processor A can and Processor C can still probably do it in two cycles.
Second, if the addition itself is pipelined (or other functional units are available), then a subsequent non-dependent operation can begin executing in the next cycle. (If the addition was width-pipelined/staggered (as mentioned in harold's answer) then even dependent additions, logical operations and left shifts could be started after only one cycle. However, if the exercise is constraining addition timing, it presumably also prohibits other optimizations to simplify the exercise.) If dependent operations are not especially common, then the faster clock of Processor C will result in higher performance. (E.g., if a dependence stall occurred every fourth cycle, then, ignoring other effects, Processor C can complete four instructions every five 7ns cycles (35 ns; the first three instruction overlap in execution) compared to 40ns for Processor B (assuming the add timing included pipelining overhead).) (Note: Your assumption 3 is incorrect, two cycles for Processor C would be 14ns.)
Third, the extra time in a clock cycle can be used to support more complex operations (e.g., preshifting one operand by a small immediate value and even adding three numbers — a carry-save adder has relatively little delay), to steal work from other pipeline stages (potentially reducing the number of pipeline stages, which generally reduces branch misprediction penalties), or to reduce area or power by using simpler logic. In addition, the extra time might be used to support a larger (or more associative) cache with fixed latency in cycles, reducing miss rates. Such factors can compensate for the "waste" of 5ns in Processor A.
Even for scalar (single issue per cycle) pipelines clock speed is not the single determinant of performance. Design choices become even more complex when power, manufacturing cost (related to yield, adjusted according to sellable bins, and area), time-to-market (and its variability/predictability), workload diversity, and more advanced architectural and microarchitectural techniques are considered.
The incorrect assumption that clock frequency determines performance even has a name: the Megahertz myth.

On most modern 64-bit processors, does the speed of `mulq` depend on the operands?

On most moder 64-bit processors (such as Intel Core 2 Duo or the Intel i7 series), does the speed of the x86_64 command mulq and its variants depend on the operands? For example, will multiplying 11 * 13 be faster than 11111111 * 13131313? Or does it always take the time of the worst case?
TL;DR: No. Constant-length integer math operations (barring division, which is non-linear) consume a constant number of cycles, regardless of the numerical value of the operands.
mulq takes two QWORD arguments.
The values are represented in little-endian binary format (used by x86 architecture) as follows:
1011000000000000000000000000000000000000000000000000000000000000 = 13
1000110001111010000100110000000000000000000000000000000000000000 = 13131313
The processor sees both of these as the same "size", as both are 64-bit values.
Therefore, the cycle count should always be the same, regardless of the actual numerical value of the operands.
More info:
There are the concepts of Leading Zero Anticipation and Leading Zero Detection[1][2] (LZA/LZD) that can be employed to speed up floating-point operations.
To the best of my knowledge however, there are no mainstream processors that employ either of these methods towards integer arithmetic. This is most likely due to the simplistic nature of most integer arithmetic (multiplication in this case). The overhead of LZA/LZD may simply not be worth it, for simple integer math circuits that can complete the full multiplication in less time anyhow.
I don't have any reference to hand, but I would place money on the latency/throughput being invariant of the values of the operands. Otherwise, it would be a nightmare to schedule.
For decades, Agner Fog has been publishing tables of instruction timings. His August 2019 tables confirm what I had expected: that the CPU chips in modern laptops and desktop computers have invariant timing for their integer-multiply units. These are extremely fast and rather power-hungry.
The CPU design space is quite different for battery-limited devices such as smartphones. On such devices, the integer multiply may be implemented in a microcoded loop with variable timing.
In (approx) 2016, Thomas Pornin had this to say about "the problem" posed by variable-latency multiplication instructions to the design of his SSL/TLS library:
"... integer multiplication opcodes in CPU may or may not execute in constant time; when they do not, implementations that use such operations may exhibit execution time variations that depend on the involved data, thereby potentially leaking secret information... When a CPU has non-constant-time multiplication opcodes, the execution time depends on the size (as integers) of one or both of the operands. For instance, on the ARM Cortex-M3, the umull opcode takes 3 to 5 cycles to complete, the `short' counts (3 or 4) being taken only if both operands are numerically less than 65536 ... In general, Intel x86 CPU all provide constant-time multiplications since the days of the first Pentium. The same does not necessarily hold for other vendors, in particular the early VIA Nano." 2

Resources