Speed up without a serial fraction - parallel-processing

I ran a set of experiments on a parallel package, say superlu-dist, with different processor numbers e.g.: 4, 16, 32, 64
I got the wall clock time for each experiment, say: 53.17s, 32.65s, 24.30s, 16.03s
The formula of speedup is :
serial time
Speedup = ----------------------
parallel time
But there is no information about the serial fraction.
Can I simply take the reciprocal of the wall clock time?

Can I simply take the reciprocal of the wall clock time ?
No,true Speedup figures require comparing Apples to Apples :
This means, that an original, pure-[SERIAL] process-scheduling ought be compared with any other scenario, where parts may get modified, so as to use some sort of parallelism ( the parallel fraction may get re-organised, so as to run on N CPUs / computing-resources, whereas the serial fraction is left as was ).
This obviously means, that the original [SERIAL]-code was extended ( both in code ( #pragma-decorators, OpenCL-modifications, CUDA-{ host_to_dev | dev_to_host }-tooling etc.), and in time( to execute these added functionalities, that were not present in the original [SERIAL]-code, to benchmark against ), so as to add some new sections, where the ( possible [PARALLEL] ) other part of the processing will take place.
This comes at cost -- add-on overhead costs ( to setup and to terminate and to communicate data from [SERIAL]-part there, to the [PARALLEL]-part and back ) -- which all adds additional [SERIAL]-part workload ( and execution time + latency ).
For more details, feel free to read section Criticism in article on re-formulated Amdahl's Law.
The [PARALLEL]-portion seems interesting, yet the Speedup principal ceiling is in the [SERIAL]-portion duration ( s = 1 - p ) in the original,
but to which add-on durations and added latency costs need to get added as accumulated alongside the "organisation" of work from an original, pure-[SERIAL], to the wished-to-have [PARALLEL]-code execution process scheduling, if realistic evaluation is to be achieved
run the test on a single processor and set that as the serial time, ...,
as #VictorSong has proposed sounds easy, but benchmarks an incoherent system ( not the pure-[SERIAL] original) and records a skewed yardstick to compare against.
This is the reason, why fair methods ought be engineered. The pure-[SERIAL] original code-execution can be time-stamped, so as to show the real durations of unchanged-parts, but the add-on overhead times have to get incorporated into the add-on extensions of the serial part of the now parallelised tests.
The re-articulated Amdahl's Law of Diminishing Returns explains this, altogether with impacts from add-on overheads and also from atomicity-of-processing, that will not allow further fictions of speedup growth, given more computing resources are added, but a parallel-fraction of the processing does not permit further split of task workloads, due to some form of its internal atomicity-of-processing, that cannot be further divided in spite of having free processors available.
The simplified of the two, re-formulated expressions stands like this :
1
S = __________________________; where s, ( 1 - s ), N were defined above
( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on
s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on
N
Some interactive GUI-tools for further visualisations of the add-on overhead-costs are available for interactive parametric simulations here - just move the p-slider towards the actual value of the ( 1 - s ) ~ having a non-zero fraction of the very [SERIAL]-part of the original code :

What do you mean when you say "serial fraction"? According to a Google search apparently superlu-dist is C, so I guess you could just use ctime or chrono and take the time the usual way, it works for me with both manual std::threads and omp.
I'd just run the test on a single processor and set that as the serial time, then do the test again with more processors (just like you said).

Related

What is the performance of 100 processors capable of 2 GFLOPs running 2% sequential and 98% parallelizable code?

This is how I solved the following question, I want to be sure if my solution is correct?
A multiprocessor consists of 100 processors, each capable of a peak execution rate of 2 Gflops. What is the performance of the system as measured in Gflops when 2% of the code is sequential and 98% is parallelizable
Solution :
I think I'm right in thinking that 2% of the program is run at 2 GFLOPs, and 98% is run at 200 GFLOPs, and that I can average these speeds to find the performance of the multiprocessor in GFLOPs
(2/100)*2 + (98/100)*200 = 196,04 Gflops
I want to be sure if my solution is correct?
From my understanding, it is 2% of the program that is sequential and not 2% of the execution time. This means the sequential code takes a significant portion of the time since there are a lot of processor so the parallel part is drastically accelerated.
With your method, a program with 50% of sequential code and 1000 processors will run at (50/100)*2 + (50/100)*2_000 = 1001 Gflops. This means that all processors are use at ~50% of their maximum capacity in average during all the execution of the program which is too good to be possible. Indeed, the parallel part of the program should be so fast that it will take only a tiny faction of the execution time (<5%) while the sequential part will take almost all the time of the program (>95%). Since the largest part of the execution time runs at 2 Gflops, the processors cannot be used at ~50% of their capacity!
Based on the Amdahl's law, you can compute the actual speed up of this code:
Slat = 1 / ((1-p) + p/s) where Slat is the speed up of the whole program, p the portion of parallel code (0.98) and s is the number of processors (100). This means Slat = 33.6. Since one processor runs at 2 Gflops and the program is 33.6 time faster overall using many processors, the overall program runs at 33.6 * 2 = 67.2 Gflops.
What the Amdahl's law show is that a tiny fraction of the execution time being sequential impact strongly the scalability and thus the performance of parallel programs.
Forgive me to start light & anecdotally,citing a meme from my beloved math professor,later we will see why & how well it helps us here
2 + 7 = 15 . . . , particularly so for higher values of 7
It is ideal if I start off with stating some definitions:
a) GFLOPS is a unit, that measures how many operations in FLO-ating point arithmetics, with no particular kind thereof specified (see remark 1), were performed P-er S-econd ~ FLOPS, here expressed for convenience in multiples of billion ( G-iga ), i.e. the said GFLOPS
b) processor, multi-processor is a device ( or some kind of composition of multiple such same devices, expressed as multi-p. ), used to perform some kind of a useful work - a processing
This pair of definitions was necessary to further judge the asked question to solve.
The term (a) is a property of (b), irrespective of all other factors, if we assume such "device" not to be a kind of some polymorphic, self-modifying FGPA or evolutionary reflective self-evolving amoeboid, which both processors & multi-processors prefer not to be, at least in our part of Universe as we know it in 2022-Q2.
Once manufactured, each kind of processor(b) (be it a monolithic or a multi-processor device) has certain, observable, repetitively measurable, qualities of processing (doing a work).
"A multiprocessor consists of 100 processors, each capable of a peak execution rate of 2Gflops. What is the performance of the system as measured in Gflops when 2% of the code is sequential and 98% is parallelizable"
A multiprocessor . . . (device)
consists . . . has a property of being composed of
100 . . . (quantitative factor) ~ 100
processors,. . . (device)
each . . . declaration of equality
capable . . . having a property of
of a peak . . . peak (not having any higher)
execution . . . execution of work (process/code)
rate . . . being measured in time [1/s]
of 2Gflops . . . (quantitative factor) ~ 2E+9 FLOPS
What is . . . Questioning
the PERFORMANCE . . . (property) a term (not defined yet)
of the SYSTEM . . . (system) a term (not defined yet)
as measured in . . . using some measure to evaluate a property of (system) in
Gflops . . . (units of measure) to express such property in
when . . . (proposition)
2% . . . (quantitative factor) ~ 0.02 fraction of
of the code . . . (subject-being-processed)
is . . . has a property of being
sequential . . . sequential, i.e. steps follow one-after-another
and
98% . . . (quantitative factor) ~ 0.98 fraction of (code)
( the same code)
is . . . has a property of being
parallelizable . . . possible to re-factor
into some other form,
from a (sequential)
original form
( emphasis added )
Fact #1 )
the processor(b) ( a (device) ), from which an introduced multiprocessor ( a macro-(device) ) is internally composed from, has a declared (granted) property of not being able to process more FLOPS, than the said 2 GFLOPS.
This property does not say, how many actual { INTOPS | FLOPS } it will perform in any particular moment in time.
This property does say, any device, that was measured and got labeled to have indeed X {M|G|P|E}FLOPS has the very same "glass-ceiling" of not being able to perform a single more instruction per second, even when it is doing nothing at all (chopping NOP-s) or even when it is switched off and powered down.
This property is a static supreme, an artificial (in relation to real-world work-loads' instruction mixes), temperature-dependent-constant (and often degrades in-vivo not only due to thermal throttling but due to many other reasons in real-world { processor + !processor }-composed SYSTEM ecosystems )
Fact #2 )
the problem, as visible to us here, has no particular definition of what is or what is not a part of the said "SYSTEM" - Is it just the (multi)processor - if so, then why introducing a new, not yet defined, term SYSTEM, for being it a pure identity with the already defined & used term (multi)processor per se? Is it both the (multi)processor and memory or other peripherals - if so, the why we know literally nothing about such important neighbourhood (a complement) of the said (multi)processor, without which a SYSTEM would not be The SYSTEM, but a mere part of it, the (multi)processor, that is NOT a SYSTEM without its (such a SYSTEM-defining and completing) neighbourhood?
Fact #3 )
the original Amdahl's Law, often dubbed as The Law of Diminishing Returns (of extending the System with more and more resources) speaks about SYSTEM and its re-organised forms, when comparing the same amount and composition of work, as performed in original SYSTEM (with a pure-[SERIAL] flow of operations, one-step-after-another-after-another), with another, improved SYSTEM' (created by re-organising and extending the original SYSTEM by adding more resources of some kinds and turning such a new SYSTEM' into operating more parts of the original work-to-be-done in an improved organisation of work, where more resources can & do perform parts of work-to-be-done independently one on any other one ~ in a concurrent, some parts even in a true parallel fashion, using all degrees of parallelism the SYSTEM' resources can provide & sustain to serve).
Given no particular piece of information was present about a SYSTEM, the less about a SYSTEM', we have no right to use The Law of Diminishing Returns to address the problem, as was defined above. Having no facts does not give us a right to guestimate, the less to turn into feelings-based evidencing, if we strive to remain serious to ourselves, don't we?
Given (a) and (b) above, the only fair to achieve claim, that indeed holds true, can be to say :
"From what has been defined so far,we know that such a multiprocessorwill never work on more than 100 x 2 GFLOP per second of time."
There is zero other knowledge to claim a single bit more (and yet we still have to silently assume that such above claimed peak FLOP-s have no side-effect and remain sustainable for at least a one whole second (see remark 2 ) -- otherwise even this claim will become skewed
An extended, stronger version :
"No matter what kind of code is actually being run,for this, above specified multiprocessor, we cannot say morethan that such a multiprocessor will never work on more than 100 x 2 GFLOPS in any moment of time."
Remarks :
see how this is being so often misused by promotion of "Exaflops performance" by marketing people, when FMUL f8,f8 is being claimed and "sold" to the public as that it "looks" equal as FMUL f512,f512, which it by far is not using the same yardstick to measure, is it?
a similar skewed argument (if not a straight misinformation) has been countless times repeated in a (false) claim, that a world "largest" femtosecond-LASER was capable to emit a light pulse carrying more power than XY-Suns (a-WOW-moment!), without adding, how long did it take to pump-up the energy for a single such femtosecond long ( 1 [fs] ~ 1E-15 [s] ) "packet-of-a-few-photons" ... careful readers have already rejected a-"WOW"-moment artificial stupidity for not being possible to carry such astronomic amount of energy, as an energy of XY-Suns on a tiny, energy poor planet, the less to carry that "over" a wire towards that "superpower" LASER )
If 2% is the run-time percentage for serial part then you can not surpass 50x speedup. This means you can not surpass 50x gflops of serial version.
If unoptimized program had 2 gflops fully serial then the optimized version with perfect scaling makes 98% of runtime compressed to 0.98%.
2% plus 0.98% is equivalent to ~3% as a new total run time. This means the program is spending 2/3 of the time in serial part and only 1/3 in the parallelized part. If parallel part is 200gflops then you have to average it over the whole 3/3 of the time. 200 gflops for 1 microsecond and 2 gflops for 2 microseconds.
This is roughly equal to 67 gflops. If there is a single-core turbo to boost the serial part, then 20% turbo boost in 2/3 of the time means shaving ~13% less of total run time, hence 15%-20% higher gflops average. Turbo core frequency is important even if it boosts single core.

Is time cost of integer multiplication the same as any binary operation on ARM or Intel processors?

Is the processing time of an integer multiplication the same as any integer binary operation on modern CPU with pipelining (e.g Intel, ARM) ?
In the Assembly documentation of Intel, it is said that an integer multiplication takes 1 cycle, like any integer binary operation. Is this cycle equivalent to the time duration supposing the operations are pipelined ?
There are more than the cycles to consider:
latency
pipeline
While the results of ALU instructions are instantaneous, multiply instructions have to go through MAC(multiply accumulate) which usually costs more cycles and comes with a latency of multiple cycles.
And often there is only one MAC unit which means the core doesn't allow two mul instructions to be dual issued.
example: ARMv5E:
smulxy(16bit): one cycle plus three cycles latency
mul(32bit): two cycles plus three cycles latency
umull(64bit): three cycles plus four(lower half) and five(upper half) cycles latency
No, multiply is much more complicated than XOR, ADD, OR, NOT, etc. While binary makes it much easier than base 10 you still have to have a larger adder (than just a two operand ADD or other operation).
Take the bits abcd
abcd
* 1011
========
abcd
abcd.
0000..
+abcd...
=========
In base 10 like grade school you had to multiply each time, you are still multiplying here but only by one or zero so either you copy and shift the first operand or you copy and shift zeros. And it gets very big, addition is cascaded. Look up xor gate at wikipedia and see the full adder or just google it. You have a single column adder for a simple two operand add with three inputs and two outputs but the carry out of one bit is the carry in of the other. No logic is instantaneous even a single transistor inversion (NOT) takes a non-zero amount of time. You can start to think about how many gates are lined up just to make one 32 bit two operand ADD, and then think about a 32 bit multiply where each adder is 32 operand bits and some number of carry bits, and then all of that is cascaded. The chip real estate and the time to settle multiply almost exponentially for multiply, and you then start to worry about can you meet timing (can you settle the msbit of the result within the desired/designed clock speed).
So you will see optimizations made including multiple pipe stages, not 32 clocks to do a 32 bit multiply but maybe not one clock maybe two or four. With a dozen stage deep pipe though you can bury that in there and still meet an advertised one clock per instruction average.
Intel, ARM, etc the 1 cycle thing is an illusion, the math operation itself might take that long, but the execution of the instruction takes a few to a handful, and your pipe depths may be several to a dozen or more. There is limited use in attempting to count cycles these days. And feeding the pipe and handling memory operations tend to dominate the performance not the pipe/instructions themselves outside a carefully crafted sim of the core.
For the cortex-ms which are perhaps not what you are asking about but are very much part of our daily life you see in the documentation that it is the chip vendor that can choose the larger faster multiply or the slower smaller that helps with overall chip size and perhaps performance. (I do not examine the cortex-a docs that much as I do not use them as often) A compile time option when they compile the core, there are many compile time options (which is why for any arm core cortex-m or cortex-a) you cannot compare, say, two cortex-m4s from different vendors or chip families within a vendor as they could have been compiled differently and behave/perform differently (they still execute the enabled instructions in the same functional way of course).
So no you cannot assume the "execution time" or "cycle time" of ANY instruction, and in particular ones like multiply and divide and anything floating point cannot assumed to be single cycle. Yes like all the other instructions the one cycle advertised is based on the pipeline effects, no instruction takes one cycle start to finish, and based on pipe depth of the design the multiply and divide may take more than one clock but be hidden by the pipe to still average one clock per instruction.
Note that this question is "too broad", as there are many Intel and ARM implementations past and present. And chip implementation details are often not available or protected by NDA, all you have if anything are public documents that can hide the reality.

Why actual runtime for a larger search value is smaller than a lower search value in a sorted array?

I executed a linear search on an array containing all unique elements in range [1, 10000], sorted in increasing order with all search values i.e., from 1 to 10000 and plotted the runtime vs search value graph as follows:
Upon closely analysing the zoomed in version of the plot as follows:
I found that the runtime for some larger search values is smaller than the lower search values and vice versa
My best guess for this phenomenon is that it is related to how data is processed by CPU using primary memory and cache, but don't have a firm quantifiable reason to explain this.
Any hint would be greatly appreciated.
PS: The code was written in C++ and executed on linux platform hosted on virtual machine with 4 VCPUs on Google Cloud. The runtime was measured using the C++ Chrono library.
CPU cache size depends on the CPU model, there are several cache levels, so your experiment should take all those factors into account. L1 cache is usually 8 KiB, which is about 4 times smaller than your 10000 array. But I don't think this is cache misses. L2 latency is about 100ns, which is much smaller than the difference between lowest and second line, which is about 5 usec. I suppose this (second line-cloud) is contributed from the context switching. The longer the task, the more probable the context switching to occur. This is why the cloud on the right side is thicker.
Now for the zoomed in figure. As Linux is not a real time OS, it's time measuring is not very reliable. IIRC it's minimal reporting unit is microsecond. Now, if a certain task takes exactly 15.45 microseconds, then its ending time depends on when it started. If the task started at exact zero time clock, the time reported would be 15 microseconds. If it started when the internal clock was at 0.1 microsecond in, than you will get 16 microsecond. What you see on the graph is a linear approximation of the analogue straight line to the discrete-valued axis. So the tasks duration you get is not actual task duration, but the real value plus task start time into microsecond (which is uniformly distributed ~U[0,1]) and all that rounded to the closest integer value.

What are the relative cycle times for the 6 basic arithmetic operations?

When I try to optimize my code, for a very long time I've just been using a rule of thumb that addition and subtraction are worth 1, multiplication and division are worth 3, squaring is worth 3 (I rarely use the more general pow function so I have no rule of thumb for it), and square roots are worth 10. (And I assume squaring a number is just a multiplication, so worth 3.)
Here's an example from a 2D orbital simulation. To calculate and apply acceleration from gravity, first I get distance from the ship to the center of earth, then calculate the acceleration.
D = sqrt( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 19
A = G*Earth.mass/sqr(D); // this is worth 9, total is 28
However, notice that in calculating D, you take a square root, but when using it in the next calculation, you square it. Therefore you can just do this:
A = G*Earth.mass/( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 15
So if my rule of thumb is true, I almost cut in half the cycle time.
However, I cannot even remember where I heard that rule before. I'd like to ask what is the actual cycle times for those basic arithmetic operations?
Assumptions:
everything is a 64-bit floating number in x64 architecture.
everything is already loaded into registers, so no worrying about hits and misses from caches or memory.
no interrupts to the CPU
no if/branching logic such as look ahead prediction
Edit: I suppose what I'm really trying to do is look inside the ALU and only count the cycle time of its logic for the 6 operations. If there is still variance within that, please explain what and why.
Note: I did not see any tags for machine code, so I chose the next closest thing, assembly. To be clear, I am talking about actual machine code operations in x64 architecture. Thus it doesn't matter whether those lines of code I wrote are in C#, C, Javascript, whatever. I'm sure each high-level language will have its own varying times so I don't wanna get into an argument over that. I think it's a shame that there's no machine code tag because when talking about performance and/or operation, you really need to get down into it.
At a minimum, one must understand that an operation has at least two interesting timings: the latency and the throughput.
Latency
The latency is how long any particular operation takes, from its inputs to its output. If you had a long series of operations where the output of one operation is fed into the input of the next, the latency would determine the total time. For example, an integer multiplication on most recent x86 hardware has a latency of 3 cycles: it takes 3 cycles to complete a single multiplication operation. Integer addition has a latency of 1 cycle: the result is available the cycle after the addition executes. Latencies are generally positive integers.
Throughput
The throughput is the number of independent operations that can be performed per unit time. Since CPUs are pipelined and superscalar, this is often more than the inverse of the latency. For example, on most recent x86 chips, 4 integer addition operations can execute per cycle, even though the latency is 1 cycle. Similarly, 1 integer multiplication can execute, on average per cycle, even though any particular multiplication takes 3 cycles to complete (meaning that you must have multiple independent multiplications in progress at once to achieve this).
Inverse Throughput
When discussing instruction performance, it is common to give throughput numbers as "inverse throughput", which is simply 1 / throughput. This makes it easy to directly compare with latency figures without doing a division in your head. For example, the inverse throughput of addition is 0.25 cycles, versus a latency of 1 cycle, so you can immediately see that you if you have sufficient independent additions, they use only something like 0.25 cycles each.
Below I'll use inverse throughput.
Variable Timings
Most simple instructions have fixed timings, at least in their reg-reg form. Some more complex mathematical operations, however, may have input-dependent timings. For example, addition, subtraction and multiplication usually have fixed timings in their integer and floating point forms, but on many platforms division has variable timings in integer, floating point or both. Agner's numbers often show a range to indicate this, but you shouldn't assume the operand space has been tested extensively, especially for floating point.
The Skylake numbers below, for example, show a small range, but it isn't clear if that's due to operand dependency (which would likely be larger) or something else.
Passing denormal inputs, or results that themselves are denormal may incur significant additional cost depending on the denormal mode. The numbers you'll see in the guides generally assume no denormals, but you might be able to find a discussion of denormal costs per operation elsewhere.
More Details
The above is necessary but often not sufficient information to fully qualify performance, since you have other factors to consider such as execution port contention, front-end bottlenecks, and so on. It's enough to start though and you are only asking for "rule of thumb" numbers if I understand it correctly.
Agner Fog
My recommended source for measured latency and inverse throughput numbers are Agner's Fogs guides. You want the files under 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, which lists fairly exhaustive timings on a huge variety of AMD and Intel CPUs. You can also get the numbers for some CPUs directly from Intel's guides, but I find them less complete and more difficult to use than Agner's.
Below I'll pull out the numbers for a couple of modern CPUs, for the basic operations you are interested in.
Intel Skylake
Lat Inv Tpt
add/sub (addsd, subsd) 4 0.5
multiply (mulsd) 4 0.5
divide (divsd) 13-14 4
sqrt (sqrtpd) 15-16 4-6
So a "rule of thumb" for latency would be add/sub/mul all cost 1, and division and sqrt are about 3 and 4, respectively. For throughput, the rule would be 1, 8, 8-12 respectively. Note also that the latency is much larger than the inverse throughput, especially for add, sub and mul: you'd need 8 parallel chains of operations if you wanted to hit the max throughput.
AMD Ryzen
Lat Inv Tpt
add/sub (addsd, subsd) 3 0.5
multiply (mulsd) 4 0.5
divide (divsd) 8-13 4-5
sqrt (sqrtpd) 14-15 4-8
The Ryzen numbers are broadly similar to recent Intel. Addition and subtraction are slightly lower latency, multiplication is the same. Latency-wise, the rule of thumb could still generally be summarized as 1/3/4 for add,sub,mul/div/sqrt, with some loss of precision.
Here, the latency range for divide is fairly large, so I expect it is data dependent.

Faster cpu wastes more time as compared to slower cpu

Suppose I have a program that has an instruction to add two numbers and that operation takes 10 nanoseconds(constant, as enforced by the gate manufactures).
Now I have 3 different processors A, B and C(where A< B < C in terms of clock cycles). A's one clock cycle has 15 nanosec, B has 10 nanosec and C has 7 nanosec.
Firstly am I correct on my following assumptions-
1. Add operation takes 1 complete cycle of processor A(slow processor) and wastes rest of 5 ns of the cycle.
2. Add operation takes 1 complete cycle of processor B wasting no time.
3. Add operation takes 2 complete cycles(20 ns) of processor C(fast processor) wasting rest of the 20-14=7 ns.
If the above assumptions are correct then isn't this a contradiction to the regular assumption that processors with high clock cycles are faster. Here processor C which is the fastest actually takes 2 cycles and wastes 7ns whereas, the slower processor A takes just 1 cycle.
Processor C is fastest, no matter what. It takes 7 ns per cycle and therefore performs more cycles than A and B. It's not C's fault that the circuit is not fast enough. If you would implement the addition circuit in a way that it gives result in 1 ns, all processors will give the answer in 1 clock cycle (i.e. C will give you the answer in 7ns, B in 10ns and A in 15ns).
Firstly am I correct on my following assumptions-
1. Add operation takes 1 complete cycle of processor A(slow processor) and wastes rest of 5 ns of the cycle.
2. Add operation takes 1 complete cycle of processor B wasting no time.
3. Add operation takes 2 complete cycles(20 ns) of processor C(fast processor) wasting rest of the 20-7=13 ns.
No. It is because you are using incomplete data to express the time for an operation. Measure the time taken to finish an operation on a particular processor in clock cycles instead of nanoseconds as you are doing here. When you say ADD op takes 10 ns and you do not mention the processor on which you measured the time for the ADD op, the time measurement in ns is meaningless.
So when you say that ADD op takes 2 clock cycles on all three processors, then you have standardized the measurement. A standardized measurement can then be translated as:
Time taken by A for addition = 2 clock cycles * 15 ns per cycle = 30 ns
Time taken by B for addition = 2 clock cycles * 10 ns per cycle = 20 ns
Time taken by C for addition = 2 clock cycles * 07 ns per cycle = 14 ns
In case you haven't noticed, when you say:
A's one clock cycle has 15 nanosec, B has 10 nanosec and C has 7 nanosec.
which of the three processors is fastest?
Answer: C is fastest. It's one cycle is finished in 7ns. It implies that it finishes 109/7 (~= 1.4 * 108) cycles in one second, compared to B which finishes 109/10 (= 108) cycles in one second, compared to A which finishes only 109/15 (~= 0.6 * 108) cycles in one second.
What does a ADD instruction mean, does it purely mean only and only ADD(with operands available at the registers) or does it mean getting
the operands, decoding the instruction and then actually adding the
numbers.
Getting the operands is done by MOV op. If you are trying to compare how fast ADD op is being done, it should be compared by time to perform ADD op only. If you, on the other hand want to find out how fast addition of two numbers is being done, then it will involve more operations than simple ADD. However, if it's helpful, the list of all Original 8086/8088 instructions is available on Wikipedia too.
Based on the above context to what add actually means, how many cycles does add take, one or more than one.
It will depend on the processor because each processor may have the adder differently implemented. There are many ways to generate addition of two numbers. Quoting Wikipedia again - A full adder can be implemented in many different ways such as with a custom transistor-level circuit or composed of other gates.
Also, there may be pipelining in the instructions which can result in parallelizing of the addition of the numbers resulting in huge time savings.
Why is clock cycle a standard since it can vary with processor to processor. Shouldn't nanosec be the standard. Atleast its fixed.
Clock cycle along with the processor speed can be the standard if you want to tell the time taken by a processor to execute an instruction. Pick any two from:
Time to execute an instruction,
Processor Speed, and
Clock cycles needed for an instruction.
The third can be derived from it.
When you say the clock cycles taken by ADD is x and you know the processor speed is y MHz, you can calculate that the time to ADD is x / y. Also, you can mention the time to perform ADD as z ns and you know the processor speed is same y MHz as earlier, you can calculate the cycles needed to execute ADD as y * z.
I'm no expert BUT I'd say ...
the regular assumption that processors with high clock cycles are faster FOR THE VAST MAJORITY OF OPERATIONS
For example, a more intelligent processor might perform an "overhead task" that takes X ns. The "overhead task" might make it faster for repetitive operations but might actually cause it to take longer for a one-off operation such as adding 2 numbers.
Now, if the same processor performed that same operation 1 million times, it should be massively faster than the slower less intelligent processor.
Hope my thinking helps. Your feedback on my thoughts welcome.
Why would a faster processor take more cycles to do the same operation than a slower one?
Even more important: modern processors use Instruction pipelining, thus executing multiple operations in one clock cycle.
Also, I don't understand what you mean by 'wasting 5ns', the frequency determines the clock speed, thus the time it takes to execute 1 clock. Of course, cpu's can have to wait on I/O for example, but that holds for all cpu's.
Another important aspect of modern cpu's are the L1, L2 and L3 caches and the architecture of those caches in multicore systems. For example: if a register access takes 1 time unit, a L1 cache access will take around 2 while a normal memory access will take between 50 and 100 (and a harddisk access would take thousands..).
This is actually almost correct, except that on processor B taking 2 cycles means 14ns, so with 10ns being enough the next cycle starts 4ns after the result was already "stable" (though it is likely that you need some extra time if you chop it up, to latch the partial result). It's not that much of a contradiction, setting your frequency "too high" can require trade-offs like that. An other thing you might do it use more a different circuit or domino logic to get the actual latency of addition down to one cycle again. More likely, you wouldn't set addition at 2 cycles to begin with. It doesn't work out so well in this case, at least not for addition. You could do it, and yes, basically you will have to "round up" the time a circuit takes to an integer number of cycles. You can also see this in bitwise operations, which take less time than addition but nevertheless take a whole cycle. On machine C you could probably still fit bitwise operations in a single cycle, for some workloads it might even be worth splitting addition like that.
FWIW, Netburst (Pentium 4) had staggered adders, which computed the lower half in one "half-cycle" and the upper half in the next (and the flags in the third half cycle, in some sense giving the whole addition a latency of 1.5). It's not completely out of this world, though Netburst was over all, fairly mad - it had to do a lot of weird things to get the frequency up that high. But those half-cycles aren't very half (it wasn't, AFAIK, logic that advanced on every flank, it just used a clock multiplier), you could also see them as the real cycles that are just very fast, with most of the rest of the logic (except that crazy ALU) running at half speed.
Your broad point that 'a CPU will occasionally waste clock cycles' is valid. But overall in the real world, part of what makes a good CPU a good CPU is how it alleviates this problem.
Modern CPUs consist of a number of different components, none of whose operations will end up taking a constant time in practice. For example, an ADD instruction might 'burst' at 1 instruction per clock cycle if the data is immediately available to it... which in turn means something like 'if the CPU subcomponents required to fetch that data were immediately available prior to the instruction'. So depending on if e.g. another subcomponent had to wait for a cache fetch, the ADD may in practice take 2 or 3 cycles, say. A good CPU will attempt to re-order the incoming stream of instructions to maximise the availability of subcomponents at the right time.
So you could well have the situation where a particular series of instructions is 'suboptimal' on one processor compared to another. And the overall performance of a processor is certainly not just about raw clock speed: it is as much about the clever logic that goes around taking a stream of incoming instructions and working out which parts of which instructions to fire off to which subcomponents of the chip when.
But... I would posit that any modern chip contains such logic. Both a 2GHz and a 3GHz processor will regularly "waste" clock cycles because (to put it simply) a "fast" instruction executed on one subcomponent of the CPU has to wait for the result of the output from another "slower" subcomponent. But overall, you will still expect the 3GHz processor to "execute real code faster".
First, if the 10ns time to perform the addition does not include the pipeline overhead (clock skew and latch delay), then Processor B cannot complete an addition (with these overheads) in one 10ns clock cycle, but Processor A can and Processor C can still probably do it in two cycles.
Second, if the addition itself is pipelined (or other functional units are available), then a subsequent non-dependent operation can begin executing in the next cycle. (If the addition was width-pipelined/staggered (as mentioned in harold's answer) then even dependent additions, logical operations and left shifts could be started after only one cycle. However, if the exercise is constraining addition timing, it presumably also prohibits other optimizations to simplify the exercise.) If dependent operations are not especially common, then the faster clock of Processor C will result in higher performance. (E.g., if a dependence stall occurred every fourth cycle, then, ignoring other effects, Processor C can complete four instructions every five 7ns cycles (35 ns; the first three instruction overlap in execution) compared to 40ns for Processor B (assuming the add timing included pipelining overhead).) (Note: Your assumption 3 is incorrect, two cycles for Processor C would be 14ns.)
Third, the extra time in a clock cycle can be used to support more complex operations (e.g., preshifting one operand by a small immediate value and even adding three numbers — a carry-save adder has relatively little delay), to steal work from other pipeline stages (potentially reducing the number of pipeline stages, which generally reduces branch misprediction penalties), or to reduce area or power by using simpler logic. In addition, the extra time might be used to support a larger (or more associative) cache with fixed latency in cycles, reducing miss rates. Such factors can compensate for the "waste" of 5ns in Processor A.
Even for scalar (single issue per cycle) pipelines clock speed is not the single determinant of performance. Design choices become even more complex when power, manufacturing cost (related to yield, adjusted according to sellable bins, and area), time-to-market (and its variability/predictability), workload diversity, and more advanced architectural and microarchitectural techniques are considered.
The incorrect assumption that clock frequency determines performance even has a name: the Megahertz myth.

Resources