How to determine which processor has the highest performance - performance

The following conditions exist.
Consider three diff erent processors P1, P2, and P3 executing the same
instruction set. P1 has a 3 GHz clock rate and a CPI of 1.5. P2 has
a
2.5 GHz clock rate and a CPI of 1.0. P3 has a 4.0 GHz clock rate and has a CPI of 2.2.
And the question is
Which processor has the highest performance when executing the same program?
I have learned to compare cpu execution when comparing computer performance.
However, since cpu execution time = CPI * instruction set * 1/clock rate, the size of the instruction set cannot be known only with the conditions in the above problem, I thought that the performance between processors could not be compared.
I looked for other issues similar to this one, and the issue is Which processor has the highest performance expressed in instructions per second? We compared the performance between processors under the condition of instructions per second as shown.
So what I want to know is whether it is possible to compare performance between processors without any special conditions. (Isn't the given problem wrong?) If possible, I wonder how they can be compared.

However, since cpu execution time = CPI * instruction set * 1/clock rate, the size of the instruction set cannot be known only with the conditions in the above problem, I thought that the performance between processors could not be compared.
For that formula; I'd assume that instruction set reflects how many instructions are needed to do the same work. E.g. if one instruction set can do a multiply in one instruction it might have instruction set = 1, and if another instruction set needs 20 instructions to do one multiply then it might have instruction set = 20 because you need 20 times as many instructions to get the same work done.
For your homework (3 processors that all execute the same instruction set), instruction set is irrelevant - they all take the same number of instructions to get any amount of work done. With this in mind you can just do performance = clock rate / CPI. More specifically:
P1 = 3 GHz / 1.5 = 2,000,000,000
P2 = 2.5 GHz / 1.0 = 2,500,000,000
P3 = 4 GHz / 2.2 = 1,818,181,818
Of course the homework is extremely over-simplified - stalls (time CPU spends doing nothing while waiting for things like cache misses and instruction fetch after a branch misprediction, etc) tend to have a larger impact on performance than clock speed or "theoretical maximum CPI"; and "measured CPI in practice" depends which specific instructions are used (and can never be a single number like "1.5 CPI" for all programs). In other words, in practice, it's easy to have a situation where a CPU is fastest for one program but slowest for another program.

Related

Calculating Cycles Per Instruction

From what I understand, to calculate CPI, it's the percentage of the type of instruction multiplied by the number of cycles right? Does the type of machine have any part of this calculation whatsoever?
I have a problem that asks me if a change should be recommended.
Machine 1: 40% R - 5 Cycles, 30% lw - 6 Cycles, 15% sw - 6 Cycles, 15% beq 3 - Cycles, on a 2.5 GHz machine
Machine 2: 40% R - 5 Cycles, 30% lw - 6 Cycles, 15% sw - 6 Cycles, 15% beq 4 - Cycles, on a 2.7 GHz machine
By my calculations, machine 1 has 5.15 CPI while machine 2 has 5.3 CPI. Is it okay to ignore the GHz of the machine and say that the change would not be a good idea or do I have to factor the machine in?
I think the point is to evaluate a design change that makes an instruction take more clocks, but allows you to raise the clock frequency. (i.e. leaning towards a speed-demon design like Pentium 4, instead of brainiac like Apple's A7/A8 ARM cores. http://www.lighterra.com/papers/modernmicroprocessors/)
So you need to calculate instructions per second to see which one will get more work done in the same amount of real time. i.e. (clock/sec) / (clocks/insn) = insn/sec, cancelling out the clocks from the units.
Your CPI calculation looks ok; I didn't check it, but yes a weighted average of the cycles according to the instruction mix.
These numbers are obviously super simplified; any CPU worth building at 2.5GHz would have some kind of branch prediction so the cost of a branch isn't just a 3 or 4 instruction bubble. And taking ~5 cycles per instruction on average is pathetic. (Most pipelined designs aim for at least 1 instruction per clock.)
Caches and superscalar CPUs also lead to complex interactions between instructions depending on whether they depend on earlier results or not.
But this is sort of like what you might do if considering increasing the L1d cache load-use latency by 1 cycle (for example), if that took it off the critical path and let you raise the clock frequency. Or vice versa, tightening up the latency or reducing the number of pipeline stages on something at the cost of reducing frequency.
Cycles per instruction a count of cycles. ghz doesnt matter as far as that average goes. But saying that we can see from your numbers that one instruction is more clocks but the processors are a different speed.
So while it takes more cycles to do the same job on the faster processor the speed of the processor DOES compensate for that so it seems clear this is a question about does the processor speed account for the extra clock?
5.15 cycles/instruction / 2.5 (giga) cycles/second, cycles cancels out you get
2.06 seconds/(giga) instruction or (nano) seconds/ instruction
5.30 / 2.7 = 1.96296 (nano) seconds / instruction
The faster one takes a slightly less amount of time so it will run the program faster.
Another way to see this to check the math.
For 100 clock cycles on the slower machine 15% of those are beq. So 15 of the 100 clocks, which is 5 beq instructions. The same 5 beq instructions take 20 clocks on the faster machine so 105 clocks total for the same instructions on the faster machine.
100 cycles at 2.5ghz vs 105 at 2.7ghz
we want the amount of time
hz is cycles / second we want seconds on the top
so we want
cycles / (cycles/second) to have cycles cancel out and have seconds on the top
1/2.5 = 0.400 (400 picoseconds)
1/2.7 = 0.370
0.400 * 100 = 40.00 units of time
0.370 * 105 = 38.85 units of time
So despite taking 5 more cycles the processor speed differences is fast enough to compensate.
2.7/2.5 = 1.08
105/100 = 1.05
so 2.5 * 1.05 = 2.625 so a processor 2.625ghz or faster would run that program faster.
Now what were the rules for changing computers, is less time defined as a reason to change computers? What is the definition of better? How much more power does the faster one consume it might take less time but the power consumption might not be linear so it may take more watts despite taking less time. I assume the question is not that detailed, meaning it is vague meaning it is a poorly written question on its own, so it goes to what the textbook or lecture defined as the threshold for change to the other processor.
Disclaimer, dont blame me if you miss this question on your homework/test.
Outside an academic exercise like this, the real world is full of pipelined processors (not all but most of the folks writing programs are writing programs for) and basically you cant put a number on clock cycles per instruction type in a way that you can do this calculation because of a laundry list of factors. Make sore you understand that, nice exercise, but that specific exercise is difficult and dangerous to attempt on real world processors. Dangerous in that as hard as you work you may be incorrectly measuring something and jumping to the wrong conclusions and as a result making bad recommendations. At the same time there is very much the reality that faster ghz does improve some percentage of the execution, but another percentage suffers, and is there a net gain or loss. Or a new processor design faster or slower may have features that perform better than an older processor, but not all feature will be better, there is a tradeoff and then we get into what "better" means.

How can CPU's have FLOPS much higher than their clock speeds?

For example, a modern i7-8700k can supposedly do ~60 GFLOPS (single-precision, source) while its maximum frequency is 4.7GHz. As far as I am aware, an instruction has to take at least one cycle to complete, so how is this possible?
There are multiple factors that are all multiplied together for this large effect:
SIMD, Intel 8700k and similar processors support AVX and AVX2, which includes many instructions that operate on registers that can hold 8 floats at the same time.
multiple cores, 8700k has 6 cores.
fused multiply-add, part of AVX2, has both a multiplication and addition in the same instruction.
high throughput execution. The latency (time an individual instruction takes) is not directly important to how much computation a processor can do in a unit of time. A modern CPU such as 8700k can start executing two (independent) FMAs in the same cycle (and keep in mind these are still SIMD instructions so that represents a lot of floating point operations) even through the latency of the operation is actually 4 cycles.
Multiplying all those factors together we get: 8 * 6 * 2 * 2 * 4.3 = 825 GFLOPS (matching the stats reported here). This calculation certainly does not mean that it can actually be attained. For example the processor may downclock significantly under such a workload in order to stay within its power budget, which is what Intel has been doing at least since Haswell (though the specifics have changed and it applied to server parts). Also, most real code has significant trouble feeding that many FMAs with data. Large matrix multiplications can get close though, and for example according to these stats the 8700k reached 496.7 Gflops in their SGEMM benchmark. Possibly the 8700k's max AVX2 turbo speed on 6 cores is 2.6GHz but as far as I can find it does not have an AVX offset by default (only needed when overclocked), or that GEMM is just not that close to hitting peak FLOPS.

Calculate CPU cycles and its handling bits per second

I've wondered about the amount of bits that a given CPU can handles and how should I manage to calculate this specific value.
I wanted to make sure that my thought and also my calculation are right.
Given that I have 64 bit CPU which feature 2.3 Ghz. The amount of bits that being handled per second should be the following :
(2.3 * 10^9) * 64
Is it that simple or I need consider any other variables?
Calculating throughput of a CPU is not as simple. Modern CPUs are pipelined and can execute multiple instructions concurrently if there are no data dependencies, yielding instructions per cycle (IPC) > 1. Some instructions may also have a higher latency than one cycle, meaning IPC can be < 1.
For some operations they might be constrained by memory bandwidth.
And then there are SIMD instructions which can process more data per instruction than the width of a single general purpose register.
http://www.agner.org/optimize/instruction_tables.pdf
https://en.wikipedia.org/wiki/Instruction_pipelining

Pipeline processor vs. Single-cycle processor

I have to compare the speed of execution of the following code (see picture) using DLX-pipeline and single-cycle processor.
Given:
an instruction in the single-cycle model takes 800 ps
a stage in the pipeline model takes 200 ps (based on MA)
My approach was as follows.
CPU time = CPI * CC * IC
Single-cycle:
CPU time = 1 * 800 ps * 10 instr. = 8000 ps.
Pipeline:
CPI = 21 cycles / 10 instr. = 2.1 cycles per instruction
CPU time = 2.1 * 200 ps * 10 = 4200 ps.
CPU time single-cycle / CPU time pipeline = 8000/4200 = 1.9, so the pipeline code runs 1.9 faster.
But I was said, I have to work with clock cycles and not with the time -- "It doesn't matter how much time a CC takes".
I don't see how to make a comparison otherwise. Could you please help me?
Your analysis is indeed correct, but I guess your professor is looking for an explanation like this:
Suppose the single cycle processor also has the stages that you have mentioned, namely IF, ID, EX, MA and WB and that the instruction spends roughly the same time in each stage as compared to the pipelined processor version. Now you can draw a pipeline diagram for this single cycle processor, and see that it would take 50 cycles on a single cycle processor (which can work on 1 instruction at a time) compared to the 19 cycles on a pipelined processor.
Again, I prefer the way you have analyzed it (as the single cycle processor wouldn't really have each of those stages in a different clock cycle, it would just have a very long clock cycle to cover all the stages). Also, you've not mentioned whether this is a stalling-only MIPS pipeline (for which your answer is correct) or if this is a bypassed-MIPS pipeline. If this is the latter, you can shave off a few more cycles and get it down to 15 cycles.

understanding CPI and cache access

These are previous homework problems, but I am using them as exam review. I am changing numbers around from what is actually in the problem. I just want to make sure I have a grasp on the concepts. I already have the answers, just need clarification that I understand them. This is not homework but review work.
Anyway, this focuses on aspects of CPI
The fist problem:
An application running on a 1GHz processor has 30% load-store instructions, 30% arithmetic, and 40% branch instructions. The individual CPIs are 3 for load-store, 4 for arithmetic, 5 for branch instructions. Determine the overall CPI of this program on the given processor.
My answer: The overall CPI is the sum of the sub-CPIs, multiplied by the percentages in which they occur i.e. 3*0.3 + 4*0.3 + 5*0.4 = 0.9 + 1.2 + 2 = 4.1
Now, the processor is enhanced to run at 1.6GHz. The CPIs of the branch instructions remain the same but load-store and arithmetic instruction CPIs both increase to 6 cycles. A new compiler is in use which eliminates 30% of branch instructions and 10% of load-stores. Determine the new overall CPI and the factor by which the application will be faster or slower.
My answer: Once again, the new CPI is just the sum of its parts. However, the parts have changed and this must be accounted for. Branch instructions will drop by 30% (0.4*0.7=0.28) and load-stores will drop by 10% (0.3*0.9=0.27); arithmetic instructions will now account for the rest of the instructions (1-0.28-0.27=0.45), or 45%. These will be multiplied by the new sub-CPIs to get: 6*0.45+6*0.27+5*0.28=5.72.
Now, the processor enhancement is 60% faster, and the CPI is greater by (5.72-4.1)/4.1 = 39.5%. Thus, the application will run roughly 0.6*0.395 = 23.7% faster.
Now, the second problem:
A new processor with a load/store architecture has an ideal CPI of 1.25. Typical applications on this processor are a mix of 50% arithmetic and logic, 25% conditional branching and 25% load/store. Memory is accessed via a separate data and instruction cache, with a 5% instruction cache miss rate and 10% data miss rate. The penalty of any cache miss is 100 cycles and hits don't produce any penalties.
What is the effective CPI?
My answer: The effective CPI is the ideal CPI, plus the stalled cycles per instruction due to cache access. The ideal CPI is, as given, 1.25. The stalled cycles per instruction is (0.1*100*0.25) + (0.05*100*1) = 7.5. 0.1*100*0.25 is the data miss rate multiplied by the stalled cycle penalty which is also multiplied by the load/store percentage (which is where the data accesses take place); 0.05*100*1 is the instruction miss rate, which is the instruction cache miss rate times the stalled cycle penalty, instruction access take place in 100% of the program, so this is multiplied by 1. Following from this, the effective CPI is 1.25 + 7.5 = 8.75.
What is the misses per 1000 instruction for typical applications and what is the average memory access time (in clock cycles) for typical applications?
My answers: The misses per 1000 instructions is equal to the stalled cycles per instruction due to cache access (as given above: 7.5), divided by 1000, which equals 7.5/1000 = 0.0075
When discussing the average memory access time (AMAT), we first must talk about the total number of accesses here, which is the percentage of data accesses (25%) plus the percentage of instruction accesses (100%), or 125%=1.25. The data accesses are .25/1.25 and the instruction accesses are 1/1.25.
The AMAT equals the percentage of data accesses (.25/1.25) multiplied by the sum of the hit time (1) and the data miss rate multiplied by the miss penalty (0.1*100), or (.25/1.25)(1+0.1*100) and this is added to the percentage of instruction accesses (1/1.25) multiplied by the sum of the hit time (1) and the instruction miss rate multiplied by the miss penalty (0.05*100), or (1/1.25)(1+0.05*100). Put together, the AMAT is (.25/1.25)(1+0.1*100)+(1/1.25)(1+0.05*100)=7.
Once again, sorry for the wall of text. If I am wrong, please try to help me understand how I am wrong. I tried to show all my work to make it as easy as possible to understand. Thanks in advance.
There's an error in the lat part of your question. When they ask:
What is the misses per 1000 instruction for typical applications and what is the average memory
access time (in clock cycles) for typical applications?
what's needed here is the number of misses you will get for every 1000 instructions, which in this case would be 1000*1*0.05 for instruction cache misses and 1000*0.25*0.1 for data cache misses. This equals 75 misses per 1000 instructions.
To calculate the AMAT, you use the formula AMAT = hit time + (miss rate*miss penalty)
In this case, your miss rate is 75/1000 and your miss penalty is 100 cycles. The hit time is given as 1.25 cycles (your ideal CPI!).
Hope this helps and all the best for your exam!

Resources