A question about synchronous ram read instructions in Ifetch stage of pipelined CPU - cpu

In the Ifetch stage of pipelined CPU, if instruction MEM uses synchronous RAM: the read data will not be given until the rising edge of the next clock after the address is given. But in pipelined CPU, there are pipelined registers between each stage. Then, from the given address to the instruction MEM to the if / ID register is written, you need to wait for two clock rising edges, so don't you need two clock rising edges in the IF stage? This question puzzled me for a long time. I hope someone can help me.

Related

Given the data path of the LC-3 (provided below), give a complete description of the STORE DIRECT instruction (ST SR, label), as follows

I've tried reading the book but I'm not sure exactly how to go about this. Anyone have an idea?
So in doing these problems its best to have the datapath in front of you and have the Register Transfer Language written down, you do these problems step by step, it is a little daunting, but following all of the Digital Logic you have learned it is all a matter of you being a pirate and the datapath being your treasure map.
To do this you just follow the wires in the diagram. I'm using this one which I'm sure is in the Patt/Patel textbook https://i.ytimg.com/vi/PeWbyffnkZ4/maxresdefault.jpg
Mem[PC + SEXT(IR[8:0])] = SR
Clock Cycle 1
So the first thing you need to do is SEXT(IR[8:0]) So where in the datapath is a sign extender and where is the IR. If you look at the ADDR2MUX you see it has 4 inputs each being bits from the IR and one with 0. So ADDR2MUX=IR[8:0]
Next we need to add the PC to it. So from the output of the ADDR2MUX will be the SEXT(IR[8:0]) So next we need to add the PC to that output. Well we see the output of the ADDR2MUX feeds into an adder. So ok we need to set the other adder up with the PC. The ADDR1MUX has an input from the register file and the PC. So ADDR1MUX=PC
Both of these inputs go into the adder and now the output of that adder has PC + SEXT(IR[8:0])
Next we need to store to memory, the address we want to store to is PC + SEXT(IR[8:0]), and what we want to store is SR. So how do we do that? To interface with memory we need to put the address in the MAR (Memory Address Register) and the data we want to store in the MDR. So lets do the MAR step first. So we need to put the result of the ADDER into the MAR. The only path we can take is the MARMUX. So MARMUX=ADDER. We need to Gate the MARMUX to put it out on the bus as well. So GateMARMUX.
The value of the MARMUX is now out onto the bus so we want to latch that into the MAR so LDMAR.
We need a new clock cycle now because we need to wait for the value to latch into the register which happens at the beginning of a new clock cycle.
Clock Cycle 1 Signals - ADDR2MUX=IR[8:0], ADDR1MUX=PC, MARMUX=ADDER, GateMARMUX, LDMAR
Clock Cycle 2
Now lets the source register into the MDR. Looking at the diagram we need a path from the register file to the BUS to get it into the MDR. There's actually two ways of doing this one going through ADDR1MUX and one going through the ALU.
I will take the ALU path as its slightly more correct.
First we need to make SR1 be the source register from the instruction so SR1MUX=[11:9]
The Source register from the instruction now comes out the register file from the SR1 output, this feeds into the ALU. So we must choose the operation the ALU does so, ALUK=PASSA. PASSA simply makes the output of the ALU the 'A' input.
We then need to put the ALU output on the bus so GateALU
Now the ALU output is on the bus and we want to store this in the MDR, however there is a MUX blocking that input. MIO.EN=0 to select the bus output to go into the MDR. Then we need to latch that value into the register so LDMDR
We just tried to load a value into a register so it will not be available in the output until the start of the next clock cycle so..
Clock Cycle 2 Signals - SR1MUX=[11:9], ALUK=PASSA, GateALU, MIO.EN=0, LDMDR
Clock Cycle 3
All we need to do is give memory a good ol kick to store the value in the MDR at the address in the MAR so... MIO.EN=1, R/W=W
Clock Cycle 3 signals - MIO.EN=1, R/W=W
So this concludes the ST instruction it takes 3 clock cycles and the signals to assert each clock cycle are indicated at the end of each clock cycle. All of those signals per clock cycle can be turned on at the same time.

Whether combinational circuit will have less frequency of operation than sequential circuit?

I have designed an algorithm-SHA3 algorithm in 2 ways - combinational
and sequential.
The sequential design that is with clock when synthesized giving design summary as
Minimum clock period 1.275 ns and Maximum frequency 784.129 MHz.
While the combinational one which is designed without clock and has been put between input and output registers is giving synthesis report as
Minimum clock period 1701.691 ns and Maximum frequency 0.588 MHz.
so i want to ask is it correct that combinational will have lesser frequency than sequential?
As far as theory is concerned combinational design should be faster than sequential. But the simulation results I m getting for sequential is after 30 clock cycles where as combinational there is no delay in the output as there is no clock. In this way combinational is faster as we are getting instant output but why frequency of operation of combinational one is lesser than sequential one. Why this design is slow can any one explain please?
The design has been simulated in Xilinx ISE
Now I have applied pipe-lining to the combinational logic by inserting the registers in between the 5 main blocks which are doing the computation. And these registers are controlled by clock so now this pipelined design is giving design summary as
clock period 1.575 ns and freq 634.924 MHz
Min period 1.718 ns and freq 581.937.
So now this 1.575 ns is the delay between any of the 2 registers , its not the propagation delay of entire algorithm so how can i calculate propagation delay of entire pipelined algorithm.
What you are seeing is pipelining and its performance benefits. The combinational circuit will cause each input to go through the propagation delays of the entire algorithm, which will take at up to 1701.691ns on the FPGA you are working with, because the slowest critical path in the combinational circuitry needed to calculate the result will take up to that long. Your simulator is not telling you everything, since a behavioral simulation will not show gate propagation delays. You'll just see the instant calculation of your combinational function in your simulation.
In the sequential design, you have multiple smaller steps, the slowest of which takes 1.275ns in the worst case. Each of those steps might be easier to place-and-route efficiently, meaning that you get overall better performance because of the improved routing of each step. However, you will need to wait 30 cycles for a result, simply because the steps are part of a synchronous pipeline. With the correct design, you could improve this and get one output per clock cycle, with a 30-cycle delay, by having a full pipeline and passing data through it at every clock cycle.

Faster cpu wastes more time as compared to slower cpu

Suppose I have a program that has an instruction to add two numbers and that operation takes 10 nanoseconds(constant, as enforced by the gate manufactures).
Now I have 3 different processors A, B and C(where A< B < C in terms of clock cycles). A's one clock cycle has 15 nanosec, B has 10 nanosec and C has 7 nanosec.
Firstly am I correct on my following assumptions-
1. Add operation takes 1 complete cycle of processor A(slow processor) and wastes rest of 5 ns of the cycle.
2. Add operation takes 1 complete cycle of processor B wasting no time.
3. Add operation takes 2 complete cycles(20 ns) of processor C(fast processor) wasting rest of the 20-14=7 ns.
If the above assumptions are correct then isn't this a contradiction to the regular assumption that processors with high clock cycles are faster. Here processor C which is the fastest actually takes 2 cycles and wastes 7ns whereas, the slower processor A takes just 1 cycle.
Processor C is fastest, no matter what. It takes 7 ns per cycle and therefore performs more cycles than A and B. It's not C's fault that the circuit is not fast enough. If you would implement the addition circuit in a way that it gives result in 1 ns, all processors will give the answer in 1 clock cycle (i.e. C will give you the answer in 7ns, B in 10ns and A in 15ns).
Firstly am I correct on my following assumptions-
1. Add operation takes 1 complete cycle of processor A(slow processor) and wastes rest of 5 ns of the cycle.
2. Add operation takes 1 complete cycle of processor B wasting no time.
3. Add operation takes 2 complete cycles(20 ns) of processor C(fast processor) wasting rest of the 20-7=13 ns.
No. It is because you are using incomplete data to express the time for an operation. Measure the time taken to finish an operation on a particular processor in clock cycles instead of nanoseconds as you are doing here. When you say ADD op takes 10 ns and you do not mention the processor on which you measured the time for the ADD op, the time measurement in ns is meaningless.
So when you say that ADD op takes 2 clock cycles on all three processors, then you have standardized the measurement. A standardized measurement can then be translated as:
Time taken by A for addition = 2 clock cycles * 15 ns per cycle = 30 ns
Time taken by B for addition = 2 clock cycles * 10 ns per cycle = 20 ns
Time taken by C for addition = 2 clock cycles * 07 ns per cycle = 14 ns
In case you haven't noticed, when you say:
A's one clock cycle has 15 nanosec, B has 10 nanosec and C has 7 nanosec.
which of the three processors is fastest?
Answer: C is fastest. It's one cycle is finished in 7ns. It implies that it finishes 109/7 (~= 1.4 * 108) cycles in one second, compared to B which finishes 109/10 (= 108) cycles in one second, compared to A which finishes only 109/15 (~= 0.6 * 108) cycles in one second.
What does a ADD instruction mean, does it purely mean only and only ADD(with operands available at the registers) or does it mean getting
the operands, decoding the instruction and then actually adding the
numbers.
Getting the operands is done by MOV op. If you are trying to compare how fast ADD op is being done, it should be compared by time to perform ADD op only. If you, on the other hand want to find out how fast addition of two numbers is being done, then it will involve more operations than simple ADD. However, if it's helpful, the list of all Original 8086/8088 instructions is available on Wikipedia too.
Based on the above context to what add actually means, how many cycles does add take, one or more than one.
It will depend on the processor because each processor may have the adder differently implemented. There are many ways to generate addition of two numbers. Quoting Wikipedia again - A full adder can be implemented in many different ways such as with a custom transistor-level circuit or composed of other gates.
Also, there may be pipelining in the instructions which can result in parallelizing of the addition of the numbers resulting in huge time savings.
Why is clock cycle a standard since it can vary with processor to processor. Shouldn't nanosec be the standard. Atleast its fixed.
Clock cycle along with the processor speed can be the standard if you want to tell the time taken by a processor to execute an instruction. Pick any two from:
Time to execute an instruction,
Processor Speed, and
Clock cycles needed for an instruction.
The third can be derived from it.
When you say the clock cycles taken by ADD is x and you know the processor speed is y MHz, you can calculate that the time to ADD is x / y. Also, you can mention the time to perform ADD as z ns and you know the processor speed is same y MHz as earlier, you can calculate the cycles needed to execute ADD as y * z.
I'm no expert BUT I'd say ...
the regular assumption that processors with high clock cycles are faster FOR THE VAST MAJORITY OF OPERATIONS
For example, a more intelligent processor might perform an "overhead task" that takes X ns. The "overhead task" might make it faster for repetitive operations but might actually cause it to take longer for a one-off operation such as adding 2 numbers.
Now, if the same processor performed that same operation 1 million times, it should be massively faster than the slower less intelligent processor.
Hope my thinking helps. Your feedback on my thoughts welcome.
Why would a faster processor take more cycles to do the same operation than a slower one?
Even more important: modern processors use Instruction pipelining, thus executing multiple operations in one clock cycle.
Also, I don't understand what you mean by 'wasting 5ns', the frequency determines the clock speed, thus the time it takes to execute 1 clock. Of course, cpu's can have to wait on I/O for example, but that holds for all cpu's.
Another important aspect of modern cpu's are the L1, L2 and L3 caches and the architecture of those caches in multicore systems. For example: if a register access takes 1 time unit, a L1 cache access will take around 2 while a normal memory access will take between 50 and 100 (and a harddisk access would take thousands..).
This is actually almost correct, except that on processor B taking 2 cycles means 14ns, so with 10ns being enough the next cycle starts 4ns after the result was already "stable" (though it is likely that you need some extra time if you chop it up, to latch the partial result). It's not that much of a contradiction, setting your frequency "too high" can require trade-offs like that. An other thing you might do it use more a different circuit or domino logic to get the actual latency of addition down to one cycle again. More likely, you wouldn't set addition at 2 cycles to begin with. It doesn't work out so well in this case, at least not for addition. You could do it, and yes, basically you will have to "round up" the time a circuit takes to an integer number of cycles. You can also see this in bitwise operations, which take less time than addition but nevertheless take a whole cycle. On machine C you could probably still fit bitwise operations in a single cycle, for some workloads it might even be worth splitting addition like that.
FWIW, Netburst (Pentium 4) had staggered adders, which computed the lower half in one "half-cycle" and the upper half in the next (and the flags in the third half cycle, in some sense giving the whole addition a latency of 1.5). It's not completely out of this world, though Netburst was over all, fairly mad - it had to do a lot of weird things to get the frequency up that high. But those half-cycles aren't very half (it wasn't, AFAIK, logic that advanced on every flank, it just used a clock multiplier), you could also see them as the real cycles that are just very fast, with most of the rest of the logic (except that crazy ALU) running at half speed.
Your broad point that 'a CPU will occasionally waste clock cycles' is valid. But overall in the real world, part of what makes a good CPU a good CPU is how it alleviates this problem.
Modern CPUs consist of a number of different components, none of whose operations will end up taking a constant time in practice. For example, an ADD instruction might 'burst' at 1 instruction per clock cycle if the data is immediately available to it... which in turn means something like 'if the CPU subcomponents required to fetch that data were immediately available prior to the instruction'. So depending on if e.g. another subcomponent had to wait for a cache fetch, the ADD may in practice take 2 or 3 cycles, say. A good CPU will attempt to re-order the incoming stream of instructions to maximise the availability of subcomponents at the right time.
So you could well have the situation where a particular series of instructions is 'suboptimal' on one processor compared to another. And the overall performance of a processor is certainly not just about raw clock speed: it is as much about the clever logic that goes around taking a stream of incoming instructions and working out which parts of which instructions to fire off to which subcomponents of the chip when.
But... I would posit that any modern chip contains such logic. Both a 2GHz and a 3GHz processor will regularly "waste" clock cycles because (to put it simply) a "fast" instruction executed on one subcomponent of the CPU has to wait for the result of the output from another "slower" subcomponent. But overall, you will still expect the 3GHz processor to "execute real code faster".
First, if the 10ns time to perform the addition does not include the pipeline overhead (clock skew and latch delay), then Processor B cannot complete an addition (with these overheads) in one 10ns clock cycle, but Processor A can and Processor C can still probably do it in two cycles.
Second, if the addition itself is pipelined (or other functional units are available), then a subsequent non-dependent operation can begin executing in the next cycle. (If the addition was width-pipelined/staggered (as mentioned in harold's answer) then even dependent additions, logical operations and left shifts could be started after only one cycle. However, if the exercise is constraining addition timing, it presumably also prohibits other optimizations to simplify the exercise.) If dependent operations are not especially common, then the faster clock of Processor C will result in higher performance. (E.g., if a dependence stall occurred every fourth cycle, then, ignoring other effects, Processor C can complete four instructions every five 7ns cycles (35 ns; the first three instruction overlap in execution) compared to 40ns for Processor B (assuming the add timing included pipelining overhead).) (Note: Your assumption 3 is incorrect, two cycles for Processor C would be 14ns.)
Third, the extra time in a clock cycle can be used to support more complex operations (e.g., preshifting one operand by a small immediate value and even adding three numbers — a carry-save adder has relatively little delay), to steal work from other pipeline stages (potentially reducing the number of pipeline stages, which generally reduces branch misprediction penalties), or to reduce area or power by using simpler logic. In addition, the extra time might be used to support a larger (or more associative) cache with fixed latency in cycles, reducing miss rates. Such factors can compensate for the "waste" of 5ns in Processor A.
Even for scalar (single issue per cycle) pipelines clock speed is not the single determinant of performance. Design choices become even more complex when power, manufacturing cost (related to yield, adjusted according to sellable bins, and area), time-to-market (and its variability/predictability), workload diversity, and more advanced architectural and microarchitectural techniques are considered.
The incorrect assumption that clock frequency determines performance even has a name: the Megahertz myth.

Understanding CPU pipeline stages vs. Instruction throughput

I'm missing something fundamental re. CPU pipelines: at a basic level, why do instructions take differing numbers of clock cycles to complete and how come some instructions only take 1 cycle in a multi-stage CPU?
Besides the obvious of "different instructions require a different amount of work to complete", hear me out...
Consider an i7 with an approx 14 stage pipeline. That takes 14 clock cycles to complete a run-through. AFAIK, that should mean the entire pipeline has a latency of 14 clocks. Yet this isn't the case.
An XOR completes in 1 cycle and has a latency of 1 cycle, indicating it doesn't go through all 14 stages. BSR has a latency of 3 cycles, but a throughput of 1 per cycle. AAM has a latency of 20 cycles (more that the stage count) and a throughput of 8 (on an Ivy Bridge).
Some instructions cannot be issued every clock, yet take less than 14 clocks to complete.
I know about the multiple execution units. I don't understand how the length of instructions in terms of latency and throughput relate to the number of pipline stages.
I think what's missing from the existing answers is the existence of "bypass" or "forwarding" datapaths. For simplicity, let's stick with the MIPS 5-stage pipeline. Every instruction takes 5 cycles from birth to death -- fetch, decode, execute, memory, writeback. So that's how long it takes to process a single instruction.
What you want to know is how long it takes for one instruction to hand off its result to a dependent instruction. Say you have two consecutive ADD instructions, and there's a dependency through R1:
ADD R1, R2, R3
ADD R4, R1, R5
If there were no forwarding paths, we'd have to stall the second instruction for multiple cycles (2 or 3 depending on how writeback works), so that the first one could store its result into the register file before the second one reads that as input in the decode stage.
However, there are forwarding paths that allow valid results (but ones that are not yet written back) to be picked out of the pipeline. So let's say the first ADD gets all its inputs from the register file in decode. The second one will get R5 out of the register file, but it'll get R1 out of the pipeline register following the execute stage. In other words, we're routing the output of the ALU back into its input one cycle later.
Out-of-order processors make ubiquitous use of forwarding. They will have lots of different functional units that have lots of different latencies. For instance, ADD and AND will typically take one cycle (TO DO THE MATH, putting aside all of the pipeline stages before and after), MUL will take like 4, floating point operations will take lots of cycles, memory access has variable latency (due to cache misses), etc.
By using forwarding, we can limit the critical path of an instruction to just the latencies of the execution units, while everything else (fetch, decode, retirement), it out of the critical path. Instructions get decoded and dumped into instruction queues, awaiting their inputs to be produced by other executing instructions. When an instruction's dependency is satisfied, then it can begin executing.
Let's consider this example
MUL R1,R5,R6
ADD R2,R1,R3
AND R7,R2,R8
I'm going to make an attempt at drawing a timeline that shows the flow of these instructions through the pipeline.
MUL FDIXXXXWR
ADD FDIIIIXWR
AND FDIIIIXWR
Key:
F - Fetch
D - Decode
I - Instruction queue (IQ)
X - execute
W - writeback/forward/bypass
R - retire
So, as you see, the multiply instruction has a total lifetime of 9 cycles. But there is overlap in execution of the MUL and the ADD, because the processor is pipelined. When the ADD enters the IQ, it has to wait for its input (R1), and likewise so does the AND that is dependent on the ADD's result (R2). What we care about is not how long the MUL lives in total but how long any dependent instruction has to wait. That is its EFFECTIVE latency, which is 4 cycles. As you can see, once the ADD executes, the dependent AND can execute on the next cycle, again due to forwarding.
I'm missing something fundamental re. CPU pipelines: at a basic level, why do instructions take differing numbers of clock cycles to complete and how come some instructions only take 1 cycle in a multi-stage CPU?
Because what we're interested in is in speed between instructions, not the start to end time of a single instruction.
Besides the obvious of "different instructions require a different amount of work to complete", hear me out...
Well that's the key answer to why different instructions have different latencies.
Consider an i7 with an approx 14 stage pipeline. That takes 14 clock cycles to complete a run-through. AFAIK, that should mean the entire pipeline has a latency of 14 clocks. Yet this isn't the case.
That is correct, though that's not a particularly meaningful number. For example, why do we care how long it takes before the CPU is entirely done with an instruction? That has basically no effect.
An XOR completes in 1 cycle and has a latency of 1 cycle, indicating it doesn't go through all 14 stages. BSR has a latency of 3 cycles, but a throughput of 1 per cycle. AAM has a latency of 20 cycles (more that the stage count) and a throughput of 8 (on an Ivy Bridge).
This is just a bunch of misunderstandings. An XOR introduces one cycle of latency into a dependency chain. That is, if I do 12 instructions that each modify the previous instruction's value and then add an XOR as the 13th instruction, it will take one cycle more. That's what the latency means.
Some instructions cannot be issued every clock, yet take less than 14 clocks to complete.
Right. So?
I know about the multiple execution units. I don't understand how the length of instructions in terms of latency and throughput relate to the number of pipline stages.
They don't. Why should there be any connection? Say there's 14 extra stages at the beginning of the pipeline. Why would that effect latency or throughput at all? It would just mean everything happens 14 clock cycles later, but still at the same rate. (Though likely it would impact the cost of a mispredicted branch and other things.)

how many cpu cycle takes to execute MOV A, 5 instruction

My question is calculate how many CPU cycle takes to execute MOV A, 5 instruction. Describe each.
can anyone please explain me how this works. And the 5 is a value is it? Just explain me the main points.
As far as i know,
first,
-get the instruction from memory (one clock cycle)
-update instruction pointer(one clock cycle)
-decode the instruction to see what it does(one clock cycle)
i'm stuck after this.
I'm assuming you are talking about the x86 processors. This of course depends on the processor, but it usually takes 1 clock cycle from what i remember.
This is because instructions are executed in a pipeline. This means that while the processor is computing the result of an instruction, it is decoding the next and fetching the one before that, so that each part of the processor is busy doing something.
Usually the instructions that need data from memory or do complex calculations like multiplications or divisions take a longer time to execute.
You can also get the number of cycles with RDTSC https://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf

Resources