We are working on a pipelined processor written in VHDL, and we have some issues with timing, synchronization and registers on the simulator (the code does not need to be synthesizable, because we going to run it only on the simulator).
Imagine we have two processor stages, A and B, with a pipeline register in the middle:
Processor stage A is combinatorial and does not depend on clock
The pipeline register R, is a register, and therefor, changes its state at clock rising edge.
Processor stage B is a complex stage and has its own state machine, and, therefor, changes its state and does operations inside a VHDL process, governed by clock rising edge.
The configuration would be as follows
_______ ___ _______
| | | | | |
---| A |---|R|---| B |---
|_____| |_| |_____|
With this configuration, there is a timing problem:
t = 0: A gets data, and does its operations
t = 1: At rising edge, R updates its data with the output of A.
t = 2: At rising edge, B gets the values of R, and updates its status and gives an output.
We would like to have B changing its state and generating an output at t = 1, but we also need the register in the middle to make the pipeline work.
A solution would be to update the R register on falling edge. But then, we are assuming that all processor stages run in half a clock cycle, and the other half is a bit useless.
How is this problem usually solved in pipelines?
First of all, just saying from personal experience in this field: never develop your own cpu, unless you are a freaking genius and have another few of your kind to verify your work and port a compiler.
To your problem:
a) A cutset technique is usually used to insert pipeline stages in the design. When implemented properly, you only need to solve control hazards
b) Model your stages not with registers inbetween but with 1-deep transparent FIFOs - you will get automatic stall management for free and it is easier to reason about pipelines
c) Bypass register R. Use data from A to register it in R and in B.
If none above helped, redesign B and/or hire a hardware developer that is used to reason about concurrent hardware.
After talking to quite a few people, I think we found the proper solution to the problem.
The stage B, which has its own state machine, should not have a VHDL process activated on rising edge. It should have the state of the state machine as a signal that is stored on register R.
In more detail, these new signals should be added:
state: current state of the state machine, output from R, input to B
state_next: next state of the state machine, input to R, output from B
Which means that state is changed for state_next each rising edge, and B can now work without a process.
Related
I wonder why last register write latency(200) is not added?
To be more precise, critical path is determined by load instruction's
latency, so then why critical path is not
I-Mem + Regs + Mux + ALU + D-Mem + MUX + Regs
but is actually
I-Mem + Regs + Mux + ALU + D-Mem + MUX
Background
Figure 4.2
In the following three problems, assume that we are starting with a
datapath from Figure 4.2, where I-Mem, Add, Mux, ALU, Regs, D-Mem,
and Control blocks have latencies of 400 ps, 100 ps, 30 ps, 120 ps,
200 ps, 350 ps, and 100 ps, respectively, and costs of 1000, 30, 10,
100, 200, 2000, and 500, respectively.
And I find solution like below
Cycle Time Without improvement = I-Mem + Regs + Mux + ALU + D-Mem +
Mux = 400 + 200 + 30 + 120 + 350 + 30 = 1130
Cycle Time With improvement = 1130 + 300 = 1430
It is a good question as to whether it requires two Regs latencies.
The register write is a capture of the output of one cycle. It happens at the end of one clock cycle, and the start of the next — it is the clock cycle edge/transition to the next cycle that causes the capture.
In one sense, the written output of one instruction effectively happens in parallel with the early operations of the next instruction, including the register reads, with the only requirement for this overlap being that the next instruction must be able to read the output of the prior instruction instead of a stale register value. And this is possible because the written data was already available at the very top/beginning of the current cycle (before the transition, in fact).
The PC works the same: at the clock transition from one cycle's end to another cycle's start, the value for the new PC is captured and then released to the I Mem. So, the read and write effectively happen in parallel, with the only requirement then being that the read value sent to I Mem is the new one.
This is the fundamental way that cycles work: enregistered values start a cycle, then combinational logic computes new values that are captured at the end of the cycle (aka the start of the next cycle) and form the program state available for the start of the new cycle. So one cycle does state -> processing -> (new) state and then the cycle repeats.
In the case of the PC, you might ask why we need a register at all?
(For the 32 CPU registers it is obvious that they are needed to provide the machine code program with lasting values (one instruction outputs register, say, $a0 and that register may be used many instructions later, or maybe even used many times before being changed.))
But one could speculate what might happen without a PC register (and the clocking that dictates its capture), and the answer there is that we don't want the PC to change until the instruction is completed, which is dictated by the clock. Without the clock and the register, the PC could run ahead of the rest of the design, since much of the PC computation is not on the critical path (this would cause instability of the design). But as we want the PC to hold stable for the whole clock cycle, and change only when the clock says the instruction is over, a register is use (and the clocked update of it).
I've tried reading the book but I'm not sure exactly how to go about this. Anyone have an idea?
So in doing these problems its best to have the datapath in front of you and have the Register Transfer Language written down, you do these problems step by step, it is a little daunting, but following all of the Digital Logic you have learned it is all a matter of you being a pirate and the datapath being your treasure map.
To do this you just follow the wires in the diagram. I'm using this one which I'm sure is in the Patt/Patel textbook https://i.ytimg.com/vi/PeWbyffnkZ4/maxresdefault.jpg
Mem[PC + SEXT(IR[8:0])] = SR
Clock Cycle 1
So the first thing you need to do is SEXT(IR[8:0]) So where in the datapath is a sign extender and where is the IR. If you look at the ADDR2MUX you see it has 4 inputs each being bits from the IR and one with 0. So ADDR2MUX=IR[8:0]
Next we need to add the PC to it. So from the output of the ADDR2MUX will be the SEXT(IR[8:0]) So next we need to add the PC to that output. Well we see the output of the ADDR2MUX feeds into an adder. So ok we need to set the other adder up with the PC. The ADDR1MUX has an input from the register file and the PC. So ADDR1MUX=PC
Both of these inputs go into the adder and now the output of that adder has PC + SEXT(IR[8:0])
Next we need to store to memory, the address we want to store to is PC + SEXT(IR[8:0]), and what we want to store is SR. So how do we do that? To interface with memory we need to put the address in the MAR (Memory Address Register) and the data we want to store in the MDR. So lets do the MAR step first. So we need to put the result of the ADDER into the MAR. The only path we can take is the MARMUX. So MARMUX=ADDER. We need to Gate the MARMUX to put it out on the bus as well. So GateMARMUX.
The value of the MARMUX is now out onto the bus so we want to latch that into the MAR so LDMAR.
We need a new clock cycle now because we need to wait for the value to latch into the register which happens at the beginning of a new clock cycle.
Clock Cycle 1 Signals - ADDR2MUX=IR[8:0], ADDR1MUX=PC, MARMUX=ADDER, GateMARMUX, LDMAR
Clock Cycle 2
Now lets the source register into the MDR. Looking at the diagram we need a path from the register file to the BUS to get it into the MDR. There's actually two ways of doing this one going through ADDR1MUX and one going through the ALU.
I will take the ALU path as its slightly more correct.
First we need to make SR1 be the source register from the instruction so SR1MUX=[11:9]
The Source register from the instruction now comes out the register file from the SR1 output, this feeds into the ALU. So we must choose the operation the ALU does so, ALUK=PASSA. PASSA simply makes the output of the ALU the 'A' input.
We then need to put the ALU output on the bus so GateALU
Now the ALU output is on the bus and we want to store this in the MDR, however there is a MUX blocking that input. MIO.EN=0 to select the bus output to go into the MDR. Then we need to latch that value into the register so LDMDR
We just tried to load a value into a register so it will not be available in the output until the start of the next clock cycle so..
Clock Cycle 2 Signals - SR1MUX=[11:9], ALUK=PASSA, GateALU, MIO.EN=0, LDMDR
Clock Cycle 3
All we need to do is give memory a good ol kick to store the value in the MDR at the address in the MAR so... MIO.EN=1, R/W=W
Clock Cycle 3 signals - MIO.EN=1, R/W=W
So this concludes the ST instruction it takes 3 clock cycles and the signals to assert each clock cycle are indicated at the end of each clock cycle. All of those signals per clock cycle can be turned on at the same time.
I am currently developing a subset of the 6502 in LogiSim. One of my main resources is Hanson's Block Diagram.
I am trying to determine how and where I should build circuitry to update the Processor States Register. In the diagram of the Processor Status Register below, there are multiple control lines going into the register, but there is no indication of where they come from.
When and where is the 6502 Processor status register updated? I would think that it is on the output of the ALU, but I want to make sure that this is the case.
Do you have Hanson's complete updated diagram? The paper is here. (Or original here.)
The inputs on the left side of P (DB0/C etc) are outputs from the bottom of the Random Control Logic block. The inputs at the top of P are from the ALU (ACR, AVR) and IR5 is bit 5 of the Instruction Register. (But from Breaknes below it seems Hanson's diagram is incomplete: "Donald missed the 0/V command on the schematic, which is used when processing the CLV instruction.")
The inputs will be latched differently for various instructions. For instance the two cycle instructions like CLC/SEC/CLD/SED/CLI/SEI/CLV have one bit (IR5) that ends up latching a hard-coded value to only one of C, I, V, or D. Other instructions will latch ALU (etc) signals to multiple flags at a later cycle. That's as much detail as I know, and as much of the logic that will fit into an answer here.
Very detailed information is available at the Russian Breaknes site. The author has reverse engineered all of the 6502 logic at the transistor level from images at Visual6502. Have a good look around in the Wiki and Info sections of the site. E.g. here is a translated link to the flag info page which has a logic diagram, unlike the wiki page on flag logic.
There was a lot of discussion in the 6502 forum when he did this work (flag logic on page 12 and page 15) and some of the content might only be linked from this thread. The original code repo has been moved to GitHub where there is emulator source code and Logisim circuit diagrams.
From the top:
The C flag is set or cleared by
any instruction that can have an unsigned overflow. These include ADC SBC, CMP, CPX, CPY
Shift and rotate instructions ASL, ROL, ROR
explicit set and clear instructions SEC, CLC
instructions that load the whole status register PLP, RTI
Z is set or cleared by
any instruction that writes to A, X, Y e.g. arithmetic as for carry, bitwise logical operations, loads, transfers, pulls from the stack, shifts and rotates.
instructions that load the whole status register PLP, RTI
BIT
I is set or cleared by
the SEI and CLI instructions.
instructions that load the whole status register PLP, RTI
set by BRK and interrupts.
D is set or cleared only by the PLP, RTI, SED and CLD instructions.
B is interesting. It's actually completely inaccessible to the programmer and not used by the processor. The status byte pushed on the stack is set for BRK and cleared for an interrupt. I guess that means that RTIand PLP would set it if it is set in the byte pulled off the stack, but it doesn't matter.
V flag is set or cleared by
ADC, SBC
BIT
instructions that load the whole status register PLP, RTI
CLV
N is set or cleared in the same circumstances as Z.
I would think that it is on the output of the ALU
That is a fair assessment for all the ALU ops, but as you can see from above there are circumstances when the status flags are set from a source other than the ALU.
Reference: http://www.e-tradition.net/bytes/6502/6502_instruction_set.html
Here is the two process solution algorithm 1:
turn = 0;
i = 0, j = 1;
do
{
while (turn != i) ; //if not i's turn , wait indefinitely
// critical section
turn = j; //after i leaves critical section, lets j in
//remainder section
} while (1); //loop again
I understand that the mutual exclusion is satisfied. Because when P0 is in critical section, P1 waits until it leaves critical section. And after P0 updates turn, P1 enters critical section. I don't understand why progress is not satisfied in this algorithm.
Progress is if there is no process in critical section waiting process should be able to enter into critical section without waiting.
P0 updates turn after leaving critical section so P1 who waits in while loop should be able to enter to critical section. Can you please tell me why there is no progress then?
Forward progress is defined as follows:
If no process is executing in its CS and there exist some processes that wish to enter their CS, then the selection of the process that will enter the CS next cannot be postponed indefinitely.
The code you wrote above does not satisfy this in the case the threads are not balanced, consider the following scenario:
P0 has entered the critical section, finished it, and set the turn to P1.
P1 enters the section, completes it, sets the turn back to P0.
P1 quickly completes the remainder section, and wishes to enter the critical section again. However, P0 still holds the turn.
P0 gets stalled somewhere in its remainder section indefinitely. P1 is starved.
In other words, this algorithm can't support a system where one of the processes runs much faster. It forces the critical section to be owned in equal turns by P0 -> P1 -> P0 -> P1 -> ... For forward progress we would like to allow a scenario where it's owned for example in the following manner P0 -> P1 -> P1 -> .., and continuing with P1 while P0 isn't ready for entering again. Otherwise P1 may be starved.
Petersons' algorithm fixes this by adding flags to indicate when a thread is ready to enter the critical section, on top of the turn-based fairness like you have. This guarantees that no one is stalled by the other thread inefficiency, and that no one can enter multiple times in a row unless the other permits it to.
You can not be sure about the order in which the code in the two processes is run. When first P1 is run and tries to enter the critical section, it is not allowed to, because it is the turn of P0. So, P1 can not enter the critical section even if there is no other process in it. Therefore progress is not fulfilled.
The problem here is that this totally depends on the lower level process scheduling. OS usually takes a bit to wake up a sleeping process, and this is done at a point when the process currently running on the CPU voluntarily gives up control by executing some blocking system call, or out of timer interrupt when time quanta expires. On a full SMP system this also takes some non-trivial in-kernel synchronization and signaling.
This means that process 0 can just loop leaving and entering critical section again without process 1 ever having a chance to run.
Also, I hope you are nor relying on bare integer variables for mutual exclusion. These might be cached in a register by a compiler, and if not, processor caches come into play. This is supposed to be done with special CPU instructions like test-and-set.
I'm writing a state machine which controls data flow from a chip by setting and reading read/write enables. My clock is running at 27 MHz giving a period of 37 ns. However the specification for the chip I'm communicating with requires I hold my 'read request' signal for at least 50 ns. Of course this isn't possible to do in one cycle since my period is 37 ns.
I have considered I could create an additional state which does nothing but flag the next state to be the one I actually complete the read on, hence adding another period delay (meaning I hold 'read request' for 74 ns), but this doesn't sound like good practice.
The other option is perhaps to use a counter, but I wonder if there's perhaps yet another option I haven't visited yet?
How should one implement delay in a state machine when a state should last longer than one clock period?
Thanks!
(T1 must be greater than 50 ns)
Please see here for the full datasheet.
Delays are only reliably doable using the clock - adding an extra "tick" either via an extra state or using a counter in the existing state is perfectly acceptable to my mind. The counter has the possibility of being more flexible if you re-use the same state machine with a slower external chip (or if you use a different clock frequency to feed the FPGA) - you can just change the maximum count, instead of adding multiple "wait" states to the state machine.