Configurable processor implemented on FPGA board - processor

For a university mid-term project I have to design a configurable processor, to write the code in VHDL and then synthesize it on a Spartan 3E FPGA board from Digilent. I'm a beginner so could you point me to some information about configurable processors, to some ideas related to the concept?

You can check out my answer for a related question. We did nearly the same, building a CPU in VHDL for a FPGA board.

This is just a mockup so please be aware i will clean it up
fetch instruct1,
fetch instruct2, fetch datas1
fetch instruct3, fetch datas2, process datas1
fetch instruct4, fetch datas3, process datas2, store1 datas1
fetch instruct5, fetch datas4, process datas3, store2 datas1
fetch instruct6, fetch datas5, process datas4, store3 datas1
fetch instruct7, fetch datas6, process datas5, store4 datas1
fetch instruct8, fetch datas7, process datas6, store5 datas1
basically these are the main components of a processor
Part 1
ALU: Arithmetic Logic Unit (this is where draeing would come in handy)
AN ALU has 2 input ports and an output port. THe 2 input ports get operated on and the result outputed. To know which instruction the ALU has to accomplish there is a control Port.
Basically this is the name of the command. So if the control port has 4bits there are 16 possible instructions.
Part 2
REGISTER UNIT: This is a set of memory cells (cache memory). The contents of this memory is often transferred to the ALU's input ports.
Part3
Control Unit: This is sort of like the orchestra master of the cpu. Its job is to
1send the data to the ALU input
2Readwhich instruction needs to happen in the Instruction Registers,send those codes to the ALU control ports
Interface. This is how the RAM and other peripherals communicate with the cpu
Everytime the intruction outputs a result it has to be stored. It can be stored in RAM so a ram write must be ready once the result is ready. At the same time, a RAM read of the next instruction's inputs can occurs.And also at the same time, the next next instruction can be fetched from RAM.
Generating 1 instruction usually requires more than 1 clock cycle. Processing an in struction is analogeous to industrial production. So chain work is done.
VLIW The programming we write is linear, meaning instructions happening one after the other. But CPUs today (not ARMs though) have multiple ALU's so multiple instructions are processed at the same time.
So you have processing units chain working multiple instructions at the same time (pipeline)
and you have alot of those units (superscalar)
It then becomes a question of what can/need to do to taylor your cpu architecture.

I did a similar project, implementing a processor with a 5-stage pipeline in VHDL.
First things first, you have to understand the architecture of how processors work. Without understand what each stange is doing and what kind of control signals you need you've got no hope of actually writing one in VHDL.
Secondly, start drawing diagrams of how instructions and data will flow through your processor (i.e. through each stage). How is each stage hooked up to each other? Where do the control signals go? Where do my inputs come from and where do my outputs go?
Once you have a solid diagram, the actual implementation in VHDL should be relatively straightforward. You can use VHDL's behaviorial modelling to essetially explain exactly what you see in the diagram.

Related

Space efficient data bus implementations [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm writing a microcontroller in VHDL and have essentially got a core for my actual microcontroller section down. I'm now getting to the point however of starting to include memory mapped peripherals. I'm using a very simple bus consisting of a single master (the CPU) and multiple slaves (the peripherals/RAM). My bus works through an acknowledge CPU->perip and acknowledge perip->CPU. The CPU also has separate input and output data buses to avoid tristates.
I've chosen this method as I wish to have the ability for peripherals to stall the CPU. A bus transaction is achieved by: The master places the data, address and read/write bit on the bus, bringing the ack(c->p) high. Once the slave has successfully received the information and has placed the response back on the data (p->c) bus, the slave sets its ack(p->c) high. The master notes the slave has successfully placed the data, takes the data for processing and releases the ack(c->p). The bus is now in idle state again, ready for further transactions.
Obviously this is a very simple bus protocol and doesn't include burst features, variable word sizes or other more complex features. My question however is what space efficient methods can be used to connect peripherals to a master CPU?
I've looked into 3 different methods as of yet. I'm currently using a single output data bus from the master to all of the peripherals, with the data outputs from all the peripherals being or'd, along with their ack(p->c) outputs. Each peripheral contains a small address mux which only allows a slave to respond if the address is within a predefined range. This reduces the logic for switching between peripherals but obviously will infer lots of logic/peripheral for the address muxes which leads me to believe that future scalability will be impacted.
Another method I though of was having a single large address mux connected from the master which decodes the address and sends it, along with the data and ack signals to each slave. The output data is then muxed back into the master. This seems a slightly more efficient method though I always seem to end up with ridiculously long data vectors and its a bit of a chore to keep track of.
A third method I thought of was to have it arranged in a ring like fashion. The master address goes to all of the slaves, with a smaller mux which merely chooses which ack signals to send out. The data output from the master then travels serially through each slave. Each slave contains a mux which can allow it to either let the data coming into it pass through unaffected OR to allow the slave to place its own data on the bus. I feel this will work best for slow systems as there is only one small mux/slave required to mux between the incoming data and that slave's data, along with a small mux that decodes the address and sends out the ack signals. The issue here I believe however is that with lots of peripherals, the propagation delay from the output of the master to the input of the master would be pretty large as it has to travel through each slave!
Could anybody give me suitable reasoning for the different methods? I'm using Quartus to synthesize and route for an Altera EP4CE10E22C8 FPGA and I'm looking for the smallest implementation with regards to FPGA LUTs. My system uses a 16bit address and data bus. I'm looking to achieve at minimum ~50MHz under ideal memory conditions (i.e no wait states) and would be looking to have around 12 slaves, each with between 8 and 16bits of addressable space.
Thanks!
I suggest that you download the AMBA specification from the ARM web site (http://www.arm.com/) and look at the AXI4-lite bus or the much older APB bus. In most bus standards with a single master there is no multiplexer on the addresses, only an address decoder that drives the peripheral selection signals. It is only the response data from the slaves that are multiplexed to the master, thanks to the "response valid" signals from the slaves. It is scalable if you pipeline it when the number of slaves increases and you cannot reach your target clock frequency any more. The hardware cost is mainly due to the read data multiplexing, that is, a N-bits P-to-one multiplexer.
This is almost your second option.
The first option is a variant of the second where read data multiplexers are replaced by or gates. I do not think it will change much the hardware cost: or gates are less complex than multiplexers but each slave will now have to zero its read data bus, which adds as many and gates. A good point is, maybe, a reduced activity and thus a lower power consumption: slaves that are not accessed by the master will keep their read data bus low. But as you synthesize all this with a logic synthesizer and place and route it with a CAD tool, I am almost sure that you will end up with the same results (area, power, frequency) as for the more classical second option.
Your third option reminds me the principles of the daisy chain or the token ring. But as you want to avoid 3-states I doubt that it will bring any benefit in terms of hardware cost. If you pipeline it correctly (each slave samples the incoming master requests and processes them or passes them to the next) you will probably reach higher clock frequencies than with the classical bus, especially with a large number of slaves, but as, in average, a complete transaction will take more clock cycles, you will not improve the performance neither.
For really small (but slow) interconnection networks you could also have a look at the Serial Peripheral Interface (SPI) protocols. This is what they are made for: drive several slaves from a single master with few wires.
Considering your target hardware (Altera Cyclone IV), your target clock frequency (50MHz) and your other specifications I would first try the classical bus. The address decoder will produce one select signal for each of your 12 slaves, based on the 8 most significant bits of your 16-bits address bus. The cost will be negligible. Apart these individual select signals, all slaves will receive all other signals (address bus, write data bus, read enable, write enable(s)). The 16-bits read data bus of your master will be the output of a 16-bits 12-to-1 multiplexer that selects one slave response among 12. This will be the part that consumes most of the resources of your interconnect. But it should be OK and run at 50 MHz without problem... if you avoid combinatorial paths between master requests and salve responses.
A good starter is the WISHBONE SoC Interconnect from OpenCores.org. The classic read and write cycles are easy to implement. Beyond that, also burst transfers are specified for high throughput and much more. The website also hosts a lot of WISHBONE compatible projects providing a wide range of I/O devices.
And last but not least, the WISHBONE standard is in the public domain.

Simultaneous reading and writing to registers

I'm planning to design a MIPS-like CPU in VHDL on a FPGA. The CPU will have a classic five stage pipeline without forwarding and hazard prevention. In the computer architecture course I learned that the first MIPS-CPUs used to read from the register file on rising clock edge and write on falling clock edge. The FPGA I'm using doesn't support using rising and falling clock edge at the same time (regarding reading and writing to registers), so I can't exactly do like the original MIPS and have to do it all on rising clock edge.
So, here comes the part where I'm having a problem. The instruction writes back to the register in the write back stage. The write back stage sends the data directly to the decode stage. Another instruction in the decode stage wants to read the same register that also the write back stage wants to write.
What happens in this case? Does the decode stage take the new value for the instruction or the old value that is still in the register file?
A register file that fits in the decode stage of the classic five stage design consists of a triple port RAM (or two dual port RAM) and two muxers and comparators. The comparators and muxers are required to bypass the data coming from the write-back stage. This is needed as the write data is written into the triple port RAM in the next cycle. Because the signals coming from the write-back stage are synchronous, this is not a problem.
The question is what do you understand by the term "register". Or more specifically, how do you would like to map the register bank to the FPGA.
The easiest but not so efficient way is to map each MIPS register to several flip-flops according to the register size. You can update these flip-flops at only clock-edge (e.g. falling edge). After that you can read the new content at any time also known as asynchronous read. This solution is not so efficient because the multiplexer to select one MIPS register from the register bank requires a lot of logic resources.
If you have an FPGA where the LUTs can be used as distributed memory, then almost all of the logic resources for the multiplexers can be saved. Distributed memory typically provides an asynchronous read too (and a synchronous write of course). Please read the vender documentation of the synthesis tool on how to describe this type of memory for synthesis.
Last but not least, you can map the full register bank to on-chip block memory. These typically provide only a synchronous read, i.e., reading starts at a clock-edge. (Of course, they also provide only a synchronous write). However, these are typically dual-ported RAMs. Thus, you can write at the falling edge at one port and read with the rising at on the other port. Please read, the documentation of your FPGA on the timing of the write. For example, on some Altera FPGAs the writing is delayed to the next opposite edge (here rising-edge) of the clock.

Bus protocol for a microcontroller in VHDL

I am designing a microcontroller in VHDL. I am at the point where I understand the role of each component (ALU/Memory...), and some ideas on how to realise them. I basically want to implement a Von Neumann architecture.
But here is what I don't get : how do the components communicate ? I don't know how to design my bus (buses?). I am therefore looking for a simple bus implementation and protocol.
My unresolved questions :
Is it simpler to have one bus for everything or to separate the different kind of data ?
How does each component knows when to "listen" and when to "write" ?
The emphasis is on the simplicity of the design (and thus of the implementation). I do not care about speed. I want to do everything from scratch (ie. no pre-made softcore).
I don't know if this is of importance at this stage, but it will not need to run "real" compiled code, is have any kind of compatibility with anything existing. Also, at which point do I begin to think about my 'assembly' instructions ? I thinks that I will load them directly in the memory.
Thank you for your help.
EDIT :
I ended up drawing (a lot of) inspiration from the Picoblaze, because it is :
simple to understand
under a BSD Licence
Specifically, I started by adding a few instructions to it.
Since your main concern seems to be learning about microcontroller design, a good approach could be taking a look into some of the earlier microprocessor models. Take for instance the Z80:
Source: http://landley.net/history/mirror/cpm/z80.html
Another good Z80 HW description: http://www.msxarchive.nl/pub/msx/mirrors/msx2.com/zaks/z80prg02.htm
To answer your first question (single vs. multiple buses), this chip uses a single bus for everything, and it has a very simple design. You could probably use something similar. To make the terminology clear, a single system bus may be composed of sub-buses (and they are also called buses). The figure shows a system bus composed of a bidirection data bus (8-bit wide) and an address bus (16-bit wide).
To answer your second question (how do components know when they are active),
in the image above you see two distinct signals, memory request and I/O request. Only one will be active at a time, and when I/O request is active, that's when a peripheral could potentially be accessed.
If you don't have many peripherals, you don't need to use all 16 address lines (some Z80's have an 8-bit I/O space). Each peripheral would be accessed through some addresses in this space. For instance, in a very simple system:
a timer peripheral could use addresses from 00h to 03h
a uart could addresses from 08h to 0Fh
In this simple example, you need to provide two circuits: one would detect when the address is within the range 00-03h, and another would do the same for 08-0Fh. If you do a logic "and" between the output of each detector and the I/O request signal, then you would have two signals indicating when each of the peripherals is being accessed. Your peripheral hardware should primarily listen to this signal.
Finally, regarding your question about instructions, the dataflow inside your microprocessor would have several stages. This is usually called a processor's datapath. It is common to divide the stages into:
FETCH: read an instruction from program memory
DECODE: check specific bits within the instructions, and decide what type of instruction it is
EXECUTE: take the actions required by the instruction (e.g., ALU operations)
MEMORY: for some instructions, you need to do a data read or write
WRITE BACK: update your CPU registers with new values affected by the instruction
Source: https://www.cs.umd.edu/class/fall2001/cmsc411/projects/DLX/proj.html
Most of your job of dealing with individual instructions would be done in the DECODE and EXECUTE stages. As for the datapath control, you will need a state machine that controls the sequence of operations through the 5 stages. This functional block is usually called a Control Unit. Here you have a few choices:
Your state machine could go throgh all stages sequentially, one at a time. An instruction would take several clock cycles to execute.
Similar as the choice above, but combining two or more stages in a single cycle if you want to make things simpler and faster.
Pipeline the execution of instructions. This can give a great speed boost, but maybe it's better left for later because things can get quite complex.
As for the implementation, I recommend keeping the functional blocks as separate entities, and make sure you write a testbench for each block. Your job will go faster if you write those testbenches.
As for the blocks, the Register File is pretty easy to code. The Instruction Decoder is also easy if you have a clear idea of your instruction layout and opcodes. And the ALU is also easy if you know the operations it needs to perform.
I would start by writing testbenches for the Instruction Decoder and the Register File. Then I would write a script that runs all the testbenches and checks their results automatically. Only then I would focus on the implementation of the functional blocks themselves.
Basically on-chip busses will use parallel busses for address and data input and output. Usually there will be some kind of arbiter which decides which component is allowed to write to the bus. So a common approach is:
The component that wants to write will set a data line connected to the arbiter to high or low to signal that it wants to access the bus.
The arbiter decides who gets access to the bus
The arbiter sets the chip select of the component that should be allowed next to access the bus.
Usually your on chip bus will use a master/slave concept, so only masters have acting access to the bus. The slaves only wait for requests from the master.
I for one like the AMBA AHB/APB design but this might be a little over the top for your application. You can have a look at this book looking for ideas on how to implement your bus

Moving data between processes in Spartan 3

I have two processes A and B, each with its own clock input.
The clock frequencies are a little different, and therefore not synchronized.
Process A samples data from an IC, this data needs to be passed to process B, which then needs to write this data to another IC.
My current solution is using some simple handshake signals between process A and B.
The memory has been declared as distributed RAM (128Bytes as an array of std_logic_vector(7 downto 0)) inside process A (not block memory).
I'm using a Spartan 3AN from Xilinx and the ISE Webpack.
But is this the right way to do it?
I read somewhere that the Spartan 3 has dual-port block memory supporting two clocks, so would this be more correct?
The reason I'm asking, is because my design behaves unpredictable, and in cases like this I just hate magic. :-)
Except for very specific exceptional cases, the only correct way to move data between two independent clock domains is to use an asynchronous FIFO (also more correctly called a multi-rate FIFO).
In almost all FPGAs (including the Xilinx parts you are using), you can use FIFOs created by the vendor -- in Xilinx's case, you do this by generating yourself a FIFO using the CoreGen tool.
You can also construct such a FIFO yourself using a dual-port RAM and appropriate handshaking logic, but like most things, this is not something you ought to go reinvent on your own unless you have a very good reason to do so.
You also might consider whether your design really needs to have multiple clock domains. Sometimes it's absolutely necessary, but that's much, MUCH less often than most people just starting out believe. For instance, even if you need logic that runs at multiple rates, you can often handle this by using a single clock and appropriately generated synchronous clock enables.
The magic you are experiencing is most likely because either you haven't constrained your design correctly in synthesis, or you haven't done your handshaking properly. You have two options:
FIFO
Use a multirate FIFO as stated by wjl, which is a very common solution, works always (when done properly) and is huge in terms of resources. The big plus of this solution is that you don't have to take care about the actual clock domain crossing issues, and you'll get maximum bandwidth between the two domains. Never try to build up an asynchronous FIFO in VHDL because that won't work; even in VHDL there are some things that you simply can't do properly; use the appropriate generators from Xilinx, I think thats CoreGen
Handshaking
Have at least two registers for your data in the two domains and build a complete request/acknowledge handshaking logic, it won't work properly if you don't include those. Make sure that the handshaking logic is properly synchronized by adding at least two registers for the handshaking signals in the receiving domain, because otherwise you will most likely have unpredictable behaviour because of metastability issues.
For getting a "valid/ack" set of flags across clock domains, you may wish to look at the Flancter and here's an application of it
But in the general case, using a dual-clock FIFO is the order-of-the-day. Writing your own will be an interesting exercise, but validating across all the potential clock timing cases is a nightmare. This is one of the few places I'll instantiate a Coregen block.

Some doubts regarding diagram of Von Neumann Arcitechture

Well I cant understand the above diagram of Von Neumann architecture [Cited from wikipedia] and not even sure whether it is correct. Some obvious doubts that I have -
How can ALU communicate with Memory? Isn't that supposed to be CU's Job?
How is accumulator a part of ALU?
And, What is exactly the job of accumulator?
Judging from the diagram of the IAS computer (which should be pretty similar to EDVAC, the computer von Neumann wrote about) the control unit provides addresses (register MAR) and controls the bus transactions with signals such as AS, R/W*.
On the other hand, ALU is connected to the data bus (register MDR): it receives the data from memory and stores back the results. The diagram also shows that ALU receives the instructions and forwards them to the CU (register IBR).
For example, suppose the control unit just fetched the instruction ADD $1234. Then the processing proceeds as follows:
CU puts $1234 onto the address bus and initiates a read cycle
the operand is received by ALU (register MDR) and added with accumulator (register AC)
the result of addition is finally stored into the accumulator.
Answers to your questions:
ALU receives the data from memory, perform the operations and stores back the result. At the time all data were stored in memory (there were no general purpose registers), hence it was logical to put MDR in ALU, which means that ALU is supposed to be connected to the data bus.
The IAS computer was designed in a way that one ALU input and the ALU output are hardwired to the accumulator. Hence it was logical to place accumulator in ALU.
The accumulator is conceived as a place to store intermediate results, because having instructions with more than one memory operand was more difficult to implement.
Finally, I believe this discussion is purely historical. There is no particular reason to prefer associating the MDR with ALU rather than CU. It was just that von Neumann happened to think that way when he was writing a paper about EDVAC. To make the story complete, Wikipedia says that EDVAC was actually designed by Eckert and Mauchly, while von Neumann only did consulting and writing.
The accumulator is register where the result of arithmetic operation is stored temporarely. It is faster than directly using the main memory. Since it stores arithmetic results it makes sense to be part of the ALU.
The control unit is like a coordinator that tells to the other components to do this and that. But it does not provide the means how to do it so this is why the ALU need to communicate direclty with the memory.
Well, the ALU changes the flags register when does something, that's why it's connected with memory (flags aren't in the CU and in the ALU neither, and since these are the only components that are shown..). And the accumulator stores data temporarily waiting the ALU to process it. It's connected directly to the ALU because this register was thought to support it with its calculations, just like the ecx register is connected with counter circuits. Of course is possible to add ecx,edx but is slower. Choosing the source and the destination register is very difficult due to the extra circuits needed to implement in a CPU and it was archived recently (relatively). That image is quite old (ssegvic is right!) because it shows that input/output are possible only using the accumulator.
In my opinion this is more clear:
The ALU is connected at the internal bus, but this doesn't mean it will communicate with everything connected to it.
One last thing: looking for better images I noticed that the ALU isn't always connected with memory, in some of them it's connected only with the CU.

Resources