I have two processes A and B, each with its own clock input.
The clock frequencies are a little different, and therefore not synchronized.
Process A samples data from an IC, this data needs to be passed to process B, which then needs to write this data to another IC.
My current solution is using some simple handshake signals between process A and B.
The memory has been declared as distributed RAM (128Bytes as an array of std_logic_vector(7 downto 0)) inside process A (not block memory).
I'm using a Spartan 3AN from Xilinx and the ISE Webpack.
But is this the right way to do it?
I read somewhere that the Spartan 3 has dual-port block memory supporting two clocks, so would this be more correct?
The reason I'm asking, is because my design behaves unpredictable, and in cases like this I just hate magic. :-)
Except for very specific exceptional cases, the only correct way to move data between two independent clock domains is to use an asynchronous FIFO (also more correctly called a multi-rate FIFO).
In almost all FPGAs (including the Xilinx parts you are using), you can use FIFOs created by the vendor -- in Xilinx's case, you do this by generating yourself a FIFO using the CoreGen tool.
You can also construct such a FIFO yourself using a dual-port RAM and appropriate handshaking logic, but like most things, this is not something you ought to go reinvent on your own unless you have a very good reason to do so.
You also might consider whether your design really needs to have multiple clock domains. Sometimes it's absolutely necessary, but that's much, MUCH less often than most people just starting out believe. For instance, even if you need logic that runs at multiple rates, you can often handle this by using a single clock and appropriately generated synchronous clock enables.
The magic you are experiencing is most likely because either you haven't constrained your design correctly in synthesis, or you haven't done your handshaking properly. You have two options:
FIFO
Use a multirate FIFO as stated by wjl, which is a very common solution, works always (when done properly) and is huge in terms of resources. The big plus of this solution is that you don't have to take care about the actual clock domain crossing issues, and you'll get maximum bandwidth between the two domains. Never try to build up an asynchronous FIFO in VHDL because that won't work; even in VHDL there are some things that you simply can't do properly; use the appropriate generators from Xilinx, I think thats CoreGen
Handshaking
Have at least two registers for your data in the two domains and build a complete request/acknowledge handshaking logic, it won't work properly if you don't include those. Make sure that the handshaking logic is properly synchronized by adding at least two registers for the handshaking signals in the receiving domain, because otherwise you will most likely have unpredictable behaviour because of metastability issues.
For getting a "valid/ack" set of flags across clock domains, you may wish to look at the Flancter and here's an application of it
But in the general case, using a dual-clock FIFO is the order-of-the-day. Writing your own will be an interesting exercise, but validating across all the potential clock timing cases is a nightmare. This is one of the few places I'll instantiate a Coregen block.
Related
I am designing a Triple modular redundancy processor (TMR) system to synthesize in an Altera DE10lite FPGA Board. Its purpose is to demonstrate reliability of computation under the present of various faults. I need advice on how to connect three external crystal oscillators (instead of the on board crystal), with same ratings to drive the three processors inside the FPGA.I will be using a synchronization voting scheme to sync all three signals. Can this task be done?
Clock distribution triplication
I have read the following relevant links that describe using PLL's is this the correct way?
https://www.altera.com/documentation/mcn1395213337540.html#mcn1395213788377
No, that's unlikely to work.
If you run each soft CPU with a separate crystal, they will drift out of synchronization due to slight variations in frequency between the crystals.
If you try to use a majority voting scheme to create a single clock signal from three input clocks, you'll end up with a very weird, irregular clock signal which will probably cause faults in the logic driven by it.
Use one clock source at a time. If you're convinced you need to resist failures of an external clock, consider implementing some way to detect a failure in the current clock and switch to another one. (Keep in mind that this logic will need to still work without a functional clock… which may be difficult.)
I am working on a project in VHDL wich includes mutliplying matrices. I would like to be able to load data from PC to arrays on FPGA using UART. I am only making my first bigger steps in VHDL and I am not sure if I am taking the right attitude.
I wanted to declare an array of integer signals, and then implement UART to receive data form PC and load it into those signals. However, I can't use for-loop for that, as it will be synthesised to load data parallelly (which is impossible, because values will be comming from PC one after another, using serial port.) And because matrices may be various sizes, in order to assign signals one by one I would need to write lots of specific code (and it appears to be a bad practice to me.)
Is the idea to use an array of signals and load data to those signals through UART realizable? And if my approach is entirely wrong, how could I achieve that?
What you want is doable but you will probably need to design a kind of hardware monitor to act as an intermediate between your UART and your storage (your array of integer signals). This hardware monitor will interpret commands coming from the UART and perform read/write operations in your storage. It will have one interface with the storage and another with the UART. You will have to define a kind of protocol with a syntax for your commands and of sequences of operations for each command.
Example: the monitor waits for commands coming from the UART. The first received character indicates whether it is a read (0) or a write (1). The four next characters are the target address, least significant byte first. If the command is a read, the monitor reads the data at the specified address in your storage and sends it to the UART, one byte at a time, least significant byte first. If the command is a write, the address is followed by a data to write in your storage at the specified address, least significant byte first, and your monitor waits until the data is received and writes it in your storage.
Optionally, the monitor could send an exit status byte at the end of each command to indicate potential errors (protocol errors, unmapped addresses, write attempts in read-only regions...)
Of course, depending on the characteristics of your application, you will probably define a completely different protocol, simpler or more complex, but the principle will be the same.
All this is usually implemented in software and runs on a CPU that has the UART as peripheral and the storage in its memory space. But if you do not have a CPU...
Warning: this is quite complex. The UART itself is quite complex. Not sure you should start with this if you are a VHDL beginner.
Your approach is not entirely wrong but you have a software orientated way of expressing this which indicate you are missing the fundamentals. People with strong software backgrounds tend to think in terms of the programming language and not in terms of the actual FPGA specific structures they want to achieve. It is the important to unlearn this if you want to be successful in designing for FPGA.
Based on what I just wrote you should consider in what type of FPGA structure you would like to store the data. The speed, resource and power requirements govern this choice. One suitable way to store the data would be in either a single or an array of either Block RAM or LUTRAM. Both of these structures can be inferred by using a signal of an array type in the hardware description language which is why I said you are not entirely off track. Consult the manual of your synthesis tool to find templates for how to infer these structures. An alternative is to use a vendor IP block or to instantiate a primitive directly but both those methods are clumsier in my opinion.
Important parameters to consider are the total number of words you need to store, the size of a word and the number of read/write operations per clock cycle. For higher number of reads per cycle an array of memories must be used since most FPGA memories only support two reads per cycle.
Can there be any unwanted effects by having several processes with the same sensitivity list in one architecture?
I have several processes that happen in parallel in an architecture, one process for reading input as a slave from a master which writes the input, one for writing output back to the master when the master asks for and one calculation. All the processes are clocked processes and their sensitivity list contains only the reset and clock signals.
Each process writes into its own signals, which the other processes may read from, i.e. there isn't an instance where two processes write into the same signal.
It's possible to implement everything in one single big process, but it'll be more cumbersome.
Can there be any adverse affects with such implementation?
Are there any reasons to favor the less elegant one big process over several smaller ones?
Not at all. Most of my processes have the same sensitivity list : (reset, clock) and this is not unusual.
If a single big process really is less elegant, and an implementation using multiple smaller processes is genuinely clearer and easier to understand, then go ahead and design that way.
I tend towards fewer, larger processes in my designs, but lay them out in discrete sections to make it easier to separate functionality within a process, but that's not a concrete rule : I wouldn't implement several independent state machines in a single process for example.
What I would advise against is two styles often seen :
the 2-process ( or 3-process) state machine where one process is purely combinational, with a complex sensitivity list that's difficult to get right. Single process SM is simpler, shorter, and at least as easy to understand.
A huge number of tiny processes, each controlling one or two signals which are inputs to other tiny processes, so that you can spend all day tracing signals from process to process without finding what does the actual work!
If I understand your description you have something like 3 blocks : receiver, data processor, transmitter; and this sounds like a good separation of functions to me.
For example you can more easily replace the receiver or transmitter, or re-use them with a different data processor, if they are separate processes (or even separate entities).
I am designing a microcontroller in VHDL. I am at the point where I understand the role of each component (ALU/Memory...), and some ideas on how to realise them. I basically want to implement a Von Neumann architecture.
But here is what I don't get : how do the components communicate ? I don't know how to design my bus (buses?). I am therefore looking for a simple bus implementation and protocol.
My unresolved questions :
Is it simpler to have one bus for everything or to separate the different kind of data ?
How does each component knows when to "listen" and when to "write" ?
The emphasis is on the simplicity of the design (and thus of the implementation). I do not care about speed. I want to do everything from scratch (ie. no pre-made softcore).
I don't know if this is of importance at this stage, but it will not need to run "real" compiled code, is have any kind of compatibility with anything existing. Also, at which point do I begin to think about my 'assembly' instructions ? I thinks that I will load them directly in the memory.
Thank you for your help.
EDIT :
I ended up drawing (a lot of) inspiration from the Picoblaze, because it is :
simple to understand
under a BSD Licence
Specifically, I started by adding a few instructions to it.
Since your main concern seems to be learning about microcontroller design, a good approach could be taking a look into some of the earlier microprocessor models. Take for instance the Z80:
Source: http://landley.net/history/mirror/cpm/z80.html
Another good Z80 HW description: http://www.msxarchive.nl/pub/msx/mirrors/msx2.com/zaks/z80prg02.htm
To answer your first question (single vs. multiple buses), this chip uses a single bus for everything, and it has a very simple design. You could probably use something similar. To make the terminology clear, a single system bus may be composed of sub-buses (and they are also called buses). The figure shows a system bus composed of a bidirection data bus (8-bit wide) and an address bus (16-bit wide).
To answer your second question (how do components know when they are active),
in the image above you see two distinct signals, memory request and I/O request. Only one will be active at a time, and when I/O request is active, that's when a peripheral could potentially be accessed.
If you don't have many peripherals, you don't need to use all 16 address lines (some Z80's have an 8-bit I/O space). Each peripheral would be accessed through some addresses in this space. For instance, in a very simple system:
a timer peripheral could use addresses from 00h to 03h
a uart could addresses from 08h to 0Fh
In this simple example, you need to provide two circuits: one would detect when the address is within the range 00-03h, and another would do the same for 08-0Fh. If you do a logic "and" between the output of each detector and the I/O request signal, then you would have two signals indicating when each of the peripherals is being accessed. Your peripheral hardware should primarily listen to this signal.
Finally, regarding your question about instructions, the dataflow inside your microprocessor would have several stages. This is usually called a processor's datapath. It is common to divide the stages into:
FETCH: read an instruction from program memory
DECODE: check specific bits within the instructions, and decide what type of instruction it is
EXECUTE: take the actions required by the instruction (e.g., ALU operations)
MEMORY: for some instructions, you need to do a data read or write
WRITE BACK: update your CPU registers with new values affected by the instruction
Source: https://www.cs.umd.edu/class/fall2001/cmsc411/projects/DLX/proj.html
Most of your job of dealing with individual instructions would be done in the DECODE and EXECUTE stages. As for the datapath control, you will need a state machine that controls the sequence of operations through the 5 stages. This functional block is usually called a Control Unit. Here you have a few choices:
Your state machine could go throgh all stages sequentially, one at a time. An instruction would take several clock cycles to execute.
Similar as the choice above, but combining two or more stages in a single cycle if you want to make things simpler and faster.
Pipeline the execution of instructions. This can give a great speed boost, but maybe it's better left for later because things can get quite complex.
As for the implementation, I recommend keeping the functional blocks as separate entities, and make sure you write a testbench for each block. Your job will go faster if you write those testbenches.
As for the blocks, the Register File is pretty easy to code. The Instruction Decoder is also easy if you have a clear idea of your instruction layout and opcodes. And the ALU is also easy if you know the operations it needs to perform.
I would start by writing testbenches for the Instruction Decoder and the Register File. Then I would write a script that runs all the testbenches and checks their results automatically. Only then I would focus on the implementation of the functional blocks themselves.
Basically on-chip busses will use parallel busses for address and data input and output. Usually there will be some kind of arbiter which decides which component is allowed to write to the bus. So a common approach is:
The component that wants to write will set a data line connected to the arbiter to high or low to signal that it wants to access the bus.
The arbiter decides who gets access to the bus
The arbiter sets the chip select of the component that should be allowed next to access the bus.
Usually your on chip bus will use a master/slave concept, so only masters have acting access to the bus. The slaves only wait for requests from the master.
I for one like the AMBA AHB/APB design but this might be a little over the top for your application. You can have a look at this book looking for ideas on how to implement your bus
I have designed an Asynchrounous asymmetric fifo using VHDL constructs.It is generic fifo with depth and prog_full as parameters. It has 32-bit in 16-bit output data width.
You can find the fifo design link here.
The top level asymmetric fifo (fifo_wrapper.vhd),is built upon an 32-bit asynchronous fifo(async_fifo.vhd). This internal fifo (async_fifo) is build using the logic from generic FIFO on open cores (http://opencores.org/project,generic_fifos). I have added a simple testbench to try out this fifo design.
BUT there is some issue with this design that I am not able to figure out. The fifo design works perfectly fine when I simulate it, but when I synthesize it and run It along with my other design on hardware I get some erroneous data sometimes. May be there is some corner case that I am not able to simulate or Is it some thing else?
That's why I would like anyone who needs this design to try it and let me know if he/she encounters any Issues during simulation or after synthesis.
thanks
PS: kindly let me know if there is some other forum where I can put my design for public use. thanks
There are a number of issues to point out in relation to this asynchronous FIFO
design, based on the assumption that the write and read clocks are fully
asynchronous.
A (and probably THE) major problem is that the write side pointer (wp in
async_fifo), which is a normal binary counter, is transfered and synchronized
to the read side clock without any Gray encoding. So the different bits in
the vector may arrive at different time in the read clock domain, thus the
write pointer value can (and most likely will from time to time) be different
from the write side value. The comparison with the read pointer (rp) will
therefore make no sense. Binary values that are transfered over clock
domains should be Gray encoded before transfer and decoded at arrival. Also
use synchronization with two flip-flop levels.
The two clocks (rd_clk and wr_clk) are assumed to be asynchronous, but there
is only a single reset (rst), so timing may be violated when reset is
deasserted, unless there are some additional requirements for clocking at the
time of reset deassert.
An similar with clear, where there is only one signal for use in two
different clock domains.
Suggestion would be to use a port naming convention where the clock domain
relationship for the port is cleared indicated in the name, like naming all
the ports in the write clock domain wr_* (e.g. wr_clk_i, wr_clk_we_i, etc.,
and all the ports in the read clock domain as rd_*.
Reset is asserted low, so a naming of rst_n would be nice.
n'I can't access your code (firewall) so I'll just mention the general points in designing them, which might be of help to you and others.
to be completely clock safe, the write side should exchange its pointer to the read side using a fully safe asynchronous handshaking method using 2 meta-stability signalling chains.
This contruct for this is a double buffered register.
The write side registers its write pointer into a buffer, and asserts a valid signal high.
A meta-stability chain reclocks the buffer valid signal to the read clock domain
On the read clock side, once a transition to high of valid is seen at the output of the meta chain, the data in the write side buffer is reregistered onto another register on the read domain. This is ok, because it is known that the data in the buffer is stable. (because of the meta chain).
The read domain asserts an ack signal high.
Another meta-stability chain reclocks the ack signal to the write clock domain.
The write side awaits a transition of the ack signal at the output of the meta chain, once seen it deasserts its valid signal.
The read side awaits a transition of the valid signal at the output of the meta chain to low, once seen it deasserts its ack signal.
The write side awaits a transition of the ack signal at the output of the meta chain to low. The cycle is now complete.
The current write pointer, which may have moved on quite a bit now, may now be transferred again.
A similar approach is taken for transferring the read pointer to the write domain.
It can be seen that although this approach leads to a latency between the write pointer on the write/ read side, and the read pointer on the read / write side, that this latency can never lead to overflow. Instead it leads to a premature full on the write side, adn a premature empty on the read side, which will eventually resolve once the pointers are next exchanged.
This approach is the only completely clock safe design for a fifo that doesn't depend on a-priori knowledge of the clock speeds. Gray coding is not required at all.
The other thing to note is that the logic for addressing / empty / full etc. needs to be duplicated on each clock domain.