Is it necessary to register both inputs and outputs of every hardware core?

Is it necessary to register both inputs and outputs of every hardware core? - fpga

I am aware of the need to synchronize all inputs to an FPGA before using those inputs in order to avoid metastability. I'm also aware of the need to synchronize signals that cross clock domains within a single FPGA. This question isn't about crossing clock domains.
My question is whether it is a good idea to routinely register all of the inputs and outputs of every internal hardware module in an FPGA design. The rationale is that we want to break up long chains of combinational logic in order to improve the clock rate so that we can meet the timing constraints for a chosen clock rate. This will add additional cycles of latency proportional to the number of modules that a signal must cross. Is this a good idea or a bad idea? Should one register only inputs and not outputs?
Answer Summary
Rule of thumb: register all outputs of internal FPGA cores; no need to register inputs. If an output already comes from a register, such as the state register of a state machine, then there is no need to register again.

It is difficult to give a hard and fast rule. It really depends on many factors.
It could:
Increase Fmax by breaking up combinatorial paths
Make place and route easier by allowing the tools to spread logic out in the part
Make partitioning your design easier, allowing for partial rebuilds.
It will not magically solve critical path timing issues. If there is a critical path inside one of your major "blocks", then it will still remain your critical path.
Additionally, you may encounter more problems, depending on how full your design is on the target part.
These things said, I lean to the side of registering outputs only.

Registering all of the inputs and outputs of every internal hardware module in an FPGA design is a bit of overkill. If an output register feeds an input register with no logic between them, then 2x the required registers are consumed. Unless, of course, you're doing logic path balancing.
Registering only inputs and not outputs of every internal hardware module in an FPGA design is a conservative design approach. If the design meets its performance and resource utilization requirements, then this is a valid approach.
If the design is not meeting its performance/utilization requirements, then you've got to do the extra timing analysis in order to reduce the registers in a given logic path within the FPGA.

My question is whether it is a good idea to routinely register all of the inputs and outputs of every internal hardware module in an FPGA design.
No, it's not a good idea to routinely introduce registers like this.
Doing both inputs and outputs is redundant. They'll be no logic between the output register and the next input register.
If my block contains a single AND gate, it's overkill. It depends on the timing and design complexity.
Register stages need to be properly thought about and designed. What happens when a output FIFO fills or other stall conditions? Do all signals have the right register delay so that they appear at the right stage in the right cycle? Adding registers isn't necessarily as simple as it seems.
The rationale is that we want to break up long chains of combinational logic in order to improve the clock rate so that we can meet the timing constraints for a chosen clock rate. This will add additional cycles of latency proportional to the number of modules that a signal must cross. Is this a good idea or a bad idea?
In this case it sounds like you must introduce registers, and you shouldn't read the previous points as "don't do it". Just don't do it blindly. Think about the control logic around the registers and the (now) multi-cycle nature of the logic. You are now building a "Pipeline". Being able to stall a pipeline properly when the output can't write is a huge source of bugs.
Think of cars moving on a road. If one car applies it's brakes and stops, all cars behind need to as well. If the first cars brake lights aren't working, the next car won't get the signal to brake, and it'll crash. Similarly each stage in a pipeline needs to tell the previous stage it's stopping for a moment.
What you can find is that instead of having long timing paths along your computation paths going from input to output, you end up with long timing paths on your enable controlling all these register stages from output to input.

Another option you have is, to let the tools work for you. Add add the end of your complete system a bunch of registers (if you want to pipeline more) and activate in your synthesis tool retiming. This will move the registers (hopefully) between the logic where it is most useful.

Related

Synthesizable delayed buffer in VHDL

I am trying to generate a synthesizable buffer in VHDL for a time-to digital project in FPGA.
I have been looking around but cannot find any set-up out there.
I have been recommended that stackoverflow has very good answers.
Could you please give me some tips for this course work, and I would be very greatful to any approach you might come up with.
Thank you a lot in advance!
Regards

Doing time-delay-circuits (TDC) is somewhat hard right now.
Basically, it boils down to having HDL that describes multiple registers all reading the same signal. You then need to apply a keep directive, e.g. equivalent_register_removal for Xilinx. You will possibly also need a timing ignore constraint on the signal you are sampling.
You then need to carefully examine the fabric of your FPGA and make sure your flop flops are placed in the same slice across multiple sites that can all be connected through the same kind of wire (check FPGA Editor), i.e. will have the same time delay.
You can build a minimal test design for Xilinx in FPGA editor. Once you have the routing down, you can then formulate appropriate constraints for your UCF file and build much bigger, more complex TDCs.
I'm only familiar with Altera from a few years ago. But Altera doesn't give you an interface like Xilinx's FPGA editor, so you're on your own determining the placement of your flops. I saw a presentation once about a university work group doing TDCs with Altera and ultimately it boiled down to measuring the resolution by using input stimuli to check whether the design was routed according to their wishes. If it was not, they would adjust some timing parameters out of sensible bounds, rinse and repeat.
The last step of course is to sample your signal in the synchronous part of your design (where the counter is) and read the counter plus flip flop contents when the event you wanted occurs (i.e. rising edge, falling edge). Then you have major time units in your counter and minor time units as a bitfield in the flip flop state.
If you want even spread among your flip flop delays, you will need to carefully examine the delay length of paths between the flip flops and adjust for your overall clock period.
So basically, counter * clock_period + index_of_highest_set_bit_in_flip_flop_state * path_delay is then your delay time.
You will also need to check the FPGA datasheet to know your minimal timings, i.e. the fastest toggle time the input buffer can achieve, the minimal setup and hold time of your flops etc.

How do I debug Verilog code where simulation functions as intended, but implementation doesn't?

I'm a bit stumped.
I have a fairly large verilog module that I've tested in Simulation (iSim) and it functions as I want. Now I've hooked up it up in real life to another device using SPI, and some stuff works, and some stuff doesn't.
For example,
I can send a value using command A, and verify that the right value was received using command B. Works no problem.
But if I send a value using command C, I cannot verify that it was received using command D. In simulation it works fine, so I feel I can't really gain anything from simulating any more.
I have looked at the signals on a logic analyzer, and the controller device (not my design) sends the right messages. When I issue command B, I can see the return values correct from my device (I know SPI works anyways). I don't know whether C or D work correctly. D just returns 0s, so maybe C didn't work in the first place. There is no way to step through Verilog, and this module is packaged as IP for Vivado.
Here are two screenshots. First is simulation (I send 5, then 2, then I expect it to return 4 on the next send, which it does; followed by zeros).
Here is what I get in reality (the first two bytes don't matter, 5 is a left over from previously sent value):
Here is a command (B) that works in returning a correct value (it responds to the 0x01 being sent):
Does anyone have any advice for debugging this? I have literally no idea how to proceed.
I can't really reproduce this behaviour in simulation.

Since you are synthesizing to an FPGA, you have a few more options on how to debug your synthesized, on-chip design. As you are using Vivado, you can use ChipScope to look at any signal in your system; allowing you to view a waveform of that signal over time just as you would in simulation (though more restricted). By including the ChipScope IPs into your synthesis, you can sent waveform data back to the Vivaod software which will display a waveform of your selected signals to help you see whats going on inside the FPGA as the system runs. (Note, if you were using Altera's stuff, you can use their equivalent called SignalTap; its pretty much the same thing)
There are numerous tutorial online on how to incorporate and run ChipScope, heres one from the Xilinx website:
http://www.xilinx.com/support/documentation/sw_manuals/xilinx2012_4/ug936-vivado-tutorial-programming-debugging.pdf
Many other use ISE, but the steps are very similar as both typically involve using the coregen tool (though I think you can also add ChipScope via synthesis flow, so there are multiple options on how to incorporate it into your design).
Once on the FPGA, you have access to what is effectively an internal logic analyzer. Note that it does take up some LEs on the FPGA and can take up a fair amount of block RAM depending on how many samples you want to take out your signals.
Tim's answer provides a good description of how to deal with on-chip debugging if you are designing purely for ASIC; so see his answer if you want more information about standard, non-FPGA debugging solutions.

In cases like this you might want to think about adding additional logic which is used just for debugging. ('Design for debug') is a common term used for thinking about this kind of logic.
So you have one chip interface (SPI), which you don't know if it works correctly. Since it seems not to be working, you can't trust debugging over this interface, because if you get an odd result you can't determine what it means.
Since you're working on an FPGA, are there any other interfaces other than SPI which you can get working correctly? Maybe 7-segment display, LEDs, JTAG, VGA, etc?
Try to think of other creative ways to get data out of your chip that don't require the SPI interface.
If you have 4 LEDs, A through D, can you light up each LED for 1 second each time a command of that type is received?
Can you have a 7-seg display the current state of your SPI receiver's state machine, or have it indicate certain error codes if some unknown command is received?
Can you draw over VGA to a monitor a binary sequence of the incoming SPI bitstream?
Once you can start narrowing down with data what is actually happening inside your hardware, you can narrowing the problem space to go inspect for possible problems.

There are multiple reasons why code that runs ok in RTL simulation behaves differently in the FPGA. It is important to consider all possibilities. Chipscope suggested above is definitely a step in right direction and it could give you hint, where to look further. These reasons could be:
The FPGA implementation flow was not executed properly. Did you have right timing constraints, were they met during implementation, especially P&R phase, pin placements, I/O properties, right clock properties. Usually you can find hints inspecting FPGA implementation reports. This is a tedious part, but needed sometimes. Incorrect implementation flow can also result in FPGA implementations that work or don't depending on the run or small unrelated changes (seen this problem many times!).
RTL/netlist discrepancies, e.g. due to incorrect usage `ifdef within design or during synthesis phase, selecting incorrect file for synthesis or the same verily module defined in multiple places. Often, the hint could be found by inspecting removed flop list or synthesis warnings.
Discrepancy between RTL simulation and board environment. They could be external like the clock/data alignment on the interface, but also internal: improper CDC, not handling clock or reset tree delays properly. Note, that X-propagation and CDC is not handled properly in RTL, unless you code in a certain way. Problems with those could be often only seen in netlist simulation environment.
Lastly, the FPGA board problems, like faulty clock source or power supply, heat can also be at fault. They worth checking, but I'd leave those as a last resource. Some folks have a dedicated board/FPGA test design proven to work on the good board that would catch some of those problems.
As a final note, the biggest return is given by investing in simulation environment. Some folks think that since FPGA can be debugged with chipscope and reprogrammed quickly, there is no need in good simulation environment. It probably depends on the size of the project, but my experience is that for most of modern FPGA projects the good simulation environment saves a lot of time spent in the lab looking through chipscope and logic analyzers captures.

Driving module output from combinatorial block

Is it a good design practice to use combinatorial logic to drive the output of a module in VHDL/Verilog?
Is it okay to use the module input directly inside a combinatorial block,and use the output of that combinatorial block to drive another sequential block in the same module?

An answer to the two questions really depends on the overall design methodology
and conditions, and will be opinion based, as Morgan points out in his comment.
The questions are in special relevant for a large design with timing pushed to
the limit, and where multiple designers contribute with different modules. In
this case it is important to determine a design methodology up front which
answers the two questions, in order to ensure that modules provided by
different designers can be integrated smoothly without timing issues.
Designing with flip-flops on all outputs of each module, gives the advantage
that when an output is used as input to other module, then the input timing is
reasonable well defined, and only depends on the routing delay. This makes it
a Yes to question 1.
Having a reasonable well-defined input timing makes it possible to make complex
combinatorial logic directly on the inputs, since most of the clock cycle will
be available for this. So this also makes it a Yes to question 2.
With the above Yes/Yes design methodology, the available cycle time is only
used once, and that is at the input side of the module, before the flip-flops
that goes on the output. The result is that multiple modules will click nicely
together like LEGO bricks, as shown in the figure below.
If a strict design methodology is not adhered to in different modules, then
some modules may place flip-flops on the input, and some on the output. A
longer cycle time, thus slower frequency, is then required, since the worst
case path goes through twice the depth of combinatorial logic. Such a design
is shown in the figure below, and should be avoided.
A third option exists, where flip-flops are placed on all inputs, and the
design will look like the figure below if two different modules use the same
output.
One disadvantage with this approach is that the number of flip-flops may be
higher, since the same output is used as input to multiple flip-flops, and the
synthesis tool may not combine these equivalent flip-flops. And even more
flip-flops than this may be required, if the module that generates the output
will also have to make a flip-flopped version for internal use, which is often
the case.
So the short answer to the questions is: Yes and Yes.

The answer to both questions as expressed is basically yes, provided the final design meets speed targets, and the input signals are clean.
The problem with blocks designed this way are that the signal timings through them are not accurately defined, so that combining several such blocks may result in an absurdly slow design, or one in which fast input signals don't propagate cleanly through the design.
If you design such a circuit, and it meets ALL your input and output timing constraints as well as any clock speed constraints you set, it will work.
However if it fails to meet the clock constraints you will have to insert registers to "pipeline" the design, breaking up long slow chains of combinational logic. And you will have to observe the input and output timings reported by synthesis and PAR, and they can get complicated.
In practice (in an FPGA : ASICs can be different) registers are free with each logic block (Xilinx/Altera, not true for Actel/Microsemi) and placing registers on each block's inputs and/or outputs makes the timings much simpler to understand and analyse.
And because such a design is pipelined, it is normally also much faster.

Vhdl with no clk

I have a clock in my vhdl code but i don't use it , simply my process just depends on handshake when one component finishes and gets an output out , this output is in the sensitivity list of my FSM and is then becomes an input to the next component and of course its output is also in the sensitivity list of my FSM(so to know when will component finishes its computation)... and so on.
Is this method wrong ? it works in simulation and also in post-route simulation but gets me warnings like this : warning :HOLD High VIOLATION ON I WITH RESPECT TO CLK; and
warning :HOLD Low VIOLATION ON I WITH RESPECT TO CLK;
is this warnings not important or will my code damage my fpga because it doesn't depend on a clock ?

The warning you are getting are timing violations. You get these because the tools detect that your design does not obey the necessary timing restrictions for the internal primitives.
For instance, inputs to lookup-tables (which is one of the main building-blocks inside an FPGA) need to be held for a specific time for the output to stabilize. This is very hard to guarantee when your entire timing relies only on the latencies and delays of the components themselves, and switch on a completely asynchronous basis.
Depending on your actual design (mostly the size and complexity of it), I'll wager the guess that you'll end up with a lot of very-hard-to-debug errors once you get it inside an FPGA. You'll have a much, much, much easier time using a clock. This will allow you to have a clear idea of when signals arrive where, and it will allow you to use the internal tools to check your timing. You'll also find it much easier to interface to other devices, and your system will be less susceptible to noisy inputs.
So all in all, use a clock. You (probably) wont damage your FPGA by not doing it, but a clock will save you from tons of trouble.

your code does most probably not damage your FPGA because it doesn't depend on a clock. however, for synthesis you should always use registered (clocked) logic. without using a clock your design will not be controllable because of timing/delay/routing/fan out/... this will let your FSM behave "mysteriously" when synthesized (even if it worked in simulation).
you'll find plenty of examples for good FSM implementation style with google's help (search for Moore or Mealy FSM)

Definitely use a clock. And only one clock throughout the design. This is the easiest way - the tools support this design style very well. You can often get away with a single timing constraint, especially if your inputs are slow and synchronous to the same clock.
When you have gained experience designing this way, you can move outside of this, but be ready for more analysis, timing constraints and potentially build iterations while you learn the pitfalls of crossing clock-domains and asynchronous signals.

What are tsetup and thold in VHDL?

I am learning VHDL. When I tried to make a testbanch I run into these words. What do they mean? I could find any simple explanaition on google.
Thanks in advance.

tSetup and tHold aren't VHDL keywords to my knowledge but the minimum setup and hold time for the device being simulated to operate correctly.
tSetup - The amount of time the data/control needs to be valid before the clock edge.
tHold - The amount of time the data/control needs to be valid after the clock edge.
A simple graphic explaining this:
http://en.wikipedia.org/wiki/Flip-flop_%28electronics%29#Setup.2C_hold.2C_recovery.2C_removal_times

As TOTA says, setup and hold times are digital logic design terms, not VHDL terms.
The vast majority of the time, you do not need to concern yourself with them in testbenches as you are almost always testing internal blocks within your chip and the tools will manage all the timing for you.
When you are working at the device pin level, you can set you models up to check the setup and hold times for violations. When simulating RTL, there are no delays (usually) modelled, so your timing should be fine. You can later simulate a back-annotated netlist which has all the real chip delays included and check that you are still going to meet all the timing requirements of your external devices.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio