Match Simulation and Post-Synthesis Behavior in VHDL - vhdl

This question is an extension of the another shown here, VHDL Process Confusion with Sensitivity Lists
However, having less than 50 Rep points, I was unable to comment for further explanation.
So, I ran into the same problem from the link and accept the answer shown. However, now I am interested in what the recommended approach is in order to match simulation with post-synthesis behavior. The accepted answer in the link states that Level Sensitive Latches are not recommended as a solution due to them causing more problems. So my question is what is the recommended approach? Is there one?
In other words, I want to acheive what was trying to be achieved in that post, but in a way that won't cause more problems. I need my sensitivity list to not be ignored by my synthesis tools.
Also, I am new to VHDL so it may be possible that using processes is not right way to achieve the result I want. I am using a DE2-115 with Quartus Prime 16.0. Any information will be greatly appreciated.

If you are using VHDL to program your FPGA-based prototyping board you are interested in the synthesis semantics of the language. It is rather different from the simulation semantics described in the Language Reference Manual (LRM). Even worse: it is not standardized and varies between synthesis tools. Anyway, synthesis means translation from the VHDL code to digital hardware. The only recommended approach here, for a beginner who still do not clearly understands the synthesis semantics is:
Think hardware first, code next.
In other words, draw a nice block diagram of the hardware you want on a sheet of paper. And use the following 10 rules. Strictly. No exceptions. Never. And do not forget to carefully check the last one, it is as essential as the others but a bit more difficult to verify.
Surround your drawing with a large rectangle. This is the boundary of your circuit. Everything that crosses this boundary is an input or output port. The VHDL entity will describe this boundary.
Clearly separate edge-triggered registers (e.g. square blocks) from combinatorial logic (e.g. round blocks).
Do not use level-triggered latches.
Use only rising-edge triggered registers and use the same single clock for all of them. Its name is clock. It comes from the outside and is an input of all square blocks and only them. Do not even represent the clock, it is the same for all square blocks and you can leave it implicit in your diagram.
Represent the communications between blocks with named and oriented arrows. For the block an arrow comes from, the arrow is an output signal. For the block an arrow goes to, the arrow is an input signal.
Arrows have one single origin but they can have several destinations. If an arrow has several destinations, fork the arrow as many times as needed.
Some arrows come from outside the large rectangle. These are the input ports of the entity. An input arrow cannot also be the output of any of your blocks.
Some arrows go outside. These are the output ports. An output arrow has one single origin and one single destination: the outside. No forks on output arrows. So, an output arrow cannot be also the input of one of your blocks. If you want to use an output arrow as an input for some of your blocks, insert a new round block to split it in two parts: the input of the new block, with as many forks as you wish, and the output arrow that comes from the new block and goes outside. The new block will become a simple continuous assignment in VHDL. A kind of transparent renaming.
All arrows that do not come or go from/to the outside are internal signals. You will declare them all in the architecture.
Every cycle in the diagram must comprise at least one square block.
If you cannot find a way to describe the function you want with this approach, the problem is with the function you want. Not with VHDL or the synthesizer. It means that the function you want is not digital hardware. Implement it using another technology.
The VHDL coding becomes a detail:
one synchronous process per square block,
one combinatorial process per round block.
A synchronous process looks like this:
process(clock)
begin
if rising_edge(clock) then
o1 <= i1;
...
on <= in;
end if;
end process;
where i1, i2,..., in are all arrows that enter the corresponding square block of your diagram and o1, ..., om are all arrows that output the corresponding square block of your diagram. Do not change anything, except the names of the signals. Nothing. Not even a single character. OK?
A combinatorial process looks like this:
process(i1, i2,... , in)
<declarations>
begin
o1 <= <default_value_for_o1>;
...
om <= <default_value_for_om>;
<statements>
end process;
where i1, i2,..., in are all arrows that enter the corresponding round block of your diagram. all and no more. Do not forget a single arrow and do not add anything else. There are no exceptions. Never. And where o1, ..., om are all arrows that output the corresponding round block of your diagram. all and no more. Do not change anything except <declarations>, the names of the inputs, the names of the outputs, the values of the <default_value_for_oi> and <statements>. Do not forget a single default value assignment. If you had to create a new round block to split a primary output arrow, the corresponding process is just:
process(i)
begin
o <= i;
end process;
which you can simplify as:
o <= i;
without the enclosing process declaration. It is the equivalent concurrent signal assignment.
Once you will be comfortable with this coding style and only then, you will:
Skip the drawing for simple designs. But continue thinking hardware first. Draw in your head instead of on a sheet of paper but continue drawing.
Use asynchronous resets:
process(clock, reset)
begin
if reset = '1' then
o <= reset_value_for_o;
elsif rising_edge(clock) then
o <= i;
end if;
end process;
Merge several combinatorial processes in one single combinatorial process. This is trivial and is just a simple reorganization of the block diagram.
Merge some combinatorial processes with synchronous processes. But in order to do this you must go back to your block diagram and add an eleventh rule:
Group several round blocks and at least one square block by drawing an enclosure around them. Also enclose the arrows that can be. Do not let an arrow cross the boundary of the enclosure if it does not come or go from/to outside the enclosure. Once this is done, look at all the output arrows of the enclosure. If any of them comes from a round block of the enclosure or is also an input of the enclosure, you cannot merge these processes in a synchronous process.
And later on you will also start using latches, falling clock edges, multiple clocks and resynchronizers between clock domains... But we will discuss these when the time will have come.

Related

Ensuring propagation is complete in VHDL without an explicit click

I am looking to build a VHDL circuit which responds to an input as fast as possible, meaning I don't have an explicit clock to clock signals in and out if I don't absolutely need one. However, I am also looking to avoid "bouncing" where one leg of a combinatorial block of logic finishes before another.
As an example, the expression B <= A xor not not A should clearly never assign true to B. However, in a real implementation, the two not gates introduce delays which permit the output of that expression to flicker after A changes but the not gates have not propagated that change. I'd like to "debounce" that circuit.
The usual, and obvious, solution is to clock the circuit, so that one never observes a transient value. However, for this circuit, I am looking to avoid a dependence on a clock signal, and only have a network of gates.
I'd like to have something like:
x <= not A -- line 1
y <= not x -- line 2
z <= A xor y -- line 3
B <= z -- line 4
such that I guarantee that line 4 occurs after line 3.
The tricky part is that I am not doing this in one block, as the exposition above might suggest. The true behavior of the circuit is defined by two or more separate components which are using signals to communicate. Thus once the signal chain propagates into my sub-circuit, I see nothing until the output changes, or doesn't change!
In effect, the final product I'm looking for is a procedure which can be "armed" by the inputs changing, and "triggered" by the sub-circuit announcing its outputs are fully changed. I'd like the result to be snynthesizable so that it responds to the implementation technology. If it's on a FPGA, it always has access to a clock, so it can use that to manage the debouncing logic. If it's being implemented as an ASIC, the signals would have to be delayed such that any procedure which sees the "triggered" signal is 100% confident that it is seeing updated ouputs from that circuit.
I see very few synthesizable approaches to such a procedural "A happens-before B" behavior. wait seems to be the "right" tool for the job, but is typically only synthesizable for use with explicit clock signals.

VHDL concurrent selective assignment synthesis

a real junior question with hopefully a junior answer, regarding one of the main assignments of VHDL (concurrent selective assignment) can anyone explain what a VHDL compiler would synthesise the following description into?
LIBRARY IEEE;
USE IEEE.std_logic_1164.ALL;
USE IEEE.numeric_std.ALL;
ENTITY Q2 IS
PORT (a,b,c,d : IN std_logic;
EW_NS : OUT std_logic
);
END ENTITY Q2;
ARCHITECTURE hybrid OF Q2 IS
SIGNAL INPUT : std_logic_vector(3 DOWNTO 0);
SIGNAL EW_NS : std_logic;
BEGIN
INPUT <= (a & b & c & d); -- concatination
WITH (INPUT) SELECT
EW_NS <= '1' WHEN "0001"|"0010"|"0011"|"0110"|"1011",
'0' WHEN OTHERS;
END ARCHITECTURE hybrid;
Why do I ask? well I have previously gone about things the wrong way i.e. describing things on VHDL before making a block diagram of the components needed. I would envisage this been synthed as a group of and gate logic ?
Any help would be really helpful.
Thanks D
You need to look at the user guide for your target FPGA, and understand what is contained within one 'logic element' ('slice' in Xilinx terminology). In general an FPGA does not implement combinatorial logic by connecting up discrete gates like AND, OR, etc. Instead, a logic element will contain one or more 'look-up tables', with typically four (but now 6 in some newer devices) inputs. The inputs to this look up table (LUT) are the inputs to your logic function, and the output is one of the outputs of the function. The LUT is then programmed as a ROM, allowing your input signals to function as an address. There is one ROM entry for every possible combination of inputs, with the result being the intended logic function.
A function with several outputs would simply use several of these LUTs in parallel, with the same inputs, one LUT for each of the function's outputs. A function requiring more inputs than the LUT has (say, 7 inputs, where a LUT has only 4), simply combines two LUTs in parallel, using a multiplexer to choose between the output of the two LUTs. This final multiplexer uses one of the input signals as it's control, and again every possible combination of inputs is accounted for.
This may sound inefficient for creating something simple like an AND gate, but the benefit is that this simple building block (a LUT) can implement absolutely any combinatorial function. It's also worth noting that an FPGA tool chain is extremely good at optimising logic functions in order to simplify them, and to better map them into the FPGA. The LUT provides a highly generic element for these tools to target.
A logic element will also contain some dedicated resources for functions that aren't well suited to the LUT approach. These might include dedicated carry chains for adders, multiplexers for combining the output of several LUTS, registers (most designs are synchronous). LUTs can also sometimes be configured as small shift registers or RAM elements. External to the logic elements, there will be more specific blocks like large multipliers, larger memories, PLLs, etc, none of which can be as efficiently implemented using LUT resource. Again, this will all be explained in the user guide for your target FPGA.
Back in the day, your code would have been implemented as a single 74150 TTL circuit, which is a 16-to-1 mux. you have a 4-bit select (INPUT), and this selects one of 16 inputs to the chip, which is routed to a single output ('EW_NS`). The 74150 is obsolete and I can't find any datasheets, but it's easy to find diagrams of what an 8-to-1 mux looks like (here, for example). The 16->1 is identical, but everything is wider. My old TI databook shows basically exactly the diagram at this link doubled up.
But - wait. Your problem is easier, because you're not routing real inputs to the output - you're just setting fixed data values. On the '150, you do this by wiring 5 of the 16 inputs to 1, and the remaining 11 to 0. This makes the logic much easier.
The 74150 has basiscally exactly the same functionality as a 4-input look-up table (where the fixed look-up data is the same as fixed levels at the '150 inputs), so it's trivial to implement your entire circuit in a single LUT in an FPGA, as per scary_jeff's answer, rather than using a NAND-level implementation. In a proper chip, though, it would be implemented as a sum-of-products, or something similar (exactly what's in the linked diagram). In this case, draw a K-map and find a minimum solution. My 2 minutes on the back of an envelope comes up with three 3-input AND gates, driving a 3-input OR gate. I'll leave it as an exercise to you to check this :)

What is the practical difference between implementing FOR-LOOP and FOR-GENERATE? When is it better to use one over the other?

Let's suppose I have to test different bits on an std_logic_vector. would it be better to implement one single process, that for-loops for each bit or to instantiate 'n' processes using for-generate on which each process tests one bit?
FOR-LOOP
my_process: process(clk, reset) begin
if rising_edge (clk) then
if reset = '1' then
--init stuff
else
for_loop: for i in 0 to n loop
test_array_bit(i);
end loop;
end if;
end if;
end process;
FOR-GENERATE
for_generate: for i in 0 to n generate begin
my_process: process(clk, reset) begin
if rising_edge (clk) then
if reset = '1' then
--init stuff
else
test_array_bit(i);
end if;
end if;
end process;
end generate;
What would be the impact on FPGA and ASIC implementations for this cases? What is easy for the CAD tools to deal with?
EDIT:
Just adding a response I gave to one helping guy, to make my question more clear:
For instance, when I ran a piece of code using for-loops on ISE, the synthesis summary gave me a fair result, taking a long while to compute everything. when I re-coded my design, this time using for-generate, and several processes, I used a bit more area, but the tool was able to compute everything way way faster and my timing result was better as well. So, does it imply on a rule, that is always better to use for-generates with a cost of extra area and lower complexity or is it one of the cases I have to verify every single implementation possibility?
Assuming relatively simple logic in the reset and test functions (for example, no interactions between adjacent bits) I would have expected both to generate the same logic.
Understand that since the entire for loop is executed in a single clock cycle, synthesis will unroll it and generate a separate instance of test_array_bit for each input bit. Therefore it is quite possible for synthesis tools to generate identical logic for both versions - at least in this simple example.
And on that basis, I would (marginally) prefer the for ... loop version because it localises the program logic, whereas the "generate" version globalises it, placing it outside the process boilerplate. If you find the loop version slightly easier to read, then you will agree at some level.
However it doesn't pay to be dogmatic about style, and your experiment illustrates this : the loop synthesises to inferior hardware. Synthesis tools are complex and imperfect pieces of software, like highly optimising compilers, and share many of the same issues. Sometimes they miss an "obvious" optimisation, and sometimes they make a complex optimisation that (e.g. in software) runs slower because its increased size trashed the cache.
So it's preferable to write in the cleanest style where you can, but with some flexibility for working around tool limitations and occasionally real tool defects.
Different versions of the tools remove (and occasionally introduce) such defects. You may find that ISE's "use new parser" option (for pre-Spartan-6 parts) or Vivado or Synplicity get this right where ISE's older parser doesn't. (For example, passing signals out of procedures, older ISE versions had serious bugs).
It might be instructive to modify the example and see if synthesis can "get it right" (produce the same hardware) for the simplest case, and re-introduce complexity until you find which construct fails.
If you discover something concrete this way, it's worth reporting here (by answering your own question). Xilinx used to encourage reporting such defects via its Webcase system; eventually they were even fixed! They seem to have stopped that, however, in the last year or two.
The first snippet would be equivalent to the following:
my_process: process(clk, reset) begin
if rising_edge (clk) then
if reset = '1' then
--init stuff
else
test_array_bit(0);
test_array_bit(1);
............
test_array_bit(n);
end if;
end if;
end process;
While the second one will generate n+1 processes for each i, together with the reset logic and everything (which might be a problem as that logic will attempt to drive the same signals from different processes).
In general, the for loops are sequential statements, containing sequential statements (i.e. each iteration is sequenced to be executed after the previous one). The for-generate loops are concurrent statements, containing concurrent statements, and this is how you can use it to make several instances of a component, for example.

Is the use of records the solution to all latch problems in VHDL

I was recently told that the solution to all (most) problems with unintended latches during VHDL synthesis is to put whatever the problematic signal is in a record.
This seems like it's a little bit too good to be true, but I'm not that experienced with VHDL so there could be something else that I'm not considering.
Should I put all my signals in records?
No, you should not put all your signals in records. This will quickly become very confusing and you will not gain anything by using the record.
One way that a record may help you avoid latches, is if you register an entire record in a clocked process, you are really registering all of the components of the record. This takes one line of code, instead of possibly tens of lines. In the case where you have many elements which all need to be treated the same, a record can save you "silly mistakes", and possibly save you from creating a latch.
As stated by others, a record doesn't have any specific synthesis interpretation. It is simply a group of signals that you are grouping together for coding-convenience.
I don't see how this would help - a record (or even just parts of a record) can become a latch just as easily as a signal. A latch is generated if a signal keeps its state through some combinatorial process (i.e., is not assigned a value on ALL paths through the process). The same holds for constituents of a record.
Records can be useful to group related signals for readability, but synthesis-wise a record is pretty much equivalent to a bunch of individual signals.
My personal suggestion to avoid latches: avoid combinatorial processes. Make all processes clocked, and do combinatorial logic at the architecture level.
A record is just another way of grouping other types, similar to using an array
for grouping of a std_logic to std_logic_vector, so there is nothing
magical about records that make them better for avoiding latches in a design.
If you get unintended latches in your design, what I guess you think of as
"latch problems", it is because you coding style specifies latches, and you
should change the coding style, as #zennehoy also suggests.
One approach can be to define some code templates for different constructions
that you use, and then stick to these known and working templates.
The template for a flip-flop (FF) with asynchronous reset can be:
process (clk_i, rst_i) is
begin
-- Clock
if rising_edge(clk_i) then
... Control structures with Qs assign by function for Ds
... Synchronous reset is just another branch
end if;
-- Reset (asynchronous) if required
if rst_i = '1' then
... Qs assign with constant reset value for so or all Qs
end if;
end process;
Use concurrent signal assigns when possible, and more complex expressions can
be done through use of concurrent function call, where a function is used
outside a process like:
z_o <= fun(a_i, b_i);
If a process is used to create combinatorial logic, then a common pitfall and
cause for latches in VHDL is to forget a signal in the sensitivity list.
However, VHDL-2008 has a solution for this, since you can use (all) as
sensitivity list, whereby all signal used in the process are implicitly
included in the sensitivity list. So if you use VHDL-2008, then your template
for combinatorial processes can be:
process (all) is
begin
z_o <= a_i and b_i;
end process;
These template should be all you need for typical synthesizable design, and
these will keep your design latch free.

Efficient synthesis of a 4-to-1 function in Verilog

I need to implement a 4-to-1 function in Veriog. The input is 4 bits, a number from 0-15. The output is a single bit, 0 or 1. Each input gives a different output and the mapping from inputs to outputs is known, but the inputs and outputs themselves are not. I want vcs to successfully optimizing the code and also have it be as short/neat as possible. My solution so far:
wire [3:0] a;
wire b;
wire [15:0] c;
assign c = 16'b0100110010111010; //for example but could be any constant
assign b = c[a];
Having to declare c is ugly and I don't know if vcs will recognize the K-map there. Will this work as well as a case statement or an assignment in conjunctive normal form?
What you have is fine. A case statement would also work equally well. It's just a matter of how expressive you wish to be.
Your solution, indexing, works fine if the select encodings don't have any special meaning (a memory address selector for example). If the select encodings do have some special semantic meaning to you the designer (and there aren't too many of them), then go with a case statement and enums.
Synthesis wise, it doesn't matter which one you use. Any decent synthesis tool will produce the same result.
I totally agree with Dallas. Use a case statement - it makes your intent clearer. The synthesis tool will build it as a look-up table (if it's parallel) and will optimise whatever it can.
Also, I wouldn't worry so much about keeping your RTL code short. I'd shoot for clarity first. Synthesis tools are cleverer than you think...
My preference - if it makes sense for your problem - is for a case statement that makes use of enums or `defines. Anything to make code review, maintenance and verification easier.
For things like this, RTL clarity trumps all by a wide margin. SystemVerilog has special always block directives to make it clear when the block should synthesize to combinational logic, latches, or flops (and your synthesis tool should throw an error if you've written RTL that conflicts with that (e.g. not including all signals in the sensitivity list of an always block). Also be aware that the tool will probably replace whatever encoding you have with the most hardware-efficient encoding (the one that minimizes the area of your total design), unless the encoding itself propagates out to the pins of your top-level module.
This advice goes in general, as well. Make your code easy to understand by humans, and it will probably be more understandable to the synthesis tool as well, which allows it to more effectively bring literally thousands of man-years of algorithms research to bear on your RTL.
You can also code it using ternary operators if you like, but i'd prefer something like:
always_comb //or "always #*" if you don't have an SV-enabled tool flow
begin
case(a)
begin
4'b0000: b = 1'b0;
4'b0001: b = 1'b1;
...
4'b1111: b = 1'b0;
//If you don't specify a "default" clause, your synthesis tool
//Should scream at you if you didn't specify all cases,
//Which is a good thing (tm)
endcase //a
end //always
Apparently I am using a lousy synthesis tool. :-) I just synthesized both versions (just the module using a model based on fan-outs for wire delays) and the indexing version from the question gave better timing and area results than the case statements. Using Synopsys DC Z-2007.03-SP.

Resources