a real junior question with hopefully a junior answer, regarding one of the main assignments of VHDL (concurrent selective assignment) can anyone explain what a VHDL compiler would synthesise the following description into?
LIBRARY IEEE;
USE IEEE.std_logic_1164.ALL;
USE IEEE.numeric_std.ALL;
ENTITY Q2 IS
PORT (a,b,c,d : IN std_logic;
EW_NS : OUT std_logic
);
END ENTITY Q2;
ARCHITECTURE hybrid OF Q2 IS
SIGNAL INPUT : std_logic_vector(3 DOWNTO 0);
SIGNAL EW_NS : std_logic;
BEGIN
INPUT <= (a & b & c & d); -- concatination
WITH (INPUT) SELECT
EW_NS <= '1' WHEN "0001"|"0010"|"0011"|"0110"|"1011",
'0' WHEN OTHERS;
END ARCHITECTURE hybrid;
Why do I ask? well I have previously gone about things the wrong way i.e. describing things on VHDL before making a block diagram of the components needed. I would envisage this been synthed as a group of and gate logic ?
Any help would be really helpful.
Thanks D
You need to look at the user guide for your target FPGA, and understand what is contained within one 'logic element' ('slice' in Xilinx terminology). In general an FPGA does not implement combinatorial logic by connecting up discrete gates like AND, OR, etc. Instead, a logic element will contain one or more 'look-up tables', with typically four (but now 6 in some newer devices) inputs. The inputs to this look up table (LUT) are the inputs to your logic function, and the output is one of the outputs of the function. The LUT is then programmed as a ROM, allowing your input signals to function as an address. There is one ROM entry for every possible combination of inputs, with the result being the intended logic function.
A function with several outputs would simply use several of these LUTs in parallel, with the same inputs, one LUT for each of the function's outputs. A function requiring more inputs than the LUT has (say, 7 inputs, where a LUT has only 4), simply combines two LUTs in parallel, using a multiplexer to choose between the output of the two LUTs. This final multiplexer uses one of the input signals as it's control, and again every possible combination of inputs is accounted for.
This may sound inefficient for creating something simple like an AND gate, but the benefit is that this simple building block (a LUT) can implement absolutely any combinatorial function. It's also worth noting that an FPGA tool chain is extremely good at optimising logic functions in order to simplify them, and to better map them into the FPGA. The LUT provides a highly generic element for these tools to target.
A logic element will also contain some dedicated resources for functions that aren't well suited to the LUT approach. These might include dedicated carry chains for adders, multiplexers for combining the output of several LUTS, registers (most designs are synchronous). LUTs can also sometimes be configured as small shift registers or RAM elements. External to the logic elements, there will be more specific blocks like large multipliers, larger memories, PLLs, etc, none of which can be as efficiently implemented using LUT resource. Again, this will all be explained in the user guide for your target FPGA.
Back in the day, your code would have been implemented as a single 74150 TTL circuit, which is a 16-to-1 mux. you have a 4-bit select (INPUT), and this selects one of 16 inputs to the chip, which is routed to a single output ('EW_NS`). The 74150 is obsolete and I can't find any datasheets, but it's easy to find diagrams of what an 8-to-1 mux looks like (here, for example). The 16->1 is identical, but everything is wider. My old TI databook shows basically exactly the diagram at this link doubled up.
But - wait. Your problem is easier, because you're not routing real inputs to the output - you're just setting fixed data values. On the '150, you do this by wiring 5 of the 16 inputs to 1, and the remaining 11 to 0. This makes the logic much easier.
The 74150 has basiscally exactly the same functionality as a 4-input look-up table (where the fixed look-up data is the same as fixed levels at the '150 inputs), so it's trivial to implement your entire circuit in a single LUT in an FPGA, as per scary_jeff's answer, rather than using a NAND-level implementation. In a proper chip, though, it would be implemented as a sum-of-products, or something similar (exactly what's in the linked diagram). In this case, draw a K-map and find a minimum solution. My 2 minutes on the back of an envelope comes up with three 3-input AND gates, driving a 3-input OR gate. I'll leave it as an exercise to you to check this :)
Related
I have a VHDL BCD counter whose output is an integer value (digit).
But when I simulate the code in Xilinx ISE it shows the code's waveform in binary value. The code works but the output should be integer but it's not. I have tested this code in Modelsim and the output is correct and it's in integer value. This problem is in code synthesize too and the value is binary.
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
entity bcdcnt is
Port ( clk : in STD_LOGIC;
digit : out INTEGER RANGE 0 TO 9);
end bcdcnt;
architecture Behavioral of bcdcnt is
begin
count: PROCESS(clk)
VARIABLE temp : INTEGER RANGE 0 TO 10;
BEGIN
IF (clk'EVENT AND clk = '1') THEN
temp := temp + 1;
IF (temp = 10) THEN temp := 0;
END IF;
END IF;
digit <= temp;
END PROCESS count;
end Behavioral;
That's what synthesis does.
That's what synthesis MUST do : it translates your high level design to the resources in your FPGA or ASIC, which are binary. So, what's the problem here?
If you need to simulate the post-synth result, the usual approach is to create a wrapper entity that takes the correct port types, and translates between those and the post-synthesis netlist component.
Then, simulation should work with either the original entity, or this wrapper entity, both of which have integer ports.
Better still, you can re-use the same entity, and add the wrapper as a second architecture, thus guaranteeing that it uses the same interface (ports).
(You can even instantiate both the original and post-synth in its wrapper in the testbench, in parallel with a comparator on their outputs, to see they both do the same thing. But note there will be gate-level delays between them; usually you check outputs only on clock edges so these do not matter.)
Another approach is to restrict port types on the top level of the design to binary types like std_logic_vector. This plays nicer with badly designed tools, like ISE where the automatically generated testbench will have binary port types, (I generally edit them back to the correct ones; it's almost easier to write TBs from scratch).
But it restricts you to using an obscure and complex design style instead of higher level abstractions like Integers.
It's bad - really bad - that this approach is taught and encouraged so widely. But it is, and sometimes you'll just have to live with it. (Even in this approach, there's no reason to avoid decent abstractions internal to the FPGA, as long as the synthesis tool understands them).
A third approach - roughly, "trust, but verify" - is to trust that synthesis tools are competently written - which is usually true - and forget about post-synthesis simulation.
Just verify the design thoroughly at the behavioural level in simulation, then synthesise it, and test in live FPGA.
99% of the time (unless you're writing really weird VHDL), synth and P&R have done the right thing, and any differences you see are due to the aforementioned I/O timings (gate delays at the I/O pins). Then model these in the testbench and/or wrapper until you see the same behaviour in both (fix anything that needs fixing, and re-synthesise).
In this approach you only need to bother with the (MUCH slower) post-synth and post-PAR simulations if you need to track down a suspected synthesis tool bug.
This does happen : I've seen two in a quarter century.
Most of the time I just use the third approach.
I want to be able to have a shift register that does an XOR against another register loaded with some value. The issue is that I wish to do this with a large scale vector, something on the order of thousands of bits wide.
The obvious way to do this in VHDL would be something like
generic( length : integer := 15);
signal shiftreg : std_logic_vector(length downto 0);
process(clk)
begin
if rising_edge(clk) then
shiftreg<= shiftreg(length-1 downto 0) & input;
endif;
end process;
However, if length here is set to some very high number, attempting to synthesize this becomes a massive undertaking. Since this is a relatively simple structure I imagine it is taking so long because the length is far beyond the number of registers in a single block.
My question is if there is some way to implement a large vector like this in a way that would be quicker to synthesize. For example, is it quicker to use something like
array(length downto 0) of std_logic;
or does a synthesis tool recognize those are equivalent?
Synthesis time is not typically relevant in FPGA design, although area utilization and timing usually is. If your shift register takes most of the resources that your target FPGA has, synthesis will take a long time trying to figure out a way to make it work, and likewise builds take longer as you fill up larger parts. For some ballpark, an 80% full design with tight timing in a modern midrange FPGA usually takes about 30 minutes to synthesize and 3 hours to place&route. This will not be significantly affected by coding style if you're still describing the same functionality.
If you describe a shift register (with the same functional features) in VHDL using std_logic_vector, a type you defined as an array of std_logic, or anything else, it will synthesize into the same thing.
In recent-ish Xilinx parts at least, a single LUT can be used for a 64-deep shift register as long as you haven't described a reset (synchronous OR asynchronous). You can likewise produce a 1000 deep shift register with just a handful of LUTs.
Now if you're looking to use the whole thousand+ bits of this shift register to xor against some other register, you can't use SRLs (LUT used as a shift register) because only the final bit is accessible as an output. This makes it put the whole thing in registers which may be rather large, and could require more registers than your part has. The key thing here is that you have to think about the scale of the hardware you describe, and whether that's feasible in your target part.
If you want a really deep shift register, block rams can be used to act like shift registers at depths exceeding 100,000 but these have the same issue where you only access the final output.
This question is an extension of the another shown here, VHDL Process Confusion with Sensitivity Lists
However, having less than 50 Rep points, I was unable to comment for further explanation.
So, I ran into the same problem from the link and accept the answer shown. However, now I am interested in what the recommended approach is in order to match simulation with post-synthesis behavior. The accepted answer in the link states that Level Sensitive Latches are not recommended as a solution due to them causing more problems. So my question is what is the recommended approach? Is there one?
In other words, I want to acheive what was trying to be achieved in that post, but in a way that won't cause more problems. I need my sensitivity list to not be ignored by my synthesis tools.
Also, I am new to VHDL so it may be possible that using processes is not right way to achieve the result I want. I am using a DE2-115 with Quartus Prime 16.0. Any information will be greatly appreciated.
If you are using VHDL to program your FPGA-based prototyping board you are interested in the synthesis semantics of the language. It is rather different from the simulation semantics described in the Language Reference Manual (LRM). Even worse: it is not standardized and varies between synthesis tools. Anyway, synthesis means translation from the VHDL code to digital hardware. The only recommended approach here, for a beginner who still do not clearly understands the synthesis semantics is:
Think hardware first, code next.
In other words, draw a nice block diagram of the hardware you want on a sheet of paper. And use the following 10 rules. Strictly. No exceptions. Never. And do not forget to carefully check the last one, it is as essential as the others but a bit more difficult to verify.
Surround your drawing with a large rectangle. This is the boundary of your circuit. Everything that crosses this boundary is an input or output port. The VHDL entity will describe this boundary.
Clearly separate edge-triggered registers (e.g. square blocks) from combinatorial logic (e.g. round blocks).
Do not use level-triggered latches.
Use only rising-edge triggered registers and use the same single clock for all of them. Its name is clock. It comes from the outside and is an input of all square blocks and only them. Do not even represent the clock, it is the same for all square blocks and you can leave it implicit in your diagram.
Represent the communications between blocks with named and oriented arrows. For the block an arrow comes from, the arrow is an output signal. For the block an arrow goes to, the arrow is an input signal.
Arrows have one single origin but they can have several destinations. If an arrow has several destinations, fork the arrow as many times as needed.
Some arrows come from outside the large rectangle. These are the input ports of the entity. An input arrow cannot also be the output of any of your blocks.
Some arrows go outside. These are the output ports. An output arrow has one single origin and one single destination: the outside. No forks on output arrows. So, an output arrow cannot be also the input of one of your blocks. If you want to use an output arrow as an input for some of your blocks, insert a new round block to split it in two parts: the input of the new block, with as many forks as you wish, and the output arrow that comes from the new block and goes outside. The new block will become a simple continuous assignment in VHDL. A kind of transparent renaming.
All arrows that do not come or go from/to the outside are internal signals. You will declare them all in the architecture.
Every cycle in the diagram must comprise at least one square block.
If you cannot find a way to describe the function you want with this approach, the problem is with the function you want. Not with VHDL or the synthesizer. It means that the function you want is not digital hardware. Implement it using another technology.
The VHDL coding becomes a detail:
one synchronous process per square block,
one combinatorial process per round block.
A synchronous process looks like this:
process(clock)
begin
if rising_edge(clock) then
o1 <= i1;
...
on <= in;
end if;
end process;
where i1, i2,..., in are all arrows that enter the corresponding square block of your diagram and o1, ..., om are all arrows that output the corresponding square block of your diagram. Do not change anything, except the names of the signals. Nothing. Not even a single character. OK?
A combinatorial process looks like this:
process(i1, i2,... , in)
<declarations>
begin
o1 <= <default_value_for_o1>;
...
om <= <default_value_for_om>;
<statements>
end process;
where i1, i2,..., in are all arrows that enter the corresponding round block of your diagram. all and no more. Do not forget a single arrow and do not add anything else. There are no exceptions. Never. And where o1, ..., om are all arrows that output the corresponding round block of your diagram. all and no more. Do not change anything except <declarations>, the names of the inputs, the names of the outputs, the values of the <default_value_for_oi> and <statements>. Do not forget a single default value assignment. If you had to create a new round block to split a primary output arrow, the corresponding process is just:
process(i)
begin
o <= i;
end process;
which you can simplify as:
o <= i;
without the enclosing process declaration. It is the equivalent concurrent signal assignment.
Once you will be comfortable with this coding style and only then, you will:
Skip the drawing for simple designs. But continue thinking hardware first. Draw in your head instead of on a sheet of paper but continue drawing.
Use asynchronous resets:
process(clock, reset)
begin
if reset = '1' then
o <= reset_value_for_o;
elsif rising_edge(clock) then
o <= i;
end if;
end process;
Merge several combinatorial processes in one single combinatorial process. This is trivial and is just a simple reorganization of the block diagram.
Merge some combinatorial processes with synchronous processes. But in order to do this you must go back to your block diagram and add an eleventh rule:
Group several round blocks and at least one square block by drawing an enclosure around them. Also enclose the arrows that can be. Do not let an arrow cross the boundary of the enclosure if it does not come or go from/to outside the enclosure. Once this is done, look at all the output arrows of the enclosure. If any of them comes from a round block of the enclosure or is also an input of the enclosure, you cannot merge these processes in a synchronous process.
And later on you will also start using latches, falling clock edges, multiple clocks and resynchronizers between clock domains... But we will discuss these when the time will have come.
I have a problem which is easier solved with a HLS tool than with writing down the raw VHDL / verilog. Currently I'm using a Xilinx Virtex-7 as I think this has been solved already by some other vendors.
I can use VHDL 2008.
So imagine in VHDL you have many calculations such as:
p1 <= a x b - c;
p2 <= p1 x d - e;
p3 <= p2 x f - g;
p4 <= p2 x p1 - p3;
Currently if I were to write this with IP Cores, it would be four DSP IP cores, and because of the different port widths, I'd have to generate this IP core 4 times. Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain, especially when resizing signed vectors down.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Does such a tool exist? Which one would you recommend?
Bonus points:
Do any of these tools handle floating point maths and let you control precision?
There are lots of ways to accomplish your goal. But first to address your points.
Currently if I were to write this with IP Cores, it would be three DSP IP cores, and because of the different port widths, I'd have to generate this IP core 3 times.
Not necessarily. If your inputs a through g are all fixed point, you can use ieee.numeric_std or in VHDL-2008 you can use ieee.fixed_pkg. These will infer DSP cores (such as the DSP48 on Xilinx). For example:
-- Assume a, b, and c are all signed integers (or implicit fixed point)
signal a : signed(7 downto 0);
signal b : signed(7 downto 0);
signal c : signed(7 downto 0);
signal p1 : signed(a'length+b'length downto 0); -- a times b produces a'length + b'length +1 (which also corresponds to (a times b) - c adding one bit).
...
p1 <= a*b - resize(c, p1'length);
This will imply multipliers and adders.
And this can be similarly done with UFIXED or SFIXED. But you do need to track the bit widths.
Also, there is a floating point package (ieee.float_pkg), but I would NOT recommend that for hardware. You are better off timing and resource-wise to implement it in fixed point.
Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain.
You can do this automatically. Look at my example above. You can easily determine widths based on the operations. Multiplications sum the number of bits. Additions add a single bit. So, if I have:
y <= a * b;
Then I can derive the length of y as simply a'length + b'length. It can be done. The issue, however, is bit growth. The chain of operations you describe will grow significantly if you keep full precision. At certain points you will need to truncate or round to reduce the number of bits. This is the hard part, it how much error you can tolerate is dependent upon the algorithm and expected data input.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Automatic handling is the hard part. In VHDL this will not happen (nor Verilog for that matter). But you can track it fairly well and have bit widths update as necessary. But it will not automatically handle things like rounding, truncation, and managing error bounds. A DSP engineer should be handing those issues and directing the RTL developer on the appropriate widths and when to round or truncate.
Does such a tool exist? Which one would you recommend?
There are a variety of options to do this at a higher level. None of these are particularly frugal with respect to resources. Matlab has a code generation tool that will convert Matlab models (suitably constructed) into RTL. It will even analyze issues such as rounding, truncation, and determine appropriate bit widths. You can control the precision, but it is fixed point. We've played with it, and found it very far from producing efficient, high-speed code.
Alternatively, Xilinx does have an HLS suite (see Vivado). I'm not all that well versed in the methodology, but as I understand it, it allows writing C code to implement algorithms. The C doe is then "synthesized" to something that executes in some sort of execution engine. You still have to interface that C code to RTL infrastructure, and that's a challenge in its own right. The reason we have so far not pursued it heavily (even though we do DSP heavy designs) is that it is a big challenge to simulate both the HLS and RTL together as a system.
In the past I found flopoco to generate arbitrary math functions in hardware. If I recall correctly, it supports many types of functions. For instance it could generate a arithmetic core to compute something like a=3*sinĀ²(x+pi/3). For these calculations allows you to specify the overall precision of the inputs/outputs (for floating point/fixed point) or the width of the inputs ( integer ). Execution frequency and whether or not to pipeline the function can also be specified.
Here is an old tutorial I found on how to use it: tutorial
I'm working on a circuit which performs basic operations such as addition and subtraction using logic gates.
Right now, it takes 3 inputs, two 4 bit numbers, and a 3 bit opcode which indicates what operation to perform.
It seems that a 3-8 decoder would be a good idea here. This is my mockup!
To give a little more context, here is what my adder circuit looks like (+). I designed it to take two 4 bit numbers X & Y:
However, what I am confused about is the fact that I have to feed in 4 inputs or 4 wires to each of the circuit that handles it's respective operations (+, -, =, etc). It appears to only connect one wire to the circuit I need to get to. I need to actually connect 8 wires, as I have to feed in the to 4 bit numbers.
UPDATE: I ended up using a MUX to select the output that I want.
An adder doesn't need an input to tell it to add, because that's all it does.
A 4-bit full adder should have
4 input signals for each operand, total 8
A carry-in input signal if you are also using it for subtraction
5 output signals, the high-order one may be used to generate an overflow flag
Your decoder is a separate component from all the function generators. You could put a tristate buffer on each function generator to connect them to a common data bus, and then the decoder would generate the tristate enable signals. Otherwise, you probably don't need a decoder, but you might look at a multiplexer (mux) instead.