Does the CONSTANT declaration stores the values in Block-RAM or in flipflops of an FPGA? - filter

For example, if I want to store my filter coefficients in n-Tap FIR filters using constants, will the CONSTANT declaration store my values in Block RAMs or registers using FPGA flipflops? Also can SIGNAL be used to store the coefficients without using RAM cells?

The constants themselves aren't "stored" anywhere - their values are simply substituted into the VHL code where you use them.
Where they're stored depends on how you use them and how the code is optimized.
If you're multiplying a signal by a constant two, for example, no elements are used at all - the data bus will be simply connected in a way that effectively shifts the value left by one bit.
Or, they may end up as hard-wired inputs to other elements like multipliers in your case.
Either way, you should look into the synthesis results to thoroughly understand the generated RTL.

[...] will the CONSTANT declaration store my values in Block RAMs or registers using FPGA flipflops?
Whether constants are stored in memory blocks or registers, or if they are merged into the boolen equations depends on your implementation of an algorithm. Let's have a look on the following mathematical equation (not VHDL code):
y = c_1 * x_1 + c_2 * x_2 + c_3 * x_3 +... + c_N * x_N
N is the number of coefficients, x_i are the input values, and c_i are the constant coefficients.
You can implements this equation in VHDL / hardware by:
N parallel multipliers and an adder tree to sum up the products; all done combinational, within one clock cycle or even pipelined with a throughput of one result per clock cycle.
Or N sequentially executed multiply-accumlate steps; with one multiply-accumlate per clock cycle.
You can take even a combination of both.
In case 1, the synthesizer optimizes each multiplication with a constant:
just wiring if the coefficient is a power of two,
addition if the binary representation of the coefficient contains a small number of ones (5*x = x + 4*x),
or multiplier hard macro with a constant value (VDD, GND) connected to one if its inputs.
Thus, in case 1 no memory or registers are required to store the constants.
In case 2, the synthesizer will map the multiply-accumulate step to a hardware multiplier plus an adder. This multiplier and adder will be re-used for all N steps, so that, the coefficients must be looked up in a memory. If you have a lot of coefficients, then memory blocks (Block-RAM) are used. The current iteration step i will make up the memory address. If you have only a small number of coefficients, then they can also be stored in distributed memory (LUT-RAM) or computed via boolean equations. But even in this case, the coefficients will not be mapped to flip-flops because their value do not change with time.
Also can SIGNAL be used to store the coefficients without using RAM cells?
Yes, of course. With a proper synchronous description they will be mapped to flip-flops.

The used storage element:
registers
distributed RAM (LUTRAM)
BlockRAM
...depends on your chosen VHDL description and size.
You should use a constant instead of a signal. Moreover it could be helpful to used synchronous read operations to infer registered outputs.
Look into the synthesis report to validate the intended description.

Building on Paebbels' answer, it depends. Though they can be implemented in distributed ROM (LUTROM) as well. It depends on the synthesis tool. For example, Xilinx's Vivado has in their synthesis guide (UG901) describes how to infer RAM/ROM.
For your example of a FIR filter, you might have something like:
type coeff_array is array(natural range<>) of std_logic_vector(17 downto 0);
constant coeffs : coeff_array(0 to N-1) := ( x"XXXX", x"XXXX", ..., x"XXXX" );
Now, whether this is a distributed ROM or RAM depends on the tool. A quick test with Vivado shows this construct synthesizes to a sea of gates (just LUT logic). However, it can be forced into a block RAM (aka block ROM) by:
signal coeffs : coeff_array(0 to N-1) := ( x"XXXX", x"XXXX", ..., x"XXXX" );
attribute ROM_STYLE : string;
attribute ROM_STYLE of coeffs : signal is "block";
The means to infer any specific type of structure (LUTs, LUTRAM, LUTROM, block ROM, block RAM) depends upon the tools in question. Run the test through synthesis to see what you get. And look at the synthesis guide for the synthesizer you are using to figure out how to get what you want.

Related

Most efficient VHDL for large vector?

I want to be able to have a shift register that does an XOR against another register loaded with some value. The issue is that I wish to do this with a large scale vector, something on the order of thousands of bits wide.
The obvious way to do this in VHDL would be something like
generic( length : integer := 15);
signal shiftreg : std_logic_vector(length downto 0);
process(clk)
begin
if rising_edge(clk) then
shiftreg<= shiftreg(length-1 downto 0) & input;
endif;
end process;
However, if length here is set to some very high number, attempting to synthesize this becomes a massive undertaking. Since this is a relatively simple structure I imagine it is taking so long because the length is far beyond the number of registers in a single block.
My question is if there is some way to implement a large vector like this in a way that would be quicker to synthesize. For example, is it quicker to use something like
array(length downto 0) of std_logic;
or does a synthesis tool recognize those are equivalent?
Synthesis time is not typically relevant in FPGA design, although area utilization and timing usually is. If your shift register takes most of the resources that your target FPGA has, synthesis will take a long time trying to figure out a way to make it work, and likewise builds take longer as you fill up larger parts. For some ballpark, an 80% full design with tight timing in a modern midrange FPGA usually takes about 30 minutes to synthesize and 3 hours to place&route. This will not be significantly affected by coding style if you're still describing the same functionality.
If you describe a shift register (with the same functional features) in VHDL using std_logic_vector, a type you defined as an array of std_logic, or anything else, it will synthesize into the same thing.
In recent-ish Xilinx parts at least, a single LUT can be used for a 64-deep shift register as long as you haven't described a reset (synchronous OR asynchronous). You can likewise produce a 1000 deep shift register with just a handful of LUTs.
Now if you're looking to use the whole thousand+ bits of this shift register to xor against some other register, you can't use SRLs (LUT used as a shift register) because only the final bit is accessible as an output. This makes it put the whole thing in registers which may be rather large, and could require more registers than your part has. The key thing here is that you have to think about the scale of the hardware you describe, and whether that's feasible in your target part.
If you want a really deep shift register, block rams can be used to act like shift registers at depths exceeding 100,000 but these have the same issue where you only access the final output.

HLS Tool to Make Maths Simpler on FPGAs

I have a problem which is easier solved with a HLS tool than with writing down the raw VHDL / verilog. Currently I'm using a Xilinx Virtex-7 as I think this has been solved already by some other vendors.
I can use VHDL 2008.
So imagine in VHDL you have many calculations such as:
p1 <= a x b - c;
p2 <= p1 x d - e;
p3 <= p2 x f - g;
p4 <= p2 x p1 - p3;
Currently if I were to write this with IP Cores, it would be four DSP IP cores, and because of the different port widths, I'd have to generate this IP core 4 times. Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain, especially when resizing signed vectors down.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Does such a tool exist? Which one would you recommend?
Bonus points:
Do any of these tools handle floating point maths and let you control precision?
There are lots of ways to accomplish your goal. But first to address your points.
Currently if I were to write this with IP Cores, it would be three DSP IP cores, and because of the different port widths, I'd have to generate this IP core 3 times.
Not necessarily. If your inputs a through g are all fixed point, you can use ieee.numeric_std or in VHDL-2008 you can use ieee.fixed_pkg. These will infer DSP cores (such as the DSP48 on Xilinx). For example:
-- Assume a, b, and c are all signed integers (or implicit fixed point)
signal a : signed(7 downto 0);
signal b : signed(7 downto 0);
signal c : signed(7 downto 0);
signal p1 : signed(a'length+b'length downto 0); -- a times b produces a'length + b'length +1 (which also corresponds to (a times b) - c adding one bit).
...
p1 <= a*b - resize(c, p1'length);
This will imply multipliers and adders.
And this can be similarly done with UFIXED or SFIXED. But you do need to track the bit widths.
Also, there is a floating point package (ieee.float_pkg), but I would NOT recommend that for hardware. You are better off timing and resource-wise to implement it in fixed point.
Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain.
You can do this automatically. Look at my example above. You can easily determine widths based on the operations. Multiplications sum the number of bits. Additions add a single bit. So, if I have:
y <= a * b;
Then I can derive the length of y as simply a'length + b'length. It can be done. The issue, however, is bit growth. The chain of operations you describe will grow significantly if you keep full precision. At certain points you will need to truncate or round to reduce the number of bits. This is the hard part, it how much error you can tolerate is dependent upon the algorithm and expected data input.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Automatic handling is the hard part. In VHDL this will not happen (nor Verilog for that matter). But you can track it fairly well and have bit widths update as necessary. But it will not automatically handle things like rounding, truncation, and managing error bounds. A DSP engineer should be handing those issues and directing the RTL developer on the appropriate widths and when to round or truncate.
Does such a tool exist? Which one would you recommend?
There are a variety of options to do this at a higher level. None of these are particularly frugal with respect to resources. Matlab has a code generation tool that will convert Matlab models (suitably constructed) into RTL. It will even analyze issues such as rounding, truncation, and determine appropriate bit widths. You can control the precision, but it is fixed point. We've played with it, and found it very far from producing efficient, high-speed code.
Alternatively, Xilinx does have an HLS suite (see Vivado). I'm not all that well versed in the methodology, but as I understand it, it allows writing C code to implement algorithms. The C doe is then "synthesized" to something that executes in some sort of execution engine. You still have to interface that C code to RTL infrastructure, and that's a challenge in its own right. The reason we have so far not pursued it heavily (even though we do DSP heavy designs) is that it is a big challenge to simulate both the HLS and RTL together as a system.
In the past I found flopoco to generate arbitrary math functions in hardware. If I recall correctly, it supports many types of functions. For instance it could generate a arithmetic core to compute something like a=3*sinĀ²(x+pi/3). For these calculations allows you to specify the overall precision of the inputs/outputs (for floating point/fixed point) or the width of the inputs ( integer ). Execution frequency and whether or not to pipeline the function can also be specified.
Here is an old tutorial I found on how to use it: tutorial

VHDL concurrent selective assignment synthesis

a real junior question with hopefully a junior answer, regarding one of the main assignments of VHDL (concurrent selective assignment) can anyone explain what a VHDL compiler would synthesise the following description into?
LIBRARY IEEE;
USE IEEE.std_logic_1164.ALL;
USE IEEE.numeric_std.ALL;
ENTITY Q2 IS
PORT (a,b,c,d : IN std_logic;
EW_NS : OUT std_logic
);
END ENTITY Q2;
ARCHITECTURE hybrid OF Q2 IS
SIGNAL INPUT : std_logic_vector(3 DOWNTO 0);
SIGNAL EW_NS : std_logic;
BEGIN
INPUT <= (a & b & c & d); -- concatination
WITH (INPUT) SELECT
EW_NS <= '1' WHEN "0001"|"0010"|"0011"|"0110"|"1011",
'0' WHEN OTHERS;
END ARCHITECTURE hybrid;
Why do I ask? well I have previously gone about things the wrong way i.e. describing things on VHDL before making a block diagram of the components needed. I would envisage this been synthed as a group of and gate logic ?
Any help would be really helpful.
Thanks D
You need to look at the user guide for your target FPGA, and understand what is contained within one 'logic element' ('slice' in Xilinx terminology). In general an FPGA does not implement combinatorial logic by connecting up discrete gates like AND, OR, etc. Instead, a logic element will contain one or more 'look-up tables', with typically four (but now 6 in some newer devices) inputs. The inputs to this look up table (LUT) are the inputs to your logic function, and the output is one of the outputs of the function. The LUT is then programmed as a ROM, allowing your input signals to function as an address. There is one ROM entry for every possible combination of inputs, with the result being the intended logic function.
A function with several outputs would simply use several of these LUTs in parallel, with the same inputs, one LUT for each of the function's outputs. A function requiring more inputs than the LUT has (say, 7 inputs, where a LUT has only 4), simply combines two LUTs in parallel, using a multiplexer to choose between the output of the two LUTs. This final multiplexer uses one of the input signals as it's control, and again every possible combination of inputs is accounted for.
This may sound inefficient for creating something simple like an AND gate, but the benefit is that this simple building block (a LUT) can implement absolutely any combinatorial function. It's also worth noting that an FPGA tool chain is extremely good at optimising logic functions in order to simplify them, and to better map them into the FPGA. The LUT provides a highly generic element for these tools to target.
A logic element will also contain some dedicated resources for functions that aren't well suited to the LUT approach. These might include dedicated carry chains for adders, multiplexers for combining the output of several LUTS, registers (most designs are synchronous). LUTs can also sometimes be configured as small shift registers or RAM elements. External to the logic elements, there will be more specific blocks like large multipliers, larger memories, PLLs, etc, none of which can be as efficiently implemented using LUT resource. Again, this will all be explained in the user guide for your target FPGA.
Back in the day, your code would have been implemented as a single 74150 TTL circuit, which is a 16-to-1 mux. you have a 4-bit select (INPUT), and this selects one of 16 inputs to the chip, which is routed to a single output ('EW_NS`). The 74150 is obsolete and I can't find any datasheets, but it's easy to find diagrams of what an 8-to-1 mux looks like (here, for example). The 16->1 is identical, but everything is wider. My old TI databook shows basically exactly the diagram at this link doubled up.
But - wait. Your problem is easier, because you're not routing real inputs to the output - you're just setting fixed data values. On the '150, you do this by wiring 5 of the 16 inputs to 1, and the remaining 11 to 0. This makes the logic much easier.
The 74150 has basiscally exactly the same functionality as a 4-input look-up table (where the fixed look-up data is the same as fixed levels at the '150 inputs), so it's trivial to implement your entire circuit in a single LUT in an FPGA, as per scary_jeff's answer, rather than using a NAND-level implementation. In a proper chip, though, it would be implemented as a sum-of-products, or something similar (exactly what's in the linked diagram). In this case, draw a K-map and find a minimum solution. My 2 minutes on the back of an envelope comes up with three 3-input AND gates, driving a 3-input OR gate. I'll leave it as an exercise to you to check this :)

In VHDL ..... how to count leading zeros of vector?

I'm working in a VHDL project and I'm facing a problem to calculate the length of vector. I know there is length attribute of a vector but this not the length I'm looking for. For example, I have std_logic_vector
E : std_logic_vector(7 downto 0);
then
E <= "00011010";
so, len = E'length = 8 but I'm not looking for this. I want to calculate len after discarding the left most zeros , so len = 5;
I know that I can use for loop by checking "0"s bits from left to right and stop if "1" bit occur. But that's not efficient, because I have 1024 or more of bits and that will slow my circuit. So is there is any method or algorithm to calculate the length in efficient way? Such as using combinational gates of log(n) level of gates, ( where n = number of bits ).
What you do with your "bit counting" very similar to the logarithm (base 2).
This is commonly used in VHDL to figure out how many bits are required to represent a signal. For example if you want to store up to N elements in RAM, the number of bits required for addressing that RAM is ceil(log2(N)). For this I use:
function log2ceil(m:natural) return natural is
begin -- note: for log(0) we return 0
for n in 0 to integer'high loop
if 2**n >= m then
return n;
end if;
end loop;
end function log2ceil;
Usually, you want to do this at synthesis time with constants, and speed is no concern. But you can also generate FPGA logic, if that's really what you want.
As others have mentioned, a "for" loop in VHDL is just used to generate a lookup table, which may be slow due to long signal paths, but still only takes a single clock. What can happen is that your maximum clock frequency goes down. Usually this is only a problem if you operate on vectors larger than 64bit (you mentioned 1024 bits) and clocks faster than 100MHz. Maybe the synthesizer already told you that this is your problem, otherwise I suggest you try first.
Then you have to split up the operation over multiple clocks, and store some intermediate result into a FF. (I would upfront forget about trying to outsmart the synthesizer by rearranging your code. A lookup-table is a table. Why should it matter how you generate the values in this table? But make sure you tell the synthesizer about "don't care" values if you have them.)
If speed is your concern, use the first clock to check all 16bit blocks in parallel (independent of each other), and then use a second clock cycle to combine the results of all 16bit blocks into a single result. If the amount of FPGA logic is your concern, implement a state machine that checks a single 16bit block at every clock cycle.
But be careful that you don't re-invent the CPU while doing that.
The problem with using a loop is that when you synthesize you might get a very long chain of logic.
Another way to look at your problem is to find the index of the most significant set bit.
To do this you can use a priority encoder. The nice thing about this is you can make a large priority encoder by using smaller priority encoders in a tree structure, so the delay is O(log N) instead of O(N).
Here is a 4 bit priority encoder:
http://en.wikibooks.org/wiki/VHDL_for_FPGA_Design/Priority_Encoder
You can make a 16 bit priority encoder using 5 of these blocks, then a 256 bit encoder from five 16 bit encoders, etc.
But since you have so many bits it is going to be fairly huge.
Well, VHDL is not SW, it does not take time to perform an operation like this, it just takes resources from your FPGA.
You can divide your 1024 bits data into 32 bits section and perform an OR between all the bits, this way, you check 32 bits at a time. It is not really necessary since the the for loop would work perfectly fine for what you want to do, just write the code, look for the first 1 in the array and stop the loop and use the loop index number as the pointer to the first 1 in your array. I didn't compile this code, but something like this should work for you:
FirstOne <= 1023;
for i in E'reverse_range loop
if (E(i) == '1') then
FirstOne <= i;
exit;
end if;
end loop;
It will not be such a big blocks inside an FPGA after all.
Most synthesizers these days support recursive functions. And indeed this will give you complexity comparable to log(N) where N is the number of bits:
Cut your vector into halves
If the top half are all zeros
The leading bit of your answer is '1', low bits depend on the bottom half vector
Otherwise
The leading bit of your answer is '0', low bits depend on the top half vector
Recurse on the half vector of interest chosen above

VHDL - creating a variable number of signals

I'm creating a full adder with a variable number of bits. I've got a component that is a half-adder which takes in three inputs (the two bits to add, and a carry in bit) and gives 2 outputs (one bit output and a carry out bit).
I need to tie the carry out of one half-adder to the carry in of another. And I need to do this a variable number of times (if I'm adding 4 digit numbers, I'll need 4 half adders. If I'm doing 32 bit numbers, I'll need 32 half adders).
I was going to tie the carry outs of one half-adder to the carry in of another using signals, but I don't know how to create a variable number of signals.
I can instantiate a variable number of half-adders using a for-loop in a process, but since signals are defined outside of processes, I can't use a for loop for it. I don't know how I should tie the half-adders together.
The easiest way to write an adder in VHDL is not to worry about full adders and half adders, but just type:
a <= b + c;
where a,b and c are signed or unsigned
95% of the time, the synthesis tools will do a better job than you would.
I think you want variable-width signals not variable numbers of signals
Your signals need to be std_logic_vector(31 downto 0) for example - and then you wire up the bits of those signals to your half-adders appropriately.
Of course, as those signals are numbers, then don't use std_logic_vector use signed or unsigned (and the ieee.numeric_std lib).
And (as Philippe rightly points out) unless this is a learning exercise, just use the + operator.

Resources