HLS Tool to Make Maths Simpler on FPGAs - vhdl

I have a problem which is easier solved with a HLS tool than with writing down the raw VHDL / verilog. Currently I'm using a Xilinx Virtex-7 as I think this has been solved already by some other vendors.
I can use VHDL 2008.
So imagine in VHDL you have many calculations such as:
p1 <= a x b - c;
p2 <= p1 x d - e;
p3 <= p2 x f - g;
p4 <= p2 x p1 - p3;
Currently if I were to write this with IP Cores, it would be four DSP IP cores, and because of the different port widths, I'd have to generate this IP core 4 times. Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain, especially when resizing signed vectors down.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Does such a tool exist? Which one would you recommend?
Bonus points:
Do any of these tools handle floating point maths and let you control precision?

There are lots of ways to accomplish your goal. But first to address your points.
Currently if I were to write this with IP Cores, it would be three DSP IP cores, and because of the different port widths, I'd have to generate this IP core 3 times.
Not necessarily. If your inputs a through g are all fixed point, you can use ieee.numeric_std or in VHDL-2008 you can use ieee.fixed_pkg. These will infer DSP cores (such as the DSP48 on Xilinx). For example:
-- Assume a, b, and c are all signed integers (or implicit fixed point)
signal a : signed(7 downto 0);
signal b : signed(7 downto 0);
signal c : signed(7 downto 0);
signal p1 : signed(a'length+b'length downto 0); -- a times b produces a'length + b'length +1 (which also corresponds to (a times b) - c adding one bit).
...
p1 <= a*b - resize(c, p1'length);
This will imply multipliers and adders.
And this can be similarly done with UFIXED or SFIXED. But you do need to track the bit widths.
Also, there is a floating point package (ieee.float_pkg), but I would NOT recommend that for hardware. You are better off timing and resource-wise to implement it in fixed point.
Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain.
You can do this automatically. Look at my example above. You can easily determine widths based on the operations. Multiplications sum the number of bits. Additions add a single bit. So, if I have:
y <= a * b;
Then I can derive the length of y as simply a'length + b'length. It can be done. The issue, however, is bit growth. The chain of operations you describe will grow significantly if you keep full precision. At certain points you will need to truncate or round to reduce the number of bits. This is the hard part, it how much error you can tolerate is dependent upon the algorithm and expected data input.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Automatic handling is the hard part. In VHDL this will not happen (nor Verilog for that matter). But you can track it fairly well and have bit widths update as necessary. But it will not automatically handle things like rounding, truncation, and managing error bounds. A DSP engineer should be handing those issues and directing the RTL developer on the appropriate widths and when to round or truncate.
Does such a tool exist? Which one would you recommend?
There are a variety of options to do this at a higher level. None of these are particularly frugal with respect to resources. Matlab has a code generation tool that will convert Matlab models (suitably constructed) into RTL. It will even analyze issues such as rounding, truncation, and determine appropriate bit widths. You can control the precision, but it is fixed point. We've played with it, and found it very far from producing efficient, high-speed code.
Alternatively, Xilinx does have an HLS suite (see Vivado). I'm not all that well versed in the methodology, but as I understand it, it allows writing C code to implement algorithms. The C doe is then "synthesized" to something that executes in some sort of execution engine. You still have to interface that C code to RTL infrastructure, and that's a challenge in its own right. The reason we have so far not pursued it heavily (even though we do DSP heavy designs) is that it is a big challenge to simulate both the HLS and RTL together as a system.

In the past I found flopoco to generate arbitrary math functions in hardware. If I recall correctly, it supports many types of functions. For instance it could generate a arithmetic core to compute something like a=3*sinĀ²(x+pi/3). For these calculations allows you to specify the overall precision of the inputs/outputs (for floating point/fixed point) or the width of the inputs ( integer ). Execution frequency and whether or not to pipeline the function can also be specified.
Here is an old tutorial I found on how to use it: tutorial

Related

Most efficient VHDL for large vector?

I want to be able to have a shift register that does an XOR against another register loaded with some value. The issue is that I wish to do this with a large scale vector, something on the order of thousands of bits wide.
The obvious way to do this in VHDL would be something like
generic( length : integer := 15);
signal shiftreg : std_logic_vector(length downto 0);
process(clk)
begin
if rising_edge(clk) then
shiftreg<= shiftreg(length-1 downto 0) & input;
endif;
end process;
However, if length here is set to some very high number, attempting to synthesize this becomes a massive undertaking. Since this is a relatively simple structure I imagine it is taking so long because the length is far beyond the number of registers in a single block.
My question is if there is some way to implement a large vector like this in a way that would be quicker to synthesize. For example, is it quicker to use something like
array(length downto 0) of std_logic;
or does a synthesis tool recognize those are equivalent?
Synthesis time is not typically relevant in FPGA design, although area utilization and timing usually is. If your shift register takes most of the resources that your target FPGA has, synthesis will take a long time trying to figure out a way to make it work, and likewise builds take longer as you fill up larger parts. For some ballpark, an 80% full design with tight timing in a modern midrange FPGA usually takes about 30 minutes to synthesize and 3 hours to place&route. This will not be significantly affected by coding style if you're still describing the same functionality.
If you describe a shift register (with the same functional features) in VHDL using std_logic_vector, a type you defined as an array of std_logic, or anything else, it will synthesize into the same thing.
In recent-ish Xilinx parts at least, a single LUT can be used for a 64-deep shift register as long as you haven't described a reset (synchronous OR asynchronous). You can likewise produce a 1000 deep shift register with just a handful of LUTs.
Now if you're looking to use the whole thousand+ bits of this shift register to xor against some other register, you can't use SRLs (LUT used as a shift register) because only the final bit is accessible as an output. This makes it put the whole thing in registers which may be rather large, and could require more registers than your part has. The key thing here is that you have to think about the scale of the hardware you describe, and whether that's feasible in your target part.
If you want a really deep shift register, block rams can be used to act like shift registers at depths exceeding 100,000 but these have the same issue where you only access the final output.

Does the CONSTANT declaration stores the values in Block-RAM or in flipflops of an FPGA?

For example, if I want to store my filter coefficients in n-Tap FIR filters using constants, will the CONSTANT declaration store my values in Block RAMs or registers using FPGA flipflops? Also can SIGNAL be used to store the coefficients without using RAM cells?
The constants themselves aren't "stored" anywhere - their values are simply substituted into the VHL code where you use them.
Where they're stored depends on how you use them and how the code is optimized.
If you're multiplying a signal by a constant two, for example, no elements are used at all - the data bus will be simply connected in a way that effectively shifts the value left by one bit.
Or, they may end up as hard-wired inputs to other elements like multipliers in your case.
Either way, you should look into the synthesis results to thoroughly understand the generated RTL.
[...] will the CONSTANT declaration store my values in Block RAMs or registers using FPGA flipflops?
Whether constants are stored in memory blocks or registers, or if they are merged into the boolen equations depends on your implementation of an algorithm. Let's have a look on the following mathematical equation (not VHDL code):
y = c_1 * x_1 + c_2 * x_2 + c_3 * x_3 +... + c_N * x_N
N is the number of coefficients, x_i are the input values, and c_i are the constant coefficients.
You can implements this equation in VHDL / hardware by:
N parallel multipliers and an adder tree to sum up the products; all done combinational, within one clock cycle or even pipelined with a throughput of one result per clock cycle.
Or N sequentially executed multiply-accumlate steps; with one multiply-accumlate per clock cycle.
You can take even a combination of both.
In case 1, the synthesizer optimizes each multiplication with a constant:
just wiring if the coefficient is a power of two,
addition if the binary representation of the coefficient contains a small number of ones (5*x = x + 4*x),
or multiplier hard macro with a constant value (VDD, GND) connected to one if its inputs.
Thus, in case 1 no memory or registers are required to store the constants.
In case 2, the synthesizer will map the multiply-accumulate step to a hardware multiplier plus an adder. This multiplier and adder will be re-used for all N steps, so that, the coefficients must be looked up in a memory. If you have a lot of coefficients, then memory blocks (Block-RAM) are used. The current iteration step i will make up the memory address. If you have only a small number of coefficients, then they can also be stored in distributed memory (LUT-RAM) or computed via boolean equations. But even in this case, the coefficients will not be mapped to flip-flops because their value do not change with time.
Also can SIGNAL be used to store the coefficients without using RAM cells?
Yes, of course. With a proper synchronous description they will be mapped to flip-flops.
The used storage element:
registers
distributed RAM (LUTRAM)
BlockRAM
...depends on your chosen VHDL description and size.
You should use a constant instead of a signal. Moreover it could be helpful to used synchronous read operations to infer registered outputs.
Look into the synthesis report to validate the intended description.
Building on Paebbels' answer, it depends. Though they can be implemented in distributed ROM (LUTROM) as well. It depends on the synthesis tool. For example, Xilinx's Vivado has in their synthesis guide (UG901) describes how to infer RAM/ROM.
For your example of a FIR filter, you might have something like:
type coeff_array is array(natural range<>) of std_logic_vector(17 downto 0);
constant coeffs : coeff_array(0 to N-1) := ( x"XXXX", x"XXXX", ..., x"XXXX" );
Now, whether this is a distributed ROM or RAM depends on the tool. A quick test with Vivado shows this construct synthesizes to a sea of gates (just LUT logic). However, it can be forced into a block RAM (aka block ROM) by:
signal coeffs : coeff_array(0 to N-1) := ( x"XXXX", x"XXXX", ..., x"XXXX" );
attribute ROM_STYLE : string;
attribute ROM_STYLE of coeffs : signal is "block";
The means to infer any specific type of structure (LUTs, LUTRAM, LUTROM, block ROM, block RAM) depends upon the tools in question. Run the test through synthesis to see what you get. And look at the synthesis guide for the synthesizer you are using to figure out how to get what you want.

VHDL concurrent selective assignment synthesis

a real junior question with hopefully a junior answer, regarding one of the main assignments of VHDL (concurrent selective assignment) can anyone explain what a VHDL compiler would synthesise the following description into?
LIBRARY IEEE;
USE IEEE.std_logic_1164.ALL;
USE IEEE.numeric_std.ALL;
ENTITY Q2 IS
PORT (a,b,c,d : IN std_logic;
EW_NS : OUT std_logic
);
END ENTITY Q2;
ARCHITECTURE hybrid OF Q2 IS
SIGNAL INPUT : std_logic_vector(3 DOWNTO 0);
SIGNAL EW_NS : std_logic;
BEGIN
INPUT <= (a & b & c & d); -- concatination
WITH (INPUT) SELECT
EW_NS <= '1' WHEN "0001"|"0010"|"0011"|"0110"|"1011",
'0' WHEN OTHERS;
END ARCHITECTURE hybrid;
Why do I ask? well I have previously gone about things the wrong way i.e. describing things on VHDL before making a block diagram of the components needed. I would envisage this been synthed as a group of and gate logic ?
Any help would be really helpful.
Thanks D
You need to look at the user guide for your target FPGA, and understand what is contained within one 'logic element' ('slice' in Xilinx terminology). In general an FPGA does not implement combinatorial logic by connecting up discrete gates like AND, OR, etc. Instead, a logic element will contain one or more 'look-up tables', with typically four (but now 6 in some newer devices) inputs. The inputs to this look up table (LUT) are the inputs to your logic function, and the output is one of the outputs of the function. The LUT is then programmed as a ROM, allowing your input signals to function as an address. There is one ROM entry for every possible combination of inputs, with the result being the intended logic function.
A function with several outputs would simply use several of these LUTs in parallel, with the same inputs, one LUT for each of the function's outputs. A function requiring more inputs than the LUT has (say, 7 inputs, where a LUT has only 4), simply combines two LUTs in parallel, using a multiplexer to choose between the output of the two LUTs. This final multiplexer uses one of the input signals as it's control, and again every possible combination of inputs is accounted for.
This may sound inefficient for creating something simple like an AND gate, but the benefit is that this simple building block (a LUT) can implement absolutely any combinatorial function. It's also worth noting that an FPGA tool chain is extremely good at optimising logic functions in order to simplify them, and to better map them into the FPGA. The LUT provides a highly generic element for these tools to target.
A logic element will also contain some dedicated resources for functions that aren't well suited to the LUT approach. These might include dedicated carry chains for adders, multiplexers for combining the output of several LUTS, registers (most designs are synchronous). LUTs can also sometimes be configured as small shift registers or RAM elements. External to the logic elements, there will be more specific blocks like large multipliers, larger memories, PLLs, etc, none of which can be as efficiently implemented using LUT resource. Again, this will all be explained in the user guide for your target FPGA.
Back in the day, your code would have been implemented as a single 74150 TTL circuit, which is a 16-to-1 mux. you have a 4-bit select (INPUT), and this selects one of 16 inputs to the chip, which is routed to a single output ('EW_NS`). The 74150 is obsolete and I can't find any datasheets, but it's easy to find diagrams of what an 8-to-1 mux looks like (here, for example). The 16->1 is identical, but everything is wider. My old TI databook shows basically exactly the diagram at this link doubled up.
But - wait. Your problem is easier, because you're not routing real inputs to the output - you're just setting fixed data values. On the '150, you do this by wiring 5 of the 16 inputs to 1, and the remaining 11 to 0. This makes the logic much easier.
The 74150 has basiscally exactly the same functionality as a 4-input look-up table (where the fixed look-up data is the same as fixed levels at the '150 inputs), so it's trivial to implement your entire circuit in a single LUT in an FPGA, as per scary_jeff's answer, rather than using a NAND-level implementation. In a proper chip, though, it would be implemented as a sum-of-products, or something similar (exactly what's in the linked diagram). In this case, draw a K-map and find a minimum solution. My 2 minutes on the back of an envelope comes up with three 3-input AND gates, driving a 3-input OR gate. I'll leave it as an exercise to you to check this :)

ODDR2 usage found in auto-generated xilinx wrapper VHDL file

I'm using the TEMAC IP core to generate a 1gb ethernet MAC, and came across an interesting piece of code:
-- DDr logic is used for this purpose to ensure that clock routing/timing to the pin is
-- balanced as part of the clock tree
not_rx_clk_int <= not (rx_clk_int);
rx_clk_ddr : ODDR2
port map (
Q => rx_clk,
C0 => rx_clk_int
C1 => not_rx_clk_int,
CE => '1',
D0 => '1',
D1 => '0',
R => reset,
S => '0'
);
So according to my understanding, what's happening here is that a "new" clock is being generated by two clocks that are 180 degrees out of phase by using each clock as a select line input to the mux. (See very useful diagram below taken from page 64 in this document!)
When C0 is '1' then Q <= D0 which gives rx_clk <= '1', and if C1 is '1' then Q <= D1 which gives rx_clk <= '0'. During reset both flipflops are reset giving rx_clk <= '0' while reset = '1'
So I have a few questions:
Are the two clocks (not_rx_clk_int and rx_clk_int) going to be precisely 180 degrees out of phase when generated in this way? (by this way, I mean not_rx_clk_int <= not (rx_clk_int)). I assume not due to delta time? What are the implications of this?
What is the benefit of using the ODDR2 in the first place (why isn't rx_clk <= rx_clk_int adequate)? (Which leads to...)
What does it mean for a clock to be "balanced" as part of the clock tree? (clock tree mentioned briefly on page 59 here.)
Isn't rx_clk being gated during reset? Isn't this bad?
Is this the "standard" way of using a ODDR2 and/or performing this operation? Are there better options? (and hence, should I add this to my arsenal of useful VHDL bits and pieces? )
Feel free to suggest recommended reading and/or other resources. I don't want to blindly copy/paste this code into my project without knowing exactly what's going on here.
1) Are the two clocks (not_rx_clk_int and rx_clk_int) going to be precisely 180 degrees out of phase when generated in this way? (by this way, I mean not_rx_clk_int <= not (rx_clk_int)). I assume not due to delta time? What are the implications of this?
Yes, they will be pretty well exactly phased.
Delta-delays are not at issue here. They only apply to HDL simulations, standing in place of unknown "real" delays. I would hope that Xilinx got their model correct so that both edges change in the same delta cycle! ie. they do something like:
not_rx_clk <= not (rx_clk_int);
rx_clk <= rx_clk_int;
to match the deltas.
2) What is the benefit of using the ODDR2 in the first place (why isn't rx_clk <= rx_clk_int adequate)? (Which leads to...)
It ensures that the delay is predictable relative to the other IOs that you no doubt have synchronised with this clock. If you just drive the clock signal out of a pin, it has to come off the clock distribution network, through some routing, and then to the pin (as there's no direct route for a clock net to get to the IO pin. That's a delay which is unpredictable and likely to vary from one compile to another.
3) What does it mean for a clock to be "balanced" as part of the clock tree? (clock tree mentioned briefly on page 59 [here.][3])
As I understand it, it means that the clock tree makes sure that the clock goes the same distance (approximately) to every destination.
4) Isn't rx_clk being gated during reset? Isn't this bad?
Yes it is being turned on and off (I'd hesitate to use the word 'gated' as that means a specific thing - being fed through an AND gate - which this isn't). Only you can say if that matters - it depends on where it goes to.
5) Is this the "standard" way of using a ODDR2 and/or performing this operation? Are there better options? (and hence, should I add this to my arsenal of useful VHDL bits and pieces? )
Three questions in one, sneaky :)
Yes, it's (a) standard way of using ODDR2 (the other standard use is for actual DDR data of course).
No, I don't know of a better way to simply get a clock out.
Yes, add it to your arsenal.
Partial set of answers:
1) I'm amused by the unnecessary brackets in not (rx_clk_int); like a lot of Xilinx cores, it makes me wonder if it's auto-translated from Verilog or something; there's a lot of really bad VHDL in some of them. (So I'm easily amused.) Anyway...
Synthesis tools probably optimise out the separate "not" and use the falling edge of rx_clk_int, so you certainly can get 180 degree phase shift this way. (Whether it's guaranteed, or a more complex expression might fool synthesis, I can't say).
2) Straight assignment would take rx_clk_int off the clock tree, onto ordinary routing, through an output buffer and the total delay would be anybody's guess. This way you have precisely timed clocks directly in the IOB for more predictable timing.
3) FFs and IOBs right next to the clock gen don't see the clock before the ones in the far corner; balancing the clock tree is slowing up all the short paths to match the longest one. (You can see this on DIMM memory PCBs, a lot of zigzag lines on traces to lengthen them!)
4) I would expect it to be gated. Whether that's bad depends on what it's clocking. Perhaps an Ethernet expert can chip in here. Or chase the logic driving "reset" to this block; it may not be the main system reset, to fix this issue.
5) It's certainly a fairly well known trick on newer FPGAs (ones with DDR regs), and very useful for clocks in addition to their main purpose (DDR inderfaces to memory etc). Keep it handy!

VHDL - creating a variable number of signals

I'm creating a full adder with a variable number of bits. I've got a component that is a half-adder which takes in three inputs (the two bits to add, and a carry in bit) and gives 2 outputs (one bit output and a carry out bit).
I need to tie the carry out of one half-adder to the carry in of another. And I need to do this a variable number of times (if I'm adding 4 digit numbers, I'll need 4 half adders. If I'm doing 32 bit numbers, I'll need 32 half adders).
I was going to tie the carry outs of one half-adder to the carry in of another using signals, but I don't know how to create a variable number of signals.
I can instantiate a variable number of half-adders using a for-loop in a process, but since signals are defined outside of processes, I can't use a for loop for it. I don't know how I should tie the half-adders together.
The easiest way to write an adder in VHDL is not to worry about full adders and half adders, but just type:
a <= b + c;
where a,b and c are signed or unsigned
95% of the time, the synthesis tools will do a better job than you would.
I think you want variable-width signals not variable numbers of signals
Your signals need to be std_logic_vector(31 downto 0) for example - and then you wire up the bits of those signals to your half-adders appropriately.
Of course, as those signals are numbers, then don't use std_logic_vector use signed or unsigned (and the ieee.numeric_std lib).
And (as Philippe rightly points out) unless this is a learning exercise, just use the + operator.

Resources