How can I speed up my math operations in VHDL? - vhdl

I have some calculations going on currently at rising edge of a 75MHz pixel clock to output 720p video on screen. Some of the math (like a few modulo) take too long (20+ns whereas 75MHz is 13.3ns) so my timing constraints are not met. I'm new to FPGAs but I'm wondering if for example there is a way to run the calculations at a faster speed than the current pixel clock in order to have them completed by the next tick of the 75MHz clock. I'm using VHDL by the way.

75 MHz is already quite slow by today's FPGA standards.
The problem is the modulo operation, which effectively involves division; and division is slow.
Think carefully about the operations you need, and if there is any way to reorganise the computation. If you are clocking pixels it's not as if you have 32-bit integers to deal with; restricted values are easier to deal with.
Martin hinted at one option: strength reduction. If you have 1280 pixels/line and need to operate on every third one, you don't need to compute 1280 mod 3! Count 0,1,2,0,... instead.
Another, if you need modulo-3 of an 8-bit (or 12-bit) number is to store all possible values in a lookup table, which will be fast enough.
Or sometimes you can multiply by 1/3 (X"5555") instead of dividing by 3, then multiply by 3 (which is a single addition) and subtract to get the modulo. This pipelines really well, but since X"5555" is only an approximation to 1/3 you need to verify in simulation that it delivers the correct output for every input. (for 16-bit inputs, this isn't a big simulation!) The extension to modulo 9 is easy.
EDIT:
Two points from your comments : Another option you have is to create a X2 clock (150MHz) using the Spartan's clock generators, which gives you 2 cycles per pixel. Well pipelined code should meet 150 MHz without much trouble.
How not to pipeline!
PROCESS(Clk)
BEGIN
if(rising_edge(Clk)) then
for i in 0 to 2 loop
case i is
when 0 => temp1 <= a*data;
when 1 => temp2 <= temp1*b;
when 2 => result <= temp2*c;
when others => null;
end case;
end loop;
end if;
END PROCESS;
The first thing to realise is that the loop and case statement cancel each other out, so this simplifies to
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
temp1 <= a*data;
temp2 <= temp1*b;
result <= temp2*c;
end if;
END PROCESS;
which is buggy! The testbench also being buggy, hides the problem.
In cycle 1, Data,a,b,c are presented, and temp1 = Data*a is computed.
In cycle 2, temp1 is multiplied by a NEW value of b instead of the correct one!
Same again in cycle 3!
Since the testbench sets the inputs and leaves them constant, it won't catch the problem!
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
-- cycle 1
temp1 <= a*data;
b_copy <= b;
c_copy1 <= c;
-- cycle 2
temp2 <= temp1*b_copy;
c_copy2 <= c_copy1;
-- cycle 3
result <= temp2*c_copy2;
end if;
END PROCESS;
I like to comment each cycle; every term I use in a cycle must come from the immediately preceding cycle, either by calculation or from a copy.
At least this works, but it could be reduced to 2 cycles depth and fewer copy registers because in this example, the four inputs are independent (and I am assuming there are no measures required to avoid overflow). So:
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
-- cycle 1
temp1 <= a * data;
temp2 <= b * c;
-- cycle 2
result <= temp1 * temp2;
end if;
END PROCESS;

Here's some techniques:
Pipelining - split the logic up to operate over multiple clock cycles
multi-cycle path - if you don't need the answer every cycle, you can tell the tools that it's OK for it to take longer. Care is required not to tell the tools the wrong thing though!
Think again - for example, do you really need to do x mod 3 on very wide x, or could you use a continuously updated modulo 3 counter?
Use better tools - I've had instances where I could meet timing on a deep-logic-path using an expensive synthesizer compared to not meeting timing on the same code using the vendor's synthesizer.
More extreme solutions involve changing the silicon, for a faster device, or a newer device, or a newer, faster device.

Usually complex math operations in FPGAs are pipelined. Pipelining means you divide your operations to stages. Let's say you have a multiplier which takes too long for your clock speed. You divide your multiplier to 3 stages. Basically your multiplier consists of three different parts (which has their own clock input) chained one after. These three parts will be smaller then one part, so they will have a smaller delay thus you can use a faster clock for them.
A drawback of this will be the 'delay'. Your pipelined system will give output with a latency. In the multiplier example above to have the correct output, you have to wait until your input passes all 3 stages. But this is usually very small (depending on your design of course) and can be ignored.
Here is a good (!) post about this: http://vhdlguru.blogspot.com/2011/01/what-is-pipelining-explanation-with.html EDIT: See Brian's post instead.
Also vendors usually ship optimized and pipelined versions of math operations as IP cores in their design software. Look for them.

Related

Handling very large vector in VHDL

Similiarly to the question asked here, I face issues managing very large vectors in FPGA, and nothing really helped in the previous topic. I have a 2^15 bits wide sample, I want to make something similar to an bitwise autocorrelation of this sample by shifting-xoring-adding all the bits of this vector.
One of my constraints is a fast exectution time (less than 100ms with a 12MHz clock).
My way to do it for now is using an FSM in which one state is responsible for processing each iteration of the bitwise autocorrelation, and a following step comparing the current result to the minimum value so far (the minimum value reflecting the fundamental frequency of the sample). To do so, I use a for loop (not a big fan of this kind of structure usually...). Here is a piece of my code :
when S_BACF =>
count_ones :=(others=>'0');
for i in 0 to 32766 loop
count_ones := count_ones + (sample(i) xor sample((index+i) mod 32767));
end loop;
s_nb_of_ones_current <= count_ones;
current_state <= S_COMP;
when S_COMP =>
if index >= 19999 then
index <=0;
current_state <= S_END;
else
index <= index+1;
current_state <= S_BACF;
if s_nb_of_ones_saved > s_nb_of_ones_current then
s_nb_of_ones_saved <= s_nb_of_ones_current;
s_min_rank <= std_logic_vector(to_unsigned(index,15));
end if;
end if;
Sorry I don't define each signal/variable, but I think it is transparent enough. My index doesn't need to go beyond 20000 (for this application).
I think this way of processing data is quite efficient in registers use ("just" one vector of 2^15 bits and a 15-bits vector for the fundamental frequency).
In simulation, it works great. BUT, as expected, the synthesis fails, even for quite big targets. The for loop, though efficient in theory cannot be synthesised with such a big depth.
I imagined splitting my data into several smaller pieces, but the shifting-xoring operation through different smaller vectors is a nightmare.
I also thought about using the internal RAM of my FPGA to save my sample, in order to reduce the register usage, but it won't be effective against the for loop synthesis issue.
So, do any of you have a good idea for the synthesis to be successfull?

Extreme pipelining in VHDL?

I was wondering which of the following designs is faster, i.e., can operate at a higher Fmax:
-- Pipelined
if crd_h = scan_end_h(vt)-1 then
rst_h <= '1';
end if;
if crd_v = scan_end_v(vt) then
rst_v <= '1';
end if;
if rst_h = '1' then
crd_h <= 0;
rst_h <= '0';
if rst_v = '1' then
crd_v <= 0;
rst_v <= '0';
else
crd_v <= crd_v + 1;
end if;
else
crd_h <= crd_h + 1;
end if;
Where the loop ends are checked in the "previous" cycle and applied in the following through the rst feedback signals.
Compared to the less pipelined approach:
-- NOT Pipelined
if crd_h = scan_end_h(vt) then
crd_h <= 0;
if crd_v = scan_end_v(vt) then
crd_v <= 0;
else
crd_v <= crd_v + 1;
end if;
else
crd_h <= crd_h + 1;
end if;
The idea in the first implementation is not to have the arithmetic in the comparison coupled with the one in the increment. However, on the other hand, in the second implementation both operations can be done in parallel and the result of one will MUX the other. Will that be as fast as having the MUX control bit ready from the previous cycle (in the first implementation)??
Thanks!
To start with, the reason 'faster' is not the best word to use, is that this could be interpreted 'throughput', 'latency', or 'Fmax'. These three goals might require different approaches.
Ultimately, whether you need to implement more pipelining or not should be driven by your design specification and constraints. If you only need to run at 20 MHz, set up constraints for this, and see if your design passes timing. If it does, then there's not much point putting the effort into optimising the design.
Assuming your design does not meet timing, your FPGA implementation tool should be able to produce a timing report, and this should tell you which parts of your design are the limiting factor. You can then focus on optimising these sections of your design.
More generally, to understand whether a process will benefit from pipelining from an Fmax perspective, you need to understand the underlying building block, often known as a 'slice', that the FPGA tools are going to use to implement your design. In general, if a sequential function cannot fit inside one slice, it could benefit from pipelining. Whether or not the process 'fits' will largely be determined by the number of inputs it has. Note that for a process operating with n-bit data, it may be possible to describe it as n processes that each work with 1-bit data, reducing the number of inputs for the purposes of this analysis. Also note that some types of process, for example adders, can efficiently spread over several slices by making use of dedicated interconnect between the carry chains in two or more slices. Again, you need to understand in detail the building blocks available in your FPGA device.
You have not included any signal definitions, but it looks like your process has as inputs two counters, a reset, and two parameters in the form of scan_end_h and scan_end_v. I have no way to know how wide these are, but let's assume as an example that these are 12-bit values. Your process then has 4 * 12 = 48 inputs from the counters and parameters. I would not expect a function of this many inputs to fit into one slice, therefore you could probably achieve a higher Fmax using pipelining. Your idea of pipelining the counter comparisons looks like a good one; as pointed out in the comments, your best bet is to try this out, and see what the result is by looking at the implementation timing report.

Serializing code in VHDL

I'm attempting to create a (very basic) GPU on a Spartan-6 FPGA using VHDL.
The big problem I have hit upon is that my understanding of HDL is quite limited - I've been writing my code using nested for loops for ray tracing/scanline rasterization algorithms without considering that these enormous loops consume >100% of the DSP slices when the loops are unraveled on synthesis.
My question is, if, I have a clock triggered counter in place of a for loop (using the counter as the index and resetting it to 0 at its max), would this mean all the logic is only generated once? I can see that, taking ray tracing on a 600x800 screen, with a 200 MHz clock for example, that the overall refresh rate of the entire screen would drop to 625 Hz but that should still be quick enough in theory..?
Thanks very much!
If you implement a for loop, then the functionality in the for loop is executed at the same time for all the values that the for loop goes through. To achieve this, the synthesis tool must implement the functionality once for each value in the for loop, so you will still have the massive hardware implementation.
For example this code will unroll to parallel hardware for the functionality, the and gate in this case, but without any overhead in hardware as result of the for loop:
process (clk_i) is
begin
if rising_edge(clk_i) then
for idx_par in z_par_o'range loop
z_par_o(idx_par) <= a_i(idx_par) and b_i(idx_par); -- Functionality
end loop;
end if;
end process;
Interleaving of processing for different data values must be implemented with explicit handling in then VHDL, thus having a signal with the value, and doing increment and wrap of this value each time the functionality have calculated the result for the given value.
And this code will make serial hardware for the functionality, but with overhead in hardware as result of the loop:
process (clk_i) is
begin
if rising_edge(clk_i) then
if rst_i = '1' then -- Reset
idx_ser <= 0;
else -- Operation
z_par_o(idx_ser) <= a_i(idx_ser) and b_i(idx_ser); -- Functionality
if idx_ser /= LEN - 1 then -- Not at end of range
idx_ser <= idx_ser + 1; -- Increment
else -- At end of range
idx_ser <= 0; -- Wrap
end if;
end if;
end if;
end process;
Ordinary VHDL synthesis tools are not able to unroll for loops to execute over time.

Behavioural logic sequential, code cannot work?

Basically I have the following code in my module. I want to change a number to it's 2's complement negative.
Eg. 100 becomes -100, and -200 becomes 200.
A shortcut I found is to read from the LSB until you reach a '1', then flip all the bits after it. I'm trying a implement a 32 bit converter using the least performance tradeoff (I heard num <= not(num) + 1 is quite resource heavy)
flipBit <= '0'; -- reset the flip bit
FOR i IN 0 TO 31 LOOP
IF flipBit = '1' THEN
tempSubtract(i) <= not Operand2(i);
ELSE
tempSubtract(i) <= Operand2(i);
END IF;
IF Operand2(i) = '1' THEN
flipBit <= '1';
END IF;
END LOOP;
However, all it does it to NOT the entire thing. Also, when I do num <= not(num)+1, the slow way, it gives me gibberish numbers too.
Can anyone tell me what's wrong? Thanks.
This is something the synthesis tool can probably do better than you, so I would recommend to simply use z <= -a;, where a and z are of type signed.
This will cause synthesis to optimize the negation for your target architecture, no matter what it is. For example, calculating not + 1 in an FPGA is very efficient.
You could just do: signal <= signal * -1 and the tools would synthesize it for you. That might not be the most efficient though. I don't think it would take up THAT much logic though. If you really need a more efficient solution you can do this:
Is there any reason you need to do this conversion you show above in 1 clock cycle? If you took 32 clocks to do it it would be easier and probably less resources. I would recommend that you remove your FOR loop, as it is causing most of the problems you are having.

In VHDL ..... how to count leading zeros of vector?

I'm working in a VHDL project and I'm facing a problem to calculate the length of vector. I know there is length attribute of a vector but this not the length I'm looking for. For example, I have std_logic_vector
E : std_logic_vector(7 downto 0);
then
E <= "00011010";
so, len = E'length = 8 but I'm not looking for this. I want to calculate len after discarding the left most zeros , so len = 5;
I know that I can use for loop by checking "0"s bits from left to right and stop if "1" bit occur. But that's not efficient, because I have 1024 or more of bits and that will slow my circuit. So is there is any method or algorithm to calculate the length in efficient way? Such as using combinational gates of log(n) level of gates, ( where n = number of bits ).
What you do with your "bit counting" very similar to the logarithm (base 2).
This is commonly used in VHDL to figure out how many bits are required to represent a signal. For example if you want to store up to N elements in RAM, the number of bits required for addressing that RAM is ceil(log2(N)). For this I use:
function log2ceil(m:natural) return natural is
begin -- note: for log(0) we return 0
for n in 0 to integer'high loop
if 2**n >= m then
return n;
end if;
end loop;
end function log2ceil;
Usually, you want to do this at synthesis time with constants, and speed is no concern. But you can also generate FPGA logic, if that's really what you want.
As others have mentioned, a "for" loop in VHDL is just used to generate a lookup table, which may be slow due to long signal paths, but still only takes a single clock. What can happen is that your maximum clock frequency goes down. Usually this is only a problem if you operate on vectors larger than 64bit (you mentioned 1024 bits) and clocks faster than 100MHz. Maybe the synthesizer already told you that this is your problem, otherwise I suggest you try first.
Then you have to split up the operation over multiple clocks, and store some intermediate result into a FF. (I would upfront forget about trying to outsmart the synthesizer by rearranging your code. A lookup-table is a table. Why should it matter how you generate the values in this table? But make sure you tell the synthesizer about "don't care" values if you have them.)
If speed is your concern, use the first clock to check all 16bit blocks in parallel (independent of each other), and then use a second clock cycle to combine the results of all 16bit blocks into a single result. If the amount of FPGA logic is your concern, implement a state machine that checks a single 16bit block at every clock cycle.
But be careful that you don't re-invent the CPU while doing that.
The problem with using a loop is that when you synthesize you might get a very long chain of logic.
Another way to look at your problem is to find the index of the most significant set bit.
To do this you can use a priority encoder. The nice thing about this is you can make a large priority encoder by using smaller priority encoders in a tree structure, so the delay is O(log N) instead of O(N).
Here is a 4 bit priority encoder:
http://en.wikibooks.org/wiki/VHDL_for_FPGA_Design/Priority_Encoder
You can make a 16 bit priority encoder using 5 of these blocks, then a 256 bit encoder from five 16 bit encoders, etc.
But since you have so many bits it is going to be fairly huge.
Well, VHDL is not SW, it does not take time to perform an operation like this, it just takes resources from your FPGA.
You can divide your 1024 bits data into 32 bits section and perform an OR between all the bits, this way, you check 32 bits at a time. It is not really necessary since the the for loop would work perfectly fine for what you want to do, just write the code, look for the first 1 in the array and stop the loop and use the loop index number as the pointer to the first 1 in your array. I didn't compile this code, but something like this should work for you:
FirstOne <= 1023;
for i in E'reverse_range loop
if (E(i) == '1') then
FirstOne <= i;
exit;
end if;
end loop;
It will not be such a big blocks inside an FPGA after all.
Most synthesizers these days support recursive functions. And indeed this will give you complexity comparable to log(N) where N is the number of bits:
Cut your vector into halves
If the top half are all zeros
The leading bit of your answer is '1', low bits depend on the bottom half vector
Otherwise
The leading bit of your answer is '0', low bits depend on the top half vector
Recurse on the half vector of interest chosen above

Resources