In VHDL ..... how to count leading zeros of vector? - vhdl

I'm working in a VHDL project and I'm facing a problem to calculate the length of vector. I know there is length attribute of a vector but this not the length I'm looking for. For example, I have std_logic_vector
E : std_logic_vector(7 downto 0);
then
E <= "00011010";
so, len = E'length = 8 but I'm not looking for this. I want to calculate len after discarding the left most zeros , so len = 5;
I know that I can use for loop by checking "0"s bits from left to right and stop if "1" bit occur. But that's not efficient, because I have 1024 or more of bits and that will slow my circuit. So is there is any method or algorithm to calculate the length in efficient way? Such as using combinational gates of log(n) level of gates, ( where n = number of bits ).

What you do with your "bit counting" very similar to the logarithm (base 2).
This is commonly used in VHDL to figure out how many bits are required to represent a signal. For example if you want to store up to N elements in RAM, the number of bits required for addressing that RAM is ceil(log2(N)). For this I use:
function log2ceil(m:natural) return natural is
begin -- note: for log(0) we return 0
for n in 0 to integer'high loop
if 2**n >= m then
return n;
end if;
end loop;
end function log2ceil;
Usually, you want to do this at synthesis time with constants, and speed is no concern. But you can also generate FPGA logic, if that's really what you want.
As others have mentioned, a "for" loop in VHDL is just used to generate a lookup table, which may be slow due to long signal paths, but still only takes a single clock. What can happen is that your maximum clock frequency goes down. Usually this is only a problem if you operate on vectors larger than 64bit (you mentioned 1024 bits) and clocks faster than 100MHz. Maybe the synthesizer already told you that this is your problem, otherwise I suggest you try first.
Then you have to split up the operation over multiple clocks, and store some intermediate result into a FF. (I would upfront forget about trying to outsmart the synthesizer by rearranging your code. A lookup-table is a table. Why should it matter how you generate the values in this table? But make sure you tell the synthesizer about "don't care" values if you have them.)
If speed is your concern, use the first clock to check all 16bit blocks in parallel (independent of each other), and then use a second clock cycle to combine the results of all 16bit blocks into a single result. If the amount of FPGA logic is your concern, implement a state machine that checks a single 16bit block at every clock cycle.
But be careful that you don't re-invent the CPU while doing that.

The problem with using a loop is that when you synthesize you might get a very long chain of logic.
Another way to look at your problem is to find the index of the most significant set bit.
To do this you can use a priority encoder. The nice thing about this is you can make a large priority encoder by using smaller priority encoders in a tree structure, so the delay is O(log N) instead of O(N).
Here is a 4 bit priority encoder:
http://en.wikibooks.org/wiki/VHDL_for_FPGA_Design/Priority_Encoder
You can make a 16 bit priority encoder using 5 of these blocks, then a 256 bit encoder from five 16 bit encoders, etc.
But since you have so many bits it is going to be fairly huge.

Well, VHDL is not SW, it does not take time to perform an operation like this, it just takes resources from your FPGA.
You can divide your 1024 bits data into 32 bits section and perform an OR between all the bits, this way, you check 32 bits at a time. It is not really necessary since the the for loop would work perfectly fine for what you want to do, just write the code, look for the first 1 in the array and stop the loop and use the loop index number as the pointer to the first 1 in your array. I didn't compile this code, but something like this should work for you:
FirstOne <= 1023;
for i in E'reverse_range loop
if (E(i) == '1') then
FirstOne <= i;
exit;
end if;
end loop;
It will not be such a big blocks inside an FPGA after all.

Most synthesizers these days support recursive functions. And indeed this will give you complexity comparable to log(N) where N is the number of bits:
Cut your vector into halves
If the top half are all zeros
The leading bit of your answer is '1', low bits depend on the bottom half vector
Otherwise
The leading bit of your answer is '0', low bits depend on the top half vector
Recurse on the half vector of interest chosen above

Related

Handling very large vector in VHDL

Similiarly to the question asked here, I face issues managing very large vectors in FPGA, and nothing really helped in the previous topic. I have a 2^15 bits wide sample, I want to make something similar to an bitwise autocorrelation of this sample by shifting-xoring-adding all the bits of this vector.
One of my constraints is a fast exectution time (less than 100ms with a 12MHz clock).
My way to do it for now is using an FSM in which one state is responsible for processing each iteration of the bitwise autocorrelation, and a following step comparing the current result to the minimum value so far (the minimum value reflecting the fundamental frequency of the sample). To do so, I use a for loop (not a big fan of this kind of structure usually...). Here is a piece of my code :
when S_BACF =>
count_ones :=(others=>'0');
for i in 0 to 32766 loop
count_ones := count_ones + (sample(i) xor sample((index+i) mod 32767));
end loop;
s_nb_of_ones_current <= count_ones;
current_state <= S_COMP;
when S_COMP =>
if index >= 19999 then
index <=0;
current_state <= S_END;
else
index <= index+1;
current_state <= S_BACF;
if s_nb_of_ones_saved > s_nb_of_ones_current then
s_nb_of_ones_saved <= s_nb_of_ones_current;
s_min_rank <= std_logic_vector(to_unsigned(index,15));
end if;
end if;
Sorry I don't define each signal/variable, but I think it is transparent enough. My index doesn't need to go beyond 20000 (for this application).
I think this way of processing data is quite efficient in registers use ("just" one vector of 2^15 bits and a 15-bits vector for the fundamental frequency).
In simulation, it works great. BUT, as expected, the synthesis fails, even for quite big targets. The for loop, though efficient in theory cannot be synthesised with such a big depth.
I imagined splitting my data into several smaller pieces, but the shifting-xoring operation through different smaller vectors is a nightmare.
I also thought about using the internal RAM of my FPGA to save my sample, in order to reduce the register usage, but it won't be effective against the for loop synthesis issue.
So, do any of you have a good idea for the synthesis to be successfull?

Does the CONSTANT declaration stores the values in Block-RAM or in flipflops of an FPGA?

For example, if I want to store my filter coefficients in n-Tap FIR filters using constants, will the CONSTANT declaration store my values in Block RAMs or registers using FPGA flipflops? Also can SIGNAL be used to store the coefficients without using RAM cells?
The constants themselves aren't "stored" anywhere - their values are simply substituted into the VHL code where you use them.
Where they're stored depends on how you use them and how the code is optimized.
If you're multiplying a signal by a constant two, for example, no elements are used at all - the data bus will be simply connected in a way that effectively shifts the value left by one bit.
Or, they may end up as hard-wired inputs to other elements like multipliers in your case.
Either way, you should look into the synthesis results to thoroughly understand the generated RTL.
[...] will the CONSTANT declaration store my values in Block RAMs or registers using FPGA flipflops?
Whether constants are stored in memory blocks or registers, or if they are merged into the boolen equations depends on your implementation of an algorithm. Let's have a look on the following mathematical equation (not VHDL code):
y = c_1 * x_1 + c_2 * x_2 + c_3 * x_3 +... + c_N * x_N
N is the number of coefficients, x_i are the input values, and c_i are the constant coefficients.
You can implements this equation in VHDL / hardware by:
N parallel multipliers and an adder tree to sum up the products; all done combinational, within one clock cycle or even pipelined with a throughput of one result per clock cycle.
Or N sequentially executed multiply-accumlate steps; with one multiply-accumlate per clock cycle.
You can take even a combination of both.
In case 1, the synthesizer optimizes each multiplication with a constant:
just wiring if the coefficient is a power of two,
addition if the binary representation of the coefficient contains a small number of ones (5*x = x + 4*x),
or multiplier hard macro with a constant value (VDD, GND) connected to one if its inputs.
Thus, in case 1 no memory or registers are required to store the constants.
In case 2, the synthesizer will map the multiply-accumulate step to a hardware multiplier plus an adder. This multiplier and adder will be re-used for all N steps, so that, the coefficients must be looked up in a memory. If you have a lot of coefficients, then memory blocks (Block-RAM) are used. The current iteration step i will make up the memory address. If you have only a small number of coefficients, then they can also be stored in distributed memory (LUT-RAM) or computed via boolean equations. But even in this case, the coefficients will not be mapped to flip-flops because their value do not change with time.
Also can SIGNAL be used to store the coefficients without using RAM cells?
Yes, of course. With a proper synchronous description they will be mapped to flip-flops.
The used storage element:
registers
distributed RAM (LUTRAM)
BlockRAM
...depends on your chosen VHDL description and size.
You should use a constant instead of a signal. Moreover it could be helpful to used synchronous read operations to infer registered outputs.
Look into the synthesis report to validate the intended description.
Building on Paebbels' answer, it depends. Though they can be implemented in distributed ROM (LUTROM) as well. It depends on the synthesis tool. For example, Xilinx's Vivado has in their synthesis guide (UG901) describes how to infer RAM/ROM.
For your example of a FIR filter, you might have something like:
type coeff_array is array(natural range<>) of std_logic_vector(17 downto 0);
constant coeffs : coeff_array(0 to N-1) := ( x"XXXX", x"XXXX", ..., x"XXXX" );
Now, whether this is a distributed ROM or RAM depends on the tool. A quick test with Vivado shows this construct synthesizes to a sea of gates (just LUT logic). However, it can be forced into a block RAM (aka block ROM) by:
signal coeffs : coeff_array(0 to N-1) := ( x"XXXX", x"XXXX", ..., x"XXXX" );
attribute ROM_STYLE : string;
attribute ROM_STYLE of coeffs : signal is "block";
The means to infer any specific type of structure (LUTs, LUTRAM, LUTROM, block ROM, block RAM) depends upon the tools in question. Run the test through synthesis to see what you get. And look at the synthesis guide for the synthesizer you are using to figure out how to get what you want.

Theoretically, is comparison between 0 and 255 faster than 0 and 1?

From the point of view of very low level programming, how is performed the comparison between two numbers?
Using one byte, unsigned numbers 0, 1 and 255 are written:
0 -----> 00000000
1 -----> 00000001
255 ---> 11111111
Now, what happens during the comparison between these numbers?
Using my vision as a human having learned basic programming, I could imagine the following algorithm about == implementation:
b = 0
while b < 8:
if first_number[b] != second_number[b]:
return False
b += 1
return True
Basically this is like comparing each bit step by step, and stop before the end if two bits are different.
Thus we note that the comparison stops at the first iteration compared 0 and 255, while it stops at the last if 0 and 1 are compared.
The first comparison would be 8 times faster than the second.
In practice, I doubt that is the case. But is this theoretically true?
If not, how does the computer work?
A comparison between integers is tipically implemented by the cpu as a subtraction, whose result sign contains information about which number is bigger.
While a naive implementation of subtraction executes one bit at a time (because every bit needs to know the carry of the preceding one), tipical implementation use a carry-lookahead circuit that allows the calculation of more result bits at the same time.
So, the answer is: no, every comparison takes almost the same time for every possible input.
Hardware is fundamentally different from the dominant programming paradigms in that all logic gates (or circuits in general) always do their work independently, in parallel, at all times. There is no such thing as "do this, then do that", only "do this here, feed the result into the circuit over there". If there's a circuit on the chip with input A and output B, then the circuit always, continuously, updates B in accordance with the current values of A — regardless of whether the result is needed right now "in the big picture".
Your pseudo code algorithm doesn't even begin to map to logic gates well. Instead, a comparator looks like this in Verilog (ignoring that there's a built-in == operator):
assign not_equal = (a[0] ^ b[0]) | (a[1] ^ b[1]) | ...;
Where each XOR is a separate logic gate and hence works independently from the others. The results are "reduced" with a logical or, i.e. the output is 1 if any of the XORs produces a 1 (this too does some work in parallel, but the critical path is longer than one gate). Furthermore, all these gates exist in silicon regardless of the specific bit values, and the signal has to propagate through about (1 + log w) gates for a w-bit integer. This propagation delay is again independent of the intermediate and final results.
On some CPU families, equality comparison is implemented by subtracting the two numbers and comparing the result to zero (using a circuit as described above), but the same principle applies. An adder/subtracter doesn't get slower or faster depending on the values.
Not to mention that instructions in a CPU can't take less than one clock cycle anyway, so even if the hardware would finish more quickly, the next instruction still wouldn't start until the next tick.
Now, some hardware operations can take a variable amount of time, but that's because they are state machines, i.e. sequential logic. Technically one could implement the moral equivalent of your algorithm with a state machine, but nobody does that, it's harder to implement than the naive, un-optimized combinatorial circuit above, and less efficient to boot.
State machine circuits are circuits with memory: They store their current state and always compute the outputs (depending on the current state) and the next state (depending on current state and inputs) each clock cycle. On some inputs they may go through N states until they produce an output, and N+x on other inputs. ALU operations generally don't do that though. Pipeline stalls, branch mispredictions, and cache misses are common reasons one instruction takes longer than usual in some circumstances. Properly reasoning about these in a way that helps programmers write faster code is hard though: You have to take into account all the tricky and quirks of real hardware, and there's a lot of those. Empirical evidence, i.e. benchmarking a real black box CPU, is vital.
When it gets down to the assembly the cmp instruction is used regardless of the contents of the variables.
So there is no performance difference.

Iterating over bits in FPGA

Now I'm trying to figure out best method for iterating over bits in FPGA. I'm using some variation of fast powering algorithm, a.k.a exponentiation by squaring (more precisely it's doubling and add algorithm for elliptic curve mathematics). To implement it on hardware, I know I must use FSM which does iteration. My problem is how to properly "handle" moving from bit to bit. My first thought was to switch order of bytes, but when my k = 17 is 32bit, I must discard first 27 bits, so it's rather stupid idea. Another concept was with "moving" 0001000 pattern and bitwise & it with number, but it also requires to find first nonzero bit.
TL&DR
Got for example k = 17 (32bits, so: 17x0 10001) and want to iterate 5 times (that means I start iteration on first "real" bit of number) knowing each bit I iterate over.
Language doesn't matter - I need only the algorithm, not solution in specific language. However, if it is easily done in Verilog, I wouldn't mind. :P
A dedicated combinatorial circuit to find the first nonzero bit, shift it to the first position and tell you the shift amount should be fairly light on resources.
In principle, the compiler should be able to find this solution on its own and improve on it:
if none of the top 16 bits are set, set bit 4 of the shift amount, and shift by 16.
if none of the top 8 bits are set, set bit 3 of the shift amount, and shift by 8.
...
The compiler should be able to find further optimizations on this.
Don not code for FPGA but still:
rewrite algorithm to iterate number x from LSB to MSB
then in each iteration bit shift x right by 1 bit
stop if x==0.
this way you have bit-scan inside your main loop and do not need additional cycles for it.
x!=0 is done easily by ORing all its bits together
C++ code example:
DWORD x = ...;
for (; x != 0; x >>= 1)
{
//here is your iteration loop stuff like:
if (DWORD(x & 1) !=0 ) ...;
}
Something like:
always # *
casex(num)
8XXX_XXXX: k = 32;
4XXX_XXXX: k = 31;
2XXX_XXXX: k = 30;
...
Should give you the value of k.
You can have a shift register which can be parallel loaded so you can write a 1 to the kth bit, so you know when your iterations have ended.
If you loop from 0 to 31 and discard the 27 leading zeros...you aren't necessarily wasting cycles. Depends on whether you've surrounded this with a synchronous process, or a asynchronous one.
One gives you a rather small clocked circuit with a 32 clock latency.
The other gives you a giant rats nest of ANDs and ORs which won't run at a very high frequency.
Depends on what you want. Remember though, that even if you do decide to loop over 32 clocks, you can PIPELINE it such that you start a new calculation every clock. It might take you 32 clocks to get an answer, but you CAN do them at high speed.

How can I speed up my math operations in VHDL?

I have some calculations going on currently at rising edge of a 75MHz pixel clock to output 720p video on screen. Some of the math (like a few modulo) take too long (20+ns whereas 75MHz is 13.3ns) so my timing constraints are not met. I'm new to FPGAs but I'm wondering if for example there is a way to run the calculations at a faster speed than the current pixel clock in order to have them completed by the next tick of the 75MHz clock. I'm using VHDL by the way.
75 MHz is already quite slow by today's FPGA standards.
The problem is the modulo operation, which effectively involves division; and division is slow.
Think carefully about the operations you need, and if there is any way to reorganise the computation. If you are clocking pixels it's not as if you have 32-bit integers to deal with; restricted values are easier to deal with.
Martin hinted at one option: strength reduction. If you have 1280 pixels/line and need to operate on every third one, you don't need to compute 1280 mod 3! Count 0,1,2,0,... instead.
Another, if you need modulo-3 of an 8-bit (or 12-bit) number is to store all possible values in a lookup table, which will be fast enough.
Or sometimes you can multiply by 1/3 (X"5555") instead of dividing by 3, then multiply by 3 (which is a single addition) and subtract to get the modulo. This pipelines really well, but since X"5555" is only an approximation to 1/3 you need to verify in simulation that it delivers the correct output for every input. (for 16-bit inputs, this isn't a big simulation!) The extension to modulo 9 is easy.
EDIT:
Two points from your comments : Another option you have is to create a X2 clock (150MHz) using the Spartan's clock generators, which gives you 2 cycles per pixel. Well pipelined code should meet 150 MHz without much trouble.
How not to pipeline!
PROCESS(Clk)
BEGIN
if(rising_edge(Clk)) then
for i in 0 to 2 loop
case i is
when 0 => temp1 <= a*data;
when 1 => temp2 <= temp1*b;
when 2 => result <= temp2*c;
when others => null;
end case;
end loop;
end if;
END PROCESS;
The first thing to realise is that the loop and case statement cancel each other out, so this simplifies to
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
temp1 <= a*data;
temp2 <= temp1*b;
result <= temp2*c;
end if;
END PROCESS;
which is buggy! The testbench also being buggy, hides the problem.
In cycle 1, Data,a,b,c are presented, and temp1 = Data*a is computed.
In cycle 2, temp1 is multiplied by a NEW value of b instead of the correct one!
Same again in cycle 3!
Since the testbench sets the inputs and leaves them constant, it won't catch the problem!
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
-- cycle 1
temp1 <= a*data;
b_copy <= b;
c_copy1 <= c;
-- cycle 2
temp2 <= temp1*b_copy;
c_copy2 <= c_copy1;
-- cycle 3
result <= temp2*c_copy2;
end if;
END PROCESS;
I like to comment each cycle; every term I use in a cycle must come from the immediately preceding cycle, either by calculation or from a copy.
At least this works, but it could be reduced to 2 cycles depth and fewer copy registers because in this example, the four inputs are independent (and I am assuming there are no measures required to avoid overflow). So:
PROCESS(Clk)
BEGIN
if rising_edge(Clk) then
-- cycle 1
temp1 <= a * data;
temp2 <= b * c;
-- cycle 2
result <= temp1 * temp2;
end if;
END PROCESS;
Here's some techniques:
Pipelining - split the logic up to operate over multiple clock cycles
multi-cycle path - if you don't need the answer every cycle, you can tell the tools that it's OK for it to take longer. Care is required not to tell the tools the wrong thing though!
Think again - for example, do you really need to do x mod 3 on very wide x, or could you use a continuously updated modulo 3 counter?
Use better tools - I've had instances where I could meet timing on a deep-logic-path using an expensive synthesizer compared to not meeting timing on the same code using the vendor's synthesizer.
More extreme solutions involve changing the silicon, for a faster device, or a newer device, or a newer, faster device.
Usually complex math operations in FPGAs are pipelined. Pipelining means you divide your operations to stages. Let's say you have a multiplier which takes too long for your clock speed. You divide your multiplier to 3 stages. Basically your multiplier consists of three different parts (which has their own clock input) chained one after. These three parts will be smaller then one part, so they will have a smaller delay thus you can use a faster clock for them.
A drawback of this will be the 'delay'. Your pipelined system will give output with a latency. In the multiplier example above to have the correct output, you have to wait until your input passes all 3 stages. But this is usually very small (depending on your design of course) and can be ignored.
Here is a good (!) post about this: http://vhdlguru.blogspot.com/2011/01/what-is-pipelining-explanation-with.html EDIT: See Brian's post instead.
Also vendors usually ship optimized and pipelined versions of math operations as IP cores in their design software. Look for them.

Resources