Iterating over bits in FPGA - algorithm

Now I'm trying to figure out best method for iterating over bits in FPGA. I'm using some variation of fast powering algorithm, a.k.a exponentiation by squaring (more precisely it's doubling and add algorithm for elliptic curve mathematics). To implement it on hardware, I know I must use FSM which does iteration. My problem is how to properly "handle" moving from bit to bit. My first thought was to switch order of bytes, but when my k = 17 is 32bit, I must discard first 27 bits, so it's rather stupid idea. Another concept was with "moving" 0001000 pattern and bitwise & it with number, but it also requires to find first nonzero bit.
TL&DR
Got for example k = 17 (32bits, so: 17x0 10001) and want to iterate 5 times (that means I start iteration on first "real" bit of number) knowing each bit I iterate over.
Language doesn't matter - I need only the algorithm, not solution in specific language. However, if it is easily done in Verilog, I wouldn't mind. :P

A dedicated combinatorial circuit to find the first nonzero bit, shift it to the first position and tell you the shift amount should be fairly light on resources.
In principle, the compiler should be able to find this solution on its own and improve on it:
if none of the top 16 bits are set, set bit 4 of the shift amount, and shift by 16.
if none of the top 8 bits are set, set bit 3 of the shift amount, and shift by 8.
...
The compiler should be able to find further optimizations on this.

Don not code for FPGA but still:
rewrite algorithm to iterate number x from LSB to MSB
then in each iteration bit shift x right by 1 bit
stop if x==0.
this way you have bit-scan inside your main loop and do not need additional cycles for it.
x!=0 is done easily by ORing all its bits together
C++ code example:
DWORD x = ...;
for (; x != 0; x >>= 1)
{
//here is your iteration loop stuff like:
if (DWORD(x & 1) !=0 ) ...;
}

Something like:
always # *
casex(num)
8XXX_XXXX: k = 32;
4XXX_XXXX: k = 31;
2XXX_XXXX: k = 30;
...
Should give you the value of k.
You can have a shift register which can be parallel loaded so you can write a 1 to the kth bit, so you know when your iterations have ended.

If you loop from 0 to 31 and discard the 27 leading zeros...you aren't necessarily wasting cycles. Depends on whether you've surrounded this with a synchronous process, or a asynchronous one.
One gives you a rather small clocked circuit with a 32 clock latency.
The other gives you a giant rats nest of ANDs and ORs which won't run at a very high frequency.
Depends on what you want. Remember though, that even if you do decide to loop over 32 clocks, you can PIPELINE it such that you start a new calculation every clock. It might take you 32 clocks to get an answer, but you CAN do them at high speed.

Related

Uniform random bit from a mask

Suppose I have a 64 bit unsigned integer (u64) mask, with one or more bits set.
I want to select one of the set bits uniformly at random from m to give a new mask x such that x & mask has one bit set. Some pseudocode that does this might be:
def uniform_random_bit_from_mask(mask):
assert mask > 0
set_indices = get_set_indices(mask)
random_index = uniform_random_choice(set_indices)
new_mask = set_bit(random_index, 0)
return new_mask
However I need to do this as fast as possible (code similar to the above in a low-level language is slowing a hot loop). Does anyone have a more efficient scheme?
The details how to optimize this depend on several factors you did not specify – the target architecture, the expected number of set bits in the mask, the language you want to use, the requirements on the randomness and many more. Without knowing further details, it's hard to give a useful answer, but I'll give a few hints that may prove useful anyway.
Most modern architectures have an instruction to count the number of set bits in an integer, generally called "popcount", and this instruction is exposed in most low-level languages. In Rust, you can use the count_ones() method. This gives you the total number k of bits to select from.
You can then generate a random number i between 0 and k - 1 (inclusive). The next step is to select the ith set bit in mask. An efficient approach to do so is this loop (Rust code):
for _ in 0..i {
mask &= mask - 1;
}
let new_mask = 1 << mask.trailing_zeros();
The loop clears the least significant set bit in each iteration. Since i < k, we know that mask can't be zero after the loop. The last line generates a new mask from the least significant bit of mask that is still set.
On common architectures, it is likely that the bottleneck will be the random number generator. If you are using Rust's rand crate, you can use SmallRng for improved performance, at the cost of being cryptographically insecure, which may not be relevant for your use case.

Set bits before index without shift or LUT

Let's say I need to set all bits before a specific bit index. Here are examples with 4 bits:
index(0) = (0x0, 0000)
index(1) = (0x1, 1000)
index(2) = (0x3, 1100)
index(3) = (0x7, 1110)
How can I do this without using shifts or a LUT, but instead using minimal bitwise operations or arithmetic or something similarly efficient?
The constraints are very weird because you want efficient solution and cut of the only two means allowing to do it properly.
So you basically want to compute x=(2^bit)-1 which is with bit-shift pretty easy:
x=(1<<bit)-1; // O(1)
With LUT also ... So how to attack this without the two:
x=pow(2,bit)-1; //O(?) can be O(1),O(log(n)),O(n)
Well this is far from efficient and pow also uses bit shift and some implementation also LUT. The only solutions left are:
approximation
can use polynomial,PCA,or any other method ... but you need to consider the target range ... This is also not very optimal and robust. This can be O(1),O(log(n)),O(n) but usually with very slow constant time.
emulate bit-shif left
you can do that with loop and addition:
int x; for (x=1;bit;bit--) x+=x; x--;
But this runs in O(n). Anyway this is still faster then pow unless you got some pow2 implemented on HW.
[Notes]
in the complexity formulas n=bit and all the code is in C++ except the first formula where ^ means power.

Theoretically, is comparison between 0 and 255 faster than 0 and 1?

From the point of view of very low level programming, how is performed the comparison between two numbers?
Using one byte, unsigned numbers 0, 1 and 255 are written:
0 -----> 00000000
1 -----> 00000001
255 ---> 11111111
Now, what happens during the comparison between these numbers?
Using my vision as a human having learned basic programming, I could imagine the following algorithm about == implementation:
b = 0
while b < 8:
if first_number[b] != second_number[b]:
return False
b += 1
return True
Basically this is like comparing each bit step by step, and stop before the end if two bits are different.
Thus we note that the comparison stops at the first iteration compared 0 and 255, while it stops at the last if 0 and 1 are compared.
The first comparison would be 8 times faster than the second.
In practice, I doubt that is the case. But is this theoretically true?
If not, how does the computer work?
A comparison between integers is tipically implemented by the cpu as a subtraction, whose result sign contains information about which number is bigger.
While a naive implementation of subtraction executes one bit at a time (because every bit needs to know the carry of the preceding one), tipical implementation use a carry-lookahead circuit that allows the calculation of more result bits at the same time.
So, the answer is: no, every comparison takes almost the same time for every possible input.
Hardware is fundamentally different from the dominant programming paradigms in that all logic gates (or circuits in general) always do their work independently, in parallel, at all times. There is no such thing as "do this, then do that", only "do this here, feed the result into the circuit over there". If there's a circuit on the chip with input A and output B, then the circuit always, continuously, updates B in accordance with the current values of A — regardless of whether the result is needed right now "in the big picture".
Your pseudo code algorithm doesn't even begin to map to logic gates well. Instead, a comparator looks like this in Verilog (ignoring that there's a built-in == operator):
assign not_equal = (a[0] ^ b[0]) | (a[1] ^ b[1]) | ...;
Where each XOR is a separate logic gate and hence works independently from the others. The results are "reduced" with a logical or, i.e. the output is 1 if any of the XORs produces a 1 (this too does some work in parallel, but the critical path is longer than one gate). Furthermore, all these gates exist in silicon regardless of the specific bit values, and the signal has to propagate through about (1 + log w) gates for a w-bit integer. This propagation delay is again independent of the intermediate and final results.
On some CPU families, equality comparison is implemented by subtracting the two numbers and comparing the result to zero (using a circuit as described above), but the same principle applies. An adder/subtracter doesn't get slower or faster depending on the values.
Not to mention that instructions in a CPU can't take less than one clock cycle anyway, so even if the hardware would finish more quickly, the next instruction still wouldn't start until the next tick.
Now, some hardware operations can take a variable amount of time, but that's because they are state machines, i.e. sequential logic. Technically one could implement the moral equivalent of your algorithm with a state machine, but nobody does that, it's harder to implement than the naive, un-optimized combinatorial circuit above, and less efficient to boot.
State machine circuits are circuits with memory: They store their current state and always compute the outputs (depending on the current state) and the next state (depending on current state and inputs) each clock cycle. On some inputs they may go through N states until they produce an output, and N+x on other inputs. ALU operations generally don't do that though. Pipeline stalls, branch mispredictions, and cache misses are common reasons one instruction takes longer than usual in some circumstances. Properly reasoning about these in a way that helps programmers write faster code is hard though: You have to take into account all the tricky and quirks of real hardware, and there's a lot of those. Empirical evidence, i.e. benchmarking a real black box CPU, is vital.
When it gets down to the assembly the cmp instruction is used regardless of the contents of the variables.
So there is no performance difference.

In VHDL ..... how to count leading zeros of vector?

I'm working in a VHDL project and I'm facing a problem to calculate the length of vector. I know there is length attribute of a vector but this not the length I'm looking for. For example, I have std_logic_vector
E : std_logic_vector(7 downto 0);
then
E <= "00011010";
so, len = E'length = 8 but I'm not looking for this. I want to calculate len after discarding the left most zeros , so len = 5;
I know that I can use for loop by checking "0"s bits from left to right and stop if "1" bit occur. But that's not efficient, because I have 1024 or more of bits and that will slow my circuit. So is there is any method or algorithm to calculate the length in efficient way? Such as using combinational gates of log(n) level of gates, ( where n = number of bits ).
What you do with your "bit counting" very similar to the logarithm (base 2).
This is commonly used in VHDL to figure out how many bits are required to represent a signal. For example if you want to store up to N elements in RAM, the number of bits required for addressing that RAM is ceil(log2(N)). For this I use:
function log2ceil(m:natural) return natural is
begin -- note: for log(0) we return 0
for n in 0 to integer'high loop
if 2**n >= m then
return n;
end if;
end loop;
end function log2ceil;
Usually, you want to do this at synthesis time with constants, and speed is no concern. But you can also generate FPGA logic, if that's really what you want.
As others have mentioned, a "for" loop in VHDL is just used to generate a lookup table, which may be slow due to long signal paths, but still only takes a single clock. What can happen is that your maximum clock frequency goes down. Usually this is only a problem if you operate on vectors larger than 64bit (you mentioned 1024 bits) and clocks faster than 100MHz. Maybe the synthesizer already told you that this is your problem, otherwise I suggest you try first.
Then you have to split up the operation over multiple clocks, and store some intermediate result into a FF. (I would upfront forget about trying to outsmart the synthesizer by rearranging your code. A lookup-table is a table. Why should it matter how you generate the values in this table? But make sure you tell the synthesizer about "don't care" values if you have them.)
If speed is your concern, use the first clock to check all 16bit blocks in parallel (independent of each other), and then use a second clock cycle to combine the results of all 16bit blocks into a single result. If the amount of FPGA logic is your concern, implement a state machine that checks a single 16bit block at every clock cycle.
But be careful that you don't re-invent the CPU while doing that.
The problem with using a loop is that when you synthesize you might get a very long chain of logic.
Another way to look at your problem is to find the index of the most significant set bit.
To do this you can use a priority encoder. The nice thing about this is you can make a large priority encoder by using smaller priority encoders in a tree structure, so the delay is O(log N) instead of O(N).
Here is a 4 bit priority encoder:
http://en.wikibooks.org/wiki/VHDL_for_FPGA_Design/Priority_Encoder
You can make a 16 bit priority encoder using 5 of these blocks, then a 256 bit encoder from five 16 bit encoders, etc.
But since you have so many bits it is going to be fairly huge.
Well, VHDL is not SW, it does not take time to perform an operation like this, it just takes resources from your FPGA.
You can divide your 1024 bits data into 32 bits section and perform an OR between all the bits, this way, you check 32 bits at a time. It is not really necessary since the the for loop would work perfectly fine for what you want to do, just write the code, look for the first 1 in the array and stop the loop and use the loop index number as the pointer to the first 1 in your array. I didn't compile this code, but something like this should work for you:
FirstOne <= 1023;
for i in E'reverse_range loop
if (E(i) == '1') then
FirstOne <= i;
exit;
end if;
end loop;
It will not be such a big blocks inside an FPGA after all.
Most synthesizers these days support recursive functions. And indeed this will give you complexity comparable to log(N) where N is the number of bits:
Cut your vector into halves
If the top half are all zeros
The leading bit of your answer is '1', low bits depend on the bottom half vector
Otherwise
The leading bit of your answer is '0', low bits depend on the top half vector
Recurse on the half vector of interest chosen above

Sum reduction of binary sequence

Consider a binary sequence:
11000111
I have to find sum of this series (actually in parallel)
Sum =1+1+0+0+0+1+1+1= 5
This is a waste of resource as why invest time in adding 0s?
Is there any clever way to sum this sequence so I can avoid unnecessary additions?
Operate at the byte level rather than the bit level. Use a small LUT to convert a byte to a population count. That way you're only doing one lookup and one add per 8 bits. Unless your data is likely to be very sparse this should be quite efficient.
Well it depends on how you store your bitset.
If it's an array, then you can't do more than a plain for. If you want to do this in parallel, just split the array in chunks and process them concurrently.
If we are talking about a bitset (storing the bits in a native (32/64-bit) integer type), then the simplest way to count bits would be this one:
int bitset;
int s = 0;
for (; bitset; s++)
bitset &= bitset-1;
This removes the last bit of 1 at every step, so you have O(s).
Of course, you can combine these two methods if you need more than 32/64 bits
I dunno why people are answering, not even looking into link from the 1st comment to the question. You can easily make it under O(size_of_bitset). At lewast when it comes to constant factor.
You could use this method (found in link by J.F. Sebastian):
inline int count_bits(int num){
int sum = 0;
for (; bitset; sum++) bitset &= bitset-1;
return sum;
}
int main (void){
int array[N];
int total_sum = 0;
#pragma omp parallel for reduction(+:total_sum)
for (size_t i = 0; i < N, i++){
total_sum += count_bits(array[i]);
}
}
This will count number of bits in memory range of array in parallel. The inline is important to avoid unnecessary copying, also the compiler should optimize it much better.
You can swap the count_bits with anything better that counts bits in an integer to get faster if you find anything. This version has complexity of O(bits_set) (not size of the bit set!).
Invoking the parallel construct will introduce quite a lot of overhead compared to a single summation that it does need to be quite large to compensate.
The parallelism is done via OpenMP. The partial sum of each thread is summed at the end of the parallel loop and stored in total_sum. Note the total_sum will be private inside the loop for each thread reduction due to reduction clause.
You could alter the code to make it count bits set in arbitrary memory region but it is quite important for it to be memory aligned when you perform operations on such low level.
As far as I can see, it would be wasteful to try to handle the zeros specially. As #bdares said, addition is really cheap. At a minimum, you'll need to execute N instructions to sum up the an N-bit sequence, that would be if you unconditionally sum ever bit. If you add a test to see whether the bit is a 0 or 1, that's another instruction that needs to be executed for each bit. Even if there's no branch penalty, you're executing minimum 1 instruction for every bit (the conditional test), and then you're also executing the original instruction (the add) for any bits that are equal to 1. So even without branch penalty, this takes more time to execute.
#bdares mentions that the compiler will optimize out the branches, but that's only if the value of each bit is known at compile time, and if you know the values of the bits at compile time, you should just add them up yourself in advance.
There might be some cute things you can do with bit twiddling. For instance, if you take the bits two at a time you're adding up values of 0, 1, 2, or 3, and only have half as many additions to do. There may by something you can then do with the result to convert it into the value you want, but I haven't actually thought about how to do that.

Resources