Fast division algorithm for binary numbers - algorithm

I'm currently building a 16 bit ALU using Logisim (ie logic gates only), and am stuck on a division process. I am currently just using the simple standard "division algorithm loop" (as shown below):
Read input values;
Compare input values. Wait until comparison process has finished;
If A<B go to step 10. If A≥B, go to next step;
Subtract B from A;
Wait until subtraction process has finished;
Add one to count;
Wait until counting process has finished;
Write value from subtraction process to input;
Go to step 1;
Answer is count remainder A
This, however, takes a very long time for processes with large answers (repeating a 300 tick cycle 65,000 times isn't fun).
I'm just wondering if there are similar algorithms which are quicker (that exclusively use addition and/or subtraction and/or multiplication and any boolean logic) that could be implemented using logic gates.
Any help or ideas would be greatly appreciated!
Fraser

Use long-division. In binary, there is no multiplication, since the quotient at each bit position can only be 1 or 0. So it can be implemented as a conditional subtract (subtract if result non-negative) and shift.
That's just a crude outline, of course.

A typical approach for a 32/16:16+16 division would be to have the dividend stored in a pair of 16-bit registers (which get updated during operation) and the divisor in its own register (which doesn't). Sixteen times, subtract the upper 17 bits of the dividend from the divisor; if a borrow results, discard the result and shift the divisor left one place, putting a 0 into the lsb. If no borrow results, store the result into the divisor while shifting it left, but put a 1 into the lsb. After sixteen such steps, the lower 16 bits of the dividend register will hold the quotient, and the upper 16 bits will hold the remainder. Note that this operation will only work if the quotient is representable in 16 bits. Note also that on a processor which implements 32/16:16+16 division in this fashion, one may conveniently divide arbitrarily-large numbers by a 16-bit quantity, since the upper 16 bits of dividend for each step should be the remainder from the previous step.

Related

QR code generation algorithm implementation case analysis

I'm implementing a QR code generation algorithm as explained on thonky.com and I'm trying to understand one of the cases:
as stated in this page and this page I can deduct that if the code is protected with M error correction level, and the chosen mask is No. 0, the first 5 bits of the format string (non-XORed) are '00000', and because of this the whole string of 15 bits is zeros.
the next step is to remove all leading zeros, which are, again, all of them. it means that there's nothing to XOR the generator polynomial string(10100110111) with, thus giving us a final string of 15 zeros, which means that the final (XORed) string will be simply the mask string (101010000010010).
I'm seeking for confirmation that my logic is right.
Thank you all very much in advance for the help.
Your logic is correct.
remove all leading zeroes
The actual process could be described as appending 10 zero bits to the 5 bits of data and treating the 15 bits as 15 single bit coefficients of a polynomial, then dividing that polynomial by the 11 bit generator polynomial resulting in a 10 bit remainder polynomial, which is then subtracted from the 5 data bits + 10 zero bits polynomial. Since this is binary math, add and subtract are both xor operations, and since the 10 appended bits are zero bits, the process can just append the 10 remainder bits to the 5 data bits.
As commented above, rather than actually implementing a BCH encode function, since there are only 32 possible format strings, you can just do a table lookup.
https://www.thonky.com/qr-code-tutorial/format-version-tables

Applications of a circular shift

I would like to know some examples of application of circular shifts. For example, a right shift on an unsigned integer would result in a division by two. Conversely, a left shift would result in a multiplication by 2. Are there any famous/interesting properties of a circular shift on binary numbers.
Note: The example about the right/left shift is to illustrate an application of that particular operator. I am asking for similar examples for the circular shift operator/function.
One surprising place where circular shifts show up is in the Josephus survivor problem. In this slightly morbid problem, n people stand in a circle. The first person kills the second, then the third kills the fourth, etc. This process repeats where one person kills the next until only one person is left. The question is given n people, which person survives?
Amazingly, the answer is given by doing a right circular shift on n by one bit. There's a great proof of this in Graham, Knuth, and Patashnik's book Concrete Mathematics.
Hope this helps!
Convert a 16-bit word between big-endian and little-endian representation: right or left circular shift by 8.
Generate random bitset with even number of bits set: t = rand(); result = t XOR cshift(t,1).
In-place, stable, and in linear time: move all elements of some array with even positions to the beginning and all elements with odd positions - to the end. One of possible algorithms is described in this paper: "In-Situ, Stable Merging by way of the Perfect Shuffle" (section 7). It generates all possible binary necklaces and uses them as starting points of cycle-leader algorithm where each next position is computed from previous one by circular shift. This application is closely related to multiplication by 2 (mod (2^N - 1)) mentioned in Henrik's answer.
Micro-optimization. Suppose you need to unpack four 2-bit words from a single byte. You could do this by shifting each sub-word to rightmost position, then applying AND operation with proper mask. (No need to shift the first sub-word or mask the last one). All this needs 6 CPU instructions. If you circularly shift the byte by 4, two middle sub-words become the first and the last one, and also need only one instruction each. So using circular shift decreases number of needed instructions to 5.
Cryptography applications receive significant speed-up when machine instruction set contains rotation instructions. For example, Twofish Cipher uses circular shifts extensively.
A regular left shift is a multiplication by 2 (mod 2^N), where N is the number of bits in your integer type.
A circular left shift is a multiplication by 2 (mod (2^N - 1)). So this can be handy when doing arithmetic mod (2^N-1).

Operations on bits when doing binary long division

This is from the number theory chapter in CLRS.
We are asked to prove that binary "paper and pencil" long division a/b with result q and reminder r does O((1+lgq)lgb) operations on bits.
The way I see it is we do 1 subtraction of b for each bit in q. So assuming that subtracting b does lgb operations (one for each bit in b), then we have a total of O(lgblgq) operations, which is not what is requested.
If you take into account that the first subtraction operation you do might result in a 0 bit (for example, dividing 100b by 11b), then, OK, you can add 1 to lgq to compensate for this subtraction. But... the same could be said of the subtraction itself - it can take lgb operations or it can take lg(b)+1 operations depending on the numbers (in the 100b and 11b example, the second subtraction will be 100b-11b which takes 3 operations to complete).
So if we're factoring these cases, then the number of operations should be O((1+lgb)(1+lgq)).
So the question is, how can you show that the division takes O((1+lgq)lgb) operations?
When you subtract 100b-11b, you can actually ignore the leading bit in the first number, because you already know that the corresponding bit in the result will be 0. Had it been 1, you would have done a subtraction instead of a shift in the previous step. So the subtraction always considers exactly lg b bits.

What division algorithm should be used for dividing small integers in hardware?

I need to multiply an integer ranging from 0-1023 by 1023 and divide the result by a number ranging from 1-1023 in hardware (verilog/fpga implementation). The multiplication is straight forward since I can probably get away with just shifting 10 bits (and if needed I'll subtract an extra 1023). The division is a little interesting though. Area/power arent't really critical to me (I'm in an FPGA so the resources are already there). Latency (within reason) isn't a big deal so long as I can pipeline the deisgn. There are obviously several choices with different trade offs, but I'm wondering if there's an "obvious" or "no brainer" algorithm for a situation like this. Given the limited range of operands and the abundance of resources that I have (bram etc) I'm wondering if there isn't something obvious to do.
If you can work with fixed point precision rather than integers it may be possible to change :
divide the result by a number ranging from 1-1023
to multiplication by a number ranging from 1 - 1/1023, ie pre-compute the divide and store that as the coefficient for the multiply.
If you can pre-compute everything, and you've got a spare 20x20 multiplier, and some way to store your pre-computed number, then go for Morgan's suggestion. You need to precompute a 20-bit multiplicand (10b quotient, 10b remainder), and multiply by your first 10b number, and take the bottom 30b of the 40b result.
Otherwise, the no-brainer is non-restoring division, since you say that latency isn't important (lots of stuff on the web, most of it incomprehensible). you have a 20-bit numerator (the result of your (1023 x) multiplication), and a 10-bit denominator. This gives a 20b quotient, and a 10b remainder (ie. 20 bits for the integer part of the answer, and 10 bits for the fractional part, giving a 30b answer).
The actual hardware is pretty trivial: an 11b adder/subtractor, a 31b shift register, and a 10b or 11b register to store the divisor. You also need a small FSM to control it (2b). You have to do a compare, add or subtract, and shift in every clock cycle, and you get the answer out in 21 cycles. I think. :)

Fastest method for adding/summing the individual digit components of a number

I saw a question on a math forum a while back where a person was discussing adding up the digits in a number over and over again until a single digit is achieved. (i.e. "362" would become "3+6+2" which would become "11"... then "11" would become "1+1" would would become "2" therefor "362" would return 2... I wrote some nice code to get an answer to this and posted it only to be outdone by a user who suggested that any number in modulo 9 is equal to this "infinite digit sum", I checked it an he was right... well almost right, if zero was returned you had to switch it out with a "9" but that was a very quick fix...
362 = 3+6+2 = 11 = 1+1 = 2
or...
362%9 = 2
Anways, the mod9 method works fantastic for infinitely adding the sum of the digits until you are left with just a single digit... but what about only doing it once (i.e. 362 would just return "11")... Can anyone think of fast algorithms?
There's a cool trick for summing the 1 digits in binary, and with a fixed-width integer. At each iteration, you separate out half the digits each into two values, bit-shift one value down, then add. First iteration, separate ever other digit. Second iteration, pairs of digits, and so on.
Given that 27 is 00011011 as 8-bit binary, the process is...
00010001 + 00000101 = 00010110 <- every other digit step
00010010 + 00000001 = 00010011 <- pairs of digits
00000011 + 00000001 = 00000100 <- quads, giving final result 4
You could do a similar trick with decimal, but it would be less efficient than a simple loop unless you had a direct representation of decimal numbers with fast operations to zero out selected digits and to do digit-shifting. So for 12345678 you get...
02040608 + 01030507 = 03071115 <- every other digit
00070015 + 00030011 = 00100026 <- pairs
00000026 + 00000010 = 00000036 <- quads, final result
So 1+2+3+4+5+6+7+8 = 36, which is correct, but you can only do this efficiently if your number representation is fixed-width decimal. It always takes lg(n) iterations, where lg means the base two logarithm, and you round upwards.
To expand on this a little (based on in-comments discussions), let's pretend this was sane, for a bit...
If you count single-digit additions, there's actually more work than a simple loop here. The idea, as with the bitwise trick for counting bits, is to re-order those additions (using associativity) and then to compute as many as possible in parallel, using a single full-width addition to implement two half-width additions, four quarter-width additions etc. There's significant overhead for the digit-clearing and digit-shifting operations, and even more if you implement this as a loop (calculating or looking up the digit-masking and shift-distance values for each step). The "loop" should probably be fully unrolled and those masks and shift-distances be included as constants in the code to avoid that.
A processor with support for Binary Coded Decimal (BCD) could handle this. Digit masking and digit shifting would be implemented using bit masking and bit shifting, as each decimal digit would be encoded in 4 (or more) bits, independent of the encoding of other digits.
One issue is that BCD support is quite rare these days. It used to be fairly common in the 8 bit and 16 bit days, but as far as I'm aware, processors that still support it now do so mainly for backward compatibility. Reasons include...
Very early processors didn't include hardware multiplication and division. Hardware support for these operations means it's easier and more efficient to convert binary to decimal now. Binary is used for almost everything now, and BCD is mostly forgotten.
There are decimal number representations around in libraries, but few if any high level languages ever provided portable support to hardware BCD, so since assembler stopped being a real-world option for most developers BCD support simply stopped being used.
As numbers get larger, even packed BCD is quite inefficiently packed. Number representations base 10^x have the most important properties of base 10, and are easily decoded as decimal. Base 1000 only needs 10 bits per three digits, not 12, because 2^10 is 1024. That's enough to show you get an extra decimal digit for 32 bits - 9 digits instead of 8 - and you've still got 2 bits left over, e.g. for a sign bit.
The thing is, for this digit-totalling algorithm to be worthwhile at all, you need to be working with fixed-width decimal of probably at least 32 bits (8 digits). That gives 12 operations (6 masks, 3 shifts, 3 additions) rather than 15 additions for the (fully unrolled) simple loop. That's a borderline gain, though - and other issues in the code could easily mean it's actually slower.
The efficiency gain is clearer at 64 bits (16 decimal digits) as there's still only 16 operations (8 masks, 4 shifts, 4 additions) rather than 31, but the odds of finding a processor that supports 64-bit BCD operations seems slim. And even if you did, how often do you need this anyway? It seems unlikely that it could be worth the effort and loss of portability.
Here's something in Haskell:
sumDigits n =
if n == 0
then 0
else let a = mod n 10
in a + sumDigits (div n 10)
Oh, but I just read you're doing that already...
(then there's also the obvious:
sumDigits n = sum $ map (read . (:[])) . show $ n
)
For short code, try this:
int digit_sum(int n){
if (n<10) return n;
return n%10 + digit_sum(n/10);
}
Or, in words,
-If the number is less than ten, then the digit sum is the number itself.
-Otherwise, the digit sum is the current last digit (a.k.a. n mod10 or n%10), plus the digit sum of everything to the left of that number (n divided by 10, using integer division).
-This algorithm can also be generalized for any base, substituting the base in for 10.
int digit_sum(int n)
Do
if (n<10) return n;
Exit do
else
n=n%10 + digit_sum(n/10);
Loop

Resources