I am implementing a VHDL 8 bit fixed point multiplication module which returns an 8bit truncated number but I have a problem when I do multiplications by hand in order to test it. The problem arises when I want to multiply two negative numbers.
I tried multiplying two positive values 1.67 * 0.625 ~ 1.04(0.906 in binary multipication).
001.10101 -> 1.67
000.10100 -> 0.625
------------
000000.1110111010 = 000.11101 (truncated to 8bits = 0.906)
I tried multiplying negative and positive numbers (-0.875 * 3 ~2.62)
111.00100 -> -0.875
011.00000 -> 3
----------
010101.0110000000 = 101.01100 (truncated to 8bits = -2.625)
So far everythng is working properly. The problem comes hen I try to multiply two negative numbers. According to what I know (unless I'm mistaken):
- multiplying two numbers will give a result with twice the resolution (multiply two 8 bit numbers and you get a 16 bit number)
- the fixed point gets dislocated as well. In this example there are 3 bits before the fixed and 5 points after. This means that in the resulting number the fixed point will have 6 digits before the point and 10 bits after the point.
By assuming this the above calculations worked properly. But when I try to multiply two negative values (-0.875 * -1.91 ~ 1.67)
110.00010 -> -1.91 (1.9375)
111.00100 -> -0.875
------------
101011.0011001000 = 011.00110(truncated to 8 bits = 3.1875)
Naturally, I tried another negative multiplication (-2.64 * -0.875 = 2.31)
101.01011 -> -2.64
111.00100 -> -0.875
----------
100110.0001001100 = 110.00010 (truncated to 8bits = -1.9375)
Clearly I'm doing something wrong, but I just can't see what I'm doing wrong.
PS: I haven't implemented it yet. The thought came to me I figured out how I was going to do it and then I tried to test it by hand with some simple examples. And I also tried more multiplications. I thought that maybe they worked out because I was lucky, but apparently not, I tried a few more multiplications and they worked. So maybe I'm doing something wrong when multiplying two negative numbers, maybe I'm truncating it wrong? Probably.
EDIT:
Ok, I found a Xilinx document that states how multiplication is made when the two operands are negative, here is the link. According to this docuent, in order to this document, this can only be done when doing extended multiplication. And the last partial sum for the multiplication must be inverted and then add 1 to it and it will result in the correct number.
In order to the multiplications I used windows' calculator in programmer mode, which means that in order to multiply the 8 bits I put the numbers in the calculator and then got the result and truncated it. If they worked for the other cases it means that the windows calculator is doing a direct multiplication (adding all the partial sums as they should be instead of inverting the last partial sum). So, this means that in order to obtain the real result I should substract the first operand from the final result and then add the first operand inverted + 1
110.00010 -> -1.91 (1.9375)
111.00100 -> -0.875
------------
101011.0011001000
Which gave me the result: 000010.0111001000 = 010.01110(truncated to 8bits =2.43)
And the with the other one I came up with the result of 1.875. Those outputs aren't exactly great, but at least they are closer to what I expected. Is there any other way to do this in an easier way?
Your intermediate results are wrong, so that, the truncation did not work as expected. Moreover, the truncation is only possible without overflow if the four top-most bit of the intermediate result are equal in your format.
You should use signed data-types to do the multiplication right.
Even your second example is wrong. The intermediate binary result 010101.0110000000 represents the decimal number 21.375 which is not the product of -0.875 and 3. So, let's do the multiplication by hand:
a * b = -0.875 * 3 = -2.625
111.00100 * 011.00000
---------------------
. 00000000 // further lines containing only zeros have been omitted
+ .01100000
+ 011.00000
+ 0110.0000
+ 110100.000 // add -(2^2) * b !
= 111101.0110000000 = -2.625 (intermediate result)
= 101.01100 = -2.625 after truncation
You have to add the two's complement of b in the last partial sum because the '1' in the top-most bit of a represent the value -(2^2) = -4. Truncation without overflow is possible here because the 4 top-most bits of the intermediate result are equal.
And now the third example
a b = -1.9375 * -0.875 = 1.6953125
110.00010 * 111.00100
---------------------
. 00000000 // further lines containing only zeros have been omitted
+ 111111.111100100 // sign-extended partial-sum
+ 111110.0100 // sign-extended partial-sum
+ 000011.100 // add -4 * b
= 000001.101100100 = 1.6953125 (intermediate result)
~ 001.10110 = 1.6875 after truncation
As b is a signed number, one has always sign-extend the partial sum to the width of the intermediate result. Of course, this has also been done in the calculation of the second example, but there it does not make a difference.
Related
Both numbers are in two's complement form
1101*1100
This is my work but I am getting an answer that is way off I'm not sure if its the adding part or the shifting but I know the criteria of when to shift and not shift so I guess it is the adding but I'm not sure whats going wrong.
Your error is in the third step when you do A-M then shift. In your calculations you are doing A+M, not A-M. Regardless of whether M is positive or negative, when you subtract, you take the 2's complement and then add. In essence, you need to do A + (-M). Taking the 2's complement of M gives 0101, so:
If A = 0000, M=1011:
A+M : 0000(0) + (1011)(-5) = 1011 (-5)
A-M : 0000(0) - (1011)(-5) = 0000(0) + 0101(5) = 0101 (5)
this is my approach.
IF u are considering (-9)*(-4).
then just invert 2's complement.
i.e
first subtract 1
then
again complement the result.
see ur result is 11101100
now sub 1
step1: 11101100 - 1 = 11101011
step2: invert (11101011) and result is 00010100
which is 36 in binary
the above is reverse 2's complement.
And try it for different question and it work.
Actually it is the right method to do, i think i have read it in some book actually i don't remember yet.
Hope it works
I have a number X , I want to check the number of powers of 2 it have ?
For Ex
N=7 ans is 2 , 2*2
N=20 ans is 4, 2*2*2*2
Similar I want to check the next power of 2
For Ex:
N=14 Ans=16
Is there any Bit Hack for this without using for loops ?
Like we are having a one line solution to check if it's a power of 2 X&(X-1)==0,similarly like that ?
GCC has a built-in instruction called __builtin_clz() that returns the number of leading zeros in an integer. So for example, assuming a 32-bit int, the expression p = 32 - __builtin_clz(n) will tell you how many bits are needed to store the integer n, and 1 << p will give you the next highest power of 2 (provided p<32, of course).
There are also equivalent functions that work with long and long long integers.
Alternatively, math.h defines a function called frexp() that returns the base-2 exponent of a double-precision number. This is likely to be less efficient because your integer will have to be converted to a double-precision value before it is passed to this function.
A number is power of two if it has only single '1' in its binary value. For example, 2 = 00000010, 4 = 00000100, 8 = 00001000 and so on. So you can check it using counting the no. of 1's in its bit value. If count is 1 then the number is power of 2 and vice versa.
You can take help from here and here to avoid for loops for counting set bits.
If count is not 1 (means that Value is not power of 2) then take position of its first set bit from MSB and the next power of 2 value to this number is the value having only set bit at position + 1. For example, number 3 = 00000011. Its first set bit from MSB is 2nd bit. Therefore the next power of 2 number is a value having only set bit at 3rd position. i.e. 00000100 = 4.
I didn't know if this is a bug in Lua itself or if I was doing something wrong. I couldn't find anything about it anywhere. I am using Lua for Windows (Lua 5.1.4):
>return math.random(0, 1000000000)
1251258
This returns a random integer between 0 and 10000000000, as expected. This seems to work for all other values. But if I add a single 0:
>return math.random(0, 10000000000)
stdin:1: bad argument #2 to 'random' (interval is empty)
Any number higher than that does the same thing.
I tried to figure out exactly how high a number has to be to cause this and found something even weirder:
>return math.random(0, 2147483647)
-75617745
If the value is 2147483647 then it gives me negative numbers. Any higher than that and it throws an error. Any lower than that and it works fine.
That's 0b1111111111111111111111111111111 in binary, 31 binary digits exactly. I am not sure what that means though.
This unexpected behavior (bug?) is due to how math.random treats the input arguments passed in Lua 5.1. From lmathlib.c:
case 2: { /* lower and upper limits */
int l = luaL_checkint(L, 1);
int u = luaL_checkint(L, 2);
luaL_argcheck(L, l<=u, 2, "interval is empty");
lua_pushnumber(L, floor(r*(u-l+1))+l); /* int between `l' and `u' */
break;
}
As you may know in C, a standard int can represent values -2,147,483,648 to 2,147,483,647. Adding +1 to 2,147,483,647, like in your use-case, will overflow and wrap around the value giving -2,147,483,648. The end result is negative since you're multiplying a positive with a negative number.
Furthermore, anything above 2,147,483,647 will fail the luaL_argcheck due to overflow wraparound.
There are a few ways to address this problem:
Upgrade to Lua 5.2. That one has since fixed this issue by treating the input arguments as lua_Number instead.
Switch to LuaJIT which does not have this integer overflow issue.
Patch the Lua 5.1 source yourself with the fix and recompile.
Modify your random range so it does not overflow.
If you need a range that is larger than what the random function supports (32 bit signed integers or 2^31 due to sign bit, because math.random is at C level), but smaller than the range of Lua "number" type (based on What is the maximum value of a number in Lua?, 2^52, or maybe even 2^53), you could try generating two random numbers: scale the first to the range desired; add the second to "fill the gap". For example, say you want a range of 0 to 2^36. The largest from math.random is 2^31. So you could do:
-- 2^36 = 2^31 * 2^5 so
scale = 2^5
baseRand = scale * math.random(0, 2^31)
-- baseRand is now between 0 and 2^36 but there are gaps of 2^5 in the set
-- of possible values; fill the gaps with second random number:
fillGap = math.random(0, 2^5)
randNum = baseRand + fillGap
This will work as long as the desired range is less than the Lua interpreter's maximum for Lua numbers, which is a configurable compile time parameter but if you use stock build it is 2^52, a very large number (although not as large as largest long integer, 2^63).
Note also that largest positive N-bit integer is 2^N-1 (not 2^N), but the above technique can be applied to any range, you could have for instance scale = 10^6 then randNum = 10^6 * math.random(0, 10^8) + math.random(0, 10^6).
I want to multiply two numbers, and detect if there was an overflow. What is the simplest way to do that?
Multiplying two 32 bit numbers results in a 64 bit answer, two 8s give a 16, etc. binary multiplication is simply shifting and adding. so if you had say two 32 bit operands and bit 17 set in operand A and any of the bits above 15 or 16 set in operand b you will overflow a 32 bit result. bit 17 shifted left 16 is bit 33 added to a 32.
So the question again is what are the size of your inputs and the size of your result, if the result is the same size then you have to find the most significant 1 of both operands add those bit locations if that result is bigger than your results space you will overflow.
EDIT
Yes multiplying two 3 bit numbers will result in either a 5 bit number or 6 bit number if there is a carry in the add. Likewise a 2 bit and 5 bit can result in 6 or 7 bits, etc. If the reason for this question posters question is to see if you have space in your result variable for an answer then this solution will work and is relatively fast for most languages on most processors. It can be significantly faster on some and significantly slower on others. It is generically fast (depending on how it is implemented of course) to just look at the number of bits in the operands. Doubling the size of the largest operand is a safe bet if you can do it within your language or processor. Divides are downright expensive (slow) and most processors dont have one much less at an arbitrary doubling of operand sizes. The fastest of course is to drop to assembler do the multiply and look at the overflow bit (or compare one of the result registers with zero). If your processor cant do the multiply in hardware then it is going to be slow no matter what you do. I am guessing that asm is not the right answer to this post despite being by far the fastest and has the most accurate overflow status.
binary makes multiplication trivial compared to decimal, for example take the binary numbers
0b100 *
0b100
Just like decimal math in school you (can) start with the least significant bit on the lower operand and multiply it against all the locations in the upper operand, except with binary there are only two choices you multiply by zero meaning you dont have to add to the result, or you multiply by one which means you just shift and add, no actual multiplication is necessary like you would have in decimal.
000 : 0 * 100
000 : 0 * 100
100 : 1 * 100
Add up the columns and the answer is 0b10000
Same as decimal math a 1 in the hundreds column means copy the top number and add two zeros, it works the same in any other base as well. So 0b100 times 0b110 is 0b1000, a one in the second column over so copy and add a zero + 0b10000 a one in the third column over so copy and add two zeros = 0b11000.
This leads to looking at the most significant bits in both numbers. 0b1xx * 0b1xx guarantees a 1xxxx is added to the answer, and that is the largest bit location in the add, no other single inputs to the final add have that column populated or a more significant column populated. From there you need only more bit in case the other bits being added up cause a carry.
Which happens with the worst case all ones times all ones, 0b111 * 0b111
0b00111 +
0b01110 +
0b11100
This causes a carry bit in the addition resulting in 0b110001. 6 bits. a 3 bit operand times a 3 bit operand 3+3=6 6 bits worst case.
So size of the operands using the most significant bit (not the size of the registers holding the values) determines the worst case storage requirement.
Well, that is true assuming positive operands. If you consider some of these numbers to be negative it changes things but not by much.
Minus 4 times 5, 0b1111...111100 * 0b0000....000101 = -20 or 0b1111..11101100
it takes 4 bits to represent a minus 4 and 4 bits to represent a positive 5 (dont forget your sign bit). Our result required 6 bits if you stripped off all the sign bits.
Lets look at the 4 bit corner cases
-8 * 7 = -56
0b1000 * 0b0111 = 0b1001000
-1 * 7 = -7 = 0b1001
-8 * -8 = 64 = 0b01000000
-1 * -1 = 2 = 0b010
-1 * -8 = 8 = 0b01000
7 * 7 = 49 = 0b0110001
Lets say we count positive numbers as the most significant 1 plus one and negative the most significant 0 plus one.
-8 * 7 is 4+4=8 bits actual 7
-1 * 7 is 1+4=5 bits, actual 4 bits
-8 * -8 is 4+4=8 bits, actual 8 bits
-1 * -1 is 1+1=2 bits, actual 3 bits
-1 * -8 is 1+4=5 bits, actual 5 bits
7 * 7 is 4+4=8 bits, actual 7 bits.
So this rule works, with the exception of -1 * -1, you can see that I called a minus one one bit, for the plus one thing find the zero plus one. Anyway, I argue that if this were a 4 bit * 4 bit machine as defined, you would have 4 bits of result at least and I interpret the question as how may more than 4 bits do I need to safely store the answer. So this rule serves to answer that question for 2s complement math.
If your question was to accurately determine overflow and then speed is secondary, then, well it is going to be really really slow for some systems, for every multiply you do. If this is the question you are asking, to get some of the speed back you need to tune it a little better for the language and/or processor. Double up the biggest operand, if you can, and check for non-zero bits above the result size, or use a divide and compare. If you cant double the operand sizes, divide and compare. Check for zero before the divide.
Actually your question doesnt specify what size of overflow you are talking about either. Good old 8086 16 bit times 16 bit gives a 32 bit result (hardware), it can never overflow. What about some of the ARMs that have a multiply, 32 bit times 32 bit, 32 bit result, easy to overflow. What is the size of your operands for this question, are they the same size or are they double the input size? Are you willing to perform multiplies that the hardware cannot do (without overflowing)? Are you writing a compiler library and trying to determine if you can feed the operands to the hardware for speed or if you have to perform the math without a hardware multiply. Which is the kind of thing you get if you cast up the operands, the compiler library will try to cast the operands back down before doing the multiply, depending on the compiler and its library of course. And it will use the count the bit trick determine to use the hardware multiply or a software one.
My goal here was to show how binary multiply works in a digestible form so you can see how much maximum storage you need by finding the location of a single bit in each operand. Now how fast you can find that bit in each operand is the trick. If you were looking for minimum storage requirements not maximum that is a different story because involves every single one of the significant bits in both operands not just one bit per operand, you have to do the multiply to determine minimum storage. If you dont care about maximum or minimum storage you have to just do the multiply and look for non zeros above your defined overflow limit or use a divide if you have the time or hardware.
Your tags imply you are not interested in floating point, floating point is a completely different beast, you cannot apply any of these fixed point rules to floating point, they DO NOT work.
Check if one is less than a maximum value divided by the other. (All values are taken as absolute).
2's complementness hardly has anything to do with it, since the multiplication overflows if x*(2n - x)>2M, which is equal to (x*2n - x2)>2M, or x2 < (x*2n - 2M), so you'll have to compare overflowing numbers anyway (x2 may overflow, while result may not).
If your number are not from the largest integral data type, then you might just cast them up, multiply and compare with the maximum of the number's original type. E.g. in Java, when multiplying two int, you can cast them to long and compare the result to Integer.MAX_VALUE or Integer.MIN_VALUE (depending on sign combination), before casting the result down to int.
If the type already is the largest, then check if one is less than the maximum value divided by the other. But do not take the absolute value! Instead you need separate comparison logic for each of the sign combinations negneg, pospos and posneg (negpos can obviously be reduced to posneg, and pospos might be reduced to neg*neg). First test for 0 arguments to allow safe divisions.
For actual code, see the Java source of MathUtils class of the commons-math 2, or ArithmeticUtils of commons-math 3. Look for public static long mulAndCheck(long a, long b). The case for positive a and b is
// check for positive overflow with positive a, positive b
if (a <= Long.MAX_VALUE / b) {
ret = a * b;
} else {
throw new ArithmeticException(msg);
}
I want to multiply two (2's complement) numbers, and detect if there was an overflow. What is the simplest way to do that?
Various languages do not specify valid checking for overflow after it occurs and so prior tests are required.
With some types, a wider integer type may not exist, so a general solution should limit itself to a single type.
The below (Ref) only requires compares and known limits to the integer range. It returns 1 if a product overflow will occur, else 0.
int is_undefined_mult1(int a, int b) {
if (a > 0) {
if (b > 0) {
return a > INT_MAX / b; // a positive, b positive
}
return b < INT_MIN / a; // a positive, b not positive
}
if (b > 0) {
return a < INT_MIN / b; // a not positive, b positive
}
return a != 0 && b < INT_MAX / a; // a not positive, b not positive
}
Is this the simplest way?
Perhaps, yet it is complete and handle all cases known to me - including rare non-2's complement.
Alternatives to Pavel Shved's solution ...
If your language of choice is assembler, then you should be able to check the overflow flag. If not, you could write a custom assembler routine that sets a variable if the overflow flag was set.
If this is not acceptable, you can find the most signficant set bit of both values (absolutes). If the sum exceeds the number of bits in the integer (or unsigned) then you will have an overflow if they are multiplied together.
Hope this helps.
In C, here's some maturely optimized code that handles the full range of corner cases:
int
would_mul_exceed_int(int a, int b) {
int product_bits;
if (a == 0 || b == 0 || a == 1 || b == 1) return (0); /* always okay */
if (a == INT_MIN || b == INT_MIN) return (1); /* always underflow */
a = ABS(a);
b = ABS(b);
product_bits = significant_bits_uint((unsigned)a);
product_bits += significant_bits_uint((unsigned)b);
if (product_bits == BITS(int)) { /* cases where the more expensive test is required */
return (a > INT_MAX / b); /* remember that IDIV and similar are very slow (dozens - hundreds of cycles) compared to bit shifts, adds */
}
return (product_bits > BITS(int));
}
Full example with test cases here
The benefit of the above approach is it doesn't require casting up to a larger type, so the approach could work on larger integer types.
The main question: How many digits?
Let me explain. I have a number in binary system: 11000000 and in decimal is 192.
After converting to decimal, how many digits it will have (in dicimal)? In my example, it's 3 digits. But, it isn't a problem. I've searched over internet and found one algorithm for integral part and one for fractional part. I'm not quite understand them, but (I think) they works.
When converting from binary to octal, it's more easy: each 3 bits give you 1 digit in octal. Same for hex: each 4 bits = 1 hex digit.
But, I'm very curious, what to do, if I have a number in P numeral system and want to convert it to the Q numeral system? I know how to do it (I think, I know :)), but, 1st of all, I want to know how many digits in Q system it will take (u no, I must preallocate space).
Writing n in base b takes ceiling(log base b (n)) digits.
The ratio you noticed (octal/binary) is log base 8 (n) / log base 2 (n) = 3.
(From memory, will it stick?)
There was an error in my previous answer: look at the comment by Ben Schwehn.
Sorry for the confusion, I found and explain the error I made in my previous answer below.
Please use the answer provided by Paul Tomblin. (rewritten to use P, Q and n)
Y = ln(P^n) / ln(Q)
Y = n * ln(P) / ln(Q)
So Y (rounded up) is the number of characters you need in system Q to express the highest number you can encode in n characters in system P.
I have no answer (that wouldn't convert the number already and take up that many space in a temporary variable) to get the bare minimum for a given number 1000(bin) = 8(dec) while you would reserve 2 decimal positions using this formula.
If a temporary memory usage isn't a problem, you might cheat and use (Python):
len(str(int(otherBaseStr,P)))
This will give you the number of decimals needed to convert a number in base P, cast as a string (otherBaseStr), into decimals.
Old WRONG answer:
If you have a number in P numeral system of length n
Then you can calculate the highest number that is possible in n characters:
P^(n-1)
To express this highest number in number system Q you need to use logarithms (because they are the inverse to exponentiation):
log((P^(n-1))/log(Q)
(n-1)*log(P) / log(Q)
For example
11000000 in binary is 8 characters.
To get it in Decimal you would need:
(8-1)*log(2) / log(10) = 2.1 digits (round up to 3)
Reason it was wrong:
The highest number that is possible in n characters is
(P^n) - 1
not
P^(n-1)
If you have a number that's X digits long in base B, then the maximum value that can be represented is B^X - 1. So if you want to know how many digits it might take in base C, then you have to find the number Y that C^Y - 1 is at least as big as B^X - 1. The way to do that is to take the logarithm in base C of B^X-1. And since the logarithm (log) of a number in base C is the same as the natural log (ln) of that number divided by the natural log of C, that becomes:
Y = ln((B^X)-1) / ln(C) + 1
and since ln(B^X) is X * ln(B), and that's probably faster to calculate than ln(B^X-1) and close enough to the right answer, rewrite that as
Y = X * ln(B) / ln(C) + 1
Covert that to your favourite language. Because we dropped the "-1", we might end up with one digit more than you need in some cases. But even better, you can pre-calculate ln(B)/ln(C) and just multiply it by new "X"s and the length of the number you are trying to convert changes.
Calculating the number of digit can be done using the formulas given by the other answers, however, it might actually be faster to allocate a buffer of maximum size first and then return the relevant part of that buffer instead of calculating a logarithm.
Note that the worst case for the buffer size happens when you convert to binary, which gives you a buffer size of 32 characters for 32-bit integers.
Converting a number to an arbitrary base could be done using the C# function below (The code would look very similar in other languages like C or Java):
public static string IntToString(int value, char[] baseChars)
{
// 32 is the worst cast buffer size for base 2 and int.MaxValue
int i = 32;
char[] buffer = new char[i];
int targetBase= baseChars.Length;
do
{
buffer[--i] = baseChars[value % targetBase];
value = value / targetBase;
}
while (value > 0);
char[] result = new char[32 - i];
Array.Copy(buffer, i, result, 0, 32 - i);
return new string(result);
}
The keyword here is "logarithm", here are some suggestive links:
http://www.adug.org.au/MathsCorner/MathsCornerLogs2.htm
http://staff.spd.dcu.ie/johnbcos/download/Fermat%20material/Fermat_Record_Number/HOW_MANY.html
look at the logarithms base P and base Q. Round down to nearest integer.
The logarithm base P can be computed using your favorite base (10 or e): log_P(x) = log_10(x)/log_10(P)
You need to compute the length of the fractional part separately.
For binary to decimal, there are as many decimal digits as there are bits. For example, binary 0.11001101001001 is decimal 0.80133056640625, both 14 digits after the radix point.
For decimal to binary, there are two cases. If the decimal fraction is dyadic, then there are as many bits as decimal digits (same as for binary to decimal above). If the fraction is not dyadic, then the number of bits is infinite.
(You can use my decimal/binary converter to experiment with this.)