long long division with 32-bits memory - algorithm

I am currently working on a framework which transforms C to VHDL and I am getting stuck on the implementation of the long long division. Indeed, my framework is only able to work on 32-bits variable, so parsing a C long long variable will result into 2 VHDL variables, one containing the most significant part, one containing the least significant part. So to sum up, from this :
long long a = 1LL;
The VHDL which will be generated will be something like :
var30 <= 00000000000000000000000000000000;
var31 <= 00000000000000000000000000000001;
Now my problem is : how can I divide 2 long long parameters (in VHDL), since they are splitted in 2 variables ? I had no problem for the addition/substraction, since I can work on the most (resp. least) significant part independently (just a carry to propagate), but I really don't see how I could perform a division, since with this kind of operation, the least and the most significant part are really bound together... If someone has an idea, it would be much appreciated
PS : I have the same problem for the multiplication
EDIT : I both work on signed/unsigned variables and the result should be a 64-bit variable

For both the multiplication and the division problem you can break the problem down like this: consider that each 64 bit value, x can be expressed as k*x.hi+x.lo where x.hi is the upper 32 bits, x.lo is the lower 32 bits, and k = 2^32. So for multiplication:
a*b = (a.hi*k+a.lo)*(b.hi*k+b.lo)
= a.hi*b.hi*k*k + (a.hi*b.lo + a.lo*b.hi)*k + a.lo*b.lo
If you just want a 64 bit result then the first term disappears and you get:
a*b = (a.hi*b.lo + a.lo*b.hi)*k + a.lo*b.lo
Remember that in general multiplication doubles the number of bits, so each 32 bit x 32 bit multiply in the above expressions will generate a 64 bit term. In some cases you only want the low 32 bits (first two terms in above expression) but for the last term you need both the low and high 32 bits.

Related

Implementation of strtof(), floating-point multiplication and mantissa rounding issues

This question is not so much about the C as about the algorithm. I need to implement strtof() function, which would behave exactly the same as GCC one - and do it from scratch (no GNU MPL etc.).
Let's skip checks, consider only correct inputs and positive numbers, e.g. 345.6e7. My basic algorithm is:
Split the number into fraction and integer exponent, so for 345.6e7 fraction is 3.456e2 and exponent is 7.
Create a floating-point exponent. To do this, I use these tables:
static const float powersOf10[] = {
1.0e1f,
1.0e2f,
1.0e4f,
1.0e8f,
1.0e16f,
1.0e32f
};
static const float minuspowersOf10[] = {
1.0e-1f,
1.0e-2f,
1.0e-4f,
1.0e-8f,
1.0e-16f,
1.0e-32f
};
and get float exponent as a product of corresponding bits in integer exponent, e.g. 7 = 1+2+4 => float_exponent = 1.0e1f * 1.0e2f * 1.0e4f.
Multiply fraction by floating exponent and return the result.
And here comes the first problem: since we do a lot of multiplications, we get a somewhat big error becaule of rounding multiplication result each time. So, I decided to dive into floating point multiplication algorithm and implement it myself: a function takes a number of floats (in my case - up to 7) and multiplies them on bit level. Consider I have uint256_t type to fit mantissas product.
Now, the second problem: round mantissas product to 23 bits. I've tried several rounding methods (round-to-even, Von Neumann rounding - a small article about them), but no of them can give the correct result for all the test numbers. And some of them really confuse me, like this one:
7038531e-32. GCC's strtof() returns 0x15ae43fd, so correct unbiased mantissa is 2e43fd. I go for multiplication of 7.038531e6 (biased mantissa d6cc86) and 1e-32 (b.m. cfb11f). The resulting unbiased mantissa in binary form is
( 47)0001 ( 43)0111 ( 39)0010 ( 35)0001
( 31)1111 ( 27)1110 ( 23)1110 ( 19)0010
( 15)1011 ( 11)0101 ( 7)0001 ( 3)1101
which I have to round to 23 bits. However, by all rounding methods I have to round it up, and I'll get 2e43fe in result - wrong! So, for this number the only way to get correct mantissa is just to chop it - but chopping does not work for other numbers.
Having this worked on countless nights, my questions are:
Is this approach to strtof() correct? (I know that GCC uses GNU MPL for it, and tried to see into it. However, trying to copy MPL's implementation would require porting the entire library, and this is definitely not what I want). Maybe this split-then-multiply algorithm is inevitably prone to errors? I did some other small tricks, (e.g. create exponent tables for all integer exponents in float range), but they led to even more failed conversions.
If so, did I miss something while rounding? I thought so for long time, but this 7038531e-32 number completely confused me.
If I want to be as precise as I can I usually do stuff like this (however I usually do the reverse operation float -> text):
use only integers (no floats what so ever)
as you know float is integer mantissa bit-shifted by integer exponent so no need for floats.
For constructing the final float datatype you can use simple union with float and 32 bit unsigned integer in it ... or pointers to such types pointing to the same address.
This will avoid rounding errors for numbers that fit completely and shrink error for those that don't fit considerably.
use hex numbers
You can convert your text of decadic number on the run into its hex counterpart (still as text) from there creating mantissa and exponent integers is simple.
Here:
How to convert a gi-normous integer (in string format) to hex format? (C#)
is C++ implementation example of dec2hex and hex2dec number conversions done on text
use more bits for mantissa while converting
for task like this and single precision float I usually use 2 or 3 32 bit DWORDs for the 24 bit mantissa to still hold some precision after the multiplications If you want to be precise you have to deal with 128+24 bits for both integer and fractional part of number so 5x32 bit numbers in sequence.
For more info and inspiration see (reverse operation):
my best attempt to print 32 bit floats with least rounding errors (integer math only)
Your code will be just inverse of that (so many parts will be similar)
Since I post that I made even more advanced version that recognize formatting just like printf , supports much more datatypes and more without using any libs (however its ~22.5 KByte of code). I needed it for MCUs as GCC implementation of prints are not very good there ...

How is this bitshifting working in this example?

I was going through the go tutorial on golang.org and I came across an example that i partially understand...
MaxInt uint64 = 1<<64 - 1
Now I understand this to be shifting the bit 64 places to the left which would make it a 1 followed by 64 0's.
My question is why is this the max integer that can be achieved in a 64 bit number. Wouldn't the max integer be 111111111....(until the 64th 1) instead of 100000...(until the 64th one)?
What happens here, step by step:
Take 1.
Shift it to the left 64 bits. This is tricky. The result actually needs 65 bits for representation - namely 1 followed by 64 zeroes. Since we are calculating a 64 bit value here why does this even compile instead of overflowing to 0 or 1 or producing a compile error?
It works because the arithmetic used to calculate constants in Go is a bit magic (https://blog.golang.org/constants) in that it has nothing to do whatsoever with the type of the named constant being calculated. You can say foo uint8 = 1<<415 / 1<<414 and foo is now 2.
Subtract 1. This brings us back into 64 bits numbers, as it's actually 11....1 (64 times), which is indeed the maximum value of uint64. Without this step, the compiler would complain about us trying to cram 65 bit value into uint64.
Name the constant MaxInt and give it type uint64. Success!
The magic arithmetic used to calculate constants still has limitations (obviously). Shifts greater than 500 or so produce funny named stupid shift errors.

Arithmetic Operations using only 32 bit integers

How would you compute the multiplication of two 1024 bit numbers on a microprocessor that is only capable of multiplying 32 bit numbers?
The starting point is to realize that you already know how to do this: in elementary school you were taught how to do arithmetic on single digit numbers, and then given data structures to represent larger numbers (e.g. decimals) and algorithms to compute arithmetic operations (e.g. long division).
If you have a way to multiply two 32-bit numbers to give a 64-bit result (note that unsigned long long is guaranteed to be at least 64 bits), then you can use those same algorithms to do arithmetic in base 2^32.
You'll also need, e.g., an add with carry operation. You can determine the carry when adding two unsigned numbers of the same type by detecting overflow, e.g. as follows:
uint32_t x, y; // set to some value
uint32_t sum = x + y;
uint32_t carry = (sum < x);
(technically, this sort of operation requires that you do unsigned arithmetic: overflow in signed arithmetic is undefined behavior, and optimizers will do surprising things to your code you least expect it)
(modern processors usually give a way to multiply two 64-bit numbers to give a 128-bit result, but to access it you will have to use compiler extensions like 128-bit types, or you'll have to write inline assembly code. modern processors also have specialized add-with-carry instructions)
Now, to do arithmetic efficiently is an immense project; I found it quite instructive to browse through the documentation and source code to gmp, the GNU multiple precision arithmetic library.
look at any implementation of bigint operations
here are few of mine approaches in C++ for fast bignum square
some are solely for sqr but others are usable for multiplication...
use 32bit arithmetics as a module for 64/128/256/... bit arithmetics
see mine 32bit ALU in x86 C++
use long multiplication with digit base 2^32
can use also Karatsuba this way

gcc precision bug?

I can only assume this is a bug. The first assert passes while the second fails:
double sum_1 = 4.0 + 6.3;
assert(sum_1 == 4.0 + 6.3);
double t1 = 4.0, t2 = 6.3;
double sum_2 = t1 + t2;
assert(sum_2 == t1 + t2);
If not a bug, why?
This is something that has bitten me, too.
Yes, floating point numbers should never be compared for equality because of rounding error, and you probably knew that.
But in this case, you're computing t1+t2, then computing it again. Surely that has to produce an identical result?
Here's what's probably going on. I'll bet you're running this on an x86 CPU, correct? The x86 FPU uses 80 bits for its internal registers, but values in memory are stored as 64-bit doubles.
So t1+t2 is first computed with 80 bits of precision, then -- I presume -- stored out to memory in sum_2 with 64 bits of precision -- and some rounding occurs. For the assert, it's loaded back into a floating point register, and t1+t2 is computed again, again with 80 bits of precision. So now you're comparing sum_2, which was previously rounded to a 64-bit floating point value, with t1+t2, which was computed with higher precision (80 bits) -- and that's why the values aren't exactly identical.
Edit So why does the first test pass? In this case, the compiler probably evaluates 4.0+6.3 at compile time and stores it as a 64-bit quantity -- both for the assignment and for the assert. So identical values are being compared, and the assert passes.
Second Edit Here's the assembly code generated for the second part of the code (gcc, x86), with comments -- pretty much follows the scenario outlined above:
// t1 = 4.0
fldl LC3
fstpl -16(%ebp)
// t2 = 6.3
fldl LC4
fstpl -24(%ebp)
// sum_2 = t1+t2
fldl -16(%ebp)
faddl -24(%ebp)
fstpl -32(%ebp)
// Compute t1+t2 again
fldl -16(%ebp)
faddl -24(%ebp)
// Load sum_2 from memory and compare
fldl -32(%ebp)
fxch %st(1)
fucompp
Interesting side note: This was compiled without optimization. When it's compiled with -O3, the compiler optimizes all of the code away.
You are comparing floating point numbers. Don't do that, floating point numbers have inherent precision error in some circumstances. Instead, take the absolute value of the difference of the two values and assert that the value is less than some small number (epsilon).
void CompareFloats( double d1, double d2, double epsilon )
{
assert( abs( d1 - d2 ) < epsilon );
}
This has nothing to do with the compiler and everything to do with the way floating point numbers are implemented. here is the IEEE spec:
http://www.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
I've duplicated your problem on my Intel Core 2 Duo, and I looked at the assembly code. Here's what's happening: when your compiler evaluates t1 + t2, it does
load t1 into an 80-bit register
load t2 into an 80-bit register
compute the 80-bit sum
When it stores into sum_2 it does
round the 80-bit sum to a 64-bit number and store it
Then the == comparison compares the 80-bit sum to a 64-bit sum, and they're different, primarily because the fractional part 0.3 cannot be represented exactly using a binary floating-point number, so you are comparing a 'repeating decimal' (actually repeating binary) that has been truncated to two different lengths.
What's really irritating is that if you compiler with gcc -O1 or gcc -O2, gcc does the wrong arithmetic at compile time, and the problem goes away. Maybe this is OK according to the standard, but it's just one more reason that gcc is not my favorite compiler.
P.S. When I say that == compares an 80-bit sum with a 64-bit sum, of course I really mean it compares the extended version of the 64-bit sum. You might do well to think
sum_2 == t1 + t2
resolves to
extend(sum_2) == extend(t1) + extend(t2)
and
sum_2 = t1 + t2
resolves to
sum_2 = round(extend(t1) + extend(t2))
Welcome to the wonderful world of floating point!
When comparing floating point numbers for closeness you usually want to measure their relative difference, which is defined as
if (abs(x) != 0 || abs(y) != 0)
rel_diff (x, y) = abs((x - y) / max(abs(x),abs(y))
else
rel_diff(x,y) = max(abs(x),abs(y))
For example,
rel_diff(1.12345, 1.12367) = 0.000195787019
rel_diff(112345.0, 112367.0) = 0.000195787019
rel_diff(112345E100, 112367E100) = 0.000195787019
The idea is to measure the number of leading significant digits the numbers have in common; if you take the -log10 of 0.000195787019 you get 3.70821611, which is about the number of leading base 10 digits all the examples have in common.
If you need to determine if two floating point numbers are equal you should do something like
if (rel_diff(x,y) < error_factor * machine_epsilon()) then
print "equal\n";
where machine epsilon is the smallest number that can be held in the mantissa of the floating point hardware being used. Most computer languages have a function call to get this value. error_factor should be based on the number of significant digits you think will be consumed by rounding errors (and others) in the calculations of the numbers x and y. For example, if I knew that x and y were the result of about 1000 summations and did not know any bounds on the numbers being summed, I would set error_factor to about 100.
Tried to add these as links but couldn't since this is my first post:
en.wikipedia.org/wiki/Relative_difference
en.wikipedia.org/wiki/Machine_epsilon
en.wikipedia.org/wiki/Significand (mantissa)
en.wikipedia.org/wiki/Rounding_error
It may be that in one of the cases, you end up comparing a 64-bit double to an 80-bit internal register. It may be enlightening to look at the assembly instructions GCC emits for the two cases...
Comparisons of double precision numbers are inherently inaccurate. For instance, you can often find 0.0 == 0.0 returning false. This is due to the way the FPU stores and tracks numbers.
Wikipedia says:
Testing for equality is problematic. Two computational sequences that are mathematically equal may well produce different floating-point values.
You will need to use a delta to give a tolerance for your comparisons, rather than an exact value.
This "problem" can be "fixed" by using these options:
-msse2 -mfpmath=sse
as explained on this page:
http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html
Once I used these options, both asserts passed.

What's better multiplication by 2 or adding the number to itself ? BIGnums

I need some help deciding what is better performance wise.
I'm working with bigints (more then 5 million digits) and most of the computation (if not all) is in the part of doubling the current bigint. So i wanted to know is it better to multiply every cell (part of the bigint) by 2 then mod it and you know the rest. Or is it better just add the bigint to itself.
I'm thinking a bit about the ease of implementation too (addition of 2 bigints is more complicated then multiplication by 2) , but I'm more concerned about the performance rather then the size of code or ease of implementation.
Other info:
I'll code it in C++ , I'm fairly familiar with bigints (just never came across this problem).
I'm not in the need of any source code or similar i just need a nice opinion and explanation/proof of it , since i need to make a good decision form the start as the project will be fairly large and mostly built around this part it depends heavily on what i chose now.
Thanks.
Try bitshifting each bit. That is probably the fastest method. When you bitshift an integer to the left, then you double it (multiply by 2). If you have several long integers in a chain, then you need to store the most significant bit, because after shifting it, it will be gone, and you need to use it as the least significant bit on the next long integer.
This doesn't actually matter a whole lot. Modern 64bit computers can add two integers in the same time it takes to bitshift them (1 clockcycle), so it will take just as long. I suggest you try different methods, and then report back if there is any major time differences. All three methods should be easy to implement, and generating a 5mb number should also be easy, using a random number generator.
To store a 5 million digit integer, you'll need quite a few bits -- 5 million if you were referring to binary digits, or ~17 million bits if those were decimal digits. Let's assume the numbers are stored in a binary representation, and your arithmetic happens in chunks of some size, e.g. 32 bits or 64 bits.
If adding the number to itself, each chunk is added to itself and to the carry from the addition of the previous chunk. Any carry forward is kept for the next chunk. That's a couple of addition operation, and some book keeping for tracking the carry.
If multiplying by two by left-shifting, that's one left-shift operation for the multiplication, and one right-shift operation + and with 1 to obtain the carry. Carry book keeping is a little simpler.
Superficially, the shift version appears slightly faster. The overall cost of doubling the number, however, is highly influenced by the size of the number. A 17 million bits number exceeds the cpu's L1 cache, and processing time is likely overwhelmed by memory fetch operations. On modern PC hardware, memory fetch is orders of magnitude slower than addition and shifting.
With that, you might want to pick the one that's simpler for you to implement. I'm leaning towards the left-shift version.
did you try shifting the bits?
<< multiplies by 2
>> divides by 2
Left bit shifting by one is the same as a multiplication by two !
This link explains the mecanism and give examples.
int A = 10; //...01010 = 10
int B = A<<1; //..010100 = 20
If it really matters, you need to write all three methods (including bit-shift!), and profile them, on various input. (Use small numbers, large numbers, and random numbers, to avoid biasing the results.)
Sorry for the "Do it yourself" answer, but that's really the best way. No one cares about this result more than you, which just makes you the best person to figure it out.
Well implemented multiplication of BigNums is O(N log(N) log(log(N)). Addition is O(n). Therefore, adding to itself should be faster than multiplying by two. However that's only true if you're multiplying two arbitrary bignums; if your library knows you're multiplying a bignum by a small integer it may be able to optimize to O(n).
As others have noted, bit-shifting is also an option. It should be O(n) as well but faster constant time. But that will only work if your bignum library supports bit shifting.
most of the computation (if not all) is in the part of doubling the current bigint
If all your computation is in doubling the number, why don't you just keep a distinct (base-2) scale field? Then just add one to scale, which can just be a plain-old int. This will surely be faster than any manipulation of some-odd million bits.
IOW, use a bigfloat.
random benchmark
use Math::GMP;
use Time::HiRes qw(clock_gettime CLOCK_REALTIME CLOCK_PROCESS_CPUTIME_ID);
my $n = Math::GMP->new(2);
$n = $n ** 1_000_000;
my $m = Math::GMP->new(2);
$m = $m ** 10_000;
my $str;
for ($bits = 1_000_000; $bits <= 2_000_000; $bits += 10_000) {
my $start = clock_gettime(CLOCK_PROCESS_CPUTIME_ID);
$str = "$n" for (1..3);
my $stop = clock_gettime(CLOCK_PROCESS_CPUTIME_ID);
print "$bits,#{[($stop-$start)/3]}\n";
$n = $n * $m;
}
Seems to show that somehow GMP is doing its conversion in O(n) time (where n the number of bits in the binary number). This may be due to the special case of having a 1 followed by a million (or two) zeros; the GNU MP docs say it should be slower (but still better than O(N^2).
http://img197.imageshack.us/img197/6527/chartp.png

Resources