Is there an efficient way to calculate ceiling of log_b(a)?

Is there an efficient way to calculate ceiling of log_b(a)? - algorithm

I need to accurately calculate where a and b
are both integers. If I simply use typical change of base formula with floating point math functions I wind up with errors due to rounding error.

You can use this identity:
b^logb(a) = a
So binary search x = logb(a) so the result of b^x is biggest integer which is still less than a and afterwards just increment the final result.
Here small C++ example for 32 bits:
//---------------------------------------------------------------------------
DWORD u32_pow(DWORD a,DWORD b) // = a^b
{
int i,bits=32;
DWORD d=1;
for (i=0;i<bits;i++)
{
d*=d;
if (DWORD(b&0x80000000)) d*=a;
b<<=1;
}
return d;
}
//---------------------------------------------------------------------------
DWORD u32_log2(DWORD a) // = ceil(log2(a))
{
DWORD x;
for (x=32;((a&0x80000000)==0)&&(x>1);x--,a<<=1);
return x;
}
//---------------------------------------------------------------------------
DWORD u32_log(DWORD b,DWORD a) // = ceil(logb(a))
{
DWORD x,m,bx;
// edge cases
if (b< 2) return 0;
if (a< 2) return 0;
if (a<=b) return 1;
m=1<<(u32_log2(a)-1); // max limit for b=2, all other bases lead to smaller exponents anyway
for (x=0;m;m>>=1)
{
x|=m;
bx=u32_pow(b,x);
if (bx>=a) x^=m;
}
return x+1;
}
//---------------------------------------------------------------------------
Where DWORD is any unsigned 32bit int type... for more info about pow,log,exp and bin search see:
Power by squaring for negative exponents
Note that u32_log2 is not really needed (unless you want bigints) you can use constant bitwidth instead, also some CPUs like x86 has single asm instruction returning the same much faster than for loop...
Now the next step is exploit the fact that the u32_pow bin search is the same as the u32_log bin search so we can merge the two functions and get rid of one nested for loop completely improving complexity considerably like this:
//---------------------------------------------------------------------------
DWORD u32_pow(DWORD a,DWORD b) // = a^b
{
int i,bits=32;
DWORD d=1;
for (i=0;i<bits;i++)
{
d*=d;
if (DWORD(b&0x80000000)) d*=a;
b<<=1;
}
return d;
}
//---------------------------------------------------------------------------
DWORD u32_log2(DWORD a) // = ceil(log2(a))
{
DWORD x;
for (x=32;((a&0x80000000)==0)&&(x>1);x--,a<<=1);
return x;
}
//---------------------------------------------------------------------------
DWORD u32_log(DWORD b,DWORD a) // = ceil(logb(a))
{
const int _bits=32; // DWORD bitwidth
DWORD bb[_bits]; // squares of b LUT for speed up b^x
DWORD x,m,bx,bx0,bit,bits;
// edge cases
if (b< 2) return 0;
if (a< 2) return 0;
if (a<=b) return 1;
// max limit for x where b=2, all other bases lead to smaller x
bits=u32_log2(a);
// compute bb LUT
bb[0]=b;
for (bit=1;bit< bits;bit++) bb[bit]=bb[bit-1]*bb[bit-1];
for ( ;bit<_bits;bit++) bb[bit]=1;
// bin search x and b^x at the same time
for (bx=1,x=0,bit=bits-1,m=1<<bit;m;m>>=1,bit--)
{
x|=m; bx0=bx; bx*=bb[bit];
if (bx>=a){ x^=m; bx=bx0; }
}
return x+1;
}
//---------------------------------------------------------------------------
The only drawback is that we need LUT for squares of b so: b,b^2,b^4,b^8... up to bits number of squares
Beware squaring will double the number of bits so you should also handle overflow if b or a are too big ...
[Edit2] more optimization
As benchmark on normal ints (on bigints the bin search is much much faster) revealed bin search version is the same speed as naive version (because of many subsequent operations except multiplications):
DWORD u32_log_naive(DWORD b,DWORD a) // = ceil(logb(a))
{
int x,bx;
if (b< 2) return 0;
if (a< 2) return 0;
if (a<=b) return 1;
for (x=2,bx=b;bx*=b;x++)
if (bx>=a) break;
return x;
}
We can optimize more:
we can comment out computation of unused squares:
//for ( ;bit<_bits;bit++) bb[bit]=1;
with this bin search become faster also on ints but not by much
we can use faster log2 instead of naive one
see: Fastest implementation of log2(int) and log2(float)
putting all together (x86 CPUs):
DWORD u32_log(DWORD b,DWORD a) // = ceil(logb(a))
{
const int _bits=32; // DWORD bitwidth
DWORD bb[_bits]; // squares of b LUT for speed up b^x
DWORD x,m,bx,bx0,bit,bits;
// edge cases
if (b< 2) return 0;
if (a< 2) return 0;
if (a<=b) return 1;
// max limit for x where b=2, all other bases lead to smaller x
asm {
bsr eax,a; // bits=u32_log2(a);
mov bits,eax;
}
// compute bb LUT
bb[0]=b;
for (bit=1;bit< bits;bit++) bb[bit]=bb[bit-1]*bb[bit-1];
// for ( ;bit<_bits;bit++) bb[bit]=1;
// bin search x and b^x at the same time
for (bx=1,x=0,bit=bits-1,m=1<<bit;m;m>>=1,bit--)
{
x|=m; bx0=bx; bx*=bb[bit];
if (bx>=a){ x^=m; bx=bx0; }
}
return x+1;
}
however the speed up is just slight for example naive 137 ms bin search 133 ms ... note that faster log2 did almost no change but that is because how my compiler is handling inline asm (not sure why BDS2006 and BCC32 is very slow on switching between asm and C++ but its true that is why in older C++ builders inline asm functions where not a good choice for speed optimizations unless a major speedup was expected) ...

Related

Efficient and Exact Floating-Point Binary Search

Consider the following binary search for a value greater than lo, but less than or equal to hi:
find(lo: number, hi: number, isTooLow: (testVal: number) => boolean) {
for(;;) {
const testVal = between(lo, hi);
if (testVal <= lo || testVal >= hi) {
break;
}
if (isTooLow(testVal)) {
lo = testVal;
} else {
hi = testVal;
}
}
return hi;
}
Note that a number here is a 64-bit float.
The search will will always terminate, and if the between function is very carefully written to choose the median available 64-bit float between lo and hi, if it exists, then also:
The search will terminate within 64 iterations; and
It will exactly find the smallest value hi such that isTooLow(hi) == false
But such a between function is tricky and complicated, and it depends on the fine details of the floating point representation.
Can anyone suggest an implementation for between that is simpler and that does not depend on any specifics of the floating point representation, except that there is a fixed-width mantissa, a fixed-width exponent, and a sign?
It will need to be implemented in Javascript, and it only needs to be almost as good, such that:
The search will always terminate within 200 iterations or so; and
It will very nearly (within 3 or 4 possible values) find the smallest value hi such that isTooLow(hi) == false
Extra points for avoiding transcendental functions and sqrt.
RESOLUTION
In the end, I really liked David's stateful guesser, but I hoisted the state up into the call stack, and the result essentially does a search for the exponent first, without any knowledge of the representation.
I haven't tested/debugged this yet:
function find(lo: number, hi: number, isTooLow: (testVal: number) => boolean) {
[lo, hi] = getLinearRange(lo, hi, isTooLow);
for (; ;) {
const testVal = lo + (hi - lo) * 0.5;
if (testVal <= lo || testVal >= hi) {
break;
}
if (isTooLow(testVal)) {
lo = testVal;
} else {
hi = testVal;
}
}
return hi;
}
/**
* Reduce a floating-point range to a size where a conventional binary
* search is appropriate.
* #returns [newlow, newhigh]
*/
function getLinearRange(
low: number, high: number,
isTooLow: (n: number) => boolean): [number, number] {
let negRange: [number, number] | undefined;
if (low < 0) {
if (high > 0) {
if (isTooLow(0)) {
return scaleRange(0, high, 0.25, isTooLow);
} else {
const isTooHigh = (n: number) => !isTooLow(n);
negRange = scaleRange(0, -low, 0.25, isTooHigh);
}
} else {
const isTooHigh = (n: number) => !isTooLow(n);
negRange = scaleRange(-high, -low, 0.25, isTooHigh);
}
} else {
return scaleRange(low, high, 0.25, isTooLow);
}
// we have to negate the range
low = -negRange[1];
negRange[1] = -negRange[0];
negRange[0] = low;
return negRange;
}
/**
* Reduce a positive range until low/high >= minScale
* #returns [newlow, newhigh]
*/
function scaleRange(
low: number, high: number, minScale: number,
isTooLow: (n: number) => boolean): [number, number] {
if (!(minScale > 0 && low < high * minScale)) {
return [low, high];
}
const range = scaleRange(low, high, minScale * minScale, isTooLow);
[low, high] = range;
const test = high * minScale;
if (test > low && test < high) {
if (isTooLow(test)) {
range[0] = test;
} else {
range[1] = test;
}
}
return range;
}

We could also do some precomputation (code below works for nonnegative finite ranges, add branches ad lib to handle the other cases). We approximate the smallest useful fraction, increase it by square roots to effect binary search on the exponent, and then finish off with good old arithmetic mean to nail down the significand. I think the worst case is 65 queries, certainly not much more, though many inputs will take longer than the bit-munging algorithm.
const fractions = [];
const Guesser = {
fractions: null,
between(low, high) {
if (this.fractions === null) {
this.fractions = [];
let f = 0.25;
while (low + f * (high - low) > low) {
this.fractions.push(f);
f *= f;
}
}
return low + (this.fractions.pop() || 0.5) * (high - low);
},
};
for (let i = 0; i <= 101; ++i) {
let n = 0;
let g = Object.create(Guesser);
let low = 0;
let high = 1.7976931348623157e308;
for (;;) {
++n;
let mid = g.between(low, high);
if (mid <= low || high <= mid) break;
if (100 * Math.random() < i) low = mid;
else high = mid;
}
console.log(n);
}

Here’s an idea that I think meets your specs on IEEE doubles using primitive operations only (but seems probably worse than using square root assuming that it’s hardware-accelerated). Find the sign (if needed), find the top 7 bits of the 11-bit exponent using linear search (≈ 128 queries, plus 4 for subnormals, ugh), then switch to the arithmetic mean (≈ 53 + 211−7 = 69 queries), for a total of about 200 queries if I’m counting right. Untested JavaScript below.
const multiplicative_stride = 1 / 2 ** (2 ** (11 - 7));
function between(low, high) {
if (high <= 0) return -between(-high, -low);
if (low < 0) return 0;
const mid = multiplicative_stride * high;
return mid <= low ? low + 0.5 * (high - low) : mid;
}

Using 64 floating iterations is not a good idea because you forgot that 64 bit floats (double) are represented as 3 separated things:
1 bit sign
11 bit exponent
53 bit mantissa
and if you do not know the real range (where you can start) of the solution then you might be off by quite a few iterations as range of such number is much more then 2^64 ...
However if you do this on binary bits directly then its OK (but you have to handle special cases like NaN,Inf and maybe also Denormalized numbers).
So instead of using *0.5 you use binary operations on individual bits like x<<=1, x|=1, x^=1 ...
Here simple example C++:
double f64_sqrt(double x)
{
// IEEE 754 double MSW masks
const DWORD _f64_sig =0x80000000; // sign
const DWORD _f64_exp =0x7FF00000; // exponent
const DWORD _f64_exp_sig=0x40000000; // exponent sign
const DWORD _f64_exp_bia=0x3FF00000; // exponent bias
const DWORD _f64_exp_lsb=0x00100000; // exponent LSB
const DWORD _f64_man =0x000FFFFF; // mantisa
const DWORD _f64_man_msb=0x00080000; // mantisa MSB
const int h=1; // may be platform dependent MSB/LSB order
const int l=0;
DWORD b; // bit mask
union // semi result
{
double f; // 64bit floating point
DWORD u[2]; // 2x32 bit uint
} y;
// fabs
y.f=x; y.u[h]&=_f64_exp|_f64_man; x=y.f;
// set safe exponent (~ abs half) destroys mantisa,sign
b=(y.u[h]&_f64_exp)-_f64_exp_bia;
y.u[h]=((b>>1)|(b&_f64_exp_sig))+_f64_exp_bia;
// sign=`+` mantisa=0
y.u[h]&=_f64_exp;
// correct exponent if needed
if (y.f*y.f>x) y.u[h]=(y.u[h]-_f64_exp_lsb)&_f64_exp;
// binary search
for (b=_f64_man_msb;b;b>>=1) { y.u[h]|=b; if (y.f*y.f>x) y.u[h]^=b; }
for (b=0x80000000 ;b;b>>=1) { y.u[l]|=b; if (y.f*y.f>x) y.u[l]^=b; }
return y.f;
}
using "estimation" of solution exponent (exploiting the fact that result has ~ half of integer bits for sqrt) and binary access binary search of mantissa only (53 iterations)
In case you do not know the exponent range you have to bin search it too (starting from highest bit or one after if sign is known) ...

Compute the exact inverse of this "simple" floating-point function

I have the following function:
float int_to_qty(unsigned x) {
const float MAX = 8.5f;
const float MIN = .001f;
return ((MAX-MIN) / (float)(1<<24)) * x + MIN;
}
This compiles (with reasonable options, on x86) to the following:
.LCPI0_0:
.long 0x3507fbe7 # float 5.06579852E-7
.LCPI0_1:
.long 0x3a83126f # float 0.00100000005
int_to_qty: # #int_to_qty
mov eax, edi
cvtsi2ss xmm0, rax
mulss xmm0, dword ptr [rip + .LCPI0_0]
addss xmm0, dword ptr [rip + .LCPI0_1]
ret
I consider the assembly to be the "canonical" version of the function: Convert the int to a float, multiply by a constant at 32-bit precision, add another constant at 32-bit precision, that's the result.
I want to find the exact inverse of this function. Specifically, a function
unsigned qty_to_int(float qty) that will pass the following test:
int test() {
for (unsigned i = 0; i < (1 << 24); ++i) {
float qty = int_to_qty(i);
if (int_to_qty(qty_to_int(qty)) != qty) {
return 0;
}
}
return 1;
}
Notes:
In the range 4 ≤ int_to_qty(x) < 8, the returned values primarily differ by 1 ulp, which is what makes this challenging.
In the range 8 ≤ int_to_qty(x) < 8.5, the function stops being one-to-one. In this case either answer is fine for the inverse, it doesn't have to be consistently the lowest or the highest.

After wrestling for a long time, I finally came up with a solution that passes the tests. (In Rust, but the translation to C is straightforward.)
pub fn qty_to_int(qty: f64) -> u32 {
const MAX: f32 = 8.5;
const MIN: f32 = 0.001;
let size_inv = f64::from(1<<24) / f64::from(MAX - MIN);
// We explicitly shrink the precision to f32 and then pop back to f64, because we *need* to
// perform that rounding step to properly reverse the addition at the end of int_to_qty.
// We could do the whole thing at f32 precision, except that our input is f64 so the
// subtraction needs to be done at f64 precision.
let fsqueezed: f32 = (qty - f64::from(MIN)) as f32;
// The squeezed subtraction is a one-to-one operation across most of our range. *However*,
// in the border areas where our input is just above an exponent breakpoint, but
// subtraction will bring it below, we hit an issue: The addition in int_to_qty() has
// irreversibly lost a bit in the lowest place! This causes issues when we invert the
// multiply, since we are counting on the error to be centered when we round at the end.
//
// To solve this, we need to re-bias the error by subtracting 0.25 ulp for these cases.
// Technically this should be applied for ranges like [2.0,2.001) as well, but they
// don't need it since they already round correctly (due to having more headroom).
let adj = if qty > 4.0 && qty < 4.001 {
0.5 - 2.0 / 8.499
} else if qty > 8.0 && qty < 8.001 {
0.5 - 4.0 / 8.499
} else {
0.5
};
// Multiply and round, taking into account the possible adjustments.
let fresult = f64::from(fsqueezed) * size_inv + adj;
unsafe { fresult.to_int_unchecked() }
}

Factorial of integer mod m fast calculation [duplicate]

This question already has answers here:
Fast way to calculate n! mod m where m is prime?
(8 answers)
Closed 8 years ago.
Is it possible to calculate Factorial(x) mod m without looping through the whole expression chain
((1 % m) * (2 %m) * (3 % m) * ... (x % m)) % m?
To be more precise m can be 1 <= m <= 10^7 and x : 1<= x < m

There are few fast algorithms for Factorial out there
so the answer is: Yes you can compute factorial without looping through all values
all I saw uses primes decompositions (including mine algorithm)
so from that it is just matter of usein mod multiplication instead of normal multiplication
look here: Fast exact bigint factorial is mine fast algorithm
and the other answer also contains link to swinging primes algorithm ...
[Notes]
for N! you will need a list of primes up to N
but the rest of code can work on arithmetics capable of holding N,m
so no need for huge numbers ...
[edit1] mine 32bit C++ implementations
//---------------------------------------------------------------------------
DWORD modmul(DWORD a,DWORD b,DWORD n)
{
DWORD _a,_b,_n;
_a=a;
_b=b;
_n=n;
asm {
mov eax,_a
mov ebx,_b
mul ebx // H(edx),L(eax) = eax * ebx
mov ebx,_n
div ebx // eax = H(edx),L(eax) / ebx
mov _a,edx // edx = H(edx),L(eax) % ebx
}
return _a;
}
//---------------------------------------------------------------------------
DWORD modfact0(DWORD n,DWORD m) // (n!) mod m (naive approach)
{
DWORD i,f;
for (f=1,i=2;i<=n;i++) f=modmul(f,i,m);
return f;
}
//---------------------------------------------------------------------------
DWORD modfact1(DWORD n,DWORD m) // (n!) mod m (mine fast approach)
{
if (n<=4)
{
if (n==4) return 24;
if (n==3) return 6;
if (n==2) return 2;
if (n==1) return 1;
if (n==0) return 1;
}
int N4,N2,p,i,j,e; DWORD c,pp;
N4=(n>>2)<<2;
N2=N4>>1;
c=modfact1(N2,m); c=modmul(c,c,m); // c=((2N)!)^2;
for (i=0;;i++) // c*= T2
{
p=primes_i32.dat[i];
if (!p) break;
if (p>N4) break;
for (e=0,j=N4;j;e+=j&1,j/=p);
if (e) // c*=p^e
{
if (p==2) c<<=e;
else for (pp=p;;)
{
if (int(e&1)) c=modmul(c,pp,m);
e>>=1; if (!e) break;
pp=modmul(pp,pp,m);
}
}
}
for (i=N4+1;i<=n;i++) c=modmul(c,i,m);
return c;
}
//---------------------------------------------------------------------------
primes:
DWORD primes_i32.dat[] is precomputed sorted (ascending) list of all primes up to n
Here the result:
[ 18.529 ms] slow modfact0(1000000,1299721) = 195641
[ 2.995 ms] fast modfact1(1000000,1299721) = 195641
[ 96.242 ms] slow modfact0(5000000,9999991) = 2812527
[ 13.305 ms] fast modfact1(5000000,9999991) = 2812527
1299721 is first prime close to 1000000 I found
if m is not prime and subresult hits zero then you can ignore the rest of multiplication to massive speed up...
Hope the result is OK have nothing to compare with ...

Algorithm to match sets with overlapping members

Looking for an efficient algorithm to match sets among a group of sets, ordered by the most overlapping members. 2 identical sets for example are the best match, while no overlapping members are the worst.
So, the algorithm takes input a list of sets and returns matching set pairs ordered by the sets with the most overlapping members.
Would be interested in ideas to do this efficiently. Brute force approach is to try all combinations and sort which obviously is not very performant when the number of sets is very large.
Edit: Use case - Assume a large number of sets already exist. When a new set arrives, the algorithm is run and the output includes matching sets (with at least one element overlap) sorted by the most matching to least (doesn't matter how many items are in the new/incoming set). Hope that clarifies my question.

If you can afford an approximation algorithm with a chance of error, then you should probably consider MinHash.
This algorithm allows estimating the similarity between 2 sets in constant time. For any constructed set, a fixed size signature is computed, and then only the signatures are compared when estimating the similarities. The similarity measure being used is Jaccard distance, which ranges from 0 (disjoint sets) to 1 (identical sets). It is defined as the intersection to union ratio of two given sets.
With this approach, any new set has to be compared against all existing ones (in linear time), and then the results can be merged into the top list (you can use a bounded search tree/heap for this purpose).

Since the number of possible different values is not very large, you get a fairly efficient hashing if you simply set the nth bit in a "large integer" when the nth number is present in your set. You can then look for overlap between sets with a simple bitwise AND followed by a "count set bits" operation. On 64 bit architecture, that means that you can look for the similarity between two numbers (out of 1000 possible values) in about 16 cycles, regardless of the number of values in each cluster. As the cluster gets more sparse, this becomes a less efficient algorithm.
Still - I implemented some of the basic functions you might need in some code that I attach here - not documented but reasonably understandable, I think. In this example I made the numbers small so I can check the result by hand - you might want to change some of the #defines to get larger ranges of values, and obviously you will want some dynamic lists etc to keep up with the growing catalog.
#include <stdio.h>
// biggest number you will come across: want this to be much bigger
#define MAXINT 25
// use the biggest type you have - not int
#define BITSPER (8*sizeof(int))
#define NWORDS (MAXINT/BITSPER + 1)
// max number in a cluster
#define CSIZE 5
typedef struct{
unsigned int num[NWORDS]; // want to use longest type but not for demo
int newmatch;
int rank;
} hmap;
// convert number to binary sequence:
void hashIt(int* t, int n, hmap* h) {
int ii;
for(ii=0;ii<n;ii++) {
int a, b;
a = t[ii]%BITSPER;
b = t[ii]/BITSPER;
h->num[b]|=1<<a;
}
}
// print binary number:
void printBinary(int n) {
unsigned int jj;
jj = 1<<31;
while(jj!=0) {
printf("%c",((n&jj)!=0)?'1':'0');
jj>>=1;
}
printf(" ");
}
// print the array of binary numbers:
void printHash(hmap* h) {
unsigned int ii, jj;
for(ii=0; ii<NWORDS; ii++) {
jj = 1<<31;
printf("0x%08x: ", h->num[ii]);
printBinary(h->num[ii]);
}
//printf("\n");
}
// find the maximum overlap for set m of n
int maxOverlap(hmap* h, int m, int n) {
int ii, jj;
int overlap, maxOverlap = -1;
for(ii = 0; ii<n; ii++) {
if(ii == m) continue; // don't compare with yourself
else {
overlap = 0;
for(jj = 0; jj< NWORDS; jj++) {
// just to see what's going on: take these print statements out
printBinary(h->num[ii]);
printBinary(h->num[m]);
int bc = countBits(h->num[ii] & h->num[m]);
printBinary(h->num[ii] & h->num[m]);
printf("%d bits overlap\n", bc);
overlap += bc;
}
if(overlap > maxOverlap) maxOverlap = overlap;
}
}
return maxOverlap;
}
int countBits (unsigned int b) {
int count;
for (count = 0; b != 0; count++) {
b &= b - 1; // this clears the LSB-most set bit
}
return count;
}
int main(void) {
int cluster[20][CSIZE];
int temp[CSIZE];
int ii,jj;
static hmap H[20]; // make them all 0 initially
for(jj=0; jj<20; jj++){
for(ii=0; ii<CSIZE; ii++) {
temp[ii] = rand()%MAXINT;
}
hashIt(temp, CSIZE, &H[jj]);
}
for(ii=0;ii<20;ii++) {
printHash(&H[ii]);
printf("max overlap: %d\n", maxOverlap(H, ii, 20));
}
}
See if this helps at all...

Floating Point Divider Hardware Implementation Details

I am trying to implement a 32-bit floating point hardware divider in hardware and I am wondering if I can get any suggestions as to some tradeoffs between different algorithms?
My floating point unit currently suppports multiplication and addition/subtraction, but I am not going to switch it to a fused multiply-add (FMA) floating point architecture since this is an embedded platform where I am trying to minimize area usage.

Once upon a very long time ago i come across this neat and easy to implement float/fixed point divison algorithm used in military FPUs of that time period:
input must be unsigned and shifted so x < y and both are in range < 0.5 ; 1 >
don't forget to store the difference of shifts sh = shx - shy and original signs
find f (by iterating) so y*f -> 1 .... after that x*f -> x/y which is the division result
shift the x*f back by sh and restore result sign (sig=sigx*sigy)
the x*f can be computed easily like this:
z=1-y
(x*f)=(x/y)=x*(1+z)*(1+z^2)*(1+z^4)*(1+z^8)*(1+z^16)...(1+z^2n)
where
n = log2(num of fractional bits for fixed point, or mantisa bit size for floating point)
You can also stop when z^2n is zero on fixed bit width data types.
[Edit2] Had a bit of time&mood for this so here 32 bit IEEE 754 C++ implementation
I removed the old (bignum) examples to avoid confusion for future readers (they are still accessible in edit history if needed)
//---------------------------------------------------------------------------
// IEEE 754 single masks
const DWORD _f32_sig =0x80000000; // sign
const DWORD _f32_exp =0x7F800000; // exponent
const DWORD _f32_exp_sig=0x40000000; // exponent sign
const DWORD _f32_exp_bia=0x3F800000; // exponent bias
const DWORD _f32_exp_lsb=0x00800000; // exponent LSB
const DWORD _f32_exp_pos= 23; // exponent LSB bit position
const DWORD _f32_man =0x007FFFFF; // mantisa
const DWORD _f32_man_msb=0x00400000; // mantisa MSB
const DWORD _f32_man_bits= 23; // mantisa bits
//---------------------------------------------------------------------------
float f32_div(float x,float y)
{
union _f32 // float bits access
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
};
_f32 xx,yy,zz; int sh; DWORD zsig; float z;
// result signum abs value
xx.f=x; zsig =xx.u&_f32_sig; xx.u&=(0xFFFFFFFF^_f32_sig);
yy.f=y; zsig^=yy.u&_f32_sig; yy.u&=(0xFFFFFFFF^_f32_sig);
// initial exponent difference sh and normalize exponents to speed up shift in range
sh =0;
sh-=((xx.u&_f32_exp)>>_f32_exp_pos)-(_f32_exp_bia>>_f32_exp_pos); xx.u&=(0xFFFFFFFF^_f32_exp); xx.u|=_f32_exp_bia;
sh+=((yy.u&_f32_exp)>>_f32_exp_pos)-(_f32_exp_bia>>_f32_exp_pos); yy.u&=(0xFFFFFFFF^_f32_exp); yy.u|=_f32_exp_bia;
// shift input in range
while (xx.f> 1.0f) { xx.f*=0.5f; sh--; }
while (xx.f< 0.5f) { xx.f*=2.0f; sh++; }
while (yy.f> 1.0f) { yy.f*=0.5f; sh++; }
while (yy.f< 0.5f) { yy.f*=2.0f; sh--; }
while (xx.f<=yy.f) { yy.f*=0.5f; sh++; }
// divider block
z=(1.0f-yy.f);
zz.f=xx.f*(1.0f+z);
for (;;)
{
z*=z; if (z==0.0f) break;
zz.f*=(1.0f+z);
}
// shift result back
for (;sh>0;) { sh--; zz.f*=0.5f; }
for (;sh<0;) { sh++; zz.f*=2.0f; }
// set signum
zz.u&=(0xFFFFFFFF^_f32_sig);
zz.u|=zsig;
return zz.f;
}
//---------------------------------------------------------------------------
I wanted to keep it simple so it is not optimized yet. You can for example replace all *=0.5 and *=2.0 by exponent inc/dec ... If you compare with FPU results on float operator / this will be a bit less precise because most FPUs compute on 80 bit internal format and this implementation is only on 32 bits.
As you can see I am using from FPU just +,-,*. The stuff can be speed up by using fast sqr algorithms like
Fast bignum square computation
especially if you want to use big bit widths ...
Do not forget to implement normalization and or overflow/underflow correction.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Is there an efficient way to calculate ceiling of log_b(a)? - algorithm

I need to accurately calculate where a and b are both integers. If I simply use typical change of base formula with floating point math functions I wind up with errors due to rounding error.

Related

Efficient and Exact Floating-Point Binary Search

Compute the exact inverse of this "simple" floating-point function

Factorial of integer mod m fast calculation [duplicate]

Algorithm to match sets with overlapping members

Floating Point Divider Hardware Implementation Details

Categories

Resources