Why float division is faster than integer division in c++? - performance

Consider the following code snippet in C++ :(visual studio 2015)
First Block
const int size = 500000000;
int sum =0;
int *num1 = new int[size];//initialized between 1-250
int *num2 = new int[size];//initialized between 1-250
for (int i = 0; i < size; i++)
{
sum +=(num1[i] / num2[i]);
}
Second Block
const int size = 500000000;
int sum =0;
float *num1 = new float [size]; //initialized between 1-250
float *num2 = new float [size]; //initialized between 1-250
for (int i = 0; i < size; i++)
{
sum +=(num1[i] / num2[i]);
}
I expected that first block runs faster because it is integer operation . But the Second block is considerably faster , although it is floating point operation . here is results of my bench mark :
Division:
Type Time
uint8 879.5ms
uint16 885.284ms
int 982.195ms
float 654.654ms
As well as floating point multiplication is faster than integer multiplication.
here is results of my bench mark :
Multiplication:
Type Time
uint8 166.339ms
uint16 524.045ms
int 432.041ms
float 402.109ms
My system spec: CPU core i7-7700 ,Ram 64GB,Visual studio 2015

Floating point number division is faster than integer division because of the exponent part in floating point number representation. To divide one exponent by another one plain subtraction is used.
int32_t division requires fast division of 31-bit numbers, whereas float division requires fast division of 24-bit mantissas (the leading one in mantissa is implied and not stored in a floating point number) and faster subtraction of 8-bit exponents.
See an excellent detailed explanation how division is performed in CPU.
It may be worth mentioning that SSE and AVX instructions only provide floating point division, but no integer division. SSE instructions/intrinsincs can be used to quadruple the speed of your float calculation easily.
If you look into Agner Fog's instruction tables, for example, for Skylake, the latency of the 32-bit integer division is 26 CPU cycles, whereas the latency of the SSE scalar float division is 11 CPU cycles (and, surprisingly, it takes the same time to divide four packed floats).
Also note, in C and C++ there is no division on numbers shorter that int, so that uint8_t and uint16_t are first promoted to int and then the division of ints happens. uint8_t division looks faster than int because it has fewer bits set when converted to int which causes the division to complete faster.

Related

Metal Compute function causes GPU timeout error

I am trying to compute the Collatz conjecture for a range of numbers to see how much I can benefit from using the GPU. For some reason, the function seems to fail for integers above one hundred million. I use 64-bit unsigned long for the calculations, so it can't be integer overflow; the largest number reached in the calculations for any integer is well below the maximum representable value for this datatype.
The application is basically Apple's Performing Calculations on a GPU, where the array buffers and fragment function are the only things changed. The basic idea for the function is to pass an array of integers (say from 1 to 1000), where each integer serves as a starting point for a while-loop performing the Collatz calculations for every number until the thread reaches a top set limit, for example, a billion.
kernel void compute_collatz(device const unsigned int *array [[buffer(0)]],
device unsigned int *result [[buffer(1)]],
uint index [[thread_position_in_grid]])
{
const unsigned long arrayLength = (unsigned long)1000;
unsigned long arrayNumber = (unsigned long)array[index];
unsigned long maxNumber = (unsigned long)1000000000 - arrayLength;
while (arrayNumber <= maxNumber) {
unsigned long curentStep = arrayNumber;
while (curentStep != 1) {
if (curentStep % 2 == 0) {curentStep = curentStep / 2;}
else {curentStep = (curentStep * 3) + 1;}
}
arrayNumber += arrayLength;
}
result[index] = (int)arrayNumber;
}
When the function reaches the maximum limit, it stores that value in a different array. This works well when the maximum value is sett to one hundred million or less, but only about one third of the array is changed when I try higher values. The program fails with the following error: Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (IOAF code 2). I had a similar problem when I tried multithreading on the CPU for the first time, but then the problem was related to pointers. Can't see that this is a problem here.
I am using Xcode 12.5.1 on macOS 11.5.2 on a MacBook Pro 16 (2016). Any help is much appreciated!

When to use dsgesv versus dgesv to solve a system of linear equations

In the LAPACK documentation, it states that DSGESV (or ZCGESV for complex numbers) is:
The dsgesv and zcgesv are mixed precision iterative refinement
subroutines for exploiting fast single precision hardware. They first
attempt to factorize the matrix in single precision (dsgesv) or single
complex precision (zcgesv) and use this factorization within an
iterative refinement procedure to produce a solution with double
precision (dsgesv) / double complex precision (zcgesv) normwise
backward error quality (see below). If the approach fails, the method
switches to a double precision or double complex precision
factorization respectively and computes the solution.
The iterative refinement is not going to be a winning strategy if the
ratio single precision performance over double precision performance
is too small. A reasonable strategy should take the number of
right-hand sides and the size of the matrix into account. This might
be done with a call to ilaenv in the future. At present, iterative
refinement is implemented.
But how can I know what the ratio of single precision performance over double precision performance is? There is the suggestion to take into account the size of the matrix, but I don't see how exactly the size of the matrix leads to an estimate of this performance ratio.
Would anyone be able to clarify these things?
My guess is that the best way to go is to test both dgesv() and dsgesv()...
Looking at the source code of the function dsgesv() of Lapack, here is what dsgesv() tries to perform:
Cast the matrix A to float As
Call sgetrf() : LU factorization, single precision
Solve the system As.x=b using the LU factorization by calling sgetrs()
Compute the double precision residue r=b-Ax and solve As.x'=r using sgetrs() again, add x=x+x'.
The last step is repeated until double precision is acheived (30 iterations max). The criteria defining success is:
where is the precision of double precison floating point numbers (approximately 1e-13) and is the size of the matrix. If it fails, dsgesv() resumes to dgesv() since it calls dgetrf() (factorization) and then dgetrs(). Hence dsgesv() is a mixed precision algorithm. See this article for instance.
Lastly, dsgesv() is expected to outperform dgesv() for small number of right-hand sides and large matrices, that is when the cost of the factorization sgetrf()/dgetrf() is much higher than the one of the substitutions sgetrs()/dgetrs() . Since the maximum number of iteration set in dsgesv() is 30, an approximate limit would be
Moreover, sgetrf() must proove significantly faster than dgetrf(). sgetrf() can be faster due to a limited available memory bandwidth or vector processing (look for SIMD, example from SSE: the instruction ADDPS).
The argument iter of dsgesv() can be tested to check whether the iterative refinement was useful. If it is negative, iterative refinement failed and using dsgesv() was just a waste of time !
Here is a C code to compare and time dgesv(), sgesv(), dsgesv(). It can be compiled by gcc main.c -o main -llapacke -llapack -lblas Feel free to test your own matrix !
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>
#include <lapacke.h>
int main(void){
srand (time(NULL));
//size of the matrix
int n=2000;
// number of right-hand size
int nb=3;
int nbrun=1000*100*100/n/n;
//memory initialization
double *aaa=malloc(n*n*sizeof(double));
if(aaa==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
double *aa=malloc(n*n*sizeof(double));
if(aa==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
double *bbb=malloc(n*nb*sizeof(double));
if(bbb==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
double *x=malloc(n*nb*sizeof(double));
if(x==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
double *bb=malloc(n*nb*sizeof(double));
if(bb==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
float *aaas=malloc(n*n*sizeof(float));
if(aaas==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
float *aas=malloc(n*n*sizeof(float));
if(aas==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
float *bbbs=malloc(n*n*sizeof(float));
if(bbbs==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
float *bbs=malloc(n*nb*sizeof(float));
if(bbs==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
int *ipiv=malloc(n*nb*sizeof(int));
if(ipiv==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
int i,j;
//matrix initialization
double cond=1e3;
for(i=0;i<n;i++){
for(j=0;j<n;j++){
if(j==i){
aaa[i*n+j]=pow(cond,(i+1)/(double)n);
}else{
aaa[i*n+j]=1.9*(rand()/(double)RAND_MAX-0.5)*pow(cond,(i+1)/(double)n)/(double)n;
//aaa[i*n+j]=(rand()/(double)RAND_MAX-0.5)/(double)n;
//aaa[i*n+j]=0;
}
}
bbb[i]=i;
}
for(i=0;i<n;i++){
for(j=0;j<n;j++){
aaas[i*n+j]=aaa[i*n+j];
}
bbbs[i]=bbb[i];
}
int k=0;
int ierr;
//estimating the condition number of the matrix
memcpy(aa,aaa,n*n*sizeof(double));
double anorm;
double rcond;
//anorm=LAPACKE_dlange( LAPACK_ROW_MAJOR, 'i', n,n, aa, n);
double work[n];
anorm=LAPACKE_dlange_work(LAPACK_ROW_MAJOR, 'i', n, n, aa, n, work );
ierr=LAPACKE_dgetrf( LAPACK_ROW_MAJOR, n, n,aa, n,ipiv );
if(ierr<0){LAPACKE_xerbla( "LAPACKE_dgetrf", ierr );}
ierr=LAPACKE_dgecon(LAPACK_ROW_MAJOR, 'i', n,aa, n,anorm,&rcond );
if(ierr<0){LAPACKE_xerbla( "LAPACKE_dgecon", ierr );}
printf("condition number is %g\n",anorm,1./rcond);
//testing dgesv()
clock_t t;
t = clock();
for(k=0;k<nbrun;k++){
memcpy(bb,bbb,n*nb*sizeof(double));
memcpy(aa,aaa,n*n*sizeof(double));
ierr=LAPACKE_dgesv(LAPACK_ROW_MAJOR,n,nb,aa,n,ipiv,bb,nb);
if(ierr<0){LAPACKE_xerbla( "LAPACKE_dgesv", ierr );}
}
//testing sgesv()
t = clock() - t;
printf ("dgesv()x%d took me %d clicks (%f seconds).\n",nbrun,t,((float)t)/CLOCKS_PER_SEC);
t = clock();
for(k=0;k<nbrun;k++){
memcpy(bbs,bbbs,n*nb*sizeof(float));
memcpy(aas,aaas,n*n*sizeof(float));
ierr=LAPACKE_sgesv(LAPACK_ROW_MAJOR,n,nb,aas,n,ipiv,bbs,nb);
if(ierr<0){LAPACKE_xerbla( "LAPACKE_sgesv", ierr );}
}
//testing dsgesv()
t = clock() - t;
printf ("sgesv()x%d took me %d clicks (%f seconds).\n",nbrun,t,((float)t)/CLOCKS_PER_SEC);
int iter;
t = clock();
for(k=0;k<nbrun;k++){
memcpy(bb,bbb,n*nb*sizeof(double));
memcpy(aa,aaa,n*n*sizeof(double));
ierr=LAPACKE_dsgesv(LAPACK_ROW_MAJOR,n,nb,aa,n,ipiv,bb,nb,x,nb,&iter);
if(ierr<0){LAPACKE_xerbla( "LAPACKE_dsgesv", ierr );}
}
t = clock() - t;
printf ("dsgesv()x%d took me %d clicks (%f seconds).\n",nbrun,t,((float)t)/CLOCKS_PER_SEC);
if(iter>0){
printf("iterative refinement has succeded, %d iterations\n");
}else{
printf("iterative refinement has failed due to");
if(iter==-1){
printf(" implementation- or machine-specific reasons\n");
}
if(iter==-2){
printf(" overflow in iterations\n");
}
if(iter==-3){
printf(" failure of single precision factorization sgetrf() (ill-conditionned?)\n");
}
if(iter==-31){
printf(" max number of iterations\n");
}
}
free(aaa);
free(aa);
free(bbb);
free(bb);
free(x);
free(aaas);
free(aas);
free(bbbs);
free(bbs);
free(ipiv);
return 0;
}
Output for n=2000:
condition number is 1475.26
dgesv()x2 took me 5260000 clicks (5.260000 seconds).
sgesv()x2 took me 3560000 clicks (3.560000 seconds).
dsgesv()x2 took me 3790000 clicks (3.790000 seconds).
iterative refinement has succeded, 11 iterations

long double subnormals/denormals get truncated to 0 [-Woverflow]

In the IEEE754 standarad, the minimum strictly positive (subnormal) value is 2−16493 ≈ 10−4965 using Quadruple-precision floating-point format. Why does GCC reject anything lower than 10-4949? I'm looking for an explanation of the different things that could be going on underneath which determine the limit to be 10-4949 rather than 10−4965.
#include <stdio.h>
void prt_ldbl(long double decker) {
unsigned char * desmond = (unsigned char *) & decker;
int i;
for (i = 0; i < sizeof (decker); i++) {
printf ("%02X ", desmond[i]);
}
printf ("\n");
}
int main()
{
long double x = 1e-4955L;
prt_ldbl(x);
}
I'm using GNU GCC version 4.8.1 online - not sure which architecture it's running on (which I realize may be the culprit). Please feel free to post your findings from different architectures.
Your long double type may not be(*) quadruple-precision. It may simply be the 387 80-bit extended-double format. This format has the same number of bits for the exponent as quad-precision, but many fewer significand bits, so the minimum value that would be representable in it sounds about right (2-16445)
(*) Your long double is likely not to be quad-precision, because no processor implements quad-precision in hardware. The compiler can always implement quad-precision in software, but it is much more likely to map long double to double-precision, to extended-double or to double-double.
The smallest 80-bit long double is around 2-16382 - 63 ~= 10-4951, not 2-164934. So the compiler is entirely correct; your number is smaller than the smallest subnormal.

How can I use mach_absolute_time without overflowing?

On Darwin, the POSIX standard clock_gettime(CLOCK_MONOTONIC) timer is not available. Instead, the highest resolution monotonic timer is obtained through the mach_absolute_time function from mach/mach_time.h.
The result returned may be an unadjusted tick count from the processor, in which case the time units could be a strange multiple. For example, on a CPU with a 33MHz tick count, Darwin returns 1000000000/33333335 as the exact units of the returned result (ie, multiply the mach_absolute_time by that fraction to obtain a nanosecond value).
We usually wish to convert from exact ticks to "standard" (decimal) units, but unfortunately, naively multiplying the absolute time by the fraction will overflow even in 64-bit arithmetic. This is an error that Apple's sole piece of documentation on mach_absolute_time falls into (Technical Q&A QA1398).1
How should I write a function that correctly uses mach_absolute_time?
Note that this is not a theoretical problem: the sample code in QA1398 completely fails to work on PowerPC-based Macs. On Intel Macs, mach_timebase_info always returns 1/1 as the scaling factor because the CPU's raw tick count is unreliable (dynamic speed-stepping), so the API does the scaling for you. On PowerPC Macs, mach_timebase_info returns either 1000000000/33333335 or 1000000000/25000000, so Apple's provided code definitely overflows every few minutes. Oops.
Most-precise (best) answer
Perform the arithmetic at 128-bit precision to avoid the overflow!
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
}
uint64_t scale(uint64_t i) {
return scaleHighPrecision(i - bias, tb.numer, tb.denom);
}
static uint64_t scaleHighPrecision(uint64_t i, uint32_t numer,
uint32_t denom) {
U64 high = (i >> 32) * numer;
U64 low = (i & 0xffffffffull) * numer / denom;
U64 highRem = ((high % denom) << 32) / denom;
high /= denom;
return (high << 32) + highRem + low;
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return data.scale(now);
}
A simple low-resolution answer
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
if (tb.denom > 1024) {
double frac = (double)tb.numer/tb.denom;
tb.denom = 1024;
tb.numer = tb.denom * frac + 0.5;
assert(tb.numer > 0);
}
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
A fiddly solution using low-precision arithmetic but using continued fractions to avoid loss of accuracy
// This function returns the rational number inside the given interval with
// the smallest denominator (and smallest numerator breaks ties; correctness
// proof neglects floating-point errors).
static mach_timebase_info_data_t bestFrac(double a, double b) {
if (floor(a) < floor(b))
{ mach_timebase_info_data_t rv = {(int)ceil(a), 1}; return rv; }
double m = floor(a);
mach_timebase_info_data_t next = bestFrac(1/(b-m), 1/(a-m));
mach_timebase_info_data_t rv = {(int)m*next.numer + next.denum, next.numer};
return rv;
}
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count. However, although the bound on the error is
// the same as for the pragmatic answer, the error is actually minimized over
// the given accuracy bound.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
double frac = (double)tb.numer/tb.denom;
uint64_t spanTarget = 315360000000000000llu; // 10 years
if (getExpressibleSpan(tb.numer, tb.denom) >= spanTarget)
return;
for (double errorTarget = 1/1024.0; errorTarget > 0.000001;) {
mach_timebase_info_data_t newFrac =
bestFrac((1-errorTarget)*frac, (1+errorTarget)*frac);
if (getExpressibleSpan(newFrac.numer, newFrac.denom) < spanTarget)
break;
tb = newFrac;
errorTarget = fabs((double)tb.numer/tb.denom - frac) / frac / 8;
}
assert(getExpressibleSpan(tb.numer, tb.denom) >= spanTarget);
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
The derivation
We aim to reduce the fraction returned by mach_timebase_info to one that is essentially the same, but with a small denominator. The size of the timespan that we can handle is limited only by the size of the denominator, not the numerator of the fraction we shall multiply by:
uint64_t getExpressibleSpan(uint32_t numer, uint32_t denom) {
// This is just less than the smallest thing we can multiply numer by without
// overflowing. ceilLog2(numer) = 64 - number of leading zeros of numer
uint64_t maxDiffWithoutOverflow = ((uint64_t)1 << (64 - ceilLog2(numer))) - 1;
return maxDiffWithoutOverflow * numer / denom;
}
If denom=33333335 as returned by mach_timebase_info, we can handle differences of up to 18 seconds only before the multiplication by numer overflows. As getExpressibleSpan shows, by calculating a rough lower bound for this, the size of numer doesn't matter: halving numer doubles maxDiffWithoutOverflow. The only goal therefore is to produce a fraction close to numer/denom that has a smaller denominator. The simplest method to do this is using continued fractions.
The continued fractions method is rather handy. bestFrac clearly works correctly if the provided interval contains an integer: it returns the least integer in the interval over 1. Otherwise, it calls itself recursively with a strictly larger interval and returns m+1/next. The final result is a continued fraction that can be shown by induction to have the correct property: it's optimal, the fraction inside the given interval with the least denominator.
Finally, we reduce the fraction Darwin passes us to a smaller one to use when rescaling the mach_absolute_time to nanoseconds. We may introduce an error here because we can't reduce the fraction in general without losing accuracy. We set ourselves the target of 0.1% error, and check that we've reduced the fraction enough for common timespans (up to ten years) to be handled correctly.
Arguably the method is over-complicated for what it does, but it handles correctly anything the API can throw at it, and the resulting code is still short and extremely fast (bestFrac typically recurses only three or four iterations deep before returning a denominator less than 1000 for random intervals [a,a*1.002]).
You're worrying about overflow when multiplying/dividing with values from the mach_timebase_info struct, which is used for conversion to nanoseconds. So, while it may not fit your exact needs, there are easier ways to get a count in nanoseconds or seconds.
All solutions below are using mach_absolute_time internally (and NOT the wall clock).
Use double instead of uint64_t
(supported in Objective-C and Swift)
double tbInSeconds = 0;
mach_timebase_info_data_t tb;
kern_return_t kError = mach_timebase_info(&tb);
if (kError == 0) {
tbInSeconds = 1e-9 * (double)tb.numer / (double)tb.denom;
}
(remove the 1e-9 if you want nanoseconds)
Usage:
uint64_t start = mach_absolute_time();
// do something
uint64_t stop = mach_absolute_time();
double durationInSeconds = tbInSeconds * (stop - start);
Use ProcessInfo.processInfo.systemUptime
(supported in Objective-C and Swift)
It does the job in double seconds directly:
CFTimeInterval start = NSProcessInfo.processInfo.systemUptime;
// do something
CFTimeInterval stop = NSProcessInfo.processInfo.systemUptime;
NSTimeInterval durationInSeconds = stop - start;
For reference, source code of systemUptime
just does something similar as previous solution:
struct mach_timebase_info info;
mach_timebase_info(&info);
__CFTSRRate = (1.0E9 / (double)info.numer) * (double)info.denom;
__CF1_TSRRate = 1.0 / __CFTSRRate;
uint64_t tsr = mach_absolute_time();
return (CFTimeInterval)((double)tsr * __CF1_TSRRate);
Use QuartzCore.CACurrentMediaTime()
(supported in Objective-C and Swift)
Same as systemUptime, but without being open source.
Use Dispatch.DispatchTime.now()
(supported in Swift only)
Another wrapper around mach_absolute_time(). Base precision is nanoseconds, backed with UInt64.
DispatchTime start = DispatchTime.now()
// do something
DispatchTime stop = DispatchTime.now()
TimeInterval durationInSeconds = Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000_000
For reference, source code of DispatchTime.now() says it basically simply returns a struct DispatchTime(rawValue: mach_absolute_time()). And the calculation for uptimeNanoseconds is:
(result, overflow) = result.multipliedReportingOverflow(by: UInt64(DispatchTime.timebaseInfo.numer))
result = overflow ? UInt64.max : result / UInt64(DispatchTime.timebaseInfo.denom)
So it just discards results if the multiplication can't be stored in an UInt64.
If mach_absolute_time() sets the uint64 back to 0 then reset the time calculations if less than the last check.
That's the problem, they don't document what happens when the uint64 reaches all ones (binary).
read it. https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time

Floating Point Divider Hardware Implementation Details

I am trying to implement a 32-bit floating point hardware divider in hardware and I am wondering if I can get any suggestions as to some tradeoffs between different algorithms?
My floating point unit currently suppports multiplication and addition/subtraction, but I am not going to switch it to a fused multiply-add (FMA) floating point architecture since this is an embedded platform where I am trying to minimize area usage.
Once upon a very long time ago i come across this neat and easy to implement float/fixed point divison algorithm used in military FPUs of that time period:
input must be unsigned and shifted so x < y and both are in range < 0.5 ; 1 >
don't forget to store the difference of shifts sh = shx - shy and original signs
find f (by iterating) so y*f -> 1 .... after that x*f -> x/y which is the division result
shift the x*f back by sh and restore result sign (sig=sigx*sigy)
the x*f can be computed easily like this:
z=1-y
(x*f)=(x/y)=x*(1+z)*(1+z^2)*(1+z^4)*(1+z^8)*(1+z^16)...(1+z^2n)
where
n = log2(num of fractional bits for fixed point, or mantisa bit size for floating point)
You can also stop when z^2n is zero on fixed bit width data types.
[Edit2] Had a bit of time&mood for this so here 32 bit IEEE 754 C++ implementation
I removed the old (bignum) examples to avoid confusion for future readers (they are still accessible in edit history if needed)
//---------------------------------------------------------------------------
// IEEE 754 single masks
const DWORD _f32_sig =0x80000000; // sign
const DWORD _f32_exp =0x7F800000; // exponent
const DWORD _f32_exp_sig=0x40000000; // exponent sign
const DWORD _f32_exp_bia=0x3F800000; // exponent bias
const DWORD _f32_exp_lsb=0x00800000; // exponent LSB
const DWORD _f32_exp_pos= 23; // exponent LSB bit position
const DWORD _f32_man =0x007FFFFF; // mantisa
const DWORD _f32_man_msb=0x00400000; // mantisa MSB
const DWORD _f32_man_bits= 23; // mantisa bits
//---------------------------------------------------------------------------
float f32_div(float x,float y)
{
union _f32 // float bits access
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
};
_f32 xx,yy,zz; int sh; DWORD zsig; float z;
// result signum abs value
xx.f=x; zsig =xx.u&_f32_sig; xx.u&=(0xFFFFFFFF^_f32_sig);
yy.f=y; zsig^=yy.u&_f32_sig; yy.u&=(0xFFFFFFFF^_f32_sig);
// initial exponent difference sh and normalize exponents to speed up shift in range
sh =0;
sh-=((xx.u&_f32_exp)>>_f32_exp_pos)-(_f32_exp_bia>>_f32_exp_pos); xx.u&=(0xFFFFFFFF^_f32_exp); xx.u|=_f32_exp_bia;
sh+=((yy.u&_f32_exp)>>_f32_exp_pos)-(_f32_exp_bia>>_f32_exp_pos); yy.u&=(0xFFFFFFFF^_f32_exp); yy.u|=_f32_exp_bia;
// shift input in range
while (xx.f> 1.0f) { xx.f*=0.5f; sh--; }
while (xx.f< 0.5f) { xx.f*=2.0f; sh++; }
while (yy.f> 1.0f) { yy.f*=0.5f; sh++; }
while (yy.f< 0.5f) { yy.f*=2.0f; sh--; }
while (xx.f<=yy.f) { yy.f*=0.5f; sh++; }
// divider block
z=(1.0f-yy.f);
zz.f=xx.f*(1.0f+z);
for (;;)
{
z*=z; if (z==0.0f) break;
zz.f*=(1.0f+z);
}
// shift result back
for (;sh>0;) { sh--; zz.f*=0.5f; }
for (;sh<0;) { sh++; zz.f*=2.0f; }
// set signum
zz.u&=(0xFFFFFFFF^_f32_sig);
zz.u|=zsig;
return zz.f;
}
//---------------------------------------------------------------------------
I wanted to keep it simple so it is not optimized yet. You can for example replace all *=0.5 and *=2.0 by exponent inc/dec ... If you compare with FPU results on float operator / this will be a bit less precise because most FPUs compute on 80 bit internal format and this implementation is only on 32 bits.
As you can see I am using from FPU just +,-,*. The stuff can be speed up by using fast sqr algorithms like
Fast bignum square computation
especially if you want to use big bit widths ...
Do not forget to implement normalization and or overflow/underflow correction.

Resources