How to calculate modulus of 64-bit unsigned integer? - algorithm

Note: This question is different from Fastest way to calculate a 128-bit integer modulo a 64-bit integer.
Here's a C# fiddle:
https://dotnetfiddle.net/QbLowb
Given the pseudocode:
UInt64 a = 9228496132430806238;
UInt32 d = 585741;
How do i calculate
UInt32 r = a % d?
The catch, of course, is that i am not in a compiler that supports the UInt64 data type.1 But i do have access to the Windows ULARGE_INTEGER union:
typedef struct ULARGE_INTEGER {
DWORD LowPart;
DWORD HighPart;
};
Which means really that i can turn my code above into:
//9228496132430806238 = 0x80123456789ABCDE
UInt32 a = 0x80123456; //high part
UInt32 b = 0x789ABCDE; //low part
UInt32 r = 585741;
How to do it
But now comes how to do the actual calculation. I can start with the pencil-and-paper long division:
________________________
585741 ) 0x80123456 0x789ABCDE
To make it simpler, we can work in variables:
Now we are working entirely with 32-bit unsigned types, which my compiler does support.
u1 = a / r; //integer truncation math
v1 = a % r; //modulus
But now i've brought myself to a standstill. Because now i have to calculate:
v1||b / r
In other words, I have to perform division of a 64-bit value, which is what i was unable to perform in the first place!
This must be a solved problem already. But the only questions i can find on Stackoverflow are people trying to calculate:
a^b mod n
or other cryptographically large multi-precision operations, or approximate floating point.
Bonus Reading
Microsoft Research: Division and Modulus for Computer Scientists
https://stackoverflow.com/questions/36684771/calculating-large-mods-by-hand
Fastest way to calculate a 128-bit integer modulo a 64-bit integer (unrelated question; i hate you people)
1But it does support Int64, but i don't think that helps me
Working with Int64 support
I was hoping for the generic solution to the performing modulus against a ULARGE_INTEGER (and even LARGE_INTEGER), in a compiler without native 64-bit support. That would be the correct, good, perfect, and ideal answer, which other people will be able to use when they need.
But there is also the reality of the problem i have. And it can lead to an answer that is generally not useful to anyone else:
cheating by calling one of the Win32 large integer functions (although there is none for modulus)
cheating by using 64-bit support for signed integers
I can check if a is positive. If it is, i know my compiler's built-in support for Int64 will handle:
UInt32 r = a % d; //for a >= 0
Then there's there's how to handle the other case: a is negative
UInt32 ModU64(ULARGE_INTEGER a, UInt32 d)
{
//Hack: Our compiler does support Int64, just not UInt64.
//Use that Int64 support if the high bit in a isn't set.
Int64 sa = (Int64)a.QuadPart;
if (sa >= 0)
return (sa % d);
//sa is negative. What to do...what to do.
//If we want to continue to work with 64-bit integers,
//we could now treat our number as two 64-bit signed values:
// a == (aHigh + aLow)
// aHigh = 0x8000000000000000
// aLow = 0x0fffffffffffffff
//
// a mod d = (aHigh + aLow) % d
// = ((aHigh % d) + (aLow % d)) % d //<--Is this even true!?
Int64 aLow = sa && 0x0fffffffffffffff;
Int64 aHigh = 0x8000000000000000;
UInt32 rLow = aLow % d; //remainder from low portion
UInt32 rHigh = aHigh % d; //this doesn't work, because it's "-1 mod d"
Int64 r = (rHigh + rLow) % d;
return d;
}
Answer
It took a while, but i finally got an answer. I would post it as an answer; but Z29kIGZ1Y2tpbmcgZGFtbiBzcGVybSBidXJwaW5nIGNvY2tzdWNraW5nIHR3YXR3YWZmbGVz people mistakenly decided that my unique question was an exact duplicate.
UInt32 ModU64(ULARGE_INTEGER a, UInt32 d)
{
//I have no idea if this overflows some intermediate calculations
UInt32 Al = a.LowPart;
UInt32 Ah = a.HighPart;
UInt32 remainder = (((Ah mod d) * ((0xFFFFFFFF - d) mod d)) + (Al mod d)) mod d;
return remainder;
}
Fiddle

I just updated my ALU32 class code in this related QA:
Cant make value propagate through carry
As CPU assembly independent code for mul,div was requested. The divider is solving all your problems. However it is using Binary long division so its a bit slover than stacking up 32 bit mul/mod/div operations. Here the relevant part of code:
void ALU32::div(DWORD &c,DWORD &d,DWORD ah,DWORD al,DWORD b)
{
DWORD ch,cl,bh,bl,h,l,mh,ml;
int e;
// edge cases
if (!b ){ c=0xFFFFFFFF; d=0xFFFFFFFF; cy=1; return; }
if (!ah){ c=al/b; d=al%b; cy=0; return; }
// align a,b for binary long division m is the shifted mask of b lsb
for (bl=b,bh=0,mh=0,ml=1;bh<0x80000000;)
{
e=0; if (ah>bh) e=+1; // e = cmp a,b {-1,0,+1}
else if (ah<bh) e=-1;
else if (al>bl) e=+1;
else if (al<bl) e=-1;
if (e<=0) break; // a<=b ?
shl(bl); rcl(bh); // b<<=1
shl(ml); rcl(mh); // m<<=1
}
// binary long division
for (ch=0,cl=0;;)
{
sub(l,al,bl); // a-b
sbc(h,ah,bh);
if (cy) // a<b ?
{
if (ml==1) break;
shr(mh); rcr(ml); // m>>=1
shr(bh); rcr(bl); // b>>=1
continue;
}
al=l; ah=h; // a>=b ?
add(cl,cl,ml); // c+=m
adc(ch,ch,mh);
}
cy=0; c=cl; d=al;
if ((ch)||(ah)) cy=1; // overflow
}
Look the linked QA for description of the class and used subfunctions. The idea behind a/b is simple:
definition
lets assume that we got 64/64 bit division (modulus will be a partial product) and want to use 32 bit arithmetics so:
(ah,al) / (bh,bl) = (ch,cl)
each 64bit QWORD will be defined as high and low 32bit DWORD.
align a,b
exactly like computing division on paper we must align b so it divides a so find sh that:
(bh,bl)<<sh <= (ah,al)
(bh,bl)<<(sh+1) > (ah,al)
and compute m so
(mh,ml) = 1<<sh
beware that in case bh>=0x80000000 stop the shifting or we would overflow ...
divide
set result c = 0 and then simply substract b from a while b>=a. For each substraction add m to c. Once b>a shift both b,m right to align again. Stop if m==0 or a==0.
result
c will hold 64bit result of division so use cl and similarly a holds the remainder so use al as your modulus result. You can check if ch,ah are zero if not overflow occurs (as result is bigger than 32 bit). The same goes for edge cases like division by zero...
Now as you want 64bit/32bit simply set bh=0 ... To do this I needed 64bit operations (+,-,<<,>>) which I did by stacking up 32bit operations with Carry (that is the reason why my ALU32 class was created in the first place) for more info see the link above.

Related

Bit twiddle help: Expanding bits to follow a given bitmask

I'm interested in a fast method for "expanding bits," which can be defined as the following:
Let B be a binary number with n bits, i.e. B \in {0,1}^n
Let P be the position of all 1/true bits in B, i.e. 1 << p[i] & B == 1, and |P|=k
For another given number, A \in {0,1}^k, let Ap be the bit-expanded form of A given B, such that Ap[j] == A[j] << p[j].
The result of the "bit expansion" is Ap.
A couple examples:
Given B: 0010 1110, A: 0110, then Ap should be 0000 1100
Given B: 1001 1001, A: 1101, then Ap should be 1001 0001
Following is a straightforward algorithm, but I can't help shake the feeling that there's a faster/easier way to do this.
unsigned int expand_bits(unsigned int A, unsigned int B, int n) {
int k = popcount(B); // cuda function, but there are good methods for this
unsigned int Ap = 0;
int j = k-1;
// Starting at the most significant bit,
for (int i = n - 1; i >= 0; --i) {
Ap <<= 1;
// if B is 1, add the value at A[j] to Ap, decrement j.
if (B & (1 << i)) {
Ap += (A >> j--) & 1;
}
}
return Ap;
}
The question appears to be asking for a CUDA emulation of the BMI2 instruction PDEP, which takes a source operand a, and deposits its bits based on the positions of the 1-bits of a mask b. There is no hardware support for an identical, or a similar, operation on currently shipping GPUs; that is, up to and including the Maxwell architecture.
I am assuming, based on the two examples given, that the mask b in general is sparse, and that we can minimize work by only iterating over the 1-bits of b. This could cause divergent branches on the GPU, but the exact trade-off in performance is unknown without knowledge of a specific use case. For now, I am assuming that the exploitation of sparsity in the mask b has a stronger positive influence on performance compared to the negative impact of divergence.
In the emulation code below, I have reduced the use of potentially "expensive" shift operations, instead relying mostly on simple ALU instructions. On various GPUs, shift instructions are executed with lower throughput than simple integer arithmetic. I have retained a single shift, off the critical path through the code, to avoid becoming execution limited by the arithmetic units. If desired, the expression 1U << i can be replaced by addition: introduce a variable m that is initialized to 1 before the loop and doubled each time through the loop.
The basic idea is to isolate each 1-bit of mask b in turn (starting at the least significant end), AND it with the value of the i-th bit of a, and incorporate the result into the expanded destination. After a 1-bit from b has been used, we remove it from the mask, and iterate until the mask becomes zero.
In order to avoid shifting the i-th bit of a into place, we simply isolate it and then replicate its value to all more significant bits by simple negation, taking advantage of the two's complement representation of integers.
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0;
int i;
for (i = 0; b; i++) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & (1U << i)); // spread i-th bit of 'a' to more signif. bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
}
return r;
}
The variant without any shift operations alluded to above looks as follows:
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0, m = 1;
while (b) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & m); // spread i-th bit of 'a' to more significant bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
m = m + m; // mask for next bit of 'a'
}
return r;
}
In comments below, #Evgeny Kluev pointed to a shift-free PDEP emulation at the chessprogramming website that looks potentially faster than either of my two implementations above; it seems worth a try.

How can I use mach_absolute_time without overflowing?

On Darwin, the POSIX standard clock_gettime(CLOCK_MONOTONIC) timer is not available. Instead, the highest resolution monotonic timer is obtained through the mach_absolute_time function from mach/mach_time.h.
The result returned may be an unadjusted tick count from the processor, in which case the time units could be a strange multiple. For example, on a CPU with a 33MHz tick count, Darwin returns 1000000000/33333335 as the exact units of the returned result (ie, multiply the mach_absolute_time by that fraction to obtain a nanosecond value).
We usually wish to convert from exact ticks to "standard" (decimal) units, but unfortunately, naively multiplying the absolute time by the fraction will overflow even in 64-bit arithmetic. This is an error that Apple's sole piece of documentation on mach_absolute_time falls into (Technical Q&A QA1398).1
How should I write a function that correctly uses mach_absolute_time?
Note that this is not a theoretical problem: the sample code in QA1398 completely fails to work on PowerPC-based Macs. On Intel Macs, mach_timebase_info always returns 1/1 as the scaling factor because the CPU's raw tick count is unreliable (dynamic speed-stepping), so the API does the scaling for you. On PowerPC Macs, mach_timebase_info returns either 1000000000/33333335 or 1000000000/25000000, so Apple's provided code definitely overflows every few minutes. Oops.
Most-precise (best) answer
Perform the arithmetic at 128-bit precision to avoid the overflow!
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
}
uint64_t scale(uint64_t i) {
return scaleHighPrecision(i - bias, tb.numer, tb.denom);
}
static uint64_t scaleHighPrecision(uint64_t i, uint32_t numer,
uint32_t denom) {
U64 high = (i >> 32) * numer;
U64 low = (i & 0xffffffffull) * numer / denom;
U64 highRem = ((high % denom) << 32) / denom;
high /= denom;
return (high << 32) + highRem + low;
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return data.scale(now);
}
A simple low-resolution answer
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
if (tb.denom > 1024) {
double frac = (double)tb.numer/tb.denom;
tb.denom = 1024;
tb.numer = tb.denom * frac + 0.5;
assert(tb.numer > 0);
}
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
A fiddly solution using low-precision arithmetic but using continued fractions to avoid loss of accuracy
// This function returns the rational number inside the given interval with
// the smallest denominator (and smallest numerator breaks ties; correctness
// proof neglects floating-point errors).
static mach_timebase_info_data_t bestFrac(double a, double b) {
if (floor(a) < floor(b))
{ mach_timebase_info_data_t rv = {(int)ceil(a), 1}; return rv; }
double m = floor(a);
mach_timebase_info_data_t next = bestFrac(1/(b-m), 1/(a-m));
mach_timebase_info_data_t rv = {(int)m*next.numer + next.denum, next.numer};
return rv;
}
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count. However, although the bound on the error is
// the same as for the pragmatic answer, the error is actually minimized over
// the given accuracy bound.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
double frac = (double)tb.numer/tb.denom;
uint64_t spanTarget = 315360000000000000llu; // 10 years
if (getExpressibleSpan(tb.numer, tb.denom) >= spanTarget)
return;
for (double errorTarget = 1/1024.0; errorTarget > 0.000001;) {
mach_timebase_info_data_t newFrac =
bestFrac((1-errorTarget)*frac, (1+errorTarget)*frac);
if (getExpressibleSpan(newFrac.numer, newFrac.denom) < spanTarget)
break;
tb = newFrac;
errorTarget = fabs((double)tb.numer/tb.denom - frac) / frac / 8;
}
assert(getExpressibleSpan(tb.numer, tb.denom) >= spanTarget);
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
The derivation
We aim to reduce the fraction returned by mach_timebase_info to one that is essentially the same, but with a small denominator. The size of the timespan that we can handle is limited only by the size of the denominator, not the numerator of the fraction we shall multiply by:
uint64_t getExpressibleSpan(uint32_t numer, uint32_t denom) {
// This is just less than the smallest thing we can multiply numer by without
// overflowing. ceilLog2(numer) = 64 - number of leading zeros of numer
uint64_t maxDiffWithoutOverflow = ((uint64_t)1 << (64 - ceilLog2(numer))) - 1;
return maxDiffWithoutOverflow * numer / denom;
}
If denom=33333335 as returned by mach_timebase_info, we can handle differences of up to 18 seconds only before the multiplication by numer overflows. As getExpressibleSpan shows, by calculating a rough lower bound for this, the size of numer doesn't matter: halving numer doubles maxDiffWithoutOverflow. The only goal therefore is to produce a fraction close to numer/denom that has a smaller denominator. The simplest method to do this is using continued fractions.
The continued fractions method is rather handy. bestFrac clearly works correctly if the provided interval contains an integer: it returns the least integer in the interval over 1. Otherwise, it calls itself recursively with a strictly larger interval and returns m+1/next. The final result is a continued fraction that can be shown by induction to have the correct property: it's optimal, the fraction inside the given interval with the least denominator.
Finally, we reduce the fraction Darwin passes us to a smaller one to use when rescaling the mach_absolute_time to nanoseconds. We may introduce an error here because we can't reduce the fraction in general without losing accuracy. We set ourselves the target of 0.1% error, and check that we've reduced the fraction enough for common timespans (up to ten years) to be handled correctly.
Arguably the method is over-complicated for what it does, but it handles correctly anything the API can throw at it, and the resulting code is still short and extremely fast (bestFrac typically recurses only three or four iterations deep before returning a denominator less than 1000 for random intervals [a,a*1.002]).
You're worrying about overflow when multiplying/dividing with values from the mach_timebase_info struct, which is used for conversion to nanoseconds. So, while it may not fit your exact needs, there are easier ways to get a count in nanoseconds or seconds.
All solutions below are using mach_absolute_time internally (and NOT the wall clock).
Use double instead of uint64_t
(supported in Objective-C and Swift)
double tbInSeconds = 0;
mach_timebase_info_data_t tb;
kern_return_t kError = mach_timebase_info(&tb);
if (kError == 0) {
tbInSeconds = 1e-9 * (double)tb.numer / (double)tb.denom;
}
(remove the 1e-9 if you want nanoseconds)
Usage:
uint64_t start = mach_absolute_time();
// do something
uint64_t stop = mach_absolute_time();
double durationInSeconds = tbInSeconds * (stop - start);
Use ProcessInfo.processInfo.systemUptime
(supported in Objective-C and Swift)
It does the job in double seconds directly:
CFTimeInterval start = NSProcessInfo.processInfo.systemUptime;
// do something
CFTimeInterval stop = NSProcessInfo.processInfo.systemUptime;
NSTimeInterval durationInSeconds = stop - start;
For reference, source code of systemUptime
just does something similar as previous solution:
struct mach_timebase_info info;
mach_timebase_info(&info);
__CFTSRRate = (1.0E9 / (double)info.numer) * (double)info.denom;
__CF1_TSRRate = 1.0 / __CFTSRRate;
uint64_t tsr = mach_absolute_time();
return (CFTimeInterval)((double)tsr * __CF1_TSRRate);
Use QuartzCore.CACurrentMediaTime()
(supported in Objective-C and Swift)
Same as systemUptime, but without being open source.
Use Dispatch.DispatchTime.now()
(supported in Swift only)
Another wrapper around mach_absolute_time(). Base precision is nanoseconds, backed with UInt64.
DispatchTime start = DispatchTime.now()
// do something
DispatchTime stop = DispatchTime.now()
TimeInterval durationInSeconds = Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000_000
For reference, source code of DispatchTime.now() says it basically simply returns a struct DispatchTime(rawValue: mach_absolute_time()). And the calculation for uptimeNanoseconds is:
(result, overflow) = result.multipliedReportingOverflow(by: UInt64(DispatchTime.timebaseInfo.numer))
result = overflow ? UInt64.max : result / UInt64(DispatchTime.timebaseInfo.denom)
So it just discards results if the multiplication can't be stored in an UInt64.
If mach_absolute_time() sets the uint64 back to 0 then reset the time calculations if less than the last check.
That's the problem, they don't document what happens when the uint64 reaches all ones (binary).
read it. https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time

IEE 754 total order in standard C++11

According to the IEEE floating point wikipage (on IEEE 754), there is a total order on double-precision floating points (i.e. on C++11 implementations having IEEE-754 floats, like gcc 4.8 on Linux / x86-64).
Of course, operator < on double is often providing a total order, but NaN are known to be exceptions (it is well known folklore that x != x is a way of testing if x, declared as double x; is a NaN).
The reason I am asking is that I want to have a.g. std::set<double> (actually, a set of JSON-like -or Python like- values) and I would like the set to have some canonical representation (my practical concern is to emit portable JSON -same data, ordered in the same order, both on Linux/x86-64 and e.g. on Linux/ARM, even in weird cases like NaN).
I cannot find any simple way to get that total order. I coded
// a totally ordering function,
// return -1 for less-than, 0 for equal, +1 for greater
int mydoublecompare(double x, double y) {
if (x==y) return 0;
else if (x<y) return -1;
else if (x>y) return 1;
int kx = std::fpclassify(x);
int ky = std::fpclassify(y);
if (kx == FP_INFINITE) return (x>0)?1:-1;
if (ky == FP_INFINITE) return (y>0)?-1:1;
if (kx == FP_NAN && ky == FP_NAN) return 0;
return (kx==ky)?0:(kx<ky)?-1:1;
}
Actually, I do know that it is not a really (mathematically speaking) total order
(since e.g. bit-wise different NaN are all equal), but I am hoping it has the same
(or a very close) behavior on several common architectures.
Any comments or suggestion?
(perhaps I should not care that much; and I deliberately don't care about signaling NaNs)
The overall motivation is that I am coding some dynamically typed interpreter which persists its entire memory state in JSON notation, and I want to be sure that the persistent state is stable between architectures, in other words if I load the JSON state and dump it, it stays idempotent for several architectures (notably all of x86-64, ia-32, ARM 32 bits...).
I would use:
int totalcompare(double x, double y) {
int64_t rx, ry;
memcpy(&rx, &x, sizeof rx);
memcpy(&ry, &y, sizeof ry);
if (rx == ry) return 0;
if (rx < 0) rx ^= INT64_MAX;
if (ry < 0) ry ^= INT64_MAX;
if (rx < ry) return -1; else return 1;
}
This makes 0.0 and -0.0 compare unequal, whereas if (x==y) return 0; in your version makes them compare equal, meaning that your version is only a preorder. NaN values are above the rest and different NaNs compare different. All values comparable for <= should be in the same order for the above relation.
Note: the above function is C. I do not know C++.

Floating Point Divider Hardware Implementation Details

I am trying to implement a 32-bit floating point hardware divider in hardware and I am wondering if I can get any suggestions as to some tradeoffs between different algorithms?
My floating point unit currently suppports multiplication and addition/subtraction, but I am not going to switch it to a fused multiply-add (FMA) floating point architecture since this is an embedded platform where I am trying to minimize area usage.
Once upon a very long time ago i come across this neat and easy to implement float/fixed point divison algorithm used in military FPUs of that time period:
input must be unsigned and shifted so x < y and both are in range < 0.5 ; 1 >
don't forget to store the difference of shifts sh = shx - shy and original signs
find f (by iterating) so y*f -> 1 .... after that x*f -> x/y which is the division result
shift the x*f back by sh and restore result sign (sig=sigx*sigy)
the x*f can be computed easily like this:
z=1-y
(x*f)=(x/y)=x*(1+z)*(1+z^2)*(1+z^4)*(1+z^8)*(1+z^16)...(1+z^2n)
where
n = log2(num of fractional bits for fixed point, or mantisa bit size for floating point)
You can also stop when z^2n is zero on fixed bit width data types.
[Edit2] Had a bit of time&mood for this so here 32 bit IEEE 754 C++ implementation
I removed the old (bignum) examples to avoid confusion for future readers (they are still accessible in edit history if needed)
//---------------------------------------------------------------------------
// IEEE 754 single masks
const DWORD _f32_sig =0x80000000; // sign
const DWORD _f32_exp =0x7F800000; // exponent
const DWORD _f32_exp_sig=0x40000000; // exponent sign
const DWORD _f32_exp_bia=0x3F800000; // exponent bias
const DWORD _f32_exp_lsb=0x00800000; // exponent LSB
const DWORD _f32_exp_pos= 23; // exponent LSB bit position
const DWORD _f32_man =0x007FFFFF; // mantisa
const DWORD _f32_man_msb=0x00400000; // mantisa MSB
const DWORD _f32_man_bits= 23; // mantisa bits
//---------------------------------------------------------------------------
float f32_div(float x,float y)
{
union _f32 // float bits access
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
};
_f32 xx,yy,zz; int sh; DWORD zsig; float z;
// result signum abs value
xx.f=x; zsig =xx.u&_f32_sig; xx.u&=(0xFFFFFFFF^_f32_sig);
yy.f=y; zsig^=yy.u&_f32_sig; yy.u&=(0xFFFFFFFF^_f32_sig);
// initial exponent difference sh and normalize exponents to speed up shift in range
sh =0;
sh-=((xx.u&_f32_exp)>>_f32_exp_pos)-(_f32_exp_bia>>_f32_exp_pos); xx.u&=(0xFFFFFFFF^_f32_exp); xx.u|=_f32_exp_bia;
sh+=((yy.u&_f32_exp)>>_f32_exp_pos)-(_f32_exp_bia>>_f32_exp_pos); yy.u&=(0xFFFFFFFF^_f32_exp); yy.u|=_f32_exp_bia;
// shift input in range
while (xx.f> 1.0f) { xx.f*=0.5f; sh--; }
while (xx.f< 0.5f) { xx.f*=2.0f; sh++; }
while (yy.f> 1.0f) { yy.f*=0.5f; sh++; }
while (yy.f< 0.5f) { yy.f*=2.0f; sh--; }
while (xx.f<=yy.f) { yy.f*=0.5f; sh++; }
// divider block
z=(1.0f-yy.f);
zz.f=xx.f*(1.0f+z);
for (;;)
{
z*=z; if (z==0.0f) break;
zz.f*=(1.0f+z);
}
// shift result back
for (;sh>0;) { sh--; zz.f*=0.5f; }
for (;sh<0;) { sh++; zz.f*=2.0f; }
// set signum
zz.u&=(0xFFFFFFFF^_f32_sig);
zz.u|=zsig;
return zz.f;
}
//---------------------------------------------------------------------------
I wanted to keep it simple so it is not optimized yet. You can for example replace all *=0.5 and *=2.0 by exponent inc/dec ... If you compare with FPU results on float operator / this will be a bit less precise because most FPUs compute on 80 bit internal format and this implementation is only on 32 bits.
As you can see I am using from FPU just +,-,*. The stuff can be speed up by using fast sqr algorithms like
Fast bignum square computation
especially if you want to use big bit widths ...
Do not forget to implement normalization and or overflow/underflow correction.

How to wrap a number using mod operator

Not sure if this is possible, but is there an automatic way, using mod or something similiar, to automatically correct bad input values? For example:
If r>255, then set r=255 and
if r<0, then set r=0
So basically what I'm asking is whats a clever mathematical way to set this rather than using
if(r>255)
r=255;
if(r<0)
r=0;
How about:
r = std:max(0, std::min(r, 255));
The following function will output what you are looking for:
f(x) = (510*(1 + Sign[-255 + x]) + x*(1 + Sign[255 - x])*(1 + Sign[x]))/4
As shown here:
Could you do something like --
R = MIN(r, 255);
R = MAX(R, 0);
Depending on how your hardware and possibly how your interpreter deal with ints, you can do this:
Assuming that an unsigned int is 16 bits (to keep my masks short):
r = r & 0000000011111111;
If an int was 32 bits, you'd need 16 more zeros at the start of the bit mask.
After that bitwise AND, the maximum value r can have is 255. Depending on the hardware, an unsigned int might do something odd given a value below zero. I believe that case is already handled by the bitmask (at least on the hardware that I've used). If not, you can do r = min(r, 0); first.
I had similar problem when dealing with images. For some special values (like these ones, 0 and 255) you can use this nonportable method:
static inline int trim_8bit(unsigned i){
return 0xff & ((i | -!!(i & ~0xff))) + (i >> 31);
// where "0xff &" can be omitted if you return unsigned char
};
In real cases the clamping have to be performed rarely, so that you could write
static inline unsigned char trim_8bit_v2(unsigned i){
if (__builtin_expect(i & ~0xFF, 0)) // it's for gcc, use __assume for MSVC
return (i >> 31) - 1;
return i;
};
And to be sure which is fastest, measure.

Resources