I have the following function:
float int_to_qty(unsigned x) {
const float MAX = 8.5f;
const float MIN = .001f;
return ((MAX-MIN) / (float)(1<<24)) * x + MIN;
}
This compiles (with reasonable options, on x86) to the following:
.LCPI0_0:
.long 0x3507fbe7 # float 5.06579852E-7
.LCPI0_1:
.long 0x3a83126f # float 0.00100000005
int_to_qty: # #int_to_qty
mov eax, edi
cvtsi2ss xmm0, rax
mulss xmm0, dword ptr [rip + .LCPI0_0]
addss xmm0, dword ptr [rip + .LCPI0_1]
ret
I consider the assembly to be the "canonical" version of the function: Convert the int to a float, multiply by a constant at 32-bit precision, add another constant at 32-bit precision, that's the result.
I want to find the exact inverse of this function. Specifically, a function
unsigned qty_to_int(float qty) that will pass the following test:
int test() {
for (unsigned i = 0; i < (1 << 24); ++i) {
float qty = int_to_qty(i);
if (int_to_qty(qty_to_int(qty)) != qty) {
return 0;
}
}
return 1;
}
Notes:
In the range 4 ≤ int_to_qty(x) < 8, the returned values primarily differ by 1 ulp, which is what makes this challenging.
In the range 8 ≤ int_to_qty(x) < 8.5, the function stops being one-to-one. In this case either answer is fine for the inverse, it doesn't have to be consistently the lowest or the highest.
After wrestling for a long time, I finally came up with a solution that passes the tests. (In Rust, but the translation to C is straightforward.)
pub fn qty_to_int(qty: f64) -> u32 {
const MAX: f32 = 8.5;
const MIN: f32 = 0.001;
let size_inv = f64::from(1<<24) / f64::from(MAX - MIN);
// We explicitly shrink the precision to f32 and then pop back to f64, because we *need* to
// perform that rounding step to properly reverse the addition at the end of int_to_qty.
// We could do the whole thing at f32 precision, except that our input is f64 so the
// subtraction needs to be done at f64 precision.
let fsqueezed: f32 = (qty - f64::from(MIN)) as f32;
// The squeezed subtraction is a one-to-one operation across most of our range. *However*,
// in the border areas where our input is just above an exponent breakpoint, but
// subtraction will bring it below, we hit an issue: The addition in int_to_qty() has
// irreversibly lost a bit in the lowest place! This causes issues when we invert the
// multiply, since we are counting on the error to be centered when we round at the end.
//
// To solve this, we need to re-bias the error by subtracting 0.25 ulp for these cases.
// Technically this should be applied for ranges like [2.0,2.001) as well, but they
// don't need it since they already round correctly (due to having more headroom).
let adj = if qty > 4.0 && qty < 4.001 {
0.5 - 2.0 / 8.499
} else if qty > 8.0 && qty < 8.001 {
0.5 - 4.0 / 8.499
} else {
0.5
};
// Multiply and round, taking into account the possible adjustments.
let fresult = f64::from(fsqueezed) * size_inv + adj;
unsafe { fresult.to_int_unchecked() }
}
Related
Consider the following binary search for a value greater than lo, but less than or equal to hi:
find(lo: number, hi: number, isTooLow: (testVal: number) => boolean) {
for(;;) {
const testVal = between(lo, hi);
if (testVal <= lo || testVal >= hi) {
break;
}
if (isTooLow(testVal)) {
lo = testVal;
} else {
hi = testVal;
}
}
return hi;
}
Note that a number here is a 64-bit float.
The search will will always terminate, and if the between function is very carefully written to choose the median available 64-bit float between lo and hi, if it exists, then also:
The search will terminate within 64 iterations; and
It will exactly find the smallest value hi such that isTooLow(hi) == false
But such a between function is tricky and complicated, and it depends on the fine details of the floating point representation.
Can anyone suggest an implementation for between that is simpler and that does not depend on any specifics of the floating point representation, except that there is a fixed-width mantissa, a fixed-width exponent, and a sign?
It will need to be implemented in Javascript, and it only needs to be almost as good, such that:
The search will always terminate within 200 iterations or so; and
It will very nearly (within 3 or 4 possible values) find the smallest value hi such that isTooLow(hi) == false
Extra points for avoiding transcendental functions and sqrt.
RESOLUTION
In the end, I really liked David's stateful guesser, but I hoisted the state up into the call stack, and the result essentially does a search for the exponent first, without any knowledge of the representation.
I haven't tested/debugged this yet:
function find(lo: number, hi: number, isTooLow: (testVal: number) => boolean) {
[lo, hi] = getLinearRange(lo, hi, isTooLow);
for (; ;) {
const testVal = lo + (hi - lo) * 0.5;
if (testVal <= lo || testVal >= hi) {
break;
}
if (isTooLow(testVal)) {
lo = testVal;
} else {
hi = testVal;
}
}
return hi;
}
/**
* Reduce a floating-point range to a size where a conventional binary
* search is appropriate.
* #returns [newlow, newhigh]
*/
function getLinearRange(
low: number, high: number,
isTooLow: (n: number) => boolean): [number, number] {
let negRange: [number, number] | undefined;
if (low < 0) {
if (high > 0) {
if (isTooLow(0)) {
return scaleRange(0, high, 0.25, isTooLow);
} else {
const isTooHigh = (n: number) => !isTooLow(n);
negRange = scaleRange(0, -low, 0.25, isTooHigh);
}
} else {
const isTooHigh = (n: number) => !isTooLow(n);
negRange = scaleRange(-high, -low, 0.25, isTooHigh);
}
} else {
return scaleRange(low, high, 0.25, isTooLow);
}
// we have to negate the range
low = -negRange[1];
negRange[1] = -negRange[0];
negRange[0] = low;
return negRange;
}
/**
* Reduce a positive range until low/high >= minScale
* #returns [newlow, newhigh]
*/
function scaleRange(
low: number, high: number, minScale: number,
isTooLow: (n: number) => boolean): [number, number] {
if (!(minScale > 0 && low < high * minScale)) {
return [low, high];
}
const range = scaleRange(low, high, minScale * minScale, isTooLow);
[low, high] = range;
const test = high * minScale;
if (test > low && test < high) {
if (isTooLow(test)) {
range[0] = test;
} else {
range[1] = test;
}
}
return range;
}
We could also do some precomputation (code below works for nonnegative finite ranges, add branches ad lib to handle the other cases). We approximate the smallest useful fraction, increase it by square roots to effect binary search on the exponent, and then finish off with good old arithmetic mean to nail down the significand. I think the worst case is 65 queries, certainly not much more, though many inputs will take longer than the bit-munging algorithm.
const fractions = [];
const Guesser = {
fractions: null,
between(low, high) {
if (this.fractions === null) {
this.fractions = [];
let f = 0.25;
while (low + f * (high - low) > low) {
this.fractions.push(f);
f *= f;
}
}
return low + (this.fractions.pop() || 0.5) * (high - low);
},
};
for (let i = 0; i <= 101; ++i) {
let n = 0;
let g = Object.create(Guesser);
let low = 0;
let high = 1.7976931348623157e308;
for (;;) {
++n;
let mid = g.between(low, high);
if (mid <= low || high <= mid) break;
if (100 * Math.random() < i) low = mid;
else high = mid;
}
console.log(n);
}
Here’s an idea that I think meets your specs on IEEE doubles using primitive operations only (but seems probably worse than using square root assuming that it’s hardware-accelerated). Find the sign (if needed), find the top 7 bits of the 11-bit exponent using linear search (≈ 128 queries, plus 4 for subnormals, ugh), then switch to the arithmetic mean (≈ 53 + 211−7 = 69 queries), for a total of about 200 queries if I’m counting right. Untested JavaScript below.
const multiplicative_stride = 1 / 2 ** (2 ** (11 - 7));
function between(low, high) {
if (high <= 0) return -between(-high, -low);
if (low < 0) return 0;
const mid = multiplicative_stride * high;
return mid <= low ? low + 0.5 * (high - low) : mid;
}
Using 64 floating iterations is not a good idea because you forgot that 64 bit floats (double) are represented as 3 separated things:
1 bit sign
11 bit exponent
53 bit mantissa
and if you do not know the real range (where you can start) of the solution then you might be off by quite a few iterations as range of such number is much more then 2^64 ...
However if you do this on binary bits directly then its OK (but you have to handle special cases like NaN,Inf and maybe also Denormalized numbers).
So instead of using *0.5 you use binary operations on individual bits like x<<=1, x|=1, x^=1 ...
Here simple example C++:
double f64_sqrt(double x)
{
// IEEE 754 double MSW masks
const DWORD _f64_sig =0x80000000; // sign
const DWORD _f64_exp =0x7FF00000; // exponent
const DWORD _f64_exp_sig=0x40000000; // exponent sign
const DWORD _f64_exp_bia=0x3FF00000; // exponent bias
const DWORD _f64_exp_lsb=0x00100000; // exponent LSB
const DWORD _f64_man =0x000FFFFF; // mantisa
const DWORD _f64_man_msb=0x00080000; // mantisa MSB
const int h=1; // may be platform dependent MSB/LSB order
const int l=0;
DWORD b; // bit mask
union // semi result
{
double f; // 64bit floating point
DWORD u[2]; // 2x32 bit uint
} y;
// fabs
y.f=x; y.u[h]&=_f64_exp|_f64_man; x=y.f;
// set safe exponent (~ abs half) destroys mantisa,sign
b=(y.u[h]&_f64_exp)-_f64_exp_bia;
y.u[h]=((b>>1)|(b&_f64_exp_sig))+_f64_exp_bia;
// sign=`+` mantisa=0
y.u[h]&=_f64_exp;
// correct exponent if needed
if (y.f*y.f>x) y.u[h]=(y.u[h]-_f64_exp_lsb)&_f64_exp;
// binary search
for (b=_f64_man_msb;b;b>>=1) { y.u[h]|=b; if (y.f*y.f>x) y.u[h]^=b; }
for (b=0x80000000 ;b;b>>=1) { y.u[l]|=b; if (y.f*y.f>x) y.u[l]^=b; }
return y.f;
}
using "estimation" of solution exponent (exploiting the fact that result has ~ half of integer bits for sqrt) and binary access binary search of mantissa only (53 iterations)
In case you do not know the exponent range you have to bin search it too (starting from highest bit or one after if sign is known) ...
I need to accurately calculate where a and b
are both integers. If I simply use typical change of base formula with floating point math functions I wind up with errors due to rounding error.
You can use this identity:
b^logb(a) = a
So binary search x = logb(a) so the result of b^x is biggest integer which is still less than a and afterwards just increment the final result.
Here small C++ example for 32 bits:
//---------------------------------------------------------------------------
DWORD u32_pow(DWORD a,DWORD b) // = a^b
{
int i,bits=32;
DWORD d=1;
for (i=0;i<bits;i++)
{
d*=d;
if (DWORD(b&0x80000000)) d*=a;
b<<=1;
}
return d;
}
//---------------------------------------------------------------------------
DWORD u32_log2(DWORD a) // = ceil(log2(a))
{
DWORD x;
for (x=32;((a&0x80000000)==0)&&(x>1);x--,a<<=1);
return x;
}
//---------------------------------------------------------------------------
DWORD u32_log(DWORD b,DWORD a) // = ceil(logb(a))
{
DWORD x,m,bx;
// edge cases
if (b< 2) return 0;
if (a< 2) return 0;
if (a<=b) return 1;
m=1<<(u32_log2(a)-1); // max limit for b=2, all other bases lead to smaller exponents anyway
for (x=0;m;m>>=1)
{
x|=m;
bx=u32_pow(b,x);
if (bx>=a) x^=m;
}
return x+1;
}
//---------------------------------------------------------------------------
Where DWORD is any unsigned 32bit int type... for more info about pow,log,exp and bin search see:
Power by squaring for negative exponents
Note that u32_log2 is not really needed (unless you want bigints) you can use constant bitwidth instead, also some CPUs like x86 has single asm instruction returning the same much faster than for loop...
Now the next step is exploit the fact that the u32_pow bin search is the same as the u32_log bin search so we can merge the two functions and get rid of one nested for loop completely improving complexity considerably like this:
//---------------------------------------------------------------------------
DWORD u32_pow(DWORD a,DWORD b) // = a^b
{
int i,bits=32;
DWORD d=1;
for (i=0;i<bits;i++)
{
d*=d;
if (DWORD(b&0x80000000)) d*=a;
b<<=1;
}
return d;
}
//---------------------------------------------------------------------------
DWORD u32_log2(DWORD a) // = ceil(log2(a))
{
DWORD x;
for (x=32;((a&0x80000000)==0)&&(x>1);x--,a<<=1);
return x;
}
//---------------------------------------------------------------------------
DWORD u32_log(DWORD b,DWORD a) // = ceil(logb(a))
{
const int _bits=32; // DWORD bitwidth
DWORD bb[_bits]; // squares of b LUT for speed up b^x
DWORD x,m,bx,bx0,bit,bits;
// edge cases
if (b< 2) return 0;
if (a< 2) return 0;
if (a<=b) return 1;
// max limit for x where b=2, all other bases lead to smaller x
bits=u32_log2(a);
// compute bb LUT
bb[0]=b;
for (bit=1;bit< bits;bit++) bb[bit]=bb[bit-1]*bb[bit-1];
for ( ;bit<_bits;bit++) bb[bit]=1;
// bin search x and b^x at the same time
for (bx=1,x=0,bit=bits-1,m=1<<bit;m;m>>=1,bit--)
{
x|=m; bx0=bx; bx*=bb[bit];
if (bx>=a){ x^=m; bx=bx0; }
}
return x+1;
}
//---------------------------------------------------------------------------
The only drawback is that we need LUT for squares of b so: b,b^2,b^4,b^8... up to bits number of squares
Beware squaring will double the number of bits so you should also handle overflow if b or a are too big ...
[Edit2] more optimization
As benchmark on normal ints (on bigints the bin search is much much faster) revealed bin search version is the same speed as naive version (because of many subsequent operations except multiplications):
DWORD u32_log_naive(DWORD b,DWORD a) // = ceil(logb(a))
{
int x,bx;
if (b< 2) return 0;
if (a< 2) return 0;
if (a<=b) return 1;
for (x=2,bx=b;bx*=b;x++)
if (bx>=a) break;
return x;
}
We can optimize more:
we can comment out computation of unused squares:
//for ( ;bit<_bits;bit++) bb[bit]=1;
with this bin search become faster also on ints but not by much
we can use faster log2 instead of naive one
see: Fastest implementation of log2(int) and log2(float)
putting all together (x86 CPUs):
DWORD u32_log(DWORD b,DWORD a) // = ceil(logb(a))
{
const int _bits=32; // DWORD bitwidth
DWORD bb[_bits]; // squares of b LUT for speed up b^x
DWORD x,m,bx,bx0,bit,bits;
// edge cases
if (b< 2) return 0;
if (a< 2) return 0;
if (a<=b) return 1;
// max limit for x where b=2, all other bases lead to smaller x
asm {
bsr eax,a; // bits=u32_log2(a);
mov bits,eax;
}
// compute bb LUT
bb[0]=b;
for (bit=1;bit< bits;bit++) bb[bit]=bb[bit-1]*bb[bit-1];
// for ( ;bit<_bits;bit++) bb[bit]=1;
// bin search x and b^x at the same time
for (bx=1,x=0,bit=bits-1,m=1<<bit;m;m>>=1,bit--)
{
x|=m; bx0=bx; bx*=bb[bit];
if (bx>=a){ x^=m; bx=bx0; }
}
return x+1;
}
however the speed up is just slight for example naive 137 ms bin search 133 ms ... note that faster log2 did almost no change but that is because how my compiler is handling inline asm (not sure why BDS2006 and BCC32 is very slow on switching between asm and C++ but its true that is why in older C++ builders inline asm functions where not a good choice for speed optimizations unless a major speedup was expected) ...
On Darwin, the POSIX standard clock_gettime(CLOCK_MONOTONIC) timer is not available. Instead, the highest resolution monotonic timer is obtained through the mach_absolute_time function from mach/mach_time.h.
The result returned may be an unadjusted tick count from the processor, in which case the time units could be a strange multiple. For example, on a CPU with a 33MHz tick count, Darwin returns 1000000000/33333335 as the exact units of the returned result (ie, multiply the mach_absolute_time by that fraction to obtain a nanosecond value).
We usually wish to convert from exact ticks to "standard" (decimal) units, but unfortunately, naively multiplying the absolute time by the fraction will overflow even in 64-bit arithmetic. This is an error that Apple's sole piece of documentation on mach_absolute_time falls into (Technical Q&A QA1398).1
How should I write a function that correctly uses mach_absolute_time?
Note that this is not a theoretical problem: the sample code in QA1398 completely fails to work on PowerPC-based Macs. On Intel Macs, mach_timebase_info always returns 1/1 as the scaling factor because the CPU's raw tick count is unreliable (dynamic speed-stepping), so the API does the scaling for you. On PowerPC Macs, mach_timebase_info returns either 1000000000/33333335 or 1000000000/25000000, so Apple's provided code definitely overflows every few minutes. Oops.
Most-precise (best) answer
Perform the arithmetic at 128-bit precision to avoid the overflow!
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
}
uint64_t scale(uint64_t i) {
return scaleHighPrecision(i - bias, tb.numer, tb.denom);
}
static uint64_t scaleHighPrecision(uint64_t i, uint32_t numer,
uint32_t denom) {
U64 high = (i >> 32) * numer;
U64 low = (i & 0xffffffffull) * numer / denom;
U64 highRem = ((high % denom) << 32) / denom;
high /= denom;
return (high << 32) + highRem + low;
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return data.scale(now);
}
A simple low-resolution answer
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
if (tb.denom > 1024) {
double frac = (double)tb.numer/tb.denom;
tb.denom = 1024;
tb.numer = tb.denom * frac + 0.5;
assert(tb.numer > 0);
}
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
A fiddly solution using low-precision arithmetic but using continued fractions to avoid loss of accuracy
// This function returns the rational number inside the given interval with
// the smallest denominator (and smallest numerator breaks ties; correctness
// proof neglects floating-point errors).
static mach_timebase_info_data_t bestFrac(double a, double b) {
if (floor(a) < floor(b))
{ mach_timebase_info_data_t rv = {(int)ceil(a), 1}; return rv; }
double m = floor(a);
mach_timebase_info_data_t next = bestFrac(1/(b-m), 1/(a-m));
mach_timebase_info_data_t rv = {(int)m*next.numer + next.denum, next.numer};
return rv;
}
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count. However, although the bound on the error is
// the same as for the pragmatic answer, the error is actually minimized over
// the given accuracy bound.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
double frac = (double)tb.numer/tb.denom;
uint64_t spanTarget = 315360000000000000llu; // 10 years
if (getExpressibleSpan(tb.numer, tb.denom) >= spanTarget)
return;
for (double errorTarget = 1/1024.0; errorTarget > 0.000001;) {
mach_timebase_info_data_t newFrac =
bestFrac((1-errorTarget)*frac, (1+errorTarget)*frac);
if (getExpressibleSpan(newFrac.numer, newFrac.denom) < spanTarget)
break;
tb = newFrac;
errorTarget = fabs((double)tb.numer/tb.denom - frac) / frac / 8;
}
assert(getExpressibleSpan(tb.numer, tb.denom) >= spanTarget);
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
The derivation
We aim to reduce the fraction returned by mach_timebase_info to one that is essentially the same, but with a small denominator. The size of the timespan that we can handle is limited only by the size of the denominator, not the numerator of the fraction we shall multiply by:
uint64_t getExpressibleSpan(uint32_t numer, uint32_t denom) {
// This is just less than the smallest thing we can multiply numer by without
// overflowing. ceilLog2(numer) = 64 - number of leading zeros of numer
uint64_t maxDiffWithoutOverflow = ((uint64_t)1 << (64 - ceilLog2(numer))) - 1;
return maxDiffWithoutOverflow * numer / denom;
}
If denom=33333335 as returned by mach_timebase_info, we can handle differences of up to 18 seconds only before the multiplication by numer overflows. As getExpressibleSpan shows, by calculating a rough lower bound for this, the size of numer doesn't matter: halving numer doubles maxDiffWithoutOverflow. The only goal therefore is to produce a fraction close to numer/denom that has a smaller denominator. The simplest method to do this is using continued fractions.
The continued fractions method is rather handy. bestFrac clearly works correctly if the provided interval contains an integer: it returns the least integer in the interval over 1. Otherwise, it calls itself recursively with a strictly larger interval and returns m+1/next. The final result is a continued fraction that can be shown by induction to have the correct property: it's optimal, the fraction inside the given interval with the least denominator.
Finally, we reduce the fraction Darwin passes us to a smaller one to use when rescaling the mach_absolute_time to nanoseconds. We may introduce an error here because we can't reduce the fraction in general without losing accuracy. We set ourselves the target of 0.1% error, and check that we've reduced the fraction enough for common timespans (up to ten years) to be handled correctly.
Arguably the method is over-complicated for what it does, but it handles correctly anything the API can throw at it, and the resulting code is still short and extremely fast (bestFrac typically recurses only three or four iterations deep before returning a denominator less than 1000 for random intervals [a,a*1.002]).
You're worrying about overflow when multiplying/dividing with values from the mach_timebase_info struct, which is used for conversion to nanoseconds. So, while it may not fit your exact needs, there are easier ways to get a count in nanoseconds or seconds.
All solutions below are using mach_absolute_time internally (and NOT the wall clock).
Use double instead of uint64_t
(supported in Objective-C and Swift)
double tbInSeconds = 0;
mach_timebase_info_data_t tb;
kern_return_t kError = mach_timebase_info(&tb);
if (kError == 0) {
tbInSeconds = 1e-9 * (double)tb.numer / (double)tb.denom;
}
(remove the 1e-9 if you want nanoseconds)
Usage:
uint64_t start = mach_absolute_time();
// do something
uint64_t stop = mach_absolute_time();
double durationInSeconds = tbInSeconds * (stop - start);
Use ProcessInfo.processInfo.systemUptime
(supported in Objective-C and Swift)
It does the job in double seconds directly:
CFTimeInterval start = NSProcessInfo.processInfo.systemUptime;
// do something
CFTimeInterval stop = NSProcessInfo.processInfo.systemUptime;
NSTimeInterval durationInSeconds = stop - start;
For reference, source code of systemUptime
just does something similar as previous solution:
struct mach_timebase_info info;
mach_timebase_info(&info);
__CFTSRRate = (1.0E9 / (double)info.numer) * (double)info.denom;
__CF1_TSRRate = 1.0 / __CFTSRRate;
uint64_t tsr = mach_absolute_time();
return (CFTimeInterval)((double)tsr * __CF1_TSRRate);
Use QuartzCore.CACurrentMediaTime()
(supported in Objective-C and Swift)
Same as systemUptime, but without being open source.
Use Dispatch.DispatchTime.now()
(supported in Swift only)
Another wrapper around mach_absolute_time(). Base precision is nanoseconds, backed with UInt64.
DispatchTime start = DispatchTime.now()
// do something
DispatchTime stop = DispatchTime.now()
TimeInterval durationInSeconds = Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000_000
For reference, source code of DispatchTime.now() says it basically simply returns a struct DispatchTime(rawValue: mach_absolute_time()). And the calculation for uptimeNanoseconds is:
(result, overflow) = result.multipliedReportingOverflow(by: UInt64(DispatchTime.timebaseInfo.numer))
result = overflow ? UInt64.max : result / UInt64(DispatchTime.timebaseInfo.denom)
So it just discards results if the multiplication can't be stored in an UInt64.
If mach_absolute_time() sets the uint64 back to 0 then reset the time calculations if less than the last check.
That's the problem, they don't document what happens when the uint64 reaches all ones (binary).
read it. https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time
I am trying to implement a 32-bit floating point hardware divider in hardware and I am wondering if I can get any suggestions as to some tradeoffs between different algorithms?
My floating point unit currently suppports multiplication and addition/subtraction, but I am not going to switch it to a fused multiply-add (FMA) floating point architecture since this is an embedded platform where I am trying to minimize area usage.
Once upon a very long time ago i come across this neat and easy to implement float/fixed point divison algorithm used in military FPUs of that time period:
input must be unsigned and shifted so x < y and both are in range < 0.5 ; 1 >
don't forget to store the difference of shifts sh = shx - shy and original signs
find f (by iterating) so y*f -> 1 .... after that x*f -> x/y which is the division result
shift the x*f back by sh and restore result sign (sig=sigx*sigy)
the x*f can be computed easily like this:
z=1-y
(x*f)=(x/y)=x*(1+z)*(1+z^2)*(1+z^4)*(1+z^8)*(1+z^16)...(1+z^2n)
where
n = log2(num of fractional bits for fixed point, or mantisa bit size for floating point)
You can also stop when z^2n is zero on fixed bit width data types.
[Edit2] Had a bit of time&mood for this so here 32 bit IEEE 754 C++ implementation
I removed the old (bignum) examples to avoid confusion for future readers (they are still accessible in edit history if needed)
//---------------------------------------------------------------------------
// IEEE 754 single masks
const DWORD _f32_sig =0x80000000; // sign
const DWORD _f32_exp =0x7F800000; // exponent
const DWORD _f32_exp_sig=0x40000000; // exponent sign
const DWORD _f32_exp_bia=0x3F800000; // exponent bias
const DWORD _f32_exp_lsb=0x00800000; // exponent LSB
const DWORD _f32_exp_pos= 23; // exponent LSB bit position
const DWORD _f32_man =0x007FFFFF; // mantisa
const DWORD _f32_man_msb=0x00400000; // mantisa MSB
const DWORD _f32_man_bits= 23; // mantisa bits
//---------------------------------------------------------------------------
float f32_div(float x,float y)
{
union _f32 // float bits access
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
};
_f32 xx,yy,zz; int sh; DWORD zsig; float z;
// result signum abs value
xx.f=x; zsig =xx.u&_f32_sig; xx.u&=(0xFFFFFFFF^_f32_sig);
yy.f=y; zsig^=yy.u&_f32_sig; yy.u&=(0xFFFFFFFF^_f32_sig);
// initial exponent difference sh and normalize exponents to speed up shift in range
sh =0;
sh-=((xx.u&_f32_exp)>>_f32_exp_pos)-(_f32_exp_bia>>_f32_exp_pos); xx.u&=(0xFFFFFFFF^_f32_exp); xx.u|=_f32_exp_bia;
sh+=((yy.u&_f32_exp)>>_f32_exp_pos)-(_f32_exp_bia>>_f32_exp_pos); yy.u&=(0xFFFFFFFF^_f32_exp); yy.u|=_f32_exp_bia;
// shift input in range
while (xx.f> 1.0f) { xx.f*=0.5f; sh--; }
while (xx.f< 0.5f) { xx.f*=2.0f; sh++; }
while (yy.f> 1.0f) { yy.f*=0.5f; sh++; }
while (yy.f< 0.5f) { yy.f*=2.0f; sh--; }
while (xx.f<=yy.f) { yy.f*=0.5f; sh++; }
// divider block
z=(1.0f-yy.f);
zz.f=xx.f*(1.0f+z);
for (;;)
{
z*=z; if (z==0.0f) break;
zz.f*=(1.0f+z);
}
// shift result back
for (;sh>0;) { sh--; zz.f*=0.5f; }
for (;sh<0;) { sh++; zz.f*=2.0f; }
// set signum
zz.u&=(0xFFFFFFFF^_f32_sig);
zz.u|=zsig;
return zz.f;
}
//---------------------------------------------------------------------------
I wanted to keep it simple so it is not optimized yet. You can for example replace all *=0.5 and *=2.0 by exponent inc/dec ... If you compare with FPU results on float operator / this will be a bit less precise because most FPUs compute on 80 bit internal format and this implementation is only on 32 bits.
As you can see I am using from FPU just +,-,*. The stuff can be speed up by using fast sqr algorithms like
Fast bignum square computation
especially if you want to use big bit widths ...
Do not forget to implement normalization and or overflow/underflow correction.
Here's the sample C code that I am trying to accelerate using SSE, the two arrays are 3072 element long with doubles, may drop it down to float if i don't need the precision of doubles.
double sum = 0.0;
for(k = 0; k < 3072; k++) {
sum += fabs(sima[k] - simb[k]);
}
double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0)));
Anyway my current problem is how to do the fabs step in a SSE register for doubles or float so that I can keep the whole calculation in the SSE registers so that it remains fast and I can parallelize all of the steps by partly unrolling this loop.
Here's some resources I've found fabs() asm or possibly this flipping the sign - SO however the weakness of the second one would need a conditional check.
I suggest using bitwise and with a mask. Positive and negative values have the same representation, only the most significant bit differs, it is 0 for positive values and 1 for negative values, see double precision number format. You can use one of these:
inline __m128 abs_ps(__m128 x) {
static const __m128 sign_mask = _mm_set1_ps(-0.f); // -0.f = 1 << 31
return _mm_andnot_ps(sign_mask, x);
}
inline __m128d abs_pd(__m128d x) {
static const __m128d sign_mask = _mm_set1_pd(-0.); // -0. = 1 << 63
return _mm_andnot_pd(sign_mask, x); // !sign_mask & x
}
Also, it might be a good idea to unroll the loop to break the loop-carried dependency chain. Since this is a sum of nonnegative values, the order of summation is not important:
double norm(const double* sima, const double* simb) {
__m128d* sima_pd = (__m128d*) sima;
__m128d* simb_pd = (__m128d*) simb;
__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
for(int k = 0; k < 3072/2; k+=2) {
sum1 += abs_pd(_mm_sub_pd(sima_pd[k], simb_pd[k]));
sum2 += abs_pd(_mm_sub_pd(sima_pd[k+1], simb_pd[k+1]));
}
__m128d sum = _mm_add_pd(sum1, sum2);
__m128d hsum = _mm_hadd_pd(sum, sum);
return *(double*)&hsum;
}
By unrolling and breaking the dependency (sum1 and sum2 are now independent), you let the processor execute the additions our of order. Since the instruction is pipelined on a modern CPU, the CPU can start working on a new addition before the previous one is finished. Also, bitwise operations are executed on a separate execution unit, the CPU can actually perform it in the same cycle as addition/subtraction. I suggest Agner Fog's optimization manuals.
Finally, I don't recommend using openMP. The loop is too small and the overhead of distribution the job among multiple threads might be bigger than any potential benefit.
The maximum of -x and x should be abs(x). Here it is in code:
x = _mm_max_ps(_mm_sub_ps(_mm_setzero_ps(), x), x)
Probably the easiest way is as follows:
__m128d vsum = _mm_set1_pd(0.0); // init partial sums
for (k = 0; k < 3072; k += 2)
{
__m128d va = _mm_load_pd(&sima[k]); // load 2 doubles from sima, simb
__m128d vb = _mm_load_pd(&simb[k]);
__m128d vdiff = _mm_sub_pd(va, vb); // calc diff = sima - simb
__m128d vnegdiff = mm_sub_pd(_mm_set1_pd(0.0), vdiff); // calc neg diff = 0.0 - diff
__m128d vabsdiff = _mm_max_pd(vdiff, vnegdiff); // calc abs diff = max(diff, - diff)
vsum = _mm_add_pd(vsum, vabsdiff); // accumulate two partial sums
}
Note that this may not be any faster than scalar code on modern x86 CPUs, which typically have two FPUs anyway. However if you can drop down to single precision then you may well get a 2x throughput improvement.
Note also that you will need to combine the two partial sums in vsum into a scalar value after the loop, but this is fairly trivial to do and is not performance-critical.