Algorithm for square root calculation - algorithm

I have been implementing control software in C and one of the control algorithms requires square root calculation. I have been looking for suitable square root calculation algorithm which will have constant execution time irrespective to the radicand value. This requirement rules out the sqrt function from the standard library.
As far as my platform I have been working with floating point 32 bits ARM Cortex A9 based machine. As far as the radicand range in my application the algorithms are calculated in physical units so I expect following range <0, 400>. As far as the required error I think that error about 1 % could be sufficient. Can anybody recommend me a square root calculation algorithm suitable for my purposes?

My initial approach would be to use the Taylor serie for square root with precalculated coefficients at a number of fixed points. This will reduce the calculation to a subtraction and a number of multiplication.
The look-up table would be a 2D array like:
point | C0 | C1 | C2 | C3 | C4 | ...
-----------------------------------------
0.5 | f00 | f01 | f02 | f03 | f04 |
-----------------------------------------
1.0 | f10 | f11 | f12 | f13 | f14 |
-----------------------------------------
1.5 | f20 | f21 | f22 | f23 | f24 |
-----------------------------------------
....
So when calculating sqrt(x) use the table row with the point closest to x.
Example:
sqrt(1.1) (i.e. use point 1.0 coeffients)
f10 +
f11 * (1.1 - 1.0) +
f12 * (1.1 - 1.0) ^ 2 +
f13 * (1.1 - 1.0) ^ 3 +
f14 * (1.1 - 1.0) ^ 4
The table above suggest a fixed distance between the points at which you precalculate coeffients (i.e. 0.5 between each point). However, due to the natur of square root you may find that the distance between points shall differ for different ranges of x. For instance x in [0 - 1] -> distance 0.1,x in [1 - 2] -> distance 0.25, x in [2 - 10] -> distance 0.5 and so on.
Another thing is the number of terms needed to get the desired precision. Here you may also find that different ranges of x may require a different number of coefficients.
All this is easy to precalculation on a normal computer (e.g. using excel).
Note: For values very close to zero this method isn't good. Maybe Newtons method will be a better choice.
Taylor series: https://en.wikipedia.org/wiki/Taylor_series
Newtons method: https://en.wikipedia.org/wiki/Newton%27s_method
Also relevant: https://math.stackexchange.com/questions/291168/algorithms-for-approximating-sqrt2

Arm v7 instruction set provides a fast instruction for inverse reciprocal square root calculation vrsqrte_f32 for two simultaneous approximations and vrsqrteq_f32 for four approximations. (The scalar variant vrsqrtes_f32 is only available on Arm64 v8.2).
Then the result can be simply calculated by x * vrsqrte_f32(x);, which has better than 0.33% relative accuracy over the whole range of positive values x. See https://www.mdpi.com/2079-3197/9/2/21/pdf
ARM NEON instruction FRSQRTE gives 8.25 correct bits of the result.
At x==0 vrsqrtes_f32(x) == Inf, so x*vrsqrtes_f32(x) would be NaN.
If the value of x==0 is unavoidable, the optimal two instruction sequence needs a bit more adjustment:
float sqrtest(float a) {
// need to "transfer" or "convert" the scalar input
// to a vector of two
// - optimally we would not need an instruction for that
// but we would just let the processor calculate the instruction
// for all the lanes in the register
float32x2_t a2 = vdup_n_f32(a);
// next we create a mask that is all ones for the legal
// domain of 1/sqrt(x)
auto is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
// calculate two reciprocal estimates in parallel
float32x2_t a2est = vrsqrte_f32(a2);
// we need to mask the result, so that effectively
// all non-legal values of a2est are zeroed
a2est = vand_u32(is_legal, a2est);
// x * 1/sqrt(x) == sqrt(x)
a2 = vmul_f32(a2, a2est);
// finally we get only the zero lane of the result
// discarding the other half
return vget_lane_f32(a2, 0);
}
Surely this method will have almost twice the throughput with
void sqrtest2(float &a, float &b) {
float32x2_t a2 = vset_lane_f32(b, vdup_n_f32(a), 1);
float32x2_t is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
float32x2_t a2est = vrsqrte_f32(a2);
a2est = vand_u32(is_legal, a2est);
a2 = vmul_f32(a2, a2est);
a = vget_lane_f32(a2,0);
b = vget_lane_f32(a2,1);
}
And even better, if you can work directly with float32x2_t or float32x4_t inputs and outputs.
float32x2_t sqrtest2(float32x2_t a2) {
float32x2_t is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
float32x2_t a2est = vrsqrte_f32(a2);
a2est = vand_u32(is_legal, a2est);
return vmul_f32(a2, a2est);
}
This implementation gives sqrtest2(1) == 0.998 and sqrtest2(400) == 19.97 (tested on MacBook M1 with arm64). Being branchless and LUT free, this has likely a constant execution time, assuming that all the instructions execute in constant number of cycles.

I have decided to use following approach. I have chosen the Newton method and then I have experimentally set the fixed number of iterations so that the error in whole range of the radicand i.e. <0,400> doesn't exceed the prescribed value. I have ended at six iterations. As far as the radicand with value 0 I have decided to return 0 without any calculations.

Related

How to fix skew trapezoidal distribution sampling output sample size

I am trying to generate a skewed trapezoidal distribution using inverse transform sampling.
The inputs are the values where the ramps start and end (a, b, c, d) and the sample size.
a=-3;b=-1;c=1;d=8;
SampleSize=10e4;
h=2/(d+c-a-b);
Then I calculate the ratio of the length of ramps and flat components to get sample size for each:
firstramp=round(((b-a)/(d-a)),3);
flat=round((c-b)/(d-a),3);
secondramp=round((d-c)/(d-a),3);
n1=firstramp*SampleSize; %sample size for first ramp
n3=secondramp*SampleSize; %sample size for second ramp
n2=flat*SampleSize;
And then finally I get the histogram from the following code:
quartile1=h/2*(b-a);
quartile2=1-h/2*(d-c);
y1=linspace(0,quartile1,n1);
y2=linspace(quartile1,quartile2,n2);
y3=linspace(quartile2,1,n3);
%inverse cumulative distribution functions
invcdf1=a+sqrt(2*(b-a)/h)*sqrt(y1);
invcdf2=(a+b)/2+y2/h;
invcdf3=d-sqrt(2*(d-c)/h)*sqrt(1-y3);
distr=[invcdf1 invcdf2 invcdf3];
histogram(distr,100)
However the sampling of ramps and flat components are not equal, looks like this:
I fixed this by trial and error, by reducing the sample size of the ramps by half:
n1=0.5*firstramp*SampleSize; %sample size for first ramp
n3=0.5*secondramp*SampleSize; %sample size for second ramp
n2=flat*SampleSize;
This made the distribution look like this:
However this makes the output sample less than what is given in input.
I've also tried different combinations of changing the sample sizes of ramps and flat.
This also works:
n1=0.75*firstramp*SampleSize; %sample size for first ramp
n3=0.75*secondramp*SampleSize; %sample size for second ramp
n2=1.5*flat*SampleSize;
It increases the output samples, but it's still not close.
Any help will be appreciated.
Full code:
a=-3;b=-1;c=1;d=8;
SampleSize=10e4;%*1.33333333333333;
h=2/(d+c-a-b);
firstramp=round(((b-a)/(d-a)),3);
flat=round((c-b)/(d-a),3);
secondramp=round((d-c)/(d-a),3);
n1=firstramp*SampleSize; %sample size for first ramp
n3=secondramp*SampleSize; %sample size for second ramp
n2=flat*SampleSize;
quartile1=h/2*(b-a);
quartile2=1-h/2*(d-c);
y1=linspace(0,quartile1,.75*n1);
y2=linspace(quartile1,quartile2,1.5*n2);
y3=linspace(quartile2,1,.75*n3);
%inverse cumulative distribution functions
invcdf1=a+sqrt(2*(b-a)/h)*sqrt(y1);
invcdf2=(a+b)/2+y2/h;
invcdf3=d-sqrt(2*(d-c)/h)*sqrt(1-y3);
distr=[invcdf1 invcdf2 invcdf3];
histogram(distr,100)
%end
I don't know Matlab so I was hoping somebody else would jump in on this, but since nobody did here goes.
If I'm reading your code correctly what you did is not an inversion. Inversion is 1-1, i.e., one uniform input produces one outcome. You seem to be using a technique known as the "composition method". In composition the overall distribution is comprised of component pieces, each of which is straightforward to generate. You choose which component to generate from based on their proportions/probabilities relative to the whole. For density functions, probability is found as the area under the density curve, so your first mistake was in sampling the components relative to the width of each component rather than using their areas. The correct sampling proportions are 2/13, 4/13, and 7/13 for what you designated the firstramp, flat, and secondramp components, respectively. A second mistake (which is relatively minor) was to assign exact sample sizes to each of the components. Having probability 2/13 does not mean that exactly 2*SampleSize/13 of your samples will be from the firstramp, it means that's the expected sample size for that component. The expected value of a random variate is not necessarily (or even likely to be) the outcome you will actually get.
In pseudocode, the composition approach would be
generate U ~ Uniform(0,1)
if U <= 2/13:
generate and return a value from firstramp
else if U <= 6/13:
generate and return a value from flat
else:
generate and return a value from secondramp
Note that since each of the generate options will use one or more uniforms, and choosing between the options requires a uniform U, this is not an inversion.
If you want an actual inversion, you need to quantify your density, integrate it to get the cumulative distribution function, then apply the inversion technique by setting F(X) = U and solving for X. Since your distribution is made of distinct components, both the density and cumulative density will be piecewise functions.
After deriving the height based on the requirement that the areas of the two triangles and the flat section must add up to 1, I came up with the following for your density:
| (x + 3) / 13 -3 <= x <= -1
|
f(x) = | 2 / 13 -1 <= x <= 1
|
| 2 * (8 - x) / 91 1 <= x <= 8
Integrating this and collecting terms produces the CDF:
| (x + 3)**2 / 26 -3 <= x <= -1
|
F(x) = | (2 + x) * 2 / 13 -1 <= x <= 1
|
| 6 / 13 + [49 - (x - 8)**2] / 91 1 <= x <= 8
Finally, determining the values of F(x) at the break points between the segments and applying inversion yields the following pseudocode algorithm:
generate U ~ Uniform(0,1)
if U <= 2 / 13:
return 2 * sqrt( (13 * U) / 2 ) - 3
else if U <= 6 / 13:
return (13 * U) / 2 - 2:
else:
return 8 - sqrt( 91 * (1 - U) )
Note that this is a true inversion. The outcome is determined by generating a single U, and transforming it in different ways depending on which range it falls in.

How to get a square root for 32 bit input in one clock cycle only?

I want to design a synthesizable module in Verilog which will take only one cycle in calculating square root of given input of 32 bit.
[Edit1] repaired code
Recently found the results where off even if tests determine all was OK so I dig deeper and found out that I had a silly bug in my equation and due to name conflicts with my pgm environment the tests got false positives so I overlooked it before. Now it work in all cases as it should.
The best thing I can think of (except approximation or large LUT) is binary search without multiplication, here C++ code:
//---------------------------------------------------------------------------
WORD u32_sqrt(DWORD xx) // 16 T
{
DWORD x,m,a0,a1,i;
const DWORD lut[16]=
{
// m*m
0x40000000,
0x10000000,
0x04000000,
0x01000000,
0x00400000,
0x00100000,
0x00040000,
0x00010000,
0x00004000,
0x00001000,
0x00000400,
0x00000100,
0x00000040,
0x00000010,
0x00000004,
0x00000001,
};
for (x=0,a0=0,m=0x8000,i=0;m;m>>=1,i++)
{
a1=a0+lut[i]+(x<<(16-i));
if (a1<=xx) { a0=a1; x|=m; }
}
return x;
}
//---------------------------------------------------------------------------
Standard binary search sqrt(xx) is setting bits of x from MSB to LSB so that result of x*x <= xx. Luckily we can avoid the multiplication by simply rewrite the thing as incrementing multiplicant... in each iteration the older x*x result can be used like this:
x1 = x0+m
x1*x1 = (x0+m)*(x0+m) = (x0*x0) + (2*m*x0) + (m*m)
Where x0 is value of x from last iteration and x1 is actual value. The m is weight of actual processed bit. The (2*m) and (m*m) are constant and can be used as LUT and bit-shift so no need to multiply. Only addition is needed. Sadly the iteration is bound to sequential computation forbid paralelisation so the result is 16T at best.
In the code a0 represents last x*x and a1 represents actual iterated x*x
As you can see the sqrt is done in 16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare) where the bit shift and LUT can be hardwired.
If you got super fast gates for this in comparison to the rest you can multiply the input clock by 16 and use that as internal timing for SQRT module. Something similar to the old days when there was MC clock as Division of source CPU clock in old Intel CPU/MCUs ... This way you can get 1T timing (or multiple of it depends on the multiplication ratio).
There is conversion to a logarithm, halving, and converting back.
For an idea how to implement "combinatorial" log and antilog, see Michael Dunn's EDN article showing priority encoder, barrel shifter & lookup table, with three log variants in System Verilog for down-load.
(Priority encoder, barrel shifter & lookup table look promising for "one-step-Babylonian/Heron/Newton/-Raphson. But that would probably still need a 128K by 9 bits lookup table.)
While not featuring "verilog",
Tole Sutikno: "An Optimized Square Root Algorithm for Implementation in FPGA Hardware" shows a combinatorial implementation of a modified (binary) digit-by-digit algorithm.
In 2018, T. Bagala, A. Fibich, M. Hagara,
P. Kubinec, O. Ondráček, V. Štofanik and R. Stojanović authored Single Clock Square Root Algorithm Based on Binomial Series and its FPGA Implementation.
Local oscillator runs at 50MHz [… For 16 bit input mantissa,] Values from [the hardware] experiment were the same as values from simulation […] Obtained delay averages were 892ps and 906ps respectively.
(No explanation about the discrepancy between 50MHz and .9ns or the quoted ps resolution and the use of a 10Gsps scope. If it was about 18 cycles (due to pipelining rather than looping?)/~900*ns*, interpretation of Single Clock Square Root… remains open - may be one result per cycle.)
The paper discloses next no details about the evaluation of the binomial series.
While the equations are presented in a general form, too, my guess is that the amount of hardware needed for a greater number of bits gets prohibitive quickly.
I got the code
here it is
module sqrt(
input[31:0]a,
output[15:0]out
);
reg [31:0]temp;
reg[14:0]x;
always#(a)
begin
if(a<257)x=4;
if(a>256 && a<65537)x=80;
if(a>65536 && a<16777217)x=1000;
if(a>16777216 && a<=4294967295)x=20000;
temp=(x+(a/x))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
end
assign out=temp;
endmodule
The usual means of doing this in hardware is using a CORDIC. A general implementation allows the calculation of a variety of transcendental functions (cos/sin/tan) and... square roots depending on how you initialize and operate the CORDIC.
It's an iterative algorithm so to do it in a single cycle you'd unroll the loop into as many iterations as you require for your desired precision and chain the instances together.
Specifically if you operated the CORDIC in vectoring mode, initialize it with [x, 0] and rotate to 45 degrees the [x', y'] final output will be a multiplicative constant away. i.e. sqrt(x) = x' * sqrt(2) * K
My version of Spektre with variable bits count in so it can be faster on short squares.
const unsigned int isqrt_lut[16] =
{
// m*m
0x40000000,
0x10000000,
0x04000000,
0x01000000,
0x00400000,
0x00100000,
0x00040000,
0x00010000,
0x00004000,
0x00001000,
0x00000400,
0x00000100,
0x00000040,
0x00000010,
0x00000004,
0x00000001,
};
/// Our largest golf ball image is about 74 pixels, so lets round up to power of 2 and we get 128.
/// 128 squared is 16384 so out largest sqrt has to handle 16383 or 14 bits. Only positive values.
/// ** maxBits in is 2 to 32 always an even number **
/// Input value mist always be less than (2^maxBits) - 1
unsigned int isqrt(unsigned int xx, int maxBitsIn) {
DWORD x, m, a0, a1, i;
for (x = 0, a0 = 0, m = 0x01 << (maxBitsIn / 2 - 1), i = 16 - maxBitsIn / 2; m; m >>= 1, i++)
{
a1 = a0 + isqrt_lut[i] + (x << (16 - i));
if (a1 <= xx) {
a0 = a1;
x |= m;
}
}
return x;
}

Seeding the Newton iteration for cube root efficiently

How can I find the cube root of a number in an efficient way?
I think Newton-Raphson method can be used, but I don't know how to guess the initial solution programmatically to minimize the number of iterations.
This is a deceptively complex question. Here is a nice survey of some possible approaches.
In view of the "link rot" that overtook the Accepted Answer, I'll give a more self-contained answer focusing on the topic of quickly obtaining an initial guess suitable for superlinear iteration.
The "survey" by metamerist (Wayback link) provided some timing comparisons for various starting value/iteration combinations (both Newton and Halley methods are included). Its references are to works by W. Kahan, "Computing a Real Cube Root", and by K. Turkowski, "Computing the Cube Root".
metamarist updates the DEC-VAX era bit-fiddling technique of W. Kahan with this snippet, which "assumes 32-bit integers" and relies on IEEE 754 format for doubles "to generate initial estimates with 5 bits of precision":
inline double cbrt_5d(double d)
{
const unsigned int B1 = 715094163;
double t = 0.0;
unsigned int* pt = (unsigned int*) &t;
unsigned int* px = (unsigned int*) &d;
pt[1]=px[1]/3+B1;
return t;
}
The code by K. Turkowski provides slightly more precision ("approximately 6 bits") by a conventional powers-of-two scaling on float fr, followed by a quadratic approximation to its cube root over interval [0.125,1.0):
/* Compute seed with a quadratic qpproximation */
fr = (-0.46946116F * fr + 1.072302F) * fr + 0.3812513F;/* 0.5<=fr<1 */
and a subsequent restoration of the exponent of two (adjusted to one-third). The exponent/mantissa extraction and restoration make use of math library calls to frexp and ldexp.
Comparison with other cube root "seed" approximations
To appreciate those cube root approximations we need to compare them with other possible forms. First the criteria for judging: we consider the approximation on the interval [1/8,1], and we use best (minimizing the maximum) relative error.
That is, if f(x) is a proposed approximation to x^{1/3}, we find its relative error:
error_rel = max | f(x)/x^(1/3) - 1 | on [1/8,1]
The simplest approximation would of course be to use a single constant on the interval, and the best relative error in that case is achieved by picking f_0(x) = sqrt(2)/2, the geometric mean of the values at the endpoints. This gives 1.27 bits of relative accuracy, a quick but dirty starting point for a Newton iteration.
A better approximation would be the best first-degree polynomial:
f_1(x) = 0.6042181313*x + 0.4531635984
This gives 4.12 bits of relative accuracy, a big improvement but short of the 5-6 bits of relative accuracy promised by the respective methods of Kahan and Turkowski. But it's in the ballpark and uses only one multiplication (and one addition).
Finally, what if we allow ourselves a division instead of a multiplication? It turns out that with one division and two "additions" we can have the best linear-fractional function:
f_M(x) = 1.4774329094 - 0.8414323527/(x+0.7387320679)
which gives 7.265 bits of relative accuracy.
At a glance this seems like an attractive approach, but an old rule of thumb was to treat the cost of a FP division like three FP multiplications (and to mostly ignore the additions and subtractions). However with current FPU designs this is not realistic. While the relative cost of multiplications to adds/subtracts has come down, in most cases to a factor of two or even equality, the cost of division has not fallen but often gone up to 7-10 times the cost of multiplication. Therefore we must be miserly with our division operations.
static double cubeRoot(double num) {
double x = num;
if(num >= 0) {
for(int i = 0; i < 10 ; i++) {
x = ((2 * x * x * x) + num ) / (3 * x * x);
}
}
return x;
}
It seems like the optimization question has already been addressed, but I'd like to add an improvement to the cubeRoot() function posted here, for other people stumbling on this page looking for a quick cube root algorithm.
The existing algorithm works well, but outside the range of 0-100 it gives incorrect results.
Here's a revised version that works with numbers between -/+1 quadrillion (1E15). If you need to work with larger numbers, just use more iterations.
static double cubeRoot( double num ){
boolean neg = ( num < 0 );
double x = Math.abs( num );
for( int i = 0, iterations = 60; i < iterations; i++ ){
x = ( ( 2 * x * x * x ) + num ) / ( 3 * x * x );
}
if( neg ){ return 0 - x; }
return x;
}
Regarding optimization, I'm guessing the original poster was asking how to predict the minimum number of iterations for an accurate result, given an arbitrary input size. But it seems like for most general cases the gain from optimization isn't worth the added complexity. Even with the function above, 100 iterations takes less than 0.2 ms on average consumer hardware. If speed was of utmost importance, I'd consider using pre-computed lookup tables. But this is coming from a desktop developer, not an embedded systems engineer.

John Carmack's Unusual Fast Inverse Square Root (Quake III)

John Carmack has a special function in the Quake III source code which calculates the inverse square root of a float, 4x faster than regular (float)(1.0/sqrt(x)), including a strange 0x5f3759df constant. See the code below. Can someone explain line by line what exactly is going on here and why this works so much faster than the regular implementation?
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) );
#ifndef Q3_VM
#ifdef __linux__
assert( !isnan(y) );
#endif
#endif
return y;
}
FYI. Carmack didn't write it. Terje Mathisen and Gary Tarolli both take partial (and very modest) credit for it, as well as crediting some other sources.
How the mythical constant was derived is something of a mystery.
To quote Gary Tarolli:
Which actually is doing a floating
point computation in integer - it took
a long time to figure out how and why
this works, and I can't remember the
details anymore.
A slightly better constant, developed by an expert mathematician (Chris Lomont) trying to work out how the original algorithm worked is:
float InvSqrt(float x)
{
float xhalf = 0.5f * x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f375a86 - (i >> 1); // gives initial guess y0
x = *(float*)&i; // convert bits back to float
x = x * (1.5f - xhalf * x * x); // Newton step, repeating increases accuracy
return x;
}
In spite of this, his initial attempt a mathematically 'superior' version of id's sqrt (which came to almost the same constant) proved inferior to the one initially developed by Gary despite being mathematically much 'purer'. He couldn't explain why id's was so excellent iirc.
Of course these days, it turns out to be much slower than just using an FPU's sqrt (especially on 360/PS3), because swapping between float and int registers induces a load-hit-store, while the floating point unit can do reciprocal square root in hardware.
It just shows how optimizations have to evolve as the nature of underlying hardware changes.
Greg Hewgill and IllidanS4 gave a link with excellent mathematical explanation.
I'll try to sum it up here for ones who don't want to go too much into details.
Any mathematical function, with some exceptions, can be represented by a polynomial sum:
y = f(x)
can be exactly transformed into:
y = a0 + a1*x + a2*(x^2) + a3*(x^3) + a4*(x^4) + ...
Where a0, a1, a2,... are constants. The problem is that for many functions, like square root, for exact value this sum has infinite number of members, it does not end at some x^n. But, if we stop at some x^n we would still have a result up to some precision.
So, if we have:
y = 1/sqrt(x)
In this particular case they decided to discard all polynomial members above second, probably because of calculation speed:
y = a0 + a1*x + [...discarded...]
And the task has now came down to calculate a0 and a1 in order for y to have the least difference from the exact value. They have calculated that the most appropriate values are:
a0 = 0x5f375a86
a1 = -0.5
So when you put this into equation you get:
y = 0x5f375a86 - 0.5*x
Which is the same as the line you see in the code:
i = 0x5f375a86 - (i >> 1);
Edit: actually here y = 0x5f375a86 - 0.5*x is not the same as i = 0x5f375a86 - (i >> 1); since shifting float as integer not only divides by two but also divides exponent by two and causes some other artifacts, but it still comes down to calculating some coefficients a0, a1, a2... .
At this point they've found out that this result's precision is not enough for the purpose. So they additionally did only one step of Newton's iteration to improve the result accuracy:
x = x * (1.5f - xhalf * x * x)
They could have done some more iterations in a loop, each one improving result, until required accuracy is met. This is exactly how it works in CPU/FPU! But it seems that only one iteration was enough, which was also a blessing for the speed. CPU/FPU does as many iterations as needed to reach the accuracy for the floating point number in which the result is stored and it has more general algorithm which works for all cases.
So in short, what they did is:
Use (almost) the same algorithm as CPU/FPU, exploit the improvement of initial conditions for the special case of 1/sqrt(x) and don't calculate all the way to precision CPU/FPU will go to but stop earlier, thus gaining in calculation speed.
I was curious to see what the constant was as a float so I simply wrote this bit of code and googled the integer that popped out.
long i = 0x5F3759DF;
float* fp = (float*)&i;
printf("(2^127)^(1/2) = %f\n", *fp);
//Output
//(2^127)^(1/2) = 13211836172961054720.000000
It looks like the constant is "An integer approximation to the square root of 2^127 better known by the hexadecimal form of its floating-point representation, 0x5f3759df" https://mrob.com/pub/math/numbers-18.html
On the same site it explains the whole thing. https://mrob.com/pub/math/numbers-16.html#le009_16
According to this nice article written a while back...
The magic of the code, even if you
can't follow it, stands out as the i =
0x5f3759df - (i>>1); line. Simplified,
Newton-Raphson is an approximation
that starts off with a guess and
refines it with iteration. Taking
advantage of the nature of 32-bit x86
processors, i, an integer, is
initially set to the value of the
floating point number you want to take
the inverse square of, using an
integer cast. i is then set to
0x5f3759df, minus itself shifted one
bit to the right. The right shift
drops the least significant bit of i,
essentially halving it.
It's a really good read. This is only a tiny piece of it.
The code consists of two major parts. Part one calculates an approximation for 1/sqrt(y), and part two takes that number and runs one iteration of Newton's method to get a better approximation.
Calculating an approximation for 1/sqrt(y)
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
Line 1 takes the floating point representation of y and treats it as an integer i. Line 2 shifts i over one bit and subtracts it from a mysterious constant. Line 3 takes the resulting number and converts it back to a standard float32. Now why does this work?
Let g be a function that maps a floating point number to its floating point representation, read as an integer. Line 1 above is setting i = g(y).
The following good approximation of g exists(*):
g(y) ≈ Clog_2 y + D for some constants C and D. An intuition for why such a good approximation exists is that the floating point representation of y is roughly linear in the exponent.
The purpose of line 2 is to map from g(y) to g(1/sqrt(y)), after which line 3 can use g^-1 to map that number to 1/sqrt(y). Using the approximation above, we have g(1/sqrt(y)) ≈ Clog_2 (1/sqrt(y)) + D = -C/2 log_2 y + D. We can use these formulas to calculate the map from g(y) to g(1/sqrt(y)), which is g(1/sqrt(y)) ≈ 3D/2 - 1/2 * g(y). In line 2, we have 0x5f3759df ≈ 3D/2, and i >> 1 ≈ 1/2*g(y).
The constant 0x5f3759df is slightly smaller than the constant that gives the best possible approximation for g(1/sqrt(y)). That is because this step is not done in isolation. Due to the direction that Newton's method tends to miss in, using a slightly smaller constant tends to yield better results. The exact optimal constant to use in this setting depends on your input distribution of y, but 0x5f3759df is one such constant that gives good results over a fairly broad range.
A more detailed description of this process can be found on Wikipedia: https://en.wikipedia.org/wiki/Fast_inverse_square_root#Algorithm
(*) More explicitly, let y = 2^e*(1+f). Taking the log of both sides, we get log_2 y = e + log_2(1+f), which can be approximated as log_2 y ≈ e + f + σ for a small constant sigma. Separately, the float32 encoding of y expressed as an integer is g(y) ≈ 2^23 * (e+127) + f * 2^23. Combining the two equations, we get g(y) ≈ 2^23 * log_2 y + 2^23 * (127 - σ).
Using Newton's method
y = y * ( threehalfs - ( x2 * y * y ) );
Consider the function f(y) = 1/y^2 - num. The positive zero of f is y = 1/sqrt(num), which is what we are interested in calculating.
Newton's method is an iterative algorithm for taking an approximation y_n for the zero of a function f, and calculating a better approximation y_n+1, using the following equation: y_n+1 = y_n - f(y_n)/f'(y_n).
Calculating what that looks like for our function f gives the following equation: y_n+1 = y_n - (-y_n+y_n^3*num)/2 = y_n * (3/2 - num/2 * y_n * y_n). This is exactly what the line of code above is doing.
You can learn more about the details of Newton's method here: https://en.wikipedia.org/wiki/Newton%27s_method

Representing continuous probability distributions

I have a problem involving a collection of continuous probability distribution functions, most of which are determined empirically (e.g. departure times, transit times). What I need is some way of taking two of these PDFs and doing arithmetic on them. E.g. if I have two values x taken from PDF X, and y taken from PDF Y, I need to get the PDF for (x+y), or any other operation f(x,y).
An analytical solution is not possible, so what I'm looking for is some representation of PDFs that allows such things. An obvious (but computationally expensive) solution is monte-carlo: generate lots of values of x and y, and then just measure f(x, y). But that takes too much CPU time.
I did think about representing the PDF as a list of ranges where each range has a roughly equal probability, effectively representing the PDF as the union of a list of uniform distributions. But I can't see how to combine them.
Does anyone have any good solutions to this problem?
Edit: The goal is to create a mini-language (aka Domain Specific Language) for manipulating PDFs. But first I need to sort out the underlying representation and algorithms.
Edit 2: dmckee suggests a histogram implementation. That is what I was getting at with my list of uniform distributions. But I don't see how to combine them to create new distributions. Ultimately I need to find things like P(x < y) in cases where this may be quite small.
Edit 3: I have a bunch of histograms. They are not evenly distributed because I'm generating them from occurance data, so basically if I have 100 samples and I want ten points in the histogram then I allocate 10 samples to each bar, and make the bars variable width but constant area.
I've figured out that to add PDFs you convolve them, and I've boned up on the maths for that. When you convolve two uniform distributions you get a new distribution with three sections: the wider uniform distribution is still there, but with a triangle stuck on each side the width of the narrower one. So if I convolve each element of X and Y I'll get a bunch of these, all overlapping. Now I'm trying to figure out how to sum them all and then get a histogram that is the best approximation to it.
I'm beginning to wonder if Monte-Carlo wasn't such a bad idea after all.
Edit 4: This paper discusses convolutions of uniform distributions in some detail. In general you get a "trapezoid" distribution. Since each "column" in the histograms is a uniform distribution, I had hoped that the problem could be solved by convolving these columns and summing the results.
However the result is considerably more complex than the inputs, and also includes triangles. Edit 5: [Wrong stuff removed]. But if these trapezoids are approximated to rectangles with the same area then you get the Right Answer, and reducing the number of rectangles in the result looks pretty straightforward too. This might be the solution I've been trying to find.
Edit 6: Solved! Here is the final Haskell code for this problem:
-- | Continuous distributions of scalars are represented as a
-- | histogram where each bar has approximately constant area but
-- | variable width and height. A histogram with N bars is stored as
-- | a list of N+1 values.
data Continuous = C {
cN :: Int,
-- ^ Number of bars in the histogram.
cAreas :: [Double],
-- ^ Areas of the bars. #length cAreas == cN#
cBars :: [Double]
-- ^ Boundaries of the bars. #length cBars == cN + 1#
} deriving (Show, Read)
{- | Add distributions. If two random variables #vX# and #vY# are
taken from distributions #x# and #y# respectively then the
distribution of #(vX + vY)# will be #(x .+. y).
This is implemented as the convolution of distributions x and y.
Each is a histogram, which is to say the sum of a collection of
uniform distributions (the "bars"). Therefore the convolution can be
computed as the sum of the convolutions of the cross product of the
components of x and y.
When you convolve two uniform distributions of unequal size you get a
trapezoidal distribution. Let p = p2-p1, q - q2-q1. Then we get:
> | |
> | ______ |
> | | | with | _____________
> | | | | | |
> +-----+----+------- +--+-----------+-
> p1 p2 q1 q2
>
> gives h|....... _______________
> | /: :\
> | / : : \ 1
> | / : : \ where h = -
> | / : : \ q
> | / : : \
> +--+-----+-------------+-----+-----
> p1+q1 p2+q1 p1+q2 p2+q2
However we cannot keep the trapezoid in the final result because our
representation is restricted to uniform distributions. So instead we
store a uniform approximation to the trapezoid with the same area:
> h|......___________________
> | | / \ |
> | |/ \|
> | | |
> | /| |\
> | / | | \
> +-----+-------------------+--------
> p1+q1+p/2 p2+q2-p/2
-}
(.+.) :: Continuous -> Continuous -> Continuous
c .+. d = C {cN = length bars - 1,
cBars = map fst bars,
cAreas = zipWith barArea bars (tail bars)}
where
-- The convolve function returns a list of two (x, deltaY) pairs.
-- These can be sorted by x and then sequentially summed to get
-- the new histogram. The "b" parameter is the product of the
-- height of the input bars, which was omitted from the diagrams
-- above.
convolve b c1 c2 d1 d2 =
if (c2-c1) < (d2-d1) then convolve1 b c1 c2 d1 d2 else convolve1 b d1
d2 c1 c2
convolve1 b p1 p2 q1 q2 =
[(p1+q1+halfP, h), (p2+q2-halfP, (-h))]
where
halfP = (p2-p1)/2
h = b / (q2-q1)
outline = map sumGroup $ groupBy ((==) `on` fst) $ sortBy (comparing fst)
$ concat
[convolve (areaC*areaD) c1 c2 d1 d2 |
(c1, c2, areaC) <- zip3 (cBars c) (tail $ cBars c) (cAreas c),
(d1, d2, areaD) <- zip3 (cBars d) (tail $ cBars d) (cAreas d)
]
sumGroup pairs = (fst $ head pairs, sum $ map snd pairs)
bars = tail $ scanl (\(_,y) (x2,dy) -> (x2, y+dy)) (0, 0) outline
barArea (x1, h) (x2, _) = (x2 - x1) * h
Other operators are left as an exercise for the reader.
No need for histograms or symbolic computation: everything can be done at the language level in closed form, if the right point of view is taken.
[I shall use the term "measure" and "distribution" interchangeably. Also, my Haskell is rusty and I ask you to forgive me for being imprecise in this area.]
Probability distributions are really codata.
Let mu be a probability measure. The only thing you can do with a measure is integrate it against a test function (this is one possible mathematical definition of "measure"). Note that this is what you will eventually do: for instance integrating against identity is taking the mean:
mean :: Measure -> Double
mean mu = mu id
another example:
variance :: Measure -> Double
variance mu = (mu $ \x -> x ^ 2) - (mean mu) ^ 2
another example, which computes P(mu < x):
cdf :: Measure -> Double -> Double
cdf mu x = mu $ \z -> if z < x then 1 else 0
This suggests an approach by duality.
The type Measure shall therefore denote the type (Double -> Double) -> Double. This allows you to model results of MC simulation, numerical/symbolic quadrature against a PDF, etc. For instance, the function
empirical :: [Double] -> Measure
empirical h:t f = (f h) + empirical t f
returns the integral of f against an empirical measure obtained by eg. MC sampling. Also
from_pdf :: (Double -> Double) -> Measure
from_pdf rho f = my_favorite_quadrature_method rho f
construct measures from (regular) densities.
Now, the good news. If mu and nu are two measures, the convolution mu ** nu is given by:
(mu ** nu) f = nu $ \y -> (mu $ \x -> f $ x + y)
So, given two measures, you can integrate against their convolution.
Also, given a random variable X of law mu, the law of a * X is given by:
rescale :: Double -> Measure -> Measure
rescale a mu f = mu $ \x -> f(a * x)
Also, the distribution of phi(X) is given by the image measure phi_* X, in our framework:
apply :: (Double -> Double) -> Measure -> Measure
apply phi mu f = mu $ f . phi
So now you can easily work out an embedded language for measures. There are much more things to do here, particularly with respect to sample spaces other than the real line, dependencies between random variables, conditionning, but I hope you get the point.
In particular, the pushforward is functorial:
newtype Measure a = (a -> Double) -> Double
instance Functor Measure a where
fmap f mu = apply f mu
It is a monad too (exercise -- hint: this very much looks like the continuation monad. What is return ? What is the analog of call/cc ?).
Also, combined with a differential geometry framework, this can probably be turned into something which compute Bayesian posterior distributions automatically.
At the end of the day, you can write stuff like
m = mean $ apply cos ((from_pdf gauss) ** (empirical data))
to compute the mean of cos(X + Y) where X has pdf gauss and Y has been sampled by a MC method whose results are in data.
Probability distributions form a monad; see eg the work of Claire Jones and also the LICS 1989 paper, but the ideas go back to a 1982 paper by Giry (DOI 10.1007/BFb0092872) and to a 1962 note by Lawvere that I cannot track down (http://permalink.gmane.org/gmane.science.mathematics.categories/6541).
But I don't see the comonad: there's no way to get an "a" out of an "(a->Double)->Double". Perhaps if you make it polymorphic - (a->r)->r for all r? (That's the continuation monad.)
Is there anything that stops you from employing a mini-language for this?
By that I mean, define a language that lets you write f = x + y and evaluates f for you just as written. And similarly for g = x * z, h = y(x), etc. ad nauseum. (The semantics I'm suggesting call for the evaluator to select a random number on each innermost PDF appearing on the RHS at evaluation time, and not to try to understand the composted form of the resulting PDFs. This may not be fast enough...)
Assuming that you understand the precision limits you need, you can represent a PDF fairly simply with a histogram or spline (the former being a degenerate case of the later). If you need to mix analytically defined PDFs with experimentally determined ones, you'll have to add a type mechanism.
A histogram is just an array, the contents of which represent the incidence in a particular region of the input range. You haven't said if you have a language preference, so I'll assume something c-like. You need to know the bin-structure (uniorm sizes are easy, but not always best) including the high and low limits and possibly the normalizatation:
struct histogram_struct {
int bins; /* Assumed to be uniform */
double low;
double high;
/* double normalization; */
/* double *errors; */ /* if using, intialize with enough space,
* and store _squared_ errors
*/
double contents[];
};
This kind of thing is very common in scientific analysis software, and you might want to use an existing implementation.
I worked on similar problems for my dissertation.
One way to compute approximate convolutions is to take the Fourier transform of the density functions (histograms in this case), multiply them, then take the inverse Fourier transform to get the convolution.
Look at Appendix C of my dissertation for formulas for various special cases of operations on probability distributions. You can find the dissertation at: http://riso.sourceforge.net
I wrote Java code to carry out those operations. You can find the code at: https://sourceforge.net/projects/riso
Autonomous mobile robotics deals with similar issue in localization and navigation, in particular the Markov localization and Kalman filter (sensor fusion). See An experimental comparison of localization methods continued for example.
Another approach you could borrow from mobile robots is path planning using potential fields.
A couple of responses:
1) If you have empirically determined PDFs they either you have histograms or you have an approximation to a parametric PDF. A PDF is a continuous function and you don't have infinite data...
2) Let's assume that the variables are independent. Then if you make the PDF discrete then P(f(x,y)) = f(x,y)p(x,y) = f(x,y)p(x)p(y) summed over all the combinations of x and y such that f(x,y) meets your target.
If you are going to fit the empirical PDFs to standard PDFs, e.g. the normal distribution, then you can use already-determined functions to figure out the sum, etc.
If the variables are not independent, then you have more trouble on your hands and I think you have to use copulas.
I think that defining your own mini-language, etc., is overkill. you can do this with arrays...
Some initial thoughts:
First, Mathematica has a nice facility for doing this with exact distributions.
Second, representation as histograms (ie, empirical PDFs) is problematic since you have to make choices about bin size. That can be avoided by storing a cumulative distribution instead, ie, an empirical CDF. (In fact, you then retain the ability to recreate the full data set of samples that the empirical distribution is based on.)
Here's some ugly Mathematica code to take a list of samples and return an empirical CDF, namely a list of value-probability pairs. Run the output of this through ListPlot to see a plot of the empirical CDF.
empiricalCDF[t_] :=
Flatten[{{#[[2,1]],#[[1,2]]},#[[2]]}&/#Partition[Prepend[Transpose[{#[[1]],
Rest[FoldList[Plus,0,#[[2]]]]/Length[t]}&[Transpose[{First[#],Length[#]}&/#
Split[Sort[t]]]]],{Null,0}],2,1],1]
Finally, here's some information on combining discrete probability distributions:
http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter7.pdf
I think the histograms or the list of 1/N area regions is a good idea. For the sake of argument, I'll assume that you'll have a fixed N for all distributions.
Use the paper you linked edit 4 to generate the new distribution. Then, approximate it with a new N-element distribution.
If you don't want N to be fixed, it's even easier. Take each convex polygon (trapezoid or triangle) in the new generated distribution and approximate it with a uniform distribution.
Another suggestion is to use kernel densities. Especially if you use Gaussian kernels, then they can be relatively easy to work with... except that the distributions quickly explode in size without care. Depending on the application, there are additional approximation techniques like importance sampling that can be used.
If you want some fun, try representing them symbolically like Maple or Mathemetica would do. Maple uses directed acyclic graphs, while Matematica uses a list/lisp like appoach (I believe, but it's been a loooong time, since I even thought about this).
Do all your manipulations symbolically, then at the end push through numerical values. (Or just find a way to launch off in a shell and do the computations).
Paul.

Resources