How to fix this floating point square root algorithm - algorithm

I am trying to compute the IEEE-754 32-bit Floating Point Square Root of various inputs but for one particular input the below algorithm based upon the Newton-Raphson method won't converge, I am wondering what I can do to fix the problem? For the platform I am designing I have a 32-bit floating point adder/subtracter, multiplier, and divider.
For input 0x7F7FFFFF (3.4028234663852886E38)., the algorithm won't converge to the correct answer of 18446743523953729536.000000 This algorithm's answer gives 18446743523953737728.000000.
I am using MATLAB to implement my code before I implement this in hardware. I can only use single precision floating point values, (SO NO DOUBLES).
clc; clear; close all;
% Input
R = typecast(uint32(hex2dec(num2str(dec2hex(((hex2dec('7F7FFFFF'))))))),'single')
% Initial estimate
OneOverRoot2 = single(1/sqrt(2));
Root2 = single(sqrt(2));
% Get low and high bits of input R
hexdata_high = bitand(bitshift(hex2dec(num2hex(single(R))),-16),hex2dec('ffff'));
hexdata_low = bitand(hex2dec(num2hex(single(R))),hex2dec('ffff'));
% Change exponent of input to -1 to get Mantissa
temp = bitand(hexdata_high,hex2dec('807F'));
Expo = bitshift(bitand(hexdata_high,hex2dec('7F80')),-7);
hexdata_high = bitor(temp,hex2dec('3F00'));
b = typecast(uint32(hex2dec(num2str(dec2hex(((bitshift(hexdata_high,16)+ hexdata_low)))))),'single');
% If exponent is odd ...
if (bitand(Expo,1))
% Pretend the mantissa [0.5 ... 1.0) is multiplied by 2 as Expo is odd,
% so it now has the value [1.0 ... 2.0)
% Estimate the sqrt(mantissa) as [1.0 ... sqrt(2))
% IOW: linearly map (0.5 ... 1.0) to (1.0 ... sqrt(2))
Mantissa = (Root2 - 1.0)/(1.0 - 0.5)*(b - 0.5) + 1.0;
% The mantissa is in range [0.5 ... 1.0)
% Estimate the sqrt(mantissa) as [1/sqrt(2) ... 1.0)
% IOW: linearly map (0.5 ... 1.0) to (1/sqrt(2) ... 1.0)
Mantissa = (1.0 - OneOverRoot2)/(1.0 - 0.5)*(b - 0.5) + OneOverRoot2;
newS = Mantissa*2^(bitshift(Expo-127,-1));
% S = (S + R/S)/2 method
for j = 1:6
fprintf('S %u %f %f\n', j, S, (S-sqrt(R)));
S = single((single(S) + single(single(R)/single(S))))/2;
S = single(S);
goodaccuracy = (abs((single(S)-single(sqrt(single(R)))))) < 2^-23
difference = (abs((single(S)-single(sqrt(single(R))))))
% Get hexadecimal output
hexdata_high = (bitand(bitshift(hex2dec(num2hex(single(S))),-16),hex2dec('ffff')));
hexdata_low = (bitand(hex2dec(num2hex(single(S))),hex2dec('ffff')));
fprintf('FLOAT: T Input: %e\t\tCorrect: %e\t\tMy answer: %e\n', R, sqrt(R), S);
fprintf('output hex = 0x%04X%04X\n',hexdata_high,hexdata_low);
out = hex2dec(num2hex(single(S)));

I took a whack at this. Here's what I came up with:
float mysqrtf(float f) {
if (f < 0) return 0.0f/0.0f;
if (f == 1.0f / 0.0f) return f;
if (f != f) return f;
// half-ass an initial guess of 1.0.
int expo;
float foo = frexpf(f, &expo);
float s = 1.0;
if (expo & 1) foo *= 2, expo--;
// this is the only case for which what's below fails.
if (foo == 0x0.ffffffp+0) return ldexpf(0x0.ffffffp+0, expo/2);
// do four newton iterations.
for (int i = 0; i < 4; i++) {
float diff = s*s-foo;
diff /= s;
s -= diff/2;
// do one last newton iteration, computing s*s-foo exactly.
float scal = s >= 1 ? 4096 : 2048;
float shi = (s + scal) - scal; // high 12 bits of significand
float slo = s - shi; // rest of significand
float diff = shi * shi - foo; // subtraction exact by sterbenz's theorem
diff += 2 * shi * slo; // opposite signs; exact by sterbenz's theorem
diff += slo * slo;
diff /= s; // diff == fma(s, s, -foo) / s.
s -= diff/2;
return ldexpf(s, expo/2);
The first thing to analyse is the formula (s*s-foo)/s in floating-point arithmetic. If s is a sufficiently good approximation to sqrt(foo), Sterbenz's theorem tells us that the numerator is within an ulp(foo) of the right answer --- all of that error is approximation error from computing s*s. Then we divide by s; this gives us at worst another half-ulp of approximation error. So, even without a fused multiply-add, diff is within 1.5 ulp of what it should be. And we divide it by two.
Notice that the initial guess doesn't in and of itself matter as long as you follow it up with enough Newton iterations.
Measure the error of an approximation s to sqrt(foo) by abs(s - foo/s). The error of my initial guess of 1 is at most 1. A Newton iteration in exact arithmetic squares the error and divides it by 4. A Newton iteration in floating-point arithmetic --- the kind I do four times --- squares the error, divides it by 4, and kicks in another 0.75 ulp of error. You do this four times and you find you have a relative error at most 0x0.000000C4018384, which is about 0.77 ulp. This means that four Newton iterations yield a faithfully-rounded result.
I do a fifth Newton step to get a correctly-rounded square root. The reason why it works is a little more intricate.
shi holds the "top half" of s while slo holds the "bottom half." The last 12 bits in each significand will be zero. This means, in particular, that shi * shi and shi * slo and slo * slo are exactly representable as floats.
s*s is within two ulps of foo. shi*shi is within 2047 ulps of s*s. Thus shi * shi - foo is within 2049 ulps of zero; in particular, it's exactly representable and less than 2-10.
You can check that you can add 2 * shi * slo and get an exactly-representable result that's within 2-22 of zero and then add slo*slo and get an exactly representable result --- s*s-foo computed exactly.
When you divide by s, you kick in an additional half-ulp of error, which is at most 2-48 here since our error was already so small.
Now we do a Newton step. We've computed the current error correctly to within 2-46. Adding half of it to s gives us the square root to within 3*2-48.
To turn this into a guarantee of correct rounding, we need to prove that there are no floats between 1/2 and 2, other than the one I special-cased, whose square roots are within 3*2-48 of a midpoint between two consecutive floats. You can do some error analysis, get a Diophantine equation, find all of the solutions of that Diophantine equation, find which inputs they correspond to, and work out what the algorithm does on those. (If you do this, there is one "physical" solution and a bunch of "unphysical" solutions. The one real solution is the only thing I special-cased.) There may be a cleaner way, however.


random increasing sequence with O(1) access to any element?

I have an interesting math/CS problem. I need to sample from a possibly infinite random sequence of increasing values, X, with X(i) > X(i-1), with some distribution between them. You could think of this as the sum of a different sequence D of uniform random numbers in [0,d). This is easy to do if you start from the first one and go from there; you just add a random amount to the sum each time. But the catch is, I want to be able to get any element of the sequence in faster than O(n) time, ideally O(1), without storing the whole list. To be concrete, let's say I pick d=1, so one possibility for D (given a particular seed) and its associated X is:
D={.1, .5, .2, .9, .3, .3, .6 ...} // standard random sequence, elements in [0,1)
X={.1, .6, .8, 1.7, 2.0, 2.3, 2.9, ...} // increasing random values; partial sum of D
(I don't really care about D, I'm just showing one conceptual way to construct X, my sequence of interest.) Now I want to be able to compute the value of X[1] or X[1000] or X[1000000] equally fast, without storing all the values of X or D. Can anyone point me to some clever algorithm or a way to think about this?
(Yes, what I'm looking for is random access into a random sequence -- with two different meanings of random. Makes it hard to google for!)
Since D is pseudorandom, there’s a space-time tradeoff possible:
O(sqrt(n))-time retrievals using O(sqrt(n)) storage locations (or,
in general, O(n**alpha)-time retrievals using O(n**(1-alpha))
storage locations). Assume zero-based indexing and that
X[n] = D[0] + D[1] + ... + D[n-1]. Compute and store
Y[s] = X[s**2]
for all s**2 <= n in the range of interest. To look up X[n], let
s = floor(sqrt(n)) and return
Y[s] + D[s**2] + D[s**2+1] + ... + D[n-1].
EDIT: here's the start of an approach based on the following idea.
Let Dist(1) be the uniform distribution on [0, d) and let Dist(k) for k > 1 be the distribution of the sum of k independent samples from Dist(1). We need fast, deterministic methods to (i) pseudorandomly sample Dist(2**p) and (ii) given that X and Y are distributed as Dist(2**p), pseudorandomly sample X conditioned on the outcome of X + Y.
Now imagine that the D array constitutes the leaves of a complete binary tree of size 2**q. The values at interior nodes are the sums of the values at their two children. The naive way is to fill the D array directly, but then it takes a long time to compute the root entry. The way I'm proposing is to sample the root from Dist(2**q). Then, sample one child according to Dist(2**(q-1)) given the root's value. This determines the value of the other, since the sum is fixed. Work recursively down the tree. In this way, we look up tree values in time O(q).
Here's an implementation for Gaussian D. I'm not sure it's working properly.
import hashlib, math
def random_oracle(seed):
h = hashlib.sha512()
x = 0.0
for b in h.digest():
x = ((x + b) / 256.0)
return x
def sample_gaussian(variance, seed):
u0 = random_oracle((2 * seed))
u1 = random_oracle(((2 * seed) + 1))
return (math.sqrt((((- 2.0) * variance) * math.log((1.0 - u0)))) * math.cos(((2.0 * math.pi) * u1)))
def sample_children(sum_outcome, sum_variance, seed):
difference_outcome = sample_gaussian(sum_variance, seed)
return (((sum_outcome + difference_outcome) / 2.0), ((sum_outcome - difference_outcome) / 2.0))
def sample_X(height, i):
assert (0 <= i <= (2 ** height))
total = 0.0
z = sample_gaussian((2 ** height), 0)
seed = 1
for j in range(height, 0, (- 1)):
(x, y) = sample_children(z, (2 ** j), seed)
assert (abs(((x + y) - z)) <= 1e-09)
seed *= 2
if (i >= (2 ** (j - 1))):
i -= (2 ** (j - 1))
total += x
z = y
seed += 1
z = x
return total
def test(height):
X = [sample_X(height, i) for i in range(((2 ** height) + 1))]
D = [(X[(i + 1)] - X[i]) for i in range((2 ** height))]
mean = (sum(D) / len(D))
variance = (sum((((d - mean) ** 2) for d in D)) / (len(D) - 1))
print(mean, math.sqrt(variance))
with open('data', 'w') as f:
for d in D:
print(d, file=f)
if (__name__ == '__main__'):
If you do not record the values in X, and if you do not remember the values in X you have previously generate, there is no way to guarantee that the elements in X you generate (on the fly) will be in increasing order. It furthermore seems like there is no way to avoid O(n) time worst-case per query, if you don't know how to quickly generate the CDF for the sum of the first m random variables in D for any choice of m.
If you want the ith value X(i) from a particular realization, I can't see how you could do this without generating the sequence up to i. Perhaps somebody else can come up with something clever.
Would you be willing to accept a value which is plausible in the sense that it has the same distribution as the X(i)'s you would observe across multiple realizations of the X process? If so, it should be pretty easy. X(i) will be asymptotically normally distributed with mean i/2 (since it's the sum of the Dk's for k=1,...,i, the D's are Uniform(0,1), and the expected value of a D is 1/2) and variance i/12 (since the variance of a D is 1/12 and the variance of a sum of independent random variables is the sum of their variances).
Because of the asymptotic aspect, I'd pick some threshold value for i to switch over from direct summing to using the normal. For example, if you use i = 12 as your threshold you would use actual summing of uniforms for values of i from 1 to 11, and generate a Normal(i/2, sqrt(i/12)) value for i >. That's an O(1) algorithm since the total work is bounded by your threshold, and the results produced will be distributionally representative of what you would see if you actually went through the summing.

Matthews Correlation Coefficient yielding values outside of [-1,1]

I'm using the formula found on Wikipedia for calculating Matthew's Correlation Coefficient. It works fairly well, most of the time, but I'm running into problems in my tool's implementation, and I'm not seeing the problem.
MCC = ((TP*TN)-(FP*FN))/sqrt(((TP + FP)( TP + FN )( TN + FP )( TN + FN )))
Where TP, TN, FP, and FN are the non-negative, integer counts of the appropriate fields.
Which should only return values $\epsilon$ [-1,1]
My implementation is as follows:
double ret;
if ((TruePositives + FalsePositives) == 0 || (TruePositives + FalseNegatives) == 0 ||
( TrueNegatives + FalsePositives) == 0 || (TrueNegatives + FalseNegatives) == 0)
//To avoid dividing by zero
ret = (double)(TruePositives * TrueNegatives -
FalsePositives * FalseNegatives);
double num = (double)(TruePositives * TrueNegatives -
FalsePositives * FalseNegatives);
double denom = (TruePositives + FalsePositives) *
(TruePositives + FalseNegatives) *
(TrueNegatives + FalsePositives) *
(TrueNegatives + FalseNegatives);
denom = Math.Sqrt(denom);
ret = num / denom;
return ret;
When I use this, as I said it works properly most of the time, but for instance if TP=280, TN = 273, FP = 67, and FN = 20, then we get:
MCC = (280*273)-(67*20)/sqrt((347*300*340*293)) = 75100/42196.06= (approx) 1.78
Is this normal behavior of Matthews Correlation Coefficient? I'm a programmer by trade, so statistics aren't a part of my formal training. Also, I've looked at questions with answers, and none of them discuss this behavior. Is it a bug in my code or in the formula itself?
The code is clear and looks correct. (But one's eyes can always deceive.)
One issue is a concern whether the output is guaranteed to lie between -1 and 1. Assuming all inputs are nonnegative, though, we can round the numerator up and the denominator down, thereby overestimating the result, by zeroing out all the "False*" terms, producing
TP*TN / Sqrt(TP*TN*TP*TN) = 1.
The lower limit is obtained similarly by zeroing out all the "True*" terms. Therefore, working code cannot produce a value larger than 1 in size unless it is presented with invalid input.
I therefore recommend placing a guard (such as an Assert statement) to assure the inputs are nonnegative. (Clearly it matters not in the preceding argument whether they are integral.) Place another assertion to check that the output is in the interval [-1,1]. Together, these will detect either or both of (a) invalid inputs or (b) an error in the calculation.

Fastest way to sort vectors by angle without actually computing that angle

Many algorithms (e.g. Graham scan) require points or vectors to be sorted by their angle (perhaps as seen from some other point, i.e. using difference vectors). This order is inherently cyclic, and where this cycle is broken to compute linear values often doesn't matter that much. But the real angle value doesn't matter much either, as long as cyclic order is maintained. So doing an atan2 call for every point might be wasteful. What faster methods are there to compute a value which is strictly monotonic in the angle, the way atan2 is? Such functions apparently have been called “pseudoangle” by some.
I started to play around with this and realised that the spec is kind of incomplete. atan2 has a discontinuity, because as dx and dy are varied, there's a point where atan2 will jump between -pi and +pi. The graph below shows the two formulas suggested by #MvG, and in fact they both have the discontinuity in a different place compared to atan2. (NB: I added 3 to the first formula and 4 to the alternative so that the lines don't overlap on the graph). If I added atan2 to that graph then it would be the straight line y=x. So it seems to me that there could be various answers, depending on where one wants to put the discontinuity. If one really wants to replicate atan2, the answer (in this genre) would be
# Input: dx, dy: coordinates of a (difference) vector.
# Output: a number from the range [-2 .. 2] which is monotonic
# in the angle this vector makes against the x axis.
# and with the same discontinuity as atan2
def pseudoangle(dx, dy):
p = dx/(abs(dx)+abs(dy)) # -1 .. 1 increasing with x
if dy < 0: return p - 1 # -2 .. 0 increasing with x
else: return 1 - p # 0 .. 2 decreasing with x
This means that if the language that you're using has a sign function, you could avoid branching by returning sign(dy)(1-p), which has the effect of putting an answer of 0 at the discontinuity between returning -2 and +2. And the same trick would work with #MvG's original methodology, one could return sign(dx)(p-1).
Update In a comment below, #MvG suggests a one-line C implementation of this, namely
pseudoangle = copysign(1. - dx/(fabs(dx)+fabs(dy)),dy)
#MvG says it works well, and it looks good to me :-).
I know one possible such function, which I will describe here.
# Input: dx, dy: coordinates of a (difference) vector.
# Output: a number from the range [-1 .. 3] (or [0 .. 4] with the comment enabled)
# which is monotonic in the angle this vector makes against the x axis.
def pseudoangle(dx, dy):
ax = abs(dx)
ay = abs(dy)
p = dy/(ax+ay)
if dx < 0: p = 2 - p
# elif dy < 0: p = 4 + p
return p
So why does this work? One thing to note is that scaling all input lengths will not affect the ouput. So the length of the vector (dx, dy) is irrelevant, only its direction matters. Concentrating on the first quadrant, we may for the moment assume dx == 1. Then dy/(1+dy) grows monotonically from zero for dy == 0 to one for infinite dy (i.e. for dx == 0). Now the other quadrants have to be handled as well. If dy is negative, then so is the initial p. So for positive dx we already have a range -1 <= p <= 1 monotonic in the angle. For dx < 0 we change the sign and add two. That gives a range 1 <= p <= 3 for dx < 0, and a range of -1 <= p <= 3 on the whole. If negative numbers are for some reason undesirable, the elif comment line can be included, which will shift the 4th quadrant from -1…0 to 3…4.
I don't know if the above function has an established name, and who might have published it first. I've gotten it quite a while ago and copied it from one project to the next. I have however found occurrences of this on the web, so I'd consider this snipped public enough for re-use.
There is a way to obtain the range [0 … 4] (for real angles [0 … 2π]) without introducing a further case distinction:
# Input: dx, dy: coordinates of a (difference) vector.
# Output: a number from the range [0 .. 4] which is monotonic
# in the angle this vector makes against the x axis.
def pseudoangle(dx, dy):
p = dx/(abs(dx)+abs(dy)) # -1 .. 1 increasing with x
if dy < 0: return 3 + p # 2 .. 4 increasing with x
else: return 1 - p # 0 .. 2 decreasing with x
I kinda like trigonometry, so I know the best way of mapping an angle to some values we usually have is a tangent. Of course, if we want a finite number in order to not have the hassle of comparing {sign(x),y/x}, it gets a bit more confusing.
But there is a function that maps [1,+inf[ to [1,0[ known as inverse, that will allow us to have a finite range to which we will map angles. The inverse of the tangent is the well known cotangent, thus x/y (yes, it's as simple as that).
A little illustration, showing the values of tangent and cotangent on a unit circle :
You see the values are the same when |x| = |y|, and you see also that if we color the parts that output a value between [-1,1] on both circles, we manage to color a full circle. To have this mapping of values be continuous and monotonous, we can do two this :
use the opposite of the cotangent to have the same monotony as tangent
add 2 to -cotan, to have the values coincide where tan=1
add 4 to one half of the circle (say, below the x=-y diagonal) to have values fit on the one of the discontinuities.
That gives the following piecewise function, which is a continuous and monotonous function of the angles, with only one discontinuity (which is the minimum) :
double pseudoangle(double dx, double dy)
// 1 for above, 0 for below the diagonal/anti-diagonal
int diag = dx > dy;
int adiag = dx > -dy;
double r = !adiag ? 4 : 0;
if (dy == 0)
return r;
if (diag ^ adiag)
r += 2 - dx / dy;
r += dy / dx;
return r;
Note that this is very close to Fowler angles, with the same properties. Formally, pseudoangle(dx,dy) + 1 % 8 == Fowler(dx,dy)
To talk performance, it's much less branchy than Fowler's code (and generally less complicated imo). Compiled with -O3 on gcc 6.1.1, the above function generates an assembly code with 4 branches, where two of them come from dy == 0 (one checking if the both operands are "unordered", thus if dy was NaN, and the other checking if they are equal).
I would argue this version is more precise than others, since it only uses mantissa preserving operations, until shifting the result to the right interval. This should be especially visible when |x| << |y| or |y| >> |x|, then the operation |x| + |y| looses quite some precision.
As you can see on the graph the angle-pseudoangle relation is also nicely close to linear.
Looking where branches come from, we can make the following remarks:
My code doesn't rely on abs nor copysign, which makes it look more self-contained. However playing with sign bits on floating point values is actually rather trivial, since it's just flipping a separate bit (no branch!), so this is more of a disadvantage.
Furthermore other solutions proposed here do not check whether abs(dx) + abs(dy) == 0 before dividing by it, but this version would fail as soon as only one component (dy) is 0 -- so that throws in a branch (or 2 in my case).
If we choose to get roughly the same result (up to rounding errors) but without branches, we could abuse copsign and write:
double pseudoangle(double dx, double dy)
double s = dx + dy;
double d = dx - dy;
double r = 2 * (1.0 - copysign(1.0, s));
double xor_sign = copysign(1.0, d) * copysign(1.0, s);
r += (1.0 - xor_sign);
r += (s - xor_sign * d) / (d + xor_sign * s);
return r;
Bigger errors may happen than with the previous implementation, due to cancellation in either d or s if dx and dy are close in absolute value. There is no check for division by zero to be comparable with the other implementations presented, and because this only happens when both dx and dy are 0.
If you can feed the original vectors instead of angles into a comparison function when sorting, you can make it work with:
Just a single branch.
Only floating point comparisons and multiplications.
Avoiding addition and subtraction makes it numerically much more robust. A double can actually always exactly represent the product of two floats, but not necessarily their sum. This means for single precision input you can guarantee a perfect flawless result with little effort.
This is basically Cimbali's solution repeated for both vectors, with branches eliminated and divisions multiplied away. It returns an integer, with sign matching the comparison result (positive, negative or zero):
signed int compare(double x1, double y1, double x2, double y2) {
unsigned int d1 = x1 > y1;
unsigned int d2 = x2 > y2;
unsigned int a1 = x1 > -y1;
unsigned int a2 = x2 > -y2;
// Quotients of both angles.
unsigned int qa = d1 * 2 + a1;
unsigned int qb = d2 * 2 + a2;
if(qa != qb) return((0x6c >> qa * 2 & 6) - (0x6c >> qb * 2 & 6));
d1 ^= a1;
double p = x1 * y2;
double q = x2 * y1;
// Numerator of each remainder, multiplied by denominator of the other.
double na = q * (1 - d1) - p * d1;
double nb = p * (1 - d1) - q * d1;
// Return signum(na - nb)
return((na > nb) - (na < nb));
The simpliest thing I came up with is making normalized copies of the points and splitting the circle around them in half along the x or y axis. Then use the opposite axis as a linear value between the beginning and end of the top or bottom buffer (one buffer will need to be in reverse linear order when putting it in.) Then you can read the first then second buffer linearly and it will be clockwise, or second and first in reverse for counter clockwise.
That might not be a good explanation so I put some code up on GitHub that uses this method to sort points with an epsilion value to size the arrays.
This might not be good for your use case because it's built for performance in graphics effects rendering, but it's fast and simple (O(N) Complexity). If your working with really small changes in points or very large (hundreds of thousands) data sets then this won't work because the memory usage might outweigh the performance benefits.
nice.. here is a varient that returns -Pi , Pi like many arctan2 functions.
edit note: changed my pseudoscode to proper python.. arg order changed for compatibility with pythons math module atan2(). Edit2 bother more code to catch the case dx=0.
def pseudoangle( dy , dx ):
""" returns approximation to math.atan2(dy,dx)*2/pi"""
if dx == 0 :
s = cmp(dy,0)
s = cmp(dx*dy,0) # cmp == "sign" in many other languages.
if s == 0 : return 0 # doesnt hurt performance much.but can omit if 0,0 never happens
p = dy/(dx+s*dy)
if dx < 0: return p-2*s
return p
In this form the max error is only ~0.07 radian for all angles.
(of course leave out the Pi/2 if you don't care about the magnitude.)
Now for the bad news -- on my system using python math.atan2 is about 25% faster
Obviously replacing a simple interpreted code doesnt beat a compiled intrisic.
If angles are not needed by themselves, but only for sorting, then #jjrv approach is the best one. Here is a comparison in Julia
using StableRNGs
using BenchmarkTools
# Definitions
struct V{T}
function pseudoangle(v)
copysign(1. - v.x/(abs(v.x)+abs(v.y)), v.y)
function isangleless(v1, v2)
a1 = abs(v1.x) + abs(v1.y)
a2 = abs(v2.x) + abs(v2.y)
a2*copysign(a1 - v1.x, v1.y) < a1*copysign(a2 - v2.x, v2.y)
# Data
rng = StableRNG(2021)
vectors = map(x -> V(x...), zip(rand(rng, 1000), rand(rng, 1000)))
# Comparison
res1 = sort(vectors, by = x -> pseudoangle(x));
res2 = sort(vectors, lt = (x, y) -> isangleless(x, y));
#assert res1 == res2
#btime sort($vectors, by = x -> pseudoangle(x));
# 110.437 μs (3 allocations: 23.70 KiB)
#btime sort($vectors, lt = (x, y) -> isangleless(x, y));
# 65.703 μs (3 allocations: 23.70 KiB)
So, by avoiding division, time is almost halved without losing result quality. Of course, for more precise calculations, isangleless should be equipped with bigfloat from time to time, but the same can be told about pseudoangle.
Just use a cross-product function. The direction you rotate one segment relative to the other will give either a positive or negative number. No trig functions and no division. Fast and simple. Just Google it.

Determining Floating Point Square Root

How do I determine the square root of a floating point number? Is the Newton-Raphson method a good way? I have no hardware square root either. I also have no hardware divide (but I have implemented floating point divide).
If possible, I would prefer to reduce the number of divides as much as possible since they are so expensive.
Also, what should be the initial guess to reduce the total number of iterations???
Thank you so much!
When you use Newton-Raphson to compute a square-root, you actually want to use the iteration to find the reciprocal square root (after which you can simply multiply by the input--with some care for rounding--to produce the square root).
More precisely: we use the function f(x) = x^-2 - n. Clearly, if f(x) = 0, then x = 1/sqrt(n). This gives rise to the newton iteration:
x_(i+1) = x_i - f(x_i)/f'(x_i)
= x_i - (x_i^-2 - n)/(-2x_i^-3)
= x_i + (x_i - nx_i^3)/2
= x_i*(3/2 - 1/2 nx_i^2)
Note that (unlike the iteration for the square root), this iteration for the reciprocal square root involves no divisions, so it is generally much more efficient.
I mentioned in your question on divide that you should look at existing soft-float libraries, rather than re-inventing the wheel. That advice applies here as well. This function has already been implemented in existing soft-float libraries.
Edit: the questioner seems to still be confused, so let's work an example: sqrt(612). 612 is 1.1953125 x 2^9 (or b1.0011001 x 2^9, if you prefer binary). Pull out the even portion of the exponent (9) to write the input as f * 2^(2m), where m is an integer and f is in the range [1,4). Then we will have:
sqrt(n) = sqrt(f * 2^2m) = sqrt(f)*2^m
applying this reduction to our example gives f = 1.1953125 * 2 = 2.390625 (b10.011001) and m = 4. Now do a newton-raphson iteration to find x = 1/sqrt(f), using a starting guess of 0.5 (as I noted in a comment, this guess converges for all f, but you can do significantly better using a linear approximation as an initial guess):
x_0 = 0.5
x_1 = x_0*(3/2 - 1/2 * 2.390625 * x_0^2)
= 0.6005859...
x_2 = x_1*(3/2 - 1/2 * 2.390625 * x_1^2)
= 0.6419342...
x_3 = 0.6467077...
x_4 = 0.6467616...
So even with a (relatively bad) initial guess, we get rapid convergence to the true value of 1/sqrt(f) = 0.6467616600226026.
Now we simply assemble the final result:
sqrt(f) = x_n * f = 1.5461646...
sqrt(n) = sqrt(f) * 2^m = 24.738633...
And check: sqrt(612) = 24.738633...
Obviously, if you want correct rounding, careful analysis needed to ensure that you carry sufficient precision at each stage of the computation. This requires careful bookkeeping, but it isn't rocket science. You simply keep careful error bounds and propagate them through the algorithm.
If you want to correct rounding without explicitly checking a residual, you need to compute sqrt(f) to a precision of 2p + 2 bits (where p is precision of the source and destination type). However, you can also take the strategy of computing sqrt(f) to a little more than p bits, square that value, and adjust the trailing bit by one if necessary (which is often cheaper).
sqrt is nice in that it is a unary function, which makes exhaustive testing for single-precision feasible on commodity hardware.
You can find the OS X soft-float sqrtf function on, which uses the algorithm described above (I wrote it, as it happens). It is licensed under the APSL, which may or not be suitable for your needs.
Probably (still) the fastest implementation for finding the inverse square root and the 10 lines of code that I adore the most.
It's based on Newton Approximation, but with a few quirks. There's even a great story around this.
Easiest to implement (you can even implement this in a calculator):
def sqrt(x, TOL=0.000001):
while( abs(x/y -y) > TOL ):
y= (y+x/y)/2.0
return y
This is exactly equal to newton raphson:
y(new) = y - f(y)/f'(y)
f(y) = y^2-x and f'(y) = 2y
Substituting these values:
y(new) = y - (y^2-x)/2y = (y^2+x)/2y = (y+x/y)/2
If division is expensive you should consider: .
Shifting algorithms:
Let us assume you have two numbers a and b such that least significant digit (equal to 1) is larger than b and b has only one bit equal to (eg. a=1000 and b=10). Let s(b) = log_2(b) (which is just the location of bit valued 1 in b).
Assume we already know the value of a^2. Now (a+b)^2 = a^2 + 2ab + b^2. a^2 is already known, 2ab: shift a by s(b)+1, b^2: shift b by s(b).
Initialize a such that a has only one bit equal to one and a^2<= n < (2*a)^2.
Let q=s(a).
sqra = a*a
For i = q-1 to -10 (or whatever significance you want):
sqrab = sqra + 2ab + b^2
if sqrab > n:
sqra = sqrab
a=10000 (16)
sqra = 256
Iteration 1:
b=01000 (8)
sqrab = (a+b)^2 = 24^2 = 576
sqrab < n => a=a+b = 24
Iteration 2:
b = 4
sqrab = (a+b)^2 = 28^2 = 784
sqrab > n => a=a
Iteration 3:
b = 2
sqrab = (a+b)^2 = 26^2 = 676
sqrab > n => a=a
Iteration 4:
b = 1
sqrab = (a+b)^2 = 25^2 = 625
sqrab > n => a=a
Iteration 5:
b = 0.5
sqrab = (a+b)^2 = 24.5^2 = 600.25
sqrab < n => a=a+b = 24.5
Iteration 6:
b = 0.25
sqrab = (a+b)^2 = 24.75^2 = 612.5625
sqrab < n => a=a
Iteration 7:
b = 0.125
sqrab = (a+b)^2 = 24.625^2 = 606.390625
sqrab < n => a=a+b = 24.625
and so on.
A good approximation to square root on the range [1,4) is
def sqrt(x):
y = x*-0.000267
y = x*(0.004686+y)
y = x*(-0.034810+y)
y = x*(0.144780+y)
y = x*(-0.387893+y)
y = x*(0.958108+y)
return y+0.315413
Normalise your floating point number so the mantissa is in the range [1,4), use the above algorithm on it, and then divide the exponent by 2. No floating point divisions anywhere.
With the same CPU time budget you can probably do much better, but that seems like a good starting point.

Rolling variance algorithm

I'm trying to find an efficient, numerically stable algorithm to calculate a rolling variance (for instance, a variance over a 20-period rolling window). I'm aware of the Welford algorithm that efficiently computes the running variance for a stream of numbers (it requires only one pass), but am not sure if this can be adapted for a rolling window. I would also like the solution to avoid the accuracy problems discussed at the top of this article by John D. Cook. A solution in any language is fine.
I've run across this problem as well. There are some great posts out there in computing the running cumulative variance such as John Cooke's Accurately computing running variance post and the post from Digital explorations, Python code for computing sample and population variances, covariance and correlation coefficient. Just could not find any that were adapted to a rolling window.
The Running Standard Deviations post by Subluminal Messages was critical in getting the rolling window formula to work. Jim takes the power sum of the squared differences of the values versus Welford’s approach of using the sum of the squared differences of the mean. Formula as follows:
PSA today = PSA(yesterday) + (((x today * x today) - x yesterday)) / n
x = value in your time series
n = number of values you've analyzed so far.
But, to convert the Power Sum Average formula to a windowed variety you need tweak the formula to the following:
PSA today = PSA yesterday + (((x today * x today) - (x yesterday * x Yesterday) / n
x = value in your time series
n = number of values you've analyzed so far.
You'll also need the Rolling Simple Moving Average formula:
SMA today = SMA yesterday + ((x today - x today - n) / n
x = value in your time series
n = period used for your rolling window.
From there you can compute the Rolling Population Variance:
Population Var today = (PSA today * n - n * SMA today * SMA today) / n
Or the Rolling Sample Variance:
Sample Var today = (PSA today * n - n * SMA today * SMA today) / (n - 1)
I've covered this topic along with sample Python code in a blog post a few years back, Running Variance.
Hope this helps.
Please note: I provided links to all the blog posts and math formulas
in Latex (images) for this answer. But, due to my low reputation (<
10); I'm limited to only 2 hyperlinks and absolutely no images. Sorry
about this. Hope this doesn't take away from the content.
I have been dealing with the same issue.
Mean is simple to compute iteratively, but you need to keep the complete history of values in a circular buffer.
next_index = (index + 1) % window_size; // oldest x value is at next_index, wrapping if necessary.
new_mean = mean + (x_new - xs[next_index])/window_size;
I have adapted Welford's algorithm and it works for all the values that I have tested with.
varSum = var_sum + (x_new - mean) * (x_new - new_mean) - (xs[next_index] - mean) * (xs[next_index] - new_mean);
xs[next_index] = x_new;
index = next_index;
To get the current variance just divide varSum by the window size: variance = varSum / window_size;
If you prefer code over words (heavily based on DanS' post):
public IEnumerable RollingSampleVariance(IEnumerable data, int sampleSize)
double mean = 0;
double accVar = 0;
int n = 0;
var queue = new Queue(sampleSize);
foreach(var observation in data)
if (n < sampleSize)
// Calculating first variance
double delta = observation - mean;
mean += delta / n;
accVar += delta * (observation - mean);
// Adjusting variance
double then = queue.Dequeue();
double prevMean = mean;
mean += (observation - then) / sampleSize;
accVar += (observation - prevMean) * (observation - mean) - (then - prevMean) * (then - mean);
if (n == sampleSize)
yield return accVar / (sampleSize - 1);
Actually Welfords algorithm can AFAICT easily be adapted to compute weighted Variance.
And by setting weights to -1, you should be able to effectively cancel out elements. I havn't checked the math whether it allows negative weights though, but at a first look it should!
I did perform a small experiment using ELKI:
void testSlidingWindowVariance() {
MeanVariance mv = new MeanVariance(); // ELKI implementation of weighted Welford!
MeanVariance mc = new MeanVariance(); // Control.
Random r = new Random();
double[] data = new double[1000];
for (int i = 0; i < data.length; i++) {
data[i] = r.nextDouble();
// Pre-roll:
for (int i = 0; i < 10; i++) {
// Compare to window approach
for (int i = 10; i < data.length; i++) {
mv.put(data[i-10], -1.); // Remove
mc.reset(); // Reset statistics
for (int j = i - 9; j <= i; j++) {
assertEquals("Variance does not agree.", mv.getSampleVariance(),
mc.getSampleVariance(), 1e-14);
I get around ~14 digits of precision compared to the exact two-pass algorithm; this is about as much as can be expected from doubles. Note that Welford does come at some computational cost because of the extra divisions - it takes about twice as long as the exact two-pass algorithm. If your window size is small, it may be much more sensible to actually recompute the mean and then in a second pass the variance every time.
I have added this experiment as unit test to ELKI, you can see the full source here:
it also compares to the exact two-pass variance.
However, on skewed data sets, the behaviour might be different. This data set obviously is uniform distributed; but I've also tried a sorted array and it worked.
Update: we published a paper with details on differentweighting schemes for (co-)variance:
Schubert, Erich, and Michael Gertz. "Numerically stable parallel computation of (co-) variance." Proceedings of the 30th International Conference on Scientific and Statistical Database Management. ACM, 2018. (Won the SSDBM best-paper award.)
This also discusses how weighting can be used to parallelize the computation, e.g., with AVX, GPUs, or on clusters.
Here's a divide and conquer approach that has O(log k)-time updates, where k is the number of samples. It should be relatively stable for the same reasons that pairwise summation and FFTs are stable, but it's a bit complicated and the constant isn't great.
Suppose we have a sequence A of length m with mean E(A) and variance V(A), and a sequence B of length n with mean E(B) and variance V(B). Let C be the concatenation of A and B. We have
p = m / (m + n)
q = n / (m + n)
E(C) = p * E(A) + q * E(B)
V(C) = p * (V(A) + (E(A) + E(C)) * (E(A) - E(C))) + q * (V(B) + (E(B) + E(C)) * (E(B) - E(C)))
Now, stuff the elements in a red-black tree, where each node is decorated with mean and variance of the subtree rooted at that node. Insert on the right; delete on the left. (Since we're only accessing the ends, a splay tree might be O(1) amortized, but I'm guessing amortized is a problem for your application.) If k is known at compile-time, you could probably unroll the inner loop FFTW-style.
I know this question is old, but in case someone else is interested here follows the python code. It is inspired by johndcook blog post, #Joachim's, #DanS's code and #Jaime comments. The code below still gives small imprecisions for small data windows sizes. Enjoy.
from __future__ import division
import collections
import math
class RunningStats:
def __init__(self, WIN_SIZE=20):
self.n = 0
self.mean = 0
self.run_var = 0
self.WIN_SIZE = WIN_SIZE = collections.deque(maxlen=WIN_SIZE)
def clear(self):
self.n = 0
def push(self, x):
if self.n <= self.WIN_SIZE:
# Calculating first variance
self.n += 1
delta = x - self.mean
self.mean += delta / self.n
self.run_var += delta * (x - self.mean)
# Adjusting variance
x_removed =
old_m = self.mean
self.mean += (x - x_removed) / self.WIN_SIZE
self.run_var += (x + x_removed - old_m - self.mean) * (x - x_removed)
def get_mean(self):
return self.mean if self.n else 0.0
def get_var(self):
return self.run_var / (self.WIN_SIZE - 1) if self.n > 1 else 0.0
def get_std(self):
return math.sqrt(self.get_var())
def get_all(self):
return list(
def __str__(self):
return "Current window values: {}".format(list(
I look forward to be proven wrong on this but I don't think this can be done "quickly." That said, a large part of the calculation is keeping track of the EV over the window which can be done easily.
I'll leave with the question: are you sure you need a windowed function? Unless you are working with very large windows it is probably better to just use a well known predefined algorithm.
I guess keeping track of your 20 samples, Sum(X^2 from 1..20), and Sum(X from 1..20) and then successively recomputing the two sums at each iteration isn't efficient enough? It's possible to recompute the new variance without adding up, squaring, etc., all of the samples each time.
As in:
Sum(X^2 from 2..21) = Sum(X^2 from 1..20) - X_1^2 + X_21^2
Sum(X from 2..21) = Sum(X from 1..20) - X_1 + X_21
Here's another O(log k) solution: find squares the original sequence, then sum pairs, then quadruples, etc.. (You'll need a bit of a buffer to be able to find all of these efficiently.) Then add up those values that you need to to get your answer. For example:
||||||||||||||||||||||||| // Squares
| | | | | | | | | | | | | // Sum of squares for pairs
| | | | | | | // Pairs of pairs
| | | | // (etc.)
| |
^------------------^ // Want these 20, which you can get with
| | // one...
| | | | // two, three...
| | // four...
|| // five stored values.
Now you use your standard E(x^2)-E(x)^2 formula and you're done. (Not if you need good stability for small sets of numbers; this was assuming that it was only accumulation of rolling error that was causing issues.)
That said, summing 20 squared numbers is very fast these days on most architectures. If you were doing more--say, a couple hundred--a more efficient method would clearly be better. But I'm not sure that brute force isn't the way to go here.
For only 20 values, it's trivial to adapt the method exposed here (I didn't say fast, though).
You can simply pick up an array of 20 of these RunningStat classes.
The first 20 elements of the stream are somewhat special, however once this is done, it's much more simple:
when a new element arrives, clear the current RunningStat instance, add the element to all 20 instances, and increment the "counter" (modulo 20) which identifies the new "full" RunningStat instance
at any given moment, you can consult the current "full" instance to get your running variant.
You will obviously note that this approach isn't really scalable...
You can also note that there is some redudancy in the numbers we keep (if you go with the RunningStat full class). An obvious improvement would be to keep the 20 lasts Mk and Sk directly.
I cannot think of a better formula using this particular algorithm, I am afraid that its recursive formulation somewhat ties our hands.
This is just a minor addition to the excellent answer provided by DanS. The following equations are for removing the oldest sample from the window and updating the mean and variance. This is useful, for example, if you want to take smaller windows near the right edge of your input data stream (i.e. just remove the oldest window sample without adding a new sample).
window_size -= 1; % decrease window size by 1 sample
new_mean = prev_mean + (prev_mean - x_old) / window_size
varSum = varSum - (prev_mean - x_old) * (new_mean - x_old)
Here, x_old is the oldest sample in the window you wish to remove.
For those coming here now, here's a reference containing the full derivation, with proofs, of DanS's answer and Jaime's related comment.
DanS and Jaime's response in concise C.
typedef struct {
size_t n, i;
float *samples, mean, var;
} rolling_var_t;
void rolling_var_init(rolling_var_t *c, size_t window_size) {
size_t ss;
memset(c, 0, sizeof(*c));
c->n = window_size;
c->samples = (float *) malloc(ss = sizeof(float)*window_size);
memset(c->samples, 0, ss);
void rolling_var_add(rolling_var_t *c, float x) {
float nmean; // new mean
float xold; // oldest x
float dx;
c->i = (c->i + 1) % c->n;
xold = c->samples[c->i];
dx = x - xold;
nmean = c->mean + dx / (float) c->n; // walk mean
//c->var += ((x - c->mean)*(x - nmean) - (xold - c->mean) * (xold - nmean)) / (float) c->n;
c->var += ((x + xold - c->mean - nmean) * dx) / (float) c->n;
c->mean = nmean;
c->samples[c->i] = x;
