Online algorithm for calculating absolute deviation - algorithm

I'm trying to calculate the absolute deviation of a vector online, that is, as each item in the vector is received, without using the entire vector. The absolute deviation is the sum of the absolute difference between each item in a vector and the mean:
I know that the variance of a vector can be calculated in such a manner. Variance is similar to absolute deviation, but each difference is squared:
The online algorithm for variance is as follows:
n = 0
mean = 0
M2 = 0
def calculate_online_variance(x):
n = n + 1
delta = x - mean
mean = mean + delta/n
M2 = M2 + delta*(x - mean) # This expression uses the new value of mean
variance_n = M2/n
return variance_n
Is there such an algorithm for calculating absolute deviance? I cannot formulate a recursive definition myself, but wiser heads may prevail!

As the absolute deviation between x and the mean can be defined as the square root of the squared difference, the adaptation is trivial if you are happy with a consistent but biased estimate (meaning the limit to infinity is the expected value) :
n = 0
mean = 0
M2 = 0
def calculate_online_avg_abs_dev(x):
n = n + 1
delta = x - mean
mean = mean + delta/n
M2 = M2 + sqrt(delta*(x - mean))
avg_abs_dev_n = M2/n
This is for the case of the average absolute deviation. Normally the mad is used (median absolute deviation), which is impossible to program recursively. but the average absolute deviation is as useful in most cases. When we're talking about hundreds of values from close-to-normal distributions, both values are very close.
If you just want the sum of the absolute devations, life is even simpler: just return M2.
Be aware of the fact that BOTH the algorithm you gave and the trivial adaptation for the absolute deviation are slightly biased.
A simulation in R to prove the algorithm works this way :
The red line is the true value, the black line is the progressive value following the algorithm outlined above.
Code :
calculate_online_abs_dev <- function(x,n){
M2=0
mean=0
out <- numeric(n)
for(i in 1:n) {
delta <- x[i] - mean
mean <- mean + delta/i
M2 = M2 + sqrt(delta*(x[i] - mean))
out[i] <- M2/i
}
return(out)
}
set.seed(2010)
x <- rnorm(100)
Abs_Dev <- calculate_online_abs_dev(x,length(x))
True_Val <- sapply(1:length(x),function(i)sum(abs(x[1:i]-mean(x[1:i])))/i)
plot(1:length(x),Abs_Dev,type="l",xlab="number of values",lwd=2)
lines(1:length(x),True_Val,col="red",lty=2,lwd=2)
legend("bottomright",lty=c(1,2),col=c("black","red"),
legend=c("Online Calc","True Value"))

I don't think it's possible.
In the formula for variance it is possible to separate the x and x2 terms, so that it suffices to keep track of those sums (and n). In the formula for absolute deviation this is not possible.
I think the best one can do (apart from keeping the whole vector and calculating the absolute deviation on demand) is keep a sorted list of elements. This is O(log(n)) for each new element, but after you add an element the cost of recalculating the absolute deviation is O(log(n)). This may or may not be worthwhile, depending on your application.

The formula for variance that you give is ONE of the many that are possible (I can think of three distinct ways to do that computation) although I have not verified that yours is correct. It looks reasonably close to what I recall.
The problem is that absolute value is actually more "nonlinear" in a sense than is the sum of squares of the deviations. This prevents you from being able to do that calculation in a recursive form in a loop, at least not without retaining all of the previous values of x. You must compute the overall mean in advance for that sum.
Edit: I see that beta agrees with me. IF you saved all of the previous data points, in a sorted list, you could then efficiently compute the updated desired deviation. But this is counter to the spirit of your request.

Related

Is this an accurate average or an exponential moving average formula?

I'm trying to calculate the average of a value that is changing, and I would like to do so without storing all the previous values in an array and iterating over them.
I found this formula
avg = avg + (value - avg) / n
where n is the number of changes to value.
TL;DR
My question is if this formula is identical to the normal way of
calculating an average (which it seems to be when I compare them), or
if it might give different results under certain circumstances?
I'm not sure what the correct name of this formula is - I've seen "running average, "rolling average", "moving average", etc. The results of it seem to be exactly the same as storing each historical value, summing them up and dividing by n - i.e. a "normal average".
What's confusing is that people sometimes call this formula a "moving average", which in my mind sounds more like you're using a subset of the historical values to calculate an average. Others say it's an exponential moving average (see comment by Julia on OP).
Is this formula is identical to the normal way of calculating an average?
With infinite precision, this formula does indeed compute the sum of the first n samples if avg is set equal to 0 at the start.
It is clearly true when n=1 because the average of 1 sample works out as:
avg' = avg + (value - avg) / n
= 0 + (value - 0) / 1
= value
For larger values, assume it is true for n-1 (i.e. avg=(x[1]+..+x[n-1])/(n-1) ).
Then:
avg' = avg + (x[n] - avg) / n
= (n-1)*avg/n + x[n]/n
= (x[1]+...+x[n-1])/n + x[n]/n
= (x[1]+...+x[n])/n
So the new value of avg is also equal to the average of the first n samples.
Is this a moving average?
Normally by "moving average" people are referring to a simple moving average.
This formula is actually known as a cumulative moving average.

Calculation of variance given mean

I'm currently utilizing an online variance algorithm to calculate the variance for a given sequence. This works nicely, and also gives good numerical stability and overflow resistance, at the cost of some speed, which is fine. My question is, does an algorithm exist that will be faster than this if the sample mean is already known, while having similar stability and resistance to overflow (hence not something like a naive variance calculation).
The current online variance calculation is a single-pass algorithm with both divisions and multiplications in the main loop (which is what impacts speed). From wikipedia:
def online_variance(data):
n = 0
mean = 0
M2 = 0
for x in data:
n = n + 1
delta = x - mean
mean = mean + delta/n
M2 = M2 + delta*(x - mean)
variance = M2/(n - 1)
return variance
The thing that causes a naive variance calculation to go unstable is the fact that you separately sum the X (to get mean(x)) and the X^2 values and then take the difference
var = mean(x^2) - (mean(x))^2
But since the definition of variance is
var = mean((x - mean(x))^2)
You can just evaluate that and it will be as fast as it can be. When you don't know the mean, you have to compute it first for stability, or use the "naive" formulation that goes through the data only once at the expense of numerical stability.
EDIT
Now that you have given the "original" code, it's easy to be better (faster). As you correctly point out, the division in the inner loop is slowing you down. Try this one for comparison:
def newVariance(data, mean):
n = 0
M2 = 0
for x in data:
n = n + 1
delta = x - mean
M2 = M2 + delta * delta
variance = M2 / (n - 1)
return variance
Note - this looks a lot like the two_pass_variance algorithm from Wikipedia, except that you don't need the first pass to compute the mean since you say it is already known.

Efficient way to take determinant of an n! x n! matrix in Maple

I have a large matrix, n! x n!, for which I need to take the determinant. For each permutation of n, I associate
a vector of length 2n (this is easy computationally)
a polynomial of in 2n variables (a product of linear factors computed recursively on n)
The matrix is the evaluation matrix for the polynomials at the vectors (thought of as points). So the sigma,tau entry of the matrix (indexed by permutations) is the polynomial for sigma evaluated at the vector for tau.
Example: For n=3, if the ith polynomial is (x1 - 4)(x3 - 5)(x4 - 4)(x6 - 1) and the jth point is (2,2,1,3,5,2), then the (i,j)th entry of the matrix will be (2 - 4)(1 - 5)(3 - 4)(2 - 1) = -8. Here n=3, so the points are in R^(3!) = R^6 and the polynomials have 3!=6 variables.
My goal is to determine whether or not the matrix is nonsingular.
My approach right now is this:
the function point takes a permutation and outputs a vector
the function poly takes a permutation and outputs a polynomial
the function nextPerm gives the next permutation in lexicographic order
The abridged pseudocode version of my code is this:
B := [];
P := [];
w := [1,2,...,n];
while w <> NULL do
B := B append poly(w);
P := P append point(w);
w := nextPerm(w);
od;
// BUILD A MATRIX IN MAPLE
M := Matrix(n!, (i,j) -> eval(B[i],P[j]));
// COMPUTE DETERMINANT IN MAPLE
det := LinearAlgebra[Determinant]( M );
// TELL ME IF IT'S NONSINGULAR
if det = 0 then return false;
else return true; fi;
I'm working in Maple using the built in function LinearAlgebra[Determinant], but everything else is a custom built function that uses low level Maple functions (e.g. seq, convert and cat).
My problem is that this takes too long, meaning I can go up to n=7 with patience, but getting n=8 takes days. Ideally, I want to be able to get to n=10.
Does anyone have an idea for how I could improve the time? I'm open to working in a different language, e.g. Matlab or C, but would prefer to find a way to speed this up within Maple.
I realize this might be hard to answer without all the gory details, but the code for each function, e.g. point and poly, is already optimized, so the real question here is if there is a faster way to take a determinant by building the matrix on the fly, or something like that.
UPDATE: Here are two ideas that I've toyed with that don't work:
I can store the polynomials (since they take a while to compute, I don't want to redo that if I can help it) into a vector of length n!, and compute the points on the fly, and plug these values into the permutation formula for the determinant:
The problem here is that this is O(N!) in the size of the matrix, so for my case this will be O((n!)!). When n=10, (n!)! = 3,628,800! which is way to big to even consider doing.
Compute the determinant using the LU decomposition. Luckily, the main diagonal of my matrix is nonzero, so this is feasible. Since this is O(N^3) in the size of the matrix, that becomes O((n!)^3) which is much closer to doable. The problem, though, is that it requires me to store the whole matrix, which puts serious strain on memory, nevermind the run time. So this doesn't work either, at least not without a bit more cleverness. Any ideas?
It isn't clear to me if your problem is space or time. Obviously the two trade back and forth. If you only wish to know if the determinant is positive or not, then you definitely should go with LU decomposition. The reason is that if A = LU with L lower triangular and U upper triangular, then
det(A) = det(L) det(U) = l_11 * ... * l_nn * u_11 * ... * u_nn
so you only need to determine if any of the main diagonal entries of L or U is 0.
To simplify further, use Doolittle's algorithm, where l_ii = 1. If at any point the algorithm breaks down, the matrix is singular so you can stop. Here's the gist:
for k := 1, 2, ..., n do {
for j := k, k+1, ..., n do {
u_kj := a_kj - sum_{s=1...k-1} l_ks u_sj;
}
for i = k+1, k+2, ..., n do {
l_ik := (a_ik - sum_{s=1...k-1} l_is u_sk)/u_kk;
}
}
The key is that you can compute the ith row of U and the ith column of L at the same time, and you only need to know the previous row/column to move forward. This way you parallel process as much as you can and store as little as you need. Since you can compute the entries a_ij as needed, this requires you to store two vectors of length n while generating two more vectors of length n (rows of U, columns of L). The algorithm takes n^2 time. You might be able to find a few more tricks, but that depends on your space/time trade off.
Not sure if I've followed your problem; is it (or does it reduce to) the following?
You have two vectors of n numbers, call them x and c, then the matrix element is product over k of (x_k+c_k), with each row/column corresponding to distinct orderings of x and c?
If so, then I believe the matrix will be singular whenever there are repeated values in either x or c, since the matrix will then have repeated rows/columns. Try a bunch of Monte Carlo's on a smaller n with distinct values of x and c to see if that case is in general non-singular - it's quite likely if that's true for 6, it'll be true for 10.
As far as brute-force goes, your method:
Is a non-starter
Will work much more quickly (should be a few seconds for n=7), though instead of LU you might want to try SVD, which will do a much better job of letting you know how well behaved your matrix is.

Find minimal functions

OK, here's the deal. I have a bunch of linear functions, a*x + b.
My goal is to answer the following question/query: What is the minimal function at x = q?
E.g.: If I have functions f(x) = 3*x + 2, g(x) = 5*x - 6 and h(x) = 2*x + 1, I will answer for e.g.:
for x = 4, function h
for x = 2, function g
for x = 1, function g
My idea goes like this:
Sort the functions by the coefficient of x, in decreasing order.
Sort the queries in increasing order
Get rid of the parallel functions, keep the ones with the smallest constant term (e.g.: if I have f(x) = 2*x + 4 and g(x) = 2*x + 2, f(x) will never be smaller than g(x), so I don't need f(x)).
Right now I am on the interval from -inf to some real number, call it w1 and I know that on this interval, the function with the highest linear coefficient is the smallest
Find w1 by finding the smallest x1 s.t. f(x1) = g(x1) where f is my current function and g iterates through all other functions with a smaller linear coefficient, w1 = x1
Repeat as long as my query is in the interval (-inf, w1): output the current function, then proceed to the next query.
If I still have queries that needed to be answered, let the current function be the one that intersects my actual current function at x = w1, and instead of -inf put w1, repeat the same steps.
However, my implementation or idea is not fast enough. Is there anything that I didn't notice that may speed up my program?
Thank you in advance.
could you not just solve for their intersections, and store the greatest function for each interval in the domain?
edit-
to elaborate, if you were to solve any pair of functions for x, then x represents the value where one of those two functions becomes greater than the other. There's going to be definable intervals where the minimal function is the same for all the values in the interval.
Here's a plot of your 3 example functions.
The intervals(with the corresponding minimal function) of this graph would be
(-∞, 7/3] => 5x - 6
(7/3, ∞] => 2x + 1
Now, at runtime, instead of "What is the minimal function at x = q" you simply do "What interval does q belong to".
And, if I'm not mistaken, if you have N linear functions, you would have at most N-1 intervals to store. And, there's specialized data structures that you can use to store and search intervals if you really have a lot of functions to analyze.
If I understood correctly, your solution is to do some pre-processing to all your functions so that the domain of x is split into ranges, and in every such a range you know what's the minimal function.
There're actually two phases: the "preparation" and the "querying" (where given a specific x you give the result).
What's your bottleneck?
Naturally for the "querying" phase to be fast you should organize your ranges in a kind of a sorted array, so that you can find the range enclosing the given x by a median search (or similar) in a logarithmic time. If this is what you did and still this isn't fast enough - consider code-level optimizations, because from the algorithmic point of view this seems to be the most optimal solution.
If your bottleneck is the "preparation" phase - here there're opportunities for optimizations. As I understand, you find intersections of all the pairs of your functions (after getting rid of parallel ones). And this is not really necessary.
Consider the following. First you sort all your functions by their coefficient (higher coefficients are at the beginning). Get rid of parallel functions. Next build the array of the ranges, while iterating through your functions.
Since the current function has the lowest coefficient (among those that have already been analyzed) - the current function will be the smallest one as x goes to infinity. So that its range should be from some x0 to infinity. Find that x0. Take the last range from the array (belonging to the previously-processed function), and find x0 - the intersection of that function with the current one. The former range shrinks up to x0. If that range becomes invalid (range start greater than x0) - means that that function is totally obscured. In such a case - remove that range, and repeat the procedure.
To make things more clear I'll write a pseudo-code:
rangeArr is an array of pairs F,X, whereas F is the function description, and X is the start of the function range. The end of the function range is considered the start of the next range, and the end of the last function range is +infinity.
for each F sorted by coefficient
{
double x0;
while (true)
{
if (rangeArr is empty)
{
x0 = -inf;
break;
}
FPrev = rangeArr.back().F;
xPrev = rangeArr.back().X;
x0 = IntersectionOf(F, FPrev);
if (x0 > xPrev)
break;
rangeArr.DeleteLastRange();
}
rangeArr.InsertRange(F, x0);
}

Quick way to calculate uniformity or discrepancy of number set

Hello
Assume I have the set of numbers I want a quick to calculate some measure of uniformity.
I know the variance is the most obvious answer but i am afraid the complexity of naive algorithm is too high
Anyone have any suggestions?
"Intuitive" algorithms for calculating variance usually suffer one or both of the following:
Use two loops (one for calculating the mean, the other for the variance)
Are not numerically stable
A good algorithm, with only one loop and numerically stable is due to D. Knuth (as always).
From Wikipedia:
n = 0
mean = 0
M2 = 0
def calculate_online_variance(x):
n = n + 1
delta = x - mean
mean = mean + delta/n
M2 = M2 + delta*(x - mean) # This expression uses the new value of mean
variance_n = M2/n
variance = M2/(n - 1) #note on the first pass with n=1 this will fail (should return Inf)
return variance
You should invoke calculate_online_variance(x) for each point, and it returns the variance calculated so far.

Resources