Related
In the question "What's the numerically best way to calculate the average" it was suggested, that calculating a rolling mean, i.e.
mean = a[n]/n + (n-1)/n * mean
might be numerically more stable than calculating the sum and then dividing by the total number of elements. This was questioned by a commenter. I can not tell which one is true - can someone else? The advantage of the rolling mean is, that you keep the mean small (i.e. at roughly the same size of all vector entries). Intuitively this should keep the error small. But the commenter claims:
Part of the issue is that 1/n introduces errors in the least significant bits, so n/n != 1, at least when it is performed as a three step operation (divide-store-multiply). This is minimized if the division is only performed once, but you'd be doing it over GB of data.
So I have multiple questions:
Is the rolling mean more precise than summing and then dividing?
Does that depend on the question whether 1/n is calculated first and then multiplied?
If so, do computers implement a one step division? (I thought so, but I am unsure now)
If yes, is it more precise than Kahan summation and then dividing?
If compareable - which one is faster? In both cases we have additional calculations.
If more precise, could you use this for precise summation?
In many circumstances, yes. Consider a sequence of all positive terms, all on the same order of magnitude. Adding them all generates a large intermediate sum, to which we add small terms, which might round precisely to the intermediate sum. Using the rolling sum, you get terms on the same order of magnitude, and in addition, the sum is much harder to overflow. However, this is not open and shut: Adding the terms and then dividing allows us to use AVX instructions, which are significantly faster than the subtract/divide/add instructions of the rolling loop. In addition, there are distributions which cause one or the other to be more accurate. This has been examined in:
Robert F Ling. Comparison of several algorithms for computing sample means and variances. Journal of the American Statistical Association, 69(348): 859–866, 1974
Kahan summation is an orthogonal issue. You can apply Kahan summation to the sequence x[n] = (x[n-1]-mu)/n; this is very accurate.
I took this as a reference for online computing the variance and mean from a variable-length array of data: http://www.johndcook.com/standard_deviation.html.
The data is a set from 16-bit unsigned values, which may have any number of samples (actually, the minimum would be about 20 samples, and the maximum about 2e32 samples.
As the dataset may be too big to store, I already implemented this using the above-mentioned online algorithm in C and verified it's computing correctly.
The trouble begins with the following requirement for the application: besides computing the variance and mean for the whole set, I also need to compute a separated result (both mean and variance) for a population comprised of the middle 50% of the values, i.e. disregarding the first 25% and the latter 25% of the samples. The number of samples is not known beforehand, so I must compute the additional set online.
I understand that I can both add and subtract a subset by computing it separately and them using something like the operator+ implementation described here: http://www.johndcook.com/skewness_kurtosis.html (minus the skewness & kurtosis specifics, for which I have no use). The subtraction could be derived from this.
The problem is: how do I maintain these subsets? Or should I try another technique?
If space is an issue, and you'd be happy to accept an approximation, I'd start with the algorithm from the following paper:
M Greenwald, S Khanna, Space-Efficient Online Computation of Quantile Summaries
You can use the algorithm to compute the running estimates of the 25th and 75th percentiles of the observations seen to far. You can then feed those observations that fall between the two percentiles into the Welford algorithm covered in John D Cook's article to compute the running mean and variance.
I'm observing a sinusoidally-varying source, i.e. f(x) = a sin (bx + d) + c, and want to determine the amplitude a, offset c and period/frequency b - the shift d is unimportant. Measurements are sparse, with each source measured typically between 6 and 12 times, and observations are at (effectively) random times, with intervals between observations roughly between a quarter and ten times the period (just to stress, the spacing of observations is not constant for each source). In each source the offset c is typically quite large compared to the measurement error, while amplitudes vary - at one extreme they are only on the order of the measurement error, while at the other extreme they are about twenty times the error. Hopefully that fully outlines the problem, if not, please ask and i'll clarify.
Thinking naively about the problem, the average of the measurements will be a good estimate of the offset c, while half the range between the minimum and maximum value of the measured f(x) will be a reasonable estimate of the amplitude, especially as the number of measurements increase so that the prospects of having observed the maximum offset from the mean improve. However, if the amplitude is small then it seems to me that there is little chance of accurately determining b, while the prospects should be better for large-amplitude sources even if they are only observed the minimum number of times.
Anyway, I wrote some code to do a least-squares fit to the data for the range of periods, and it identifies best-fit values of a, b and d quite effectively for the larger-amplitude sources. However, I see it finding a number of possible periods, and while one is the 'best' (in as much as it gives the minimum error-weighted residual) in the majority of cases the difference in the residuals for different candidate periods is not large. So what I would like to do now is quantify the possibility that the derived period is a 'false positive' (or, to put it slightly differently, what confidence I can have that the derived period is correct).
Does anybody have any suggestions on how best to proceed? One thought I had was to use a Monte-Carlo algorithm to construct a large number of sources with known values for a, b and c, construct samples that correspond to my measurement times, fit the resultant sample with my fitting code, and see what percentage of the time I recover the correct period. But that seems quite heavyweight, and i'm not sure that it's particularly useful other than giving a general feel for the false-positive rate.
And any advice for frameworks that might help? I have a feeling this is something that can likely be done in a line or two in Mathematica, but (a) I don't know it, an (b) don't have access to it. I'm fluent in Java, competent in IDL and can probably figure out other things...
This looks tailor-made for working in the frequency domain. Apply a Fourier transform and identify the frequency based on where the power is located, which should be clear for a sinusoidal source.
ADDENDUM To get an idea of how accurate is your estimate, I'd try a resampling approach such as cross-validation. I think this is the direction that you're heading with the Monte Carlo idea; lots of work is out there, so hopefully that's a wheel you won't need to re-invent.
The trick here is to do what might seem at first to make the problem more difficult. Rewrite f in the similar form:
f(x) = a1*sin(b*x) + a2*cos(b*x) + c
This is based on the identity for the sin(u+v).
Recognize that if b is known, then the problem of estimating {a1, a2, c} is a simple LINEAR regression problem. So all you need to do is use a 1-variable minimization tool, working on the value of b, to minimize the sum of squares of the residuals from that linear regression model. There are many such univariate optimizers to be found.
Once you have those parameters, it is easy to find the parameter a in your original model, since that is all you care about.
a = sqrt(a1^2 + a2^2)
The scheme I have described is called a partitioned least squares.
If you have a reasonable estimate of the size and the nature of your noise (e.g. white Gaussian with SD sigma), you can
(a) invert the Hessian matrix to get an estimate of the error in your position and
(b) should be able to easily derive a significance statistic for your fit residues.
For (a), compare http://www.physics.utah.edu/~detar/phys6720/handouts/curve_fit/curve_fit/node6.html
For (b), assume that your measurement errors are independent and thus the variance of their sum is the sum of their variances.
If I say to you:
"I am thinking of a number between 0 and n, and I will tell you if your guess is high or low", then you will immediately reach for binary search.
What if I remove the upper bound? i.e. I am thinking of a positive integer, and you need to guess it.
One possible method would be for you to guess 2, 4, 8, ..., until you guess 2**k for some k and I say "lower". Then you can apply binary search.
Is there a quicker method?
EDIT:
Clearly, any solution is going to take time proportional to the size of the target number. If I chuck Graham's number through the Ackermann function, we'll be waiting a while whatever strategy you pursue.
I could offer this algorithm too: Guess each integer in turn, starting from 1.
It's guaranteed to finish in a finite amount of time, but yet it's clearly much worse than my "powers of 2" strategy. If I can find a worse algorithm (and know that it is worse), then maybe I could find a better one?
For example, instead of powers of 2, maybe I can use powers of 10. Then I find the upper bound in log_10(n) steps, instead of log_2(n) steps. But I have to then search a bigger space. Say k = ceil(log_10(n)). Then I need log_2(10**k - 10**(k-1)) steps for my binary search, which I guess is about 10+log_2(k). For powers of 2, I have roughly log_2(log_2(n)) steps for my search phase. Which wins?
What if I search upwards using n**n? Or some other sequence? Does the prize go to whoever can find the sequence that grows the fastest? Is this a problem with an answer?
Thank you for your thoughts. And my apologies to those of you suggesting I start at MAX_INT or 2**32-1, since I'm clearly drifting away from the bounds of practicality here.
FINAL EDIT:
Hi all,
Thank you for your responses. I accepted the answer by Norman Ramsey (and commenter onebyone) for what I understood to be the following argument: for a target number n, any strategy must be capable of distinguishing between (at least) the numbers from 0..n, which means you need (at least) O(log(n)) comparisons.
However seveal of you also pointed out that the problem is not well-defined in the first place, because it's not possible to pick a "random positive integer" under the uniform probability distribution (or, rather, a uniform probability distribution cannot exist over an infinite set). And once I give you a nonuniform distribution, you can split it in half and apply binary search as normal.
This is a problem that I've often pondered as I walk around, so I'm pleased to have two conclusive answers for it.
If there truly is no upper bound, and all numbers all the way to infinity are equally likely, then there is no optimum way to do this. For any finite guess G, the probability that the number is lower than G is zero and the probability that it is higher is 1 - so there is no finite guess that has an expectation of being higher than the number.
RESPONSE TO JOHN'S EDIT:
By the same reasoning that powers of 10 are expected to be better than powers of 2 (there's only a finite number of possible Ns for which powers of 2 are better, and an infinite number where powers of 10 are better), powers of 20 can be shown to be better than powers of 10.
So basically, yes, the prize goes to fastest-growing sequence (and for the same sequence, the highest starting point) - for any given sequence, it can be shown that a faster growing sequence will win in infinitely more cases. And since for any sequence you name, I can name one that grows faster, and for any integer you name, I can name one higher, there's no answer that can't be bettered. (And every algorithm that will eventually give the correct answer has an expected number of guesses that is infinite, anyway).
People (who have never studied probability) tend to think that "pick a number from 1 to N" means "with equal probability of each", and they act according to their intuitive understanding of probability.
Then when you say "pick any positive integer", they still think it means "with equal probability of each".
This is of course impossible - there exists no discrete probability distribution with domain the positive integers, where p(n) == p(m) for all n, m.
So, the person picking the number must have used some other probability distribution. If you know anything at all about that distribution, then you must base your guessing scheme on that knowledge in order to have the "fastest" solution.
The only way to calculate how "fast" a given guessing scheme is, is to calculate its expected number of guesses to find the answer. You can only do this by assuming a probability distribution for the target number. For example, if they have picked n with probability (1/2) ^ n, then I think your best guessing scheme is "1", "2", "3",... (average 2 guesses). I haven't proved it, though, maybe it's some other sequence of guesses. Certainly the guesses should start small and grow slowly. If they have picked 4 with probability 1 and all other numbers with probability 0, then your best guessing scheme is "4" (average 1 guess). If they have picked a number from 1 to a trillion with uniform distribution, then you should binary search (average about 40 guesses).
I say the only way to define "fast" - you could look at worst case. You have to assume a bound on the target, to prevent all schemes having the exact same speed, namely "no bound on the worst case". But you don't have to assume a distribution, and the answer for the "fastest" algorithm under this definition is obvious - binary search starting at the bound you selected. So I'm not sure this definition is terribly interesting...
In practice, you don't know the distribution, but can make a few educated guesses based on the fact that the picker is a human being, and what numbers humans are capable of conceiving. As someone says, if the number they picked is the Ackermann function for Graham's number, then you're probably in trouble. But if you know that they are capable of representing their chosen number in digits, then that actually puts an upper limit on the number they could have chosen. But it still depends what techniques they might have used to generate and record the number, and hence what your best knowledge is of the probability of the number being of each particular magnitude.
Worst case, you can find it in time logarithmic in the size of the answer using exactly the methods you describe. You might use Ackermann's function to find an upper bound faster than logarithmic time, but then the binary search between the number guessed and the previous guess will require time logarithmic in the size of the interval, which (if guesses grow very quickly) is close to logarithmic in the size of the answer.
It would be interesting to try to prove that there is no faster algorithm (e.g., O(log log n)), but I have no idea how to do it.
Mathematically speaking:
You cannot ever correctly find this integer. In fact, strictly speaking, the statement "pick any positive integer" is meaningless as it cannot be done: although you as a person may believe you can do it, you are actually picking from a bounded set - you are merely unconscious of the bounds.
Computationally speaking:
Computationally, we never deal with infinites, as we would have no way of storing or checking against any number larger than, say, the theoretical maximum number of electrons in the universe. As such, if you can estimate a maximum based on the number of bits used in a register on the device in question, you can carry out a binary search.
Binary search can be generalized: each time set of possible choices should be divided into to subsets of probability 0.5. In this case it's still applicable to infinite sets, but still requires knowledge about distribution (for finite sets this requirement is forgotten quite often)...
My main refinement is that I'd start with a higher first guess instead of 2, around the average of what I'd expect them to choose. Starting with 64 would save 5 guesses vs starting with 2 when the number's over 64, at the cost of 1-5 more when it's less. 2 makes sense if you expect the answer to be around 1 or 2 half the time. You could even keep a memory of past answers to decide the best first guess. Another improvement could be to try negatives when they say "lower" on 0.
If this is guessing the upper bound of a number being generated by a computer, I'd start with 2**[number of bits/2], then scale up or down by powers of two. This, at least, gets you the closest to the possible values in the least number of jumps.
However, if this is a purely mathematical number, you can start with any value, since you have an infinite range of values, so your approach would be fine.
Since you do not specify any probability distribution of the numbers (as others have correctly mentioned, there is no uniform distribution over all the positive integers), the No Free Lunch Theorem give the answer: any method (that does not repeat the same number twice) is as good as any other.
Once you start making assumptions about the distribution (f.x. it is a human being or binary computer etc. that chooses the number) this of course changes, but as the problem is stated any algorithm is as good as any other when averaged over all possible distributions.
Use binary search starting with MAX_INT/2, where MAX_INT is the biggest number your platform can handle.
No point in pretending we can actually have infinite possibilities.
UPDATE: Given that you insist on entering the realms of infinity, I'll just vote to close your question as not programming related :-)
The standard default assumption of a uniform distribution for all positive integers doesn't lead to a solution, so you should start by defining the probability distribution of the numbers to guess.
I'd probably start my guessing with Graham's Number.
The practical answer within a computing context would be to start with whatever is the highest number that can (realistically) be represented by the type you are using. In case of some BigInt type you'd probably want to make a judgement call about what is realistic... obviously ultimately the bound in that case is the available memory... but performance-wise something smaller may be more realistic.
Your starting point should be the largest number you can think of plus 1.
There is no 'efficient search' for a number in an infinite range.
EDIT: Just to clarify, for any number you can think of there are still infinitely more numbers that are 'greater' than your number, compared to a finite collection of numbers that are 'less' than your number. Therefore, assuming the chosen number is randomly selected from all positive numbers, you have zero | (approaching zero) chance of being 'above' the chosen number.
I gave an answer to a similar question "Optimal algorithm to guess any random integer without limits?"
Actually, provided there algorithm not just searches for the conceived number, but it estimates a median of the distribution of the number that you may re-conceive at each step! And also the number could be even from the real domain ;)
Is there an algorithm to estimate the median, mode, skewness, and/or kurtosis of set of values, but that does NOT require storing all the values in memory at once?
I'd like to calculate the basic statistics:
mean: arithmetic average
variance: average of squared deviations from the mean
standard deviation: square root of the variance
median: value that separates larger half of the numbers from the smaller half
mode: most frequent value found in the set
skewness: tl; dr
kurtosis: tl; dr
The basic formulas for calculating any of these is grade-school arithmetic, and I do know them. There are many stats libraries that implement them, as well.
My problem is the large number (billions) of values in the sets I'm handling: Working in Python, I can't just make a list or hash with billions of elements. Even if I wrote this in C, billion-element arrays aren't too practical.
The data is not sorted. It's produced randomly, on-the-fly, by other processes. The size of each set is highly variable, and the sizes will not be known in advance.
I've already figured out how to handle the mean and variance pretty well, iterating through each value in the set in any order. (Actually, in my case, I take them in the order in which they're generated.) Here's the algorithm I'm using, courtesy http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:
Initialize three variables: count, sum, and sum_of_squares
For each value:
Increment count.
Add the value to sum.
Add the square of the value to sum_of_squares.
Divide sum by count, storing as the variable mean.
Divide sum_of_squares by count, storing as the variable mean_of_squares.
Square mean, storing as square_of_mean.
Subtract square_of_mean from mean_of_squares, storing as variance.
Output mean and variance.
This "on-line" algorithm has weaknesses (e.g., accuracy problems as sum_of_squares quickly grows larger than integer range or float precision), but it basically gives me what I need, without having to store every value in each set.
But I don't know whether similar techniques exist for estimating the additional statistics (median, mode, skewness, kurtosis). I could live with a biased estimator, or even a method that compromises accuracy to a certain degree, as long as the memory required to process N values is substantially less than O(N).
Pointing me to an existing stats library will help, too, if the library has functions to calculate one or more of these operations "on-line".
I use these incremental/recursive mean and median estimators, which both use constant storage:
mean += eta * (sample - mean)
median += eta * sgn(sample - median)
where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)
This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.
I would love to see an incremental mode estimator of a similar form...
UPDATE (2011-09-19)
I just modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:
quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)
The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.
UPDATE (2021-11-19)
For further details about the median estimator described here, I'd like to highlight this paper linked in the comments below: Bylander & Rosen, 1997, A Perceptron-Like Online Algorithm for Tracking the Median. Here is a postscript version from the author's website.
Skewness and Kurtosis
For the on-line algorithms for Skewness and Kurtosis (along the lines of the variance), see in the same wiki page here the parallel algorithms for higher-moment statistics.
Median
Median is tough without sorted data. If you know, how many data points you have, in theory you only have to partially sort, e.g. by using a selection algorithm. However, that doesn't help too much with billions of values. I would suggest using frequency counts, see the next section.
Median and Mode with Frequency Counts
If it is integers, I would count
frequencies, probably cutting off the highest and lowest values beyond some value where I am sure that it is no longer relevant. For floats (or too many integers), I would probably create buckets / intervals, and then use the same approach as for integers. (Approximate) mode and median calculation than gets easy, based on the frequencies table.
Normally Distributed Random Variables
If it is normally distributed, I would use the population sample mean, variance, skewness, and kurtosis as maximum likelihood estimators for a small subset. The (on-line) algorithms to calculate those, you already now. E.g. read in a couple of hundred thousand or million datapoints, until your estimation error gets small enough. Just make sure that you pick randomly from your set (e.g. that you don't introduce a bias by picking the first 100'000 values). The same approach can also be used for estimating mode and median for the normal case (for both the sample mean is an estimator).
Further comments
All the algorithms above can be run in parallel (including many sorting and selection algorithm, e.g. QuickSort and QuickSelect), if this helps.
I have always assumed (with the exception of the section on the normal distribution) that we talk about sample moments, median, and mode, not estimators for theoretical moments given a known distribution.
In general, sampling the data (i.e. only looking at a sub-set) should be pretty successful given the amount of data, as long as all observations are realizations of the same random variable (have the same distributions) and the moments, mode and median actually exist for this distribution. The last caveat is not innocuous. For example, the mean (and all higher moments) for the Cauchy Distribution do not exist. In this case, the sample mean of a "small" sub-set might be massively off from the sample mean of the whole sample.
I implemented the P-Square Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations in a neat Python module I wrote called LiveStats. It should solve your problem quite effectively. The library supports every statistic that you mention except for mode. I have not yet found a satisfactory solution for mode estimation.
Ryan, I'm afraid you are not doing the mean and variance right... This came up a few weeks ago here. And one of the strong points of the online version (which actually goes by the name of Welford's method) is the fact that it is specially accurate and stable, see the discussion here. One of the strong points is the fact that you do not need to store the total sum or total sum of squares...
I can't think of any on-line approach to the mode and median, which seem to require considering the whole list at once. But it may very well be that a similar approach than the one for the variance and mean will work also for the skewness and kurtosis...
The Wikipedia article quoted in the question contains the formulas for calcualting skewness and kurtosis on-line.
For mode - I believe - there is no way doing this on-line. Why? Assume that all values of your input are different besides the last one that duplicates a previous one. In this case you have to remember all values allready seen in the input to detect that the last value duplicates a value seen befor and makes it the most frequent one.
For median it is almost the same - up to the last input you don't know what value will become the median if all input values are different because it could be before or after the current median. If you know the length of the input, you can find the median without storing all values in memory, but you will still have to store many of them (I guess around the half) because a bad input sequence could shift the median heavily in the second half possibly making any value from the first half the median.
(Note that I am refering to exact calculation only.)
If you have billions of data points, then it's not likely that you need exact answers, as opposed to close answers. Generally, if you have billions of data points the underlying process which generates them will likely obey some kind of statistical stationarity / ergodicity / mixing property. Also it may matter whether you expect the distributions to be reasonably continuous or not.
In these circumstances, there exist algorithms for on-line, low memory, estimation of quantiles (the median is a special case of 0.5 quantile), as well as modes, if you don't need exact answers. This is an active field of statistics.
quantile estimation example: http://www.computer.org/portal/web/csdl/doi/10.1109/WSC.2006.323014
mode estimation example: Bickel DR. Robust estimators of the mode and skewness of continuous data. Computational Statistics and Data Analysis. 2002;39:153–163. doi: 10.1016/S0167-9473(01)00057-3.
These are active fields of computational statistics. You are getting into the fields where there isn't any single best exact algorithm, but a diversity of them (statistical estimators, in truth), which have different properties, assumptions and performance. It's experimental mathematics. There are probably hundreds to thousands of papers on the subject.
The final question is whether you really need skewness and kurtosis by themselves, or more likely some other parameters which may be more reliable at characterizing the probability distribution (assuming you have a probability distribution!). Are you expecting a Gaussian?
Do you have ways of cleaning/preprocessing the data to make it mostly Gaussianish? (for instance, financial transaction amounts are often somewhat Gaussian after taking logarithms). Do you expect finite standard deviations? Do you expect fat tails? Are the quantities you care about in the tails or in the bulk?
Everyone keeps saying that you can't do the mode in an online manner but that is simply not true. Here is an article describing an algorithm to do just this very problem invented in 1982 by Michael E. Fischer and Steven L. Salzberg of Yale University. From the article:
The majority-finding algorithm uses one of its registers for temporary
storage of a single item from the stream; this item is the current
candidate for majority element. The second register is a counter
initialized to 0. For each element of the stream, we ask the algorithm
to perform the following routine. If the counter reads 0, install the
current stream element as the new majority candidate (displacing any
other element that might already be in the register). Then, if the
current element matches the majority candidate, increment the counter;
otherwise, decrement the counter. At this point in the cycle, if the
part of the stream seen so far has a majority element, that element is
in the candidate register, and the counter holds a value greater than
0. What if there is no majority element? Without making a second pass through the data—which isn't possible in a stream environment—the
algorithm cannot always give an unambiguous answer in this
circumstance. It merely promises to correctly identify the majority
element if there is one.
It can also be extended to find the top N with more memory but this should solve it for the mode.
Ultimately if you have no a priori parametric knowledge of the distribution I think you have to store all the values.
That said unless you are dealing with some sort of pathological situation, the remedian (Rousseuw and Bassett 1990) may well be good enough for your purposes.
Very simply it involves calculating the median of batches of medians.
median and mode can't be calculated online using only constant space available. However, because median and mode are anyway more "descriptive" than "quantitative", you can estimate them e.g. by sampling the data set.
If the data is normal distributed in the long run, then you could just use your mean to estimate the median.
You can also estimate median using the following technique: establish a median estimation M[i] for every, say, 1,000,000 entries in the data stream so that M[0] is the median of the first one million entries, M[1] the median of the second one million entries etc. Then use the median of M[0]...M[k] as the median estimator. This of course saves space, and you can control how much you want to use space by "tuning" the parameter 1,000,000. This can be also generalized recursively.
I would tend to use buckets, which could be adaptive. The bucket size should be the accuracy you need. Then as each data point comes in you add one to the relevant bucket's count.
These should give you simple approximations to median and kurtosis, by counting each bucket as its value weighted by its count.
The one problem could be loss of resolution in floating point after billions of operations, i.e. adding one does not change the value any more! To get round this, if the maximum bucket size exceeds some limit you could take a large number off all the counts.
OK dude try these:
for c++:
double skew(double* v, unsigned long n){
double sigma = pow(svar(v, n), 0.5);
double mu = avg(v, n);
double* t;
t = new double[n];
for(unsigned long i = 0; i < n; ++i){
t[i] = pow((v[i] - mu)/sigma, 3);
}
double ret = avg(t, n);
delete [] t;
return ret;
}
double kurt(double* v, double n){
double sigma = pow(svar(v, n), 0.5);
double mu = avg(v, n);
double* t;
t = new double[n];
for(unsigned long i = 0; i < n; ++i){
t[i] = pow( ((v[i] - mu[i]) / sigma) , 4) - 3;
}
double ret = avg(t, n);
delete [] t;
return ret;
}
where you say you can already calculate sample variance (svar) and average (avg)
you point those to your functions for doin that.
Also, have a look at Pearson's approximation thing. on such a large dataset it would be pretty similar.
3 (mean − median) / standard deviation
you have median as max - min/2
for floats mode has no meaning. one would typically stick them in bins of a sginificant size (like 1/100 * (max - min)).
This problem was solved by Pebay et al:
https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf
Median
Two recent percentile approximation algorithms and their python implementations can be found here:
t-Digests
https://arxiv.org/abs/1902.04023
https://github.com/CamDavidsonPilon/tdigest
DDSketch
https://arxiv.org/abs/1908.10693
https://github.com/DataDog/sketches-py
Both algorithms bucket data. As T-Digest uses smaller bins near the tails the
accuracy is better at the extremes (and weaker close to the median). DDSketch additionally provides relative error guarantees.
for j in range (1,M):
y=np.zeros(M) # build the vector y
y[0]=y0
#generate the white noise
eps=npr.randn(M-1)*np.sqrt(var)
#increment the y vector
for k in range(1,T):
y[k]=corr*y[k-1]+eps[k-1]
yy[j]=y
list.append(y)