Parametric Scoring Function or Algorithm - algorithm

I'm trying to come up with a way to arrive at a "score" based on an integer number of "points" that is adjustable using a small number (3-5?) of parameters. Preferably it would be simple enough to reasonably enter as a function/calculation in a spreadsheet for tuning the parameters by the "designer" (not a programmer or mathematician). The first point has the most value and eventually additional points have a fixed or nearly fixed value. The transition from the initial slope of point value to final slope would be smooth. See example shapes below.
Points values are always positive integers (0 pts = 0 score)
At some point, curve is linear (or nearly), all additional points have fixed value
Preferably, parameters are understandable to a lay person, e.g.: "smoothness of the curve", "value of first point", "place where the additional value of points is fixed", etc
For parameters, an example of something ideal would be:
Value of first point: 10
Value of point #: 3 is: 5
Minimum value of additional points: 0.75
Exact shape of curve not too important as long as the corner can be more smooth or more sharp.
This is not for a game but more of a rating system with multiple components (several of which might use this kind of scale) will be combined.
This seems like a non-traditional kind of question for SO/SE. I've done mostly financial software in my career, I'm hoping there some domain wisdom for this kind of thing I can tap into.
Implementation of Prune's Solution:
Google Sheet

Parameters:
Initial value (a)
Second value (b)
Minimum value (z)
Your decay ratio is b/a. It's simple from here: iterate through your values, applying the decay at each step, until you "peg" at the minimum:
x[n] = max( z, a * (b/a)^n )
// Take the larger of the computed "decayed" value,
// and the specified minimum.
The sequence x is your values list.
You can also truncate intermediate results if you want integers up to a certain point. Just apply the floor function to each computed value, but still allow z to override that if it gets too small.
Is that good enough? I know there's a discontinuity in the derivative function, which will be noticeable if the minimum and decay aren't pleasantly aligned. You can adjust this with a relative decay, translating the exponential decay curve from y = 0 to z.
base = z
diff = a-z
ratio = (b-z) / diff
x[n] = z + diff * ratio^n
In this case, you don't need the max function, since the decay has a natural asymptote of 0.

Related

Numerical instability?

I am working in a program that concerns the optimization of some objective function obj over the scalar beta. The true global minimum beta0 is set at beta0=1.
In the mwe below you can see that obj is constructed as the sum of the 100-R (here I use R=3) smallest eigenvalues of the 100x100 symmetric matrix u'*u. While around the true global minimum obj "looks good" when I plot the objective function evaluated at much larger values of beta the objective function becomes very unstable (here or running the mwe you can see that multiple local minima (and maxima) appear, associated with values of obj(beta) smaller than the true global minimum).
My guess is that there is some sort of "numerical instability" going on, but I am unable to find the source.
%Matrix dimensions
N=100;
T=100;
%Reproducibility
rng('default');
%True global minimum
beta0=1;
%Generating data
l=1+randn(N,2);
s=randn(T+1,2);
la=1+randn(N,2);
X(1,:,:)=1+(3*l+la)*(3*s(1:T,:)+s(2:T+1,:))';
s=s(1:T,:);
a=(randn(N,T));
Y=beta0*squeeze(X(1,:,:))+l*s'+a;
%Give "beta" a large value
beta=1e6;
%Compute objective function
u=Y-beta*squeeze(X(1,:,:));
ev=sort(eig(u'*u)); % sort eigenvalues
obj=sum(ev(1:100-3))/(N*T); % "obj" is sum of 97 smallest eigenvalues
This evaluates the objective function at obj(beta=1e6). I have noticed that some of the eigenvalues from eig(u'*u) are negative (see object ev), when by construction the matrix u'*u is positive semidefinite
I am guessing this may have to do with floating point arithmetic issues and may (partly) be the answer to the instability of my function, but I am not sure.
Finally, this is what the objective function obj evaluated at a wide range of values for betalooks like:
% Now plot "obj" for a wide range of values of "beta"
clear obj
betaGrid=-5e5:100:5e5;
for i=1:length(betaGrid)
u=Y-betaGrid(i)*squeeze(X(1,:,:));
ev=sort(eig(u'*u));
obj(i)=sum(ev(1:100-3))/(N*T);
end
plot(betaGrid,obj,"*")
xlabel('\beta')
ylabel('obj')
This gives this figure, which shows how unstable it becomes for extreme values for beta.
The key here is noticing that computing eigenvalues can be a hard problem.
Actually the condition number for this problem is K = norm(A) * norm(inv(A)) (don't compute it this way, use cond(). This means the the an (relative) perturbation in the inpute (i.e. the matrix entries) gets amplified by the condition number when computing the output. I modified your code a little bit to compute and plot the condition number in each step. It turns out that for a large part of the range you are interested in it is greater than 10^17, which is abysmal. (Note that the double floating point numbers are accurate to not quite 16 significant (decimal) digits. This means even the representation error of double floating point numbers will here produce errors that make every digit "insignificant".) This already explains the bad behaviour. You should note that usually we can compute the largest eigenvalues quite accurately, the errors in the smaller (in magnitude) ones usually increase.
If the condition number was better (closer to 1) I would have suggested
computing the singular values, as they happen to be the eigenvalues (due to the symmetry). The svd is numerically more stable, but with this really bad
condition even this will not help. In the following modification of the
final snippet I added a graph that plots the condition number.
The only case where anything is salvageable is for R=0, then we actually
want to compute the sum of all eigenvalues, which happens to be the
trace of our matrix, which can easily be computed by just summing the
diagonal entries.
To summarize: This problem seems to have an inherent bad condition, so it doesn't really matter how you compute it. If you have a completely different formulation for the same problem that might help.
% Now plot "obj" for a wide range of values of "beta"
clear obj
L = 5e5; % decrease to 5e-1 to see that the condition number is still >1e9 around the optimum
betaGrid=linspace(-L,L,1000);
condition = nan(size(betaGrid));
for i=1:length(betaGrid)
disp(i/length(betaGrid))
u=Y-betaGrid(i)*squeeze(X(1,:,:));
A = u'*u;
ev=sort(eig(A));
condition(i) = cond(A);
obj(i)=sum(ev(1:100-3))/(N*t); % for R=0 use trace(A)/(N*T);
end
subplot(1,2,1);
plot(betaGrid,obj,"*")
xlabel('\beta')
ylabel('obj')
subplot(1,2,2);
semilogy(betaGrid, condition);
title('condition number');

Select a number not present in a list

Is there an elegant method to create a number that does not exist in a given list of floating point numbers? It would be nice if this number were not close to the existing values in the array.
For example, in the list [-1.5, 1e+38, -1e38, 1e-12] it might be nice to pick a number like 20 that's "far" away from the existing numbers as opposed to 0.0 which is not in the list, but very close to 1e-12.
The only algorithm I've been able to come up with involves creating a random number and testing to see if it is not in the array. If so, regenerate. Is there a better deterministic approach?
Here's a way to select a random number not in the list, where the probability is higher the further away from an existing point you get.
Create a probability distribution function f as follows:
f(x) = <the absolute distance to the point closest to x>
such function gives a higher probability the further away from the a given point you are. (Note that it should be normalized so that the area below the function is 1.)
Create the primitive function F of f (i.e. the accumulated area below f up to a given point).
Generate a uniformly random number, x, between 0 and 1 (that's easy! :)
Get the final result by applying the inverse of F to that value: F-1(x).
Here's a picture describing a situation with 1.5, 2.2 and 2.9 given as existing numbers:
Here's the intuition of why it works:
The higher probability you have (the higher the blue line is) the steeper the red line is.
The steeper the red line is, the more probable it is that x hits the red line at that point.
For example: At the given points, the blue lines is 0, thus the red line is horizontal. If the red line is horizontal, probability that x hits that point is zero.
(If you want the full range of doubles, you could set min / max to -Double.MAX_VALUE and Double.MAX_VALUE respectively.)
If you have the constraint, that the new value must be somewhere in between [min, max] then you could sort your values and insert the mean value of the two adjacent values with the largest absolute difference.
In your sample case [-1e38, -1.5, 1e-12, 1e+38] is the ordered list. As you calculate the absolute differences, you'll find the maximum difference for the values (1e-12, 1e+38) so you calculate the new value to be ((n[i+1] - n[i]) / 2) + n[i] (simple mean value calculation).
Update:
Additionally you could also check if the FLOAT_MAX or FLOAT_MIN values will give good candidates. Simply check their distance to min and max and if the result values are larger than the maximum difference for two adjacent values, pick them.
If there is no upper bound, just sum up the absolute value of all the numbers, or subtract them all.
Another possible solution would be to get the smallest number and the greatest number in the list, and choose something outside their bounds (maybe double the greatest number).
Or probably the best way would be to compute the average, the smalelst and the biggest number, as long as the standard deviation. Then, with all this data, you know how the numbers are structured, and can choose accordingly (all clustered around a given negative value? Chosoe a positive one. All small numbers? Choose a big one. etc.)
Something along the lines of
number := 1
multiplier := random(1000)+1
if avg>0
number:= -number
if min < 1 and max > 1
multiplier:= 1 / (random(1000)+1)
if stdDev > 1000
number := avg+random(500)-250
multiplier:= multiplier / (random(1000)+1)
(just an example from the top of my head)
Or another Possibility would be to XOR all the numbers together. Should yield a good result.

GPS Data time to distance base transformation

I am developing an application that logs a GPS trace over time.
After the trace is complete, I need to convert the time based data to distance based data, that is to say, where the original trace had a lon/lat record every second, I need to convert that into having a lon/lat record every 20 meters.
Smoothing the original data seems to be a well understood problem and I suppose I need something like a smoothing algorithm, but I'm struggling to think how to convert from a time based data set to a distance based data set.
This is an excellent question and what makes it so interesting is the data points should be assumed random. Which means you cannot expect a beginning to end data graph that represents a well behaved polynomial (like SINE or COS wave). So you will have to work in small increments such that values on your x-axis (so to speak) do not oscillate meaning Xn cannot be less than Xn-1. The next consideration would be the case of overlap or near overlap of data points. Imagine I’m recording my GPS coordinates and we have stopped to chat or rest and I walk randomly within a twenty five foot circle for the next five minutes. So the question would be how to ignore this type of “data noise”?
For simplicity let’s consider linear calculations where there is no approximation between two points; it’s a straight line. This will probably be more than sufficient for your calculations. Now given the comment above regarding random data points, you will want to traverse your data from your start point to the end point sequentially. Sequential termination occurs when you exceed the last data point or you have exceeded the overall distance to produce coordinates (like a subset). Let’s assume your plot precision is X. This would be your 20 meters. As you traverse there will be three conditions:
The distance between the two points is greater than your
precision. Therefore save the start point plus the precision X. This
will also become your new start point.
The distance between the two points is equal to your precision.
Therefore save the start point plus the precision X (or save end
point). This will also become your new start point.
The distance between the two points is less than your precision.
Therefore precision is adjusted to precision minus end point. The end
point will become your new start point.
Here is pseudo-code that might help get you started. Note, point y minus point x = distance between. And, point x plus value = new point on line between poing x and point y at distance value.
recordedPoints = received from trace;
newPlotPoints = emplty list of coordinates;
plotPrecision = 20
immedPrecision = plotPrecision;
startPoint = recordedPoints[0];
for(int i = 1; i < recordedPoints.Length – 1; i++)
{
Delta = recordedPoints[i] – startPoint;
if (immedPrecision < Delta)
{
newPlotPoints.Add(startPoint + immedPrecision);
startPoint = startPoint + immedPrecision;
immedPrecsion = plotPrecsion;
i--;
}
else if (immedPrecision = Delta)
{
newPlotPoints.Add(startPoint + immedPrecision);
startPoint = startPoint + immediatePrecision;
immedPrecision = plotPrecision;
}
else if (immedPrecision > Delta)
{
// Store last data point regardless
if (i == recordedPoints.Length - 1)
{
newPlotPoints.Add(startPoint + Delta)
}
startPoint = recordedPoints[i];
immedPrecision = Delta - immedPrecision;
}
}
Previously I mentioned "data noise". You can wrap the "if" and "else if's" in another "if" which detemines scrubs this factor. The easiest way is to ignore a data point if it has not moved a given distance. Keep in mind this magic number must be small enough such that sequentially recorded data points which are ignored don't sum to something large and valuable. So putting a limit on ignored data points might be a benefit.
With all this said, there are many ways to accurately perform this operation. One suggestion to take this subject to the next level is Interpolation. For .NET there is a open source library at http://www.mathdotnet.com. You can use their Numberics library which contains Interpolation at http://numerics.mathdotnet.com/interpolation/. If you choose such a route your next major hurdle will be deciding the appropriate Interpolation technique. If you are not a math guru here is a bit of information to get you started http://en.wikipedia.org/wiki/Interpolation. Frankly, Polynomial Interpolation using two adjacent points would be more than sufficient for your approximations provided you consider the idea of Xn is not < Xn-1 otherwise your approximation will be skewed.
The last item to note, these calculations are two-dimensional and do consider altitude (Azimuth) or the curvature of the earth. Here is some additional information in that regard: Calculate distance between two latitude-longitude points? (Haversine formula).
Never the less, hopefully this will point you in the correct direction. With no doubt this is not a trivial problem therefore keeping the data point range as small as possible while still being accurate will be to your benefit.
One other consideration might be to approximate using actual data points using the precision to disregard excessive data. Therefore you are not essentially saving two lists of coordinates.
Cheers,
Jeff

Which algorithm will be required to do this?

I have data of this form:
for x=1, y is one of {1,4,6,7,9,18,16,19}
for x=2, y is one of {1,5,7,4}
for x=3, y is one of {2,6,4,8,2}
....
for x=100, y is one of {2,7,89,4,5}
Only one of the values in each set is the correct value, the rest is random noise.
I know that the correct values describe a sinusoid function whose parameters are unknown. How can I find the correct combination of values, one from each set?
I am looking something like "travelling salesman"combinatorial optimization algorithm
You're trying to do curve fitting, for which there are several algorithms depending on the type of curve you want to fit your curve to (linear, polynomial, etc.). I have no idea whether there is a specific algorithm for sinusoidal curves (Fourier approximations), but my first idea would be to use a polynomial fitting algorithm with a polynomial approximation of the sine.
I wonder whether you need to do this in the course of another larger program, or whether you are trying to do this task on its own. If so, then you'd be much better off using a statistical package, my preferred one being R. It allows you to import your data and fit curves and draw graphs in just a few lines, and you could also use R in batch-mode to call it from a script or even a program (this is what I tend to do).
It depends on what you mean by "exactly", and what you know beforehand. If you know the frequency w, and that the sinusoid is unbiased, you have an equation
a cos(w * x) + b sin(w * x)
with two (x,y) points at different x values you can find a and b, and then check the generated curve against all the other points. Choose the two x values with the smallest number of y observations and try it for all the y's. If there is a bias, i.e. your equation is
a cos(w * x) + b sin(w * x) + c
You need to look at three x values.
If you do not know the frequency, you can try the same technique, unfortunately the solutions may not be unique, there may be more than one w that fits.
Edit As I understand your problem, you have a real y value for each x and a bunch of incorrect ones. You want to find the real values. The best way to do this is to fit curves through a small number of points and check to see if the curve fits some y value in the other sets.
If not all the x values have valid y values then the same technique applies, but you need to look at a much larger set of pairs, triples or quadruples (essentially every pair, triple, or quad of points with different y values)
If your problem is something else, and I suspect it is, please specify it.
Define sinusoid. Most people take that to mean a function of the form a cos(w * x) + b sin(w * x) + c. If you mean something different, specify it.
2 Specify exactly what success looks like. An example with say 10 points instead of 100 would be nice.
It is extremely unclear what this has to do with combinatorial optimization.
Sinusoidal equations are so general that if you take any random value of all y's these values can be fitted in sinusoidal function unless you give conditions eg. Frequency<100 or all parameters are integers,its not possible to diffrentiate noise and data theorotically so work on finding such conditions from your data source/experiment first.
By sinusoidal, do you mean a function that is increasing for n steps, then decreasing for n steps, etc.? If so, you you can model your data as a sequence of nodes connected by up-links and down-links. For each node (possible value of y), record the length and end-value of chains of only ascending or descending links (there will be multiple chain per node). Then you scan for consecutive runs of equal length and opposite direction, modulo some initial offset.

How to calculate or approximate the median of a list without storing the list

I'm trying to calculate the median of a set of values, but I don't want to store all the values as that could blow memory requirements. Is there a way of calculating or approximating the median without storing and sorting all the individual values?
Ideally I would like to write my code a bit like the following
var medianCalculator = new MedianCalculator();
foreach (var value in SourceData)
{
medianCalculator.Add(value);
}
Console.WriteLine("The median is: {0}", medianCalculator.Median);
All I need is the actual MedianCalculator code!
Update: Some people have asked if the values I'm trying to calculate the median for have known properties. The answer is yes. One value is in 0.5 increments from about -25 to -0.5. The other is also in 0.5 increments from -120 to -60. I guess this means I can use some form of histogram for each value.
Thanks
Nick
If the values are discrete and the number of distinct values isn't too high, you could just accumulate the number of times each value occurs in a histogram, then find the median from the histogram counts (just add up counts from the top and bottom of the histogram until you reach the middle). Or if they're continuous values, you could distribute them into bins - that wouldn't tell you the exact median but it would give you a range, and if you need to know more precisely you could iterate over the list again, examining only the elements in the central bin.
There is the 'remedian' statistic. It works by first setting up k arrays, each of length b. Data values are fed in to the first array and, when this is full, the median is calculated and stored in the first pos of the next array, after which the first array is re-used. When the second array is full the median of its values is stored in the first pos of the third array, etc. etc. You get the idea :)
It's simple and pretty robust. The reference is here...
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/Remedian.pdf
Hope this helps
Michael
I use these incremental/recursive mean and median estimators, which both use constant storage:
mean += eta * (sample - mean)
median += eta * sgn(sample - median)
where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)
This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.
Also, I modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:
quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)
The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.
I would love to see an incremental mode estimator of a similar form...
(Note: I also posted this to a similar topic here: "On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?)
Here is a crazy approach that you might try. This is a classical problem in streaming algorithms. The rules are
You have limited memory, say O(log n) where n is the number of items you want
You can look at each item once and make a decision then and there what to do with it, if you store it, it costs memory, if you throw it away it is gone forever.
The idea for the finding a median is simple. Sample O(1 / a^2 * log(1 / p)) * log(n) elements from the list at random, you can do this via reservoir sampling (see a previous question). Now simply return the median from your sampled elements, using a classical method.
The guarantee is that the index of the item returned will be (1 +/- a) / 2 with probability at least 1-p. So there is a probability p of failing, you can choose it by sampling more elements. And it wont return the median or guarantee that the value of the item returned is anywhere close to the median, just that when you sort the list the item returned will be close to the half of the list.
This algorithm uses O(log n) additional space and runs in Linear time.
This is tricky to get right in general, especially to handle degenerate series that are already sorted, or have a bunch of values at the "start" of the list but the end of the list has values in a different range.
The basic idea of making a histogram is most promising. This lets you accumulate distribution information and answer queries (like median) from it. The median will be approximate since you obviously don't store all values. The storage space is fixed so it will work with whatever length sequence you have.
But you can't just build a histogram from say the first 100 values and use that histogram continually.. the changing data may make that histogram invalid. So you need a dynamic histogram that can change its range and bins on the fly.
Make a structure which has N bins. You'll store the X value of each slot transition (N+1 values total) as well as the population of the bin.
Stream in your data. Record the first N+1 values. If the stream ends before this, great, you have all the values loaded and you can find the exact median and return it. Else use the values to define your first histogram. Just sort the values and use those as bin definitions, each bin having a population of 1. It's OK to have dupes (0 width bins).
Now stream in new values. For each one, binary search to find the bin it belongs to.
In the common case, you just increment the population of that bin and continue.
If your sample is beyond the histogram's edges (highest or lowest), just extend the end bin's range to include it.
When your stream is done, you find the median sample value by finding the bin which has equal population on both sides of it, and linearly interpolating the remaining bin-width.
But that's not enough.. you still need to ADAPT the histogram to the data as it's being streamed in. When a bin gets over-full, you're losing information about that bin's sub distribution.
You can fix this by adapting based on some heuristic... The easiest and most robust one is if a bin reaches some certain threshold population (something like 10*v/N where v=# of values seen so far in the stream, and N is the number of bins), you SPLIT that overfull bin. Add a new value at the midpoint of the bin, give each side half of the original bin's population. But now you have too many bins, so you need to DELETE a bin. A good heuristic for that is to find the bin with the smallest product of population and width. Delete it and merge it with its left or right neighbor (whichever one of the neighbors itself has the smallest product of width and population.). Done!
Note that merging or splitting bins loses information, but that's unavoidable.. you only have fixed storage.
This algorithm is nice in that it will deal with all types of input streams and give good results. If you have the luxury of choosing sample order, a random sample is best, since that minimizes splits and merges.
The algorithm also allows you to query any percentile, not just median, since you have a complete distribution estimate.
I use this method in my own code in many places, mostly for debugging logs.. where some stats that you're recording have unknown distribution. With this algorithm you don't need to guess ahead of time.
The downside is the unequal bin widths means you have to do a binary search for each sample, so your net algorithm is O(NlogN).
David's suggestion seems like the most sensible approach for approximating the median.
A running mean for the same problem is a much easier to calculate:
Mn = Mn-1 + ((Vn - Mn-1) / n)
Where Mn is the mean of n values, Mn-1 is the previous mean, and Vn is the new value.
In other words, the new mean is the existing mean plus the difference between the new value and the mean, divided by the number of values.
In code this would look something like:
new_mean = prev_mean + ((value - prev_mean) / count)
though obviously you may want to consider language-specific stuff like floating-point rounding errors etc.
I don't think it is possible to do without having the list in memory. You can obviously approximate with
average if you know that the data is symmetrically distributed
or calculate a proper median of a small subset of data (that fits in memory) - if you know that your data has the same distribution across the sample (e.g. that the first item has the same distribution as the last one)
Find Min and Max of the list containing N items through linear search and name them as HighValue and LowValue
Let MedianIndex = (N+1)/2
1st Order Binary Search:
Repeat the following 4 steps until LowValue < HighValue.
Get MedianValue approximately = ( HighValue + LowValue ) / 2
Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K
is K = MedianIndex, then return MedianValue
is K > MedianIndex ? then HighValue = MedianValue Else LowValue = MedianValue
It will be faster without consuming memory
2nd Order Binary Search:
LowIndex=1
HighIndex=N
Repeat Following 5 Steps until (LowIndex < HighIndex)
Get Approximate DistrbutionPerUnit=(HighValue-LowValue)/(HighIndex-LowIndex)
Get Approximate MedianValue = LowValue + (MedianIndex-LowIndex) * DistributionPerUnit
Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K
is (K=MedianIndex) ? return MedianValue
is (K > MedianIndex) ? then HighIndex=K and HighValue=MedianValue Else LowIndex=K and LowValue=MedianValue
It will be faster than 1st order without consuming memory
We can also think of fitting HighValue, LowValue and MedianValue with HighIndex, LowIndex and MedianIndex to a Parabola, and can get ThirdOrder Binary Search which will be faster than 2nd order without consuming memory and so on...
Usually if the input is within a certain range, say 1 to 1 million, it's easy to create an array of counts: read the code for "quantile" and "ibucket" here: http://code.google.com/p/ea-utils/source/browse/trunk/clipper/sam-stats.cpp
This solution can be generalized as an approximation by coercing the input into an integer within some range using a function that you then reverse on the way out: IE: foo.push((int) input/1000000) and quantile(foo)*1000000.
If your input is an arbitrary double precision number, then you've got to autoscale your histogram as values come in that are out of range (see above).
Or you can use the median-triplets method described in this paper: http://web.cs.wpi.edu/~hofri/medsel.pdf
I picked up the idea of iterative quantile calculation. It is important to have a good value for starting point and eta, these may come from mean and sigma. So I programmed this:
Function QuantileIterative(Var x : Array of Double; n : Integer; p, mean, sigma : Double) : Double;
Var eta, quantile,q1, dq : Double;
i : Integer;
Begin
quantile:= mean + 1.25*sigma*(p-0.5);
q1:=quantile;
eta:=0.2*sigma/xy(1+n,0.75); // should not be too large! sets accuracy
For i:=1 to n Do
quantile := quantile + eta * (signum_smooth(x[i] - quantile,eta) + 2*p - 1);
dq:=abs(q1-quantile);
If dq>eta
then Begin
If dq<3*eta then eta:=eta/4;
For i:=1 to n Do
quantile := quantile + eta * (signum_smooth(x[i] - quantile,eta) + 2*p - 1);
end;
QuantileIterative:=quantile
end;
As the median for two elements would be the mean, I used a smoothed signum function, and xy() is x^y. Are there ideas to make it better? Of course if we have some more a-priori knowledge we can add code using min and max of the array, skew, etc. For big data you would not use an array perhaps, but for testing it is easier.
On homogeneous random ordered and for big enough list, this pseudo code can work:
# find min on the fly
if minDataPoint > dataPoint:
minDataPoint = dataPoint
# find max on the fly
if maxDataPoint < dataPoint:
maxDataPoint = dataPoint
# estimate median base on the current data
estimate_mid = (maxDataPoint + minDataPoint) / 2
#if **new** dataPoint is closer to the mid? stor it
if abs(midDataPoint - estimate_mid) > abs(dataPoint - estimate_mid):
midDataPoint = dataPoint
Inspired by #lakshmanaraj

Resources