Mean max subset of array values - algorithm

I am working on an algorithm to compute multiple mean max values of an array. The array contains time/value pairs such as HR data recorded on a Garmin device over a 5 hour run. the data is approx once a second for an unknown period, but has no guaranteed frequency. An example would be a 10 minute mean maximum, which is the maximum average 10 minute duration value. Assume "mean" is just average value for this discussion. The desired mean maximal value's duration is arbitrary, 1 min, 5 min, 60 min. And, I'm likely going to need many of them-- at least 30 but ideally any on demand if it wasn't a lengthy request.
Right now I have a straight forward algorithm to compute on value:
1) Start at beginning of array and "walk" forward until the subset is equal to or 1 element past the desired duration. Stop if end of array is reached.
2) Find the average of those subset values. Store as max avg if larger than current max.
3) Shift a single value off the left side of array.
4) Repeat from 1 until end of array met.
It basically computes every possible consecutive average and returns the max.
It does this for each duration. And it computes a real avg computation continuously instead of sliding it somehow by removing the left point and adding the right, like one could do for a Simple-moving-average series. It takes about 3-10 secs per mean max value depending on the total array size.
I'm wondering how to optimize this. For instance, the series of all mean max values will be an exponential curve with the 1s value highest, and lowering until the entire average is met. Can this curve, and all values, be interpolated from a certain number of points? Or some other optimization to the above heavy computation but still maintain accuracy?

"And it computes a real avg computation continuously instead of sliding it somehow by removing the left point and adding the right, like one could do for a Simple-moving-average series."
Why don't you just slide it (i.e. keep a running sum and divide by the number of elements in that sum)?

Related

Does a moving average preserve resolution over sum / count quotient in a binary system?

There are multiple methods for find the average of a set of numbers.
First, the sum / count quotient. Add all values and divide them by the number of values.
Second, the moving average. The function I found in another Stack answer is:
New average = old average * (n-1)/n + new value /n
This works as long as each value is added to the average one value at a time.
My concern is that the second method is more calculatively complex for my processor to execute, but I also fear that the first method will result in a loss of resolution for data sets that result in large sums. In a 32 bit system, for example, the resolution of a float value stored is reduced automatically as the magnitude of the number grows.
Does a moving average preserve resolution?
"moving average" does not calculate average over large interval.
It smooth data in such way that newer measurements give larger impact, and older measurements weight becomes smaller and smaller.
If you bother about large sums and want to preserve all possible data "bits", consider special methods like Kahan summation algorithm

Calculating total over timespan with arbitrary datapoints

If I am given the total number of occurrences of an event over the last hour, and I can get this data at arbitrary times ( but at least once an hour ), how can I work out the total number of occurrences over a 24 hour period?
Obviously, you can't. For example -- if the first two observations overlap then it would be impossible to determine the number of kills during the overlap. If there is a time gap between the first two observations then there is no way to determine what happened during the gap. You could try to set up a system of equations -- but the resulting system will be underdetermined (but it could give you both a min and a max, which might be relevant).
Why not adopt a statistical approach? Let X = kills over a 1 hour period. This is a random variable. Estimate its expected value by sampling it at randomly chosen times and multiply your estimate by 24.

How to detect partitions (clusters) of sparse data in linear time, and (hopefully) sublinear space?

Let's say I have m integers from n disjoint integer intervals, which are in some sense "far" apart.
n is not known beforehand, but it is known to be small (see assumptions below).
For example, for n = 3, I might have been given randomly distributed integers from the intervals 105-2400, 58030-571290, 1000000-1000100.
Finding the minimum (105) and maximum (1000100) is clearly trivial.
But is there any way to efficiently (in O(m) time and hopefully o(m) space) find the intervals' boundary points, so that I can quickly partition the data for separate processing?
If there is no efficient way to do this exactly, is there an efficient way to approximate the partitions, to within a small constant factor (like 2)?
(For example, 4000 would be an acceptable approximation of the upper bound of the smaller interval, and 30000 would be an acceptable approximation of the lower bound of the middle interval.)
Assumptions:
Everything is nonnegative
n is very small (say, < 10)
The max value is comparatively large (say, on the order of 226)
The intervals are dense (i.e. there exists an integer in the array for most values inside that interval)
Two clusters are far apart if their closest boundaries are at least a constant factor c apart.
(Edit: It makes sense for c to be relative to the cluster size rather than relative to the bound. So a cluster of 1 element at 1000000 should not be approximated as originating from the interval 500000-2000000.)
The integers are not sorted, and this is crucial. In fact sorting them in O(m) time is impossible without radix sort, but radix sort could have O(max value) complexity, and there is no guarantee the max value is anywhere close to m.
Again, speed is the most important factor here; inaccuracy is tolerated as long as it's within a reasonable factor.
I say move to a logarithmic scale of factor c to search for the intervals, since you know them being at least c factor apart. Then, make an array of counters, each counter counts numbers in intervals (0.5X .. 0.5X+0.5) logarithmic scale, where X is the index of selected counter.
Say, c is 2, and you know the maximum upper bound of 226, so you create 52 counters, then calculate floor(2*log<sub>2</sub>i) where i is the current integer, and increment that counter. After you parse all of m integers, walk through that array, and each sequence of zeroes in there will mean that the corresponding logarithmic interval is empty.
So the output of this will be the sequence of occupied intervals, logarithmically aligned to halves of power of c, aka 128, 181, 256, 363, 512 etc. This satisfies your requirements on precision for intervals' boundaries.
Update: You can also store lowest and highest number out of those that hit the interval. Once you do, the intervals' boundaries are calculated as follows:
Find the first nonzero counter from current position in the counters array. Take its lowest number as lower bound.
Progress through the counters array until you find a zero in the counter or hit the array's end. Take the last nonzero counter's highest number. This will be your upper bound for the current interval.
Proceed until full traversal of the array.
Return the set of intervals found. The boundaries will be strict.
An example: (abstract language code)
counters=[];
lowest=[];
highest=[];
for (i=0;i<m;i++) {
x=getNextInteger();
n=Math.floor(2.0*logByBase(c,x));
counters[n]++;
if (counters[n]==1) {
lowest[n]=x;
highest[n]=x;
} else {
if (lowest[n]>x) lowest[n]=x;
if (highest[n]<x) highest[n]=x;
}
}
zeroflag=true; /// are we in mode of finding a zero or a nonzero
intervals=[];
currentLow=0;
currentHigh=0;
for (i=0;i<counters.length;i++) {
if (zeroflag) {
// we search for nonzero
if (counters[i]>0) {
currentLow=lowest[i]; // there's a value
zeroflag=false;
} // else skip
} else {
if (counters[i]==0) {
currentHigh=highest[i-1]; // previous was nonzero, get its highest
intervals.push([currentLow,currentHigh]); // store interval
zeroflag=true;
}
}
if (!zeroflag) { // unfinished interval
currentHigh=highest[counters.length-1];
intervals.push([currentLow,currentHigh]); // store last interval
}
You may want to look at aproximate median finding.
These methods can often be generalized to finding arbitrary quantiles with a reasonable precision; and quantiles are good for distributing your work load.
Here's my take, using two passes over the data set.
sample 10000 objects from your data set.
Solve your problem for the sample objects.
Re-scan your data set, assigning each object to the nearest interval from your sample, and tracking the minimum and maximum of each interval.
If your gaps are prominent enough, they should still be visible in the sample. The second pass is only to refine the interval boundaries.
Split the total range into buckets. The bucket boundaries X_i should be spaced in an appropriate way, like e.g. linearly X_i=16*i. Other options would be quadratic spacing like X_i=4*i*i or logarithmic X_i=2^(i/16), here the total number of buckets would be smaller but finding the right bucket for a given number would be more effort. Each bucket is empty or non-empty, so one bit would be sufficient.
You iterate over the set of numbers, and for each number you mark its bucket as non-empty. Then the gaps between your intervals are represented by series of empty buckets. So now you find all sufficiently long series of empty buckets, and you have the interval gaps. The accuracy of the interval boundary will be determined by the bucket size, so assuming a bucket size of 16 you interval border is off by at most 15. If the max number is 226 and buckets are size 16 and you use one bit for each bucket you need 219 byte, or 512kB of memory.
Actually I think I was wrong when I posted the question -- the answer does seem to be radix sort.
The number of buckets is arbitrary, it doesn't have to correlate with the sizes of the intervals.
It might even be 2, if I go bit-by-bit.
Thus radix sort could help me sort the data in O(m log(max value)) ≈ O(m) time (since log(max value) is essentially a constant factor of 26 according to the assumptions), at which point the problem becomes trivial.

Data structure/algorithm to efficiently save weighted moving average

I'd like to sum up moving averages for a number of different categories when storing log records. Imagine a service that saves web server logs one entry at a time. Let's further imagine, we don't have access to the logged records. So we see them once but don't have access to them later on.
For different pages, I'd like to know
the total number of hits (easy)
a "recent" average (like one month or so)
a "long term" average (over a year)
Is there any clever algorithm/data model that allows to save such moving averages without having to recalculate them by summing up huge quantities of data?
I don't need an exact average (exactly 30 days or so) but just trend indicators. So some fuzziness is not a problem at all. It should just make sure that newer entries are weighted higher than older ones.
One solution probably would be to auto-create statistics records for each month. However, I don't even need past month statistics, so this seems like overkill. And it wouldn't give me a moving average but rather swap to new values from month to month.
An easy solution would be to keep an exponentially decaying total.
It can be calculated using the following formula:
newX = oldX * (p ^ (newT - oldT)) + delta
where oldX is the old value of your total (at time oldT), newX is the new value of your total (at time newT); delta is the contribution of new events to the total (for example the number of hits today); p is less or equal to 1 and is the decay factor. If we take p = 1, then we have the total number of hits. By decreasing p, we effectively decrease the interval our total describes.
If all you really want is a smoothed value with a given time constant then the easiest thing is to use a single pole recursive IIR filter (aka AR or auto-regressive filter in time series analysis). This takes the form:
Xnew = k * X_old + (1 - k) * x
where X_old is the previous smoothed value, X_new is the new smoothed value, x is the current data point and k is a factor which determines the time constant (usually a small value, < 0.1). You may need to determine the two k values (one value for "recent" and a smaller value for "long term") empirically, based on your sample rate, which ideally should be reasonably constant, e.g. one update per day.
It may be solution for you.
You can aggregate data to intermediate storage grouped by hour or day. Than grouping function will work very fast, because you will need to group small amount of records and inserts will be fast as well. Precision decisions up to you.
It can be better than auto-correlated exponential algorithms because you can understand what you calculate easier and it doesn't require math each step.
For last term data you can use capped collections with limited amount of records. They supported natively by some DBs for example MongoDB.

Incremental median computation with max memory efficiency

I have a process that generates values and that I observe. When the process terminates, I want to compute the median of those values.
If I had to compute the mean, I could just store the sum and the number of generated values and thus have O(1) memory requirement. How about the median? Is there a way to save on the obvious O(n) coming from storing all the values?
Edit: Interested in 2 cases: 1) the stream length is known, 2) it's not.
You are going to need to store at least ceil(n/2) points, because any one of the first n/2 points could be the median. It is probably simplest to just store the points and find the median. If saving ceil(n/2) points is of value, then read in the first n/2 points into a sorted list (a binary tree is probably best), then as new points are added throw out the low or high points and keep track of the number of points on either end thrown out.
Edit:
If the stream length is unknown, then obviously, as Stephen observed in the comments, then we have no choice but to remember everything. If duplicate items are likely, we could possibly save a bit of memory using Dolphins idea of storing values and counts.
I had the same problem and got a way that has not been posted here. Hopefully my answer can help someone in the future.
If you know your value range and don't care much about median value precision, you can incrementally create a histogram of quantized values using constant memory. Then it is easy to find median or any position of values, with your quantization error.
For example, suppose your data stream is image pixel values and you know these values are integers all falling within 0~255. To create the image histogram incrementally, just create 256 counters (bins) starting from zeros and count one on the bin corresponding to the pixel value while scanning through the input. Once the histogram is created, find the first cumulative count that is larger than half of the data size to get median.
For data that are real numbers, you can still compute histogram with each bin having quantized values (e.g. bins of 10's, 1's, or 0.1's etc.), depending on your expected data value range and precision you want.
If you don't know the value range of entire data sample, you can still estimate the possible value range of median and compute histogram within this range. This drops outliers by nature but is exactly what we want when computing median.
You can
Use statistics, if that's acceptable - for example, you could use sampling.
Use knowledge about your number stream
using a counting sort like approach: k distinct values means storing O(k) memory)
or toss out known outliers and keep a (high,low) counter.
If you know you have no duplicates, you could use a bitmap... but that's just a smaller constant for O(n).
If you have discrete values and lots of repetition you could store the values and counts, which would save a bit of space.
Possibly at stages through the computation you could discard the top 'n' and bottom 'n' values, as long as you are sure that the median is not in that top or bottom range.
e.g. Let's say you are expecting 100,000 values. Every time your stored number gets to (say) 12,000 you could discard the highest 1000 and lowest 1000, dropping storage back to 10,000.
If the distribution of values is fairly consistent, this would work well. However if there is a possibility that you will receive a large number of very high or very low values near the end, that might distort your computation. Basically if you discard a "high" value that is less than the (eventual) median or a "low" value that is equal or greater than the (eventual) median then your calculation is off.
Update
Bit of an example
Let's say that the data set is the numbers 1,2,3,4,5,6,7,8,9.
By inspection the median is 5.
Let's say that the first 5 numbers you get are 1,3,5,7,9.
To save space we discard the highest and lowest, leaving 3,5,7
Now get two more, 2,6 so our storage is 2,3,5,6,7
Discard the highest and lowest, leaving 3,5,6
Get the last two 4,8 and we have 3,4,5,6,8
Median is still 5 and the world is a good place.
However, lets say that the first five numbers we get are 1,2,3,4,5
Discard top and bottom leaving 2,3,4
Get two more 6,7 and we have 2,3,4,6,7
Discard top and bottom leaving 3,4,6
Get last two 8,9 and we have 3,4,6,8,9
With a median of 6 which is incorrect.
If our numbers are well distributed, we can keep trimming the extremities. If they might be bunched in lots of large or lots of small numbers, then discarding is risky.

Resources