If I am given the total number of occurrences of an event over the last hour, and I can get this data at arbitrary times ( but at least once an hour ), how can I work out the total number of occurrences over a 24 hour period?
Obviously, you can't. For example -- if the first two observations overlap then it would be impossible to determine the number of kills during the overlap. If there is a time gap between the first two observations then there is no way to determine what happened during the gap. You could try to set up a system of equations -- but the resulting system will be underdetermined (but it could give you both a min and a max, which might be relevant).
Why not adopt a statistical approach? Let X = kills over a 1 hour period. This is a random variable. Estimate its expected value by sampling it at randomly chosen times and multiply your estimate by 24.
Related
You stand in an office by a door, with a measuring tape. Every time a person walks in you measure him or her and only keep tally of the “record” tallest. If the new person is taller than the preceding one, you count a record. If later another person is taller, you have another record, etc.
A 1000 persons pass through the door. How many records do you expect to have?
(Assume independence of height/arrival. Also note that the answer does not depend on any assumption about the probability distribution other than independence.)
PS - I'm able to come up with answer (~7.5) with a brute force approach. ( Running this scenario over 1000000 times and taking average ). But here I'm looking for a theoretical approach.
consider x_1 to x_1000 as the record, and max(i) as max of the sequence until i. The question is reduced to finding expected number of times the max(i) changes.
for i=0 to 999:
if x_i+1>max(i), then max(i) changes
Also, P(x_i+1>max(i))=1/i+1
answer=> summation of 1/1+i (i varies from 0 to 999) which is approx. 7.49
I am having difficulty grasping a probability assumption in a problem I am reviewing.
Given:
each record in a dataset has a unique transaction id number (TXNID)
The incremental change between TXNID is predictable based on the transaction time (the specific method is irrelevant to the problem)
Because the incremental change is predictable, we can identify if a record is missing between two sequential TXNID. Specifically, if the difference between two sequential TXNID is greater than the predicted incremental change, then at least one record is missing
The increment between two TXNID is always a whole number between 1 and 20 (inclusive)
Equal probability exists that any increment of 1 to 20 will occur
Where such a gap is identified, we wish to estimate the number missing of records.
For example:
Previous TXNID: 100 (given)
Current TXNID: 125 (given)
Predicted increment: 5 (given)
Actual increment: 25 (current - previous)
The actual increment is greater than the predicted increment so we know that at least one record is missing.
We also know that one missing record has a TXNID that is equal to the current TXNID - 5. The estimating records within the remaining gap is the focus of the problem.
Remaining gap: 20 (actual increment - predicted increment)
What we wish to estimate is the number of missing records within the remaining gap. In this example, the missing records may be comprised of a single record having an increment of 20, 20 records having increments of 1, or any relevant combination between these extremes.
20 = 20 x 1
...
20 = 1 x 20
The author proposes that because an equal probability exists that each TXNID increment is within 1 and 20, 5% (1/20) of the remaining gap is a realistic estimate for the number of missing records.
Having tested this in a very limited fashion, the assumption appears work; however, I am struggling to understand the logic that each scenario has an equal probability.
I agree that a single record has a 1/20 (5%) chance of having an increment of 20 (scenario 1 x 20). But for the reverse scenario (20 x 1), shouldn't the probability compound? Here, I not only require that the increment of a single record be 1 (5% probability), but also the next 19 records also be 1. Therefore, it seems that the probability of 20 missing records existing within the remaining gap is significantly less (0.05 ^ 20 versus 0.05).
Am I over thinking this? Have I missed a point? Does applying 5% to the remaining gap make sense as a means to estimate the number of missing records?
Thanks
Andrew
Frankly, I would approach the problem from different perspective. I would assume records are coming from Poisson stream. Thus, differences between records are distributed according to Poisson distribution.
If this is true, you could estimate Poisson parameter \lambda and get the estimation how many on average records should be here at any given distance between records
I am trying to find a function in Matlab, or at least the name of an algorithm that does the following:
Let's say that I am analyzing a time series in real time. I initially start with a threshold of 10 and -10, so that when the time series goes above 10 or below -10, it's considered a 'HIT'. Let's say it initially takes the time series 5 minutes to produce a 'HIT', but I want to adjust the threshold so that, on average, it takes only 1 minute for a 'HIT' to be produced. I know it would look something like start with 10 and -10, if it takes too long, drop it to 5 and -5, then increase the threshold if it's too quick, etc.
I know there's a specific name for this type of algorithm, and there's probably built-in functions for this, but the name is eluding me. Can somebody help?
I don't know what the time resolution of your time series is, or if it's constant, so I'll leave that to you. However here is what you can do in matlab if you have a constant time resolution. First take the absolute value of the values in your time series. Then sort these values in reverse order using the sort() command. Then choose the value whose index in the sorted array gives you the average time resolution that you desire. So e.g. if your time series has size N and the time resolution is 0.1 seconds, and you want an alert on average every 1 second, then after sorting you would choose the threshold at (reverse order) sorted position N/10.
I am working on an algorithm to compute multiple mean max values of an array. The array contains time/value pairs such as HR data recorded on a Garmin device over a 5 hour run. the data is approx once a second for an unknown period, but has no guaranteed frequency. An example would be a 10 minute mean maximum, which is the maximum average 10 minute duration value. Assume "mean" is just average value for this discussion. The desired mean maximal value's duration is arbitrary, 1 min, 5 min, 60 min. And, I'm likely going to need many of them-- at least 30 but ideally any on demand if it wasn't a lengthy request.
Right now I have a straight forward algorithm to compute on value:
1) Start at beginning of array and "walk" forward until the subset is equal to or 1 element past the desired duration. Stop if end of array is reached.
2) Find the average of those subset values. Store as max avg if larger than current max.
3) Shift a single value off the left side of array.
4) Repeat from 1 until end of array met.
It basically computes every possible consecutive average and returns the max.
It does this for each duration. And it computes a real avg computation continuously instead of sliding it somehow by removing the left point and adding the right, like one could do for a Simple-moving-average series. It takes about 3-10 secs per mean max value depending on the total array size.
I'm wondering how to optimize this. For instance, the series of all mean max values will be an exponential curve with the 1s value highest, and lowering until the entire average is met. Can this curve, and all values, be interpolated from a certain number of points? Or some other optimization to the above heavy computation but still maintain accuracy?
"And it computes a real avg computation continuously instead of sliding it somehow by removing the left point and adding the right, like one could do for a Simple-moving-average series."
Why don't you just slide it (i.e. keep a running sum and divide by the number of elements in that sum)?
I'd like to sum up moving averages for a number of different categories when storing log records. Imagine a service that saves web server logs one entry at a time. Let's further imagine, we don't have access to the logged records. So we see them once but don't have access to them later on.
For different pages, I'd like to know
the total number of hits (easy)
a "recent" average (like one month or so)
a "long term" average (over a year)
Is there any clever algorithm/data model that allows to save such moving averages without having to recalculate them by summing up huge quantities of data?
I don't need an exact average (exactly 30 days or so) but just trend indicators. So some fuzziness is not a problem at all. It should just make sure that newer entries are weighted higher than older ones.
One solution probably would be to auto-create statistics records for each month. However, I don't even need past month statistics, so this seems like overkill. And it wouldn't give me a moving average but rather swap to new values from month to month.
An easy solution would be to keep an exponentially decaying total.
It can be calculated using the following formula:
newX = oldX * (p ^ (newT - oldT)) + delta
where oldX is the old value of your total (at time oldT), newX is the new value of your total (at time newT); delta is the contribution of new events to the total (for example the number of hits today); p is less or equal to 1 and is the decay factor. If we take p = 1, then we have the total number of hits. By decreasing p, we effectively decrease the interval our total describes.
If all you really want is a smoothed value with a given time constant then the easiest thing is to use a single pole recursive IIR filter (aka AR or auto-regressive filter in time series analysis). This takes the form:
Xnew = k * X_old + (1 - k) * x
where X_old is the previous smoothed value, X_new is the new smoothed value, x is the current data point and k is a factor which determines the time constant (usually a small value, < 0.1). You may need to determine the two k values (one value for "recent" and a smaller value for "long term") empirically, based on your sample rate, which ideally should be reasonably constant, e.g. one update per day.
It may be solution for you.
You can aggregate data to intermediate storage grouped by hour or day. Than grouping function will work very fast, because you will need to group small amount of records and inserts will be fast as well. Precision decisions up to you.
It can be better than auto-correlated exponential algorithms because you can understand what you calculate easier and it doesn't require math each step.
For last term data you can use capped collections with limited amount of records. They supported natively by some DBs for example MongoDB.