Probable number of missing records - probability

I am having difficulty grasping a probability assumption in a problem I am reviewing.
Given:
each record in a dataset has a unique transaction id number (TXNID)
The incremental change between TXNID is predictable based on the transaction time (the specific method is irrelevant to the problem)
Because the incremental change is predictable, we can identify if a record is missing between two sequential TXNID. Specifically, if the difference between two sequential TXNID is greater than the predicted incremental change, then at least one record is missing
The increment between two TXNID is always a whole number between 1 and 20 (inclusive)
Equal probability exists that any increment of 1 to 20 will occur
Where such a gap is identified, we wish to estimate the number missing of records.
For example:
Previous TXNID: 100 (given)
Current TXNID: 125 (given)
Predicted increment: 5 (given)
Actual increment: 25 (current - previous)
The actual increment is greater than the predicted increment so we know that at least one record is missing.
We also know that one missing record has a TXNID that is equal to the current TXNID - 5. The estimating records within the remaining gap is the focus of the problem.
Remaining gap: 20 (actual increment - predicted increment)
What we wish to estimate is the number of missing records within the remaining gap. In this example, the missing records may be comprised of a single record having an increment of 20, 20 records having increments of 1, or any relevant combination between these extremes.
20 = 20 x 1
...
20 = 1 x 20
The author proposes that because an equal probability exists that each TXNID increment is within 1 and 20, 5% (1/20) of the remaining gap is a realistic estimate for the number of missing records.
Having tested this in a very limited fashion, the assumption appears work; however, I am struggling to understand the logic that each scenario has an equal probability.
I agree that a single record has a 1/20 (5%) chance of having an increment of 20 (scenario 1 x 20). But for the reverse scenario (20 x 1), shouldn't the probability compound? Here, I not only require that the increment of a single record be 1 (5% probability), but also the next 19 records also be 1. Therefore, it seems that the probability of 20 missing records existing within the remaining gap is significantly less (0.05 ^ 20 versus 0.05).
Am I over thinking this? Have I missed a point? Does applying 5% to the remaining gap make sense as a means to estimate the number of missing records?
Thanks
Andrew

Frankly, I would approach the problem from different perspective. I would assume records are coming from Poisson stream. Thus, differences between records are distributed according to Poisson distribution.
If this is true, you could estimate Poisson parameter \lambda and get the estimation how many on average records should be here at any given distance between records

Related

Probability - expectation Puzzle: 1000 persons and a door

You stand in an office by a door, with a measuring tape. Every time a person walks in you measure him or her and only keep tally of the “record” tallest. If the new person is taller than the preceding one, you count a record. If later another person is taller, you have another record, etc.
A 1000 persons pass through the door. How many records do you expect to have?
(Assume independence of height/arrival. Also note that the answer does not depend on any assumption about the probability distribution other than independence.)
PS - I'm able to come up with answer (~7.5) with a brute force approach. ( Running this scenario over 1000000 times and taking average ). But here I'm looking for a theoretical approach.
consider x_1 to x_1000 as the record, and max(i) as max of the sequence until i. The question is reduced to finding expected number of times the max(i) changes.
for i=0 to 999:
if x_i+1>max(i), then max(i) changes
Also, P(x_i+1>max(i))=1/i+1
answer=> summation of 1/1+i (i varies from 0 to 999) which is approx. 7.49

Maximum Collection

Mark has a collection of N postage stamps. Each stamp belongs to some type, which are enumerated as positive integers. More valuable stamps have a higher enumerated type.
On any particular day, E-bay lists several offers, each of which is represented as an unordered pair {A, B}, allowing its users to exchange stamps of type A with an equal number of stamps of type B. Mark can use such an offer to put up any number of stamps of enumerated type A on the website and get the same number of stamps of type B in return, or vice-versa . Assume that any number of stamps Mark wants are always available on the site's exchange market. Each offer is open during only one day: Mark can't use it after this day, but he can use it several times during this day. If there are some offers which are active during a given day, Mark can use them in any order.
Find maximum possible value of his collection after going through (accepting or declining) all the offers. Value of Mark's collection is equal to the sum of type enumerations of all stamps in the collection.
How dynamic programming lead to the solution for the problem ? (Mark knows what offers will come in future)
I would maintain a table that gives, for each type, the maximum value that you can get for a member of that type using only the last N swaps.
To compute this for N=0 just put down the value of each type without swaps.
To compute this for N=i+1 look at the ith swap and the table for N=i. The i-th swap is for two offsets in that table, which probably have different values. Because you can use the i-th swap, you can alter the table to set the lower value of the two equal to the higher value of the two.
When you have a table taking into account all the swaps you can sum up the values for the types that Mark is starting with to get the answer.
Example tables for the swaps {4, 5}, {5, 3},{3, 1}, {1, 20}
1 2 3 4 5 .. 20
20 2 3 4 5 .. 20
20 2 20 3 4 .. 20
20 2 20 3 20 .. 20
20 2 20 20 20 .. 20
Example for swaps {1, 5} and then {1, 20}
1 2 3 4 5 .. 20
20 2 3 4 5 .. 20
20 2 3 4 20 .. 20
Note that i=1 means take account of the last swap possible, so we are working backwards as far as swaps are concerned. The final table reflects the fact that 5 can be swapped for 1 before 1 is swapped for 20. You can work out a schedule of which swaps to do when by looking at what swap is available at time i and which table entries change at this time.
Dynamic Programming means simplifying a problem into smaller sub-sequences of problems. Your problem is well defined as a value ordered collection of stamps of different types. So, Value(T1) < Value(T2) .. Value(Tn-1)
Finding the maximum value of the collection will be determined by the opportunities to swap pairs of types. Of course, we only want to swap pairs when it will increase the total value of the collection.
Therefore, we define a simple swap operation where we will swap if the collection contains stamps of the lower valued stamp in the swap opportunity.
If sufficient opportunities of differing types are offered, then the collection could ultimately contain all stamps at the highest value.
My suggestion is to create a collection data structure, a simple conditioned swap function and perhaps an event queue which responds to swap events.
Dynamic Table
Take a look at this diagram which shows how I would set up my data. The key is to start from the last row and work backwards computing the best deals, then moving forward and taking the best going forward.

Calculating total over timespan with arbitrary datapoints

If I am given the total number of occurrences of an event over the last hour, and I can get this data at arbitrary times ( but at least once an hour ), how can I work out the total number of occurrences over a 24 hour period?
Obviously, you can't. For example -- if the first two observations overlap then it would be impossible to determine the number of kills during the overlap. If there is a time gap between the first two observations then there is no way to determine what happened during the gap. You could try to set up a system of equations -- but the resulting system will be underdetermined (but it could give you both a min and a max, which might be relevant).
Why not adopt a statistical approach? Let X = kills over a 1 hour period. This is a random variable. Estimate its expected value by sampling it at randomly chosen times and multiply your estimate by 24.

Calculate new probability based on past probability per value

I want to calculate a percentage probability based on a list of past occurrences.
The data looks similar to this simplified table, for instance when the first value has been 8 in the past there has been a 72% chance of the event occurring.
1 76%
2 64%
4 80%
6 85%
7 83%
8 72%
11 70%
The full table ranges from 0 to 1030 and has 377 rows but changes daily. I want pass the function a value such as 3 and be returned a percentage probability of the event occurring. I don't need exact code, but would appreciate being pointed in the right direction.
Thanks
Based on your answers in the comments of the question, I would suggest an interpolation---linear interpolation is the simplest answer. It doesn't look like a probabilistic model would be appropriate based on the series in the spreadsheet (there doesn't appear to be a clear relationship between the column 1 and column 3).
To give an example of how this would work: imagine you want the probability for some point p, which is unobserved in the data. The biggest value you observe which is less than p is p_low (with corresponding probability f(p_low)), and the smallest value greater than p is p_high (with probability f(p_high)). Your estimate for p is:
interval = p_high - p_low
f_p_hat = ((p-p_low)/interval*f_p_low) + ((p_high-p)/interval*f_p_high)
This is going to make your estimate for p a weighted average of the values at p_low and p_high, with weights given by the distances between p and p_low, and p and p_high. E.g. if p is equidistant between p_low and p_high, f_p_hat (your estimate for f(p)) is just the mean of p_low and p_high.
Now, linear interpolation may not work if you have reason to suspect that the estimates at the endpoints are inaccurate (possibly due to small sample sizes). If so, it would be possible to do a (possibly weighted) least squares fit to a neighbourhood of points around p, and use that as a prediction. If this is the case I can go into a bit more detail.

Mean max subset of array values

I am working on an algorithm to compute multiple mean max values of an array. The array contains time/value pairs such as HR data recorded on a Garmin device over a 5 hour run. the data is approx once a second for an unknown period, but has no guaranteed frequency. An example would be a 10 minute mean maximum, which is the maximum average 10 minute duration value. Assume "mean" is just average value for this discussion. The desired mean maximal value's duration is arbitrary, 1 min, 5 min, 60 min. And, I'm likely going to need many of them-- at least 30 but ideally any on demand if it wasn't a lengthy request.
Right now I have a straight forward algorithm to compute on value:
1) Start at beginning of array and "walk" forward until the subset is equal to or 1 element past the desired duration. Stop if end of array is reached.
2) Find the average of those subset values. Store as max avg if larger than current max.
3) Shift a single value off the left side of array.
4) Repeat from 1 until end of array met.
It basically computes every possible consecutive average and returns the max.
It does this for each duration. And it computes a real avg computation continuously instead of sliding it somehow by removing the left point and adding the right, like one could do for a Simple-moving-average series. It takes about 3-10 secs per mean max value depending on the total array size.
I'm wondering how to optimize this. For instance, the series of all mean max values will be an exponential curve with the 1s value highest, and lowering until the entire average is met. Can this curve, and all values, be interpolated from a certain number of points? Or some other optimization to the above heavy computation but still maintain accuracy?
"And it computes a real avg computation continuously instead of sliding it somehow by removing the left point and adding the right, like one could do for a Simple-moving-average series."
Why don't you just slide it (i.e. keep a running sum and divide by the number of elements in that sum)?

Resources