What is the expected probability of a basket to be frequent on average? - probability

I was reading on frequent pattern mining algorithm and came up with the following question. Let a company have 10,000 different products and there are 1,000,000,000 transactions each containing exactly 10 different products. If for each basket a product is uniformly chosen, then what is the probability that a given fixed basket of size 10 is chosen for 1,000 times among 1,000,000,000 transactions?
This is self study and the problem is stated in Slide 9 in here

I am not an expert in probability theory, but I think that the chance is practically 0. To see why, imagine you have a box with all possible baskets. Let B be the cardinality of the box, so the probability that one specific basket is drawn from the box is p = 1/B, approximately p = 10^(-40). Imagine you draw N times from this box with replacement. Then you would expect that this specific basket would be drawn m = N/B times. This is the expected frequency of the experiment.
The standard deviation of this sampling process (N draws with probability of success p) is σ = sqrt(N*p*(1-p)). If you do the math with N = 10^9, p = 10^(-40) you find σ = sqrt(10^(-31)).
Suppose now that the observed frequency of the experiment is f = 10^3. Since the expected frequency is m = N/B = 10^9/10^40 = 10^(-31), it follows that the z-score of this experiment is
z = (f-m)/σ = sqrt(10)*10^17
The chance of observing at least f instances of the specific basket is given by the normal approximation as the area under the standard normal curve between z and infinity. This area is practically zero.

Related

Select n records at random from a set of N

I need to select n records at random from a set of N (where 0 < n < N).
A possible algorithm is:
Iterate through the list and for each element, make the probability of selection = (number needed) / (number left)
So if you had 40 items, the first would have a 5/40 chance of being selected.
If it is, the next has a 4/39 chance, otherwise it has a 5/39 chance. By the time you get to the end
you will have your 5 items, and often you'll have all of them before that.
Assuming a good pseudo-random number generator, is this algorithm correct?
NOTE
There're many questions of this kind on stackoverflow (a lot of them are marked as duplicates of Select N random elements from a List<T> in C#).
The above algorithm is often proposed (e.g. Kyle Cronin's answer) and
it's always questioned (e.g. see
here, here, here, here...).
Can I have a final word about the matter?
The algorithm is absolutely correct.
It's not the sudden invention of a good poster, it's a well known technique called Selection sampling / Algorithm S (discovered by Fan, Muller and Rezucha (1) and independently by Jones (2) in 1962), well described in TAOCP - Volume 2 - Seminumerical Algorithms - § 3.4.2.
As Knuth says:
This algorithm may appear to be unreliable at first glance and, in fact, to be incorrect. But a careful analysis shows that it is completely trustworthy.
The algorithm samples n elements from a set of size N and the t + 1st element is chosen with probability (n - m) / (N - t), when already m elements have been chosen.
It's easy to see that we never run off the end of the set before choosing n items (as the probability will be 1 when we have k elements to choose from the remaining k elements).
Also we never pick too many elements (the probability will be 0 as soon n == m).
It's a bit harder to demonstrate that the sample is completely unbiased, but it's
... true in spite of the fact that we are not selecting the t + 1st item with probability n / N. This has caused some confusion in the published literature
(so not just on Stackoverflow!)
The fact is we should not confuse conditional and unconditional probabilities:
For example consider the second element; if the first element was selected in the sample (this happen with probability n / N), the second element is selected with probability (n - 1) / (N - 1); if the first element was not selected, the second element is selected with probability n / (N - 1).
The overall probability of selecting the second element is (n / N) ((n - 1) / (N - 1)) + (1 - n/N)(n / (N - 1)) = n/N.
TAOCP - Vol 2 - Section 3.4.2 exercise 3
Apart from theoretical considerations, Algorithm S (and algorithm R / reservoir sampling) is used in many well known libraries (e.g. SGI's original STL implementation, std::experimental::sample,
random.sample in Python...).
Of course algorithm S is not always the best answer:
it's O(N) (even if we will usually not have to pass over all N records: the average number of records considered when n=2 is about 2/3 N; the general formulas are given in
TAOCP - Vol 2 - § 3.4.2 - ex 5/6);
it cannot be used when the value of N isn't known in advance.
Anyway it works!
C. T. Fan, M. E. Muller and I. Rezucha, J. Amer. Stat. Assoc. 57 (1962), pp 387 - 402
T. G. Jones, CACM 5 (1962), pp 343
EDIT
how do you randomly select this item, with a probability of 7/22
[CUT]
In rare cases, you might even pick 4 or 6 elements when you wanted 5
This is from N3925 (small modifications to avoid the common interface / tag dispatch):
template<class PopIter, class SampleIter, class Size, class URNG>
SampleIter sample(PopIter first, PopIter last, SampleIter out, Size n, URNG &&g)
{
using dist_t = uniform_int_distribution<Size>;
using param_t = typename dist_t::param_type;
dist_t d{};
Size unsampled_sz = distance(first, last);
for (n = min(n, unsampled_sz); n != 0; ++first)
{
param_t const p{0, --unsampled_sz};
if (d(g, p) < n) { *out++ = *first; --n; }
}
return out;
}
There aren't floats.
If you need 5 elements you get 5 elements;
if uniform_int_distribution "works as advertised" there is no bias.
Although the algorithm described is technically correct, it depends on having an algorithm to return a bool with arbitrary probability determined by the ratio of two ints. For example, how do you select this item with a probability of 7/22? For the point of talking, let's call it the bool RandomSelect(int x, int y) method, or just the RS(x,y) method, designed to return true with probability x/y. If you're not very concerned about accuracy, the oft-given answer is to use return Random.NextDouble() < (double)x/(double)y; which is inaccurate because Random.NextDouble() is imprecise and not perfectly uniform, and the division (double)x/(double)y is also imprecise. The choice of < or <= should be irrelevant (but it's not) because in theory it's impossible to randomly pick the infinite precision random number exactly equal to the specified probability. While I'm sure an algorithm can be created or found, to implement the RS(x,y) method precisely, which would then allow you to implement the described algorithm correctly, I think that to simply answer this question as "yes the algorithm is correct" would be misleading - as it has misled so many people before, into calculating and choosing elements using double, unaware of the bias they're introducing.
Don't misunderstand me - I'm not saying everyone should avoid using the described algorithm - I'm only saying that unless you find a more precise way to implement the RS(x,y) algorithm, your selections will be subtly biased in favor of some elements more frequently than other elements.
If you care about fairness (equal probability of all possible outcomes) I think it is better, and easier to understand, to use a different algorithm instead, as I've described below:
If you take it as given that the only source of random you have available is random bits, you have to define a technique of random selection that assures equal probability, given binary random data. This means, if you want to pick a random number in a range that happens to be a power of 2, you just pick random bits and return them. But if you want a random number in a range that's not a power of 2, you have to get more random bits, and discard outcomes that could not map to fair outcomes (throw away the random number and try again). I blogged about it with pictoral representations and C# example code here: https://nedharvey.com/blog/?p=284 Repeat the random selection from your collection, until you have n unique items.

algorithm em : comprehension and example

I'm studying pattern recognition and I found an interesting algorithm that I'd like to deepen, the Expectations Maximization Algorithm. I haven't great knowledge of probability and statistics and I've read some article on the operation of the algorithm on normal or Gaussian distributions, but I would start with a simple example to understand better. I hope that the example may be suitable.
Assume we have a jar with 3 colors, red, green, blue. Corresponding probability of drawing each colored ball are: pr, pg, pb. Now, let's assume that we have the following parametrized model for the probabilities of drawing the different colours :
pr = 1/4
pg = 1/4 + p/4
pb = 1/2 - p/4
with p unknown parameter. Now assume that the man who is doing the experiment is actually colourblind and cannot discern the red from the green balls. He draws N balls, but only sees
m1 = nR + nG red/green balls and m2 = nB blue balls.
The question is, can the man still estimate the parameter p and with that in hand calculate his best guess for the number of red and green balls (obviously, he knows the number of blue balls)? I think that obviously he can, but what about EM? What I have to consider?
Well, the general outline of the EM algorithm is that if you know the values of some of the parameters, then computing the MLE for the other parameters is very simple. The commonly-given example is mixture density estimation. If you know the mixture weights, then estimating the parameters for the individual densities is easy (M step). Then you go back a step: if you know the individual densities then you can estimate the mixture weights (E step). There isn't necessarily an EM algorithm for every problem, and even if there is one, it's not necessarily the most efficient algorithm. It is, however, usually simpler and therefore more convenient.
In the problem you stated, you can pretend that you know the numbers of red and green balls and then you can carry out ML estimation for p (M step). Then with the value of p you go back and estimate the numbers of red and green balls (E step). Without thinking about it too much, my guess is that you could reverse the roles of the parameters and still work it as an EM algorithm: you could pretend that you know p and carry out ML estimation for the numbers of balls, then go back and estimate p.
If you are still following, we can work out formulas for all this stuff.
When "p" is not known, you can go for maximum likihood or MLE.
First, from your descriptions, "p" has to be in [-1, 2] or the probabilities will not make sense.
You have two certain observations: nG + nR = m and nB = N - m (m = m1, N = m1 + m2)
The chances of this happening is N! / (m! (N - m)!) (1- pb)^m (1 - pb)^(N - m).
Ignoring the constant of N choose m, we will maximize the second term:
p* = argmax over p of (1 - pb)^m pb^(N - m)
The easy solution is that p* should make pb = (N - m) / N = 1 - m / N.
So 0.5 - 0.25 p* = 1 = m / N ==> p* = max(-1, -2 + 4 * m / N)

Distance Calculation for massive number of devices/nodes

I have N mobile devices/nodes (say 100K) and I periodically obtain their location ( latitude , longtitude ) values.
Some of the devices are "logically connected" to roughly M other devices (say 10).
My program periodically compares the distance between the each device and its logically connected devices and determines if the distance is within a threshold (say 100 meters).
I need a robust algorithm to calculate these distances to the logically connected devices.
The complexity order of brute force approach would be N*M or Θ(N2)
The program does this every 3 seconds (all devices are mobile), thus 100K*10 = 3M calculations every 3 seconds is not good.
Any good/classical algorithms for this operation ?
(To simplify my explanation, I have omitted the detail about each device only being logically connected to M ~= 10 other devices.)
Spatially partition your device locations. If you are only interested in pairs of devices less than 100 meters apart, consider the following two algorithms.
For i = 1..N, j = 1..N, i != j, compute distance between devices i and j.
For i = 1..N, compute which grid cell the latitude and longitude for device i lies in, where grid cells are 100 meters square. Now for all nonempty grid cells, compare devices in that cell only with devices in the same cell or one of the eight adjacent cells.
The data structure for this approach would basically be a map M from grid cell index (s,t) to list of devices in that grid cell.
The first approach is naive and will cost Θ(N2). The second approach will, assuming there is some "constant maximum density of devices," be closer to Θ(N) in practice. A 100 meter radius is fairly small.
The pseudocode for the second approach would look something like the following.
M = {}
for i = 1..N
(s,t) = compute_grid_cell(i)
if ((s,t) not in M)
M[(s,t)] = []
M[(s,t)].push(i)
for i = 1..N
(s,t) = compute_grid_cell(i)
for s' in [s-1, s, s+1]
for t' in [t-1, t, t+1]
if (s',t') in M
for j in M[(s',t')]
if i != j and distance(i, j) < 100
report (i,j) as a pair of devices that are "close"
You can use a quadtree or a morton curve. It's reduce the dimensions and make the solution easier.

Randomly choose one cell from the grid and mark until some condition happen

I have NxN boolean matrix, all elements of which have initial state false
bool[][] matrix = GetMatrix(N);
In each step of the loop, I want to choose one cell (row i, column j) uniformly at random among all false cells, and set it to true until some condition happen.
Which method to use? I have this two ways in mind.
Create a NxN array from 0...(NxN-1), shuffle using uniformly shuffling algorithm than sequentially take i element from this array and set matrix[i/N][i%N].
Uses O(N^2) additional memory, and initialization take O(N^2) time
And second
Generate random i from 0...(N^2-1) and if (i/N, i%N) is set in matrix, repeat random generation until founding unset element.
This way doesn't use any additional memory, but I have a difficulty to estimate performance... can it be a case, when all elements except one are set, and random repeats a lot of times looking for free cell? Am I right, that as soon as random theoretically works uniformly, this case should not happen so often?
I'll try to reply to your questions, with a worst case scenario anaysis that happens when, as you have pointed out, all cells but one are taken.
Let's start by noting that p = P(X = m) = 1/N^2 . From this, we obtain that the probability that you'll have to wait k tosses before getting the desired result is P( Y = k) = p * (1-p)^(k-1) . This means that, for N = 10 you will need 67 random numbers to have a probability greater than 50% to get yours, and 457 to have a probability greater than 99%.
The general formula that gives you the number k of tosses needed to get the a probaiblity greater than alpha to get your value is:
k > (log(1 - alpha) / log(1-p)) -1
where p is define as above, equal to 1/N^2
That could get much worse with N getting bigger. You could think about creating a list of the indices you need and get one randomly for it.
Generate random i from 0...(N^2-1) and if (i/N, i%N) is set in matrix,
repeat random generation until founding unset element.
The analysis of this algorithm is the same as the coupon collector's problem. The running time is Theta(n^2 log n).

Algorithm for making two histograms proportional, minimizing units removed

Imagine you have two histograms with an equal number of bins. N observations are distributed among the bins. Each bin now has between 0 and N observations.
What algorithm would be appropriate for determining the minimum number of observations to remove from both histograms in order to make them proportional? They do not need to be equal in absolute number, only proportional to each other. That is, there must be a common factor by which all the bins in one histogram can be multiplied in order to make it equal to the other histogram.
For example, imagine the following two histograms, where the item i in each histogram refers to the number of observations in bin i for the respective histogram.
Histogram 1: 4, 7, 4, 9
Histogram 2: 2, 0, 2, 1
For these histograms, the solution would be to remove from histogram 1 all 7 observations in bin 2 and another 7 observations from bin 4, such that (histogram 1)*2 = histogram 2.
But what general algorithm could be used to find the subsets of the two histograms that maximized the number of total observations between them while making them proportional? You can drop observations from both histograms or just one.
Thanks!
Seems to me that the problem is equivalent (if you consider each histogram as a N-dimensional vector), to minimizing the Manhattan length |R|, where R=xA-B, A and B are your 'vectors' and x is your proportional scale.
|R| has a single minimum (not necessarily an integer) so you can find it fairly rapidly using a simple bisection algorithm (or something akin to Newton's method).
Then, assuming you want a solution where the proportion is an integer, test the two cases ceil(x), and floor(x), to find which has the smallest Manhattan length (and that is the number of observations you need to remove).
Proof that the problem is not NP-hard:
Consider an inefficient 'solution' whereby you removed all N observations from all the bins. Now both A and B are equal to the 'zero' histogram 0 = (0,0,0,...). The two histograms are equal and thus proportional as 0 = s * 0 for all proportional values s, so a hard maximum for the number of observations to remove is N.
Now assume a more efficient solution exists with assitions/removals < N and a proportional scale s > 2*N (i.e after removal of some observations A = N * B or B=N * A ). If both A = 0 and B = 0, we have the previous solution with N removals (which contradicts the assumption that there are less than N removals). If A = 0 and B ≠ 0 then there is no s <> 0 such that 0 = s * B and no s such that s * 0 = B (with a similar argument for B = 0 and S ≠ 0). So it must be the case that both A ≠ 0 and B ≠ 0. Assume for a moment that A is the histogram to be scaled (so A * s = B), A must have at least one non-zero entry A[i] with minimum value 1 (after removal of extra observations), so when scaled it will have minimum value ≥. Therefore the equivalent entry B[i] must also have at least 2*N observations. But the total number of observations was initially N, so we have needed to add at least N observations to B[i], which contradicts the assumption that the improved solution had less than N additions/removals. So no 'efficient' solution requires a proportional scale greater than N.
So to find an efficient solution requires, at worst, testing the 'best fit' solution for scaling factors in the range 0-N.
The 'best fit' solution for scaling factor s in A = s * B, where A and B have M bins each requires
Sum(i=1 to M) of { Abs(A[i]- s * B[i]) mod s + Abs(A[i]- s * B[i]) div s } additions/removals.
This is an order M operation, so to test for each scaling factor in the range 0-N will be an algorithm of order O(M*N)
I am fairly certain (but haven't got a formal proof), that the scale factor cannot exceed the number of observations in the most filled bin. In practice it is typically very much smaller. For two histograms with two hundred bins and randomly chosen 30-300 observations per bin: if there were Na > Nb total observations in all the bins of A and B respectively the scaling factor was either almost always found in the range Na/Nb-4 < s < Na/Nb + 4, (or s = 0 if Na >> Nb).

Resources