Sampling from a distribution given by frequency counts?

Sampling from a distribution given by frequency counts? - random

I have a data set with objects x_1,...,x_N and the object x_i appears c_i times in the data. I would like to sample efficiently from the distribution, so that object x_i has probability c_i/c of getting selected, where c = c_1 + ... + c_N.
This must be a well-known problem, but I wasn't able to find a good algorithm for this. What is the most efficient way of accomplishing this, when N is of the order of a few million?

Related

Is this optimization algorithm a linear programming problem?

I am trying to solve a variant of the multidimensional multiple knapsack problem which tries to optimize the values in each knapsack so that a percentage of each of them can be “taken” and added to create a “final knapsack” with the ideal values. See this question below.
https://cs.stackexchange.com/questions/14163/linear-programming-algorithm-to-check-if-ratios-can-be-combined-with-n-bottles
The problem I linked to says "given n bottles find a solution where you can take a ratio of every bottle and add them to equal the predetermined values (A, B, C)." The problem I have is "given a set of values organize them in the n bottles in such a way that there is a solution where you can take a ratio of every bottle and add them to equal the predetermined values (A, B, C)."
Essentially I would like to create an algorithm that organizes incoming values knapsacks (or bottles as they call it in the question above) so that they could be combined in a way that could guarantee the desired result.
The main difference between the multiple multidimensional knapsack problem and what I am trying to do is what I am trying to maximize. Instead of trying to maximize the total value of all the knapsacks, I want to first multiply each knapsack by a variable Lambda[i] and add them together so that they equal the item F which is a vector with constants (A, B, C, D). I want the knapsacks optimized in such a way that the combination gives back the most amount of item F.
Here are the variables
I: number of bottles
J: set of items
v[j] =[a(j), b(j), c(j), d(j)]: value of item j
p[i] = [a(i), b(i), c(i), d(i)]: value of bottle after items added to it
n[i] = weight of each bottle after items are added to it
m[i]: capacity of bottle i
Lambda[i]: lambda variable to multiply each bottle by
F =[A, B, C, D]: Optimal bottle value
N: Weight of final blend
M: Weight capacity of final bottle
The objective function I am trying to maximize N = Σ n(i) * Lambda(i)
some constraints I have are
n[i] <= m[i]
Σ a(i) * Lambda[i] = A
Σ b(i) * Lambda[i] = B
Σ C(i) * Lambda[i] = C
Σ D(i) * Lambda[i] = D
0 <= Lambda[i] <= 1
I have tried implementing this solution in Gurobi and OR-Tools but the problem I'm having is that the weight of each bin is only found after optimizing so there is no way for me to maximize the function I need.
Ultimately I would like to solve an Online version of this problem where the algorithm wouldn't be able to reject any item coming in but I figured starting with the offline version with a dataset would be easier.
Does this mean this algorithm is not a linear programming problem or I am just missing a step? If it isn't a linear program is there any other method like machine learning that could help me solve this?
Any help would be greatly appreciated.

It looks to me to be probably a linear programming problem, but you are having a problem with the objective being not-linear? There are ways to reformulate some of these not-linear terms to make them solveable: See for example Erwin Kalvelagen's excellent mathematical programming examples such as http://yetanothermathprogrammingconsultant.blogspot.com/2017/05/linearizing-average.html which is probably not quite the same as you need but may give you enough ideas to help.

Algorithm for finding all combinations of (x,y,z,j) that satisfy w+x = y+j, where w,x,y,j are integers between -N...N inclusive

I'm working on a problem that requires an array (dA[j], j=-N..N) to be calculated from the values of another array (A[i], i=-N..N) based on a conservation of momentum rule (x+y=z+j). This means that for a given index j for all the valid combinations of (x,y,z) I calculate A[x]A[y]A[z]. dA[j] is equal to the sum of these values.
I'm currently precomputing the valid indices for each dA[j] by looping x=-N...+N,y=-N...+N and calculating z=x+y-j and storing the indices if abs(z) <= N.
Is there a more efficient method of computing this?
The reason I ask is that in future I'd like to also be able to efficiently find for each dA[j] all the terms that have a specific A[i]. Essentially to be able to compute the Jacobian of dA[j] with respect to dA[i].
Update
For the sake of completeness I figured out a way of doing this without any if statements: if you parametrize the equation x+y=z+j given that j is a constant you get the equation for a plane. The constraint that x,y,z need to be integers between -N..N create boundaries on this plane. The points that define this boundary are functions of N and j. So all you have to do is loop over your parametrized variables (s,t) within these boundaries and you'll generate all the valid points by using the vectors defined by the plane (s*u + t*v + j*[0,0,1]).
For example, if you choose u=[1,0,-1] and v=[0,1,1] all the valid solutions for every value of j are bounded by a 6 sided polygon with points (-N,-N),(-N,-j),(j,N),(N,N),(N,-j), and (j,-N).

So for each j, you go through all (2N)^2 combinations to find the correct x's and y's such that x+y= z+j; the running time of your application (per j) is O(N^2). I don't think your current idea is bad (and after playing with some pseudocode for this, I couldn't improve it significantly). I would like to note that once you've picked a j and a z, there is at most 2N choices for x's and y's. So overall, the best algorithm would still complete in O(N^2).
But consider the following improvement by a factor of 2 (for the overall program, not per j): if z+j= x+y, then (-z)+(-j)= (-x)+(-y) also.

Sample with given probability

I stumbled upon a basic discrete math/probability question and I wanted to get some ideas for improvements over my solution.
Assume you are given a collection (an alphabet, the natural numbers, etc.). How do you ensure that you draw a certain value X from this collection with a given probability P?
I'll explain my naïve solution with an example:
Collection = {A, B}
X = A, P = 1/4
We build an array v = [A, B, B, B] and we use a rand function to uniformly sample the indices of the array, i.e., {0, 1, 2, 3}
This approach works, but isn't efficient: the smaller P, the bigger the memory storage of v. Hence, I was wondering what ideas the stackoverflow community might have in improving this.
Thanks!

Partition the interval [0,1] into disjoint intervals whose union is [0,1]. Create the size of each partition to correspond to the probability of selecting each event. Then simply sample randomly from [0,1], evaluate which of your partitions the result lies in, then look up the selection that corresponds to that interval. In your example, this would result in the following 2 intervals [0,1/4) and [1/4,1] - generate a random uniform value from [0,1]. If your sample lies in the first interval then your selection X = A , if in the other interval then X = B.

Your proposed solution is indeed not great, and the most general and efficient way to solve it is as mathematician1975 states (this is known as the inverse CDF method). For your specific problem, which is multinomial sampling, you can also use a series of draws from binomial distributions to sample from your collection. This is often more intuitive if you're not familiar with sampling methods.
If the first item in the collection has probability p_1, sample uniformly in the interval [0-1]. If the sample is less than p_1, return item 1. Otherwise, renormalise the remaining outcomes by 1-p_1 and repeat the process with the next possible outcome. After each unsuccessful sampling, renormalise remaining outcomes by the total probability of rejected outcomes, so that the sum of remaining outcomes is 1. If you get to the last outcome, return it with probability 1. The result of the process will be random samples distributed according to your original vector.
This method is using the fact that individual components of a multinomial are binomially distributed, and any sub vector of the multinomial is also multinomial with parameters given by the renormalisation I describe above.

Optimal Bucket Size and No. of Buckets

Sorry this post is not related to coding but more to data structures and Algorithms.
I'm having large amount of data each having different frequencies. The approximate figure plot seems to be a Bell curve. I now want to display the data in ranges which most precisely describes the frequency of the ranges.
e.g. the entire range of data has total no. of frequencies but this range or bucket size is not precise and may be made more precise.(e.g if some data is more concentrated in a particular frequency zone, we may build up a bucket with less data size but having more closely related frequencies.)
Any help regarding some algorithm .
I thought of an algorithm related to binary search.
Any ideas folks.

Not sure I am following, but it seems you are looking for k beans, where for each two beans, the probability of the data falling in one bean is identical for it being in the other bean.
From your description, your data seems to be normally distributed, or T-distributed.
One can evaluate the mean and standard deviation of the data, let the extracted S.D. be s and the mean be u.
The standard formulas for evaluating the mean and S.D. from the sample are1:
u = (x1 + x2 + ... + xn) / n (simple average)
s^2 = Sigma((xi - u)^2)/(n-1)
Given this information, you can evaluate the distribution of your data, which is N(u,s^2). Given this information, you can create a random variabe: X~N(u,s^2)2
Now all is left is finding the a,b,... as follows (assuming 10 buckets, this can obviously be modified as you wish):
P(X<a) = 0.1
P(X<b) = 0.2
P(X<c) = 0.3
...
After finding a,b,c,... you have your beans: (-infinity,a], (a,b], (a,c], ...
(1) evaluating variance: http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance
(2)The real distribution for this variable is actually t-distribution, since the variance is unknown - and extracted from the data. However - for large enough n - t-distribution decays into normal distribution.

First count all the indexes then subtract the repeating values this will give you optimal number of buckets. but at small level

Constraint Satisfaction: Choosing real numbers with certain characteristics

I have a set of n real numbers. I also have a set of functions,
f_1, f_2, ..., f_m.
Each of these functions takes a list of numbers as its argument. I also have a set of m ranges,
[l_1, u_1], [l_2, u_2], ..., [l_m, u_m].
I want to repeatedly choose a subset {r_1, r_2, ..., r_k} of k elements such that
l_i <= f_i({r_1, r_2, ..., r_k}) <= u_i for 1 <= i <= m.
Note that the functions are smooth. Changing one element in {r_1, r_2, ..., r_k} will not change f_i({r_1, r_2, ..., r_k}) by much. average and variance are two f_i that are commonly used.
These are the m constraints that I need to satisfy.
Moreover I want to do this so that the set of subsets I choose is uniformly distributed over the set of all subsets of size k that satisfy these m constraints. Not only that, but I want to do this in an efficient manner. How quickly it runs will depend on the density of solutions within the space of all possible solutions (if this is 0.0, then the algorithm can run forever). (Assume that f_i (for any i) can be computed in a constant amount of time.)
Note that n is large enough that I cannot brute-force the problem. That is, I cannot just iterate through all k-element subsets and find which ones satisfy the m constraints.
Is there a way to do this?
What sorts of techniques are commonly used for a CSP like this? Can someone point me in the direction of good books or articles that talk about problems like this (not just CSPs in general, but CSPs involving continuous, as opposed to discrete values)?

Assuming you're looking to write your own application and use existing libraries to do this, there are choices in many languages, like Python-constraint, or Cream or Choco for Java, or CSP for C++. The way you've described the problem it sound like you're looking for a general purpose CSP solver. Are there any properties of your functions that may help reduce the complexity, such as being monotonic?

Given the problem as you've described it, you can pick from each range r_i uniformly and throw away any m-dimensional point that fails to meet the criterion. It will be uniformly distributed because the original is uniformly distributed and the set of subsets is a binary mask over the original.
Without knowing more about the shape of f, you can't make any guarantees about whether time is polynomial or not (or even have any idea of how to hit a spot that meets the constraint). After all, if f_1 = (x^2 + y^2 - 1) and f_2 = (1 - x^2 - y^2) and the constraints are f_1 < 0 and f_2 < 0, you can't satisfy this at all (and without access to the analytic form of the functions, you could never know for sure).

Given the information in your message, I'm not sure it can be done at all...
Consider:
numbers = {1....100}
m = 1 (keep it simple)
F1 = Average
L1 = 10
U1 = 50
Now, how many subset of {1...100} can you come up with that produces an average between 10 & 50?

This looks like a very hard problem. For the simplest case with linear functions you could take a look at linear programming.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Sampling from a distribution given by frequency counts? - random

Related

Is this optimization algorithm a linear programming problem?

Algorithm for finding all combinations of (x,y,z,j) that satisfy w+x = y+j, where w,x,y,j are integers between -N...N inclusive

Sample with given probability

Optimal Bucket Size and No. of Buckets

Constraint Satisfaction: Choosing real numbers with certain characteristics

Categories

Resources