How to fix skew trapezoidal distribution sampling output sample size - algorithm

I am trying to generate a skewed trapezoidal distribution using inverse transform sampling.
The inputs are the values where the ramps start and end (a, b, c, d) and the sample size.
a=-3;b=-1;c=1;d=8;
SampleSize=10e4;
h=2/(d+c-a-b);
Then I calculate the ratio of the length of ramps and flat components to get sample size for each:
firstramp=round(((b-a)/(d-a)),3);
flat=round((c-b)/(d-a),3);
secondramp=round((d-c)/(d-a),3);
n1=firstramp*SampleSize; %sample size for first ramp
n3=secondramp*SampleSize; %sample size for second ramp
n2=flat*SampleSize;
And then finally I get the histogram from the following code:
quartile1=h/2*(b-a);
quartile2=1-h/2*(d-c);
y1=linspace(0,quartile1,n1);
y2=linspace(quartile1,quartile2,n2);
y3=linspace(quartile2,1,n3);
%inverse cumulative distribution functions
invcdf1=a+sqrt(2*(b-a)/h)*sqrt(y1);
invcdf2=(a+b)/2+y2/h;
invcdf3=d-sqrt(2*(d-c)/h)*sqrt(1-y3);
distr=[invcdf1 invcdf2 invcdf3];
histogram(distr,100)
However the sampling of ramps and flat components are not equal, looks like this:
I fixed this by trial and error, by reducing the sample size of the ramps by half:
n1=0.5*firstramp*SampleSize; %sample size for first ramp
n3=0.5*secondramp*SampleSize; %sample size for second ramp
n2=flat*SampleSize;
This made the distribution look like this:
However this makes the output sample less than what is given in input.
I've also tried different combinations of changing the sample sizes of ramps and flat.
This also works:
n1=0.75*firstramp*SampleSize; %sample size for first ramp
n3=0.75*secondramp*SampleSize; %sample size for second ramp
n2=1.5*flat*SampleSize;
It increases the output samples, but it's still not close.
Any help will be appreciated.
Full code:
a=-3;b=-1;c=1;d=8;
SampleSize=10e4;%*1.33333333333333;
h=2/(d+c-a-b);
firstramp=round(((b-a)/(d-a)),3);
flat=round((c-b)/(d-a),3);
secondramp=round((d-c)/(d-a),3);
n1=firstramp*SampleSize; %sample size for first ramp
n3=secondramp*SampleSize; %sample size for second ramp
n2=flat*SampleSize;
quartile1=h/2*(b-a);
quartile2=1-h/2*(d-c);
y1=linspace(0,quartile1,.75*n1);
y2=linspace(quartile1,quartile2,1.5*n2);
y3=linspace(quartile2,1,.75*n3);
%inverse cumulative distribution functions
invcdf1=a+sqrt(2*(b-a)/h)*sqrt(y1);
invcdf2=(a+b)/2+y2/h;
invcdf3=d-sqrt(2*(d-c)/h)*sqrt(1-y3);
distr=[invcdf1 invcdf2 invcdf3];
histogram(distr,100)
%end

I don't know Matlab so I was hoping somebody else would jump in on this, but since nobody did here goes.
If I'm reading your code correctly what you did is not an inversion. Inversion is 1-1, i.e., one uniform input produces one outcome. You seem to be using a technique known as the "composition method". In composition the overall distribution is comprised of component pieces, each of which is straightforward to generate. You choose which component to generate from based on their proportions/probabilities relative to the whole. For density functions, probability is found as the area under the density curve, so your first mistake was in sampling the components relative to the width of each component rather than using their areas. The correct sampling proportions are 2/13, 4/13, and 7/13 for what you designated the firstramp, flat, and secondramp components, respectively. A second mistake (which is relatively minor) was to assign exact sample sizes to each of the components. Having probability 2/13 does not mean that exactly 2*SampleSize/13 of your samples will be from the firstramp, it means that's the expected sample size for that component. The expected value of a random variate is not necessarily (or even likely to be) the outcome you will actually get.
In pseudocode, the composition approach would be
generate U ~ Uniform(0,1)
if U <= 2/13:
generate and return a value from firstramp
else if U <= 6/13:
generate and return a value from flat
else:
generate and return a value from secondramp
Note that since each of the generate options will use one or more uniforms, and choosing between the options requires a uniform U, this is not an inversion.
If you want an actual inversion, you need to quantify your density, integrate it to get the cumulative distribution function, then apply the inversion technique by setting F(X) = U and solving for X. Since your distribution is made of distinct components, both the density and cumulative density will be piecewise functions.
After deriving the height based on the requirement that the areas of the two triangles and the flat section must add up to 1, I came up with the following for your density:
| (x + 3) / 13 -3 <= x <= -1
|
f(x) = | 2 / 13 -1 <= x <= 1
|
| 2 * (8 - x) / 91 1 <= x <= 8
Integrating this and collecting terms produces the CDF:
| (x + 3)**2 / 26 -3 <= x <= -1
|
F(x) = | (2 + x) * 2 / 13 -1 <= x <= 1
|
| 6 / 13 + [49 - (x - 8)**2] / 91 1 <= x <= 8
Finally, determining the values of F(x) at the break points between the segments and applying inversion yields the following pseudocode algorithm:
generate U ~ Uniform(0,1)
if U <= 2 / 13:
return 2 * sqrt( (13 * U) / 2 ) - 3
else if U <= 6 / 13:
return (13 * U) / 2 - 2:
else:
return 8 - sqrt( 91 * (1 - U) )
Note that this is a true inversion. The outcome is determined by generating a single U, and transforming it in different ways depending on which range it falls in.

Related

Can an ANN of 2 neurons solve XOR?

I know that an artificial neural network (ANN) of 3 neurons in 2 layers can solve XOR
Input1----Neuron1\
\ / \
/ \ +------->Neuron3
/ \ /
Input2----Neuron2/
But to minify this ANN, can just 2 neurons (Neuron1 takes 2 inputs, Neuron2 take only 1 input) solve XOR?
Input1
\
\ Neuron1------->Neuron2
/
Input2/
The artificial neuron receives one or more inputs...
https://en.wikipedia.org/wiki/Artificial_neuron
Bias input '1' is assumed to be always there in both diagrams.
Side notes:
Single neuron can solve xor but with additional input x1*x2 or x1+x2
https://www.quora.com/Why-cant-the-XOR-problem-be-solved-by-a-one-layer-perceptron/answer/Razvan-Popovici/log
The ANN form in second diagram may solve XOR with additional input like above to Neuron1 or Neuron2?
No that's not possible, unless (maybe) you start using some rather strange, unusual activation functions.
Let's first ignore neuron 2, and pretend that neuron 1 is the output node. Let x0 denote the bias value (always x0 = 1), and x1 and x2 denote the input values of an example, let y denote the desired output, and let w1, w2, w3 denote the weights from the x's to neuron 1. With the XOR problem, we have the following four examples:
x0 = 1, x1 = 0, x2 = 0, y = 0
x0 = 1, x1 = 1, x2 = 0, y = 1
x0 = 1, x1 = 0, x2 = 1, y = 1
x0 = 1, x1 = 1, x2 = 1, y = 0
Let f(.) denote the activation function of neuron 1. Then, assuming we can somehow train our weights to solve the XOR problem, we have the following four equations:
f(w0 + x1*w1 + x2*w2) = f(w0) = 0
f(w0 + x1*w1 + x2*w2) = f(w0 + w1) = 1
f(w0 + x1*w1 + x2*w2) = f(w0 + w2) = 1
f(w0 + x1*w1 + x2*w2) = f(w0 + w1 + w2) = 0
Now, the main problem is that activation functions that are typically used (ReLUs, sigmoid, tanh, idendity function... maybe others) are nondecreasing. That means that if you give it a larger input, you also get a larger output: f(a + b) >= f(a) if b >= 0. If you look at the above four equations, you'll see this is a problem. Comparing the second and third equations to the first tell us that w1 and w2 need to be positive because they need to increase the output in comparison to f(w0). But, then the fourth equation won't work out because it will give an even greater output, instead of 0.
I think (but didn't actually try to verify, maybe I'm missing something) that it would be possible if you use an activation function that goes up first and then down again. Think of something like f(x) = -(x^2) with some extra term to shift it away from the origin. I don't think such activation functions are commonly used in neural networks. I suspect they'll behave less nicely when training, and are not plausible from a biological point of view (remember than neural networks are at least inspired by biology).
Now, in your question you also added an extra link from neuron 1 to neuron 2, which I ignored in the discussion above. The problem here is still the same though. The activation level in neuron 1 is always going to be higher than (or at least as high as) the second and third cases. Neuron 2 would typically again have a nondecreasing activation function, so would not be able to change this (unless you put a negative weight between the hidden neuron 1 and output neuron 2, in which case you flip the problem around and will predict too high a value for the first case)
EDIT: Note that this is related to Aaron's answer, which is essentially also about the problem of nondecreasing activation functions, just using more formal language. Give him an upvote too!
It's not possible.
Firstly, you need an equal number of inputs to the inputs of XOR. The smallest ANN capable of modelling any binary operation will contain two inputs. The second diagram only shows one input, one output.
Secondly, and this is probably the most direct refutation, the XOR function's output is not an additive or multiplicative relationship, but can be modelled using a combination of them. A neuron is generally modelled using functions like sigmoids or lines which have no stationary points, so one layer of neurons can roughly approximate an additive or multiplicative relationship.
What this means is that a minimum of two layers of processing are required to produce a XOR operation.
This question brings up an interesting topic of ANNs. They are well-suited to identifying fuzzy relationships, but tend to require at least as much network complexity as any mathematical process which would solve the problem with no fuzzy margin for error. Use ANNs where you need to identify something which looks mostly like what you are identifying, and use math where you need to know precisely whether something matches a set of concrete traits.
Understanding the distinction between ANN and mathematics opens up the possibility of combining the two in more powerful calculation pipelines, such as identifying possible circles in an image using ANN, using mathematics to pin down their precise origins, and using a second ANN to compare those origins to the configurations on known objects.
It is absolutely possible to solve the XOR problem with only two neurons.
Take a look at the model below.
This model solves the problem easily.
The first representing logic AND and the other logic OR. The value of +1.5 for the threshold of the hidden neuron insures that it will be turned on only when both input units are on. The value of +0.5 for the output neuron insures that it will turn on only when it receives a net positive input greater than +0.5. The weight of -2 from the hidden neuron to the output one insures that the output neuron will not come on when both input neurons are on (ref. 2).
ref. 1: Hazem M El-Bakry, Modular neural networks for solving high complexity problems (link)
ref. 2: D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representation by error backpropagation, Parallel distributed processing: Explorations in the Microstructures of Cognition, Vol. 1, Cambridge, MA: MIT Press, pp. 318-362, 1986.
Of course it is possible. But before solving XOR problem with two neurons I want to discuss on linearly separability. A problem is linearly separable if only one hyperplane can make the decision boundary. (Hyperplane is just a plane drawn to differentiate the classes. For an N dimensional problem i.e, a problem having N features as inputs the hyperplane will be an N-1 dimensional plane.) So for a 2 input XOR problem the hyperplane will be an one dimensional plane that is a "line".
Now coming to the question, XOR is not linearly separable. Hence we cannot directly solve XOR problem with two neurons. Following images show no matter how many ways we draw a line in 2D space we cannot differentiate one side's output with the other. For example for the first one (0,1) and (1,0) both inputs makes XOR to give 1. But for the input (1,1) the output is 0 but we cannot make it separated and unfortunately they are falling in the same side.
So here we have two options to solve it:
Using hidden layer. But it will increase the number of neurons more than two.
Another option is to increase the dimensions.
Let's have an illustration how increasing dimensions can solve this problem keeping the the number of neurons 2.
For an analogy we can think XOR as a subtraction of AND from OR like below,
If you notice the upper figure, the first neuron will mimic logical AND after passing "v=(-1.5)+(x1*1)+(x2*1)" to some activation function and the output will be considered as 0 or 1 depending on v is negative or positive respectively (I am not getting into the details...hope you got the point). And the same way the next neuron will mimic logical OR.
So for the first three cases of the truth table the AND neuron will remain turned off. But for the last one (actually where OR is different from XOR) the AND neuron will be turned on providing a big negative value to the OR neuron which will overwhelm the total summation to negative as it is big enough to make the summation a negative number. So finally activation function of the second neuron will interpret it as 0.
By this way we can make XOR with 2 neurons.
Following two figures are also the solutions to your questions which I have collected:
The problem can be split in two parts.
Part one
a b c
-------
0 0 0
0 1 1
1 0 0
1 1 0
Part two
a b d
-------
0 0 0
0 1 0
1 0 1
1 1 0
Part one can be solved with one neuron.
Part two can also be solved with one neuron.
part one and part two added together makes the XOR.
c = sigmoid(a * 6.0178 + b * -6.6000 + -2.9996)
d = sigmoid(a * -6.5906 + b *5.9016 + -3.1123 )
----------------------------------------------------------
sigmoid(0.0 * 6.0178 + 0 * -6.6000 + -2.9996)+ sigmoid(0.0 * -6.5906 + 0 *5.9016 + -3.1123 ) = 0.0900
sigmoid(1.0 * 6.0178 + 0 * -6.6000 + -2.9996)+ sigmoid(1.0 * -6.5906 + 0 *5.9016 + -3.1123 ) = 0.9534
sigmoid(0.0 * 6.0178 + 1 * -6.6000 + -2.9996)+ sigmoid(0.0 * -6.5906 + 1 *5.9016 + -3.1123 ) = 0.9422
sigmoid(1.0 * 6.0178 + 1 * -6.6000 + -2.9996)+ sigmoid(1.0 * -6.5906 + 1 *5.9016 + -3.1123 ) = 0.0489

Random Algorithm with adjustable probability

I'm searching for an algorithm (no matter what programming language, maybe Pseudo-code?) where you get a random number with different probability's.
For example:
A random Generator, which simulates a dice where the chance for a '6'
is 50% and for the other 5 numbers it's 10%.
The algorithm should be scalable, because this is my exact problem:
I have a array (or database) of elements, from which i want to
select 1 random element. But each element should have a different
probability to be selected. So my idea is that every element get a
number. And this number divided by the sum of all numbers results the
chance for the number to be randomly selected.
Anybody know a good programming language (or library) for this problem?
The best solution would be a good SQL Query which delivers 1 random entry.
But i would also be happy with every hint or attempt in an other programming language.
A simple algorithm to achieve it is:
Create an auexillary array where sum[i] = p1 + p2 + ... + pi. This is done only once.
When you draw a number, draw a number r with uniform distribution over [0,sum[n]), and binary search for the first number higher than the uniformly distributed random number. It can be done using binary search efficiently.
It is easy to see that indeed the probability for r to lay in a certain range [sum[i-1],sum[i]), is indeed sum[i]-sum[i-1] = pi
(In the above, we regard sum[-1]=0, for completeness)
For your cube example:
You have:
p1=p2=....=p5 = 0.1
p6 = 0.5
First, calculate sum array:
sum[1] = 0.1
sum[2] = 0.2
sum[3] = 0.3
sum[4] = 0.4
sum[5] = 0.5
sum[6] = 1
Then, each time you need to draw a number: Draw a random number r in [0,1), and choose the number closest to it, for example:
r1 = 0.45 -> element = 4
r2 = 0.8 -> element = 6
r3 = 0.1 -> element = 2
r4 = 0.09 -> element = 1
An alternative answer. Your example was in percentages, so set up an array with 100 slots. A 6 is 50%, so put 6 in 50 of the slots. 1 to 5 are at 10% each, so put 1 in 10 slots, 2 in 10 slots etc. until you have filled all 100 slots in the array. Now pick one of the slots at random using a uniform distribution in [0, 99] or [1, 100] depending on the language you are using.
The contents of the selected array slot will give you the distribution you want.
ETA: On second thoughts, you don't actually need the array, just use cumulative probabilities to emulate the array:
r = rand(100) // In range 0 -> 99 inclusive.
if (r < 50) return 6; // Up to 50% returns a 6.
if (r < 60) return 1; // Between 50% and 60% returns a 1.
if (r < 70) return 2; // Between 60% and 70% returns a 2.
etc.
You already know what numbers are in what slots, so just use cumulative probabilities to pick a virtual slot: 50; 50 + 10; 50 + 10 + 10; ...
Be careful of edge cases and whether your RNG is 0 -> 99 or 1 -> 100.

least square line fitting in 4D space

I have a set of points like:
(x , y , z , t)
(1 , 3 , 6 , 0.5)
(1.5 , 4 , 6.5 , 1)
(3.5 , 7 , 8 , 1.5)
(4 , 7.25 , 9 , 2)
I am looking to find the best linear fit on these points, let say a function like:
f(t) = a * x +b * y +c * z
This is Linear Regression problem. The "best fit" depends on the metric you define for being better.
One simple example is the Least Squares Metric, which aims to minimize the sum of squares: (f((x_i,y_i,z_i)) - w_i)^2 - where w_i is the measured value for the sample.
So, in least squares you are trying to minimize SUM{(a*x_i+b*y_i+c*z^i - w_i)^2 | per each i }. This function has a single global minimum at:
(a,b,c) = (X^T * X)^-1 * X^T * w
Where:
X is a 3xm matrix (m is the number of samples you have)
X^T - is the transposed of this matrix
w - is the measured results: `(w_1,w_2,...,w_m)`
The * operator represents matrix multiplication
There are more complex other methods, that use other distance metric, one example is the famous SVR with a linear kernel.
It seems that you are looking for the major axis of a point cloud.
You can work this out by finding the Eigenvector associated to the largest Eigenvalue of the covariance matrix. Could be an opportunity to use the power method (starting the iterations with the point farthest from the centroid, for example).
Can also be addressed by Singular Value Decomposition, preferably using methods that compute the largest values only.
If your data set contains outliers, then RANSAC could be a better choice: take two points at random and compute the sum of distances to the line they define. Repeat a number of times and keep the best fit.
Using the squared distances will answer your request for least-squares, but non-squared distances will be more robust.
You have a linear problem.
For example, my equation will be Y=ax1+bx2+c*x3.
In MATLAB do it:
B = [x1(:) x2(:) x3(:)] \ Y;
Y_fit = [x1(:) x2(:) x3(:)] * B;
In PYTHON do it:
import numpy as np
B, _, _, _ = np.linalg.lstsq([x1[:], x2[:], x3[:]], Y)
Y_fit = np.matmul([x1[:] x2[:] x3[:]], B)

Compare two arrays of points [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm trying to find a way to find similarities in two arrays of different points. I drew circles around points that have similar patterns and I would like to do some kind of auto comparison in intervals of let's say 100 points and tell what coefficient of similarity is for that interval. As you can see it might not be perfectly aligned also so point-to-point comparison would not be a good solution also (I suppose). Patterns that are slightly misaligned could also mean that they are matching the pattern (but obviously with a smaller coefficient)
What similarity could mean (1 coefficient is a perfect match, 0 or less - is not a match at all):
Points 640 to 660 - Very similar (coefficient is ~0.8)
Points 670 to 690 - Quite similar (coefficient is ~0.5-~0.6)
Points 720 to 780 - Let's say quite similar (coefficient is ~0.5-~0.6)
Points 790 to 810 - Perfectly similar (coefficient is 1)
Coefficient is just my thoughts of how a final calculated result of comparing function could look like with given data.
I read many posts on SO but it didn't seem to solve my problem. I would appreciate your help a lot. Thank you
P.S. Perfect answer would be the one that provides pseudo code for function which could accept two data arrays as arguments (intervals of data) and return coefficient of similarity.
Click here to see original size of image
I also think High Performance Mark has basically given you the answer (cross-correlation). In my opinion, most of the other answers are only giving you half of what you need (i.e., dot product plus compare against some threshold). However, this won't consider a signal to be similar to a shifted version of itself. You'll want to compute this dot product N + M - 1 times, where N, M are the sizes of the arrays. For each iteration, compute the dot product between array 1 and a shifted version of array 2. The amount you shift array 2 increases by one each iteration. You can think of array 2 as a window you are passing over array 1. You'll want to start the loop with the last element of array 2 only overlapping the first element in array 1.
This loop will generate numbers for different amounts of shift, and what you do with that number is up to you. Maybe you compare it (or the absolute value of it) against a threshold that you define to consider two signals "similar".
Lastly, in many contexts, a signal is considered similar to a scaled (in the amplitude sense, not time-scaling) version of itself, so there must be a normalization step prior to computing the cross-correlation. This is usually done by scaling the elements of the array so that the dot product with itself equals 1. Just be careful to ensure this makes sense for your application numerically, i.e., integers don't scale very well to values between 0 and 1 :-)
i think HighPerformanceMarks's suggestion is the standard way of doing the job.
a computationally lightweight alternative measure might be a dot product.
split both arrays into the same predefined index intervals.
consider the array elements in each intervals as vector coordinates in high-dimensional space.
compute the dot product of both vectors.
the dot product will not be negative. if the two vectors are perpendicular in their vector space, the dot product will be 0 (in fact that's how 'perpendicular' is usually defined in higher dimensions), and it will attain its maximum for identical vectors.
if you accept the geometric notion of perpendicularity as a (dis)similarity measure, here you go.
caveat:
this is an ad hoc heuristic chosen for computational efficiency. i cannot tell you about mathematical/statistical properties of the process and separation properties - if you need rigorous analysis, however, you'll probably fare better with correlation theory anyway and should perhaps forward your question to math.stackexchange.com.
My Attempt:
Total_sum=0
1. For each index i in the range (m,n)
2. sum=0
3. k=Array1[i]*Array2[i]; t1=magnitude(Array1[i]); t2=magnitude(Array2[i]);
4. k=k/(t1*t2)
5. sum=sum+k
6. Total_sum=Total_sum+sum
Coefficient=Total_sum/(m-n)
If all values are equal, then sum would return 1 in each case and total_sum would return (m-n)*(1). Hence, when the same is divided by (m-n) we get the value as 1. If the graphs are exact opposites, we get -1 and for other variations a value between -1 and 1 is returned.
This is not so efficient when the y range or the x range is huge. But, I just wanted to give you an idea.
Another option would be to perform an extensive xnor.
1. For each index i in the range (m,n)
2. sum=1
3. k=Array1[i] xnor Array2[i];
4. k=k/((pow(2,number_of_bits))-1) //This will scale k down to a value between 0 and 1
5. sum=(sum+k)/2
Coefficient=sum
Is this helpful ?
You can define a distance metric for two vectors A and B of length N containing numbers in the interval [-1, 1] e.g. as
sum = 0
for i in 0 to 99:
d = (A[i] - B[i])^2 // this is in range 0 .. 4
sum = (sum / 4) / N // now in range 0 .. 1
This now returns distance 1 for vectors that are completely opposite (one is all 1, another all -1), and 0 for identical vectors.
You can translate this into your coefficient by
coeff = 1 - sum
However, this is a crude approach because it does not take into account the fact that there could be horizontal distortion or shift between the signals you want to compare, so let's look at some approaches for coping with that.
You can sort both your arrays (e.g. in ascending order) and then calculate the distance / coefficient. This returns more similarity than the original metric, and is agnostic towards permutations / shifts of the signal.
You can also calculate the differentials and calculate distance / coefficient for those, and then you can do that sorted also. Using differentials has the benefit that it eliminates vertical shifts. Sorted differentials eliminate horizontal shift but still recognize different shapes better than sorted original data points.
You can then e.g. average the different coefficients. Here more complete code. The routine below calculates coefficient for arrays A and B of given size, and takes d many differentials (recursively) first. If sorted is true, the final (differentiated) array is sorted.
procedure calc(A, B, size, d, sorted):
if (d > 0):
A' = new array[size - 1]
B' = new array[size - 1]
for i in 0 to size - 2:
A'[i] = (A[i + 1] - A[i]) / 2 // keep in range -1..1 by dividing by 2
B'[i] = (B[i + 1] - B[i]) / 2
return calc(A', B', size - 1, d - 1, sorted)
else:
if (sorted):
A = sort(A)
B = sort(B)
sum = 0
for i in 0 to size - 1:
sum = sum + (A[i] - B[i]) * (A[i] - B[i])
sum = (sum / 4) / size
return 1 - sum // return the coefficient
procedure similarity(A, B, size):
sum a = 0
a = a + calc(A, B, size, 0, false)
a = a + calc(A, B, size, 0, true)
a = a + calc(A, B, size, 1, false)
a = a + calc(A, B, size, 1, true)
return a / 4 // take average
For something completely different, you could also run Fourier transform using FFT and then take a distance metric on the returning spectra.

Split a number into three buckets with constraints

Is there a good algorithm to split a randomly generated number into three buckets, each with constraints as to how much of the total they may contain.
For example, say my randomly generated number is 1,000 and I need to split it into buckets a, b, and c.
These ranges are only an example. See my edit for possible ranges.
Bucket a may only be between 10% - 70% of the number (100 - 700)
Bucket b may only be between 10% - 50% of the number (100 - 500)
Bucket c may only be between 5% - 25% of the number (50 - 250)
a + b + c must equal the randomly generated number
You want the amounts assigned to be completely random so there's just as equal a chance of bucket a hitting its max as bucket c in addition to as equal a chance of all three buckets being around their percentage mean.
EDIT: The following will most likely always be true: low end of a + b + c < 100%, high end of a + b + c > 100%. These percentages are only to indicate acceptable values of a, b, and c. In a case where a is 10% while b and c are their max (50% and 25% respectively) the numbers would have to be reassigned since the total would not equal 100%. This is the exact case I'm trying to avoid by finding a way to assign these numbers in one pass.
I'd like to find a way to pick these number randomly within their range in one pass.
The problem is equivalent to selecting a random point in an N-dimensional object (in your example N=3), the object being defined by the equations (in your example):
0.1 <= x <= 0.7
0.1 <= y <= 0.5
0.05 <= z <= 0.25
x + y + z = 1 (*)
Clearly because of the last equation (*) one of the coordinates is redundant, i.e. picking values for x and y dictates z.
Eliminating (*) and one of the other equations leaves us with an (N-1)-dimensional box, e.g.
0.1 <= x <= 0.7
0.1 <= y <= 0.5
that is cut by the inequality
0.05 <= (1 - x - y) <= 0.25 (**)
that derives from (*) and the equation for z. This is basically a diagonal stripe through the box.
In order for the results to be uniform, I would just repeatedly sample the (N-1)-dimensional box, and accept the first sampled point that fulfills (**). Single-pass solutions might end up having biased distributions.
Update: Yes, you're right, the result is not uniformly distributed.
Let's say your percent values are natural numbers (if this assumption is wrong, you don't have to read further :) In that case I don't have a solution).
Let's define an event e as a tuple of 3 values (percentage of each bucket): e = (pa, pb, pc). Next, create all possible events en. What you have here is a tuple space consisting of a discrete number of events. All of the possible events should have the same possibility to occur.
Let's say we have a function f(n) => en. Then, all we have to do is take a random number n and return en in a single pass.
Now, the problem remains to create such a function f :)
In pseudo code, a very slow method (just for illustration):
function f(n) {
int c = 0
for i in [10..70] {
for j in [10..50] {
for k in [5..25] {
if(i + j + k == 100) {
if(n == c) {
return (i, j, k) // found event!
} else {
c = c + 1
}
}
}
}
}
}
What you have know is a single pass solution, but problem is only moved away. The function f is very slow. But you can do better: I think you can calculate everything a bit faster if you set your ranges correctly and calculate offsets instead of iterating through your ranges.
Is this clear enough?
First of all you probably have to adjust your ranges. 10% in bucket a is not possible, since you can't get condition a+b+c = number to hold.
Concerning your question: (1) Pick a random number for bucket a inside your range, then (2) update the range for bucket b with minimum and maximum percentage (you should only narrow the range). Then (3) pick a random number for bucket b. In the end c should be calculated that your condition holds (4).
Example:
n = 1000
(1) a = 40%
(2) range b [35,50], because 40+35+25 = 100%
(3) b = 45%
(4) c = 100-40-45 = 15%
Or:
n = 1000
(1) a = 70%
(2) range b [10,25], because 70+25+5 = 100%
(3) b = 20%
(4) c = 100-70-20 = 10%
It is to check whether all the events are uniformly distributed. If that should be a problem you might want to randomize the range update in step 2.

Resources