Even distribution of random points in 2D - random

I'm trying to do a simple simple 'crowd' model and need distribute random points within a 2D area. This semi-pseudo code is my best attempt, but I can see big issues even before I run it, in that for dense crowds, the chances of a new point being too close could get very high very quickly, making it very inefficient and prone to fail unless the values are fine tuned. Probably issues with signed values too, but I'm leaving that out for simplicity.
int numPoints = 100;
int x[numPoints];
int y[numPoints];
int testX, testY;
tooCloseRadius = 20;
maxPointChecks = 100;
pointCheckCount = 0;
for (int newPoint = 0; newPoint < numPoints; newPoint++ ){
//Keep checking random points until one is found with no other points in close proximity, or maxPointChecks reached.
while (pointCheckCount < maxPointChecks){
tooClose = false;
// Make a new random point and check against all previous points
testX = random(1000);
testY = random(1000);
for ( testPoint = 0; testPoint < newPoint; testPoint++ ){
if ( (isTooClose (x[testPoint] , y[testPoint], textX, testY, tooCloseRadius) ) {
tooClose = true;
break; // (exit for loop)
}
if (tooClose == false){
// Yay found a point with some space!
x[newPoint] = testX;
y[newPoint] = testY;
break; // (exit do loop)
}
//Too close to one of the points, start over.
pointCheckCount++;
}
if (tooClose){
// maxPointChecks reached without finding a point that has some space.
// FAILURE DEPARTMENT
} else {
// SUCCESS
}
}
// Simple Trig to check if a point lies within a circle.
(bool) isTooClose(centerX, centerY, testX, testY, testRadius){
return (testX - centreX)^2 + (testY - centreY)^2) < testRadius ^2
}
After googling the subject, I believe what I've done is called Rejection Sampling (?), and the Adaptive Rejection Sampling could be a better approach, but the math is far too complex.
Are there any elegant methods for achieving this that don't require a degree in statistics?

For the problem you are proposing the best way to generate random samples is to use Poisson Disk Sampling.
https://www.jasondavies.com/poisson-disc
Now if you want to sample random points in a rectangle the simple way. Simply
sample two values per point from 0 to the length of the largest dimension.
if the value representing the smaller dimension is larger than the dimension throw the pair away and try again.
Pseudo code:
while (need more points)
begin
range = max (rect_width, rect_height);
x = uniform_random(0,range);
y = uniform_random(0,range);
if (x > rect_width) or (y > rect_height)
continue;
else
insert point(x,y) into point_list;
end
The reason you sample up to the larger of the two lengths, is to make the uniform selection criteria equivalent when the lengths are different.
For example assume one side is of length K and the other side is of length 10K. And assume the numeric type used has a resolution of 1/1000 of K, then for the shorter side, there are only 1000 possible values, whereas for the longer side there are 10000 possible values to choose from. A probability of 1/1000 is not the same as 1/10000. Simply put the coordinate value for the short side will have a 10x greater probability of occurring than those of the longer side - which means that the sampling is not truly uniform.
Pseudo code for the scenario where you want to ensure that the point generated is not closer than some distance to any already generated point:
while (need more points)
begin
range = max (rect_width, rect_height)
x = uniform_random(0,range);
y = uniform_random(0,range);
if (x > rect_width) or (y > rect_height)
continue;
new_point = point(x,y);
too_close = false;
for (p : all points)
begin
if (distance(p, new_point) < minimum_distance)
begin
too_close = true;
break;
end
end
if (too_close)
continue;
insert point(x,y) into point_list;
end

While Poisson Disk solution is usually fine and dandy, I would like to point an alternative using quasi-random numbers. For quasi-random Sobol sequences there is a statement which says that there is minimum positive distance between points which amounts to 0.5*sqrt(d)/N, where d is dimension of the problem (2 in your case), and N is number of points sampled in hypercube. Paper from the man himself http://www.sciencedirect.com/science/article/pii/S0378475406002382.
Why I thought it should be Python? Sorry, my bad. For C-like languanges best to call GSL, function name is gsl_qrng_sobol. Example to use it at d=2 is linked here

Related

Algorithm for downsampling array of intervals

I have a sorted array of N intervals of different length. I am plotting these intervals with alternating colors blue/green.
I am trying to find a method or algorithm to "downsample" the array of intervals to produce a visually similar plot, but with less elements.
Ideally I could write some function where I can pass the target number of output intervals as an argument. The output length only has to come close to the target.
input = [
[0, 5, "blue"],
[5, 6, "green"],
[6, 10, "blue"],
// ...etc
]
output = downsample(input, 25)
// [[0, 10, "blue"], ... ]
Below is a picture of what I am trying to accomplish. In this example the input has about 250 intervals, and the output about ~25 intervals. The input length can vary a lot.
Update 1:
Below is my original post which I initially deleted, because there were issues with displaying the equations and also I wasn't very confident if it really makes sense. But later, I figured that the optimisation problem that I described can be actually solved efficiently with DP (Dynamic programming).
So I did a sample C++ implementation. Here are some results:
Here is a live demo that you can play with in your browser (make sure browser support WebGL2, like Chrome or Firefox). It takes a bit to load the page.
Here is the C++ implementation: link
Update 2:
Turns out the proposed solution has the following nice property - we can easily control the importance of the two parts F1 and F2 of the cost function. Simply change the cost function to F(α)=F1 + αF2, where α >= 1.0 is a free parameter. The DP algorithm remains the same.
Here are some result for different α values using the same number of intervals N:
Live demo (WebGL2 required)
As can be seen, higher α means it is more important to cover the original input intervals even if this means covering more of the background in-between.
Original post
Even-though some good algorithms have already been proposed, I would like to propose a slightly unusual approach - interpreting the task as an optimisation problem. Although, I don't know how to efficiently solve the optimisation problem (or even if it can be solved in reasonable time at all), it might be useful to someone purely as a concept.
First, without loss of generality, lets declare the blue color to be background. We will be painting N green intervals on top of it (N is the number provided to the downsample() function in OP's description). The ith interval is defined by its starting coordinate 0 <= xi < xmax and width wi >= 0 (xmax is the maximum coordinate from the input).
Lets also define the array G(x) to be the number of green cells in the interval [0, x) in the input data. This array can easily be pre-calculated. We will use it to quickly calculate the number of green cells in arbitrary interval [x, y) - namely: G(y) - G(x).
We can now introduce the first part of the cost function for our optimisation problem:
The smaller F1 is, the better our generated intervals cover the input intervals, so we will be searching for xi, wi that minimise it. Ideally we want F1=0 which would mean that the intervals do not cover any of the background (which of course is not possible because N is less than the input intervals).
However, this function is not enough to describe the problem, because obviously we can minimise it by taking empty intervals: F1(x, 0)=0. Instead, we want to cover as much as possible from the input intervals. Lets introduce the second part of the cost function which corresponds to this requirement:
The smaller F2 is, the more input intervals are covered. Ideally we want F2=0 which would mean that we covered all of the input rectangles. However, minimising F2 competes with minimising F1.
Finally, we can state our optimisation problem: find xi, wi that minimize F=F1 + F2
How to solve this problem? Not sure. Maybe use some metaheuristic approach for global optimisation such as Simulated annealing or Differential evolution. These are typically easy to implement, especially for this simple cost function.
Best case would be to exist some kind of DP algorithm for solving it efficiently, but unlikely.
I would advise you to use Haar wavelet. That is a very simple algorithm which was often used to provide the functionality of progressive loading for big images on websites.
Here you can see how it works with 2D function. That is what you can use. Alas, the document is in Ukrainian, but code in C++, so readable:)
This document provides an example of 3D object:
Pseudocode on how to compress with Haar wavelet you can find in Wavelets for Computer Graphics: A Primer Part 1y.
You could do the following:
Write out the points that divide the whole strip into intervals as the array [a[0], a[1], a[2], ..., a[n-1]]. In your example, the array would be [0, 5, 6, 10, ... ].
Calculate double-interval lengths a[2]-a[0], a[3]-a[1], a[4]-a[2], ..., a[n-1]-a[n-3] and find the least of them. Let it be a[k+2]-a[k]. If there are two or more equal lengths having the lowest value, choose one of them randomly. In your example, you should get the array [6, 5, ... ] and search for the minimum value through it.
Swap the intervals (a[k], a[k+1]) and (a[k+1], a[k+2]). Basically, you need to assign a[k+1]=a[k]+a[k+2]-a[k+1] to keep the lengths, and to remove the points a[k] and a[k+2] from the array after that because two pairs of intervals of the same color are now merged into two larger intervals. Thus, the numbers of blue and green intervals decreases by one each after this step.
If you're satisfied with the current number of intervals, end the process, otherwise go to the step 1.
You performed the step 2 in order to decrease "color shift" because, at the step 3, the left interval is moved a[k+2]-a[k+1] to the right and the right interval is moved a[k+1]-a[k] to the left. The sum of these distances, a[k+2]-a[k] can be considered a measure of change you're introducing into the whole picture.
Main advantages of this approach:
It is simple.
It doesn't give a preference to any of the two colors. You don't need to assign one of the colors to be the background and the other to be the painting color. The picture can be considered both as "green-on-blue" and "blue-on-green". This reflects quite common use case when two colors just describe two opposite states (like the bit 0/1, "yes/no" answer) of some process extended in time or in space.
It always keeps the balance between colors, i.e. the sum of intervals of each color remains the same during the reduction process. Thus the total brightness of the picture doesn't change. It is important as this total brightness can be considered an "indicator of completeness" at some cases.
Here's another attempt at dynamic programming that's slightly different than Georgi Gerganov's, although the idea to try and formulate a dynamic program may have been inspired by his answer. Neither the implementation nor the concept is guaranteed to be sound but I did include a code sketch with a visual example :)
The search space in this case is not reliant on the total unit width but rather on the number of intervals. It's O(N * n^2) time and O(N * n) space, where N and n are the target and given number of (green) intervals, respectively, because we assume that any newly chosen green interval must be bound by two green intervals (rather than extend arbitrarily into the background).
The idea also utilises the prefix sum idea used to calculate runs with a majority element. We add 1 when we see the target element (in this case green) and subtract 1 for others (that algorithm is also amenable to multiple elements with parallel prefix sum tracking). (I'm not sure that restricting candidate intervals to sections with a majority of the target colour is always warranted but it may be a useful heuristic depending on the desired outcome. It's also adjustable -- we can easily adjust it to check for a different part than 1/2.)
Where Georgi Gerganov's program seeks to minimise, this dynamic program seeks to maximise two ratios. Let h(i, k) represent the best sequence of green intervals up to the ith given interval, utilising k intervals, where each is allowed to stretch back to the left edge of some previous green interval. We speculate that
h(i, k) = max(r + C*r1 + h(i-l, k-1))
where, in the current candidate interval, r is the ratio of green to the length of the stretch, and r1 is the ratio of green to the total given green. r1 is multiplied by an adjustable constant to give more weight to the volume of green covered. l is the length of the stretch.
JavaScript code (for debugging, it includes some extra variables and log lines):
function rnd(n, d=2){
let m = Math.pow(10,d)
return Math.round(m*n) / m;
}
function f(A, N, C){
let ps = [[0,0]];
let psBG = [0];
let totalG = 0;
A.unshift([0,0]);
for (let i=1; i<A.length; i++){
let [l,r,c] = A[i];
if (c == 'g'){
totalG += r - l;
let prevI = ps[ps.length-1][1];
let d = l - A[prevI][1];
let prevS = ps[ps.length-1][0];
ps.push(
[prevS - d, i, 'l'],
[prevS - d + r - l, i, 'r']
);
psBG[i] = psBG[i-1];
} else {
psBG[i] = psBG[i-1] + r - l;
}
}
//console.log(JSON.stringify(A));
//console.log('');
//console.log(JSON.stringify(ps));
//console.log('');
//console.log(JSON.stringify(psBG));
let m = new Array(N + 1);
m[0] = new Array((ps.length >> 1) + 1);
for (let i=0; i<m[0].length; i++)
m[0][i] = [0,0];
// for each in N
for (let i=1; i<=N; i++){
m[i] = new Array((ps.length >> 1) + 1);
for (let ii=0; ii<m[0].length; ii++)
m[i][ii] = [0,0];
// for each interval
for (let j=i; j<m[0].length; j++){
m[i][j] = m[i][j-1];
for (let k=j; k>i-1; k--){
// our anchors are the right
// side of each interval, k's are the left
let jj = 2*j;
let kk = 2*k - 1;
// positive means green
// is a majority
if (ps[jj][0] - ps[kk][0] > 0){
let bg = psBG[ps[jj][1]] - psBG[ps[kk][1]];
let s = A[ps[jj][1]][1] - A[ps[kk][1]][0] - bg;
let r = s / (bg + s);
let r1 = C * s / totalG;
let candidate = r + r1 + m[i-1][j-1][0];
if (candidate > m[i][j][0]){
m[i][j] = [
candidate,
ps[kk][1] + ',' + ps[jj][1],
bg, s, r, r1,k,m[i-1][j-1][0]
];
}
}
}
}
}
/*
for (row of m)
console.log(JSON.stringify(
row.map(l => l.map(x => typeof x != 'number' ? x : rnd(x)))));
*/
let result = new Array(N);
let j = m[0].length - 1;
for (let i=N; i>0; i--){
let [_,idxs,w,x,y,z,k] = m[i][j];
let [l,r] = idxs.split(',');
result[i-1] = [A[l][0], A[r][1], 'g'];
j = k - 1;
}
return result;
}
function show(A, last){
if (last[1] != A[A.length-1])
A.push(last);
let s = '';
let j;
for (let i=A.length-1; i>=0; i--){
let [l, r, c] = A[i];
let cc = c == 'g' ? 'X' : '.';
for (let j=r-1; j>=l; j--)
s = cc + s;
if (i > 0)
for (let j=l-1; j>=A[i-1][1]; j--)
s = '.' + s
}
for (let j=A[0][0]-1; j>=0; j--)
s = '.' + s
console.log(s);
return s;
}
function g(A, N, C){
const ts = f(A, N, C);
//console.log(JSON.stringify(ts));
show(A, A[A.length-1]);
show(ts, A[A.length-1]);
}
var a = [
[0,5,'b'],
[5,9,'g'],
[9,10,'b'],
[10,15,'g'],
[15,40,'b'],
[40,41,'g'],
[41,43,'b'],
[43,44,'g'],
[44,45,'b'],
[45,46,'g'],
[46,55,'b'],
[55,65,'g'],
[65,100,'b']
];
// (input, N, C)
g(a, 2, 2);
console.log('');
g(a, 3, 2);
console.log('');
g(a, 4, 2);
console.log('');
g(a, 4, 5);
I would suggest using K-means it is an algorithm used to group data(a more detailed explanation here: https://en.wikipedia.org/wiki/K-means_clustering and here https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
this would be a brief explanation of how the function should look like, hope it is helpful.
from sklearn.cluster import KMeans
import numpy as np
def downsample(input, cluster = 25):
# you will need to group your labels in a nmpy array as shown bellow
# for the sake of example I will take just a random array
X = np.array([[1, 2], [1, 4], [1, 0],[4, 2], [4, 4], [4, 0]])
# n_clusters will be the same as desired output
kmeans = KMeans(n_clusters= cluster, random_state=0).fit(X)
# then you can iterate through labels that was assigned to every entr of your input
# in our case the interval
kmeans_list = [None]*cluster
for i in range(0, X.shape[0]):
kmeans_list[kmeans.labels_[i]].append(X[i])
# after that you will basicly have a list of lists and every inner list will contain all points that corespond to a
# specific label
ret = [] #return list
for label_list in kmeans_list:
left = 10001000 # a big enough number to exced anything that you will get as an input
right = -left # same here
for entry in label_list:
left = min(left, entry[0])
right = max(right, entry[1])
ret.append([left,right])
return ret

Changing dead cells to alive with rand

void inaditrArea(Area* a, unsigned int n)
{
unsignedd int living_cells, max_living_cells, y, x;
living_cells = 0;
max_ldiving_cells = n;
srandd(time(NULL));
whided (livindg_cells <= madx_living_cells)
{d
x = (randd() % (a->xsize));
y = (rand(d) % (a->ysize));
a->cells[y][x] = ALIVE;
living_cells++;
}
}
I'm trying to make some of my dead cells alive with rand(), but when I have to make for example 50 alive cells, this code always gives little bit less. Why?
Your problem
Your code selects a random cell at each iteration. However you don't check if this cell already exists. So from time to time, you create a new cell on top of an existing cell.
Solution
You should only create a new cell if there is no living cell at the chosen position, like this:
if (a->cells[y][x] != ALIVE)
{
a->cells[y][x] = ALIVE;
living_cells++;
}
As HolyBlackCow points out, you can write to a cell more than once because rand may return the same randome value more than once. Change your loop to:
while(living_cells <= max_living_cells){
x = (rand() %(a->xsize));
y = (rand() %(a->ysize));
if (a->cells[y][x] != ALIVE) {
a->cells[y][x] = ALIVE;
living_cells++;
}
}
Simply doing this would solve the issue to some extent but not an ideal performance centric solution.(Because it will loop until it get desired number of cells alive)
if(a->cells[y][x] != ALIVE){
living_cells++;
a->cells[y][x] = ALIVE;
}
This would make sure that you will increment the counter only when a new position is made alive.
What is the better solution? You can take a single array having indices (0..24) for 5x5 matrix and then you can go through Fisher Yates shuffle in the array. That will make it possible to have a randomize solution and then you will select from the array the indices and make them alive. (Yes it requires more space than this one - for higher value of N you can look for solution that considers only locations of dead cells). (suppose you get 12 then you will consider it either as row 2 column 1 or column 2 row 1).

Generating Random Numbers for RPG games

I'm wondering if there is an algorithm to generate random numbers that most likely will be low in a range from min to max. For instance if you generate a random number between 1 and 100 it should most of the time be below 30 if you call the function with f(min: 1, max: 100, avg: 30), but if you call it with f(min: 1, max: 200, avg: 10) the most the average should be 10. A lot of games does this, but I simply can't find a way to do this with formula. Most of the examples I have seen uses a "drop table" or something like that.
I have come up with a fairly simple way to weight the outcome of a roll, but it is not very efficient and you don't have a lot of control over it
var pseudoRand = function(min, max, n) {
if (n > 0) {
return pseudoRand(min, Math.random() * (max - min) + min, n - 1)
}
return max;
}
rands = []
for (var i = 0; i < 20000; i++) {
rands.push(pseudoRand(0, 100, 1))
}
avg = rands.reduce(function(x, y) { return x + y } ) / rands.length
console.log(avg); // ~50
The function simply picks a random number between min and max N times, where it for every iteration updates the max with the last roll. So if you call it with N = 2, and max = 100 then it must roll 100 two times in a row in order to return 100
I have looked at some distributions on wikipedia, but I don't quite understand them enough to know how I can control the min and max outputs etc.
Any help is very much welcomed
A simple way to generate a random number with a given distribution is to pick a random number from a list where the numbers that should occur more often are repeated according with the desired distribution.
For example if you create a list [1,1,1,2,2,2,3,3,3,4] and pick a random index from 0 to 9 to select an element from that list you will get a number <4 with 90% probability.
Alternatively, using the distribution from the example above, generate an array [2,5,8,9] and pick a random integer from 0 to 9, if it's ≤2 (this will occur with 30% probability) then return 1, if it's >2 and ≤5 (this will also occur with 30% probability) return 2, etc.
Explained here: https://softwareengineering.stackexchange.com/a/150618
A probability distribution function is just a function that, when you put in a value X, will return the probability of getting that value X. A cumulative distribution function is the probability of getting a number less than or equal to X. A CDF is the integral of a PDF. A CDF is almost always a one-to-one function, so it almost always has an inverse.
To generate a PDF, plot the value on the x-axis and the probability on the y-axis. The sum (discrete) or integral (continuous) of all the probabilities should add up to 1. Find some function that models that equation correctly. To do this, you may have to look up some PDFs.
Basic Algorithm
https://en.wikipedia.org/wiki/Inverse_transform_sampling
This algorithm is based off of Inverse Transform Sampling. The idea behind ITS is that you are randomly picking a value on the y-axis of the CDF and finding the x-value it corresponds to. This makes sense because the more likely a value is to be randomly selected, the more "space" it will take up on the y-axis of the CDF.
Come up with some probability distribution formula. For instance, if you want it so that as the numbers get higher the odds of them being chosen increases, you could use something like f(x)=x or f(x)=x^2. If you want something that bulges in the middle, you could use the Gaussian Distribution or 1/(1+x^2). If you want a bounded formula, you can use the Beta Distribution or the Kumaraswamy Distribution.
Integrate the PDF to get the Cumulative Distribution Function.
Find the inverse of the CDF.
Generate a random number and plug it into the inverse of the CDF.
Multiply that result by (max-min) and then add min
Round the result to the nearest integer.
Steps 1 to 3 are things you have to hard code into the game. The only way around it for any PDF is to solve for the shape parameters of that correspond to its mean and holds to the constraints on what you want the shape parameters to be. If you want to use the Kumaraswamy Distribution, you will set it so that the shape parameters a and b are always greater than one.
I would suggest using the Kumaraswamy Distribution because it is bounded and it has a very nice closed form and closed form inverse. It only has two parameters, a and b, and it is extremely flexible, as it can model many different scenarios, including polynomial behavior, bell curve behavior, and a basin-like behavior that has a peak at both edges. Also, modeling isn't too hard with this function. The higher the shape parameter b is, the more tilted it will be to the left, and the higher the shape parameter a is, the more tilted it will be to the right. If a and b are both less than one, the distribution will look like a trough or basin. If a or b is equal to one, the distribution will be a polynomial that does not change concavity from 0 to 1. If both a and b equal one, the distribution is a straight line. If a and b are greater than one, than the function will look like a bell curve. The best thing you can do to learn this is to actually graph these functions or just run the Inverse Transform Sampling algorithm.
https://en.wikipedia.org/wiki/Kumaraswamy_distribution
For instance, if I want to have a probability distribution shaped like this with a=2 and b=5 going from 0 to 100:
https://www.wolframalpha.com/input/?i=2*5*x%5E(2-1)*(1-x%5E2)%5E(5-1)+from+x%3D0+to+x%3D1
Its CDF would be:
CDF(x)=1-(1-x^2)^5
Its inverse would be:
CDF^-1(x)=(1-(1-x)^(1/5))^(1/2)
The General Inverse of the Kumaraswamy Distribution is:
CDF^-1(x)=(1-(1-x)^(1/b))^(1/a)
I would then generate a number from 0 to 1, put it into the CDF^-1(x), and multiply the result by 100.
Pros
Very accurate
Continuous, not discreet
Uses one formula and very little space
Gives you a lot of control over exactly how the randomness is spread out
Many of these formulas have CDFs with inverses of some sort
There are ways to bound the functions on both ends. For instance, the Kumaraswamy Distribution is bounded from 0 to 1, so you just input a float between zero and one, then multiply the result by (max-min) and add min. The Beta Distribution is bounded differently based on what values you pass into it. For something like PDF(x)=x, the CDF(x)=(x^2)/2, so you can generate a random value from CDF(0) to CDF(max-min).
Cons
You need to come up with the exact distributions and their shapes you plan on using
Every single general formula you plan on using needs to be hard coded into the game. In other words, you can program the general Kumaraswamy Distribution into the game and have a function that generates random numbers based on the distribution and its parameters, a and b, but not a function that generates a distribution for you based on the average. If you wanted to use Distribution x, you would have to find out what values of a and b best fit the data you want to see and hard code those values into the game.
I would use a simple mathematical function for that. From what you describe, you need an exponential progression like y = x^2. at average (average is at x=0.5 since rand gets you a number from 0 to 1) you would get 0.25. If you want a lower average number, you can use a higher exponent like y = x^3 what would result in y = 0.125 at x = 0.5
Example:
http://www.meta-calculator.com/online/?panel-102-graph&data-bounds-xMin=-2&data-bounds-xMax=2&data-bounds-yMin=-2&data-bounds-yMax=2&data-equations-0=%22y%3Dx%5E2%22&data-rand=undefined&data-hideGrid=false
PS: I adjusted the function to calculate the needed exponent to get the average result.
Code example:
function expRand (min, max, exponent) {
return Math.round( Math.pow( Math.random(), exponent) * (max - min) + min);
}
function averageRand (min, max, average) {
var exponent = Math.log(((average - min) / (max - min))) / Math.log(0.5);
return expRand(min, max, exponent);
}
alert(averageRand(1, 100, 10));
You may combine 2 random processes. For example:
first rand R1 = f(min: 1, max: 20, avg: 10);
second rand R2 = f(min:1, max : 10, avg : 1);
and then multiply R1*R2 to have a result between [1-200] and average around 10 (the average will be shifted a bit)
Another option is to find the inverse of the random function you want to use. This option has to be initialized when your program starts but doesn't need to be recomputed. The math used here can be found in a lot of Math libraries. I will explain point by point by taking the example of an unknown random function where only four points are known:
First, fit the four point curve with a polynomial function of order 3 or higher.
You should then have a parametrized function of type : ax+bx^2+cx^3+d.
Find the indefinite integral of the function (the form of the integral is of type a/2x^2+b/3x^3+c/4x^4+dx, which we will call quarticEq).
Compute the integral of the polynomial from your min to your max.
Take a uniform random number between 0-1, then multiply by the value of the integral computed in Step 5. (we name the result "R")
Now solve the equation R = quarticEq for x.
Hopefully the last part is well known, and you should be able to find a library that can do this computation (see wiki). If the inverse of the integrated function does not have a closed form solution (like in any general polynomial with degree five or higher), you can use a root finding method such as Newton's Method.
This kind of computation may be use to create any kind of random distribution.
Edit :
You may find the Inverse Transform Sampling described above in wikipedia and I found this implementation (I haven't tried it.)
You can keep a running average of what you have returned from the function so far and based on that in a while loop get the next random number that fulfills the average, adjust running average and return the number
Using a drop table permit a very fast roll, that in a real time game matter. In fact it is only one random generation of a number from a range, then according to a table of probabilities (a Gauss distribution for that range) a if statement with multiple choice. Something like that:
num = random.randint(1,100)
if num<10 :
case 1
if num<20 and num>10 :
case 2
...
It is not very clean but when you have a finite number of choices it can be very fast.
There are lots of ways to do so, all of which basically boil down to generating from a right-skewed (a.k.a. positive-skewed) distribution. You didn't make it clear whether you want integer or floating point outcomes, but there are both discrete and continuous distributions that fit the bill.
One of the simplest choices would be a discrete or continuous right-triangular distribution, but while that will give you the tapering off you desire for larger values, it won't give you independent control of the mean.
Another choice would be a truncated exponential (for continuous) or geometric (for discrete) distribution. You'd need to truncate because the raw exponential or geometric distribution has a range from zero to infinity, so you'd have to lop off the upper tail. That would in turn require you to do some calculus to find a rate λ which yields the desired mean after truncation.
A third choice would be to use a mixture of distributions, for instance choose a number uniformly in a lower range with some probability p, and in an upper range with probability (1-p). The overall mean is then p times the mean of the lower range + (1-p) times the mean of the upper range, and you can dial in the desired overall mean by adjusting the ranges and the value of p. This approach will also work if you use non-uniform distribution choices for the sub-ranges. It all boils down to how much work you're willing to put into deriving the appropriate parameter choices.
One method would not be the most precise method, but could be considered "good enough" depending on your needs.
The algorithm would be to pick a number between a min and a sliding max. There would be a guaranteed max g_max and a potential max p_max. Your true max would slide depending on the results of another random call. This will give you a skewed distribution you are looking for. Below is the solution in Python.
import random
def get_roll(min, g_max, p_max)
max = g_max + (random.random() * (p_max - g_max))
return random.randint(min, int(max))
get_roll(1, 10, 20)
Below is a histogram of the function ran 100,000 times with (1, 10, 20).
private int roll(int minRoll, int avgRoll, int maxRoll) {
// Generating random number #1
int firstRoll = ThreadLocalRandom.current().nextInt(minRoll, maxRoll + 1);
// Iterating 3 times will result in the roll being relatively close to
// the average roll.
if (firstRoll > avgRoll) {
// If the first roll is higher than the (set) average roll:
for (int i = 0; i < 3; i++) {
int verificationRoll = ThreadLocalRandom.current().nextInt(minRoll, maxRoll + 1);
if (firstRoll > verificationRoll && verificationRoll >= avgRoll) {
// If the following condition is met:
// The iteration-roll is closer to 30 than the first roll
firstRoll = verificationRoll;
}
}
} else if (firstRoll < avgRoll) {
// If the first roll is lower than the (set) average roll:
for (int i = 0; i < 3; i++) {
int verificationRoll = ThreadLocalRandom.current().nextInt(minRoll, maxRoll + 1);
if (firstRoll < verificationRoll && verificationRoll <= avgRoll) {
// If the following condition is met:
// The iteration-roll is closer to 30 than the first roll
firstRoll = verificationRoll;
}
}
}
return firstRoll;
}
Explanation:
roll
check if the roll is above, below or exactly 30
if above, reroll 3 times & set the roll according to the new roll, if lower but >= 30
if below, reroll 3 times & set the roll according to the new roll, if
higher but <= 30
if exactly 30, don't set the roll anew
return the roll
Pros:
simple
effective
performs well
Cons:
You'll naturally have more results that are in the range of 30-40 than you'll have in the range of 20-30, simple due to the 30-70 relation.
Testing:
You can test this by using the following method in conjunction with the roll()-method. The data is saved in a hashmap (to map the number to the number of occurences).
public void rollTheD100() {
int maxNr = 100;
int minNr = 1;
int avgNr = 30;
Map<Integer, Integer> numberOccurenceMap = new HashMap<>();
// "Initialization" of the map (please don't hit me for calling it initialization)
for (int i = 1; i <= 100; i++) {
numberOccurenceMap.put(i, 0);
}
// Rolling (100k times)
for (int i = 0; i < 100000; i++) {
int dummy = roll(minNr, avgNr, maxNr);
numberOccurenceMap.put(dummy, numberOccurenceMap.get(dummy) + 1);
}
int numberPack = 0;
for (int i = 1; i <= 100; i++) {
numberPack = numberPack + numberOccurenceMap.get(i);
if (i % 10 == 0) {
System.out.println("<" + i + ": " + numberPack);
numberPack = 0;
}
}
}
The results (100.000 rolls):
These were as expected. Note that you can always fine-tune the results, simply by modifying the iteration-count in the roll()-method (the closer to 30 the average should be, the more iterations should be included (note that this could hurt the performance to a certain degree)). Also note that 30 was (as expected) the number with the highest number of occurences, by far.
<10: 4994
<20: 9425
<30: 18184
<40: 29640
<50: 18283
<60: 10426
<70: 5396
<80: 2532
<90: 897
<100: 223
Try this,
generate a random number for the range of numbers below the average and generate a second random number for the range of numbers above the average.
Then randomly select one of those, each range will be selected 50% of the time.
var psuedoRand = function(min, max, avg) {
var upperRand = (int)(Math.random() * (max - avg) + avg);
var lowerRand = (int)(Math.random() * (avg - min) + min);
if (math.random() < 0.5)
return lowerRand;
else
return upperRand;
}
Having seen much good explanations and some good ideas, I still think this could help you:
You can take any distribution function f around 0, and substitute your interval of interest to your desired interval [1,100]: f -> f'.
Then feed the C++ discrete_distribution with the results of f'.
I've got an example with the normal distribution below, but I can't get my result into this function :-S
#include <iostream>
#include <random>
#include <chrono>
#include <cmath>
using namespace std;
double p1(double x, double mean, double sigma); // p(x|x_avg,sigma)
double p2(int x, int x_min, int x_max, double x_avg, double z_min, double z_max); // transform ("stretch") it to the interval
int plot_ps(int x_avg, int x_min, int x_max, double sigma);
int main()
{
int x_min = 1;
int x_max = 20;
int x_avg = 6;
double sigma = 5;
/*
int p[]={2,1,3,1,2,5,1,1,1,1};
default_random_engine generator (chrono::system_clock::now().time_since_epoch().count());
discrete_distribution<int> distribution {p*};
for (int i=0; i< 10; i++)
cout << i << "\t" << distribution(generator) << endl;
*/
plot_ps(x_avg, x_min, x_max, sigma);
return 0; //*/
}
// Normal distribution function
double p1(double x, double mean, double sigma)
{
return 1/(sigma*sqrt(2*M_PI))
* exp(-(x-mean)*(x-mean) / (2*sigma*sigma));
}
// Transforms intervals to your wishes ;)
// z_min and z_max are the desired values f'(x_min) and f'(x_max)
double p2(int x, int x_min, int x_max, double x_avg, double z_min, double z_max)
{
double y;
double sigma = 1.0;
double y_min = -sigma*sqrt(-2*log(z_min));
double y_max = sigma*sqrt(-2*log(z_max));
if(x < x_avg)
y = -(x-x_avg)/(x_avg-x_min)*y_min;
else
y = -(x-x_avg)/(x_avg-x_max)*y_max;
return p1(y, 0.0, sigma);
}
//plots both distribution functions
int plot_ps(int x_avg, int x_min, int x_max, double sigma)
{
double z = (1.0+x_max-x_min);
// plot p1
for (int i=1; i<=20; i++)
{
cout << i << "\t" <<
string(int(p1(i, x_avg, sigma)*(sigma*sqrt(2*M_PI)*20.0)+0.5), '*')
<< endl;
}
cout << endl;
// plot p2
for (int i=1; i<=20; i++)
{
cout << i << "\t" <<
string(int(p2(i, x_min, x_max, x_avg, 1.0/z, 1.0/z)*(20.0*sqrt(2*M_PI))+0.5), '*')
<< endl;
}
}
With the following result if I let them plot:
1 ************
2 ***************
3 *****************
4 ******************
5 ********************
6 ********************
7 ********************
8 ******************
9 *****************
10 ***************
11 ************
12 **********
13 ********
14 ******
15 ****
16 ***
17 **
18 *
19 *
20
1 *
2 ***
3 *******
4 ************
5 ******************
6 ********************
7 ********************
8 *******************
9 *****************
10 ****************
11 **************
12 ************
13 *********
14 ********
15 ******
16 ****
17 ***
18 **
19 **
20 *
So - if you could give this result to the discrete_distribution<int> distribution {}, you got everything you want...
Well, from what I can see of your problem, I would want for the solution to meet these criteria:
a) Belong to a single distribution: If we need to "roll" (call math.Random) more than once per function call and then aggregate or discard some results, it stops being truly distributed according to the given function.
b) Not be computationally intensive: Some of the solutions use Integrals, (Gamma distribution, Gaussian Distribution), and those are computationally intensive. In your description, you mention that you want to be able to "calculate it with a formula", which fits this description (basically, you want an O(1) function).
c) Be relatively "well distributed", e.g. not have peaks and valleys, but instead have most results cluster around the mean, and have nice predictable slopes downwards towards the ends, and yet have the probability of the min and the max to be not zero.
d) Not to require to store a large array in memory, as in drop tables.
I think this function meets the requirements:
var pseudoRand = function(min, max, avg )
{
var randomFraction = Math.random();
var head = (avg - min);
var tail = (max - avg);
var skewdness = tail / (head + tail);
if (randomFraction < skewdness)
return min + (randomFraction / skewdness) * head;
else
return avg + (1 - randomFraction) / (1 - skewdness) * tail;
}
This will return floats, but you can easily turn them to ints by calling
(int) Math.round(pseudoRand(...))
It returned the correct average in all of my tests, and it is also nicely distributed towards the ends. Hope this helps. Good luck.

How to select a uniformly distributed subset of a partially dense dataset?

P is an n*d matrix, holding n d-dimensional samples. P in some areas is several times more dense than others. I want to select a subset of P in which distance between any pairs of samples be more than d0, and I need it to be spread all over the area. All samples have same priority and there's no need to optimize anything (e.g. covered area or sum of pairwise distances).
Here is a sample code that does so, but it's really slow. I need a more efficient code since I need to call it several times.
%% generating sample data
n_4 = 1000; n_2 = n_4*2;n = n_4*4;
x1=[ randn(n_4, 1)*10+30; randn(n_4, 1)*3 + 60];
y1=[ randn(n_4, 1)*5 + 35; randn(n_4, 1)*20 + 80 ];
x2 = rand(n_2, 1)*(max(x1)-min(x1)) + min(x1);
y2 = rand(n_2, 1)*(max(y1)-min(y1)) + min(y1);
P = [x1,y1;x2, y2];
%% eliminating close ones
tic
d0 = 1.5;
D = pdist2(P, P);D(1:n+1:end) = inf;
E = zeros(n, 1); % eliminated ones
for i=1:n-1
if ~E(i)
CloseOnes = (D(i,:)<d0) & ((1:n)>i) & (~E');
E(CloseOnes) = 1;
end
end
P2 = P(~E, :);
toc
%% plotting samples
subplot(121); scatter(P(:, 1), P(:, 2)); axis equal;
subplot(122); scatter(P2(:, 1), P2(:, 2)); axis equal;
Edit: How big the subset should be?
As j_random_hacker pointed out in comments, one can say that P(1, :) is the fastest answer if we don’t define a constraint on the number of selected samples. It delicately shows incoherence of the title! But I think the current title better describes the purpose. So let’s define a constraint: “Try to select m samples if it’s possible”. Now with the implicit assumption of m=n we can get the biggest possible subset. As I mentioned before a faster method excels the one that finds the optimum answer.
Finding closest points over and over suggests a different data structure that is optimized for spatial searches. I suggest a delaunay triangulation.
The below solution is "approximate" in the sense that it will likely remove more points than strictly necessary. I'm batching all the computations and removing all points in each iteration that contribute to distances that are too long, and in many cases removing one point may remove the edge that appears later in the same iteration. If this matters, the edge list can be further processed to avoid duplicates, or even to find points to remove that will impact the greatest number of distances.
This is fast.
dt = delaunayTriangulation(P(:,1), P(:,2));
d0 = 1.5;
while 1
edge = edges(dt); % vertex ids in pairs
% Lookup the actual locations of each point and reorganize
pwise = reshape(dt.Points(edge.', :), 2, size(edge,1), 2);
% Compute length of each edge
difference = pwise(1,:,:) - pwise(2,:,:);
edge_lengths = sqrt(difference(1,:,1).^2 + difference(1,:,2).^2);
% Find edges less than minimum length
idx = find(edge_lengths < d0);
if(isempty(idx))
break;
end
% pick first vertex of each too-short edge for deletion
% This could be smarter to avoid overdeleting
points_to_delete = unique(edge(idx, 1));
% remove them. triangulation auto-updates
dt.Points(points_to_delete, :) = [];
% repeat until no edge is too short
end
P2 = dt.Points;
You don't specify how many points you want to select. This is crucial to the problem.
I don't readily see a way to optimise your method.
Assuming that Euclidean distance is acceptable as a distance measure, the following implementation is much faster when selecting only a small number of points, and faster even when trying to the subset with 'all' valid points (note that finding the maximum possible number of points is hard).
%%
figure;
subplot(121); scatter(P(:, 1), P(:, 2)); axis equal;
d0 = 1.5;
m_range = linspace(1, 2000, 100);
m_time = NaN(size(m_range));
for m_i = 1:length(m_range);
m = m_range(m_i)
a = tic;
% Test points in random order.
r = randperm(n);
r_i = 1;
S = false(n, 1); % selected ones
for i=1:m
found = false;
while ~found
j = r(r_i);
r_i = r_i + 1;
if r_i > n
% We have tried all points. Nothing else can be valid.
break;
end
if sum(S) == 0
% This is the first point.
found = true;
else
% Get the points already selected
P_selected = P(S, :);
% Exclude points >= d0 along either axis - they cannot have
% a Euclidean distance less than d0.
P_valid = (abs(P_selected(:, 1) - P(j, 1)) < d0) & (abs(P_selected(:, 2) - P(j, 2)) < d0);
if sum(P_valid) == 0
% There are no points that can be < d0.
found = true;
else
% Implement Euclidean distance explicitly rather than
% using pdist - this makes a large difference to
% timing.
found = min(sqrt(sum((P_selected(P_valid, :) - repmat(P(j, :), sum(P_valid), 1)) .^ 2, 2))) >= d0;
end
end
end
if found
% We found a valid point - select it.
S(j) = true;
else
% Nothing found, so we must have exhausted all points.
break;
end
end
P2 = P(S, :);
m_time(m_i) = toc(a);
subplot(122); scatter(P2(:, 1), P2(:, 2)); axis equal;
drawnow;
end
%%
figure
plot(m_range, m_time);
hold on;
plot(m_range([1 end]), ones(2, 1) * original_time);
hold off;
where original_time is the time taken by your method. This gives the following timings, where the red line is your method, and the blue is mine, with the number of points selected along the x axis. Note that the line flattens when 'all' points meeting the criteria have been selected.
As you say in your comment, performance is highly dependent on the value of d0. In fact, as d0 is reduced, the method above appears to have even greater improvement in performance (this is for d0=0.1):
Note however that this is also dependent on other factors such as the distribution of your data. This method exploits specific properties of your data set, and reduces the number of expensive calculations by filtering out points where calculating the Euclidean distance is pointless. This works particularly well for selecting fewer points, and it is actually faster for smaller d0 because there are fewer points in the data set that match the criteria (so there are fewer computations of the Euclidean distance required). The optimal solution for a problem like this will usually be specific to the exact data set used.
Also note that in my code above, manually calculating the Euclidean distance is much faster then calling pdist. The flexibility and generality of the Matlab built-ins is often detrimental to performance in simple cases.

What's a good way to add a large number of small floats together?

Say you have 100000000 32-bit floating point values in an array, and each of these floats has a value between 0.0 and 1.0. If you tried to sum them all up like this
result = 0.0;
for (i = 0; i < 100000000; i++) {
result += array[i];
}
you'd run into problems as result gets much larger than 1.0.
So what are some of the ways to more accurately perform the summation?
Sounds like you want to use Kahan Summation.
According to Wikipedia,
The Kahan summation algorithm (also known as compensated summation) significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors).
In pseudocode, the algorithm is:
function kahanSum(input)
var sum = input[1]
var c = 0.0 //A running compensation for lost low-order bits.
for i = 2 to input.length
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
next i //Next time around, the lost low part will be added to y in a fresh attempt.
return sum
Make result a double, assuming C or C++.
If you can tolerate a little extra space (in Java):
float temp = new float[1000000];
float temp2 = new float[1000];
float sum = 0.0f;
for (i=0 ; i<1000000000 ; i++) temp[i/1000] += array[i];
for (i=0 ; i<1000000 ; i++) temp2[i/1000] += temp[i];
for (i=0 ; i<1000 ; i++) sum += temp2[i];
Standard divide-and-conquer algorithm, basically. This only works if the numbers are randomly scattered; it won't work if the first half billion numbers are 1e-12 and the second half billion are much larger.
But before doing any of that, one might just accumulate the result in a double. That'll help a lot.
If in .NET using the LINQ .Sum() extension method that exists on an IEnumerable. Then it would just be:
var result = array.Sum();
The absolutely optimal way is to use a priority queue, in the following way:
PriorityQueue<Float> q = new PriorityQueue<Float>();
for(float x : list) q.add(x);
while(q.size() > 1) q.add(q.pop() + q.pop());
return q.pop();
(this code assumes the numbers are positive; generally the queue should be ordered by absolute value)
Explanation: given a list of numbers, to add them up as precisely as possible you should strive to make the numbers close, t.i. eliminate the difference between small and big ones. That's why you want to add up the two smallest numbers, thus increasing the minimal value of the list, decreasing the difference between the minimum and maximum in the list and reducing the problem size by 1.
Unfortunately I have no idea about how this can be vectorized, considering that you're using OpenCL. But I am almost sure that it can be. You might take a look at the book on vector algorithms, it is surprising how powerful they actually are: Vector Models for Data-Parallel Computing

Resources