Get X random points in a fixed grid without repetition - random

I'm looking for a way of getting X points in a fixed sized grid of let's say M by N, where the points are not returned multiple times and all points have a similar chance of getting chosen and the amount of points returned is always X.
I had the idea of looping over all the grid points and giving each point a random chance of X/(N*M) yet I felt like that it would give more priority to the first points in the grid. Also this didn't meet the requirement of always returning X amount of points.
Also I could go with a way of using increments with a prime number to get kind of a shuffle without repeat functionality, but I'd rather have it behave more random than that.

Essentially, you need to keep track of the points you already chose, and make use of a random number generator to get a pseudo-uniformly distributed answer. Each "choice" should be independent of the previous one.
With your first idea, you're right, the first ones would have more chance of getting picked. Consider a one-dimensional array with two elements. With the strategy you mention, the chance of getting the first one is:
P[x=0] = 1/2 = 0.5
The chance of getting the second one is the chance of NOT getting the first one 0.5, times 1/2:
P[x=1] = 1/2 * 1/2 = 0.25
You don't mention which programming language you're using, so I'll assume you have at your disposal random number generator rand() which results in a random float in the range [0, 1), a Hashmap (or similar) data structure, and a Point data structure. I'll further assume that a point in the grid can be any floating point x,y, where 0 <= x < M and 0 <= y < N. (If this is a NxM array, then the same applies, but in integers, and up to (M-1,N-1)).
Hashmap points = new Hashmap();
Point p;
while (items.size() < X) {
p = new Point(rand()*M, rand()*N);
if (!points.containsKey(p)) {
items.add(p, 1);
}
}
Note: Two Point objects of equal x and y should be themselves considered equal and generate equal hash codes, etc.

Related

What is a greedy algorithm for this problem that is minimally optimal + proof?

The details are a bit cringe, fair warning lol:
I want to set up meters on the floor of my building to catch someone; assume my floor is a number line from 0 to length L. The specific type of meter I am designing has a radius of detection that is 4.7 meters in the -x and +x direction (diameter of 9.4 meters of detection). I want to set them up in such a way that if the person I am trying to find steps foot anywhere in the floor, I will know. However, I can't just setup a meter anywhere (it may annoy other residents); therefore, there are only n valid locations that I can setup a meter. Additionally, these meters are expensive and time consuming to make, so I would like to use as few as possible.
For simplicity, you can assume the meter has 0 width, and that each valid location is just a point on the number line aformentioned. What is a greedy algorithm that places as few meters as possible, while being able to detect the entire hallway of length L like I want it to, or, if detecting the entire hallway is not possible, will output false for the set of n locations I have (and, if it isn't able to detect the whole hallway, still uses as few meters as possible while attempting to do so)?
Edit: some clarification on being able to detect the entire hallway or not
Given:
L (hallway length)
a list of N valid positions to place a meter (p_0 ... p_N-1) of radius 4.7
You can determine in O(N) either a valid and minimal ("good") covering of the whole hallway or a proof that no such covering exists given the constraints as follows (pseudo-code):
// total = total length;
// start = current starting position, initially 0
// possible = list of possible meter positions
// placed = list of (optimal) meter placements, initially empty
boolean solve(float total, float start, List<Float> possible, List<Float> placed):
if (total-start <= 0):
return true; // problem solved with no additional meters - woo!
else:
Float next = extractFurthestWithinRange(start, possible, 4.7);
if (next == null):
return false; // no way to cover end of hall: report failure
else:
placed.add(next); // placement decided
return solve(total, next + 4.7, possible, placed);
Where extractFurthestWithinRange(float start, List<Float> candidates, float range) returns null if there are no candidates within range of start, or returns the last position p in candidates such that p <= start + range -- and also removes p, and all candidates c such that p >= c.
The key here is that, by always choosing to place a meter in the next position that a) leaves no gaps and b) is furthest from the previously-placed position we are simultaneously creating a valid covering (= no gaps) and an optimal covering (= no possible valid covering could have used less meters - because our gaps are already as wide as possible). At each iteration, we either completely solve the problem, or take a greedy bite to reduce it to a (guaranteed) smaller problem.
Note that there can be other optimal coverings with different meter positions, but they will use the exact same number of meters as those returned from this pseudo-code. For example, if you adapt the code to start from the end of the hallway instead of from the start, the covering would still be good, but the gaps could be rearranged. Indeed, if you need the lexicographically minimal optimal covering, you should use the adapted algorithm that places meters starting from the end:
// remaining = length (starts at hallway length)
// possible = positions to place meters at, starting by closest to end of hallway
// placed = positions where meters have been placed
boolean solve(float remaining, List<Float> possible, Queue<Float> placed):
if (remaining <= 0):
return true; // problem solved with no additional meters - woo!
else:
// extracts points p up to and including p such that p >= remaining - range
Float next = extractFurthestWithinRange2(remaining, possible, 4.7);
if (next == null):
return false; // no way to cover start of hall: report failure
else:
placed.add(next); // placement decided
return solve(next - 4.7, possible, placed);
To prove that your solution is optimal if it is found, you merely have to prove that it finds the lexicographically last optimal solution.
And you do that by induction on the size of the lexicographically last optimal solution. The case of a zero length floor and no monitor is trivial. Otherwise you demonstrate that you found the first element of the lexicographically last solution. And covering the rest of the line with the remaining elements is your induction step.
Technical note, for this to work you have to be allowed to place monitoring stations outside of the line.

Is there an elegant and efficient way to implement weighted random choices in golang? Details on current implementation and issues inside

tl;dr: I'm looking for methods to implement a weighted random choice based on the relative magnitude of values (or functions of values) in an array in golang. Are there standard algorithms or recommendable packages for this? Is so how do they scale?
Goals
I'm trying to write 2D and 3D markov process programs in golang. A simple 2D example of such is the following: Imagine one has a lattice, and on each site labeled by index (i,j) there are n(i,j) particles. At each time step, the program chooses a site and moves one particle from this site to a random adjacent site. The probability that a site is chosen is proportional to its population n(i,j) at that time.
Current Implementation
My current algorithm, e.g. for the 2D case on an L x L lattice, is the following:
Convert the starting array into a slice of length L^2 by concatenating rows in order, e.g. cdfpop[i L +j]=initialpopulation[i][j].
Convert the 1D slice into a cdf by running a for loop over cdfpop[i]+=cdfpop[i-1].
Generate a two random numbers, Rsite whose range is from 1 to the largest value in the cdf (this is just the last value, cdfpop[L^2-1]), and Rhop whose range is between 1 and 4. The first random number chooses a weighted random site, and the second number a random direction to hop in
Use a binary search to find the leftmost index indexhop of cdfpop that is greater than Rsite. The index being hopped to is either indexhop +-1 for x direction hops or indexhop +- L for y direction hops.
Finally, directly change the values of cdfpop to reflect the hop process. This means subtracting one from (adding one to) all values in cdfpop between the index being hopped from (to) and the index being hopped to (from) depending on order.
Rinse and repeat in for loop. At the end reverse the cdf to determine the final population.
Edit: Requested Pseudocode looks like:
main(){
//import population LxL array
population:= import(population array)
//turn array into slice
for i number of rows{
cdf[ith slice of length L] = population[ith row]
}
//compute cumulant array
for i number of total sites{
cdf[i] = cdf[i-1]+cdf[i]
}
for i timesteps{
site = Randomhopsite(cdf)
cdf = Dohop(cdf, site)
}
Convertcdftoarrayandsave(cdf)
}
Randomhopsite(cdf) site{
//Choose random number in range of the cummulant
randomnumber=RandomNumber(Range 1 to Max(cdf))
site = binarysearch(cdf) // finds leftmost index such that
// cdf[i] > random number
return site
}
Dohop(cdf,site) cdf{
//choose random hop direction and calculate coordinate
randomnumber=RandomNumber(Range 1 to 4)
case{
randomnumber=1 { finalsite= site +1}
randomnumber=2 { finalsite= site -1}
randomnumber=3 { finalsite= site + L}
randomnumber=4 { finalsite= site - L}
}
//change the value of the cumulant distribution to reflect change
if finalsite > site{
for i between site and finalsite{
cdf[i]--
}
elseif finalsite < site{
for i between finalsite and site{
cdf[i]++
}
else {error: something failed}
return cdf
}
This process works really well for simple problems. For this particular problem, I can run about 1 trillion steps on a 1000x 1000 lattice in about 2 minutes on average with my current set up, and I can compile population data to gifs every 10000 or so steps by spinning a go routine without a huge slowdown.
Where efficiency breaks down
The trouble comes when I want to add different processes, with real-valued coefficients, whose rates are not proportional to site population. So say I now have a hopping rate at k_hop *n(i,j) and a death rate (where I simply remove a particle) at k_death *(n(i,j))^2. There are two slow-downs in this case:
My cdf will be double the size (not that big of a deal). It will be real valued and created by cdfpop[i*L+j]= 4 *k_hop * pop[i][j] for i*L+j<L*L and cdfpop[i*L+j]= k_death*math. Power(pop[i][j],2) for L*L<=i*L+j<2*L*L, followed by cdfpop[i]+=cdfpop[i-1]. I would then select a random real in the range of the cdf.
Because of the squared n, I will have to dynamically recalculate the part of the cdf associated with the death process weights at each step. This is a MAJOR slow down, as expected. Timing for this is about 3 microseconds compared with the original algorithm which took less than a nanosecond.
This problem only gets worse if I have rates calculated as a function of populations on neighboring sites -- e.g. spontaneous particle creation depends on the product of populations on neighboring sites. While I hope to work out a way to just modify the cdf without recalculation by thinking really hard, as I try to simulate problems of increasing complexity, I can't help but wonder if there is a universal solution with reasonable efficiency I'm missing that doesn't require specialized code for each random process.
Thanks for reading!

Get N samples given iterator

Given are an iterator it over data points, the number of data points we have n, and the maximum number of samples we want to use to do some calculations (maxSamples).
Imagine a function calculateStatistics(Iterator it, int n, int maxSamples). This function should use the iterator to retrieve the data and do some (heavy) calculations on the data element retrieved.
if n <= maxSamples we will of course use each element we get from the iterator
if n > maxSamples we will have to choose which elements to look at and which to skip
I've been spending quite some time on this. The problem is of course how to choose when to skip an element and when to keep it. My approaches so far:
I don't want to take the first maxSamples coming from the iterator, because the values might not be evenly distributed.
Another idea was to use a random number generator and let me create maxSamples (distinct) random numbers between 0 and n and take the elements at these positions. But if e.g. n = 101 and maxSamples = 100 it gets more and more difficult to find a new distinct number not yet in the list, loosing lot of time just in the random number generation
My last idea was to do the contrary: to generate n - maxSamples random numbers and exclude the data elements at these positions elements. But this also doesn't seem to be a very good solution.
Do you have a good idea for this problem? Are there maybe standard known algorithms for this?
To provide some answer, a good way to collect a set of random numbers given collection size > elements needed, is the following. (in C++ ish pseudo code).
EDIT: you may need to iterate over and create the "someElements" vector first. If your elements are large they can be "pointers" to these elements to save space.
vector randomCollectionFromVector(someElements, numElementsToGrab) {
while(numElementsToGrab--) {
randPosition = rand() % someElements.size();
resultVector.push(someElements.get(randPosition))
someElements.remove(randPosition);
}
return resultVector;
}
If you don't care about changing your vector of elements, you could also remove random elements from someElements, as you mentioned. The algorithm would look very similar, and again, this is conceptually the same idea, you just pass someElements by reference, and manipulate it.
Something worth noting, is the quality of psuedo random distributions as far as how random they are, grows as the size of the distribution you used increases. So, you may tend to get better results if you pick which method you use based on which method results in the use of more random numbers. Example: if you have 100 values, and need 99, you should probably pick 99 values, as this will result in you using 99 pseudo random numbers, instead of just 1. Conversely, if you have 1000 values, and need 99, you should probably prefer the version where you remove 901 values, because you use more numbers from the psuedo random distribution. If what you want is a solid random distribution, this is a very simple optimization, that will greatly increase the quality of "fake randomness" that you see. Alternatively, if performance matters more than distribution, you would take the alternative or even just grab the first 99 values approach.
interval = n/(n-maxSamples) //an euclidian division of course
offset = random(0..(n-1)) //a random number between 0 and n-1
totalSkip = 0
indexSample = 0;
FOR it IN samples DO
indexSample++ // goes from 1 to n
IF totalSkip < (n-maxSamples) AND indexSample+offset % interval == 0 THEN
//do nothing with this sample
totalSkip++
ELSE
//work with this sample
ENDIF
ENDFOR
ASSERT(totalSkip == n-maxSamples) //to be sure
interval represents the distance between two samples to skip.
offset is not mandatory but it allows to have a very little diversity.
Based on the discussion, and greater understanding of your problem, I suggest the following. You can take advantage of a property of prime numbers that I think will net you a very good solution, that will appear to grab pseudo random numbers. It is illustrated in the following code.
#include <iostream>
using namespace std;
int main() {
const int SOME_LARGE_PRIME = 577; //This prime should be larger than the size of your data set.
const int NUM_ELEMENTS = 100;
int lastValue = 0;
for(int i = 0; i < NUM_ELEMENTS; i++) {
lastValue += SOME_LARGE_PRIME;
cout << lastValue % NUM_ELEMENTS << endl;
}
}
Using the logic presented here, you can create a table of all values from 1 to "NUM_ELEMENTS". Because of the properties of prime numbers, you will not get any duplicates until you rotate all the way around back to the size of your data set. If you then take the first "NUM_SAMPLES" of these, and sort them, you can iterate through your data structure, and grab a pseudo random distribution of numbers(not very good random, but more random than a pre-determined interval), without extra space and only one pass over your data. Better yet, you can change the layout of the distribution by grabbing a random prime number each time, again must be larger than your data set, or the following example breaks.
PRIME = 3, data set size = 99. Won't work.
Of course, ultimately this is very similar to the pre-determined interval, but it inserts a level of randomness that you do not get by simply grabbing every "size/num_samples"th element.
This is called the Reservoir sampling

matlab: optimum amount of points for linear fit

I want to make a linear fit to few data points, as shown on the image. Since I know the intercept (in this case say 0.05), I want to fit only points which are in the linear region with this particular intercept. In this case it will be lets say points 5:22 (but not 22:30).
I'm looking for the simple algorithm to determine this optimal amount of points, based on... hmm, that's the question... R^2? Any Ideas how to do it?
I was thinking about probing R^2 for fits using points 1 to 2:30, 2 to 3:30, and so on, but I don't really know how to enclose it into clear and simple function. For fits with fixed intercept I'm using polyfit0 (http://www.mathworks.com/matlabcentral/fileexchange/272-polyfit0-m) . Thanks for any suggestions!
EDIT:
sample data:
intercept = 0.043;
x = 0.01:0.01:0.3;
y = [0.0530642513911393,0.0600786706929529,0.0673485248329648,0.0794662409166333,0.0895915873196170,0.103837395346484,0.107224784565365,0.120300492775786,0.126318699218730,0.141508831492330,0.147135757370947,0.161734674733680,0.170982455701681,0.191799936622712,0.192312642057298,0.204771365716483,0.222689541632988,0.242582251060963,0.252582727297656,0.267390860166283,0.282890010610515,0.292381165948577,0.307990544720676,0.314264952297699,0.332344368808024,0.355781519885611,0.373277721489254,0.387722683944356,0.413648156978284,0.446500064130389;];
What you have here is a rather difficult problem to find a general solution of.
One approach would be to compute all the slopes/intersects between all consecutive pairs of points, and then do cluster analysis on the intersepts:
slopes = diff(y)./diff(x);
intersepts = y(1:end-1) - slopes.*x(1:end-1);
idx = kmeans(intersepts, 3);
x([idx; 3] == 2) % the points with the intersepts closest to the linear one.
This requires the statistics toolbox (for kmeans). This is the best of all methods I tried, although the range of points found this way might have a few small holes in it; e.g., when the slopes of two points in the start and end range lie close to the slope of the line, these points will be detected as belonging to the line. This (and other factors) will require a bit more post-processing of the solution found this way.
Another approach (which I failed to construct successfully) is to do a linear fit in a loop, each time increasing the range of points from some point in the middle towards both of the endpoints, and see if the sum of the squared error remains small. This I gave up very quickly, because defining what "small" is is very subjective and must be done in some heuristic way.
I tried a more systematic and robust approach of the above:
function test
%% example data
slope = 2;
intercept = 1.5;
x = linspace(0.1, 5, 100).';
y = slope*x + intercept;
y(1:12) = log(x(1:12)) + y(12)-log(x(12));
y(74:100) = y(74:100) + (x(74:100)-x(74)).^8;
y = y + 0.2*randn(size(y));
%% simple algorithm
[X,fn] = fminsearch(#(ii)P(ii, x,y,intercept), [0.5 0.5])
[~,inds] = P(X, y,x,intercept)
end
function [C, inds] = P(ii, x,y,intercept)
% ii represents fraction of range from center to end,
% So ii lies between 0 and 1.
N = numel(x);
n = round(N/2);
ii = round(ii*n);
inds = min(max(1, n+(-ii(1):ii(2))), N);
% Solve linear system with fixed intercept
A = x(inds);
b = y(inds) - intercept;
% and return the sum of squared errors, divided by
% the number of points included in the set. This
% last step is required to prevent fminsearch from
% reducing the set to 1 point (= minimum possible
% squared error).
C = sum(((A\b)*A - b).^2)/numel(inds);
end
which only finds a rough approximation to the desired indices (12 and 74 in this example).
When fminsearch is run a few dozen times with random starting values (really just rand(1,2)), it gets more reliable, but I still wouln't bet my life on it.
If you have the statistics toolbox, use the kmeans option.
Depending on the number of data values, I would split the data into a relative small number of overlapping segments, and for each segment calculate the linear fit, or rather the 1-st order coefficient, (remember you know the intercept, which will be same for all segments).
Then, for each coefficient calculate the MSE between this hypothetical line and entire dataset, choosing the coefficient which yields the smallest MSE.

Simple Weighted Random Walk with Hysteresis

I've already written a solution for this, but it doesn't feel "right", so I'd like some input from others.
The rules are:
Movement is on a 2D grid (Directions arbitrarily labelled N, NE, E, SE, S, SW, W, NW)
Probabilities of moving in a given direction are relative to the direction of travel (i.e. 40% represents ahead), and weighted:
[14%][40%][14%]
[ 8%][ 4%][ 8%]
[ 4%][ 4%][ 4%]
This means with overwhelming probability, travel will continue along its current trajectory. The middle value represents stopping. As an example, if the last move was NW, then the absolute probabilities would read:
[40%][14%][ 8%]
[14%][ 4%][ 4%]
[ 8%][ 4%][ 4%]
The probabilities are approximate - one thing I toyed with was making stopped a static 5% chance outside of the main calculation, which would have altered the probability of any other operation ever so slightly.
My current algorithm is as follows (in simplified pseudocode):
int[] probabilities = [4,40,14,8,4,4,4,8,14]
if move.previous == null:
move.previous = STOPPED
if move.previous != STOPPED:
// Cycle probabilities[1:8] array until indexof(move.previous) = 40%
r = Random % 99
if r < probabilities.sum[0:0]:
move.current = STOPPED
elif r < probabilities.sum[0:1]:
move.current = NW
elif r < probabilities.sum[0:2]:
move.current = NW
...
Reasons why I really dislike this method:
* It forces me to assign specific roles to array indices: [0] = stopped, [1] = North...
* It forces me to operate on a subset of the array when cycling (i.e. STOPPED always remains in place)
* It's very iterative, and therefore, slow. It has to check every value in turn until it gets to the right one. Cycling the array requires up to 4 operations.
* A 9-case if-block (most languages do not allow dynamic switches).
* Stopped has to be special cased in everything.
Things I have considered:
* Circular linked list: Simplifies the cycling (make the pivot always equal north) but requires maintaining a set of pointers, and still involves assigning roles to specific indices.
* Vectors: Really not sure how I'd go about weighting this, plus I'd need to worry about magnitude.
* Matrices: Rotating matrices does not work like that :)
* Use a well-known random walk algorithm: Overkill? Though recommendations are considered.
* Trees: Just thought of this, so no real thought given to it...
So. Does anyone have any bright ideas?
8You have 8 directions and when you hit some direction you have to "rotate this matrix"
But this is just modulo over finite field.
Since you have only 100 integers to pick probability from, you can just putt all integers in list and value from each integers points to index of your direction.
This direction you rotate (modulo addition) in way that it points to move that you have to make.
And than you have one array that have difference that you have to apply to your move.
somethihing like that.
40 numbers 14 numbers 8 numbers
int[100] probab={0,0,0,0,0,0,....,1,1,1,.......,2,2,2,...};
and then
N NE E SE STOP
int[9] next_move={{0,1},{ 1,1},{1,1},{1,-1}...,{0,0}}; //in circle
So you pick
move=probab[randint(100)]
if(move != 8)//if 8 you got stop
{
move=(prevous_move+move)%8;
}
move_x=next_move[move][0];
move_y=next_move[move][1];
Use a more direct representation of direction in your algorithms, something like a (dx, dy) pair, for example.
This allows you to move by just having x += dx; y += dy;
(You can still use the "direction ENUM" + a lookup table if you wish...)
Your next problem is finding a good representation of the "probability table". Since r only ranges from 1 to 99 it might be feasible to just do a dumb array and use prob_table[r] directly.
Then, compute a 3x3 matrix of these probability tables using the method of your choice. It doesn't matter if it is slow because you only do it once.
To get the next direction simply
prob_table = dir_table[curr_dx][curr_dy];
(curr_dx, curr_dy) = get_next_dir(prob_table, random_number());

Resources