How to determine if a pattern of distribution is different from a random/uniform distribution - probability

Here is my case:
Let's say we have 50 polygons(looks like this:
and a point set distributed within these 50 polygons. So that for each polygon, there is an associated point density. What I want to test if whether the distribution pattern of this data set (for example, the fluctuations in density across 50 polygons) is kind of realization of spatial randomness.
The method I use is: in the uniform random case, the number of points of each ring follows a binomial distribution, i.e. X~B(n, p), where n is the total number of points and p is the probability of each point to be inside a particular polygon (p = Area_polygon/Area_semicircle). So that for each polygon, I can calculate the expected number of points and upon which we can calculate the density. And then I can apply the one-way ANOVA to compare two groups: the actual density group and the theoretical density group.
However, I found a problem: when calculating the density, I actually divide the expected number over the area. But, considering the expected number
E = N(total number)*Area_polygon/total area,
thus the density:
D = N(total number)/total area
which means for each polygon, the expected density is the same number.
So in that case, is it still suitable to use one-way ANOVA to compare my actual density group to a group within which all numbers are the same?
What if use numbers rather than density? Or is there any other more suitable tests?

You may want to look up a method called "quadrat test". It is explained in the online help for the function quadrat.test in the R package spatstat and more extensively in the spatstat book. (Disclaimer: I'm a coauthor.)

Related

How do I implement genetic algorithm on placing 2 or more kinds of element with different (repeating)distances in a grid?

Please forgive me if I do not explain my question clearly in title.
Here may I show you two pictures as my example:
my question is described as follows: I have 2 or more different objects(In the pictures, two objects: circle and cross), each one is placed repeatedly with a fixed row/column distance (In the pictures, the circle has a distance of 4 and cross has a distance of 2) into a grid.
In the first picture, each of the two objects are repeated correctly without any interruptions(here interruption means one object may occupy another one's position), but the arrangement in the first picture is ununiform distributed; on the contrary, in the second picture, the two objects may have interruptions (the circle object occupies cross objects' position) but the picture is uniformly distributed.
My target is to get the placement as uniform as possible (the objects are still placed with fixed distances but may allow some occupations). Is there a potential algorithm for this question? Or are there any similar questions?
I have some immature thinkings on this problem that: 1. occupation may relate to least common multiple; 2. how to define "uniformly distributed" mathematically? Maybe there's no genetic solution but is there a solution for some special cases? (for example, 3 objects with distance of multiple of 2, or multiple of 3?)
Uniformity can be measured as sum of squared inverse distances(or distances to equilibrium distances). Because it has squared relation, any single piece that approaches others will have big fitness penalty in system so that the system will not tolerate too close piece and prefer a better distribution.
If you do not use squared (or higher orders) distance but simple distance, then system starts tolerating even overlapped pieces.
If you want to manually compute uniformity, then compute the standard deviation of distances. You'd say its perfect with 1 distance and 0 deviation but small enough deviation also acceptable.
I tested this only on a problem to fit 106 circles in a square thats 10x the size of circle.

Algorithm for distributing points evenly but randomly in a rectangle

I want to place some points in a rectangle randomly.
Generating random x, y coordinates it's not a good idea, because many times happens that the points are mainly distributed on the same area instead cover the whole rectangle.
I don't need an algorithm incredibly fast or the best cover position, just something that could run in a simple game that generate random (x, y) that cover almost the whole rectangle.
In my particular case I'm trying to generate a simple sky, so the idea is to place almost 40/50 stars in the sky rectangle.
Could someone point me some common algorithm to do that?
There is a number of algorithms to pseudo-randomly fill a 2d plane. One of them is Poisson Disk Sampling which places the samples randomly, but then checks that any two are not too close. The result would look something like this:
You can check some articles describing this algorithm. And even some implementations are available.
The problem though is that the resulting distribution looks nothing like the actual stars in the sky. But it gives a good tool to start with - by controlling the Poisson radius we can create very naturally looking looking patterns. For example in this article they use Perlin Noise to control the radius of the Poisson Disk Sampling:
You would also want to adjust the brightness of the stars, but you can experiment with uniform random values or Perlin noise.
Once I have used a completely different approach for a game. I took real positions of the stars in cartesian system from HYG database by David Nash and transformed them to my viewpoint. With this approach you can even create the exact view that can be seen from where you are on Earth.
I once showed this database to the girl I wanted to date, saying "I want to show you the stars… in cartesian coordinate system".
Upd. It’s been over seven years now and we are still together.
Just some ideas which might make your cover to appear "more uniform". These approaches don't necessarily provide an efficient way to generate a truly uniform cover, but they might be good enough and worth looking at in your case.
First, you can divide the original rectangle in 4 (or 10, or 100 - as long as performance allows you) subrectangles and cover those subrectangles separately with random points. By doing so you will make sure that no subrectangle will be left uncovered. You can generate the same number of points for each subrectangle, but you can also vary the number of points from one subrectangle to another. For example, for each subrectangle you can first generate a random number num_points_in_subrectangle (which can come from a uniform random distribution on some interval [lower, upper]) and then randomly fill the subrectangle with this many points. So all subrectangles will contain random number of points and will probably look less "programmatically generated".
Another thing you can try is to generate random points inside the original rectangle and for each generated point check if there already exists a point within some radius R. If there is such point, you reject the candidate and generate the new one. Again, here you can vary the radius from one point to another by making R a random variable.
Finally, you can combine several approaches. Generate some random number n of points you want in total. First, divide the original rectangle in subrectangles and cover those in such a way that there are n / 3 points in total. Then generate next n / 3 points by selecting the random point inside the original rectangle without any restrictions. After this, generate the last n / 3 points randomly with checks for neighbors within the radius.
Using a uniform drawing of X, Y, if you draw 40 points, the probability of having all points in the same half is about one over a trillion (~0.0000000000009).

To make a distance matrix or to repeatedly calculate distance

I'm working on K-medoids algorithm implementation. It is a clustering algorithm and one of its steps includes finding the most representative point in a cluster.
So, here's the thing
I have a certain number of clusters
Each cluster contains a certain number of points
I need to find the point in each cluster that results with the least error if it is picked as a cluster representative
Distance from each point to all the other in the cluster needs to be calculated
This distance calculation could be simple as Euclidean or more complex like DTW (Dynamic Time Warping) between two signals
There are two approaches, one is to calculate distance matrix that will save values between all the points in the dataset and the other is to calculate distances during clustering, which results that distances between some points will be calculated repeatedly.
On one hand, to build distance matrix you must calculate distances between all points in the whole dataset and some of calculated values will never be used.
On the other hand, if you don't build the distance matrix, you will repeat some calculations in certain number of iterations.
Which is the better approach?
I'm also considering MapReduce implementation, so opinions from that angle are also welcome.
Thanks
A 3rd approach could be a combination of both, and is lazily evaluating the distance matrix. Initialize a matrix with default values (unrealistic values, like negative ones), and when you need to calculate distance between two points, if the values is already present in the matrix - just take it from it.
Otherwise, calculate it and store it in the matrix.
This approach trades calculations (and is optimal in doing the lowest number of possible pair calculations), for more branches in the code, and a few more instructions. However, due to branch predictors, I assume this overhead will not be that dramatic.
I predict it will have better performance when the calculation is relatively expansive.
Another optimization of it could be to dynamically switch for a plain matrix implementation (and calculate the remaining part of the matrix) when the number of already calculated exceeds a certain threshold. This can be achieved pretty nicely in OOP languages, by switching the implementation of the interface when a certain threshold is met.
Which is actually better implementation is going to rely heavily on the cost of the distance function, and the data you are clustering, as some will need to calculate the same points more often than other data sets.
I suggest doing a benchmark, and using statistical tools to evaluate which method is actually better.

How does one decide the final clusters when using the means shift algorthm?

I am reading a bit about the means shift clustering algorithm (http://en.wikipedia.org/wiki/Mean_shift) and this is what i got so far. For each point in your data set : select all points within a certain distance of it (including the original point), calculate the mean for all these points, repeat until these means stabilize.
What I'm confused about is how does one go from here in deciding what the final clusters are , and on what conditions do these means merge. Also, does the distance used to select the points fluctuate through the iterations or does it remain constant?
Thanks in advance
The mean shift cluster finding is a simple iterative process which is actually guaranteed to converge. The iteration starts from a starting point x, and the iteration steps are (note that x may have several components, as the algorithm will work in higher dimensions, as well):
calculate the weighted mean position x' of all points around x - maybe the simplest form is to calculate the average of positions of all points within d distance from x, but the gaussian function is also commonly used and mathematically beneficial.
set x <- x'
repeat until the difference between x and x' is very small
This can be used in cluster analysis by starting with different values of x. The final values will end up at different cluster centers. The number of clusters cannot be known (other than it is <= number of points).
The upper level algorithm is:
go through a selection of starting values
for each value, calculate the convergence value as shown above
if the value is not already in the list of convergence values, add it to the list (allow some reasonable tolerance for numerical imprecision)
And then you have the list of clusters. The only difficult thing is finding a reasonable selection of starting values. It is easy with one or two dimensions, but with higher dimensionalities exhaustive searches are not quite possible.
All starting points, which end up into the same mode (point of convergence) belong to the same cluster.
It may be of interest that if you are doing this on a 2D image, it should be sufficient to calculate the gradient (i.e. the first iteration) for each pixel. This is a fast operation with common convolution techniques, and then it is relatively easy to group the pixels into clusters.

Algorithm: 2D transformation, find outlying pairs of points and omit

I am looking for the following type of algorithm:
There are n matched pairs of points in 2D. How can I identify outlying pairs of points according to Affine / Helmert transformation and omit them from the transformation key? We do not know the exact number of such outlying pairs.
I cannot use Trimmed Least Squares method because there is a basic assumption that a k percentage of pairs is correct. But we do not have any information about the sample and do not know the k... In such a sample of all pairs could be correct or vice versa.
Which types of algorithms are suitable for this problem?
Use RANSAC:
Repeat the following steps a fixed number of times:
Randomly select as much pairs as are necessary to compute the transformation parameters.
Compute the parameters.
Compute the subset of pairs that have small projection error (the 'consensus set').
If the consensus set is large enough, compute a projection for it (e.g. with Least Squares).
Computer the consensus set's projection error
Remember the model if it is the best you found so far.
You have to experiment to find good values for
"a fixed number of times"
"small projection error"
"consensus set is large enough".
The simplest approach is compute your transformation based on all points, compute the residuals for each point, remove the points with high residuals until you reach an acceptable transformation or hit the minimum number of acceptable input points. The residual for any given point is the join distance between the forward transformed value for a point, and the intended target point.
Note that the residuals between an affine transformation and a Helmert (conformal) transformation will be very different as these transformations do different things. The non-uniform scale of the affine has more 'stretch' and will hence lead to smaller residuals.

Resources