Algorithm: 2D transformation, find outlying pairs of points and omit - algorithm

I am looking for the following type of algorithm:
There are n matched pairs of points in 2D. How can I identify outlying pairs of points according to Affine / Helmert transformation and omit them from the transformation key? We do not know the exact number of such outlying pairs.
I cannot use Trimmed Least Squares method because there is a basic assumption that a k percentage of pairs is correct. But we do not have any information about the sample and do not know the k... In such a sample of all pairs could be correct or vice versa.
Which types of algorithms are suitable for this problem?

Repeat the following steps a fixed number of times:
Randomly select as much pairs as are necessary to compute the transformation parameters.
Compute the parameters.
Compute the subset of pairs that have small projection error (the 'consensus set').
If the consensus set is large enough, compute a projection for it (e.g. with Least Squares).
Computer the consensus set's projection error
Remember the model if it is the best you found so far.
You have to experiment to find good values for
"a fixed number of times"
"small projection error"
"consensus set is large enough".

The simplest approach is compute your transformation based on all points, compute the residuals for each point, remove the points with high residuals until you reach an acceptable transformation or hit the minimum number of acceptable input points. The residual for any given point is the join distance between the forward transformed value for a point, and the intended target point.
Note that the residuals between an affine transformation and a Helmert (conformal) transformation will be very different as these transformations do different things. The non-uniform scale of the affine has more 'stretch' and will hence lead to smaller residuals.


How to determine if a pattern of distribution is different from a random/uniform distribution

Here is my case:
Let's say we have 50 polygons(looks like this:
and a point set distributed within these 50 polygons. So that for each polygon, there is an associated point density. What I want to test if whether the distribution pattern of this data set (for example, the fluctuations in density across 50 polygons) is kind of realization of spatial randomness.
The method I use is: in the uniform random case, the number of points of each ring follows a binomial distribution, i.e. X~B(n, p), where n is the total number of points and p is the probability of each point to be inside a particular polygon (p = Area_polygon/Area_semicircle). So that for each polygon, I can calculate the expected number of points and upon which we can calculate the density. And then I can apply the one-way ANOVA to compare two groups: the actual density group and the theoretical density group.
However, I found a problem: when calculating the density, I actually divide the expected number over the area. But, considering the expected number
E = N(total number)*Area_polygon/total area,
thus the density:
D = N(total number)/total area
which means for each polygon, the expected density is the same number.
So in that case, is it still suitable to use one-way ANOVA to compare my actual density group to a group within which all numbers are the same?
What if use numbers rather than density? Or is there any other more suitable tests?
You may want to look up a method called "quadrat test". It is explained in the online help for the function quadrat.test in the R package spatstat and more extensively in the spatstat book. (Disclaimer: I'm a coauthor.)

Procrustes analysis with unequal number of points

As far as I understand Procrustes analysis takes into account the one-to-one ordering of the points across shapes. Therefore, you cannot run the algorithm if you have an unequal number of "anchor" or "landmark" points.
Is there another algorithm for shape alignment that works with unequal number of points across shapes? Say, minimizes the RMSE of the distance of points in one shape to the closest points in the other shape.
Procrustes analysis can be seen as final part of "point set registration" since you assume that you already know correspondences and what to align them using a rigid transformation:
However if your correspondences are unknown (or noisy) like in the case of two 3D scanned shapes, then you need to do a complete registration using for instance ICP (iterative closest points)
There are more sophisticated algorithms as well. Take into account that Point Set Registration is a special case of Shape Registration.
Unless the problem is constrained, in the early stages of point set matching you have little clue on the pose.
Global strategies include
choosing a few random correspondences, computing the corresponding transform and using it to find more correspondences; from there, estimate a goodness-of-fit score; repeat several times and keep the best score. [This is the RANSAC principle.]
instead of choosing randomly, detect "feature points" that exhibit special properties, such as forming "corners" (in case of curve-like clouds), or dense concentrations...; then the number of correspondences to be tried is much lessened.

To make a distance matrix or to repeatedly calculate distance

I'm working on K-medoids algorithm implementation. It is a clustering algorithm and one of its steps includes finding the most representative point in a cluster.
So, here's the thing
I have a certain number of clusters
Each cluster contains a certain number of points
I need to find the point in each cluster that results with the least error if it is picked as a cluster representative
Distance from each point to all the other in the cluster needs to be calculated
This distance calculation could be simple as Euclidean or more complex like DTW (Dynamic Time Warping) between two signals
There are two approaches, one is to calculate distance matrix that will save values between all the points in the dataset and the other is to calculate distances during clustering, which results that distances between some points will be calculated repeatedly.
On one hand, to build distance matrix you must calculate distances between all points in the whole dataset and some of calculated values will never be used.
On the other hand, if you don't build the distance matrix, you will repeat some calculations in certain number of iterations.
Which is the better approach?
I'm also considering MapReduce implementation, so opinions from that angle are also welcome.
A 3rd approach could be a combination of both, and is lazily evaluating the distance matrix. Initialize a matrix with default values (unrealistic values, like negative ones), and when you need to calculate distance between two points, if the values is already present in the matrix - just take it from it.
Otherwise, calculate it and store it in the matrix.
This approach trades calculations (and is optimal in doing the lowest number of possible pair calculations), for more branches in the code, and a few more instructions. However, due to branch predictors, I assume this overhead will not be that dramatic.
I predict it will have better performance when the calculation is relatively expansive.
Another optimization of it could be to dynamically switch for a plain matrix implementation (and calculate the remaining part of the matrix) when the number of already calculated exceeds a certain threshold. This can be achieved pretty nicely in OOP languages, by switching the implementation of the interface when a certain threshold is met.
Which is actually better implementation is going to rely heavily on the cost of the distance function, and the data you are clustering, as some will need to calculate the same points more often than other data sets.
I suggest doing a benchmark, and using statistical tools to evaluate which method is actually better.

Good algorithm for finding subsets of point sets

I'm trying to find suitable algorithms for searching subsets of 2D points in larger set.
A picture is worth thousand words, so:
Any ideas on how one could achieve this? Note that the transformations are just rotation and scaling.
It seems that the most closely problem is Point set registration [1].
I was experimenting with CPD and other rigid and non-rigid algorithms' implementations, but they don't seem to perform
too well on finding small subsets in larger sets of points.
Another approach could be using star tracking algorithms like the Angle method mentioned in [2]
or more robust methods like [3]. But again, they all seem to be meant for large input sets and target sets. I'm looking for something less reliable but more minimalistic...
Thanks for any ideas!
here's some papers probably related to your question:
Geometric Pattern Matching under Euclidean Motion (1993) by L. Paul Chew , Michael T. Goodrich , Daniel P. Huttenlocher , Klara Kedem , Jon M. Kleinberg , Dina Kravets.
A fast expected time algorithm for the 2-D point pattern (2004) by Wamelena, Iyengarb.
Simple algorithms for partial point set pattern matching under rigid motion (2006) by Bishnua, Dasb, Nandyb, Bhattacharyab.
Exact and approximate Geometric Pattern Matching for point sets in the plane under similarity transformations (2007) by Aiger and Kedem.
and by the way, your last reference reminded me of:
An Application of Point Pattern Matching in Astronautics (1994) by G. Weber, L. Knipping and H. Alt.
I think you should start with a subset of the input points and determine the required transformation to match a subset of the large set. For example:
choose any two points of the input, say A and B.
map A and B to a pair of the large set. This will determine the scale and two rotation angles (clockwise or counter clockwise)
apply the same scaling and transformation to a third input point C and check the large set to see if a point exists there. You'll have to check two positions, one for each of rotation angle. If the point C exists where it should be in the large set, you can check the rest of the points.
repeat for each pair of points in the large set
I think you could also try to match a subset of 3 input points, knowing that the angles of a triangle will be invariant under scaling and rotations.
Those are my ideas, I hope they help solve your problem.
I would try the Iterative Closest Point algorithm. A simple version like the one you need should be easy to implement.
Take a look at geometric hashing. It allows finding geometric patterns under different transformations. If you use only rotation and scale, it will be quite simple.
The main idea is to encode the pattern in "native" coordinates, which is invariant under transformations.
You can try a geohash. Translate the points to a binary and interleave it. Measure the distance and compare it with the original. You can also try to rotate the geohash, i.e. z-curve or morton curve.

Measuring distance between vectors

I have a set of 300.000 or so vectors which I would like to compare in some way, and given one vector I want to be able to find the closest vector I have thought of three methods.
Simple Euclidian distance
Cosine similarity
Use a kernel (for instance Gaussian) to calculate the Gram matrix.
Treat the vector as a discrete probability distribution (which makes
sense to do) and calculate some divergence measure.
I do not really understand when it is useful to do one rather than the other. My data has a lot of zero-elements. With that in mind, is there some general rule of thumbs as to which of the three methods is the best?
Sorry for the weak question, but I had to start somewhere...
Thank you!
Your question is not quite clear, are you looking for a distance metric between vectors, or an algorithm to efficiently find the nearest neighbour?
If your vectors just contain a numeric type such as doubles or integers, you can find a nearest neighbour efficiently using a structure such as the kd-tree. (since you are just looking at points in d-dimensional space). See, for other methods.
Otherwise, choosing a distance metric and algorithm is very much dependent on the content of the vectors.
If your vectors are very sparse in nature and if they are binary, you can use Hamming or Hellinger distance. When your vector dimensions are large, avoid using Euclidean (refer
Please refer to for a survey of distance/similarity measures, although the paper limits it to pair of probability distributions.
