Finding approximations of a 2D distribution in a massive 2D distribution

Finding approximations of a 2D distribution in a massive 2D distribution - algorithm

Suppose you are given a hand drawn constellation (2d distribution of points) and a map of all stars, and are asked to find an actual star distribution sufficiently similar (with Kolmogorov Smirnov <= some_threshold) to the hand drawn constellation, if one exists.
Is there a classical algorithm for this kind of approximate distribution search?
Else do others have insights on how to reduce the complexity of this problem? I continue to be stumped by the fact that the user's hand-drawn constellation has no notion of scale or rotation...

Related

Algorithm: How to smoothly interpolate/reconstruct sparse samples with noise?

This question is not directly related to a particular programming language but is an algorithmic question.
What I have is a lot of samples of a 2D function. The samples are at random locations, they are not uniformly distributed over the domain, the sample values contain noise and each sample has a confidence-weight assigned to it.
What I'm looking for is an algorithm to reconstruct the original 2D function based on the samples, so a function y' = G(x0, x1) that approximates the original well and interpolates areas where samples are sparse smoothly.
It goes into the direction of what scipy.interpolate.griddata is doing, but with the added difficulty that:
the sample values contain noise - meaning that samples should not just be interpolated, but nearby samples also averaged in some way to average out the sampling noise.
the samples are weighted, so, samples with higher weight should contrbute more strongly to the reconstruction that those with lower weight.
scipy.interpolate.griddata seems to do a Delaunay triangulation and then use the barycentric cordinates of the triangles to interpolate values. This doesn't seem to be compatible with my requirement of weighting samples and averaging noise though.
Can someone point me in the right direction on how to solve this?

Based on the comments, the function is defined on a sphere. That simplifies life because your region is both well-studied and nicely bounded!
First, decide how many Spherical Harmonic functions you will use in your approximation. The fewer you use, the more you smooth out noise. The more you use, the more accurate it will be. But if you use any of a particular degree, you should use all of them.
And now you just impose the condition that the sum of the squares of the weighted errors should be minimized. That will lead to a system of linear equations, which you then solve to get the coefficients of each harmonic function.

Procrustes analysis with unequal number of points

As far as I understand Procrustes analysis takes into account the one-to-one ordering of the points across shapes. Therefore, you cannot run the algorithm if you have an unequal number of "anchor" or "landmark" points.
Is there another algorithm for shape alignment that works with unequal number of points across shapes? Say, minimizes the RMSE of the distance of points in one shape to the closest points in the other shape.
Thanks.

Procrustes analysis can be seen as final part of "point set registration" since you assume that you already know correspondences and what to align them using a rigid transformation:
https://en.m.wikipedia.org/wiki/Point_set_registration
However if your correspondences are unknown (or noisy) like in the case of two 3D scanned shapes, then you need to do a complete registration using for instance ICP (iterative closest points)
https://en.m.wikipedia.org/wiki/Iterative_closest_point
There are more sophisticated algorithms as well. Take into account that Point Set Registration is a special case of Shape Registration.

Unless the problem is constrained, in the early stages of point set matching you have little clue on the pose.
Global strategies include
choosing a few random correspondences, computing the corresponding transform and using it to find more correspondences; from there, estimate a goodness-of-fit score; repeat several times and keep the best score. [This is the RANSAC principle.]
instead of choosing randomly, detect "feature points" that exhibit special properties, such as forming "corners" (in case of curve-like clouds), or dense concentrations...; then the number of correspondences to be tried is much lessened.

geometric median for rigid body

I need to implement a "geometric median"-type algorithm that would apply to rigid bodies, meaning it would not only find a point minimizing the distance from a set of points, but would also take into account the orientation of the body. I haven't found a solution for this type of problem anywhere, while for the geometric median (or Weber or Fermat-Torricelli problem, or facilities location problem), there is a lot of information available, including the Weiszfeld algorithm (and modern improvements). I'm hoping someone will have references to possible solutions. I would have thought this to be a relatively common problem in registration, but maybe I just haven't found the right words to search for...
My problem could be formulated as follows: Say I have a "reference" rigid body with 3 non-colinear points (a triangle), and I measure the coordinates of the 3 points a bunch of times (with some error, or the object was moving a bit). I want to find a good "central location", that would minimize the sum of distances (not square distances) between each measured point and its corresponding centrally-located-object point. This is equivalent to the "multi-facility location problem" but with extra contstraints of fixed distances between the "facilities" and with each point pre-assigned to a facility (not necessarily the closest one).
Actually, I'm thinking instead of minimizing the sum for all the points, I'd only keep the max distance out of the 3 points for each measurement. (is that what's called "minimax"?) But I don't think that would make a big difference in the type of algorithm I'd have to use.
A possible difficulty compared to the geometric median could be that with the added freedom of rotations, the quantity to minimize is no longer convex (not 100% sure, but I think). I'm hoping I can still use a similar algorithm as Weiszfeld's (which is a subgradient method), and hopefully this has been investigated previously. Thanks for any help!
P.S. I'll be doing this in Matlab.

I can't find any research on this subject. The first thing I would do is to use Weiszfeld's algorithm without rigidity constraints to find geometric medians of individual points, define lagrange multipliers corresponding to deviations of edges of the object from expected values, and use gradient descent to find a constrained local minimum. I can't prove that it will always work, but, intuitively, it should as long as deviations are sufficiently small.

Nearest neighbor zones visualized

I'm writing an app that looks up points in two-dimensional space using a k-d tree. It would be nice, during development, to be able to "see" the nearest-neighbor zones surrounding each point.
In the attached image, the red points are points in the k-d tree, and the blue lines surrounding each point bound the zone where a nearest neighbor search will return the contained point.
The image was created thusly:
for each point in the space:
da = distance to nearest neighbor
db = distance to second-nearest neighbor
if absolute_value(da - db) < 4:
draw blue pixel
This algorithm has two problems:
(more important) It's slow on my (reasonably fast Core i7) computer.
(less important) It's sloppy, as you can see by the varying widths of the blue lines.
What is this "visualization" of a set of points called?
What are some good algorithms to create such a visualization?

This is called a Voronoi Diagram and there are many excellent algorithms for generating them efficiently. The one I've heard about most is Fortune's algorithm, which runs in time O(n log n), though others algorithms exist for this problem.
Hope this helps!

Jacob,
hey, you found an interesting way of generating this Voronoi diagram, even though it is not so efficient.
The less important issue first: the varying thickness boundaries that you get, those butterfly shapes, are in fact the area between the two branches of an hyperbola. Precisely the hyperbola given by the equation |da - db| = 4. To get a thick line instead, you have to modify this criterion and replace it by the distance to the bisector of the two nearest neighbors, let A and B; using vector calculus, | PA.AB/||AB|| - ||AB||/2 | < 4.
The more important issue: there are two well known efficient solutions to the construction of the Voronoi diagram of a set of points: Fortune's sweep algorithm (as mentioned by templatetypedef) and Preparata & Shamos' Divide & Conquer solutions. Both run in optimal time O(N.Lg(N)) for N points, but aren't so easy to implement.
These algorithm will construct the Voronoi diagram as a set of line segments and half-lines. Check http://en.wikipedia.org/wiki/Voronoi_diagram.
This paper "Primitives for the manipulation of general subdivisions and the computation of Voronoi" describes both algorithms using a somewhat high-level framework, caring about all implementation details; the article is difficult but the algorithms are implementable.
You may also have a look at "A straightforward iterative algorithm for the planar Voronoi diagram", which I never tried.
A totally different approach is to directly build the distance map from the given points for example by means of Dijkstra's algorithm: starting from the given points, you grow the boundary of the area within a given distance from every point and you stop growing when two boundaries meet. [More explanations required.] See http://1.bp.blogspot.com/-O6rXggLa9fE/TnAwz4f9hXI/AAAAAAAAAPk/0vrqEKRPVIw/s1600/distmap-20-seed4-fin.jpg
Another good starting point (for efficiently computing the distance map) can be "A general algorithm for computing distance transforms in linear time".

From personal experience: Fortune's algorithm is a pain to implement. The divide and conquer algorithm presented by Guibas and Stolfi isn't too bad; they give detailed pseudocode that's easy to transcribe into a procedural programming language. Both will blow up if you have nearly degenerate inputs and use floating point, but since the primitives are quadratic, if you can represent coordinates as 32-bit integers, then you can use 64 bits to carry out the determinant computations.
Once you get it working, you might consider replacing your kd-tree algorithms, which have a Theta(√n) worst case, with algorithms that work on planar subdivisions.

You can find a great implementation for it at D3.js library: http://mbostock.github.com/d3/ex/voronoi.html

efficient algorithm to find nearest point in a graph that does not have a known equation

I'm asking this questions out of curiostity, since my quick and dirty implementation seems to be good enough. However I'm curious what a better implementation would be.
I have a graph of real world data. There are no duplicate X values and the X value increments at a consistant rate across the graph, but Y data is based off of real world output. I want to find the nearest point on the graph from an arbitrary given point P programmatically. I'm trying to find an efficient (ie fast) algorithm for doing this. I don't need the the exact closest point, I can settle for a point that is 'nearly' the closest point.
The obvious lazy solution is to increment through every single point in the graph, calculate the distance, and then find the minimum of the distance. This however could theoretically be slow for large graphs; too slow for what I want.
Since I only need an approximate closest point I imagine the ideal fastest equation would involve generating a best fit line and using that line to calculate where the point should be in real time; but that sounds like a potential mathematical headache I'm not about to take on.
My solution is a hack which works only because I assume my point P isn't arbitrary, namely I assume that P will usually be close to my graph line and when that happens I can cross out the distant X values from consideration. I calculating how close the point on the line that shares the X coordinate with P is and use the distance between that point and P to calculate the largest/smallest X value that could possible be closer points.
I can't help but feel there should be a faster algorithm then my solution (which is only useful because I assume 99% of the time my point P will be a point close to the line already). I tried googling for better algorithms but found so many algorithms that didn't quite fit that it was hard to find what I was looking for amongst all the clutter of inappropriate algorithms. So, does anyone here have a suggested algorithm that would be more efficient? Keep in mind I don't need a full algorithm since what I have works for my needs, I'm just curious what the proper solution would have been.

If you store the [x,y] points in a quadtree you'll be able to find the closest one quickly (something like O(log n)). I think that's the best you can do without making assumptions about where the point is going to be. Rather than repeat the algorithm here have a look at this link.
Your solution is pretty good, by examining how the points vary in y couldn't you calculate a bound for the number of points along the x axis you need to examine instead of using an arbitrary one.

Let's say your point P=(x,y) and your real-world data is a function y=f(x)
Step 1: Calculate r=|f(x)-y|.
Step 2: Find points in the interval I=(x-r,x+r)
Step 3: Find the closest point in I to P.

If you can use a data structure, some common data structures for spacial searching (including nearest neighbour) are...
quad-tree (and octree etc).
kd-tree
bsp tree (only practical for a static set of points).
r-tree
The r-tree comes in a number of variants. It's very closely related to the B+ tree, but with (depending on the variant) different orderings on the items (points) in the leaf nodes.
The Hilbert R tree uses a strict ordering of points based on the Hilbert curve. The Hilbert curve (or rather a generalization of it) is very good at ordering multi-dimensional data so that nearby points in space are usually nearby in the linear ordering.
In principle, the Hilbert ordering could be applied by sorting a simple array of points. The natural clustering in this would mean that a search would usually only need to search a few fairly-short spans in the array - with the complication being that you need to work out which spans they are.
I used to have a link for a good paper on doing the Hilbert curve ordering calculations, but I've lost it. An ordering based on Gray codes would be simpler, but not quite as efficient at clustering. In fact, there's a deep connection between Gray codes and Hilbert curves - that paper I've lost uses Gray code related functions quite a bit.
EDIT - I found that link - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.133.7490

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio