I have a set of N objects, and I'd like to compute a NxN distance matrix. Sometimes my set of N objects is very large, and I'd like to compute an approximation to the NxN distance matrix by only computing a subset of the distance comparisons.
Can anyone point me in the direction of something that calculates approximations to a full distance matrix? I have some ideas in mind, but I'd like to avoid re-inventing the wheel.
Edit: An example of the type of algorithm would take advantage of the fact that if there is a very small distance between object A and object B, and there is a very small distance between object B and object C, there has to be a somewhat short distance between objects A and C.
I had this same question and ended up writing Python code for it:
https://github.com/jpeterbaker/lazyDistance
README.md explains how the triangle inequality can be used to update upper and lower bounds for each distance.
Just run the Python file as a script for an example in 2-dimensional space. The plotted lines are the only distances that were actually calculated.
In my version, the time savings aren't about having a large number of objects. As I've written it, it's a O(n^4) algorithm, so it's actually worse than just calculating all distances if the number of objects is large. But my method will save time when you have a modest number of objects and the distance function is very expensive to calculate. It assumes that it is faster to do several O(n^2) operations rather than a single distance measurement.
If n is large, you could look for cheaper methods to decide which distance to calculate next (that don't involve arithmetic with n^2 entries of distance bounds matrices). You also may not need to update all 2*n^2 bounds every time that this code does.
Honestly, I think it depends how close you want your approximation to be and how big your subset is. If you just want some overall feel of what the matrix will look like, you can do simple linear interpolation on a random subset (including the maximal and minimal nodes) getting pretty accurate (tm) results.
I think the real trick here is figuring out the heuristic (linear, quadratic, etc interpolation) and the subset size. You could also figure out the distance matrices of various subsets and then interpolate those matrices with some method (linear, spherical linear, cubic).
Depending on your initial sample, it's pretty much an heuristic trial and error until you go "oh that's good enough for what I need".
Are your "objects" on a network? If the objects are in a network, you can use this or this that yields the all-pairs shortest paths. If not, you're pretty much stuck with calculated all the n x n distances, I think.
The solution you require is similar to what we commonly see in a graph, you can use All pair shortest path for finding the distance, you can also look at johnson's algorithm
Related
In hill climbing for 1 dimension, I try two neighbors - a small delta to the left and one to the right of my current point, and then keep the one that gives a higher value of the objective function. How do I extend it to an n-dimensional space? How does one define a neighbor for an n-dimensional space? Do I have to try 2^n neighbors (a delta applied to each of the dimension)?
You don't need to compare each pair of neighbors, you need to compute a set of neighbors, e.g. on a circle (sphere/ hypersphere in a higher dimensions) with a radius of delta, and then take the one with the highest values to "climb up". In any case you will discretize the neighborhood of your current solution and compute the score function for each neighbor. When you can differentiate your function, than, Gradient ascent/descent based algorithms may solve your problem:
1) Compute the gradient (direction of steepest ascent)
2) Go a small step into the direction of the gradient
3) Stop if solution does not change
A common problem with those algorithms is, that you often only find local maxima / minima. You can find a great overview on gradient descent/ascent algorithms here: http://sebastianruder.com/optimizing-gradient-descent/
If you are using IEEE-754 floating point numbers then the obvious answer is something like (2^52*(log_2(delta)+1023))^(n-1)+1 if delta>=2^(-1022) (more or less depending on your search space...) as that is the only way you can be certain that there are no more neighboring solutions with a distance of delta.
Even assuming you instead take a random fixed size sample of all points within a given distance of delta, lets say delta=.1, you would still have the problem that if the distance from the local optimum was .0001 the probability of finding an improvement in just 1 dimension would be less than .0001/.1/2=0.05% so you would need to take more and more random samples as you get closer to the local optimum (of which you don't know the value...).
Obviously hill climbing is not intended for the real number space or theoretical graph spaces with infinite degree. You should instead be using a global search algorithm.
One example of a multidimensional search algorithm which needs only O(n) neighbours instead of O(2^n) neighbours is the Torczon simplex method described in Multidirectional search: A direct search algorithm for parallel machines (1989). I chose this over the more widely known Nelder-Mead method because the Torczon simplex method has a convergence proof (convergence to a local optimum given some reasonable conditions).
Given a very large set of GPS coordinates, is there a time/computationally efficient way to determine whether an input GPS coordinate is within a given radius of any point in the set? Pre-computation is acceptable. The best I could think of is an O(N) implementation but just wondering if there is a better way to approach this problem.
You should look into range tree. Preprocess the points to create the range tree and then use it search for all the points in a specific range. Space complexity is O(n log n). Time complexity is O(n log n + k) where k is the number of points in the search radius.
The given coordinate can be tested with time complexity slighlty above O(log4(N)) asuming a radius that is small compared to the space the points spread.
log4 is the logarithmus with base 4
A Quadtree works extremly well for this task.
An alternative is the R-Tree.
If you are looking for a Java implementation, you may try one of the KD-Trees I have fully posted here. With that you are able to find the nearest point to your new GPS input. Then you just need to check the real distance, for whether it is inside the radius you are interested or not.
I'm looking for an algorithm to find the best fit between a cloud of points and a sphere.
That is, I want to minimise
where C is the centre of the sphere, r its radius, and each P a point in my set of n points. The variables are obviously Cx, Cy, Cz, and r. In my case, I can obtain a known r beforehand, leaving only the components of C as variables.
I really don't want to have to use any kind of iterative minimisation (e.g. Newton's method, Levenberg-Marquardt, etc) - I'd prefer a set of linear equations or a solution explicitly using SVD.
There are no matrix equations forthcoming. Your choice of E is badly behaved; its partial derivatives are not even continuous, let alone linear. Even with a different objective, this optimization problem seems fundamentally non-convex; with one point P and a nonzero radius r, the set of optimal solutions is the sphere about P.
You should probably reask on an exchange with more optimization knowledge.
You might find the following reference interesting but I would warn you
that you will need to have some familiarity with geometric algebra -
particularly conformal geometric algebra to understand the
mathematics. However, the algorithm is straight forward to implement with
standard linear algebra techniques and is not iterative.
One caveat, the algorithm, at least as presented fits both center and
radius, you may be able to work out a way to constrain the fit so the radius is constrained.
Total Least Squares Fitting of k-Spheres in n-D Euclidean Space Using
an (n+ 2)-D Isometric Representation. L Dorst, Journal of Mathematical Imaging and Vision, 2014 p1-21
Your can pull in a copy from
Leo Dorst's researchgate page
One last thing, I have no connection to the author.
Short description of making matrix equation could be found here.
I've seen that WildMagic Library uses iterative method (at least in version 4)
You may be interested by the best fit d-dimensional sphere, i.e. minimizing the variance of the population of the squared distances to the center; it has a simple analytical solution (matrix calculus): see the appendix of the open access paper of Cerisier et al. in J. Comput. Biol. 24(11), 1134-1137 (2017), https://doi.org/10.1089/cmb.2017.0061
It works when the data points are weighted (it works even for continuous distributions; as a by-product, when d=1, a well-known inequality is retrieved: the kurtosis is always greater than the squared skewness plus 1).
Difficult to do this without iteration.
I would proceed as follows:
find the overall midpoint, by averaging (X,Y,Z) coords for all points
with that result, find the average distance Ravg to the midpoint, decide ok or proceed..
remove points from your set with a distance too far from Ravg found in step 2
go back to step 1 (average points again, yielding a better midpoint)
Of course, this will require some conditions for (2) and (4) that depends on the quality of your points cloud !
Ian Coope has an interesting algorithm in which he linearized the problem using a change of variable. The fit is quite robust, and although it slightly redefines the condition of optimality, I've found it to be generally visually better, especially for noisy data.
A preprint of Coope's paper is available here: https://ir.canterbury.ac.nz/bitstream/handle/10092/11104/coope_report_no69_1992.pdf.
I found the algorithm to be very useful, so I implemented it in scikit-guess as skg.nsphere_fit. Let's say you have an (m, n) array p, consisting of M points of dimension N (here N=3):
r, c = skg.nsphere_fit(p)
The radius, r, is a scalar and c is be an n-vector containing the center.
There are two classes, let's call them X and O. A number of elements belonging to these classes are spread out in the xy-plane. Here is an example where the two classes are not linearly separable. It is not possible to draw a straight line that perfectly divides the Xs and the Os on each side of the line.
How to determine, in general, whether the two classes are linearly separable?. I am interested in an algorithm where no assumptions are made regarding the number of elements or their distribution. An algorithm of the lowest computational complexity is of course preferred.
If you found the convex hull for both the X points and the O points separately (i.e. you have two separate convex hulls at this stage) you would then just need to check whether any segments of the hulls intersected or whether either hull was enclosed by the other.
If the two hulls were found to be totally disjoint the two data-sets would be geometrically separable.
Since the hulls are convex by definition, any separator would be a straight line.
There are efficient algorithms that can be used both to find the convex hull (the qhull algorithm is based on an O(nlog(n)) quickhull approach I think), and to perform line-line intersection tests for a set of segments (sweepline at O(nlog(n))), so overall it seems that an efficient O(nlog(n)) algorithm should be possible.
This type of approach should also generalise to general k-way separation tests (where you have k groups of objects) by forming the convex hull and performing the intersection tests for each group.
It should also work in higher dimensions, although the intersection tests would start to become more challenging...
Hope this helps.
Computationally the most effective way to decide whether two sets of points are linearly separable is by applying linear programming. GLTK is perfect for that purpose and pretty much every highlevel language offers an interface for it - R, Python, Octave, Julia, etc.
Let's say you have a set of points A and B:
Then you have to minimize the 0 for the following conditions:
(The A below is a matrix, not the set of points from above)
"Minimizing 0" effectively means that you don't need to actually optimize an objective function because this is not necessary to find out if the sets are linearly separable.
In the end
() is defining the separating plane.
In case you are interested in a working example in R or the math details, then check this out.
Here is a naïve algorithm that I'm quite sure will work (and, if so, shows that the problem is not NP-complete, as another post claims), but I wouldn't be surprised if it can be done more efficiently: If a separating line exists, it will be possible to move and rotate it until it hits two of the X'es or one X and one O. Therefore, we can simply look at all the possible lines that intersect two X'es or one X and one O, and see if any of them are dividing lines. So, for each of the O(n^2) pairs, iterate over all the n-2 other elements to see if all the X'es are on one side and all the O's on the other. Total time complexity: O(n^3).
Linear perceptron is guaranteed to find such separation if one exists.
See: http://en.wikipedia.org/wiki/Perceptron .
You can probably apply linear programming to this problem. I'm not sure of its computational complexity in formal terms, but the technique is successfully applied to very large problems covering a wide range of domains.
Computing a linear SVM then determining which side of the computed plane with optimal marginals each point lies on will tell you if the points are linearly separable.
This is overkill, but if you need a quick one off solution, there are many existing SVM libraries that will do this for you.
As mentioned by ElKamina, Linear Perceptron is guaranteed to find a solution if one exists. This approach is not efficient for large dimensions. Computationally the most effective way to decide whether two sets of points are linearly separable is by applying linear programming.
A code with an example to solve using Perceptron in Matlab is here
In general that problem is NP-hard but there are good approximate solutions like K-means clustering.
Well, both Perceptron and SVM (Support Vector Machines) can tell if two data sets are separable linearly, but SVM can find the Optimal Hiperplane of separability. Besides, it can work with n-dimensional vectors, not only points.
It is used in applications such as face recognition. I recomend to go deep into this topic.
I'm asking this questions out of curiostity, since my quick and dirty implementation seems to be good enough. However I'm curious what a better implementation would be.
I have a graph of real world data. There are no duplicate X values and the X value increments at a consistant rate across the graph, but Y data is based off of real world output. I want to find the nearest point on the graph from an arbitrary given point P programmatically. I'm trying to find an efficient (ie fast) algorithm for doing this. I don't need the the exact closest point, I can settle for a point that is 'nearly' the closest point.
The obvious lazy solution is to increment through every single point in the graph, calculate the distance, and then find the minimum of the distance. This however could theoretically be slow for large graphs; too slow for what I want.
Since I only need an approximate closest point I imagine the ideal fastest equation would involve generating a best fit line and using that line to calculate where the point should be in real time; but that sounds like a potential mathematical headache I'm not about to take on.
My solution is a hack which works only because I assume my point P isn't arbitrary, namely I assume that P will usually be close to my graph line and when that happens I can cross out the distant X values from consideration. I calculating how close the point on the line that shares the X coordinate with P is and use the distance between that point and P to calculate the largest/smallest X value that could possible be closer points.
I can't help but feel there should be a faster algorithm then my solution (which is only useful because I assume 99% of the time my point P will be a point close to the line already). I tried googling for better algorithms but found so many algorithms that didn't quite fit that it was hard to find what I was looking for amongst all the clutter of inappropriate algorithms. So, does anyone here have a suggested algorithm that would be more efficient? Keep in mind I don't need a full algorithm since what I have works for my needs, I'm just curious what the proper solution would have been.
If you store the [x,y] points in a quadtree you'll be able to find the closest one quickly (something like O(log n)). I think that's the best you can do without making assumptions about where the point is going to be. Rather than repeat the algorithm here have a look at this link.
Your solution is pretty good, by examining how the points vary in y couldn't you calculate a bound for the number of points along the x axis you need to examine instead of using an arbitrary one.
Let's say your point P=(x,y) and your real-world data is a function y=f(x)
Step 1: Calculate r=|f(x)-y|.
Step 2: Find points in the interval I=(x-r,x+r)
Step 3: Find the closest point in I to P.
If you can use a data structure, some common data structures for spacial searching (including nearest neighbour) are...
quad-tree (and octree etc).
kd-tree
bsp tree (only practical for a static set of points).
r-tree
The r-tree comes in a number of variants. It's very closely related to the B+ tree, but with (depending on the variant) different orderings on the items (points) in the leaf nodes.
The Hilbert R tree uses a strict ordering of points based on the Hilbert curve. The Hilbert curve (or rather a generalization of it) is very good at ordering multi-dimensional data so that nearby points in space are usually nearby in the linear ordering.
In principle, the Hilbert ordering could be applied by sorting a simple array of points. The natural clustering in this would mean that a search would usually only need to search a few fairly-short spans in the array - with the complication being that you need to work out which spans they are.
I used to have a link for a good paper on doing the Hilbert curve ordering calculations, but I've lost it. An ordering based on Gray codes would be simpler, but not quite as efficient at clustering. In fact, there's a deep connection between Gray codes and Hilbert curves - that paper I've lost uses Gray code related functions quite a bit.
EDIT - I found that link - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.133.7490