Find a bijection that best preserves distances - algorithm

I have two spaces (not necessarily equal in dimension) with N points.
I am trying to find a bijection (pairing) of the points, such that the distances are preserved as well as possible.
I can't seem to find a discussion of possible solutions or algorithms to this question online. Can anyone suggest keywords that I could search for? Does this problem have a name, or does it come up in any domain?

I believe you are looking for a Multidimensional Scaling algorithm where you are minimizing the total change in distance. Unfortunately, I have very little experience in this area and can't be of much more help.

I haven't heard of the exact same problem. There are two similar types of problems:
Non-linear dimensionality reduction, you're given N high dimensional points and you want to find N low dimensional points that preserve distance as well as possible. MDS, mentioned by Michael Koval is one such method.
This might be more promising: algorithms for the assignment problem. For example Kuhn-Munkres (the Hungarian algorithm), you're given an NxN matrix that encodes the cost of matching pi with pj and you want to find the minimum cost bijection. There are many generalizations of this problem, for example b-matching (Kuhn-Munkres solves 1-matching).
Depending on how you define "preserves distances as well as possible" I think you either want (2) or a generalization of (2) in such a way that the cost doesn't only depend on the two points being matched but the assignment of all other points.
Finally, Kuhn-Munkres comes up everywhere in operations research.

Related

What's the difference between Levenshtein distance and the Wagner-Fischer algorithm

The Levenshtein distance is a string metric for measuring the difference between two sequences.
The Wagner–Fischer algorithm is a dynamic programming algorithm that computes the edit distance between two strings of characters.
Both using a matrix, and I don't see the difference?
Is the difference the backtracking or is there no further difference by the fact that one is the "literature" and the other one is the programming?
Also I am just writing on a thesis, and I am not sure how to divide it- should I first get into explaining the Levenshtein distance first and afterwards the Wagner-Fisher algorithm or doing both in one? I got kinda confused here.
You actually answer the question yourself in the first paragraph.
In the second paragraph you mix them up a bit.
Levenshtein distance is an edit distance metric named after Vladimir Levenshtein who considered this distance in 1965 and have nothing to do with the dynamic programming "matrix". And the Wagner–Fischer algorithm is a dynamic programming algorithm that computes the edit distance between two strings of characters.
However, the Levenshtein distance is normally computed using dynamic programming if what you need is a general purpose computation, that is, calculate the edit distance between two random input strings. But Levenshtein distance can also be used in a spell checker, when you compare one string with a dictionary. In cases like this its normally to slow to use a general purpose computation,and something like a Levenshtein Automaton can provide linear time to get all spelling suggestions. Btw, this is also used in the fuzzy search in Lucene since version 4.
About your thesis, well I think it depends. If its about the actual Levenshtein metric then I think thats where you should start, and if its about dynamic programming you should start with Wagner-Fischer. Anyway, thats my two cents about it. And good luck with you thesis.
Indeed, they are closely related, but they are not the same thing. The Levenshtein distance is a concept that is defined by a mathematical formula. However, trying to compute the Levenshtein distance by implementing the recursive formula directly will be horrendously slow. The Wagner-Fischer is a dynamic programming algorithm to compute it efficiently.

How to choose the space of optimal substructures for dynamic programming algorithms?

I am reading up dynamic programming chapter of Introduction to Algorithms by Cormen et al. I am trying to understand how to characterize the space of subproblems . They gave two examples of dynamic programming . Both these two problems have an input of size n
Rod cutting problem (Cut the rod of size n optimally)
Matrix parenthesization problem .(Parenthesize the matrix product A1 . A2 .A3 ...An optimally to get the least number of scalar multiplications)
For the first problem , they choose a subproblem of the form where they make a cut of length k , assuming that the left subproblem resulting from the cut can not be cut any further and the right subproblem can be cut further thereby giving us a single subproblem of size (n-k) .
But for the second problem that choose subproblems of the type Ai...Aj where 1<=i<=j<=n . Why did they choose to keep both ends open for this problem ? Why not close one end and just consider on subproblems of size (n-k)? Why need both i and j here instead of a single k split?
It is an art. There are many types of dynamic programming problems, and it is not easy to define one way to work out what dimensions of space we want to solve sub-problems for.
It depends on how the sub-problems interact, and very much on the size of each dimension of space.
Dynamic programming is a general term describing the caching or memoization of sub-problems to solve larger problems more efficiently. But there are so many different problems that can be solved by dynamic programming in so many different ways, that I cannot explain it all, unless you have a specific dynamic programming problem that you need to solve.
All that I can suggest is to try when solving a problem is:
if you know how to solve one problem, you can use similar techniques for similar problems.
try different approaches, and estimate the order of complexity (in time and memory) in terms of input size for each dimension, then given the size of each dimension, see if it executes fast enough, and within memory limits.
Some algorithms that can be described as dynamic programming, include:
shortest path algorithms (Dijkstra, Floyd-Warshall, ...)
string algorithms (longest common subsequence, Levenshtein distance, ...)
and much more...
Vazirani's technical note on Dynamic Programming
http://www.cs.berkeley.edu/~vazirani/algorithms/chap6.pdf has some useful ways create subproblems given an input. I have added some other ways to the list below:
Input x_1, x_2, ..x_n. Subproblem is x_1...x_i.
Input x_1, x_2....x_n. Subproblem is x_i, ...x_j.
Input x_1, x_2...x_n and y_1, y_2..y_m. Subproblem is x_1, x_2, ..x_i and y_1, y_2, ..y_j.
Input is a rooted tree. Subproblem is a rooted subtree.
Input is a matrix. Subproblem is submatrices of different lengths that share a corner with the original matrix.
Input is a matrix. Subproblem is all possible submatrices.
Which subproblems to use usualy depends on the problem. Try out these known variations and see which one suits your needs the best.

In regards to genetic algorithms

Currently, I'm studying genetic algorithms (personal, not required) and I've come across some topics I'm unfamiliar or just basically familiar with and they are:
Search Space
The "extreme" of a Function
I understand that one's search space is a collection of all possible solutions but I also wish to know how one would decide the range of their search space. Furthermore I would like to know what an extreme is in relation to functions and how it is calculated.
I know I should probably understand what these are but so far I've only taken Algebra 2 and Geometry but I have ventured into physics, matrix/vector math, and data structures on my own so please excuse me if I seem naive.
Generally, all algorithms which are looking for a specific item in a collection of items are called search algorithms. When the collection of items is defined by a mathematical function (opposed to existing in a database), it is called a search space.
One of the most famous problems of this kind is the travelling salesman problem, where an algorithm is sought which will, given a list of cities and their distances, find the shortest route for visiting each city only once. For this problem, the exact solution can be found only by examining all possible routes (the entire search space), and finding the shortest one (the route which has the minimum distance, which is the extreme value in the search space). The best time complexity of such an algorithm (called an exhaustive search) is exponential (although it is still possible that there may be a better solution), meaning that the worst-case running time increases exponentially as the number of cities increases.
This is where genetic algorithms come into play. Similar to other heuristic algorithms, genetic algorithms try to get close to the optimal solution by improving a candidate solution iteratively, with no guarantee that an optimal solution will actually be found.
This iterative approach has the problem that the algorithm can easily get "stuck" in a local extreme (while trying to improve a solution), not knowing that there is a potentially better solution somewhere further away:
The figure shows that, in order to get to the actual, optimal solution (the global minimum), an algorithm currently examining the solution around the local minimum needs to "jump over" a large maximum in the search space. A genetic algorithm will rapidly locate such local optimums, but it will usually fail to "sacrifice" this short-term gain to get a potentially better solution.
So, a summary would be:
exhaustive search
examines the entire search space (long time)
finds global extremes
heuristic (e.g. genetic algorithms)
examines a part of the search space (short time)
finds local extremes
Genetic algorithms are not good in tuning to a local optimum. If you want to find a global optimum at least you should be able to approach or find a strategy to approach the local optimum. Recently some improvements have been developed to better find the local optima.
"GENETIC ALGORITHM FOR INFORMATIVE BASIS FUNCTION SELECTION
FROM THE WAVELET PACKET DECOMPOSITION WITH APPLICATION TO
CORROSION IDENTIFICATION USING ACOUSTIC EMISSION"
http://gbiomed.kuleuven.be/english/research/50000666/50000669/50488669/neuro_research/neuro_research_mvanhulle/comp_pdf/Chemometrics.pdf
In general, "search space" means, what type of answers are you looking for. For example, if you are writing a genetic algorithm which builds bridges, tests them out, and then builds more, the answers you are looking for are bridge models (in some form). As another example, if you're trying to find a function which agrees with a set of sample inputs on some number of points, you might try to find a polynomial which has this property. In this instance your search space might be polynomials. You might make this simpler by putting a bound on the number of terms, maximum degree of the polynomial, etc... So you could specify that you wanted to search for polynomials with integer exponents in the range [-4, 4]. In genetic algorithms, the search space is the set of possible solutions you could generate. In genetic algorithms you need to carefully limit your search space so you avoid answers which are completely dumb. At my former university, a physics student wrote a program which was a GA to calculate the best configuration of atoms in a molecule to have low energy properties: they found a great solution having almost no energy. Unfortunately, their solution put all the atoms at the exact center of the molecule, which is physically impossible :-). GAs really hone in on good solutions to your fitness functions, so it's important to choose your search space so that it doesn't produce solutions with good fitness but are in reality "impossible answers."
As for the "extreme" of a function. This is simply the point at which the function takes its maximum value. With respect to genetic algorithms, you want the best solution to the problem you're trying to solve. If you're building a bridge, you're looking for the best bridge. In this scenario, you have a fitness function that can tell you "this bridge can take 80 pounds of weight" and "that bridge can take 120 pounds of weight" then you look around for solutions which have higher fitness values than others. Some functions have simple extremes: you can find the extreme of a polynomial using simple high school calculus. Other functions don't have a simple way to calculate their extremes. Notably, highly nonlinear functions have extremes which might be difficult to find. Genetic algorithms excel at finding these solutions using a clever search technique which looks around for high points and then finds others. It's worth noting that there are other algorithms that do this as well, hill climbers in particular. The things that make GAs different is that if you find a local maximum, other types of algorithms can get "stuck," blinded by a locally good solution, so that they never see a possibly much better solution farther away in the search space. There are other ways to adapt hill climbers to this as well, simulated annealing, for one.
The range space usually requires some intuitive understanding of the problem you're trying to solve-- some expertise in the domain of the problem. There's really no guaranteed method to pick the range.
The extremes are just the minimum and maximum values of the function.
So for instance, if you're coding up a GA just for practice, to find the minimum of, say, f(x) = x^2, you know pretty well that your range should be +/- something because you already know that you're going to find the answer at x=0. But then of course, you wouldn't use a GA for that because you already have the answer, and even if you didn't, you could use calculus to find it.
One of the tricks in genetic algorithms is to take some real-world problem (often an engineering or scientific problem) and translate it, so to speak, into some mathematical function that can be minimized or maximized. But if you're doing that, you probably already have some basic notion where the solutions might lie, so it's not as hopeless as it sounds.
The term "search space" does not restrict to genetic algorithms. I actually just means the set of solutions to your optimization problem. An "extremum" is one solution that minimizes or maximizes the target function with respect to the search space.
Search space simply put is the space of all possible solutions. If you're looking for a shortest tour, the search space consists of all possible tours that can be formed. However, beware that it's not the space of all feasible solutions! It only depends on your encoding. If your encoding is e.g. a permutation, than the search space is that of the permutation which is n! (factorial) in size. If you're looking to minimize a certain function the search space with real valued input the search space is bounded by the hypercube of the real valued inputs. It's basically infinite, but of course limited by the precision of the computer.
If you're interested in genetic algorithms, maybe you're interested in experimenting with our software. We're using it to teach heuristic optimization in classes. It's GUI driven and windows based so you can start right away. We have included a number of problems such as real-valued test functions, traveling salesman, vehicle routing, etc. This allows you to e.g. look at how the best solution of a certain TSP is improving over the generations. It also exposes the problem of parameterization of metaheuristics and lets you find better parameters that will solve the problems more effectively. You can get it at http://dev.heuristiclab.com.

Maximum two-dimensional subset-sum

I'm given a task to write an algorithm to compute the maximum two dimensional subset, of a matrix of integers. - However I'm not interested in help for such an algorithm, I'm more interested in knowing the complexity for the best worse-case that can possibly solve this.
Our current algorithm is like O(n^3).
I've been considering, something alike divide and conquer, by splitting the matrix into a number of sub-matrices, simply by adding up the elements within the matrices; and thereby limiting the number of matrices one have to consider in order to find an approximate solution.
Worst case (exhaustive search) is definitely no worse than O(n^3). There are several descriptions of this on the web.
Best case can be far better: O(1). If all of the elements are non-negative, then the answer is the matrix itself. If the elements are non-positive, the answer is the element that has its value closest to zero.
Likewise if there are entire rows/columns on the edges of your matrix that are nothing but non-positive integers, you can chop these off in your search.
I've figured that there isn't a better way to do it. - At least not known to man yet.
And I'm going to stick with the solution I got, mainly because its simple.

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition

Resources