Create matching of row and column entries such that values are maximized - algorithm

Currently I am facing the following optimization problem and I cant seem to find the right applicable algorithm for this. This has to do with some of the combinatorial optimzation problems such as the knapsack problem but my mathematical knowledge is limited to that extent.
assume we have a list of the following words: ["apple", "banana", "cookie", "donut", "ear", "force"] Further, assume we have a dataset of texts which, among others, include these words. At some point I compute a cofrequency matrix, that is, a matrix of each of the word combinations the frequency in which the words combine together in all of the files. e.g. cofreq("apple", "banana") = (amount of files which have apple and banana)/(total files). Therefore, cofreq(apple, banana) = cofreq(banana, apple). We ignore cofreq(apple, apple)
Assume we have the following computed matrix (as an image, adding tables seems to be impossible): Table
The goal now is to create unique word pairs such that the word frequencies are maximized and each of the word pairs have a "partner" (We assume we have an even number of words). In this example it would be:
(apple, force) 0.4
(cookie, donut) 0.5
(banana, ear) 0.05
------------------+--
.95
In this case I did it by hand but I know that there is a good algorithm for it, but I cant seem to find it. I was hoping someone could point me in the right direction in the form of a research paper or such.

You need to use a maximum weight matching algorithm to compute this maximal sum pairing.
The table you have in input can be seen as the adjacency matrix of a graph, where the values in the table correspond to the graph's edges weight. You can do it since the cofreq value is commutative (meaning cofreq(apple, banana) == cofreq(banana, apple)).
The matching algorithm you can use here is called the blossom algorithm. It is not trivial, but very elegant. If you have some experience in implementing complex algorithms, you can implement it. Otherwise, there exists implementations of it in graph libraries for most of the common laguages.

Related

Puzzle: for boolean matrix find permutation of rows and colums that allows decomposistion into minimal set of covering rectangles

Suppose I know an algorithm, that partitions a boolean matrix into a minimal set of disjoint rectangles that cover all "ones" ("trues").
The task is to find a permutation of rows and columns of the matrix, such that a matrix built by shuffling the columns and rows according to the permutations can be partitioned into a minimal set of rectangles.
For illustration, one can think about the problem this way:
Suppose I have a set of objects and a set of properties. Each object can have any number of (distinct) properties. The task is to summarize (report) this mapping using the least amount of sentences. Each sentence has a form "<list of objects> have properties <list of properties>".
I know I can brute-force the solution by applying the permutations and run the algorithm on each try. But the time complexity explodes exponentially making this approach non-practical for matrices bigger than 15×15.
I know I can simplify the matrices before running the algorithm by removing duplicated rows and columns.
This problem feels like it is NP-hard, and there might be no fast (polynomial in time) solutions. If that is so, I'd be interested to learn about some approximate solutions.
This is isomorphic to reducing logic circuits, given the full set of inputs (features) and the required truth table (which rows have which feature). You can solve the problem with classic Boolean algebra. The process is called logic optimization.
When I was in school, we drew Karnaugh maps on the board and drew colored boundaries to form our rectangles. However, it sounds as if you have something larger than one would handle on the board; try the QM algorithm and the cited heuristics for a "good enough" solution for many applications.
My solution so far:
First let us acknowledge, that the problem is symmetric with respect to swapping rows with columns (features with objects).
Let us represent the problem with the binary matrix, where rows are objects and columns are features and ones in the matrix represent matched pairs (object, feature).
My idea so far is to run two steps in sequence until there is no 1s left in the matrix:
Heuristically find a good unshuffling permutation of rows and columns on which I can run 2D maximal rectangle
Find the maximal rectangle, save it to the answer list and zero all 1s belonging to it.
Maximal rectangle problem
It can be simply any of the implementations of the maximal rectangle problem found on the net, for instance https://www.geeksforgeeks.org/maximum-size-rectangle-binary-sub-matrix-1s/
Unshuffling the rows (and columns)
Unshuffling rows are independent of unshuffling columns and both tasks can be run separately (concurrently). Let us assume I am looking for the unshuffling permutation of columns.
Also, it is worth noting, that unshuffling a matrix should yield the same results if we swap ones with zeroes.
Build a distance matrix of columns. A distance between two columns is defined as Manhattan distance between the two columns represented numerically (i.e. 0 - the absence of a relationship between object and feature, 1 - presence)
Run hierarchical clustering using the distance matrix. The complexity is O(n^2), as I believe single linkage should be good enough.
The order of objects returned from the hierarchical clustering is the unshuffling permutation.
The algorithm works good enough for my use cases. The implementation in R can be found in https://github.com/adamryczkowski/rectpartitions

Approximated closest pair algorithm

I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).
Consider 4 points (A, B, C, D), and the following distances:
dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2
In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.
Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?
Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.
Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.
Imagine this problem for a set with 10⁸ points.
What kind of solution should I look for? Was this kind of problem solved before?
This is not a classroom problem or something related. I have been just thinking about this problem.
I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.
The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".
Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.
My implementation is here: https://github.com/jon1van/MTreeMapRepo
Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.
I suggest using my code (or the idea behind it) to implement a solution in which you:
Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
Return the closest pair of Keys when considering only the "winner" of each HashMap.
This style of solution would be a "divide and conquer" approach the returns an approximate solution.
You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.
Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate)
Good Luck
If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E).
Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.
Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.
Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.
That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.

Optimal placement of objects wrt pairwise similarity weights

Ok this is an abstract algorithmic challenge and it will remain abstract since it is a top secret where I am going to use it.
Suppose we have a set of objects O = {o_1, ..., o_N} and a symmetric similarity matrix S where s_ij is the pairwise correlation of objects o_i and o_j.
Assume also that we have an one-dimensional space with discrete positions where objects may be put (like having N boxes in a row or chairs for people).
Having a certain placement, we may measure the cost of moving from the position of one object to that of another object as the number of boxes we need to pass by until we reach our target multiplied with their pairwise object similarity. Moving from a position to the box right after or before that position has zero cost.
Imagine an example where for three objects we have the following similarity matrix:
1.0 0.5 0.8
S = 0.5 1.0 0.1
0.8 0.1 1.0
Then, the best ordering of objects in the tree boxes is obviously:
[o_3] [o_1] [o_2]
The cost of this ordering is the sum of costs (counting boxes) for moving from one object to all others. So here we have cost only for the distance between o_2 and o_3 equal to 1box * 0.1sim = 0.1, the same as:
[o_3] [o_1] [o_2]
On the other hand:
[o_1] [o_2] [o_3]
would have cost = cost(o_1-->o_3) = 1box * 0.8sim = 0.8.
The target is to determine a placement of the N objects in the available positions in a way that we minimize the above mentioned overall cost for all possible pairs of objects!
An analogue is to imagine that we have a table and chairs side by side in one row only (like the boxes) and you need to put N people to sit on the chairs. Now those ppl have some relations that is -lets say- how probable is one of them to want to speak to another. This is to stand up pass by a number of chairs and speak to the guy there. When the people sit on two successive chairs then they don't need to move in order to talk to each other.
So how can we put those ppl down so that every distance-cost between two ppl are minimized. This means that during the night the overall number of distances walked by the guests are close to minimum.
Greedy search is... ok forget it!
I am interested in hearing if there is a standard formulation of such problem for which I could find some literature, and also different searching approaches (e.g. dynamic programming, tabu search, simulated annealing etc from combinatorial optimization field).
Looking forward to hear your ideas.
PS. My question has something in common with this thread Algorithm for ordering a list of Objects, but I think here it is better posed as problem and probably slightly different.
That sounds like an instance of the Quadratic Assignment Problem. The speciality is due to the fact that the locations are placed on one line only, but I don't think this will make it easier to solve. The QAP in general is NP hard. Unless I misinterpreted your problem you can't find an optimal algorithm that solves the problem in polynomial time without proving P=NP at the same time.
If the instances are small you can use exact methods such as branch and bound. You can also use tabu search or other metaheuristics if the problem is more difficult. We have an implementation of the QAP and some metaheuristics in HeuristicLab. You can configure the problem in the GUI, just paste the similarity and the distance matrix into the appropriate parameters. Try starting with the robust Taboo Search. It's an older, but still quite well working algorithm. Taillard also has the C code for it on his website if you want to implement it for yourself. Our implementation is based on that code.
There has been a lot of publications done on the QAP. More modern algorithms combine genetic search abilities with local search heuristics (e. g. Genetic Local Search from Stützle IIRC).
Here's a variation of the already posted method. I don't think this one is optimal, but it may be a start.
Create a list of all the pairs in descending cost order.
While list not empty:
Pop the head item from the list.
If neither element is in an existing group, create a new group containing
the pair.
If one element is in an existing group, add the other element to whichever
end puts it closer to the group member.
If both elements are in existing groups, combine them so as to minimize
the distance between the pair.
Group combining may require reversal of order in a group, and the data structure should
be designed to support that.
Let me help the thread (of my own) with a simplistic ordering approach.
1. Order the upper half of the similarity matrix.
2. Start with the pair of objects having the highest similarity weight and place them in the center positions.
3. The next object may be put on the left or the right side of them. So each time you may select the object that when put to left or right
has the highest cost to the pre-placed objects. Goto Step 2.
The selection of Step 3 is because if you left this object and place it later this cost will be again the greatest of the remaining, and even more (farther to the pre-placed objects). So the costly placements should be done as earlier as it can be.
This is too simple and of course does not discover a good solution.
Another approach is to
1. start with a complete ordering generated somehow (random or from another algorithm)
2. try to improve it using "swaps" of object pairs.
I believe local minima would be a huge deterrent.

Select some from many binary sequences so that the result of "or" them together is 1111111111....111

I have N binary sequences of length L, where N and L maybe very large, and those sequences maybe very sparse, say have much more 0s then 1s.
I want to select M sequences from them, namely b_1, b_2, b_3..., such that
b_1 | b_2 | b_3 ... | b_M = 1111...11 (L 1s)
Is there an algorithm to achieve it?
My idea is:
STEP1: for position from 1 to L, count the total number of sequences which has 1 at that position. Name it 'owning number'
STEP2: consider the position having minimum owning number, and choose the sequence having the maximum number of 1s from the owning sequence of that position.
STEP3: ignore the chosen sequence, update owning number and go back to STEP2.
I believe that my method cannot generate the best answer.
Does anyone has a better idea?
This is the well known set cover problem. It is NP-hard — in fact, its decision version is one of the canonical NP-complete problems and was among the 21 problems included in Karp's 1972 paper — and so no efficient algorithm is known for solving it.
The algorithm you describe in your question is known as the "greedy algorithm" and (unless your problem has some special features that you are not telling us) it's essentially the best known approach. It finds a collection of sets that is no more than O(log |N|) times the size of the smallest such collection.
Sounds like a typical backtrack task.
Yes, your algoryth sounds reasonable if you want to have a good answer quickly. If you want to have the combination of the least possible samples you can't do better than try all combinations.
Depending on the exact structure of the problem, there is an other technique that often works well (and actually gives an optimal result):
Let x[j] be a boolean variable representing the choice whether to include the j'th binary sequence in the result. A zero-suppressed binary decision diagram can now represent (maybe succinctly - depending on the characteristics of the problem) the family of sets such that the OR of the binary sequences corresponding to a variable x[j] included in the set is all ones. Finding the smallest such set (thus minimizing the number of sequences included) is relatively easy if the ZDD was succinct. Details can be found in The Art of Computer Programming chapter 7.1.4 (volume 4A).
It's also easy to adapt to an exact cover, by taking the family of sets such that there is exactly one 1 for every position.

How to apply the Levenshtein distance to a set of target strings?

Let TARGET be a set of strings that I expect to be spoken.
Let SOURCE be the set of strings returned by a speech recognizer (that is, the possible sentences that it has heard).
I need a way to choose a string from TARGET. I read about the Levenshtein distance and the Damerau-Levenshtein distance, which basically returns the distance between a source string and a target string, that is the number of changes needed to transform the source string into the target string.
But, how can I apply this algorithm to a set of target strings?
I thought I'd use the following method:
For each string that belongs to TARGET, I calculate the distance from each string in SOURCE. In this way we obtain an m-by-n matrix, where n is the cardinality of SOURCE and n is the cardinality of TARGET. We could say that the i-th row represents the similarity of the sentences detected by the speech recognizer with respect to the i-th target.
Calculating the average of the values ​​on each row, you can obtain the average distance between the i-th target and the output of the speech recognizer. Let's call it average_on_row(i), where i is the row index.
Finally, for each row, I calculate the standard deviation of all values in the row. For each row, I also perform the sum of all the standard deviations. The result is a column vector, in which each element (Let's call it stadard_deviation_sum(i)) refers to a string of TARGET.
The string which is associated with the shortest stadard_deviation_sum could be the sentence pronounced by the user. Could be considered the correct method I used? Or are there other methods?
Obviously, too high values ​​indicate that the sentence pronounced by the user probably does not belong to TARGET.
I'm not an expert but your proposal does not make sense. First of all, in practice I'd expect the cardinality of TARGET to be very large if not infinite. Second, I don't believe the Levensthein distance or some similar similarity metric will be useful.
If :
you could really define SOURCE and TARGET sets,
all strings in SOURCE were equally probable,
all strings in TARGET were equally probable,
the strings in SOURCE and TARGET consisted of not characters but phonemes,
then I believe your best bet would be to find the pair p in SOURCE, q in TARGET such that distance(p,q) is minimum. Since especially you cannot guarantee the equal-probability part, I think you should think about the problem from scratch, do some research and make a completely different design. The usual methodology for speech recognition is the use Hidden Markov models. I would start from there.
Answer to your comment: Choose whichever is more probable. If you don't consider probabilities, it is hopeless.
[Suppose the following example is on phonemes, not characters]
Suppose the recognized word the "chees". Target set is "cheese", "chess". You must calculate P(cheese|chees) and P(chess|chees) What I'm trying to say is that not every substitution is equiprobable. If you will model probabilities as distances between strings, then at least you must allow that for example d("c","s") < d("c","q") . (It is common to confuse c and s letters but it is not common to confuse c and q) Adapting the distance calculation algorithm is easy, coming with good values for all pairs is difficult.
Also you must somehow estimate P(cheese|context) and P(chess|context) If we are talking about board games chess is more probable. If we are talking about dairy products cheese is more probable. This is why you'll need large amounts of data to come up with such estimates. This is also why Hidden Markov Models are good for this kind of problem.
You need to calculate these probabilities first: probability of insertion, deletion and substitution. Then use log of these probabilities as penalties for each operation.
In a "context independent" situation, if pi is probability of insertion, pd is probability of deletion and ps probability of substitution, the probability of observing the same symbol is pp=1-ps-pd.
In this case use log(pi/pp/k), log(pd/pp) and log(ps/pp/(k-1)) as penalties for insertion, deletion and substitution respectively, where k is the number of symbols in the system.
Essentially if you use this distance measure between a source and target you get log probability of observing that target given the source. If you have a bunch of training data (i.e. source-target pairs) choose some initial estimates for these probabilities, align source-target pairs and re-estimate these probabilities (AKA EM strategy).
You can start with one set of probabilities and assume context independence. Later you can assume some kind of clustering among the contexts (eg. assume there are k different sets of letters whose substitution rate is different...).

Resources