How to optimize the order of elements based on similarity coefficients? - algorithm

I have to reorder a sequence of elements based on the similarity between each other (expressed by a coefficient) so that each element is the most similar possible to each of its neighbors. I have to find an algorithm rather than a code.
Example with 10 elements and similarity coefficients calculated for each pair of the elements below :
The excel file can be find here : https://1drv.ms/x/s!AtmZN4-kjgrPms99fqgaDwAS_F4uYw
What I have tried :
Find a pair with the highest coefficient. In the example : 0.98 for T3 (left-end) and T5 (right-end)
Find maximum coefficient between the left-end and the remaining elements
Find maximum coefficient between the right-end and the remaining elements
Take the maximum between 2. and 3.
If maximum is 2. add on the left the element corresponding to the maximum coefficient for the left-end. Else, add on the right the element corresponding to the maximum coefficient for the right-end
Repeat points 2 - 6 until no elements left.
Here is the result :
The result isn't bad. One of the disadvantages I see is that 0.99>0.98 is considered in the same way as 0.99>0.01.
The second option I thought about was maximizing the sum of coefficients between all neighbors, but don't really know where to start from. Especially if there are significantly more than 10 elements. More, it could result in a more "flat" order where while having better similarities overall some extremely similar elements could be placed far from each other.
Being really new to this kind of problems I am pretty sure this should be a rather standard issue with existing solutions. Could you please point to those?
Thank you!

After researching I have found that my problem can be seen as the "Travelling Salesman Problem" (TSP). More here : https://en.wikipedia.org/wiki/Travelling_salesman_problem
To apply it you can see "elements" in my example as "cities" in TSP and (1-Similarity coefficient) as "distances".

Related

Minimize the max distance, 1D array

Problem:
Given a group of numbers of length n (sorted), each number is the location of a house in a 1D line "city".
Given a number k<=n, you need to place k "supermarkets" on the 1D city.
For every element in A, the min distance is defined as the minimum distance between A and a supermarket: |a-c|.
The cost of a city is defined as the max of all min distances.
You need to find what the minimum (optimal) cost would be for a given A of length n, and k<=n.
I can't find a solution for this problem. The solution should use dynamic programming. I'm thinking of how to write the recursive formula, and I think I already came out with the base cases:
if k = n then obviously the result will be 0 since you can place each supermarket in a city
if k = 1, I think the solution should be: (A[n] - A[1])/2.
But I can't come up with the actual formula (and the whole actual dynamic program). Also, I can't seem to find a "title" to this answer, I didn't find any other example of this exact answer online.
To minimize the maximum distance from k supermarkets, you divide the houses into consecutive groups so that you minimize the maximum distance between the starting and ending houses in each group. Then you just put a supermarket in the middle of each group.
Solving the problem this way makes it much easier for dynamic programming, since it removes the continuous variable of supermarket position.
I came up with this recursive function for the problem:
if there are more stands than houses, the answer is 0
if there is only one stand, so we place it in the middle between the edges
Othrwise:
For all the indexes from i to j, we calculate the maximum between all of them, and then the min.

Sack with different weights, what has to be the diff for it to work

Question: I have a sack which can carry some weight, and number of items with weight and i want to put as much weight as possible in the sack to carry, after some thought I have got into a conclusion, I take the highest weight every time and put into the sack, intuitivaly that it will work if the weights that are given are incremented atleast by multiplication of 2. For e.g. 2 4 8 16 32 64..
Can anyone help me prove if I am right or wrong about that? I have also an intuition about that, would love to hear urs.
Note: thought about saying that the sum of the previous numbers wont be bigger of the current nunber.
Yes, described greedy algorithm will work for powers of two.
Note that partial sum of geometric sequence 1,2,4,8,16..2^(k-1) is 2^k-1, that is why you always should choose the largest possible item - it is always bigger than any sum of smaller items.
In mathematical sense set of 2's powers forms matroid
But it would fail in general case (example - 3,3,4 and sum 6). You can learn for dynamic programming to solve this problem with integer weights. It is similar to knapsack problem with unit item costs.

Highest possible sum across 2D array

What is the best way to find the highest possible sum across a 2D integer array? You can't repeat columns and rows. Eg.
1 3 6
4 5 2
3 1 3
Max sum: 3+5+6=14
I know there is a method called the Hungarian algorithm, but that seems to be more suitable for finding minimum sum.
Yes, you can use the hungarian algorithm.
You need to modify the search criteria to look for largest sum instead of the smallest on. You also need to run Bellman-Ford instead of Dijkstra for the search component (because Dijkstra can't compute maximum sum path).
You can't run into a constantly increasing loop because the selected nodes are already paired using their maximum value, so any change would yield a lower total sum. The algorithm will chose to rearrange the connections if the loss from the already connected nodes is less than the gain from the newly connected one. You don't need to worry about it.

Combinatorial best match

Say I have a Group data structure which contains a list of Element objects, such that each group has a unique set of elements.:
public class Group
{
public List<Element> Elements;
}
and say I have a list of populations who require certain elements, in such a way that each population has a unique set of required elements:
public class Population
{
public List<Element> RequiredElements;
}
I have an unlimited quantity of each defined Group, i.e. they are not consumed by populations.
Say I am looking at a particular Population. I want to find the best possible match of groups such that there is minimum excess elements, and no unmatched elements.
For example: I have a population which needs wood, steel, grain, and coal. The only groups available are {wood, herbs}, {steel, coal, oil}, {grain, steel}, and {herbs, meat}.
The last group - {herbs, meat} isn't required at all by my population so it isn't used. All others are needed, but herbs and oil are not required so it is wasted. Furthermore, steel exists twice in the minimum set, so one lot of steel is also wasted. The best match in this example has a wastage of 3.
So for a few hundred Population objects, I need to find the minimum wastage best match and compute how many elements are wasted.
How do I even begin to solve this? Once I have found a match, counting the wastage is trivial. Finding the match in the first place is hard. I could enumerate all possibilities but with a few thousand populations and many hundreds of groups, it's quite a task. Especially considering this whole thing sits inside each iteration of a simulated annealing algorithm.
I'm wondering whether I can formulate the whole thing as a mixed-integer program and call a solver like GLPK at each iteration.
I hope I have explained the problem correctly. I can clarify anything that's unclear.
Here's my binary program, for those of you interested...
x is the decision vector, an element of {0,1}, which says that the population in question does/doesn't receive from group i. There is an entry for each group.
b is the column vector, an element of {0,1}, which says which resources the population in question does/doesn't need. There is an entry for each resource.
A is a matrix, an element of {0,1}, which says what resources are in what groups.
The program is:
Minimise: ((Ax - b)' * 1-vector) + (x' * 1-vector);
Subject to: Ax >= b;
The constraint just says that all required resources must be satisfied. The objective is to minimise all excess and the total number of groups used. (i.e. 0 excess with 1 group used is better than 0 excess with 5 groups used).
You can formulate an integer program for each population P as follows. Use a binary variable xj to denote whether group j is chosen or not. Let A be a binary matrix, such that Aij is 1 if and only if item i is present in group j. Then the integer program is:
min Ei,j (xjAij)
s.t. Ej xjAij >= 1 for all i in P.
xj = 0, 1 for all j.
Note that you can obtain the minimum wastage by subtracting |P| from the optimal solution of the above IP.
Do you mean the Maximum matching problem?
You need to build a bipartite graph, where one of the sides is your populations and the other is groups, and edge exists between group A and population B if it have it in its set.
To find maximum edge matching you can easily use Kuhn algorithm, which is greatly described here on TopCoder.
But, if you want to find mimimum edge dominating set (the set of minimum edges that is covering all the vertexes), the problem becomes NP-hard and can't be solved in polynomial time.
Take a look at the weighted set cover problem, I think this is exactly what you described above. A basic description of the (unweighted) problem can be found here.
Finding the minimal waste as you defined above is equivalent to finding a set cover such that the sum of the cardinalities of the covering sets is minimal. Hence, the weight of each set (=a group of elements) has to be defined equal to its cardinality.
Since even the unweighted the set cover problem is NP-complete, it is not likely that an efficient algorithm for your problem instances exist. Maybe a good greedy approximation algorithm will be sufficient or your purpose? Googling weighted set cover provides several promising results, e.g. this script.

Find the "largest" dense sub matrix in a large sparse matrix

Given a large sparse matrix (say 10k+ by 1M+) I need to find a subset, not necessarily continuous, of the rows and columns that form a dense matrix (all non-zero elements). I want this sub matrix to be as large as possible (not the largest sum, but the largest number of elements) within some aspect ratio constraints.
Are there any known exact or aproxamate solutions to this problem?
A quick scan on Google seems to give a lot of close-but-not-exactly results. What terms should I be looking for?
edit: Just to clarify; the sub matrix need not be continuous. In fact the row and column order is completely arbitrary so adjacency is completely irrelevant.
A thought based on Chad Okere's idea
Order the rows from largest count to smallest count (not necessary but might help perf)
Select two rows that have a "large" overlap
Add all other rows that won't reduce the overlap
Record that set
Add whatever row reduces the overlap by the least
Repeat at #3 until the result gets to small
Start over at #2 with a different starting pair
Continue until you decide the result is good enough
I assume you want something like this. You have a matrix like
1100101
1110101
0100101
You want columns 1,2,5,7 and rows 1 and 2, right? That submatrix would 4x2 with 8 elements. Or you could go with columns 1,5,7 with rows 1,2,3 which would be a 3x3 matrix.
If you want an 'approximate' method, you could start with a single non-zero element, then go on to find another non-zero element and add it to your list of rows and columns. At some point you'll run into a non-zero element that, if it's rows and columns were added to your collection, your collection would no longer be entirely non-zero.
So for the above matrix, if you added 1,1 and 2,2 you would have rows 1,2 and columns 1,2 in your collection. If you tried to add 3,7 it would cause a problem because 1,3 is zero. So you couldn't add it. You could add 2,5 and 2,7 though. Creating the 4x2 submatrix.
You would basically iterate until you can't find any more new rows and columns to add. That would get you too a local minimum. You could store the result and start again with another start point (perhaps one that didn't fit into your current solution).
Then just stop when you can't find any more after a while.
That, obviously, would take a long time, but I don't know if you'll be able to do it any more quickly.
I know you aren't working on this anymore, but I thought someone might have the same question as me in the future.
So, after realizing this is an NP-hard problem (by reduction to MAX-CLIQUE) I decided to come up with a heuristic that has worked well for me so far:
Given an N x M binary/boolean matrix, find a large dense submatrix:
Part I: Generate reasonable candidate submatrices
Consider each of the N rows to be a M-dimensional binary vector, v_i, where i=1 to N
Compute a distance matrix for the N vectors using the Hamming distance
Use the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm to cluster vectors
Initially, each of the v_i vectors is a singleton cluster. Step 3 above (clustering) gives the order that the vectors should be combined into submatrices. So each internal node in the hierarchical clustering tree is a candidate submatrix.
Part II: Score and rank candidate submatrices
For each submatrix, calculate D, the number of elements in the dense subset of the vectors for the submatrix by eliminating any column with one or more zeros.
Select the submatrix that maximizes D
I also had some considerations regarding the min number of rows that needed to be preserved from the initial full matrix, and I would discard any candidate submatrices that did not meet this criteria before selecting a submatrix with max D value.
Is this a Netflix problem?
MATLAB or some other sparse matrix libraries might have ways to handle it.
Is your intent to write your own?
Maybe the 1D approach for each row would help you. The algorithm might look like this:
Loop over each row
Find the index of the first non-zero element
Find the index of the non-zero row element with the largest span between non-zero columns in each row and store both.
Sort the rows from largest to smallest span between non-zero columns.
At this point I start getting fuzzy (sorry, not an algorithm designer). I'd try looping over each row, lining up the indexes of the starting point, looking for the maximum non-zero run of column indexes that I could.
You don't specify whether or not the dense matrix has to be square. I'll assume not.
I don't know how efficient this is or what its Big-O behavior would be. But it's a brute force method to start with.
EDIT. This is NOT the same as the problem below.. My bad...
But based on the last comment below, it might be equivilent to the following:
Find the furthest vertically separated pair of zero points that have no zero point between them.
Find the furthest horizontally separated pair of zero points that have no zeros between them ?
Then the horizontal region you're looking for is the rectangle that fits between these two pairs of points?
This exact problem is discussed in a gem of a book called "Programming Pearls" by Jon Bentley, and, as I recall, although there is a solution in one dimension, there is no easy answer for the 2-d or higher dimensional variants ...
The 1=D problem is, effectively, find the largest sum of a contiguous subset of a set of numbers:
iterate through the elements, keeping track of a running total from a specific previous element, and the maximum subtotal seen so far (and the start and end elemnt that generateds it)... At each element, if the maxrunning subtotal is greater than the max total seen so far, the max seen so far and endelemnt are reset... If the max running total goes below zero, the start element is reset to the current element and the running total is reset to zero ...
The 2-D problem came from an attempt to generate a visual image processing algorithm, which was attempting to find, within a stream of brightnesss values representing pixels in a 2-color image, find the "brightest" rectangular area within the image. i.e., find the contained 2-D sub-matrix with the highest sum of brightness values, where "Brightness" was measured by the difference between the pixel's brighness value and the overall average brightness of the entire image (so many elements had negative values)
EDIT: To look up the 1-D solution I dredged up my copy of the 2nd edition of this book, and in it, Jon Bentley says "The 2-D version remains unsolved as this edition goes to print..." which was in 1999.

Resources