What is the difference between clustering and matching? - algorithm

What is the difference between clustering and matching?
For example: There's a pool of four elements and in the one scenario I want to generate pairs. What I do is I measure the distance of each element to each other which yields a 2x2 matrix. Then the matching algorithm finds the two pairings with the lowest or highest weighted sum.
What is a clustering algorithm doing? When I demand a cluster number of two then the result is the same, or not?

Specifying the number of elements in a cluster (pairs for example) doesn't make much sense. If you have been looking at k-means (k-medoids), the k actually indicates how many clusters will be created in total. So, if you have 4 elements and use k = 2, you can get one cluster with 1 element and another cluster with 3 elements, depending on the data you have. Anyway, clustering on 4 elements doesn't make sense.

Related

Highest possible sum across 2D array

What is the best way to find the highest possible sum across a 2D integer array? You can't repeat columns and rows. Eg.
1 3 6
4 5 2
3 1 3
Max sum: 3+5+6=14
I know there is a method called the Hungarian algorithm, but that seems to be more suitable for finding minimum sum.
Yes, you can use the hungarian algorithm.
You need to modify the search criteria to look for largest sum instead of the smallest on. You also need to run Bellman-Ford instead of Dijkstra for the search component (because Dijkstra can't compute maximum sum path).
You can't run into a constantly increasing loop because the selected nodes are already paired using their maximum value, so any change would yield a lower total sum. The algorithm will chose to rearrange the connections if the loss from the already connected nodes is less than the gain from the newly connected one. You don't need to worry about it.

Is there a algorithm to extract the minimum number of Cartesian products from a set of formulas?

For example, we have a set of formulas as below:
B*2*j
B*3*i
B*3*j
C*2*j
C*3*i
C*3*j
D*2*i
D*2*j
D*3*i
D*3*j
And we could have three Cartesian products to represent the formulas above:
D*(2+3)*(i+j)
(B+c)*3*(i+j)
(B+C)*2*j
So the total number is 3. And we could also have:
3*(B+C+D)*(i+j)
2*(B+C)*D
2*D*(i+j)
which is also 3.
I wanna ask that is there a algorithm to determine the minimum number of Cartesian products from a set of formulas? And also come up with these products?
First, I'll write a set of formulas as terms separated by +, since the transformation you're looking for makes sense algebraically (apart from the fact that you don't want to combine numbers like 2+3 into 5).
The basic operation that you have available is factorising: combining two terms like ABC+ABD into AB(C+D). Based on your comment, you can only generate new factors that consist of a sum of single-factor terms, like C+D in the previous example; you're not allowed to factorise e.g. ABCD+ABDE into AB(CD+DE).
You can factorise 2 k-factor terms if and only if they share exactly k-1 factors. (E.g. k=3 in my ABC+ABD example.) Every such factorisation reduces the number of terms in the set by 1: 2 are removed and 1 is added back in.
Doing this multiple times works when combining 3 or more terms: ABC+ABD+ABE can first be factorised into AB(C+D)+ABE and then those 2 terms factorised again into AB(C+D+E). Notice that it doesn't matter in which order we list terms in a sum or factors in a product, and nor does it matter in which order we perform factorisation steps when building a factor containing 3 or more terms.
We can then frame the problem as a search problem in a graph, in which the start vertex corresponds to the original formula (B*2*j + B*3*i + ... + D*3*j in your example) and from each vertex v there emanate arcs to its child vertices, which each correspond to the result of performing some factorisation on v. v will have a child vertex for each possible factorisation that could be performed on it; if there are m terms in v, then this means it could have up to m(m-1)/2 children in the worst case, because it could be that all m terms share a full complement of k-1 factors, meaning that any pair of them could be combined.
If a vertex has no pair of terms that can be combined via factorisation then it is a "leaf" -- it has no children, and can't be processed further. What we want to find is a leaf vertex that has the fewest number of terms. Since every factorisation, corresponding to an arc in the graph, reduces the number of terms by 1, this is equivalent to searching for a deepest-possible vertex. This can be done using DFS or BFS. Note however that the same expression (vertex) can be generated many times over using this approach, so it will be crucial for performance to maintain a hashtable seen that records all expressions that have already been processed; then if we visit a vertex, try to generate a child for it, and see that this child is already in seen, we avoid visiting this child a second time.
To mitigate against the phenomenon of the same expression being generated via multiple different orderings of the same set of factorisations, you can add a rule: order v's child factorisations somehow, so that if there are n children they correspond to factorisations 1, 2, ..., n in this ordering, and record in a separate "already skipped" field in each child vertex the set of earlier (in the ordering) factorisations that were skipped over to generate this child. Then, when visiting a vertex, avoid generating any of its "already skipped" factorisations as children, since doing so would create a vertex that is identical to some other existing vertex (by performing the same pair of operations in reverse order).
There are probably other speedups available that will reduce the number of duplicate vertices that are generated in the first place, but this should be enough to get results for small problems.
Write down you sum in matrix form. Then what you are asking for is the rank of that matrix, and a corresponding decomposition into dyadic products. This decomposition is far from unique.
[ 3 5 ] [ i ]
[ B C D ] * | 3 5 | * [ j ]
[ 5 5 ]
As one can see, the matrix in the middle has full rank 2
If you intend to use 2 and 3 also as variables, then you are asking to decompose a tensor of order 3 into a minimum number of terms that factorize, i.e., that are tensor products of vectors.

Algorithm design to assign nodes to graphs

I have a graph-theoretic (which is also related to combinatorics) problem that is illustrated below, and wonder what is the best approach to design an algorithm to solve it.
Given 4 different graphs of 6 nodes (by different, I mean different structures, e.g. STAR, LINE, COMPLETE, etc), and 24 unique objects, design an algorithm to assign these objects to these 4 graphs 4 times, so that the number of repeating neighbors on the graphs over the 4 assignments is minimized. For example, if object A and B are neighbors on 1 of the 4 graphs in one assignment, then in the best case, A and B will not be neighbors again in the other 3 assignments.
Obviously, the degree to which such minimization can go is dependent on the specific graph structures given. But I am more interested in a general solution here so that given any 4 graph structures, such minimization is guaranteed as the result of the algorithm.
Any suggestion/idea of solving this problem is welcome, and some pseudo-code may well be sufficient to illustrate the design. Thank you.
Representation:
You have 24 elements, I will name this elements from A to X (24 first letters).
Each of these elements will have a place in one of the 4 graphs. I will assign a number to the 24 nodes of the 4 graphs from 1 to 24.
I will identify the position of A by a 24-uple =(xA1,xA2...,xA24), and if I want to assign A to the node number 8 for exemple, I will write (xa1,Xa2..xa24) = (0,0,0,0,0,0,0,1,0,0...0), where 1 is on position 8.
We can say that A =(xa1,...xa24)
e1...e24 are the unit vectors (1,0...0) to (0,0...1)
note about the operator '.':
A.e1=xa1
...
X.e24=Xx24
There are some constraints on A,...X with these notations :
Xii is in {0,1}
and
Sum(Xai)=1 ... Sum(Xxi)=1
Sum(Xa1,xb1,...Xx1)=1 ... Sum(Xa24,Xb24,... Xx24)=1
Since one element can be assign to only one node.
I will define a graph by defining the neighbors relation of each node, lets say node 8 has neighbors node 7 and node 10
to check that A and B are neighbors on node 8 for exemple I nedd:
A.e8=1 and B.e7 or B.e10 =1 then I just need A.e8*(B.e7+B.e10)==1
in the function isNeighborInGraphs(A,B) I test that for every nodes and I get one or zero depending on the neighborhood.
Notations:
4 graphs of 6 nodes, the position of each element is defined by an integer from 1 to 24.
(1 to 6 for first graph, etc...)
e1... e24 are the unit vectors (1,0,0...0) to (0,0...1)
Let A, B ...X be the N elements.
A=(0,0...,1,...,0)=(xa1,xa2...xa24)
B=...
...
X=(0,0...,1,...,0)
Graph descriptions:
IsNeigborInGraphs(A,B)=A.e1*B.e2+...
//if 1 and 2 are neigbors in one graph
for exemple
State of the system:
L(A)=[B,B,C,E,G...] // list of
neigbors of A (can repeat)
actualise(L(A)):
for element in [B,X]
if IsNeigbotInGraphs(A,Element)
L(A).append(Element)
endIf
endfor
Objective functions
N(A)=len(L(A))+Sum(IsneigborInGraph(A,i),i in L(A))
...
N(X)= ...
Description of the algorithm
start with an initial position
A=e1... X=e24
Actualize L(A),L(B)... L(X)
Solve this (with a solveur, ampl for
exemple will work I guess since it's
a nonlinear optimization
problem):
Objective function
min(Sum(N(Z),Z=A to X)
Constraints:
Sum(Xai)=1 ... Sum(Xxi)=1
Sum(Xa1,xb1,...Xx1)=1 ...
Sum(Xa24,Xb24,... Xx24)=1
You get the best solution
4.Repeat step 2 and 3, 3 more times.
If all four graphs are K_6, then the best you can do is choose 4 set partitions of your 24 objects into 4 sets each of cardinality 6 so that the pairwise intersection of any two sets has cardinality at most 2. You can do this by choosing set partitions that are maximally far apart in the Hasse diagram of set partitions with partial order given by refinement. The general case is much harder, but perhaps you can still begin with this crude approximation of a solution and then be clever with which vertex is assigned which object in the four assignments.
Assuming you don't want to cycle all combinations and calculate the sum every time and choose the lowest, you can implement a minimum problem (solved depending on your constraints using either a linear programming solver i.e. symplex algorithm engines or a non-linear solver, much harder talking in terms of time) with constraints on your variables (24) depending on the shape of your path. You can also use free software like LINGO/LINDO to create rapidly a decision theory model and test its correctness (you need decision theory notions though)
If this has anything to do with the real world, then it's unlikely that you absolutely must have a solution that is the true minimum. Close to the minimum should be good enough, right? If so, you could repeatedly randomly make the 4 assignments and check the results until you either run out of time or have a good-enough solution or appear to have stopped improving your best solution.

Creating a cluster centroid prone to noise

I'm working on a clustering algorithm to group similar ranges of real numbers. After I group them, I have to create one range for that cluster, i.e., cluster centroid. For example, if one cluster contains values <1,6>, <0,7> and <0,6>, that means that this cluster is for all those with values <0,7>. The question is how to create such a resulting range. I was thinking to take the min and max value of all values in the cluster, but that would mean that the algorithm is very sensitive on noise. I should do it somehow weighted, but I'm not sure how. Any hints? Thanks.
Perhaps you can convert all ranges to their midpoints before running your clustering algorithm. That way you convert your problem into clustering points on a line. Previously, the centroid range could 'grow' and in the next iteration consume more ranges that perhaps should belong to another cluster.
midpoints = []
for range in ranges
midpoints[range] = range.min + (range.max - range.min) / 2
end
After the algorithm is finished you can do as you previously suggested and take the min and max values of all the ranges in the cluster to create the range for that centroid.

Choosing number of clusters in k means

I want to cluster a large sample of data and for it I am using k means function in MATLAB. The problem is that it returns a matrix with all the data sorted in the number of clusters I specify.
How can I know which number of clusters is optimal.
I thought that if I would get the equal number of elements in each cluster that would be optimal but this never happens. Rather it can go on clustering the data for any number I put.
Please help...
I read and I think an answer to this could be :- In kmeans we are trying to partition the data according to the means as the data comes so theoretically our best dataset would be where each partition has equal number of data.
I used kmeans++ which was a better algorithm than kmeans because it does not initialise a random value and then iterated over the number of partitions till the sizes of partitions were almost equal. This was an approximate figure as say for 3 i got 2180,729,1219 and for 4 i was getting 30,2422, 1556,120 so I chose 3 as my final answer............

Resources