k means clustering sample data

k means clustering sample data - algorithm

I am writing program to implement k-means clustering.
consider a simple input with 4 vertices a,b,c and d with following edge costs
[vertex1] [vertex2] [edge cost]
a b 1
a c 2
a d 3
b d 4
c d 5
Now I need to make the program run until i get 2 clusters.
My doubt is, in the first step when calculate the minimum distance it is a->b (edge cost 1). Now I should consider ab as a single cluster. If that is the case, what will be the distance of ab from c and d?

The K-means algorithm works as follows:
choose k points as initial centroids (hence, K-*);
calculate the distance from all vertices to the k centroids choosen;
assign each vertex to the closest centroid;
recalculate the position of the centroids by generating the mean between all the vertices that belong to the centroid (hence, k-means, one mean calculation for each of the k centroids);
go to step 2 and stop when, in step 3, no vertex get assigned to another centroid -- or until your error condition gets satisfied.
In your case, as you have an undirected graph, it'd be better for you to generate the coordinates of each vertex considering the edge distances, and then, apply the algorithm.
If you don't want to do this initial process, you may calculate the distance from a vertex to all other reachable vertices, but you'd have to do this for every iteration -- which is quite an unnecessary overhead.
For your undirected graph:
[vertex1] [vertex2] [edge cost]
a b 1
a c 2
a d 3
b d 4
c d 5
The table of distances would be something like:
a b c d
a 0 1 2 3
b 1 0 (1) 4
c 2 (1) 0 5
d 3 4 5 0
(1) - b to c = (b to a, a to c) = 3
If this should be your table, simply apply the Dijkstra algorithm on your graph, for each vertex, and consider the resultant table your table of distances.
The table would have the minimal distances, but, if you have any other policy to calculate it, it's totally up to you saying how to calculate it.
Notice also that, if your graph is directed, the matrix will not be symmetric, as it is, in this case.

Related

Shortest path in matrix with obstacles with cheat paths

First of all this is an assingment and I am not looking for direct answers but instead the complexity of the best solution as you might be thinking it .
This is the known problem of shortest path between 2 points in a matrix (Start and End) while having obstacles in the way. Moves acceptables is up,down,left and right . Lets say when moving i carry sth and the cost of each movement is 2 . There are points in the matrix (lets name them B points) where I can leave this sth in one B point and pick it up from another B point . Cost of dumping sth in B point is 1 and cost of picking sth up from a B point is 1 again . Whenever I move without this sth , my cost of moving now is 1 .
What I think of the solution is transform the matrix into a tree and have a BFS applied . However that works without the B points .
Whenever i take into account the B points complexity comes to a worst case scenario N^2.
Here is an example :
S - - -
- - - -
B - - B
- - O E
S = Start , E = End , B = B point to drop sth, O = obstacle
So i start with S move down down to the B point (2*2=4 points) leave sth in the B point (1 point ) move right right (2*1= 2 points ) , pick it up (1 point ) , move down 2 points = total of 10 points .
What i thought was build the tree with nodes every B point , however this would create a very dense cyclic graph of almost (V-1)*(V-1) edges which takes the algortithm in N^2 boundaries just to create the graph .
That is the worst case scenario as above :
S b b b
b b b b
b b b b
b b b E
Another option I thought was that of first calculating shortest paths withouth B points .
Then have iterations where at each iteration :
First have bfs on S and closest B
have BFS on E and closest B
Then see if there is a path between B of closest to S and B closest to E .
If there is then I would see if the path is smaller than that of regular shortest path with obstacles .
If that is bigger then there is no shortest path (no greedy test).
If there is no path between the 2 B points , try second closest to S and try again .
If no path again , the second closest to E and closest to S .
However I am not able to calculate the complexity in this one in the worst case scenario plus there is no greedy test that evaluates that.
Any help on calculating the complexity or even pointing out the best complexity solution (not the solution but just the complexity ) would be greatly appreciated

Your matrix is a representation of a graph. Without the cheat paths it is quite easy to implement a nice BFS. Implementing the cheat paths is not a big deal. Just add the same matrix as another 'layer' on top of the first one. bottom layer is 'carry', top layer is 'no carry'. You can move to the other layer only at B-points for the given cost. This is the same BFS with a third dimension.
You have n^2 nodes and (n-1)^2 edges per layer and additionally a maximum of n^2 eges connecting the layers. That's O(n^2).

You can just build a new graph with nodes labeled by (N, w) where N is a node in the original graph (so a position in your matrix), and w=0 or 1 is whether you're carrying a weight. It's then quite easy to add all possible edges in this graph
This new graph is of size 2*V, not V^2 (and the number of edges is around 4*V+number(B)).
Then you can use a shortest path algorithm, for instance Dijkstra's algorithm: complexity O(E + V log(V)) which is O(V log(V)) in your case.

Traveling Salesman Variation Algorithm

I'm having a trouble finding a contradicting example of the next variation of the TSP problem.
Input: G=(V,E) undirected complete graph which holds the triangle inequality, w:E->R+ weight function, and a source vertex s.
Output: Simple Hamilton cycle that starts and ends at s, with a minimum weight.
Algorithm:
1. S=Empty-Set
2. B=Sort E by weights.
3. Initialized array M of size |V|,
where each cell in the array holds a counter (Initialized to 0)
and a list of pointers to all the edges of that vertex (In B).
4. While |S|!=|V|-1
a. e(u,v)=removeHead(B).
b. If e does not close a cycle in S then
i. s=s union {e}
ii. Increase degree counter for u,v.
iii. If M[u].deg=2 then remove all e' from B s.t e'=(u,x).
iv. If M[v].deg=2 then remove all e' from B s.t e'=(v,x).
5. S=S union removeHead(B).
This will be done similar to the Kruskal Algorithm (Using union-find DS).
Steps 4.b.iii and 4.b.iv will be done using the List of pointers.
I highly doubt that this algorithm is true so I instantly turned into finding why it is wrong. Any help would be appreciated.

Lets say we have a graph with 4 vertices (a, b, c, d) with edge weights as follows:
w_ab = 5
w_bc = 6
w_bd = 7
w_ac = 8
w_da = 11
w_dc = 12
7
|--------------|
5 | 6 12 |
a ---- b ---- c ----- d
|______________| |
| 8 |
|_____________________|
11
The triangle inequality holds for each triangle in this graph.
Your algorithm will choose the cycle a-b-c-d-a (cost 34), when a better cycle is a-b-d-c-a (cost 32).

Your procedure may not terminate. Consider a graph with nodes { 1, 2, 3, 4 } and edges { (1,2), (1,3), (2,3), (2,4), (3,4) }. The only Hamiltonian cycle in this graph is { (1,2), (1,3), (2,4), (3,4) }. Suppose the lowest weighted edge is (2,3). Then your procedure will pick (2,3), pick one of { (1,2), (1,3) } and eliminate the other, pick one of { (2,4), (3,4) } and eliminate the other, then loop forever.
Nuances like this are what makes the Travelling Salesman problem so difficult.

Consider the complete graph on 4 vertices, where {a,b,c,d} are the nodes, imagined as the clockwise arranged corners of a square. Let the edge weights be as follows.
w({a,b}) := 2, // "edges"
w({b,c}) := 2,
w({c,d}) := 2,
w({d,a}) := 2,
w({a,c}) := 1, // "diagnoals"
w({b,d}) := M
where M is an integer larger than 2. On one hand, the hamiltonian cycle consisting of the "edges" has weight 8. On the other hand, the hamiltonian cycle containing {a,c} , which is the lightest edge, must contain {b,d} and has total weight
1 + M + 2 + 2 = 5 + M > 8
which is larger than the minimum possible weight. In total, this means that in general a hamitonian cycle of minimum weight does not necessarily contain the lightst edge, which is chosen by the algorithm in the original question. Furthermore, as M tends to infinity, the algorithm performs arbitrarily badly in terms of the approximation ratio, as
(5 + M) / 8
grows arbitrarily large.

Minimising distance between related items in an array

I have an array of related items, { A, B, C, D }.
C is dependent on A.
D is dependent on B and C.
So, I calculate the total distance between items in this permutation as the sum of distances between:
C and A (2),
D and B (2),
D and C (1).
So, we have a total of 5 in this permutation.
However, the most optimal solution would be {A, C, D, B}, which has a total distance of 3.
I have a (much more complicated) list of about 200 items, which I want to optimise as best as I can, and I'm not aware of any sorting algorithms that sort in this way- can anyone point me in the direction of an existing algorithm?
From Comments:
A plot of the data would look like below- (Apologies for the formatting!)
#Dependencies #Items
0 9
1 27
2 57
3 55
4 11
5 3
6 1

I believe what you're looking for is Topological Sorting.
This algorithm is used in directed graphs. Here the alphabets form the nodes of the graph and the dependencies form the unidirectional edges.
This algorithm is an application of depth first search and is used to order jobs.
This is a pretty neat explanation.

Separate areas algorithm

We are given an N x M rectangular area with K lines in it. Every line has (x0,y0) - (x1,y1), beginning and end coordinates. Are there some well known algorithms or resources to learn that can help me to write a program to find how many separate areas those lines form in the rectangular area?
If this is the original rectangular area : http://prntscr.com/6p9m2c
Then there are 4 separate areas: http://prntscr.com/6p9mo5

All segments with intersections form planar graph. You have to count thoroughly vertices and edges of this graph, then apply Euler's formula
F = E - V + 2
where
V is vertice count - number of intersections (and corners and free segment ends)
E is edge count - number of segments, formed after intersections
F is number of facets.
Your needed region count is
R = F - 1
because F takes into account outer facet.
For your example - there are 16 vertices, 10 horizontal edges and 9 vertical, so
R = 10 + 9 - 16 + 2 - 1 = 4
Note that you can either count vertices with degree 1,2 (corners and free ends) or ignore them together with one neighbour segment (simplify graph) - this doesn't influence to result.

Imagine that the NxM grid is a graph where each '.' is a vertex, and two vertex is connected by an edge if they are side by side (above, below, on the side). Now each separate area corresponds to a connected component of the graph, and you can count the number of connected components in O(N*M) using BFS or DFS algorithms.

count unique pair in graph (vertices repetition not allowed)

Is there any Algorithm exists for counting unique pair (pair of vertices) in undirected graph (vertices repetition not allowed). I think it could be a variation of bipartite graph but if there is any better way to find out .. please comment.
[I think Problem Belongs to Perfect Matching Algorithm]
Problem Statement:
I have an undirected graph which consists of n vertexes and m edges. I can delete edges from the graph. Now I'm interested in one question : is it possible to delete edges in the graph so that the degree of each vertex in the graph will be equal 1.. There can be multiple edges in the graph, but can not be any loops
Example: n = #vertices, m = #edges
n = 4, m = 6
1 2
1 3
1 4
2 3
2 4
3 4
Unique sequence could be (1 2, 3 4) (1 4, 2 3) (1 3, 2 4)

The set of edges that covers the entire graph without using the same vertex multiple times is called a matching or independent edge set, see wikipedia.
In that article is also mentioned that the number of distinct matchings in a graph (which is the number you are after) is called the Hosoya index, see this wikipedia article.
Algorithms to compute this number are not trivial and Stack Overflow wouldn't be the right place to try to explain them, but I at least I hope you have enough pointers to investigate further.

Here is pseudo code, it should run in O(|E|) time, i.e. linear of number of edges :
Suppose G = (V, E) is your initial graph, with E - initial set of all edges
count = 0;
while(E is not empty) {
//1. pick up any edge e = (n1, n2) from E
//2. remove e from G
E = E - e;
//3. calculate number of edges in G remaining if nodes n1 and n2 were removed
// -> these are edges making pair with e
edges_not_connected_to_e = |E| - |n1| - |n2|;
// where |n1| - degree of n1 in updated G (already without edge e)
//4. update the count
count += edges_not_connected_to_e;
}
return count;
Let me know if you need more clarifications. And probably someone could fix my Graph math notations, in case they are incorrect.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio