how to partition the nodes of an undirected graph into k sets - algorithm

I have an undirected graph G=(V,E) where each vertex represents a router in a large network. Each edge represents a network hop from one router to the other therefore, all edges have the same weight. I wish to partition this network of routers into 3 or k different sets clustered by Hop count.
Motivation:
The idea is to replicate some data in routers contained in each of these 3 sets. This is so that whenever a node( or client or whatever) in the network graph requests for a certain data item, I can search for it in the 3 sets and get a responsible node(one that has cached that particular data) from each set. Then I'd select the node which is at the minimum hop count away from the requesting node and continue with my algorithms and tests.
The cache distribution and request response are not in the scope of this question. I just need a way to partition the network into 3 sets so that I can perform the above operations on it.
Which clustering algorithm could be used in such a scenario. I have almost 9000 nodes in the graph and I wish to get 3 sets of ~3000 nodes each

In the graph case, a clustering method based on minimum spanning trees can be used.
The regular algorithm is the following:
Find the minimum spanning tree of the graph.
Remove the k - 1 longest edges in the spanning tree, where k is the desired number of clusters.
However, this works only if the edges differ in length (or weight). In the case of edges of equal length, every spanning tree is a minimum one so this would not work well. However, putting a little thinking into it, a different algorithm came to my mind which uses BFS.
The algorithm:
1. for i = 1..k do // for each cluster
2. choose the number of nodes N in cluster i
3. choose an arbitrary node n
4. run breadth-first search (BFS) from n until N
5. assign the first N nodes (incl. n) tapped by the BFS to the i-th cluster
6. remove these nodes (and the incident edges) from the graph
7. done
This algorithm (the results) hugely depends on how step 3, i.e. choosing the "root" node of a cluster, is implemented. For the sake of simplicity I choose an arbitrary node, but it could be more sophisticated. The best nodes are those that are the at the "end" of the graph. You could find a center of the graph (a node that has the lowest sum of lengths of paths to all other nodes) and then use the nodes that are the furthes from this center.
The real issue is that your edges are equal (if I understood your problem statement well) and you have no information about the nodes (i.e. their coordinates - then you could use e.g. k-means).

Related

Minimum time to reach every node in the graph having connected components

consider a graph without cycles. The graph have K distinct pairs in contact with each other.if we want to sent a letter to all the persons. sending a letter takes a unit time. we want to speed up the process. So what will be the minimum time for the letter to reach every person(node of graph). we can hand over letter to anyone of the connected component amoung all the connected components
They key point is that the graph has no cycles. Thus each component of your graph is a tree. See Wikipedia for more information: https://en.wikipedia.org/wiki/Tree_(graph_theory)
Let us assume in the following, that your graph has only one component and n nodes. If your graph has multiple components, just take the largest one and set n to the number of nodes of this component.
The worst case is, that the delivery of a letter goes from a leaf node (at the bottom) up to the root node (at the top) and then down to another leaf node. This path has length (n-1). Thus this delivery takes (n-1) time.
To use other words: The longest path in a tree with n nodes has length n-1.
Use dynamic programming to solve these kinds of problem statements.

Add maximum possible edges to the graph with nodes capacity

Problem: given N nodes, each of them has a limit for it's own degree, for example degree of the node (1) can not be higher that 10 (but can be less, of course), degree of the node (2) can not be higher that 3, etc. On these nodes build graph with maximum possible edges.
Would be happy to see any hints/recommendations.
EIDT: Graph should be simple :)
If there's no other constraint on which vertices can be connected, a greedy algorithm should work here: Connect whichever two unconnected vertices have the highest remaining degree, until no such pair exists. This can be done efficiently with an array of vertices dynamically sorted by remaining degree.
If the graph doesn't have to be simple (the question doesn't specify) then just add duplicate self loops to exhaust all but at most one available endpoint at each node. Then, pair off nodes. You will be left with at most one unused endpoint; the number of edges is trivially the sum of endpoint allowances, divided by two, rounded down.

How to calculate maximal parallelism in a DAG?

Given a DAG (directed acyclic graph), how does one calculate the maximal parallelism?
Instantaneous parallelism is the maximum number of processors that can be kept busy at each point in execution of algorithm; the maximal parallelism is the highest instantaneous parallelism.
Put another way, given a DAG representing a dependency graph of tasks, what is the minimum number of processors/threads such that no task is ever blocked?
The closest approach I found here is:
apply a topological sort on the DAG
traverse over the nodes by the topological order, calculate the minimum level:
no parents: 0
otherwise: minimum parent level + 1
return the max level width (max num of nodes assigned the same level)
This algorithm worked for me on several samples, however doesn't work on a tree. E.g.:
o 0
/ \
o 1 o 1
/ \
o 2 o 2
/ \
o 3 o 3
According to the algorithm above, max width is 2, but clearly max parallelism in a tree is the number of leafs, 4 in the example above.
A similar approach is partially described here (see slide titled Computing critical path etc., which describes how to calculate earliest start times of nodes and that "maximal...parallelism can easily be computed from this").
Edit 1:
#AliSoltani's solution to use BFS to find the length of the critical path and that is the max parallelism degree is incorrect, since it only applies to a subset of examples, mainly trees in which the number of leafs is equal to the longest path. Here's an illustration of a case where this wouldn't work:
Edit 2:
#AliSultani's 2nd solution using BFS to find the level with maximum number of nodes, and set that max as the max parallelism, is also incorrect, as it doesn't take into account cases where nodes from different levels may run concurrently. See this counterexample:
This problem is reducible to the Maximum Directed Cut problem.
Let's build an auxiliary DAG from the original one.
For every vertex u[i] of the original graph add vertexes v[i] and w[i] to the new graph, and connect them using an edge (v[i], w[i]) with a cost 1.
For every edge (u[i], u[j]) of the original graph add an edge (w[i], v[j]) with a cost 0 to the new graph.
Now the problem is equivalent to finding the maximum directed cut in the auxiliary graph.
Example:
You should find critical path length in DAG. A critical path is a directed
path that has the maximum execution requirement among all other paths in DAG. critical path length in DAG with n node has n node. So maximal parallelism is n.
Critical path is longest path from root to leaf (in DAG) and for find it you can use BFS algorithm (Breath First Search).
Example 1
BFS order in this tree is O(|V|+|E|). This is optimal solution for this problem.
Edit: Find maximum degree of concurrency by BFS
You can determine the maximum degree of concurrency by running the breadth-first search algorithm too:
The algorithm starts from the root node and proceeds towards the
leafs level-wise.
before inspecting nodes located on the next level it explores all of
the nodes belonging to the same level.
Count the number of nodes on each level and update a variable holding
the maximum number of nodes per level.
Example 2 (Step by step)
So in this example maximum degree of concurrency is 4.
Final Edit
With the last explanations you gave, Maximal independent set of tasks is what you are looking for. To solve this problem see this article.
I have not tested the algorithm, but my proposal would be the following:
Start from the origin node.
Select each connected edge. Current concurrency is the number of selected edges. Remember that.
Sort the selected nodes which are connected by the edges by the number of outgoing edges. Ignore all nodes, which have incoming edges which weren't yet selected.
Start going down the edge with the node with the most outgoing edges.
If not at end node: Repeat from 2)
Get the maximum of current concurrency for all iterations.
Here is an implementation in python using networkx. The document you have linked does something different. It calculates the number of concurrent tasks when the graph is executed with the attached timings to the nodes (1 for each node in that case). This is an easy tasks and probably the one the author of the document refers to. My algorithm however calculates the theoretical maximum and does not take the running time of each task into account.

Why are there two listed time complexities for breadth-first search?

The Wikipedia article on breadth-first search lists two time complexities for breadth-first search over a graph: O(|E|) and O(bd). Later on the page, though, it only lists O(|E|).
Why are there two different runtimes? Which one is correct?
Both time complexities are correct, but are used in different circumstances to measure the time complexity relative to two different quantities. This has to do with the fact that the breadth-first search algorithm typically refers to two different related algorithms used in different contexts.
In one context, BFS is an algorithm that, given a graph and a start node in the graph, visits every node reachable from the start node by first visiting all nodes at distance 0, then distance 1, then distance 2, etc. until all nodes are visited. This will visit every node in the graph and in the process of doing so explore each node one and edge edge at most once (in the directed case) or twice (in the undirected case). By using queues to keep track of which nodes to explore next and using appropriate bookkeeping, the runtime will be O(|E| + |V|) (and with further optimizations, it will be O(|E|)).
In a different context, BFS is a search algorithm used to find the shortest path from some start node in a graph to a destination node in the graph. In this case, the algorithm stops running as soon as it discovers the destination node. This means that the runtime depends on how far away the destination node is from the source node. That distance in turn depends on the structure of the graph. If the graph is densely connected, the node can't be that far away, and if the graph is sparse, the node might be extremely distant. In this context, it's common to add in a parameter called the "branching factor" b, which describes the average or maximum number of edges adjacent to any node. This means that
There is one node at distance 0 from the start node.
There are at most b nodes at distance 1 from the start node.
There are at most b2 nodes at distance 2 from the start node.
...
There are at most bk nodes at distance k from the start node.
If we assume that the destination node is at distance d from the start node, then BFS will visit at most b0 + b1 + ... + bd = O(bd) nodes during its search, spending O(b) time on each of them. Accordingly, the total runtime will be O(bd).
In summary:
The runtime of O(|E|) is typically used when discussing the algorithm when being used to explore the entire graph.
The runtime of O(bd) is typically used when discussing the algorithm when being used to find a specific node in the graph.
Hope this helps!

Smallest path in graph theory (social network analysis)

This is the scenario:
There is an undirected graph with n nodes and e edges, all nodes are connected.
The question in the scenario:
Every node can be considered as a person in a social network that shares or reads a content. It means that if A is connected to B, C and D, if A shares a content with the network, it will reach directly BCD. It means that to reach all the nodes in the network, it's just necessary that they are adjacent to a node which shared the content.
Q1: is there a way to find the best starting point to reach the entire network?
Q2: is there a way to find a smallest path from that point?
I've already looked at salesman problem and prim'algorithm.
Thanks!
The wikipedia page on Centrality describes several different forms of centrality in a graph, and has links to algorithms for some of them.
Raising the adjacency matrix of the network to the nth power gives you the number of walks of length n between two verticies i,j (represented by the ij-th element of the matrix). The first non zero value of x(i,j) will tell you how far apart they are with respect to walks. If you're looking for the best node to reach the whole network, then you could just look for the first instance of a row (or column) of the matrix which has all non zero values whilst increasing n.
Obviously this isn't practical with huge networks...
Otherwise you could apply Dijkstra's algorithm.
Closeness Centrality is a ranking of each individual node and can be thought of as a measure of how "close a node is to the center of a network". So a node with a high closeness centrality value is positioned in the network such that it takes this node a shorter number of hopes (on average) to reach all other nodes in the network. So for Q1 above, the node(s) with the highest closeness could be interpreted to be in the best position to reach all other nodes with a minimum number of hops between nodes on the way. For Q2, the "smallest path" can be considered the smallest average path to all nodes in the network.

Resources