What is the simplest, easiest algorithm for finding EMST of a complete graph of order 10^5 - algorithm

I just want to be clear that EMST stands for Euclidean Minimum Spanning Tree.
Essentially, I have been given a file with 100k 4D vertices (one vertex on each line). The goal is to visit every vertex in the file while minimizing the total distance traveled. The distance from a point to another point is simply the Euclidean Distance (Distance if you draw a Straight Line between two points".
I already know that this is pretty much the Traveling Salesman Problem, which is NP Complete, so I am looking to approximate the solution.
The first approximation algorithm that came to my mind is by finding the MST from a graph constructed from the file... But that would take O(N^2) to even just construct all the edges from the file given the fact that it's a complete graph ( I can go from any point to another ). And given that my input is N = 10^5, my algorithm will have a huge running time, which is too slow...
Any ideas on how I can plan on approximating the solution? Thank you very much..

I know it's quadratic-time, but I think you should consider Prim with an implicit graph. The structure of the algorithm is
for each vertex v
mindist[v] := infinity
visited[v] := false
choose a root vertex r
mindist[r] := 0
repeat |V| times
let w be the minimizer of d[w] such that not visited[w]
visited[w] := true
for each vertex v
if not visited[v] and distance(w, v) < mindist[v]:
mindist[v] := distance(w, v)
parent[v] := w
Since the storage used is linear, it will likely stay resident in cache, and there are no fancy data structures, so this algorithm should run pretty fast.

I am going to assume that you actually want a EMST as your title suggests, and the TSP is just a means to that end, and not the actual goal itself. The two have very different restrictions (the TSP being far more restrictive), and thus very different optimal solutions.
Overview
The idea is that we want to run a modified kruskal's algorithm, which will make use of a k-d tree to find the closest pairs without evaluating every potential edge. We can find the shortest edge to each vertex in a connected component, take the shortest overall, and connect our connected components via that edge. As you'll see, this connects at least half of our connected components each iteration, so it takes at most logn iterations to complete.
Nearest Neighbor Search
For constructing an EMST, you'll want to use a data structure for querying for nearest neighbors in 4D space. You could extend octrees to work in a higher dimension, but I'd personally go with a k-d tree. You can construct a k-d tree in O(nlogn) time using the median of medians algorithm to find the median at each level, and you can insert / remove from a balanced k-d tree in O(logn) time.
Once you've built a k-d tree, you'll want to query for the nearest neighbor to each point. We'll then construct the edge between these two vertices. Many of these edges will be duplicated, as for some vertices A and B, A's nearest neighbor may be B, and B's nearest neighbor may be A. We'll handle this by storing which connected component each vertex belongs to, and after two vertices are joined by an edge, the duplicate edge will clearly connect two vertices of the same connected component, and so we'll discard it. To accomplish this, we'll use a disjoint-set (just like in many implementations of kruskal's algorithm) to assign a connected component to each vertex. This will also prevent us from creating cycles in our graph, which would introduce unnecessary edges in the MST.
Merging
However, as we construct each edge, we'll want to insert it into a min-heap priority queue before checking which edges to keep and which edges connect already-connected vertices. This will not affect the outcome of this first iteration, but later on we will need to handle edges by increasing distance. Then dequeue all the edges, check for unnecessary / redundant edges via the disjoint-set, insert valid edges into the MST, and merge the respective disjoint-sets. All of this of course introduces a nlogn factor for constructing and dequeuing elements from the min-heap (we could also just sort them in a plain array, if we wished).
After this first iteration of adding edges, we'll have connected at least half of the MST, maybe more. This is because for each vertex we added one edge, and we can have at most one duplicate per edge, so we've added a few as vertices / 2 edges, but as many as vertices - 1. Now at least 1/2 of our MST has been built. We'll continue the process as described in the following paragraphs, until we've added vertices - 1 edges in total.
Generalizing NN-Search
To continue, we'll want to construct lists of the vertices in each connected component, so that we can iterate over them by groups. This can be done in nearly linear time, as searching (also merging) a disjoint-set takes O(α(n)) time (α being the inverse ackermann function) and we repeat exactly n times. Once we have our lists of vertices per connected component, the rest is fairly straightforward. We'll take our existing k-d tree, and remove all the vertices in our current connected component. We'll then query for the nearest neighbor to each vertex to each vertex in our connected component, and add these edges to our min-heap. We'll then add these vertices back into the k-d tree, and repeat on the next connected component. Since we insert/remove a total of n elements, this amounts to an average case O(nlogn) time complexity.
Now that we have a queue of the shortest potential edges connecting our connected components, we'll dequeue these in order, and just as before insert valid edges and merge the disjoint sets. For the same reasons as before, this is guaranteed to connect at least half of our components, maybe even all of them. We'll repeat this process until we have connected all vertices into a single connected component, which will be our MST. Note that because we halve the number of disconnected components each iteration, it'll take at most O(logn) iterations to connect every vertex in our MST (most likely far less).
Remarks
Overall, this will take O(nlog^2(n)) time. There will likely be far less than log(n) iterations however, so expect a speedup there in practice. Also note that R-tree might be a good alternative to the k-d tree- I don't know how they compare in practice however.

Related

Finding size of largest connected component of a graph

Consider we have a random undirected graph G = (V,E) with n vertices, now suppose for any two vertices u and v ∈ V, the probability that the edge between u and v ∈ E is 1/n. We need to figure out the size of the largest connected component in the undirected graph C(n).
C(n) should be equal to Θ(n**a), we need to run some experiments to give an estimate of a.
I am a bit confused on how to link the probability 1/n to the largest connected component, is there any way I can do so?
The process you're simulating here is called the Erdős–Rényi model. You have a collection of n nodes, and each pair of nodes has probability p of being linked. The (expected) shape of the resulting graph depends heavily on the choice of p, and there are a lot of famous results about this.
As for how to do this: one option would be to create a collection of n nodes, iterate over all pairs of nodes, and link them with probability 1/n. You can then run an algorithm like BFS or DFS over the graph to find and size the connected components.
Another would be to use the above approach, except that instead of doing a BFS or DFS to use a disjoint-set forest to perform the links and find the largest connected component.
Alternatively, because each edge is absent or present with equal probability and independently of every other edge, the number of edges you have is binomially distributed and pretty tightly packed around n total edges. You could therefore generate n random edges, add them into the graph, then use the above techniques. (This will be much faster, as this does O(n) work rather than O(n2) work to process the edges.)
Once you've gotten this worked out, you can vary n over a large range and run some sort of polynomial regression on it to find the best-first curve. That's something you could either code up yourself, or which you could do by importing your data into Excel and using its regression tools.
As a spoiler, when you're done you'll find that the number of nodes in the largest connected component is Θ(n2/3). If you search for "Erdős–Rényi critical case," you can find online proofs of this result. It's not a trivial result to prove (and definitely isn't obvious!), but it'll drop out of your empirical analysis.

Algorithm: Minimal path alternating colors

Let G be a directed weighted graph with nodes colored black or white, and all weights non-negative. No other information is specified--no start or terminal vertex.
I need to find a path (not necessarily simple) of minimal weight which alternates colors at least n times. My first thought is to run Kosaraju's algorithm to get the component graph, then find a minimal path between the components. Then you could select nodes with in-degree equal to zero since those will have at least as many color alternations as paths which start at components with in-degree positive. However, that also means that you may have an unnecessarily long path.
I've thought about maybe trying to modify the graph somehow, by perhaps making copies of the graph that black-to-white edges or white-to-black edges point into, or copying or deleting edges, but nothing that I'm brain-storming seems to work.
The comments mention using Dijkstra's algorithm, and in fact there is a way to make this work. If we create an new "root" vertex in the graph, and connect every other vertex to it with a directed edge, we can run a modified Dijkstra's algorithm from the root outwards, terminating when a given path's inversions exceeds n. It is important to note that we must allow revisiting each vertex in the implementation, so the key of each vertex in our priority queue will not be merely node_id, but a tuple (node_id, inversion_count), representing that vertex on its ith visit. In doing so, we implicitly make n copies of each vertex, one per potential visit. Visually, we are effectively making n copies of our graph, and translating the edges between each (black_vertex, white_vertex) pair to connect between the i and i+1th inversion graphs. We run the algorithm until we reach a path with n inversions. Alternatively, we can connect each vertex on the nth inversion graph to a "sink" vertex, and run any conventional path finding algorithm on this graph, unmodified. This will run in O(n(E + Vlog(nV))) time. You could optimize this quite heavily, and also consider using A* instead, with the smallest_inversion_weight * (n - inversion_count) as a heuristic.
Furthermore, another idea hit me regarding using knowledge of the inversion requirement to speedup the search, but I was unable to find a way to implement it without exceeding O(V^2) time. The idea is that you can use an addition-chain (like binary exponentiation) to decompose the shortest n-inversion path into two smaller paths, and rinse and repeat in a divide and conquer fashion. The issue is you would need to construct tables for the shortest i-inversion path from any two vertices, which would be O(V^2) entries per i, and O(V^2logn) overall. To construct each table, for every entry in the preceding table you'd need to append V other paths, so it'd be O(V^3logn) time overall. Maybe someone else will see a way to merge these two ideas into a O((logn)(E + Vlog(Vlogn))) time algorithm or something.

Polygons from a list of edges

Given N points in a map of edges Map<Point, List<Edge>>, it's possible to get the polygons formed by these edges in O(N log N)?
What I know is that you have to walk all the vertices and get the edges containing that vertex as a starting point. These are edges of a voronoi diagram, and each vertex has, at most, 3 artists containing it. So, in the map, the key is a vertex, and the value is a list where the vertex is the start node.
For example:
Points: a,b,c,d,e,f,g
Edges: [a,b]; [a,c]; [a,d], [b,c], [d,e], [e,g], [g,f]
My idea is to iterate the map counterclockwise until I get the initial vertex. That is a polygon, then I put it in a list of polygons and keep looking for others. The problem is I do not want to overcome the complexity O(N log N)
Thanks!
You can loop through the edges and compute the distance from midpoint of the edge to all sites. Then sort the distances in ascending order and for inner voronoi polygons pick the first and the second. For outer polygons pick the first. Basically an edge separate/divide 2 polygons.
It's something O(m log n).
If I did find a polynomial solution to this problem I would not post it here because I am fairly certain this is at least NP-Hard. I think your best bet is to do a DFS. You might find this link useful Finding all cycles in undirected graphs.
You might be able to use the below solution if you can formulate your graph as a directed graph. There are 2^E directed graphs (because each edge can be represented in 2 directions). You could pick a random directed graph and use the below solution to find all of the cycles in this graph. You could do this multiple times for different random directed graphs keeping track of all the cycles and until you've reached a satisfactory error bounds.
You can efficiently create a directed graph with a little bit of state (Maybe store a + or - with an edge to note the direction?) And once you do this in O(n) the first time you can randomly flip x << E directions to get a new graph in what will essentially be constant time.
Since you can create subsequent directed graphs in constant time you need to choose the number of times to run the cycle finding algorithm to have it still be polynomial and efficient.
UPDATE - The below only works for directed graphs
Off the top of my head it seems like it's a better idea to think of this as a graph problem. Your map of vertices to edges is a graph representation. Your problem reduces to finding all of the loops in the graph because each cycle will be a polygon. I think "Tarjan's strongly connected components algorithm" will be of use here as it can do this in O(v+e).
You can find more information on the algorithm here https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm

Finding reachable vertices for every vertex in a directed graph

I know that brute force approach to do this is perform DFS on all the vertices of the graph.So for this algorithm the complexity would be O(V|V+E|). But is there more efficient way to do this?
I get the impression from papers like http://research.microsoft.com/pubs/144985/todsfinal.pdf that there is no algorithm that does better than O(VE) or O(V^3) in the general case. For sparse graphs and other special graphs there are faster algorithms. It seems, however, that you can still make improvements by separating "index construction" from "query", if you have some idea of the number of queries that will be made on the data. If there are going to be a lot of queries, O(1) is possible for queries if all the data is pre-computed (DFS or Floyd-Warshall, etc.) and stored in O(n^2) space. On the other hand, if there are going to be relatively few queries, space and/or index construction time can be reduced at the expense of query time.
I really suspect that there isn't a known better algorithm for general graphs. All the papers I found on the subject [1] [2] describe algorithms that run in O(|V| * |E|) time. That isn't better than your naïve attempt in the worst case.
Even the wikipedia page [3] says the fastest algorithms reduce the problem to matrix multiplication, which the fastest algorithms are only marginally better than your baseline.
[1] http://ion.uwinnipeg.ca/~ychen2/conferencePapers/tranRelationCopy.pdf
[2] http://www.vldb.org/conf/1988/P382.PDF
[3] http://en.wikipedia.org/wiki/Transitive_closure#Algorithms
[EDIT: As pointed out by kraskevich, the final query step can be worse than I had originally claimed: up to O(|V|^2) even for an output of size O(|V|), which is no better than ordinary DFS without any preprocessing.].
In the worst case, O(|V|^2) space would be needed to store all this information explicitly -- i.e., to store the complete list of reachable vertices for each vertex (think of a graph in which every vertex has an edge to every other vertex). But it's possible to represent it in such a way that only O(|V|) space is needed, and this representation can be built in O(|V|+|E|) time, and a query on it will only take time proportional to the size of the answer (number of reachable vertices).
The basic idea is: Every vertex in a strongly connected component (SCC) can reach every other vertex in the same SCC (this is the definition of SCC), and can reach all vertices in SCCs that it can reach, and no other vertices.
Find all SCCs; this can be done in O(|V|+|E|) time. Build a table SCC, so that SCC(u) = i if the SCC of u is i (both vertices in G and SCCs can be represented as integers). Afterwards make another pass through this table to build a dual table, Verts, so that Verts(i) contains a list of all vertices in the ith SCC.
Build a new graph G' whose vertices are the SCCs of G. G' will necessarily be acyclic.
So, given a vertex u in G, look up its SCC, SCC(u). Call this i. Perform a DFS through G' starting at vertex i: For each vertex (of G') j encountered during this DFS, output every vertex (of G) in Verts(j).

Time complexity of creating a minimal spanning tree if the number of edges is known

Suppose that the number of edges of a connected graph is known and the weight of each edge is distinct, would it possible to create a minimal spanning tree in linear time?
To do this we must look at each edge; and during this loop there can contain no searches otherwise it would result in at least n log n time. I'm not sure how to do this without searching in the loop. It would mean that, somehow we must only look at each edge once, and decide rather to include it or not based on some "static" previous values that does not involve a growing data structure.
So.. let's say we keep the endpoints of the node in question, then look at the next node, if the next node has the same vertices as prev, then compare the weight of prev and current node and keep the lower one. If the current node's endpoints are not equal to prev, then it is in a different component .. now I am stuck because we cannot create a hash or array to keep track of the component nodes that are already added while look through each edge in linear time.
Another approach I thought of is to find the edge with the minimal weight; since the edge weights are distinct this edge will be part of any MST. Then.. I am stuck. Since we cannot do this for n - 1 edges in linear time.
Any hints?
EDIT
What if we know the number of nodes, the number of edges and also that each edge weight is distinct? Say, for example, there are n nodes, n + 6 edges?
Then we would only have to find and remove the correct 7 edges correct?
To the best of my knowledge there is no way to compute an MST faster by knowing how many edges there are in the graph and that they are distinct. In the worst case, you would have to look at every edge in the graph before finding the minimum-cost edge (which must be in the MST), which takes Ω(m) time in the worst case. Therefore, I'll claim that any MST algorithm must take Ω(m) time in the worst case.
However, if we're already doing Ω(m) work in the worst-case, we could do the following preprocessing step on any MST algorithm:
Scan over the edges and count up how many there are.
Add an epsilon value to each edge weight to ensure the edges are unique.
This can be done in time Ω(m) as well. Consequently, if there were a way to speed up MST computation knowing the number of edges and that the edge costs are distinct, we would just do this preprocessing step on any current MST algorithm to try to get faster performance. Since to the best of my knowledge no MST algorithm actually tries to do this for performance reasons, I would suspect that there isn't a (known) way to get a faster MST algorithm based on this extra knowledge.
Hope this helps!
There's a famous randomised linear-time algorithm for minimum spanning trees whose complexity is linear in the number of edges. See "A randomized linear-time algorithm to find minimum spanning trees" by Karger, Klein, and Tarjan.
The key result in the paper is their "sampling lemma" -- that, if you independently randomly select a subset of the edges with probability p and find the minimum spanning tree of this subgraph, then there are only |V|/p edges that are better than the worst edge in the tree path connecting its ends.
As templatetypedef noted, you can't beat linear-time. That all edge weights are distinct is a common assumption that simplifies analysis; if anything, it makes MST algorithms run a little slower.
The fact that a number of edges (N) is known does not influence the complexity in any way. N is still a finite but unbounded variable, and each graph will have different N. If you place a upper bound on N, say, 1 million, then the complexity is O(1 million log 1 million) = O(1).
The fact that each edge has distinct weight does not influence the program either, because it does not say anything about the graph's structure. Therefore knowledge about current case cannot influence further processing, as we cannot predict how the graph's structure will look like in the next step.
If the number of edges is close to n, like in this case n-6 (after edit), we know that we only need to remove 7 edges as every spanning tree has only n-1 edges.
The Cycle Property shows that the most expensive edge in a cycle does not belong to any Minimum Spanning tree(assuming all edges are distinct) and thus, should be removed.
Now you can simply apply BFS or DFS to identify a cycle and remove the most expensive edge. So, overall, we need to run BFS 7 times. This takes 7*n time and gives us a time complexity of O(n). Again, this is only true if the number of edges is close to the number of nodes.

Resources