Efficiently checking which of a large collection of nodes are close together?

Efficiently checking which of a large collection of nodes are close together? - algorithm

I'm currently interested in generating random geometric graphs. For my particular problem, we randomly place node v in the unit square, and add an edge from v to node u if they have Euclidean distance <= D, where D=D(u,n) varies with u and the number of nodes n in the graph.
Important points:
It is costly to compute D, so I'd like to minimize the number of calls to this function.
The vast majority of the time, when v is added, edges uv will be added to only a small number of nodes u (usually 0 or 1).
Question: What is an efficient method for checking which vertices u are "close enough" to v?
The brute force algorithm is to compute and compare dist(v,u) and D(u,n) for all extant nodes u. This requires O(n2) calls to D.
I feel we should be able to do much better than this. Perhaps some kind of binning would work. We could divide the space up into bins, then for each vertex u, we store a list of bins where a newly placed vertex v could result in the edge uv. If v ends up placed outside of u's list of bins (which should happen most of the time), then it's too far away, and we don't need to compute D. This is somewhat of a off-the-top-of-my-head suggestion, and I don't know if it'd work well (e.g., there would be overhead in computing sufficiently close bins, which might be too costly), so I'm after feedback.

Based on your description of the problem, I would choose an R-tree as your data structure.
It allows for very fast searching by narrowing the set of vertices you need to run D against drastically. However, in the worst-case insertion, O(n) time is required. Thankfully, you're quite unlikely to hit the worst-case insertion with a typical data set.

I would probably just use a binning approach.
Say we cut the unit square in m x m subsquares (each having side length 1/m of course). Since you place your vertices uniformly at random (or so I assumed), every square will contain n / m^2 vertices on average.
Depending on A1, A2, m and n, you can probably determine the maximum radius you need to check. Say that's less than m. Then, after inserting v, you would need to check the square in which it landed, plus all adjacent squares. Anyway, this is a constant number of squares, so for every insertion you'll need to check O(n / m^2) other vertices on average.
I don't know the best value for m (as said, that depends on A1 and A2), but say it would be sqrt(n), then your entire algorithm could run in O(n) expected time.
EDIT
A small addition: you could keep track of vertices with many neighbors (so with high radius, which extends over multiple squares) and check them for every inserted vertex. There should only be few, so that's no problem.

Related

Efficient approximate algorithm to determine the presence of k-sized cycle in graph

I have a very large sparse graph G (about 100 million nodes, about 50 million edges) and I would like to find an efficient algorithm (hopefully O(1) or sub-linear in the number of nodes + edges) that predicts with some probability the presence of a cycle of length k in this graph. For practical use, k will very small (between 30 and 90) relative to the size of G. It is also guaranteed that k will always be even. G is also a random graph, so I don't expect any consistent clustering.
The algorithm doesn't need to enumerate the actual nodes contained in the cycle, it just needs to eliminate G if it most likely don't have any cycles of length k.
I found a close solution with the answer presented here, where the trace and rank of L (where L is the Laplacian of G) could be compared to determine whether G had any cycles at all. However, I couldn't find a relatively efficient way to compute rank for G. Another problem was that it doesn't take k into account, which might be able to make a more efficient approach.
Getting connected components is a possibility, but it is linear in the number of nodes + edges, which is not optimal for a graph of this size.

If it's an Erdos--Renyi random graph, then since having such a cycle is a monotone property of a graph, there's a zero-one law (https://www.ams.org/journals/proc/1996-124-10/S0002-9939-96-03732-X/S0002-9939-96-03732-X.pdf), which implies that you can make a reasonably good guess by setting the right threshold. (Which threshold? I don't know offhand, but probably you can extrapolate from smaller graphs.)

What is the simplest, easiest algorithm for finding EMST of a complete graph of order 10^5

I just want to be clear that EMST stands for Euclidean Minimum Spanning Tree.
Essentially, I have been given a file with 100k 4D vertices (one vertex on each line). The goal is to visit every vertex in the file while minimizing the total distance traveled. The distance from a point to another point is simply the Euclidean Distance (Distance if you draw a Straight Line between two points".
I already know that this is pretty much the Traveling Salesman Problem, which is NP Complete, so I am looking to approximate the solution.
The first approximation algorithm that came to my mind is by finding the MST from a graph constructed from the file... But that would take O(N^2) to even just construct all the edges from the file given the fact that it's a complete graph ( I can go from any point to another ). And given that my input is N = 10^5, my algorithm will have a huge running time, which is too slow...
Any ideas on how I can plan on approximating the solution? Thank you very much..

I know it's quadratic-time, but I think you should consider Prim with an implicit graph. The structure of the algorithm is
for each vertex v
mindist[v] := infinity
visited[v] := false
choose a root vertex r
mindist[r] := 0
repeat |V| times
let w be the minimizer of d[w] such that not visited[w]
visited[w] := true
for each vertex v
if not visited[v] and distance(w, v) < mindist[v]:
mindist[v] := distance(w, v)
parent[v] := w
Since the storage used is linear, it will likely stay resident in cache, and there are no fancy data structures, so this algorithm should run pretty fast.

I am going to assume that you actually want a EMST as your title suggests, and the TSP is just a means to that end, and not the actual goal itself. The two have very different restrictions (the TSP being far more restrictive), and thus very different optimal solutions.
Overview
The idea is that we want to run a modified kruskal's algorithm, which will make use of a k-d tree to find the closest pairs without evaluating every potential edge. We can find the shortest edge to each vertex in a connected component, take the shortest overall, and connect our connected components via that edge. As you'll see, this connects at least half of our connected components each iteration, so it takes at most logn iterations to complete.
Nearest Neighbor Search
For constructing an EMST, you'll want to use a data structure for querying for nearest neighbors in 4D space. You could extend octrees to work in a higher dimension, but I'd personally go with a k-d tree. You can construct a k-d tree in O(nlogn) time using the median of medians algorithm to find the median at each level, and you can insert / remove from a balanced k-d tree in O(logn) time.
Once you've built a k-d tree, you'll want to query for the nearest neighbor to each point. We'll then construct the edge between these two vertices. Many of these edges will be duplicated, as for some vertices A and B, A's nearest neighbor may be B, and B's nearest neighbor may be A. We'll handle this by storing which connected component each vertex belongs to, and after two vertices are joined by an edge, the duplicate edge will clearly connect two vertices of the same connected component, and so we'll discard it. To accomplish this, we'll use a disjoint-set (just like in many implementations of kruskal's algorithm) to assign a connected component to each vertex. This will also prevent us from creating cycles in our graph, which would introduce unnecessary edges in the MST.
Merging
However, as we construct each edge, we'll want to insert it into a min-heap priority queue before checking which edges to keep and which edges connect already-connected vertices. This will not affect the outcome of this first iteration, but later on we will need to handle edges by increasing distance. Then dequeue all the edges, check for unnecessary / redundant edges via the disjoint-set, insert valid edges into the MST, and merge the respective disjoint-sets. All of this of course introduces a nlogn factor for constructing and dequeuing elements from the min-heap (we could also just sort them in a plain array, if we wished).
After this first iteration of adding edges, we'll have connected at least half of the MST, maybe more. This is because for each vertex we added one edge, and we can have at most one duplicate per edge, so we've added a few as vertices / 2 edges, but as many as vertices - 1. Now at least 1/2 of our MST has been built. We'll continue the process as described in the following paragraphs, until we've added vertices - 1 edges in total.
Generalizing NN-Search
To continue, we'll want to construct lists of the vertices in each connected component, so that we can iterate over them by groups. This can be done in nearly linear time, as searching (also merging) a disjoint-set takes O(α(n)) time (α being the inverse ackermann function) and we repeat exactly n times. Once we have our lists of vertices per connected component, the rest is fairly straightforward. We'll take our existing k-d tree, and remove all the vertices in our current connected component. We'll then query for the nearest neighbor to each vertex to each vertex in our connected component, and add these edges to our min-heap. We'll then add these vertices back into the k-d tree, and repeat on the next connected component. Since we insert/remove a total of n elements, this amounts to an average case O(nlogn) time complexity.
Now that we have a queue of the shortest potential edges connecting our connected components, we'll dequeue these in order, and just as before insert valid edges and merge the disjoint sets. For the same reasons as before, this is guaranteed to connect at least half of our components, maybe even all of them. We'll repeat this process until we have connected all vertices into a single connected component, which will be our MST. Note that because we halve the number of disconnected components each iteration, it'll take at most O(logn) iterations to connect every vertex in our MST (most likely far less).
Remarks
Overall, this will take O(nlog^2(n)) time. There will likely be far less than log(n) iterations however, so expect a speedup there in practice. Also note that R-tree might be a good alternative to the k-d tree- I don't know how they compare in practice however.

Find a subset of k most distant point each other

I have a set of N points (in particular this point are binary string) and for each of them I have a discrete metric (the Hamming distance) such that given two points, i and j, Dij is the distance between the i-th and the j-th point.
I want to find a subset of k elements (with k < N of course) such that the distance between this k points is the maximum as possibile.
In other words what I want is to find a sort of "border points" that cover the maximum area in the space of the points.
If k = 2 the answer is trivial because I can try to search the two most distant element in the matrix of distances and these are the two points, but how I can generalize this question when k>2?
Any suggest? It's a NP-hard problem?
Thanks for the answer

One generalisation would be "find k points such that the minimum distance between any two of these k points is as large as possible".
Unfortunately, I think this is hard, because I think if you could do this efficiently you could find cliques efficiently. Suppose somebody gives you a matrix of distances and asks you to find a k-clique. Create another matrix with entries 1 where the original matrix had infinity, and entries 1000000 where the original matrix had any finite distance. Now a set of k points in the new matrix where the minimum distance between any two points in that set is 1000000 corresponds to a set of k points in the original matrix which were all connected to each other - a clique.
This construction does not take account of the fact that the points correspond to bit-vectors and the distance between them is the Hamming distance, but I think it can be extended to cope with this. To show that a program capable of solving the original problem can be used to find cliques I need to show that, given an adjacency matrix, I can construct a bit-vector for each point so that pairs of points connected in the graph, and so with 1 in the adjacency matrix, are at distance roughly A from each other, and pairs of points not connected in the graph are at distance B from each other, where A > B. Note that A could be quite close to B. In fact, the triangle inequality will force this to be the case. Once I have shown this, k points all at distance A from each other (and so with minimum distance A, and a sum of distances of k(k-1)A/2) will correspond to a clique, so a program finding such points will find cliques.
To do this I will use bit-vectors of length kn(n-1)/2, where k will grow with n, so the length of the bit-vectors could be as much as O(n^3). I can get away with this because this is still only polynomial in n. I will divide each bit-vector into n(n-1)/2 fields each of length k, where each field is responsible for representing the connection or lack of connection between two points. I claim that there is a set of bit-vectors of length k so that all of the distances between these k-long bit-vectors are roughly the same, except that two of them are closer together than the others. I also claim that there is a set of bit-vectors of length k so that all of the distances between them are roughly the same, except that two of them are further apart than the others. By choosing between these two different sets, and by allocating the nearer or further pair to the two points owning the current bit-field of the n(n-1)/2 fields within the bit-vector I can create a set of bit-vectors with the required pattern of distances.
I think these exist because I think there is a construction that creates such patterns with high probability. Create n random bit-vectors of length k. Any two such bit-vectors have an expected Hamming distance of k/2 with a variance of k/4 so a standard deviation of sqrt(k)/2. For large k we expect the different distances to be reasonably similar. To create within this set two points that are very close together, make one a copy of the other. To create two points that are very far apart, make one the not of the other (0s in one where the other has 1s and vice versa).
Given any two points their expected distance from each other will be (n(n-1)/2 - 1)k/2 + k (if they are supposed to be far apart) and (n(n-1)/2 -1)k/2 (if they are supposed to be close together) and I claim without proof that by making k large enough the expected difference will triumph over the random variability and I will get distances that are pretty much A and pretty much B as I require.

#mcdowella, I think that probably I don't explain very well my problem.
In my problem I have binary string and for each of them I can compute the distance to the other using the Hamming distance
In this way I have a distance matrix D that has a finite value in each element D(i,j).
I can see this distance matrix like a graph: infact, each row is a vertex in the graph and in the column I have the weight of the arc that connect the vertex Vi to the vertex Vj.
This graph, for the reason that I explain, is complete and it's a clique of itself.
For this reason, if i pick at random k vertex from the original graph I obtain a subgraph that is also complete.
From all the possible subgraph with order k I want to choose the best one.
What is the best one? Is a graph such that the distance between the vertex as much large but also much uniform as possible.
Suppose that I have two vertex v1 and v2 in my subgraph and that their distance is 25, and I have three other vertex v3, v4, v5, such that
d(v1, v3) = 24, d(v1, v4) = 7, d(v2, v3) = 5, d(v2, v4) = 22, d(v1, v5) = 14, d(v1, v5) = 14
With these distance I have that v3 is too far from v1 but is very near to v2, and the opposite situation for v4 that is too far from v2 but is near to v1.
Instead I prefer to add the vertex v5 to my subgraph because it is distant to the other two in a more uniform way.
I hope that now my problem is clear.
You think that your formulation is already correct?

I have claimed that the problem of finding k points such that the minimum distance between these points, or the sum of the distances between these points, is as large as possible is NP-complete, so there is no polynomial time exact answer. This suggests that we should look for some sort of heuristic solution, so here is one, based on an idea for clustering. I will describe it for maximising the total distance. I think it can be made to work for maximising the minimum distance as well, and perhaps for other goals.
Pick k arbitrary points and note down, for each point, the sum of the distances to the other points. For each other point in the data, look at the sum of the distances to the k chosen points and see if replacing any of the chosen points with that point would increase the sum. If so, replace whichever point increases the sum most and continue. Keep trying until none of the points can be used to increase the sum. This is only a local optimum, so repeat with another set of k arbitrary/random points in the hope of finding a better one until you get fed up.
This inherits from its clustering forebear the following property, which might at least be useful for testing: if the points can be divided into k classes such that the distance between any two points in the same class is always less than the distance between any two points in different classes then, when you have found k points where no local improvement is possible, these k points should all be from different classes (because if not, swapping out one of a pair of points from the same class would increase the sum of distances between them).

This problem is known as the MaxMin Diversity Problem (MMDP). It is known to be NP-hard. However, there are algorithms for giving good approximate solutions in reasonable time, such as this one.
I'm answering this question years after it was asked because I was looking for algorithms to solve the same problem, and had trouble even finding out what to call it.

Making a cost matrix in graph

Problem :
Form a network, that is, all the bases should be reachable from every base.
One base is reachable from other base if there is a path of tunnels connecting bases.
Bases are suppose based on a 2-D plane having integer coordinates.
Cost of building tunnels between two bases are coordinates (x1,y1) and (x2,y2) is min{ |x1-x2|, |y1-y2| }.
What is the minimum cost such that a network is formed.
1 ≤ N ≤ 100000 // Number of bases
-10^9 ≤ xi,yi ≤ 10^9
Typical Kruskal's minimum spanning tree implementation.But u cannot store (10^5)^2 edges.
So how i should make my cost matrix , how to make a graph so i can apply Kruskal algorithm?

You should not store the whole graph as you don't actually need it. In fact in this case I think Prim's algorithm is more suitable in this case. You will not need all the edges at any single time, instead on each iteration you will update a min dist array of size N. Of course complexity will still be in the order of N**2 but at least memory will not be an issue. Also you can further use the specific way distance is computed to improve the complexity(using some ordered structure to store the points).

I believe the only edges that will ever be used (due to your cost function) will be from each base to at most 4 neighbours. The neighbours to use are the closest point with greater (or equal) x value, the closest point with smaller (or equal) x value, the closest point with greater (or equal) y value, the closest point with smaller (or equal) y value.
You can compute these neighbours efficiently by sorting the points according to each axis and then linking each point with the point ahead and behind it in sorted order.
It does not matter if there is more than one point at a particular value of coordinate.
There will therefore be only O(4n) edges for you to consider with Kruskal's algorithm.

Algorithm that finds the connectivity distance of a graph on uniform points on the unit square

Situation
Suppose we are given n points on the unit square [0, 1]x[0, 1] and a positive real number r. We define the graph G(point 1, point 2, ..., point n, r) as the graph on vertices {1, 2, ..., n} such that there is an edge connecting two given vertices if and only if the distance between the corresponding points is less than or equal to r. (You can think of the points as transmitters, which can communicate with each other as long as they are within range r.)
Given n points on the unit square [0, 1]x[0, 1], we define the connectivity distance as the smallest possible r for which G(point 1, point 2, ..., point n, r) is connected.
Problem 1) find an algorithm that determines if G(point 1, point 2, ..., point n, r) is connected
Problem 2) find an algorithm that finds the connectivity distance for any n given points
My partial solution
I have an algorithm (Algorithm 1) in mind for problem 1. I haven't implemented it yet, but I'm convinced it works. (Roughly, the idea is to start from vertex 1, and try to reach all other vertices through the edges. I think it would be somewhat similar to this.)
All that remains is problem 2. I also have an algorithm in mind for this one. However, I think it is not efficient time wise. I'll try to explain how it works:
You must first convince yourself that the connectivity distance rmin is necessarily the distance between two of the given points, say p and q. Hence, there are at most *n**(n-1)/2 possible values for rmin.
So, first, my algorithm would measure all *n**(n-1)/2 distances and store them (in an array in C, for instance) in increasing order. Then it would use Algorithm 1 to test each stored value (in increasing order) to see if the graph is connected with such range. The first value that does the job is the answer, rmin.
My question is: is there a better (time wise) algorithm for problem 2?
Remarks: the points will be randomly generated (something like 10000 of them), so that's the type of thing the algorithm is supposed to solve "quickly". Furthermore, I'll implement this in C. (If that makes any difference.)

Here is an algorithm which requires O(n2) time and O(n) space.
It's based on the observation that if you partition the points into two sets, then the connectivity distance cannot be less than the distance of the closest pair of points one from each set in the partition. In other words, if we build up the connected graph by always adding the closest point, then the largest distance we add will be the connectivity distance.
Create two sets, A and B. Put a random point into A and all the remaining points into B.
Initialize r (the connectivity distance) to 0.
Initialize a map M with the distance to every point in B of the point in A.
While there are still points in B:
Select the point b in B whose distance M[b] is the smallest.
If M[b] is greater than r, set r to M[b]
Remove b from B and add it to A.
For each point p in M:
If p is b, remove it from M.
Otherwise, if the distance from b to p is less than M[p], set M[p] to that distance.
When all the points are in A, r will be the connectivity distance.
Each iteration of the while loop takes O(|B|) time, first to find the minimum value in M (whose size is equal to the size of B); second, to update the values in M. Since a point is moved from B to A in each iteration, there will be exactly n iterations, and thus the total execution time is O(n2).
The algorithm presented above is an improvement to a previous algorithm, which used an (unspecified) solution to the bichromatic closest pair (BCP) problem to recompute the closest neighbour to A in every cycle. Since there is an O(n log n) solution to BCP, this implied a solution to the original problem in O(n2 log n). However, maintaining and updating the list of closest points is actually much simpler, and only requires O(n). Thanks to #LajosArpad for a question which triggered this line of thought.

I think your ideas are reasonably good, however, I have an improvement for you.
In fact you build up an array based on measurement and you sort your array. Very nice. At least with not too many points.
The number of n(n-1)/2 is a logical consequence of your pairing requirement. So, for 10000 elements, you will have 49995000 elements. You will need to increase significantly the speed! Also, this number of elements would eat a lot of your memory storage.
How can you achieve greater speed?
First of all, don't build arrays. You already have an array. Secondly, you can easily solve your problem by traversing. Let's suppose you have a function, which determines whether a given distance is enough to connect all the nodes, lets call this function "valid". It is not enough, because you need to find the minimal possible value. So, if you don't have more information about the nodes prior the execution of the algorithm, then my suggestion is this solution:
lowerBound <- 0
upperBound <- infinite
i <- 0
while i < numberOfElements do
j <- i + 1
while j < numberOfElements do
distance <- d(elements[i], elements[j])
if distance < upperBound and distance > lowerBound then
if valid(distance) then
upperBound <- distance
else
lowerBound <- distance
end if
end if
j <- j + 1
end while
i <- i + 1
end while
After traversing all the elements the value of upperBound will hold the smallest distance which still connects the network. You didn't store all the distances, as they were far too many and you have solved your problem in a single cycle. I hope you find my answer helpful.

If some distance makes graph connected, any larger distance would make it connected too. To find minimal connecting distance just sort all distances and use binary search.
Time complexity is O(n^2*log n), space complexity is O(n^2).

You can start with some small distance d then check for connectivity. If the Graph is connected, you're done, if not, increment d by a small distance then check again for connectivity.
You also need a clever algorithm to avoid O(N^2) in case N is big.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio