Shortest-path algorithms which use a space-time tradeoff? - algorithm

Problem: finding shortest paths in an unweighted, undirected graph.
Breadth-first search can find the shortest path between two nodes, but this can take up to O(|V| + |E|) time. A precomputed lookup table would allow requests to be answered in O(1) time, but at the cost of O(|V|^2) space.
What I'm wondering: Is there an algorithm which offers a space-time tradeoff that's more fine-grained? In other words, is there an algorithm which:
Finds shortest paths in more time than O(1), but is faster than a bidirectional breadth-first search
Uses precomputed data which takes up less space than O(|V|^2)?
On the practical side: The graph is 800,000 nodes and is believed to be a small-world network. The all-pairs shortest paths table would be on the order of gigabytes -- that's not outrageous these days, but it doesn't suit our requirements.
However, I am asking my question out of curiosity. What's keeping me up at night is not "how can I reduce cache misses for an all-pairs lookup table?", but "Is there a completely different algorithm out there that I've never heard of?"
The answer may be no, and that's okay.

You should start by looking at Dijkstra's algorithm for finding the shortest path. The a* algorithm is a variant that uses a heuristic to reduce the time taken to calculate the optimal route between the start and goal node (such as the euclidean distance). You can modify this heuristic for performance or accuracy.

It seems as if your input set must be very large, if a lookup table will be too large to store on the disk. I assume that that the data will not fit in RAM then, which means that whatever algorithm you use should be tuned to minimize the amounts of reads and writes. Whenever disks are involved space == time, because writing to disk is so slow.
The exact algorithm you should use depends on what kind of graph you have. This research paper might be of interest to you. Full disclosure: I have not read it myself, but it seems like it might be what you are looking for.
Edit:
If the graph is (almost) connected, which a small-world network is, a lookup table can't be smaller than V^2. This means that all lookups will require disk access. If the edges fit in main memory, it might be faster to just compute the path every time. Otherwise, you might compute the path from a table containing the lengths of all shortests paths. You can reconstruct the path from that table.
The key is to make sure that the entries in the table which are close to each other in either direction are also close to each other on the disk. This storage pattern accomplishes that:
1 2 1 2 5 6
3 4 3 4 7 8
9 10 13 14
11 12 15 16
It will also work well with the cache hierarchy.
In order to compute the table you might use a modified Floyd-Warshall, where you process the data in blocks. This would let you perform the computation in a reasonable amount of time, especially if you parallelize it.

Related

How to count the nodes in a binary tree without using any extra memory

I recently had an interview for a position dealing with extremely large distributed systems, and one of the questions I was asked was to make a function that could count the nodes in a binary tree entirely in place; meaning no recursion, and no queue or stack for an iterative approach.
I don't think I have ever seen a solution that does not use at least one of the above, either when I was in school or after.
I mentioned that having a "parent" pointer would trivialize the problem somewhat but adding even a single simple field to each node in a tree with a million nodes is not trivial in terms of memory cost.
How can this be done?
If an exact solution is required, then the prerequisite of being a binary tree may be a red herring. Each node in the cluster may simply count allocations in the backing collection. Which may be either constant or linear time, depending on whether it has been tracked or not.
If no exact solution was asked for, but the given tree is balanced, then a simple deep probe to determine tree hight, in combination with the placement rules allows to estimate an upper and lower bound for the total node count. Be wary that the probe may have either hit a node with height log2(n) or log2(n) - 1, so your estimate can be up to factor 2 too low or too high. Constant space, O(log(n)) time.
If the placement rules dictate special properties about the bottom most layer (e.g. filled from left to right, not e.g. a red-black-tree), then you may perform log(n) probes in a binary search pattern to find the exact count, in constant space and O(log(n)^2) time.

Finding a partition of a graph with no edges crossing partition

I have a graph with a guarantee that it can be divided into two equal-sized partitions (one side may be 1 larger than the other), with no edges across this partition. I initially thought this was NP-hard, but I suspect it might not be. Is there any (efficient) way of solving this?
It's possible to solve your problem in time O(n2) by combining together two well-known algorithms.
When I first saw your problem, I initially thought it was going to relate to something like finding a maximum or minimum cut in a graph. However, since you're specifically looking for a way of splitting the nodes into two groups where there are no edges at all running between those groups, I think what you're looking for is much closer to finding connected components of a graph. After all, if you break the graph apart into connected components, there will be no edges running between those components. Therefore, the question boils down to the following:
Find all the connected components of the graph, making a note of how many nodes are in each component.
Partition the two connected components into two groups of roughly equal size - specifically, the size of the group on one side should be at most one more than the size of the group on the other.
Step (1) is something that you can do using a breadth-first or depth-first search to identify all the connected components. That will take you time O(m + n), where m is the number of edges and n is the number of nodes.
Step (2), initially, seems like it might be pretty hard. It's reminiscent of the partition problem, which is known to be NP-hard. The partition problem works like this: you're given as input a list of numbers, and you want to determine whether there's a split of those numbers into two groups whose totals are equal to one another. (It's possible to adapt this problem so that you can tolerate a split that's off by plus or minus one without changing the complexity). That problem happens to be NP-complete, which suggests that your problem might be hard.
However, there's a small nuance that actually makes the apparent NP-hardness of the partition problem not an issue. The partition problem is NP-hard in the case where the numbers you're given are written out in binary. On the other hand, if the numbers are written out in unary, then the partition problem has a polynomial-time solution. More specifically, there's an algorithm for the partition problem that runs in time O(kU), where k is the number of numbers and U is the sum of all those numbers. In the case of the problem you're describing, you know that the sum of the sizes of the connected components in your graph must be n, the number of nodes in the graph, and you know that the number of connected components is also upper-bounded by n. This means that the runtime of O(kU), plugging in k = O(n) and U = O(n), works out to O(n2), firmly something that can be done in polynomial time.
(Another way to see this - there's a pseudopolynomial time algorithm for the partition problem, but since in your case the maximum possible sum is bounded by an actual polynomial in the size of the input, the overall runtime is a polynomial.)
The algorithm I'm alluding to above is a standard dynamic programming exercise. You pick some ordering of the numbers - not necessarily in sorted order - and then fill in a 2D table where each entry corresponds to an answer to the question "is there a subset of the first i numbers that adds up to exactly j?" If you're not familiar with this algorithm, I'll leave it up to you to work out the details, as it's a really beautiful problem to solve that has a fairly simple and elegant solution.
Overall, this algorithm will run in time O(n2), since you'll do O(m + n) = O(n2) work to find connected components, followed by time O(n2) to run the partition problem DP to determine whether the split exists.
Hope this helps!

All pairs shortest path - warm restart?

Is it possible to warm start any of the well known algorithms (Dijkstra/Floyd-Warshall etc) for the APSP problem so as to be able to reduce the time complexity, and potentially the computation time?
Let's say the graph is represented by a NxN matrix. I am only considering changes in one or more matrix entries( << N), i.e. distance between the corresponding vertices, between any 2 calls to the algorithm procedure. Can we use the solution from the first call and just the incremental changes to the matrix to speed up the calculation on the second call to the algorithm? I am primarily looking at dense matrices, but if there are known methods for sparse matrices, please feel free to share. Thanks.
I'm not aware of an incremental algorithm for APSP. However, there is an incremental version of A* for solving SSSP called Lifelong Planning A* (aka 'LPA*,' rarely also called 'Incremental A*'), which seems to be what you're asking about in the second paragraph.
Here is a link to the original paper. You can find more information about it in this post about A* variations.
An interesting study paper is: Experimental Analysis of Dynamic All Pairs Shortest Path Algorithms [Demetrescu, Emiliozzi, Italiano]:
We present the results of an extensive computational study on dynamic algorithms for all pairs shortest path problems. We describe our
implementations of the recent dynamic algorithms of King [18] and of
Demetrescu and Italiano [7], and compare them to the dynamic algorithm
of Ramalingam and Reps [25] and to static algorithms on random,
real-world and hard instances. Our experimental data suggest that some
of the dynamic algorithms and their algorithmic techniques can be
really of practical value in many situations.
Another interesting distributed algorithm is Engineering a New Algorithm for Distributed Shortest Paths on Dynamic Networks [Cicerone, D’Angelo, Di Stefano, Frigioni, Maurizio]:
We study the problem of dynamically updating all-pairs shortest paths in a distributed network while edge update operations occur to the
network. We consider the practical case of a dynamic network in which
an edge update can occur while one or more other edge updates are
under processing.
You can find more resources searching for All Pairs Shortest Paths on Dynamic Networks.

BFS with Constant Amount of Memory

Is it possible to do a breath first search using only (size of graph) + a constant amount of memory -- in other words, without recording which nodes have already been visited?
No. You always need to remember where you have visited. In the worst case, therefore, you need to record the visited state of all nodes. However, the branching factor and depth of the graph are the main factors. If the graph doesn't branch a lot, you won't need anything like that. If it is highly branching, you tend to the worst case.

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition

Resources