MPI communication complexity - parallel-processing

MPI communication complexity - parallel-processing

I'm studying the communication complexity of a parallel implementation of Quicksort in MPI and I've found something like this in a book:
"A single process gathers p regular samples from each of the other p-1 processes. Since relatively few values are being passed, message latency is likely to be the dominant term of this step. Hence the communication complexity of the gather is O(log p)" (O is actually a theta and p is the number of processors).
The same affirmation is made for the broadcast message.
Why are these group communications complexity O(log p)? Is it because the communication is done using some kind of tree-based hierarchy?
What if latency is not the dominant term and there's a lot of data being sent? Would the complexity be O(n log (p)) where n would be the size of the data being sent divided by the available bandwidth?
And, what about the communication complexity of an MPI_Send() and an MPI_Recv()?
Thanks in advance!

Yes, gather and scatter are implemented using (depending on the particular MPI release) for instance binomial trees, hypercube, linear array or 2D square mesh. An all-gather operations may be implemented using an hypercube and so on.
For a gather or scatter, let lambda be the latency and beta the bandwidth. Then log p steps are required. Suppose you are sending n integers each represented using 4 bytes. The time to send them is
This is O(log p) when n = O(1) and O(log p + n) otherwise.
For a broadcast, the time required is
which is O(log p) when n = O(1) and O(n log p) otherwise.
Finally, for point-to-point communications like MPI_Send(), if you are sending n integers the communication complexity is O(n). When n = O(1) then the complexity is obviously O(1).

Related

Algorithm and Complexity

What does it mean when we say that an algorithm X is asymptotically more efficient than Y?
we consider the growth of the algorithm in terms of input size. I am not getting the concept properly.

The growth of algorithm appears when we use containers such as Array,stack,queue and other data structures.If an array size is taken from the user then it will take O(N)(big-oh of N size) in terms of space complexity.
In terms of Time complexity, if there is any loop in the program running for n number of time then it will take O(N)(big-oh of N) time complexity.
These are the two main attributes while judging the growth of any algorithm.

Predicting Spark performance/scalability on cluster?

Let's assume you have written an algorithm in Spark and you can evaluate its performance using 1 .. X cores on data sets of size N running in local mode. How would you approach questions like these:
What is the runtime running on a cluster with Y nodes and data size M >> N?
What is the minimum possible runtime for a data set of size M >> N using an arbitrary number of nodes?
Clearly, this is influenced by countless factors and giving a precise estimate is almost impossible. But how would you come up with an educated guess? Running in local mode mainly allows to measure CPU usage. Is there a rule of thumb to account for disk + network load in shuffles as well? Are there even ways to simulate performance on a cluster?

The data load can be estimated as O(n).
The algorithm can be estimated for each stage. The whole algorithm is an accumulation of all stages. Note, each stage have different amount of data, it has a relation with the first input data.
If the whole algorithm has O(n), then it's O(n).
If the whole algorithm has O(n log n), then it's O(n log n).
If the whole algorithm has O(n2), then the algorithm need to be improved to fit M >> N.
Assume
There is no huge shuffle/network is fast enough
Each node has the same configuration
Total time spend is T for data size N on a single node.
Number of node is X
Then the time if the algorithm is O(n) T * M / N / X
Then the time if the algorithm is O(n log n) T * M / N / X * log(M/N)
Edit
If there is A big shuffle, then it O(n) respect to bandwidth. The extra time added is dataSize(M)/bandwidth.
If there are many big shuffle, then consider to improve the algorithm.

Algorithms with O(n/log(n)) complexity

Are there any famous algorithms with this complexity?
I was thinking maybe a skip list where levels of the nodes are not determined by the number of tails coin tosses, but instead are use a number generated randomly (with uniform distribution) from the (1,log(n)) period to determine the level of the node. Such a data structure would have a find(x) operation with the complexity of O(n/log(n)) (I think, at least). I was curious whether there was anything else.

It's common to see algorithms whose runtime is of the form O(nk / log n) or O(log n / log log n) when using the method of Four Russians to speed up an existing algorithm. The classic Four Russians speedup reduces the cost of doing a matrix/vector product on Boolean matrices from O(n2) to O(n2 / log n). The standard dynamic programming algorithm for sequence alignment on two length-n strings runs in time O(n2), which can be decreased to O(n2 / log n) by using a similar trick.
Similarly, the prefix parity problem - in which you need to maintain a sequence of Boolean values while supporting the "flip" and "parity of the prefix of a sequence" operations can be solved in time O(log n / log log n) by using a Four-Russians speedup. (Notice that if you express the runtime as a function of k = log n, this is O(k / log k).

Does using union-find in Kruskal's algorithm actually affect worst-case runtime?

So I'm teaching myself some graph algorithms, now on Kruskal's, and understand that it's recommended to use union-find so checking whether adding an edge creates a cycle only takes O(Log V) time. For practical purposes, I see why you'd want to, but strictly looking through Big O notation, does doing so actually affect the worst-case complexity?
My reasoning: If instead of union find, we did a DFS to check for cycles, the runtime for that would be O(E+V), and you have to perform that V times for a runtime of O(V^2 + VE). It's more than with union find, which would be O(V * LogV), but the bulk of the complexity of Kruskal's comes from deleting the minimum element of the priority queue E times, which is O(E * logE), the Big O answer. I don't really see a space advantage either since the union-find takes O(V) space and so too do the data structures you need to maintain to find a cycle using DFS.
So a probably overly long explanation for a simple question: Does using union-find in Kruskal's algorithm actually affect worst-case runtime?

and understand that it's recommended to use union-find so checking whether adding an edge creates a cycle only takes O(Log V) time
This isn't right. Using union find is O(alpha(n) * m), where alpha(n) is the inverse of the Ackermann function, and, for all intents and purposes, can be considered constant. So much faster than logarithmic:
Since alpha(n) is the inverse of this function, alpha(n) is less than 5 for all remotely practical values of n. Thus, the amortized running time per operation is effectively a small constant.
but the bulk of the complexity of Kruskal's comes from deleting the minimum element of the priority queue E times
This is also wrong. Kruskal's algorithm does not involve using any priority queues. It involves sorting the edges by cost at the beginning. Although the complexity remains the one you mention for this step. However, sorting might be faster in practice than a priority queue (using a priority queue will, at best, be equivalent to a heap sort, which is not the fastest sorting algorithm).
Bottom line, if m is the number of edges and n the number of nodes.:
Sorting the edges: O(m log m).
For each edge, calling union-find: O(m * alpha(n)), or basically just O(m).
Total complexity: O(m log m + m * alpha(n)).
If you don't use union-find, total complexity will be O(m log m + m * (n + m)), if we use your O(n + m) cycle finding algorithm. Although O(n + m) for this step is probably an understatement, since you must also update your structure somehow (insert an edge). The naive disjoint-set algorithm is actually O(n log n), so even worse.
Note: in this case, you can write log n instead of log m if you prefer, because m = O(n^2) and log(n^2) = 2log n.
In conclusion: yes, union-find helps a lot.
Even if you use the O(log n) variant of union-find, which would lead to O(m log m + m log n) total complexity, which you could assimilate to O(m log m), in practice you'd rather keep the second part faster if you can. Since union-find is very easy to implement, there's really no reason not to.

What is meant when it's said that the union-find data structure will be "linear in the real world?"

I am undertaking the algorithms course on Coursera, there is a section where the author mentions the following
the running time of weighted quick union with path compression is
going be linear in the real world and actually could be improved to
even a more interesting function called the Ackermann function, which
is even more slowly growing than lg. And another point about this
is it seems that this is so close to being linear that is time proportional to N instead of time proportional to N times the
slowly growing function in N. Is there a simple algorithm that is
linear? And people, looked for a long time for that, and actually it
works out to be the case that we can prove that there is no such
algorithm. (emphasis added)
(You can find the entire transcript here)
In all other sources including Wikipedia "linear" is used when time increases proportionally with the input size, and in weighted quick-union with path compression this is certainly not the case.
What exactly is meant by "linear in the real world" here?

The runtime of m operations on a union-find data structure with path compression and union-by-rank is O(mα(m)), where α(m) is the inverse Ackermann function. This function is so slowly-growing that you cannot express an input to it for which the output is 6 in scientific notation. In other words, for any possible value of m that fits into the universe (or even that has size around 2num atoms in the universe), we have that α(m) ≤ 5. Therefore, for any "reasonable" input the cost of m operations will be O(m · 6) = O(m), which is linear.
Of course, the runtime isn't linear because α(m) does indeed grow, just very, very slowly. However, it's usually fine to approximate the runtime as O(m) because there's no possible way you'd ever notice the runtime of the function deviating from a simple linear function.
Hope this helps!

Here are some chunks from the transcript:
And what was proved
by Hopcroft Ulman and Tarjan was that if
you have N objects, any sequence of M
union and find operations will touch the
array at most a c (N + M lg star N) times.
And now, lg N is kind of a funny function....
And another point
about this is it< /i> seems that this is
so close to being linear that is t ime
proportional to N instead of time
proportional to N times the slowly growing
function in N.
(end quote)
You are pointing out that the cost of an individual operation grows very slowly with the number of objects, but they are looking at how the total cost of a number of operations grows with the number of objects involved so N times a per-operation cost that grows only very slowly with N is still just over linear in N.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

MPI communication complexity - parallel-processing

Related

Algorithm and Complexity

Predicting Spark performance/scalability on cluster?

Algorithms with O(n/log(n)) complexity

Does using union-find in Kruskal's algorithm actually affect worst-case runtime?

What is meant when it's said that the union-find data structure will be "linear in the real world?"

Categories

Resources