Cost of a round-based distributed algorithm - complexity-theory

I have a round-based distributed algorithm for a network on n nodes.
I know that the cost (in terms of resource usage) of a round is O(n). However, I do not know the number of rounds, basically they can be repeated till the end of time (infinity).
So what will be the cost of the algorithm? can we say that it is O(n)?

You're not clear on what you're trying to evaluate the big O of. If you mean "Distribute n tasks to a cluster of size n in a round robin fashion" then that would be O(n). If you mean "Distribute the next task in a round robin fashion" then you only need to find the next node, which can be done in O(1).
If you have some other algorithm with which you intend to use a round robin distribution as part of it, you can't determine the big O for that algorithm by looking at this portion alone.

Related

How is pre-computation handled by complexity notation?

Suppose I have an algorithm that runs in O(n) for every input of size n, but only after a pre-computation step of O(n^2) for that given size n. Is the algorithm considered O(n) still, with O(n^2) amortized? Or does big O only consider one "run" of the algorithm at size n, and so the pre-computation step is included in the notation, making the true notation O(n+n^2) or O(n^2)?
It's not uncommon to see this accounted for by explicitly separating out the costs into two different pieces. For example, in the range minimum query problem, it's common to see people talk about things like an ⟨O(n2), O(1)⟩-time solution to the problem, where the O(n2) denotes the precomputation cost and the O(1) denotes the lookup cost. You also see this with string algorithms sometimes: a suffix tree provides an O(m)-preprocessing-time, O(n+z)-query-time solution to string searching, while Aho-Corasick string matching offers an O(n)-preprocessing-time, O(m+z)-query-time solution.
The reason for doing so is that the tradeoffs involved here really depend on the use case. It lets you quantitatively measure how many queries you're going to have to make before the preprocessing time starts to be worth it.
People usually care about the total time to get things done when they are talking about complexity etc.
Thus, if getting to the result R requires you to perform steps A and B, then complexity(R) = complexity(A) + complexity(B). This works out to be O(n^2) in your particular example.
You have already noted that for O analysis, the fastest growing term dominates the overall complexity (or in other words, in a pipeline, the slowest module defines the throughput).
However, complexity analysis of A and B will be typically performed in isolation if they are disjoint.
In summary, it's the amount of time taken to get the results that counts, but you can (and usually do) reason about the individual steps independent of one another.
There are cases when you cannot only specify the slowest part of the pipeline. A simple example is BFS, with the complexity O(V + E). Since E = O(V^2), it may be tempting to write the complexity of BFS as O(E) (since E > V). However, that would be incorrect, since there can be a graph with no edges! In those cases, you will still need to iterate over all the vertices.
The point of O(...) notation is not to measure how fast the algorithm is working, because in many specific cases O(n) can be significantly slower than, say O(n^3). (Imagine the algorithm which runs in 10^100 n steps vs. the one which runs in n^3 / 2 steps.) If I tell you that my algorithm runs in O(n^2) time, it tells you nothing about how long it will take for n = 1000.
The point of O(...) is to specify how the algorithm behaves when the input size grows. If I tell you that my algorithm runs in O(n^2) time, and it takes 1 second to run for n = 500, then you'll expect rather 4 seconds to for n = 1000, not 1.5 and not 40.
So, to answer your question -- no, the algorithm will not be O(n), it will be O(n^2), because if I double the input size the time will be multiplied by 4, not by 2.

Which complexity is better?

Assume that a graph has N nodes and M edges, and the total number of iterations is k.
(k is a constant integer, larger than 1, independent of N and M)
Let D=M/N be the average degree of the graph.
I have two graph-based iterative search algorithms.
The first algorithm has the complexity of O(D^{2k}) time.
The second algorithm has the complexity of O(k*D*N) time.
Based on their Big O time complexity, which one is better?
Some told me that the first one is better because the number of nodes N in a graph is usually much larger than D in real world.
Others said that the second one is better because k is exponentially increased for the first one, but is linearly increased for the second one.
Summary
Neither of your two O's dominate the other, so the right approach is to chose the algorithm based on the inputs
O Domination
The first it better when D<1 (sparse graphs) and similar.
The second is better when D is relatively large
Algorithm Selection
The important parameter is not just the O but the actual constant in front of it.
E.g., an O(n) algorithm which is actually 100000*n is worse than O(n^2) which is just n^2 when n<100000.
So, given the graph and the desired iteration count k, you need to estimate the expected performance of each algorithm and chose the best one.
Big-O notation describes how a function grows, when its arguments grow. So if you want to estimate growth of algorithm time consumption, you should estimate first how D and N will grow. That requires some additional information from your domain.
If we assume that N is going to grow anyway. For D you have several choices:
D remains constant - the first algorithm is definitely better
D grows proportionally to N - the second algorithm is better
More generally: if D grows faster than N^(1/(2k-1)), you should select the first algorithm, otherwise - the second one.
For every fixed D, D^(2k) is a constant, so the first algorithm will beat the second if M is large enough. However, what is large enough depends on D. If D isn't constant or limited, the two complexities cannot be compared.
In practice, you would implement both algorithms, find a good approximation for their actual speed, and depending on your values pick the one that will be faster.

Time complexity of one algorithm cascaded into another?

I am working with random forest for a supervised classification problem, and I am using the k-means clustering algorithm to split the data at each node. I am trying to calculate the time complexity for the algorithm. From what I understand the the time complexity for k-means is
O(n · K · I · d )
where
n is the number of points,
K is the number of clusters,
I is the number of iterations, and
d is the number of attributes.
The k, I and d are constants or have an upper bound, and n is much larger as compared to these three, so I suppose the complexity is just O(n).
The random forest, on the other hand, is a divide-and-conquer approach, so for n instances the complexity is O(n · logn), though I am not sure about this, correct me if i am wrong.
To get the complexity of the algorithm do i just add these two things?
In this case, you don't add the values together. If you have a divide-and-conquer algorithm, the runtime is determined by a combination of
The number of subproblems made per call,
The sizes of those subproblems, and
The amount of work done per problem.
Changing any one of these parameters can wildly impact the overall runtime of the function. If you increase the number of subproblems made per call by even a small amount, you increase exponentially the number of total subproblems, which can have a large impact overall. Similarly, if you increase the work done per level, since there are so many subproblems the runtime can swing wildly. Check out the Master Theorem as an example of how to determine the runtime based on these quantities.
In your case, you are beginning with a divide-and-conquer algorithm where all you know is that the runtime is O(n log n) and are adding in a step that does O(n) work per level. Just knowing this, I don't believe it's possible to determine what the runtime will be. If, on the other hand, you make the assumption that
The algorithm always splits the input into two smaller pieces,
The algorithm recursively processes those two pieces independently, and
The algorithm uses your O(n) algorithm to determine which split to make
Then you can conclude that the runtime is O(n log n), since this is the solution to the recurrence given by the Master Theorem.
Without more information about the internal workings of the algorithm, though, I can't say for certain.
Hope this helps!

Analyze span - two parallel for

If I have an algorithm with two parallel for and I want to analyze the span of the algorithm, what do I have to do?
For example
parallel for a=2 to n
parallel for b=1 to a-1
My guess is the span is theta(lg(n)*lg(n)) but I'm not sure. :) Someone who can help or give a hint?
I am assuming you want the time complexity of this algorithm. Since time complexity is NOT how much time the algorithm actually takes, but rather how much operations are needed for it [a quote supporting this claim follows], the time complexity of this algorithm is O(n^2), as it was if it was not parallel.
from the wiki page: Time complexity is commonly estimated by counting the number of elementary operations performed by the algorithm, where an elementary operation takes a fixed amount of time to perform
Why don't we care for the fact the algorithm is parallel?
Usually, our cluster size is fixed, and does not depend on the input [n]. let the cluster size be k [meaning, we can perform k operations simultaneously and the algorithm is O(n^2) [for simplicity assume exactly n^2]
assume we have an input of size 100, then it will 'take' (100^2)/k time. if it was of size 1,000, it would take (1000^2)/k, and for n elements: (n^2)/k, as you can see, the k is a constant, and the fact that the program is parallel does not change the complexity. Being able to do k operations at once, is not better [and even worth, but that's for another thread] then a computer k time faster.

How to know when Big O is Logarithmic?

My question arises from the post "Plain English Explanation of Big O". I don't know the exact meaning for logarithmic complexity. I know that I can make a regression between the time and the number of operations and calculate the X-squared value, and determine so the complexity. However, I want to know a method to determine it quickly on paper.
How do you determine logarithmic complexity? Are there some good benchmarks?
Not rigorous, but it you have an algorithm that is essentially dividing the work needed to be done by half on each iteration, then you have logarithmic complexity. The classic example is binary search.
Not sure if this is what you mean, but... logarithmic complexity usually arises when you're working with a spread-out data structure like a balanced binary tree, which contains 1 node at the root, 2 children, 4 grandchildren, 8 great-grandchildren, etc. Basically at each level the number of nodes gets multiplied by some factor (2) but still only one of those is involved in the iteration. Or as another example, a loop in which the index doubles at each step:
for (int i = 1; i < N; i *= 2) { ... }
Things like that are the signatures of logarithmic complexity.
Master theorem usually works.
If you just want to know about logarithmic Big Oh, be on the lookout for when your data is cut in half each step of the recurrence.
This is because if you are processing data that is 1/2 as big as the step before it, it is an infinite series.
Here is another way of saying it.
Suppose your algorithm is linear in the number of digits in the size of the problem. So, perhaps you have a new algorithm to factor a large number, that you can show to be linear in the number of digits. A 20 digit number thereby takes twice as long to factor as a 10 digit number using your algorithm. This would have log complexity. (And it would be worth something for the inventor.)
Bisection has the same behavior. It takes roughly 10 bisection steps to cut the interval length by a factor of 1024 = 2^10, but only 20 steps will cut the interval by a factor of 2^20.
Log complexity does not always mean an algorithm is fast on all problems. The linear factor in front of the O(log(n)) may be large. So your algorithm may be terrible on small problems, not becoming useful until the problem size is appreciably large that other algorithms die an exponential (or polynomial) death.

Resources