Time complexity of one algorithm cascaded into another? - algorithm

I am working with random forest for a supervised classification problem, and I am using the k-means clustering algorithm to split the data at each node. I am trying to calculate the time complexity for the algorithm. From what I understand the the time complexity for k-means is
O(n · K · I · d )
where
n is the number of points,
K is the number of clusters,
I is the number of iterations, and
d is the number of attributes.
The k, I and d are constants or have an upper bound, and n is much larger as compared to these three, so I suppose the complexity is just O(n).
The random forest, on the other hand, is a divide-and-conquer approach, so for n instances the complexity is O(n · logn), though I am not sure about this, correct me if i am wrong.
To get the complexity of the algorithm do i just add these two things?

In this case, you don't add the values together. If you have a divide-and-conquer algorithm, the runtime is determined by a combination of
The number of subproblems made per call,
The sizes of those subproblems, and
The amount of work done per problem.
Changing any one of these parameters can wildly impact the overall runtime of the function. If you increase the number of subproblems made per call by even a small amount, you increase exponentially the number of total subproblems, which can have a large impact overall. Similarly, if you increase the work done per level, since there are so many subproblems the runtime can swing wildly. Check out the Master Theorem as an example of how to determine the runtime based on these quantities.
In your case, you are beginning with a divide-and-conquer algorithm where all you know is that the runtime is O(n log n) and are adding in a step that does O(n) work per level. Just knowing this, I don't believe it's possible to determine what the runtime will be. If, on the other hand, you make the assumption that
The algorithm always splits the input into two smaller pieces,
The algorithm recursively processes those two pieces independently, and
The algorithm uses your O(n) algorithm to determine which split to make
Then you can conclude that the runtime is O(n log n), since this is the solution to the recurrence given by the Master Theorem.
Without more information about the internal workings of the algorithm, though, I can't say for certain.
Hope this helps!

Related

Importance of number of Swaps in analysis of the sorting algorithm - Three-Way Partitioning

I'm developing a Three-Way partitioning algorithm for sorting the data. I could observe that for few distinct elements in large data set the number of comparison made by algorithm is less than traditional version of Quick-sort.
However, the number of swaps taken are higher than normal version of quick-sort.
In order to perform the analysis of the algorithm I need to understand what is the impact of number of swaps and comparisons on overall algorithm performance.
When analysing the efficiency of an algorithm by static analysis - that is, without actually running the code and measuring the time it takes - we are usually concerned only with the asymptotic complexity of the algorithm, which can be described using big O notation and related asymptotic notations.
One important fact about big O notation is that O(f + g) is either O(f) or O(g), whichever is larger. So if f measures how many comparisons your algorithm makes, and g measures how many swaps, then whichever is larger will be the important one. Both have an impact on the algorithm's actual running time, but only the larger one has an impact on the asymptotic running time.
Most sorting algorithms do more comparisons than swaps, so normally the number of comparisons is what matters. But if your algorithm does more swaps than comparisons, then the number of swaps is what matters for your algorithm.
Of course, if there are other operations your algorithm does, like addition, multiplication, reading or allocating memory allocation, then you should consider these too. Whichever operation your algorithm does most is what would determine its asymptotic running time.

How is pre-computation handled by complexity notation?

Suppose I have an algorithm that runs in O(n) for every input of size n, but only after a pre-computation step of O(n^2) for that given size n. Is the algorithm considered O(n) still, with O(n^2) amortized? Or does big O only consider one "run" of the algorithm at size n, and so the pre-computation step is included in the notation, making the true notation O(n+n^2) or O(n^2)?
It's not uncommon to see this accounted for by explicitly separating out the costs into two different pieces. For example, in the range minimum query problem, it's common to see people talk about things like an ⟨O(n2), O(1)⟩-time solution to the problem, where the O(n2) denotes the precomputation cost and the O(1) denotes the lookup cost. You also see this with string algorithms sometimes: a suffix tree provides an O(m)-preprocessing-time, O(n+z)-query-time solution to string searching, while Aho-Corasick string matching offers an O(n)-preprocessing-time, O(m+z)-query-time solution.
The reason for doing so is that the tradeoffs involved here really depend on the use case. It lets you quantitatively measure how many queries you're going to have to make before the preprocessing time starts to be worth it.
People usually care about the total time to get things done when they are talking about complexity etc.
Thus, if getting to the result R requires you to perform steps A and B, then complexity(R) = complexity(A) + complexity(B). This works out to be O(n^2) in your particular example.
You have already noted that for O analysis, the fastest growing term dominates the overall complexity (or in other words, in a pipeline, the slowest module defines the throughput).
However, complexity analysis of A and B will be typically performed in isolation if they are disjoint.
In summary, it's the amount of time taken to get the results that counts, but you can (and usually do) reason about the individual steps independent of one another.
There are cases when you cannot only specify the slowest part of the pipeline. A simple example is BFS, with the complexity O(V + E). Since E = O(V^2), it may be tempting to write the complexity of BFS as O(E) (since E > V). However, that would be incorrect, since there can be a graph with no edges! In those cases, you will still need to iterate over all the vertices.
The point of O(...) notation is not to measure how fast the algorithm is working, because in many specific cases O(n) can be significantly slower than, say O(n^3). (Imagine the algorithm which runs in 10^100 n steps vs. the one which runs in n^3 / 2 steps.) If I tell you that my algorithm runs in O(n^2) time, it tells you nothing about how long it will take for n = 1000.
The point of O(...) is to specify how the algorithm behaves when the input size grows. If I tell you that my algorithm runs in O(n^2) time, and it takes 1 second to run for n = 500, then you'll expect rather 4 seconds to for n = 1000, not 1.5 and not 40.
So, to answer your question -- no, the algorithm will not be O(n), it will be O(n^2), because if I double the input size the time will be multiplied by 4, not by 2.

Can O(k * n) be considered as linear complexity (O(n))?

When talking about complexity in general, things like O(3n) tend to be simplified to O(n) and so on. This is merely theoretical, so how does complexity work in reality? Can O(3n) also be simplified to O(n)?
For example, if a task implies that solution must be in O(n) complexity and in our code we have 2 times linear search of an array, which is O(n) + O(n). So, in reality, would that solution be considered as linear complexity or not fast enough?
Note that this question is asking about real implementations, not theoretical. I'm already aware that O(n) + O(n) is simplified to O(n)?
Bear in mind that O(f(n)) does not give you the amount of real-world time that something takes: only the rate of growth as n grows. O(n) only indicates that if n doubles, the runtime doubles as well, which lumps functions together that take one second per iteration or one millennium per iteration.
For this reason, O(n) + O(n) and O(2n) are both equivalent to O(n), which is the set of functions of linear complexity, and which should be sufficient for your purposes.
Though an algorithm that takes arbitrary-sized inputs will often want the most optimal function as represented by O(f(n)), an algorithm that grows faster (e.g. O(n²)) may still be faster in practice, especially when the data set size n is limited or fixed in practice. However, learning to reason about O(f(n)) representations can help you compose algorithms to have a predictable—optimal for your use-case—upper bound.
Yes, as long as k is a constant, you can write O(kn) = O(n).
The intuition behind is that the constant k doesn't increase with the size of the input space and at some point will be incomparably small to n, so it doesn't have much influence on the overall complexity.
Yes - as long as the number k of array searches is not affected by the input size, even for inputs that are too big to be possible in practice, O(kn) = O(n). The main idea of the O notation is to emphasize how the computation time increases with the size of the input, and so constant factors that stay the same no matter how big the input is aren't of interest.
An example of an incorrect way to apply this is to say that you can perform selection sort in linear time because you can only fit about one billion numbers in memory, and so selection sort is merely one billion array searches. However, with an ideal computer with infinite memory, your algorithm would not be able to handle more than one billion numbers, and so it is not a correct sorting algorithm (algorithms must be able to handle arbitrarily large inputs unless you specify a limit as a part of the problem statement); it is merely a correct algorithm for sorting up to one billion numbers.
(As a matter of fact, once you put a limit on the input size, most algorithms will become constant-time because for all inputs within your limit, the algorithm will solve it using at most the amount of time that is required for the biggest / most difficult input.)

Compare the complexity of two algorithms given steps

Assume you had a data set of size n and two algorithms that processed that data
set in the same way. Algorithm A took 10 steps to process each item in the data set. Algorithm B processed each item in 100 steps. What would the complexity
be of these two algorithms?
I have drawn from the question that algorithm A completes the processing of each item with 1/10th the complexity of algorithm B,and using the graph provided in the accepted answer from the question: What is a plain English explanation of "Big O" notation? I am concluding that algorithm B has a complexity of O(n^2) and algorithm A a complexity of O(n), but am struggling to make conclusions beyond that without the implementation.
You need more than one data point before you can start making any conclusions about time complexity. The difference of 10 steps and 100 steps between Algorithm A and Algorithm B could be for many different reasons:
Additive Constant difference: Algorithm A is always 90 steps faster than Algorithm B no matter the input. In this case, both algorithms would have the same time complexity.
Scalar Multiplicative difference: Algorithm A is always 10 times faster than Algorithm B no matter the input. In this case, both algorithms would have the same time complexity.
The case that you brought up, where B is O(n^2) and A is O(n)
Many, many other possibilities.

Which complexity is better?

Assume that a graph has N nodes and M edges, and the total number of iterations is k.
(k is a constant integer, larger than 1, independent of N and M)
Let D=M/N be the average degree of the graph.
I have two graph-based iterative search algorithms.
The first algorithm has the complexity of O(D^{2k}) time.
The second algorithm has the complexity of O(k*D*N) time.
Based on their Big O time complexity, which one is better?
Some told me that the first one is better because the number of nodes N in a graph is usually much larger than D in real world.
Others said that the second one is better because k is exponentially increased for the first one, but is linearly increased for the second one.
Summary
Neither of your two O's dominate the other, so the right approach is to chose the algorithm based on the inputs
O Domination
The first it better when D<1 (sparse graphs) and similar.
The second is better when D is relatively large
Algorithm Selection
The important parameter is not just the O but the actual constant in front of it.
E.g., an O(n) algorithm which is actually 100000*n is worse than O(n^2) which is just n^2 when n<100000.
So, given the graph and the desired iteration count k, you need to estimate the expected performance of each algorithm and chose the best one.
Big-O notation describes how a function grows, when its arguments grow. So if you want to estimate growth of algorithm time consumption, you should estimate first how D and N will grow. That requires some additional information from your domain.
If we assume that N is going to grow anyway. For D you have several choices:
D remains constant - the first algorithm is definitely better
D grows proportionally to N - the second algorithm is better
More generally: if D grows faster than N^(1/(2k-1)), you should select the first algorithm, otherwise - the second one.
For every fixed D, D^(2k) is a constant, so the first algorithm will beat the second if M is large enough. However, what is large enough depends on D. If D isn't constant or limited, the two complexities cannot be compared.
In practice, you would implement both algorithms, find a good approximation for their actual speed, and depending on your values pick the one that will be faster.

Resources