Predicting Spark performance/scalability on cluster? - performance

Let's assume you have written an algorithm in Spark and you can evaluate its performance using 1 .. X cores on data sets of size N running in local mode. How would you approach questions like these:
What is the runtime running on a cluster with Y nodes and data size M >> N?
What is the minimum possible runtime for a data set of size M >> N using an arbitrary number of nodes?
Clearly, this is influenced by countless factors and giving a precise estimate is almost impossible. But how would you come up with an educated guess? Running in local mode mainly allows to measure CPU usage. Is there a rule of thumb to account for disk + network load in shuffles as well? Are there even ways to simulate performance on a cluster?

The data load can be estimated as O(n).
The algorithm can be estimated for each stage. The whole algorithm is an accumulation of all stages. Note, each stage have different amount of data, it has a relation with the first input data.
If the whole algorithm has O(n), then it's O(n).
If the whole algorithm has O(n log n), then it's O(n log n).
If the whole algorithm has O(n2), then the algorithm need to be improved to fit M >> N.
Assume
There is no huge shuffle/network is fast enough
Each node has the same configuration
Total time spend is T for data size N on a single node.
Number of node is X
Then the time if the algorithm is O(n) T * M / N / X
Then the time if the algorithm is O(n log n) T * M / N / X * log(M/N)
Edit
If there is A big shuffle, then it O(n) respect to bandwidth. The extra time added is dataSize(M)/bandwidth.
If there are many big shuffle, then consider to improve the algorithm.

Related

is anything less than n is log n?

Consider there are two solutions for a problem.
executes in n/2 times ie. if n = 100 then it executes in 50 times
executes in sqrt of n times ie. if n = 100 then it executes in 10 times.
Are both the solutions can be called as O(log N) ?
if so, then there is huge difference between sqrt of N and N/2.
if we can't say O(log N) then can we say it is N ?
But the problem is the difference rate between these two. By the below image the algorithm should come in either of these thing, under which these solutions will come ?
Please help me on this.
Consider the three cases.
Executes n/2 times. That means each time we increase n by a factor of 100, the execution time increases by a factor of 100.
Executes sqrt(n) times. That means each time we increase n by a factor of 100, the execution time increases by a factor of 10.
Executes log(n) times. That means each time we increase n by a factor of 100, the execution time increases by a constant amount.
No, these three things aren't even close to the same. The first is much worse than the second. The third is much better than the second.
Neither of them is O(logn)
Here is an example of O(logn), Binary search algorithm
The best algorithm is the best algorithm for the data that you have. If you don't know what data you have, consider massively large amounts of data, say n= 1 billion. Would you choose O(31623), or O(5000000000)? Graph the comparison and find where your data size is.
If your dataset was n=4, then either algorithm is identical. If you get in to the details, it may actually take longer for the sqrt(n) algorithm due to the operations it conducts.
You can have O(1) which is the fastest. One such example is looking up in a hash map, but your memory size may suffer. So you should consider space constraints as well as time constraints.
You are also misunderstanding and overanalyzing complexity classification. O(n) algorithms are not algorithms that execute with exactly n operations. Any constant multiplier does not affect the Order of the classification. What is important is the grown of the number of operations when the problem grows. Consider two search algorithms.
A) Scan a sorted list sequentially from index 0 to (n-1) to find the number.
B) Scan a sorted list from from index 0 to (n-1), skipping by 2, and backtracking if necessary.
Clearly A takes at most n operations, and B takes n/2+1 operations. Yet they are both O(n). You can say algorithm B is faster, but I might run it on my machine which is twice as fast. So complexity is a general class, one isn't supposed to need to be overly finicky on the details of the operation.
If you were trying to develop a better algorithm, it would be much more useful to search for one with a better complexity class, than one with slightly fewer operations.

Finding the constant part of an algorithms running time

I have an implementation of an algorithm that runs in O(n log n), for n=10^7 the algorithm takes 570 ms. Does anyone know how to find the constant part (C) of my algorithms running time? I would like to have this so I can calculate how long the algorithm 'should' take for an arbitrary input size.
I don't think you can calculate it exactly, but if you know for sure that the complexity is O(n log n) then I would recommend a simple proportion as an estimate your run time:
10^10 log 10^10 unknown run time
--------------- = ----------------
10^7 log 10^7 570 ms
In this case, that should be about 1428.6 * 570 ms =~ 814 sec.
It's not exactly mathematically correct, but if you don't have multiple data points to try to fit to a curve to figure out the various constants, it's not an unreasonable place to start.
If you know that the asymptotic complexity of an algorithm is O(n log n), then with just a single data point you can't (exactly) determine the runtime on future operations. Imagine, for example, that you have an algorithm that you know runs in time O(n) and on an input of size N, the runtime is T. You can't exactly predict what the runtime is going to be on an input of size 2T, because it's unclear how much of T is explained by the slope of the linear function and how much is explained by the intercept.
If you assume that N is "large enough" that most of the runtime T comes from the slope, then you can make a reasonable estimate of the runtime of the algorithm on a future input. Specifically, since the function grows linearly, you can assume that if you multiply the size of the input by some constant k, then the runtime ought to be Tk. In your case, the function n log n grows mostly linearly. Since log grows very slowly, for large enough n its growth is extremely flat. Consequently, if you think that N is "large enough," you can estimate the runtime on an input of size kN by just scaling the runtime on size N by a factor of k.
To be much more accurate, you could also try gathering more data points about the runtime and doing a regression. In the linear case, if you know two accurate data points, you can recover the actual linear function and then extrapolate to get very accurate runtime predictions. With something of the form n log n, it's probably good to assume the runtime has the form c0 n log n + c1 n + c2 n. If you gather enough data points, you could probably plug this into Excel and recover the coefficients, from which you could extrapolate very accurately.
Hope this helps!

Time complexity of one algorithm cascaded into another?

I am working with random forest for a supervised classification problem, and I am using the k-means clustering algorithm to split the data at each node. I am trying to calculate the time complexity for the algorithm. From what I understand the the time complexity for k-means is
O(n · K · I · d )
where
n is the number of points,
K is the number of clusters,
I is the number of iterations, and
d is the number of attributes.
The k, I and d are constants or have an upper bound, and n is much larger as compared to these three, so I suppose the complexity is just O(n).
The random forest, on the other hand, is a divide-and-conquer approach, so for n instances the complexity is O(n · logn), though I am not sure about this, correct me if i am wrong.
To get the complexity of the algorithm do i just add these two things?
In this case, you don't add the values together. If you have a divide-and-conquer algorithm, the runtime is determined by a combination of
The number of subproblems made per call,
The sizes of those subproblems, and
The amount of work done per problem.
Changing any one of these parameters can wildly impact the overall runtime of the function. If you increase the number of subproblems made per call by even a small amount, you increase exponentially the number of total subproblems, which can have a large impact overall. Similarly, if you increase the work done per level, since there are so many subproblems the runtime can swing wildly. Check out the Master Theorem as an example of how to determine the runtime based on these quantities.
In your case, you are beginning with a divide-and-conquer algorithm where all you know is that the runtime is O(n log n) and are adding in a step that does O(n) work per level. Just knowing this, I don't believe it's possible to determine what the runtime will be. If, on the other hand, you make the assumption that
The algorithm always splits the input into two smaller pieces,
The algorithm recursively processes those two pieces independently, and
The algorithm uses your O(n) algorithm to determine which split to make
Then you can conclude that the runtime is O(n log n), since this is the solution to the recurrence given by the Master Theorem.
Without more information about the internal workings of the algorithm, though, I can't say for certain.
Hope this helps!

Analysis of algorithms

I am reading algorithm analysis topic. Here is the text snippet from the book
When n doubles, the running time goes up by a factor of 2 for linear
programs, 4 for quadratic programs, and 8 for cubic programs.
Programs that run in logarithmic time take only an additive constant
longer when n doubles, and programs that run in O(n log n) take
slightly more than twice as long to run under the same circumstances.
These increases can be hard to spot if the lower-order terms have
relatively large coefficients and n is not large enough.
My question is what does author mean lower-order terms have relatively large coefficients? Can any one explain with example
Thanks!
Suppose your algorithm actually executes n^2 + 1000 n computations when run on n elements. Now for n = 1 you need 1001 computations, and for n = 2 you need 2004. The difference from linear growth is tiny, and you can hardly spot the quadratic contribution!
Asymptotically, however, your algorithm takes O(n^2) steps, so asymptotically (as n gets large) doubling the input size quadruples your runtime. But for our small value, doubling from 1 to 2 did not quadruple the runtime! The lower-order term is n, and its coefficient (1000) is large compared to the coefficient of the leading-order termn^2 (which is 1).
This shows how the asymptotic complexity does not say anything about particular, especially small values. It's merely a limiting statement about the behaviour as n gets large.
When using O notation, you specify the largest term of the function that is your performance bound. For example, if the performance was always bound by f = c3n3 + c2n2 + c1n + c0, you would say that is is O(n3). The author is saying that when n is small, the coefficients may have a larger impact than n on the performance, for example if c2 were very large and c3 very small, the performance may appear to be O(n2) until the size of n dominates the coefficients if you only go by the relative performance for specific small instances of n.
Asymptotic notation refers to the bounds of the runtime as n->infinity. So, a function that is O(n log n) may have an actual runtime of .1*n log n + 100000*n.
In this case, the 100000*n term is the "lower-order term". As n->infinity, this term is overpowered by the .1*n log n term.
However, as you can see, for small n, the 100000*n term will dominate the runtime.
For instance if you have an O(n) algorithm at lower scales you could have T(n) = 490239n + (insert ridiculous constant here) which means that the performance would look bad but as the scales increase you see that the increase is always linear.
Real world example is merge sort, O(n logn) problem is that recursion has a computational cost or overhead which is a factor of n which is a smaller order than nlogn so it gets discarded in the Big-O, problem is that that factor gets to be quite large as well and affects performance.

Analyze span - two parallel for

If I have an algorithm with two parallel for and I want to analyze the span of the algorithm, what do I have to do?
For example
parallel for a=2 to n
parallel for b=1 to a-1
My guess is the span is theta(lg(n)*lg(n)) but I'm not sure. :) Someone who can help or give a hint?
I am assuming you want the time complexity of this algorithm. Since time complexity is NOT how much time the algorithm actually takes, but rather how much operations are needed for it [a quote supporting this claim follows], the time complexity of this algorithm is O(n^2), as it was if it was not parallel.
from the wiki page: Time complexity is commonly estimated by counting the number of elementary operations performed by the algorithm, where an elementary operation takes a fixed amount of time to perform
Why don't we care for the fact the algorithm is parallel?
Usually, our cluster size is fixed, and does not depend on the input [n]. let the cluster size be k [meaning, we can perform k operations simultaneously and the algorithm is O(n^2) [for simplicity assume exactly n^2]
assume we have an input of size 100, then it will 'take' (100^2)/k time. if it was of size 1,000, it would take (1000^2)/k, and for n elements: (n^2)/k, as you can see, the k is a constant, and the fact that the program is parallel does not change the complexity. Being able to do k operations at once, is not better [and even worth, but that's for another thread] then a computer k time faster.

Resources