How to calculate parallel speedup between two algorithm - parallel-processing

suppose I have algorithm 1 and 2, their sequential execution time is ts1 and ts2. their parallel execution time is tp1 and tp2.
Now when calculating the speed up for both algorithm, which of the following is true?
min(ts1,ts2)/tp1 for algorithm 1
min(ts1,ts2)/tp2 for algorithm 2
or
ts1/tp1 for algorithm 1
ts2/tp2 for algorithm 2
in other words, for numerator, should I use the best sequential time or their own sequential time?

Short Version:
None of the above
Fig.1:
a SPEEDUP
BETWEEN
a BLACK-BOX <PROCESS_2>
[START] and
+-----------------------------------------+ a BLACK-BOX <PROCESS_1>
| |
[T0] [T0+ts1] [T0+ts1+tp1]
| | |
| | |
v v v
|________________|R.0: ____.____.____.____| ~~ <PAR.1:1> == [SEQ]
| |R.1? ____.____| :
| |R.2? ____| : :
| |R.3? ____| : :
| | : : :
|<SEQ.1>>>>>>>>>>| : : :
| |<PAR.1:N>: : :
| : : :
: : :
: : [FINISH] using 1 PAR-RESOURCE
: [FINISH] if using 2 PAR-RESOURCEs
[FINISH] if using 4 PAR-RESOURCEs
( Execution time flows from left to right, from [T0] .. to [T0 + ts1 + tp1]. The sketched order of [SEQ], [PAR] sections was chosen just for illustrative purpose here, can be opposite, as process-flow sections' durations ordering is commutative in principle )
A TL;DR; Version:
a bit formal simplification of the [SEQ]+[PAR] process-flows above may help to both answer and also understand why.
Needless to tell any HPC planners, that Amdahl Law rules ( the better if extended form of Amdahl, the overhead + atomicity aware formulation were used ).
We see, the more resources R.i were used in [PAR]-section of the PROCESS_1, the shorter the tp1 may get. Here is the power of [PAR]-processing.
Given just the pair of tuples ( ts1, tp1 ) and ( ts2, tp2 ), no one can assume any potential Amdahl Law -- resources-driven ( as demonstrated in Fig.1 ) -- speedup, but if one strives to just compare the two postulated implementations, having potentially different internal processing, the possible speedup S can be formulated as:
max( [ ts1 + tp1 ], [ ts2, tp2 ] )
S = ______________________________________
min( [ ts1 + tp1 ], [ ts2, tp2 ] )

There is a fundamental issue with your question. That is why you are stuck. The issue is that Speedup is defined for processors, not algorithms.
In computer architecture, speedup is a process for increasing the performance between two systems processing the same problem. More technically, it is the improvement in speed of execution of a task executed on two similar architectures with different resources.
Definition taken from Wikipedia.

Related

Equal performance across training and test set. Is it normal?

I run a simple logit but I always get the same performance no matter the size of the training. I started with 90% training and then keep trying with lower size. I tried even with a training size of 15% but nothing changes at all: performance on test-set is the same of the training and also for any kind of metrics: Accuracy, Sensitivity, Specificity, etc.
Dataset has been preprocessed by removing outliers (or transformed in logarithm for some monetary feature such as income) and missing values
At first glance I thought this might happen because the two-class proportions is the same across train and test no matter the splitting size, that is 80-20.
As a matter of fact even with 15% training size the proportion is the same across training and test. So i tried by running the mode on a very little training sample (i.e. less than 1000 instances) and I got different performance across training and test set.
However, I'm here to ask if there's some other kind of explaination about that.
d5 #d5 is the dataset
n <- nrow(d5)
index <- sample(1:n,n*0.80,replace = F)
training <- d5[index,]
wgths <- ifelse(training$status == 0,0.21,0.79) #set weigths cause the classes are umbalanced
test <- d5[-index,]
logit <- glm(training$status~.,data = training[,-c(1,4,6,11,14)], #some columns excluded cause useless
family = binomial('logit'),weights = wgths)
score <- ifelse(logit$fitted.values > 0.5,1,0) #classification on training set
prb <- predict.glm(logit,newdata = test,type = 'response')
score1 <- ifelse(prb > 0.5,1,0) #classification on test set
train_mat <- confusionMatrix(as.factor(score),as.factor(training$status),'1')
test_mat <- confusionMatrix(as.factor(score1),as.factor(test$status),'1')
Each comparison of the confusion matrix between train and test is something similar to these two below.
Confusion Matrix and Statistics (Training)
Reference
Prediction 0 1
0 482321 66330
1 271584 130485
Accuracy : 0.6446
95% CI : (0.6436, 0.6455)
No Information Rate : 0.793
P-Value [Acc > NIR] : 1
Kappa : 0.2185
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.6630
Specificity : 0.6398
Pos Pred Value : 0.3245
Neg Pred Value : 0.8791
Prevalence : 0.2070
Detection Rate : 0.1372
Detection Prevalence : 0.4229
Balanced Accuracy : 0.6514
'Positive' Class : 1
Confusion Matrix and Statistics (Test)
Reference
Prediction 0 1
0 120775 16682
1 67544 32679
Accuracy : 0.6456
95% CI : (0.6437, 0.6476)
No Information Rate : 0.7923
P-Value [Acc > NIR] : 1
Kappa : 0.2198
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.6620
Specificity : 0.6413
Pos Pred Value : 0.3261
Neg Pred Value : 0.8786
Prevalence : 0.2077
Detection Rate : 0.1375
Detection Prevalence : 0.4217
Balanced Accuracy : 0.6517
'Positive' Class : 1

Performance measure on data sizes and identical resources

I have systems that have a large number of cores as well as a cluster. For a particular task for which no serial implementation is available, I can only benchmark w.r.t. time taken for tasks running on different input sizes. I see that even when data size was increased by a factor of 10 times, the time for completion is less than 10 times while using identical resources. I would like to know how to measure the performance, as this does not appear to fall under typical definitions of strong/weak scaling. This appears to be related to efficiency, but I am not certain. From what I could gather about the three:
Strong scaling (Amdhal's law): speedup = 1 / ( s + p / N ) = T( 1 ) / T( N )
Weak scaling (Gustafson’s law): scaled speedup = s + p × N
Efficiency: speedup / N
As I don't have speedup due to lack of serial implementation and that N a is constant, I can only think of finding ratios of efficiencies using strong scaling. Is such a parameter used in CS?
Apache Spark on workloads on 250-500 GB data. B/M was done with 100% and 10% data sets. Jobs run between 250-3000s depending on the type and size. I can force number of executors to be 1 with 1 executor core, but that would be wrong as theoretically only optimum serial job should be written.
– Quiescent 24 mins ago( URL added )
Thanks for this note. The problem gets ground to answer it :
Q :... "Is such a parameter used in CS ?"
The answer to the questions about the observations on the above depicted problem has nothing to do with DATA-size per-se, the DATA-sizing is important, yet the core understanding is related to the internal functioning of the distributed-computing where overheads matter :
SMALL RDD-DATA
+-------------------E-2-E ( RDD/DAG Spark-wide distribution
|s+------+o | & recollection
|e| | v s| Turn-Around-Time )
|t| DATA | e d |
|u|1x | r a |
|p+------+ h e |
+-------------------+
| |
| |
|123456789.123456789|
Whereas :
LARGER RDD-DATA
+--------:------:------:------:-------------------E-2-E ( RDD/DAG Spark-wide TAT )
|s+------:------:------:------:------+o + |
|e| : : : : | v s v|
|t| DATA : DATA : DATA : DATA : DATA | e d a|
|u|1x :2x :3x :4x :5x | r a r|
|p+------:------:------:------:------+ h e .|
+--------:------:------:------:-------------------+
| |
| | |
|123456789.123456789| |
| |
|123456789.123456789.123456789.123456789.123456789|
( not a multiple of 5x the originally observed E-2-E for "small" DATA ( Spark-wide TAT )
yet a ( Setup & Termination overheads stay about same ~ const. )
a ( a DATA-size variable part need-not yet may grow )
now
show an E-2-E of about ~ 50 TimeUNITs for 5-times more DATA,
that is
for obvious
reasons not 5-times ~ 20 TimeUNITs
as was seen
during the E-2-E TAT from processing in "small"-DATA use-case
as not
all system-wide overheads accumulation
scale with DATA size
For further reading on Amdahl's argument & Gustafson/Barsis promoted scaling, feel free to continue here.

Negative speed up in Amdahl's law?

Amdahl’s law states that a speed up of the entire system is
an_old_time / a_new_time
where the a_new_time can be represented as ( 1 - f ) + f / s’, where f is the fraction of the system that is enhanced by some modification, and s’ is the amount by which that fraction of the system is enhanced. However, after solving this equation for s’, it seems like there are many cases in which s’ is negative, which makes no physical sense.
Taking the case that s = 2 (a 100% increase in the speed for entire system) and f = 0.1 (a 10% of the system is impacted by some speed enhancement s’), we solve for s’ by setting an_old time = 1 and s’ = f / ( f + 1 / s - 1 ).
Plugging on the values for f and s, we find that : s’ = 0.1 / ( 0.1 + 0.5 - 1 ) = 0.1 / -0.4 which means that the s’ value is negative.
How can this be possible, and what is the physical meaning of this? Also, how can I avoid negative s’ values when answering questions like these?
Amdahl's Law, also known as Amdahl's argument, is used to find the maximum expected improvement to an overall process when only a part of the process is improved.
1 | where S is the maximum theoretical Speedup achievable
S = __________________________; | s is the pure-[SERIAL]-section fraction
( 1 - s ) | ( 1 - s ) a True-[PARALLEL]-section fraction
s + _________ | N is the number of processes doing the [PAR.]-part
N |
Due to the algebra, the s + ( 1 - s ) == 1, s being anything from < 0.0 .. 1.0 >, there is no chance to get negative values here.
The full context of the Amdahl's argument& the contemporary criticism,adding all principal add-on overheads factors&a better handling of an atomicity-of-work
It is often applied in the field of parallel-computing to predict the theoretical maximum speedup achievable by using multiple processors. The law is named after Dr. Gene M. AMDAHL ( IBM Corporation ) and was presented at the AFIPS Spring Joint Computer Conference in 1967.
His paper was extending a prior work, cited by Amdahl himself as "... one of the most thorough analyses of relative computer capabilities currently published ...", published in 1966/Sep by prof. Kenneth E. KNIGHT, Stanford School of Business Administration.
The paper keeps a general view on process improvement.
Fig.1:
a SPEEDUP
BETWEEN
a <PROCESS_B>-[SEQ.B]-[PAR.B:N]
[START] and
[T0] [T0+tsA] a <PROCESS_A>-[SEQ.A]-ONLY
| |
v v
| |
PROCESS:<SEQ.A>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>|
| |
+-----------------------------------------+
| |
[T0] [T0+tsB] [T0+tsB+tpB]
| | |
v v v
|________________|R.0: ____.____.____.____|
| |R.1? ____.____| :
| |R.2? ____| : :
| |R.3? ____| : :
| |R.4? : : :
| |R.5? : : :
| |R.6? : : :
| |R.7? : : :
| | : : :
PROCESS:<SEQ.B>>>>>>>>>>|<PAR.B:4>: : :
| |<PAR.B:2>:>>>>: :
|<PAR.B:1>:>>>>:>>>>>>>>>: ~~ <PAR.B:1> == [SEQ]
: : :
: : [FINISH] using 1 PAR-RESOURCE
: [FINISH] if using 2 PAR-RESOURCEs
[FINISH] if using 4 PAR-RESOURCEs
( Execution time flows from left to right, from [T0] .. to [T0 + ts1 + tp1].The sketched order of [SEQ], [PAR] sections was chosen just for illustrative purpose here, can be opposite, in principle, as the process-flow sections' durations ordering is commutative in principle )
The speedup of a { program | process }, coming from using multiple processors in parallel computing, was derived to be ( maybe to a surprise of audience ) principally limited by the very fraction of time, that was consumed for the non-improved part of the processing, typically the sequential fraction of the program processing, executed still in a pure [SERIAL] process-schedulling manner ( be it due to not being parallelised per-se, or non-parallelisable by nature ).
For example, if a program needs 20 hours using a single processor core, and a particular portion of the program which takes one hour to execute cannot be parallelized ( having been processed in a pure-[SERIAL] process-scheduling manner ) , while the remaining 19 hours (95%) of execution time can be parallelized ( using a true-[PARALLEL] ( not a "just"-[CONCURRENT] ) process-scheduling ), then out of the question the minimum achievable execution time cannot be less than that ( first ) critical one hour, regardless of how many processors are devoted to a parallelized process execution of the rest of this program.
Hence the Speedup achievable is principally limited up to 20x, even if an infinite amount of processors would have been used for the [PARALLEL]-fraction of the process.
See also:
CRI UNICOS has a useful command amlaw(1) which does simple
number crunching on Amdahl's Law.
------------
On a CRI system type: man amlaw.
1 1
S = lim ------------ = ---
P->oo 1-s s
s + ---
P
S = speedup which can be achieved with P processors
s (small sigma) = proportion of a calculation which is serial
1-s = parallelizable portion
Speedup_overall
= 1 / ( ( 1 - Fraction_enhanced ) + ( Fraction_enhanced / Speedup_enhanced ) )
Articles to parallel#ctc.com (Administrative: bigrigg#ctc.com)
Archive: http://www.hensa.ac.uk/parallel/internet/usenet/comp.parallel
Criticism:
While Amdahl has formulated process-oriented speedup comparison, many educators keep repeating the formula, as if it were postulated for the multiprocessing process rearrangement, without taking into account also the following cardinal issues:
atomicity of processing ( some parts of the processing are not further divisible, even if more processing-resources are available and free to the process-scheduler -- ref. the resources-bound, further indivisible, atomic processing-section in Fig. 1 above )
add-on overheads, that are principally present and associated with any new process creation, scheduler re-distribution thereof, inter-process communication, processing results re-collection and remote-process resources' release and termination ( it's proportional dependence on N is not widely confirmed, ref. Dr. J. L. Gustafson, Jack Dongarra, et el, who claimed approaches with better than linear scaling in N )
Both of these group of factors have to be incorporated in the overhead-strict, resources-aware Amdahl's Law re-formulation, if it ought serve well to compare apples to apples in contemporary parallel-computing realms. Any kind of use of an overhead-naive formula results but in a dogmatic result, which was by far not formulated by Dr. Gene M. Amdahl in his paper ( ref. above ) and comparing apples to oranges have never brought anything positive to any scientific discourse in any rigorous domain.
Overhead-strict re-formulation of the Amdahl's Law speedup S:
1
S = __________________________; where s, ( 1 - s ), N were defined above
( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on
s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on
N
Overhead-strict and resources-aware re-formulation:
1 where s, ( 1 - s ), N
S = ______________________________________________ ; pSO, pTO
/ ( 1 - s ) \ were defined above
s + pSO + max| _________ , atomicP | + pTO atomicP:= further indivisible duration of atomic-process-block
\ N /
Interactive Tool for a maximum effective speedup :
Due to reasons described above, one picture might be worth million words here. Try this, where a fully interactive tool for using the overhead-strict Amdahl's Law is cross-linked.

Finding parameters next to vectors to get a desired vector

What is the simplest algorithm I can use to find such values of m1, m2, m3, ..., mn, that the following equation is satisified (of course to a certain accuracy threshold):
m1*v1 + m2*v2 + ... + mn*vn = vd
where v1, v2, ..., vn and vd are given vectors of 3-10 dimensions? The parameters m1, ..., mn should be positive real numbers.
I need an algorithm that's reliable and quick to code. Problem sizes will be small, (no larger than n=100) so speed isn't a very important issue, especially that the accuracies will be rather liberal.
What you are describing is a system of linear equations. You can write it as the following matrix equation:
A * x = b
Where, if k is the dimension of the vectors:
/ v1[1] v2[1] ... vn[1] \
| v1[2] v2[2] ... vn[2] |
A = | ..................... |
| ..................... |
\ v1[k] v2[k] ... vn[k] /
/ m1 \
| m2 |
x = | .. |
| .. |
\ mn /
/ vd[1] \
| vd[2] |
b = | ..... |
| ..... |
\ vd[k] /
There are several ways to solve these. If n is equal to k and the problem has a solution (which may or may not), then you can solve it by inverting the coefficient matrix A and computing inverse(A) * b, by using Cramer's rule or, most commonly, with Gaussian elmination. If n is not equal to k, several things can happen, you can learn about it googling a little bit.
By the way, you said that m1 ... mn must be positive numbers (non-zero?). In this case, you may want to approach your problem from linear programming, adding restrictions like m1 > 0, m2 > 0, etc. and using simplex algorithm to solve it.
Whatever you use, is really not advisable to program the algorithm by yourself. There are plenty of libraries for every language that deal with this kind of problems.

Sorted list difference

I have the following problem.
I have a set of elements that I can sort by a certain algorithm A . The sorting is good, but very expensive.
There is also an algorithm B that can approximate the result of A. It is much faster, but the ordering will not be exactly the same.
Taking the output of A as a 'golden standard' I need to get a meaningful estimate of the error resulting of the use of B on the same data.
Could anyone please suggest any resource I could look at to solve my problem?
Thanks in advance!
EDIT :
As requested : adding an example to illustrate the case :
if the data are the first 10 letters of the alphabet,
A outputs : a,b,c,d,e,f,g,h,i,j
B outputs : a,b,d,c,e,g,h,f,j,i
What are the possible measures of the resulting error, that would allow me to tune the internal parameters of algorithm B to get result closer to the output of A?
Spearman's rho
I think what you want is Spearman's rank correlation coefficient. Using the index [rank] vectors for the two sortings (perfect A and approximate B), you calculate the rank correlation rho ranging from -1 (completely different) to 1 (exactly the same):
where d(i) are the difference in ranks for each character between A and B
You can defined your measure of error as a distance D := (1-rho)/2.
I would determine the largest correctly ordered sub set.
+-------------> I
| +--------->
| |
A -> B -> D -----> E -> G -> H --|--> J
| ^ | | ^
| | | | |
+------> C ---+ +-----------> F ---+
In your example 7 out of 10 so the algorithm scores 0.7. The other sets have the length 6. Correct ordering scores 1.0, reverse ordering 1/n.
I assume that this is related to the number of inversions. x + y indicates x <= y (correct order) and x - y indicates x > y (wrong order).
A + B + D - C + E + G + H - F + J - I
We obtain almost the same result - 6 of 9 are correct scorring 0.667. Again correct ordering scores 1.0 and reverse ordering 0.0 and this might be much easier to calculate.
Are you looking for finding some algorithm that calculates the difference based on array sorted with A and array sorted with B as inputs? Or are you looking for a generic method of determining on average how off an array would be when sorted with B?
If the first, then I suggest something as simple as the distance each item is from where it should be (an average would do better than a sum to remove length of array as an issue)
If the second, then I think I'd need to see more about these algorithms.
It's tough to give a good generic answer, because the right solution for you will depend on your application.
One of my favorite options is just the number of in-order element pairs, divided by the total number of pairs. This is a nice, simple, easy-to-compute metric that just tells you how many mistakes there are. But it doesn't make any attempt to quantify the magnitude of those mistakes.
double sortQuality = 1;
if (array.length > 1) {
int inOrderPairCount = 0;
for (int i = 1; i < array.length; i++) {
if (array[i] >= array[i - 1]) ++inOrderPairCount;
}
sortQuality = (double) inOrderPairCount / (array.length - 1);
}
Calculating RMS Error may be one of the many possible methods. Here is small python code.
def calc_error(out_A,out_B):
# in <= input
# out_A <= output of algorithm A
# out_B <= output of algorithm B
rms_error = 0
for i in range(len(out_A)):
# Take square of differences and add
rms_error += (out_A[i]-out_B[i])**2
return rms_error**0.5 # Take square root
>>> calc_error([1,2,3,4,5,6],[1,2,3,4,5,6])
0.0
>>> calc_error([1,2,3,4,5,6],[1,2,4,3,5,6]) # 4,3 swapped
1.414
>>> calc_error([1,2,3,4,5,6],[1,2,4,6,3,5]) # 3,4,5,6 randomized
2.44
NOTE:
Taking square root is not necessary but taking squares is as just differences may sum to zero. I think that calc_error function gives approximate number of wrongly placed pairs but I dont have any programming tools handy so :(.
Take a look at this question.
you could try something involving hamming distance
if anyone is using R language, I've implemented a function that computes the "spearman rank correlation coefficient" using the method described above by #bubake :
get_spearman_coef <- function(objectA, objectB) {
#getting the spearman rho rank test
spearman_data <- data.frame(listA = objectA, listB = objectB)
spearman_data$rankA <- 1:nrow(spearman_data)
rankB <- c()
for (index_valueA in 1:nrow(spearman_data)) {
for (index_valueB in 1:nrow(spearman_data)) {
if (spearman_data$listA[index_valueA] == spearman_data$listB[index_valueB]) {
rankB <- append(rankB, index_valueB)
}
}
}
spearman_data$rankB <- rankB
spearman_data$distance <-(spearman_data$rankA - spearman_data$rankB)**2
spearman <- 1 - ( (6 * sum(spearman_data$distance)) / (nrow(spearman_data) * ( nrow(spearman_data)**2 -1) ) )
print(paste("spearman's rank correlation coefficient"))
return( spearman)
}
results :
get_spearman_coef(c("a","b","c","d","e"), c("a","b","c","d","e"))
spearman's rank correlation coefficient: 1
get_spearman_coef(c("a","b","c","d","e"), c("b","a","d","c","e"))
spearman's rank correlation coefficient: 0.9

Resources