I am trying to understand the following slide
The definition is kind of unclear to me. Sources like wikipedia say that Amdahl's measures the speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. To me speedup is basically how faster a task runs over other task. Speedup in this case is used in a different way. Can you clarify what Amdahl's law measures in an easier way and what speed up really is?
The definition of speedup here is:
Speedup = Baseline Running Time / New Running Time
This means that if the running time is BRT and the parallelizable portion is P, then:
BRT = (1 - P) * BRT + P * BRT
Now if a speedup of S was obtained on the P portion of the running time, then the new improved running time (IRT) is:
IRT = (1 - P) * BRT + P * (BRT / S)
= (1 - P) * BRT + (P / S) * BRT
= ((1 - P) + (P / S)) * BRT
Therefore:
BRT / IRT = 1 / ((1 - P) + (P / S))
This is the overall speedup. This is Amdahl's law.
To me speedup is basically how faster a task runs over other task.
Yes, speedup can be defined in different ways. This can be a little confusing.
Amdhal's Law measures the theoretical maximum speed up, this is almost never achieved, The formula is easy to under stand once you know what different parts mean,
Okay so the formula is Speedup = 1/ 1-f + f/p,
1 means the whole code,
1-f means the amount of serial code (can't be parallelized),
f means code that can be parallelized,
p means number of processors,
So, if we say there are 10 processors and we have 40% of code that can be parallelized.
the formula is speedup = 1/ 1-40% (0.4) + 40%(0.4)/10
Not a professional and you might want to check this, but if i remember correctly this is how it should work :)
No system is truly parallel. they might start parallel, then executed serially and then parallel again in each different workflow. In general, we have to take locks, coordinate threads, synchronize the code. So there will be serial portions within a parallel process. During this serial portion, multiple threads/processes that are executing, get queueing. Amdahl's law tells how much the serial portion affects the performance (throughput) graph. As you see in the image:
if it was a perfect parallel system, the rate would have been perfectly linear. If there is a serial portion within a process, it does not matter 5 percent or 10 percent, the rate of the graph will be flat after a given point. Amdahl's law calculates how soon the graph is going to flatten out. If it is flattened out that means throughput has decreased.
The formula on the slide is saying that the amount of speedup a program will see by using more parallel cores is based on how much of the program is serial.
Related
This is how I solved the following question, I want to be sure if my solution is correct?
A multiprocessor consists of 100 processors, each capable of a peak execution rate of 2 Gflops. What is the performance of the system as measured in Gflops when 2% of the code is sequential and 98% is parallelizable
Solution :
I think I'm right in thinking that 2% of the program is run at 2 GFLOPs, and 98% is run at 200 GFLOPs, and that I can average these speeds to find the performance of the multiprocessor in GFLOPs
(2/100)*2 + (98/100)*200 = 196,04 Gflops
I want to be sure if my solution is correct?
From my understanding, it is 2% of the program that is sequential and not 2% of the execution time. This means the sequential code takes a significant portion of the time since there are a lot of processor so the parallel part is drastically accelerated.
With your method, a program with 50% of sequential code and 1000 processors will run at (50/100)*2 + (50/100)*2_000 = 1001 Gflops. This means that all processors are use at ~50% of their maximum capacity in average during all the execution of the program which is too good to be possible. Indeed, the parallel part of the program should be so fast that it will take only a tiny faction of the execution time (<5%) while the sequential part will take almost all the time of the program (>95%). Since the largest part of the execution time runs at 2 Gflops, the processors cannot be used at ~50% of their capacity!
Based on the Amdahl's law, you can compute the actual speed up of this code:
Slat = 1 / ((1-p) + p/s) where Slat is the speed up of the whole program, p the portion of parallel code (0.98) and s is the number of processors (100). This means Slat = 33.6. Since one processor runs at 2 Gflops and the program is 33.6 time faster overall using many processors, the overall program runs at 33.6 * 2 = 67.2 Gflops.
What the Amdahl's law show is that a tiny fraction of the execution time being sequential impact strongly the scalability and thus the performance of parallel programs.
Forgive me to start light & anecdotally,citing a meme from my beloved math professor,later we will see why & how well it helps us here
2 + 7 = 15 . . . , particularly so for higher values of 7
It is ideal if I start off with stating some definitions:
a) GFLOPS is a unit, that measures how many operations in FLO-ating point arithmetics, with no particular kind thereof specified (see remark 1), were performed P-er S-econd ~ FLOPS, here expressed for convenience in multiples of billion ( G-iga ), i.e. the said GFLOPS
b) processor, multi-processor is a device ( or some kind of composition of multiple such same devices, expressed as multi-p. ), used to perform some kind of a useful work - a processing
This pair of definitions was necessary to further judge the asked question to solve.
The term (a) is a property of (b), irrespective of all other factors, if we assume such "device" not to be a kind of some polymorphic, self-modifying FGPA or evolutionary reflective self-evolving amoeboid, which both processors & multi-processors prefer not to be, at least in our part of Universe as we know it in 2022-Q2.
Once manufactured, each kind of processor(b) (be it a monolithic or a multi-processor device) has certain, observable, repetitively measurable, qualities of processing (doing a work).
"A multiprocessor consists of 100 processors, each capable of a peak execution rate of 2Gflops. What is the performance of the system as measured in Gflops when 2% of the code is sequential and 98% is parallelizable"
A multiprocessor . . . (device)
consists . . . has a property of being composed of
100 . . . (quantitative factor) ~ 100
processors,. . . (device)
each . . . declaration of equality
capable . . . having a property of
of a peak . . . peak (not having any higher)
execution . . . execution of work (process/code)
rate . . . being measured in time [1/s]
of 2Gflops . . . (quantitative factor) ~ 2E+9 FLOPS
What is . . . Questioning
the PERFORMANCE . . . (property) a term (not defined yet)
of the SYSTEM . . . (system) a term (not defined yet)
as measured in . . . using some measure to evaluate a property of (system) in
Gflops . . . (units of measure) to express such property in
when . . . (proposition)
2% . . . (quantitative factor) ~ 0.02 fraction of
of the code . . . (subject-being-processed)
is . . . has a property of being
sequential . . . sequential, i.e. steps follow one-after-another
and
98% . . . (quantitative factor) ~ 0.98 fraction of (code)
( the same code)
is . . . has a property of being
parallelizable . . . possible to re-factor
into some other form,
from a (sequential)
original form
( emphasis added )
Fact #1 )
the processor(b) ( a (device) ), from which an introduced multiprocessor ( a macro-(device) ) is internally composed from, has a declared (granted) property of not being able to process more FLOPS, than the said 2 GFLOPS.
This property does not say, how many actual { INTOPS | FLOPS } it will perform in any particular moment in time.
This property does say, any device, that was measured and got labeled to have indeed X {M|G|P|E}FLOPS has the very same "glass-ceiling" of not being able to perform a single more instruction per second, even when it is doing nothing at all (chopping NOP-s) or even when it is switched off and powered down.
This property is a static supreme, an artificial (in relation to real-world work-loads' instruction mixes), temperature-dependent-constant (and often degrades in-vivo not only due to thermal throttling but due to many other reasons in real-world { processor + !processor }-composed SYSTEM ecosystems )
Fact #2 )
the problem, as visible to us here, has no particular definition of what is or what is not a part of the said "SYSTEM" - Is it just the (multi)processor - if so, then why introducing a new, not yet defined, term SYSTEM, for being it a pure identity with the already defined & used term (multi)processor per se? Is it both the (multi)processor and memory or other peripherals - if so, the why we know literally nothing about such important neighbourhood (a complement) of the said (multi)processor, without which a SYSTEM would not be The SYSTEM, but a mere part of it, the (multi)processor, that is NOT a SYSTEM without its (such a SYSTEM-defining and completing) neighbourhood?
Fact #3 )
the original Amdahl's Law, often dubbed as The Law of Diminishing Returns (of extending the System with more and more resources) speaks about SYSTEM and its re-organised forms, when comparing the same amount and composition of work, as performed in original SYSTEM (with a pure-[SERIAL] flow of operations, one-step-after-another-after-another), with another, improved SYSTEM' (created by re-organising and extending the original SYSTEM by adding more resources of some kinds and turning such a new SYSTEM' into operating more parts of the original work-to-be-done in an improved organisation of work, where more resources can & do perform parts of work-to-be-done independently one on any other one ~ in a concurrent, some parts even in a true parallel fashion, using all degrees of parallelism the SYSTEM' resources can provide & sustain to serve).
Given no particular piece of information was present about a SYSTEM, the less about a SYSTEM', we have no right to use The Law of Diminishing Returns to address the problem, as was defined above. Having no facts does not give us a right to guestimate, the less to turn into feelings-based evidencing, if we strive to remain serious to ourselves, don't we?
Given (a) and (b) above, the only fair to achieve claim, that indeed holds true, can be to say :
"From what has been defined so far,we know that such a multiprocessorwill never work on more than 100 x 2 GFLOP per second of time."
There is zero other knowledge to claim a single bit more (and yet we still have to silently assume that such above claimed peak FLOP-s have no side-effect and remain sustainable for at least a one whole second (see remark 2 ) -- otherwise even this claim will become skewed
An extended, stronger version :
"No matter what kind of code is actually being run,for this, above specified multiprocessor, we cannot say morethan that such a multiprocessor will never work on more than 100 x 2 GFLOPS in any moment of time."
Remarks :
see how this is being so often misused by promotion of "Exaflops performance" by marketing people, when FMUL f8,f8 is being claimed and "sold" to the public as that it "looks" equal as FMUL f512,f512, which it by far is not using the same yardstick to measure, is it?
a similar skewed argument (if not a straight misinformation) has been countless times repeated in a (false) claim, that a world "largest" femtosecond-LASER was capable to emit a light pulse carrying more power than XY-Suns (a-WOW-moment!), without adding, how long did it take to pump-up the energy for a single such femtosecond long ( 1 [fs] ~ 1E-15 [s] ) "packet-of-a-few-photons" ... careful readers have already rejected a-"WOW"-moment artificial stupidity for not being possible to carry such astronomic amount of energy, as an energy of XY-Suns on a tiny, energy poor planet, the less to carry that "over" a wire towards that "superpower" LASER )
If 2% is the run-time percentage for serial part then you can not surpass 50x speedup. This means you can not surpass 50x gflops of serial version.
If unoptimized program had 2 gflops fully serial then the optimized version with perfect scaling makes 98% of runtime compressed to 0.98%.
2% plus 0.98% is equivalent to ~3% as a new total run time. This means the program is spending 2/3 of the time in serial part and only 1/3 in the parallelized part. If parallel part is 200gflops then you have to average it over the whole 3/3 of the time. 200 gflops for 1 microsecond and 2 gflops for 2 microseconds.
This is roughly equal to 67 gflops. If there is a single-core turbo to boost the serial part, then 20% turbo boost in 2/3 of the time means shaving ~13% less of total run time, hence 15%-20% higher gflops average. Turbo core frequency is important even if it boosts single core.
I am interested in understanding what's the percentage of workload that can almost never be put into a hardware accelerators. While more and more tasks are being amenable to domain specific accelerators, I wonder is it possible to have tasks that are not going to be useful with accelerator? Put simply, what are the tasks that are less likely to be accelerator-compatible?
Would love to have a pointers to resources that speaks to this question.
So you have the following question(s) in your original post:
Question:
I wonder is it possible to have tasks that are not going to be useful with accelerator? Put simply, what are the tasks that are less likely to be accelerator-compatible?
Answer:
Of course it's possible. First and foremost, workload that needs to be accelerated on hardware accelerators should not involve following:
dynamic polymorphism and dynamic memory allocation
runtime type information (RTTI)
system calls
........... (some more depending on the hardware accelerator)
Although explaining each above-mentioned point will make the post too lengthy, I can explain few. There is no support of dynamic memory allocation because hardware accelerators have fixed set of resources on silicon, and the dynamic creation and freeing of memory resources is not supported. Similarly dynamic polymorphism is only supported if the pointer object can be determined at compile time. And there should be no System calls because these are actions that relate to performing some task upon the operating system. Therefore OS operations, such as file read/write or OS queries like time and date, are not supported.
Having said that, the workload that are less likely to be accelerator-compatible are mostly communication intensive kernels. Such communication intensive kernels often lead to a serious data transfer overhead compared to the CPU execution, which can probably be detected by the CPU-FPGA or CPU-GPU communication time measurement.
For better understanding, let's take the following example:
Communication Intensive Breadth-First Search (BFS):
1 procedure BFS(G, root) is
2 let Q be a queue
3 label root as explored
4 Q.enqueue(root)
5 while Q is not empty do
6 v := Q.dequeue()
7 if v is the goal then
8 return v
9 for all edges from v to w in G.adjacentEdges(v) do
10 if w is not labeled as explored then
11 label w as explored
12 Q.enqueue(w)
The above pseudo code is of famous bread-first search (BFS). Why it's not a good candidate for acceleration? Because it traverses all the nodes in a graph without doing any significant computation. Hence it's immensely communication intensive as compared to compute intensive. Furthermore, for a data-driven algorithm like
BFS, the shape and structure of the input can actually dictate runtime characteristics like locality and branch behaviour , making it not so good candidate for hardware acceleration.
Now the question arises why have I focused on compute intensive vs communication intensive?
As you have tagged FPGA in your post, I can explain you this concept with respect to FPGA. For instance in a given system that uses the PCIe connection between the CPU and FPGA, we calculate the PCIe transfer time as the elapsed time of data movement from the host memory to the device memory through PCIe-based direct memory access (DMA).
The PCIe transfer time is a significant factor to filter out the FPGA acceleration for communication bounded workload. Therefore, the above mentioned BFS can show severe PCIe transfer overheads and hence, not acceleration compatible.
On the other hand, consider a the family of object recognition algorithms implemented as a deep neural network. If you go through these algorithms you will find that a significant amount of time (more than 90% may be) is spent in the convolution function. The input data is relatively small. The convolutions are embarrassingly parallel. And this makes it them ideal workload for moving to hardware accelerator.
Let's take another example showing a perfect workload for hardware acceleration:
Compute Intensive General Matrix Multiply (GEMM):
void gemm(TYPE m1[N], TYPE m2[N], TYPE prod[N]){
int i, k, j, jj, kk;
int i_row, k_row;
TYPE temp_x, mul;
loopjj:for (jj = 0; jj < row_size; jj += block_size){
loopkk:for (kk = 0; kk < row_size; kk += block_size){
loopi:for ( i = 0; i < row_size; ++i){
loopk:for (k = 0; k < block_size; ++k){
i_row = i * row_size;
k_row = (k + kk) * row_size;
temp_x = m1[i_row + k + kk];
loopj:for (j = 0; j < block_size; ++j){
mul = temp_x * m2[k_row + j + jj];
prod[i_row + j + jj] += mul;
}
}
}
}
}
}
The above code example is General Matrix Multiply (GEMM). It is a common algorithm in linear algebra, machine learning, statistics, and many other domains. The matrix multiplication in this code is more commonly computed using a blocked
loop structure. Commuting the arithmetic to reuse all of the elements
in one block before moving onto the next dramatically
improves memory locality. Hence it is an extremely compute intensive and perfect candidate for acceleration.
Hence, to name only few, we can conclude following are the deciding factors for hardware acceleration:
the load of your workload
the data your workload accesses,
how parallel is your workload
the underlying silicon available for acceleration
the bandwidth and latency of communication channels.
Do not forget Amdahl's Law:
Even if you have found out the right workload that is an ideal candidate for hardware acceleration, the struggle does not end here. Why? Because the famous Amdahl's law comes into play. Meaning, you might be able to significantly speed up a workload, but if it is only 2% of the runtime of the application, then even if you speed it up infinitely (take the run time to 0) you will only speed the overall application by 2% at the system level. Hence, your ideal workload should not only be an ideal workload algorithmically, in fact, it should also be contributing significantly to the overall runtime of your system.
I ran a set of experiments on a parallel package, say superlu-dist, with different processor numbers e.g.: 4, 16, 32, 64
I got the wall clock time for each experiment, say: 53.17s, 32.65s, 24.30s, 16.03s
The formula of speedup is :
serial time
Speedup = ----------------------
parallel time
But there is no information about the serial fraction.
Can I simply take the reciprocal of the wall clock time?
Can I simply take the reciprocal of the wall clock time ?
No,true Speedup figures require comparing Apples to Apples :
This means, that an original, pure-[SERIAL] process-scheduling ought be compared with any other scenario, where parts may get modified, so as to use some sort of parallelism ( the parallel fraction may get re-organised, so as to run on N CPUs / computing-resources, whereas the serial fraction is left as was ).
This obviously means, that the original [SERIAL]-code was extended ( both in code ( #pragma-decorators, OpenCL-modifications, CUDA-{ host_to_dev | dev_to_host }-tooling etc.), and in time( to execute these added functionalities, that were not present in the original [SERIAL]-code, to benchmark against ), so as to add some new sections, where the ( possible [PARALLEL] ) other part of the processing will take place.
This comes at cost -- add-on overhead costs ( to setup and to terminate and to communicate data from [SERIAL]-part there, to the [PARALLEL]-part and back ) -- which all adds additional [SERIAL]-part workload ( and execution time + latency ).
For more details, feel free to read section Criticism in article on re-formulated Amdahl's Law.
The [PARALLEL]-portion seems interesting, yet the Speedup principal ceiling is in the [SERIAL]-portion duration ( s = 1 - p ) in the original,
but to which add-on durations and added latency costs need to get added as accumulated alongside the "organisation" of work from an original, pure-[SERIAL], to the wished-to-have [PARALLEL]-code execution process scheduling, if realistic evaluation is to be achieved
run the test on a single processor and set that as the serial time, ...,
as #VictorSong has proposed sounds easy, but benchmarks an incoherent system ( not the pure-[SERIAL] original) and records a skewed yardstick to compare against.
This is the reason, why fair methods ought be engineered. The pure-[SERIAL] original code-execution can be time-stamped, so as to show the real durations of unchanged-parts, but the add-on overhead times have to get incorporated into the add-on extensions of the serial part of the now parallelised tests.
The re-articulated Amdahl's Law of Diminishing Returns explains this, altogether with impacts from add-on overheads and also from atomicity-of-processing, that will not allow further fictions of speedup growth, given more computing resources are added, but a parallel-fraction of the processing does not permit further split of task workloads, due to some form of its internal atomicity-of-processing, that cannot be further divided in spite of having free processors available.
The simplified of the two, re-formulated expressions stands like this :
1
S = __________________________; where s, ( 1 - s ), N were defined above
( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on
s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on
N
Some interactive GUI-tools for further visualisations of the add-on overhead-costs are available for interactive parametric simulations here - just move the p-slider towards the actual value of the ( 1 - s ) ~ having a non-zero fraction of the very [SERIAL]-part of the original code :
What do you mean when you say "serial fraction"? According to a Google search apparently superlu-dist is C, so I guess you could just use ctime or chrono and take the time the usual way, it works for me with both manual std::threads and omp.
I'd just run the test on a single processor and set that as the serial time, then do the test again with more processors (just like you said).
I have some performance problems with parallel computing in Julia. I am new in both, Julia and parallel calculations.
In order to learn, I parallelized a code that should benefits from parallelization, but it does not.
The program estimates the mean of the mean of the components of arrays whose elements were chosen randomly with an uniform distribution.
Serial version
tic()
function mean_estimate(N::Int)
iter = 100000*2
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
a = mean_estimate(0)
toc()
println("The mean is: ", a)
Parallelized version
addprocs(CPU_CORES - 1)
println("CPU cores ", CPU_CORES)
tic()
#everywhere function mean_estimate(N::Int)
iter = 100000
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
the_mean = mean(vcat(pmap(mean_estimate,[1,2])...))
toc()
println("The mean is: ", the_mean)
Notes:
The factor 2 in the fourth line of the serial code is because I tried the code in a PC with two cores.
I checked the usage of the two cores with htop, and it seems to be ok.
The outputs I get are:
me#pentium-ws:~/average$ time julia serial.jl
elapsed time: 2.68671022 seconds
The mean is: 0.49999736055814215
real 0m2.961s
user 0m2.928s
sys 0m0.116s
and
me#pentium-ws:~/average$ time julia -p 2 parallel.jl
CPU cores 2
elapsed time: 2.890163089 seconds
The mean is: 0.5000104221069994
real 0m7.576s
user 0m11.744s
sys 0m0.308s
I've noticed that the serial version is slightly faster than the parallelized one for the timed part of the code. Also, that there is large difference in the total execution time.
Questions
Why is the parallelized version slower? (what I am doing wrong?)
Which is the right way to parallelize this program?
Note: I use pmap with vcat because I wish to try with the median too.
Thanks for your help
EDIT
I measured times as #HighPerformanceMark suggested. The tic()/toc() times are the following. The iteration number is 2E6 for every case.
Array Size Single thread Parallel Ratio
5000 2.69 2.89 1.07
100 000 488.77 346.00 0.71
1000 000 4776.58 4438.09 0.93
I am puzzled about why there is not clear trend with array size.
You should pay prime attention to suggestions in the comments.
As #ChrisRackauckas points out, type instability is a common stumbling block for performant Julia code. If you want highly performant code, then make sure that your functions are type-stable. Consider annotating the return type of the function pmap and/or vcat, e.g. f(pids::Vector{Int}) = mean(vcat(pmap(mean_estimate, pids))) :: Float64 or something similar, since pmap does not strongly type its output. Another strategy is to roll your own parallel scheduler. You can use pmap source code as a springboard (see code here).
Furthermore, as #AlexMorley commented, you are confounding your performance measurements by including compilation times. Normally performance of a function f() is measured in Julia by running it twice and measuring only the second run. In the first run, the JIT compiler compiles f() before running it, while the second run uses the compiled function. Compilation incurs a (unwanted) performance cost, so timing the second run avoid measuring the compilation.
If possible, preallocate all outputs. In your code, you have set each worker to allocate its own zeros(iter) and its own rand(p). This can have dramatic performance consequences. A sketch of your code:
# code mean_estimate as two functions
f(p::Int) = mean(rand(p))
function g(iter::Int, p::Int)
vec_mean = zeros(iter)
for i in eachindex(vec_mean)
vec_mean[i] = f(p)
end
return mean(vec_mean)
end
# run twice, time on second run to get compute time
g(200000, 5000)
#time g(200000, 5000)
### output on my machine
# 2.792953 seconds (600.01 k allocations: 7.470 GB, 24.65% gc time)
# 0.4999951853035917
The #time macro is alerting you that the garbage collector is cleaning up a lot of allocated memory during execution, several gigabytes in fact. This kills performance. Memory allocations may be overshadowing any distinction between your serial and parallel compute times.
Lastly, remember that parallel computing incurs overhead from scheduling and managing individual workers. Your workers are computing the mean of the means of many random vectors of length 5000. But you could succinctly compute the mean (or median) of, say, 5M entries with
x = rand(5_000_000)
mean(x)
#time mean(x) # 0.002854 seconds (5 allocations: 176 bytes)
so it is unclear how your parallel computing scheme improves upon serial performance. Parallel computing generally provides the best help when your arrays are truly beefy or your calculations are arithmetically intense, and vector means probably do not fall in that domain.
One last note: you may want to peek at SharedArrays, which distribute arrays over several workers with a common memory pool, or the experimental multithreading facilities in Julia. You may find those parallel frameworks more intuitive than pmap.
Hi I have a question regarding inherent parallelism.
Let's say we have a sequential program which takes 20 seconds to complete execution. Suppose the execution time consists of 2 seconds of setup time at the beginning and 2 seconds of finalization time at the end of the execution, and the remaining work can be parallelized. How do we calculate the inherent parallelism of this program?
How do you define "inherent parallelism"? I've not heard the term. We can talk about "possible speedup".
OP said "remaining work can be parallelized"... to what degree?
Can it run with infinite parallelism? If this were possible (it isn't practical), then the total runtime would be 4 seconds with a speedup of 20/4 --> 5.
If the remaining work can be run on N processors perfectly in parallel,
then the total runtime would be 4+16/N. The ratio of that to 20 seconds is 20/(4+16/N) which can have pretty much any degree of speedup from 1 (no speedup) to 5 (he the limit case) depending on the value of N.