How do you figure out whether it's worth parallelizing a particular code block based on its code size? Is the following calculation correct?
Assume:
Thread pool consisting of one thread per CPU.
CPU-bound code block with execution time of X milliseconds.
Y = min(number of CPUs, number of concurrent requests)
Therefore:
Cost: code complexity, potential bugs
Benefit: (X * Y) milliseconds
My conclusion is that it isn't worth parallelizing for small values of X or Y, where "small" depends on how responsive your requests must be.
One thing that will help you figure that out is Amdahl's Law
The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 1 hour. Hence the speed up is limited up to 20x.
Figure out what you want to achieve in speed up, and how much parallelism you can actually achieve, then see if its worth it.
It depends on many factors, as the difficulty of parallelize the code, the speedup obtained from it (there are overhead costs on dividing the problem and joining the results) and the amount of time that the code is spending there (Amdahl's Law)
Well, the benefit is really more:
(X * (Y-1)) * Tc * Pf
Where Tc is the cost of the threading framework you are using. No threading framework scales perfectly, so using 2x threads will likely be, at best, 1.9x speed.
Pf is some factor for parallization that depends completely on the algorithm (ie: whether or not you'll need to lock, which will slow the process down).
Also, it's Y-1, since single threaded is basically assuming Y==1.
As for deciding, it's also a matter of user frustration/expectation (if they user is annoyed at waiting for something, it'd have a greater benefit than a task that the user doesn't really mind - which is not always just due to wait times, etc - it's partly expectations).
Related
I am trying to find out how long it takes to execute 10,000 RISC instructions with 4 bytes from a processor that is 2GHz and another that is 4GHz, I only need the very basics of a formula
I have tried 10,000 x 4 = 40,000 / 2x10^9 and 40,000 / 4x10^9
There isn't a correct way to calculate this. There are a number of dependencies and complexities:
What type of instructions are included? Instructions cycle counts can vary from 1 cycle to 20-30 cycles per instructions. How many of these instructions can be dispatched at once?
What is the memory access pattern and how is the CPU memory access designed? How effective will caching/pre-fetching be (and does the CPU support)?
Are there many branches? How predictable are those branches and how many are within the critical portion of the code? What is the cost of a miss-predict?
and more.
Fundamentally the question you are asking isn't easily solvable and absolutely depends on the code to be run.
Generally speaking, code execution does not scale linearly so it is unlikely that for anything non-trivial that a 4GHz processor will be twice as fast as a 2GHz processor.
is it possible to calculate the computing time of a process based on the number of operations that it performs and the speed of the CPU in GHz?
For example, I have a for loop that performs a total number of 5*10^14 cycles. If it runs on a 2.4 GHz processor, will the computing time in seconds be: 5*10^14/2.4*10^9 = 208333 s?
If the process runs on 4 cores in parallel, will the time be reduced by four?
Thanks for your help.
No, it is not possible to calculate the computing time based just on the number of operations. First of all, based on your question, it sounds like you are talking about the number of lines of code in some higher-level programming language since you mention a for loop. So depending on the optimization level of your compiler, you could see varying results in computation time depending on what kinds of optimizations are done.
But even if you are talking about assembly language operations, it is still not possible to calculate the computation time based on the number of instructions and CPU speed alone. Some instructions might take multiple CPU cycles. If you have a lot of memory access, you will likely have cache misses and have to load data from disk, which is unpredictable.
Also, if the time that you are concerned about is the actual amount of time that passes between the moment the program begins executing and the time it finishes, you have the additional confounding variable of other processes running on the computer and taking up CPU time. The operating system should be pretty good about context switching during disk reads and other slow operations so that the program isn't stopped in the middle of computation, but you can't count on never losing some computation time because of this.
As far as running on four cores in parallel, a program can't just do that by itself. You need to actually write the program as a parallel program. A for loop is a sequential operation on its own. In order to run four processes on four separate cores, you will need to use the fork system call and have some way of dividing up the work between the four processes. If you divide the work into four processes, the maximum speedup you can have is 4x, but in most cases it is impossible to achieve the theoretical maximum. How close you get depends on how well you are able to balance the work between the four processes and how much overhead is necessary to make sure the parallel processes successfully work together to generate a correct result.
As far as I understand a sampling profiler works as follows: it interupts the program execution in regular intervals and reads out the call stack. It notes which part of the program is currently executing and increments a counter that represents this part of the program. In a post processing step: For each function of the program the ratio of the whole execution time is computed, for which the function is responsible for. This is done by looking at the counter C for this specific function and the total number of samples N:
ratio of the function = C / N
Finding the hotspots then is easy, as this are the parts of the program with a high ratio.
But how can this be done for a parallel program running on parallel hardware. As far as I know, when the program execution is interupted the executing parts of the program on ALL processors are determined. Due to that a function which is executed in parallel gets counted multiple times. Thus the number of samples C of this function can not be used for computing its share of the whole execution time anymore.
Is my thinking correct? Are there other ways how the hotspots of a parallel program can be identified - or is this just not possible using sampling?
You're on the right track.
Whether you need to sample all the threads depends on whether they are doing the same thing or different things.
It is not essential to sample them all at the same time.
You need to look at the threads that are actually working, not just idling.
Some points:
Sampling should be on wall-clock time, not CPU time, unless you want to be blind to needless I/O and other blocking calls.
You're not just interested in which functions are on the stack, but which lines of code, because they convey the purpose of the time being spent. It is more useful to look for a "hot purpose" than a "hot spot".
The cost of a function or line of code is just the fraction of samples it appears on. To appreciate that, suppose samples are taken every 10ms for a total of N samples. If the function or line of code could be made to disappear, then all the samples in which it is on the stack would also disappear, reducing N by that fraction. That's what speedup is.
In spite of the last point, in sampling, quality beats quantity. When the goal is to understand what opportunities you have for speedup, you get farther faster by manually scrutinizing 10-20 samples to understand the full reason why each moment in time is being spent. That's why I take samples manually. Knowing the amount of time with statistical precision is really far less important.
I can't emphasize enough the importance of finding and fixing more than one problem. Speed problems come in severals, and each one you fix has a multiplier effect on those done already. The ones you don't find end up being the limiting factor.
Programs that involve a lot of asynchronous inter-thread message-passing are more difficult, because it becomes harder to discern the full reason why a moment in time is being spent.
More on that.
I have two scenarios of measuring metrics like computation time and parallel speedup (sequential_time/parallel_time).
Scenario 1:
Sequential time measurement:
startTime=omp_get_wtime();
for loop computation
endTime=omp_get_wtime();
seq_time = endTime-startTime;
Parallel time measurement:
startTime = omp_get_wtime();
for loop computation (#pragma omp parallel for reduction (+:pi) private (i)
for (blah blah) {
computation;
}
endTime=omp_get_wtime();
paralleltime = endTime-startTime;
speedup = seq_time/paralleltime;
Scenario 2:
Sequential time measurement:
for loop{
startTime=omp_get_wtime();
computation;
endTime=omp_get_wtime();
seq_time += endTime-startTime;
}
Parallel time measurement:
for loop computation (#pragma omp parallel for reduction (+:pi, paralleltime) private (i,startTime,endTime)
for (blah blah) {
startTime=omp_get_wtime();
computation;
endTime=omp_get_wtime();
paralleltime = endTime-startTime;
}
speedup = seq_time/paralleltime;
I know that Scenario 2 is NOT the best production code, but I think that it measures the actual theoretical performance by OVERLOOKING the overhead involved in openmp spawning and managing (thread context switching) several threads. So it will give us a linear speedup. But Scenario 1 considers the overhead involved in spawning and managing threads.
My doubt is this:
With Scenario 1, I am getting a speedup which starts out linear, but tapers off as we move to a higher number of iterations. With Scenario 2, I am getting a full on linear speedup irrespective of the number of iterations. I was told that in reality, Scenario 1 will give me a linear speedup irrespective of the number of iterations. But I think it will not because of the high overload due to thread management. Can someone please explain to me why I am wrong?
Thanks! And sorry about the rather long post.
There's many situations where scenario 2 won't give you linear speedup either -- false sharing between threads (or, for that matter, true sharing of shared variables which get modified), memory bandwidth contention, etc. The sub-linear speedup is generally real, not a measurement artifact.
More generally, once you get to the point where you're putting timers inside for loops, you're considering more fine-grained timing information than is really appropriate to measure using timers like this. You might well want to be able to disentangle the thread management overhead from the actual work being done for a variety of reasons, but here you're trying to do that by inserting N extra function calls to omp_get_wtime(), as well as the arithmetic and the reduction operation, all of which will have non-negligable overhead of their own.
If you really want accurate timing of how much time is being spent in the computation; line, you really want to use something like sampling rather than manual instrumentation (we talk a little bit about the distinction here). Using gprof or scalasca or openspeedshop (all free software) or Intel's VTune or something (commercial package) will give you the information about how much time is being spent on that line -- often even by thread -- with much lower overhead.
First of all, by the definition of the speedup, you should use the scenario 1, which includes parallel overhead.
In the scenario 2, you have the wrong code in the measurement of paralleltime. To satisfy your goal in the scenario 2, you need to have a per-thread paralleltime by allocating int paralleltime[NUM_THREADS] and accessing them by omp_get_thread_num() (Note that this code will have false sharing, so you'd better to allocate 64-byte struct with padding). Then, measure per-thread computation time, and finally take the longest one to calculate a different kind of speedup (I'd say a sort of parallelism instead).
No, you may see sub-linear speedup for even Scenario 2, or even super-linear speedup could be obtained. The potential reasons (i.e., excluding parallel overhead) are:
Load imbalance: the workload length in compuation is different on iteration. This would be the most common reason of low speedup (But, you're saying load imbalance is not the case).
Synchronization cost: if there are any kind of synchronizations (e.g., mutex, event, barrier), you may have waiting/block time.
Caches and memory cost: when computation requires large bandwidth and high working set set, parallel code may suffer from bandwidth limitation (though it's rare in reality) and cache conflicts. Also, false sharing would be a significant reason, but it's easy to avoid it. Super-linear effect can be also observed because using multicore may have more caches (i.e., private L1/L2 caches).
In the scenario 1, it will include the overhead of parallel libraries:
Forking/joining threads: although most parallel libraries implementations won't do physical thread creation/termination on every parallel construct.
Dispatching/joining logical tasks: even if physical threads are already created, you need to dispatch logical task to each thread (in general M task to N thread), and also do a sort of joining operation at the end (e.g., implicit barrier).
Scheduling overhead: for static scheduling (as shown in your code, which uses OpenMP's static scheduling), the overhead is minimal. You can safely ignore the overhead when the workload is sufficiently enough (say 0.1 second). However, dynamic scheduling (such as work-stealing in TBB) has some overhead, but it's not significant once your workload is sufficient.
I don't think your code (1-level static-scheduling parallel loop) does have high parallel overhead due to thread management, unless this code is called million times per a second. So, may be the other reasons that I've mentioned in the above.
Keep in mind that there are many factors that will determine speedup; from the inherent parallelism (=load imbalance and synchronizations) to the overhead of a parallel library (e.g., scheduling overhead).
What do you want to measure exactly? The overhead due to parallelism is part of the real execution time, hence IMHO scenario 1 is better.
Besides, according to your OpenMP directives, you're doing a reduction on some array. In scenario 1, you're taking this into account. In scenario 2, you're not. So basically, you're measuring less things than in scenario 1. This probably has some impact on your measurements too.
Otherwise, Jonathan Dursi's answer is excellent.
There are several options with OpenMP for how the work is distributed among the threads. This can impact the linearity of your measurement method 1. You measurement method 2 doesn't seem useful. What were you trying to get at with that? If you want to know single thread performance, then run a single thread. If you want parallel performance, then you need to include the overhead.
A theoretical question, maybe it is obvious:
Is it possible that an algorithm, after being implemented in a parallel way with N threads, will be executed more than N times faster than the original, single-threaded algorithm? In other words, can the gain be better that linear with number of threads?
It's not common, but it most assuredly is possible.
Consider, for example, building a software pipeline where each step in the pipeline does a fairly small amount of calculation, but requires enough static data to approximately fill the entire data cache -- but each step uses different static data.
In a case like this, serial calculation on a single processor will normally be limited primarily by the bandwidth to main memory. Assuming you have (at least) as many processors/cores (each with its own data cache) as pipeline steps, you can load each data cache once, and process one packet of data after another, retaining the same static data for all of them. Now your calculation can proceed at the processor's speed instead of being limited by the bandwidth to main memory, so the speed improvement could easily be 10 times greater than the number of threads.
Theoretically, you could accomplish the same with a single processor that just had a really huge cache. From a practical viewpoint, however, the selection of processors and cache sizes is fairly limited, so if you want to use more cache you need to use more processors -- and the way most systems provide to accomplish this is with multiple threads.
Yes.
I saw an algorithm for moving a robot arm through complicated maneuvers that was basically to divide into N threads, and have each thread move more or less randomly through the solution space. (It wasn't a practical algorithm.) The statistics clearly showed a superlinear speedup over one thread. Apparently the probability of hitting a solution over time rose fairly fast and then leveled out some, so the advantage was in having a lot of initial attempts.
Amdahl's law (parallelization) tells us this is not possible for the general case. At best we can perfectly divide the work by N. The reason for this is that given no serial portion, Amdahl's formula for speedup becomes:
Speedup = 1/(1/N)
where N is the number of processors. This of course reduces to just N.