When/how to benefit from parallel processing of scripts? - performance

I have a sequence of scripts to run on a computer with 1 physical & logical core.
I have tried running them in sequence, and also forking them with something like the bash script below. The background processes, running in parallel, actually took longer than running them in sequence.
My question is: Under what circumstances should a workload like this be run in parallel? That is, must I have particular hardware or can I do this more efficiently with the single-processor computer that I have?
#!/bin/bash
# Run them in sequence...
T1=$(date +%s)
python proc_test1.py
python proc_test2.py
python proc_test3.py
T2=$(date +%s)
T=$((T2-T1))
echo "Scripts took ${T} seconds."
# Now fork them...
T1=$(date +%s)
python proc_test1.py &
python proc_test2.py &
python proc_test3.py &
wait
T2=$(date +%s)
T=$((T2-T1))
echo "Scripts took ${T} seconds."
exit 0

Parallel processing usually becomes beneficial when hardware is available. For example, if you had three logical CPUs, then three computations could occur simultaneously. Thus, ideally, if you ran proc_test1.py three times with forks with three available processors, all three would finish in the same time it would take to run just one instance of proc_test1.py.
In other words, given sufficient hardware, running three proc_test1.py's in serial will take three times as long as running them with forks.
Now, given that you just have one hardware cpu, it makes sense that the parallel jobs would run more slowly than the serial ones, as each python program will be competing with each other for cpu time. The cpu stopping one job and resuming another costs cpu time itself.
For example, say you had 6 oranges and two hands, and you had to hold all 6 oranges for 5 seconds total. Say it takes you one second to pick up or swap out oranges. You could do this task in serial, and pick up two oranges at a time for five seconds before swapping for a new pair. This would take you
1 + 5 + 1 + 5 + 1 + 5
= 3 * 5 + 3 = 18
seconds to complete.
Now suppose the parallel analogy. Then all 6 oranges are begging to be picked up and you holding one does not mean that you wont immediately drop it for an alternative. There isn't necessarily an upper bound on how long it will take you to complete the task as we have defined it, so suppose that you have to hold the oranges for at least 2.5 seconds before you swap them out, and you only swap in pairs. Then, it will take you
1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1
= 3 * 5 + 7 = 22
seconds to complete. Note that by "forking", it takes you 22% longer to hold 6 oranges for 5 seconds each. Since you have two hands, it still takes you 15 seconds to complete the task, but there is a variable overhead in switching time based on your strategy. Note that if you had 6 hands, it would take only 7 seconds to complete the task.
Thus when you have more processors, fork processes, otherwise you're just juggling jobs on limited hardware.

Your simple question is in practice extremely hard to answer in general. In real life I would always measure to see if reality agrees with my theory.
An example where reality did not agree with my theory was on my Intel Core i7. It has 4 cores and has hyperthreading. This would suggest that running 8 threads in parallel will be optimal: You will be using the 4 processing units and use 4 additional threads for keeping the pipelines filled.
However, Core i7 has 6MB of shared cache. It just so happened, that the working set of my data fit inside the 6MB. So I saw an extreme speedup by running 1 thread instead of 2 or even 8: Running more than 1 would simply flush the cache all the time. This would not have been true if the cache had not been shared, but it just shows that it is not simple to say when parallelizing will be faster.

Related

Approximating Processing Power from CPU-Time

In a particular scenario I found that a code has taken 20 CPU Years and 4 real Months time. My goal is to approximate the amount of processing power utilized considering the fact that all the processors were on 100% usage all the time. So, my approach is as follows,
20 CPU Years = 20 * 365 * 24 CPU Hours = 175,200 CPU Hours.
Now, 1 CPU Year means 1 GFLOP machine working for 1 real Hour. Which means, in this case, the work done is, 1 GFLOP machine working for 175,200 real Hours. But in reality it took 4 * 30 * 24 = 2,880 real hours. So, approximately 175,200/2,880 =(approx.) 61 GLFOP machine.
My question is am I doing the approximation correctly or misunderstanding some particular term as per the calculations given above ? Or I am mixing GFLOPS and GFLOP together ?
Definitions
My question is am I doing the approximation correctly or misunderstanding some particular term as per the calculations given above ?
"100% usage" may mean the CPU spent 20% of its time doing nothing waiting for data to be transferred to/from RAM (and/or branch mispredictions or other stalls), 10% of its time running faster than normal because other CPUs where actually doing nothing, and 15% of its time running slower than normal for power/temperature management reasons; and (depending on where you got that "100% usage" statistic) "100% usage" may be significantly more confusing (e.g. http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html ).
Depending on context; GFLOPS is either "theoretical maximum under perfect conditions that will never occur in practice" (worthless marketing hype); or a direct measurement of a specific case that ignores most of the work a CPU did (everything involving integers, all control flow, all data transfer, all memory management, ...)
In a particular scenario I found that a code has taken 20 CPU Years and 4 real Months time. My goal is to approximate the amount of processing power utilized.
From this; you might (or might not) be able to say "most of the work that CPUs did was discarded due to lockless algorithm retries and/or transactions that couldn't be committed; and (partly because the bottleneck was RAM bandwidth and partly because of the way SMT works on this system) it would have been 4 times as fast if half as many CPUs were used."
TL;DR: Approximating processor power is just an inconvenient way to obfuscate the (more useful) information that you started with (e.g. that a specific piece of code running on a specific piece of hardware that was working on a specific piece of data happened to take 4 months of real time).
Your Calculation:
Yes; you're mixing GFLOP and GFLOPS (e.g. GFLOPS = GFLOP per second; and a "1 GFLOP machine" is a computer that can do a billion floating point operations in an infinite amount of time, which is every computer), and the web page you linked to is making the same mistake (e.g. saying "a 1 GFLOP reference machine" when it should be saying "a 1 GFLOPS reference machine").
Note that there's no need to care about GFLOPS or GFLOP for the calculation you're doing: If something was supposed to take 20 "reference CPU years" and actually took 4 months (or 4/12 years); then you'd say that your hardware is equivalent to "20 / (4/12) = 60 reference CPUs". Of course this is horribly silly and it'd make more sense to say that your hardware happened to achieve 60 GFLOPS without bothering with the misleading "reference CPU" nonsense.

Why does GNU parallel affect script speed?

I have some Fortran script. I compile with gfortran and then run as time ./a.out.
My script completes, and outputs the runtime as,
real 0m36.037s
user 0m36.028s
sys 0m0.004s
i.e. ~36 seconds
Now suppose I want to run this script multiple times, in parallel. For this I am using GNU Parallel.
Using the lscpu command tells me that I have 8 CPUs, with 2 threads per core and 4 cores per socket.
I create some file example.txt of the form,
time ./a.out
time ./a.out
time ./a.out
time ./a.out
...
which goes on for 8 lines.
I can then run these in parallel on 8 cores as,
parallel -j 8 :::: example.txt
In this case I would expect the runtime for each script to still be 36 seconds, and the total runtime to be ~36 seconds. However, in actuality what happens is the run time for each script roughly doubles.
If I instead run on 4 cores instead of 8 (-j 4) the problem disappears, and each script reverts to taking 36 seconds to run.
What is the cause of this? I have heard talk in the past on 'overheads' but I am not sure exactly what is meant by this.
What is happening is that you have only one socket with 4 physical cores in it.
Those are the real cores of your machine.
The total number of CPUs you see as output of lscpu is calculated using the following formula: #sockets * #cores_per_socket * #threads_per_core.
In your case it is 1*4*2=8.
Threads per core are a sort of virtual CPUs and they do not always perform as real CPUs, expecially for compute intense processing (this spec is called hyperthreading ).
Hence when you try to squeeze two threads per core, they get almost executed serially.
Take a look at this article for more info.

inherent parallelism for a program

Hi I have a question regarding inherent parallelism.
Let's say we have a sequential program which takes 20 seconds to complete execution. Suppose the execution time consists of 2 seconds of setup time at the beginning and 2 seconds of finalization time at the end of the execution, and the remaining work can be parallelized. How do we calculate the inherent parallelism of this program?
How do you define "inherent parallelism"? I've not heard the term. We can talk about "possible speedup".
OP said "remaining work can be parallelized"... to what degree?
Can it run with infinite parallelism? If this were possible (it isn't practical), then the total runtime would be 4 seconds with a speedup of 20/4 --> 5.
If the remaining work can be run on N processors perfectly in parallel,
then the total runtime would be 4+16/N. The ratio of that to 20 seconds is 20/(4+16/N) which can have pretty much any degree of speedup from 1 (no speedup) to 5 (he the limit case) depending on the value of N.

Parallel-ForkManager, DBI. Faster than before forking, but still too slow

I have a very simple task on updating database.
my $pm = new Parallel::ForkManager(15);
for my $line (#lines){
my $pid = $pm->start and next;
my $dbh2 = $dbh->clone();
my $sth2 = $dbh2->prepare("update db1 set field1=? where field2 =?");
my ($field1, $field2) = very_slow_subroutine();
$sth2->execute($field1,$field2);
$pm->finish;
}
$pm->wait_all_children;
I could just use $dbh2->do, but I doubt it a reason for a slowness.
What interesting, is that it seems it very fast starts these 15 processes (or whatever I specify) , but right after that slows drastically, still noticeable faster than without forking, but I would expect more...
Edit:
The very_slow_subroutine is sub which get an answer from a web service. The service can answer from fraction of second to several seconds on time out. I have to ask dozen thousands times... the reason I would like to make a fork.
And if this is matters -- I am on Linux.
Parallel::ForkManager doesn't magically make things faster, it just lets you do run your code multiple times and at the same time. In order to get the benefit out of it, you have to design your code for parallelism.
Think of it this way. It takes you 10 minutes to get to the store, shop, load your car, come back, and unload it. You need to get 5 loads. You alone can do it in 50 minutes. That is working in serial. 10 minutes * 5 trips one after the other = 50 minutes.
Let's say you get four friends to help. You all start off for the store at the same time. There's still 5 trips, and they still take 10 minutes, but because you did it in parallel the total time is only 10 minutes.
But it will never take less than 10 minutes, no matter how many trips you have to make or how many friends you get to help. That is why the process starts up fast, everybody gets into their cars and drives off to the store, but then nothing happens for a while because it still takes 10 minutes for everyone to do their job.
Same thing here. Your loop body takes X time to run. If you iterate through it Y times, it will take X * Y real world human time to run. If you run it in parallel Y times, ideally it will take just X time to run. Each parallel worker must still execute the full body of the loop taking X time.
In order to speed things up further, you have to break up the big bottleneck of very_slow_subroutine and make that work in parallel. Your SQL is so simple that is where you should focus your efforts at optimization and parallelism.
Let's say the store is really close, it's only a 1 minute drive (this is your SQL UPDATE), but shopping, loading and unloading takes 9 minutes (this is very_slow_subroutine). What if instead you have 5 cars and 15 friends. You load 3 people into each car. Driving to and from the store will take the same time, but now three people are working together to do the shopping, loading and unloading taking only 4 minutes. Now each trip takes 5 minutes instead of 10.
This represents redesigning very_slow_subroutine to do its work in parallel. If it's just a big loop, you can put more workers on that loop. If it's a series of slow operations, you will have to redesign it to take advantage of parallel execution.
If you use too many workers you can clog up the system, it depends on what the bottleneck is. If it's CPU bound and you have 2 CPU cores, you're probably see performance gains up to 3 to 5 workers ((cores * 2)+1 is a good rule of thumb) and after that performance will drop off as the CPU spends more time switching between processes than doing work. If the bottleneck is IO, or an external service as is often the case with database and network calls, you can see great efficiencies throwing many workers at the problem. While one process is waiting around for a disk or network operation, the others can be using your CPU.
Whether parallelism can help depends on where your bottleneck is. If your CPU with 4 cores is the bottleneck, forking 4 processes might cause things to complete in about 1/4th the under the best case scenario, but spawning 15 processes is not going to improve things much more.
If, more likely, your bottleneck is in I/O, starting 15 processes that compete for the same I/O is not going to help much, although in cases where you have tons of memory to use as file cache, some improvement might be possible.
To explore the limits on your system, consider the following program:
#!/usr/bin/env perl
use strict;
use warnings;
use Parallel::ForkManager;
run(#ARGV);
sub run {
my $count = #_ ? $_[0] : 2;
my $pm = Parallel::ForkManager->new($count);
for (1 .. 20) {
$pm->start and next;
sleep 1;
$pm->finish;
}
$pm->wait_all_children;
}
My ancient laptop has a single CPU with 2 cores. Let's see what I get:
TimeThis : Command Line : perl sleeper.pl 1
TimeThis : Elapsed Time : 00:00:20.735
TimeThis : Command Line : perl sleeper.pl 2
TimeThis : Elapsed Time : 00:00:06.578
TimeThis : Command Line : perl sleeper.pl 4
TimeThis : Elapsed Time : 00:00:04.578
TimeThis : Command Line : perl sleeper.pl 8
TimeThis : Elapsed Time : 00:00:03.546
TimeThis : Command Line : perl sleeper.pl 16
TimeThis : Elapsed Time : 00:00:02.562
TimeThis : Command Line : perl sleeper.pl 20
TimeThis : Elapsed Time : 00:00:02.563
So, running with max 20 processes gives me a total run time over 2.5 seconds for sleeping one second 20 times.
On the other hand, with just one process, sleeping one second 20 times took just over 20 seconds. That is a huge improvement, but it also indicates a management overhead of more than 150% when you have 20 processes each sleeping for one second.
This is in the nature of parallel programming. There are a lot of formal treatments out there on what you can expect, but Amdahl's Law is required reading.

FPGA timing question

I am new to FPGA programming and I have a question regarding the performance in terms of overall execution time.
I have read that latency is calculated in terms of cycle-time. Hence, overall execution time = latency * cycle time.
I want to optimize the time needed in processing the data, I would be measuring the overall execution time.
Let's say I have a calculation a = b * c * d.
If I make it to calculate in two cycles (result1 = b * c) & (a = result1 * d), the overall execution time would be latency of 2 * cycle time(which is determined by the delay of the multiplication operation say value X) = 2X
If I make the calculation in one cycle ( a = b * c * d). the overall execution time would be latency of 1 * cycle time (say value 2X since it has twice of the delay because of two multiplication instead of one) = 2X
So, it seems that for optimizing the performance in terms of execution time, if I focus only on decreasing the latency, the cycle time would increase and vice versa. Is there a case where both latency and the cycle time could be decreased, causing the execution time to decrease? When should I focus on optimizing the latency and when should I focus on cycle-time?
Also, when I am programming in C++, it seems that when I want to optimize the code, I would like to optimize the latency( the cycles needed for the execution). However, it seems that for FPGA programming, optimizing the latency is not adequate as the cycle time would increase. Hence, I should focus on optimizing the execution time ( latency * cycle time). Am I correct in this if I could like to increase the speed of the program?
Hope that someone would help me with this. Thanks in advance.
I tend to think of latency as the time from the first input to the first output. As there is usually a series of data, it is useful to look at the time taken to process multiple inputs, one after another.
With your example, to process 10 items doing a = b x c x d in one cycle (one cycle = 2t) would take 20t. However doing it in two 1t cycles, to process 10 items would take 11t.
Hope that helps.
Edit Add timing.
Calculation in one 2t cycle. 10 calculations.
Time 0 2 2 2 2 2 2 2 2 2 2 = 20t
Input 1 2 3 4 5 6 7 8 9 10
Output 1 2 3 4 5 6 7 8 9 10
Calculation in two 1t cycles, pipelined, 10 calculations
Time 0 1 1 1 1 1 1 1 1 1 1 1 = 11t
Input 1 2 3 4 5 6 7 8 9 10
Stage1 1 2 3 4 5 6 7 8 9 10
Output 1 2 3 4 5 6 7 8 9 10
Latency for both solutions is 2t, one 2t cycle for the first one, and two 1t cycles for the second one. However the through put of the second solution is twice as fast. Once the latency is accounted for, you get a new answer every 1t cycle.
So if you had a complex calculation that required say 5 1t cycles, then the latency would be 5t, but the through put would still be 1t.
You need another word in addition to latency and cycle-time, which is throughput. Even if it takes 2 cycles to get an answer, if you can put new data in every cycle and get it out every cycle, your throughput can be increased by 2x over the "do it all in one cycle".
Say your calculation takes 40 ns in one cycle, so a throughput of 25 million data items/sec.
If you pipeline it (which is the technical term for splitting up the calculation into multiple cycles) you can do it in 2 lots of 20ns + a bit (you lose a bit in the extra registers that have to go in). Let's say that bit is 10 ns (which is a lot, butmakes the sums easy). So now it takes 2x25+10=50 ns => 20M items/sec. Worse!
But, if you can make the 2 stages independent of each other (in your case, not sharing the multiplier) you can push new data into the pipeline every 25+a bit ns. This "a bit" will be smaller than the previous one, but even if it's the whole 10 ns, you can push data in at 35ns times or nearly 30M items/sec, which is better than your started with.
In real life the 10ns will bemuch less, often 100s of ps, so the gains are much larger.
George described accurately the meaning latency (which does not necessary relate to computation time). Its seems you want to optimize your design for speed. This is very complex and requires much experience. The total runtime is
execution_time = (latency + (N * computation_cycles) ) * cycle_time
Where N is the number of calculations you want to perform. If you develop for acceleration you should only compute on large data sets, i.e. N is big. Usually you then dont have requirements for latency (which could be in real time applications different). The determining factors are then the cycle_time and the computation_cycles. And here it is really hard to optimize, because there is a relation. The cycle_time is determined by the critical path of your design, and that gets longer the fewer registers you have on it. The longer it gets, the bigger is the cycle_time. But the more registers you have the higher is your computation_cycles (each register increases the number of required cycles by one).
Maybe I should add, that the latency is usually the number of computation_cycles (its the first computation that makes the latency) but in theory this can be different.

Resources