How to calculate execution time (speedup) - performance

I was stuck when trying to calculate for the speedup. So the question given was:
Question 1
If 50% of a program is enhanced by 2 times and the rest 50% is enhanced by 4 times then what is the overall speedup due to the enhancements? Hints: Consider that the execution time of program in the machine before enhancement (without enhancement) is T. Then find the total execution time after the enhancements, T'. The speedup is T/T'.
The only thing I know is speedup = execution time before enhancement/execution time after enhancement. So can I assume the answer is:
Speedup = T/((50/100x1/2) + (50/100x1/4))
Total execution time after the enhancement = T + speedup
(50/100x1/2) because 50% was enhanced by 2 times and same goes to the 4 times.
Question 2
Let us hypothetically imagine that the execution of (2/3)rd of a program could be made to run infinitely fast by some kind of improvement/enhancement in the design of a processor. Then how many times the enhanced processor will run faster compared with the un-enhanced (original) machine?
Can I assume that it is 150 times faster as 100/(2/3) = 150
Any ideas? Thanks in advance.

Let's start with question 1.
The total time is the sum of the times for the two halves:
T = T1 + T2
Then, T1 is enhanced by a factor of two. T2 is improved by a factor of 4:
T' = T1' + T2'
= T1 / 2 + T2 / 4
We know that both T1 and T2 are 50% of T. So:
T' = 0.5 * T / 2 + 0.5 * T / 4
= 1/4 * T + 1/8 * T
= 3/8 * T
The speed-up is
T / T' = T / (3/8 T) = 8/3
Question two can be solved similarly:
T' = T1' + T2'
T1' is reduced to 0. T2 is the remaining 1/3 of T.
T' = 1/3 T
The speed-up is
T / T' = 3
Hence, the program is three times as fast as before (or two times faster).

Related

PRAM CREW algorithm for counting odd numbers

So I try to solve the following task:
Develop an CREW PRAM algorithm for counting the odd numbers of a sequence of integers x_1,x_2,...x_n.
n is the number of processors - the complexity should be O(log n) and log_2 n is a natural number
My solution so far:
Input: A:={x_1,x_2,...,x_n} Output:=oddCount
begin
1. global_read(A(n),a)
2. if(a mod 2 != 0) then
oddCount += 1
The problem is, due to CREW I am not allowed to use multiple write instructions at the same time oddCount += 1 is reading oddCount and then writes oddCount + 1, so there would be multiple writes.
Do I have to do something like this
Input: A:={x_1,x_2,...,x_n} Output:=oddCount
begin
1. global_read(A(n),a)
2. if(a mod 2 != 0) then
global_write(1, B(n))
3. if(n = A.length - 1) then
for i = 0 to B.length do
oddCount += B(i)
So first each process determines wether it is a odd or even number and the last process calculates the sum? But how would this affect the complexity and is there a better solution?
Thanks to libik I came to this solution: (n starts with 0)
Input: A:={x_1,x_2,...,x_n} Output:=A(0):=number off odd numbers
begin
1. if(A(n) mod 2 != 0) then
A(n) = 1
else
A(n) = 0
2. for i = 1 to log_2(n) do
if (n*(2^i)+2^(i-1) < A.length)
A(n*(2^i)) += A(n*(2^i) + (2^(i-1)))
end
i = 1 --> A(n * 2): 0 2 4 6 8 10 ... A(n*2 + 2^0): 1 3 5 7 ...
i = 2 --> A(n * 4): 0 4 8 12 16 ... A(n*4 + 2^1): 2 6 10 14 18 ...
i = 3 --> A(n * 8): 0 8 16 24 32 ... A(n*8 + 2^2): 4 12 20 28 36 ...
So the first if is the 1st Step and the for is representing log_2(n)-1 steps so over all there are log_2(n) steps. Solution should be in A(0).
Your solution is O(n) as there is for cycle that has to go through all the numbers (which means you dont utilize multiple processors at all)
The CREW means you cannot write into the same cell (in your example cell=processor memory), but you can write into multiple cells at once.
So how to do it as fast as possible?
At initialization all processors start with 1 or 0 (having odd number or not)
In first round just sum the neighbours x_2 with x_1, then x_4 with x_3 etc.
It will be done in O(1) as every second processor "p_x" look to "p_x+1" processor in parallel and add 0 or 1 (is there odd number or not)
Then in processors p1,p3,p5,p7.... you have part of solution. Lets do this again but now with p1 looks to p3, p5 looks to p7 and p_x looks to o_x+2
Then you have part of the solution only in processors p1, p5, p9 etc.
Repeat the process. Every step the number of processors halves, so you need log_2(n) steps.
If this would be real-life example, there is often calculated cost of synchronization. Basically after each step, all processors have to synchronize themselves so they now, they can do the second step (as you run the described code in each processor, but how do you know if you can already add number from processor p_x, because you can do it after p_x finished work).
You need either some kind of "clock" or synchronization.
At this example, the final complexity would be log(n)*k, where k is the complexity of synchronization.
The cost depends on machine, or definition. One way how to notify processors that you have finished is basically the same one as the one described here for counting the odd numbers. Then it would also cost k=log(n) which would result in log^2(n)

Optimizing a program and calculating % of total execution time improved

So I was told to ask this on here instead of StackExchage:
If I have a program P, which runs on a 2GHz machine M in 30seconds and is optimized by replacing all instances of 'raise to the power 4' with 3 instructions of multiplying x by. This optimized program will be P'. The CPI of multiplication is 2 and CPI of power is 12. If there are 10^9 such operations optimized, what is the percent of total execution time improved?
Here is what I've deduced so far.
For P, we have:
time (30s)
CPI: 12
Frequency (2GHz)
For P', we have:
CPI (6) [2*3]
Frequency (2GHz)
So I need to figure our how to calculate the time of P' in order to compare the times. But I have no idea how to achieve this. Could someone please help me out?
Program P, which runs on a 2GHz machine M in 30 seconds and is optimized by replacing all instances of 'raise to the power 4' with 3 instructions of multiplying x by. This optimized program will be P'. The CPI of multiplication is 2 and CPI of power is 12. If there are 10^9 such operations optimized,
From this information we can compute time needed to execute all POWER4 ("raise to the power 4) instructions, we have total count of such instructions (all POWER4 was replaced, count is 10^9 or 1 G). Every POWER4 instruction needs 12 clock cycles (CPI = clock per instruction), so all POWER4 were executed in 1G * 12 = 12G cycles.
2GHz machine has 2G cycles per second, and there are 30 seconds of execution. Total P program execution is 2G*30 = 60 G cycles (60 * 10^9). We can conclude that P program has some other instructions. We don't know what instructions, how many executions they have and there is no information about their mean CPI. But we know that time needed to execute other instructions is 60 G - 12 G = 48 G (total program running time minus POWER4 running time - true for simple processors). There is some X executed instructions with Y mean CPI, so X*Y = 48 G.
So, total cycles executed for the program P is
Freq * seconds = POWER4_count * POWER4_CPI + OTHER_count * OTHER_mean_CPI
2G * 30 = 1G * 12 + X*Y
Or total running time for P:
30s = (1G * 12 + X*Y) / 2GHz
what is the percent of total execution time improved?
After replacing 1G POWER4 operations with 3 times more MUL instructions (multiply by) we have 3G MUL operations, and cycles needed for them is now CPI * count, where MUL CPI is 2: 2*3G = 6G cycles. X*Y part of P' was unchanged, and we can solve the problem.
P' time in seconds = ( MUL_count * MUL_CPI + OTHER_count * OTHER_mean_CPI ) / Frequency
P' time = (3G*2 + X*Y) / 2GHz
Improvement is not so big as can be excepted, because POWER4 instructions in P takes only some part of running time: 12G/60G; and optimization converted 12G to 6G, without changing remaining 48 G cycles part. By halving only some part of time we get not half of time.

Hungarian algorithm with multiple assignments

Let's say we're given N jobs and K workers to do those jobs. But for some jobs we need 2 employees, while for some we need just one. Also the employees can't do all jobs. For example worker 1 can do jobs 1,2 and 5, while not jobs 3 and 4. Also if we hire worker 1 to do job 1, then we want him to do jobs 2 and 5, since we've already paid him.
So for example let's say we have 5 jobs and 6 workers. For jobs 1,2 and 4 we need 2 men, while for jobs 3 and 5 we need just one. And here's the list of the jobs every worker can do and the wage he requires.
Worker 1 can do jobs 1,3,5 and he requires 1000 dollars.
Worker 2 can do jobs 1,5 and he requires 2000 dollars.
Worker 3 can do jobs 1,2 and he requires 1500 dollars.
Worker 4 can do jobs 2,4 and he requires 2500 dollars.
Worker 5 can do jobs 4,5 and he requires 1500 dollars.
Worker 6 can do jobs 3,5 and he requires 1000 dollars.
After little calculation and logical thinking we can conclude that we have to hire workers 1,3,4 and 5, which means that the minimum wage we need to pay is: 1000+1500+2500+1500=5500 dollars.
But how we can find an efficient algorithm that will output that amount? This somehow reminds me of the Hungarian Algorithm, but all those additional constrains makes it impossible for me to apply it.
We can represent a state of all jobs as a number in a ternary system(2-two people remaing, 1-one person remaining and 0 if it is already done). Now we can compute f(mask, k) = the smallest cost to hire some workers among the first k in such a way that the state of remaining jobs is mask. Transitions are as follows: we either go to (mask, k + 1)(not hiring the current worker) or we go to (new_mask, k + 1)(in this case we pay this worker his salary and let him do all the jobs he can). The answer is f(0, K).
The time complexity is O(3^N * K * N).
Here is an idea how to optimize it further(and get rid of the N factor). Let's assume that the current mask is mask and the man can do jobs from another mask'. We could actually simply add mask to mask', but there is one problem: the positions where there was 2 in the mask and 1 in mask' will get broken. But we can fix: for each mask, let's precompute a binary mask allowed_mask that contain all position where the digit is not 2. For each man and for each allowed_mask we can precompute that mask' value. Now each transition is just one addition:
for i = 0 ... k - 1
for mask = 0 ... 3^n - 1
allowed_mask = precomputed_allowed_mask[mask]
// make a transition to (i + 1, mask + add_for_allowed_mask[i][allowed_mask])
// make a transition to (i + 1, mask)
Note that there are only 2^n allowed masks. So the time complexity of this solution is O(3^N * N + T * 2^N * K * N + T * 3^N * K)(the first term is for precomputing allowed_masks for all ternary mask, the second one is for precomputing mask' for all allowed_masks and people, and the last is for dp itself).

Using dynamic programming to find minimum cost of a pipeline

I am learning dynamic programming and I have this is problem that i can't understand.
We have a a pipeline that is connected by depos (pumps). The pipeline is linear and we have assigned the depos a value. For example 4----5----1----2 (shows a pipeline, with numbers representing pumps/depos), the total value of pipeline is computed as 4*5 +4*1 + 4*2 + 5*1 + 5*2 + 1*2 = 49.
We are now given with a problem to cut the water supply at such a point that we are left with the minimum value of pipeline for example cutting between 5 and 1 gives us 22 (4---5--/--1---2 gives as 4*5 + 1*2 = 22.)
while cutting it at 4 and 5 gives 17 ( 4--/--5--1---2 gives 5*1 + 5*2 + 1*2 = 17). We are allowed a limited number of cuts. and we are to device a dynamic algorithm that gives us the least pipeline cost.

Scheduling: advance deadline for implicit-deadline rate monotonic algorithm

Given a set of tasks:
T1(20,100) T2(30,250) T3(100,400) (execution time, deadline=peroid)
Now I want to constrict the deadlines as Di = f * Pi where Di is new deadline for ith task, Pi is the original period for ith task and f is the factor I want to figure out. What is the smallest value of f that the tasks will continue to meet their deadlines using rate monotonic scheduler?
This schema will repeat (synchronize) every 2000 time units. During this period
T1 must run 20 times, requiring 400 time units.
T2 must run 8 times, requiring 240 time units.
T3 must run 5 times, requiring 500 time units.
Total is 1140 time units per 2000 time unit interval.
f = 1140 / 2000 = 0.57
This assumes long-running tasks can be interrupted and resumed, to allow shorter-running tasks to run in between. Otherwise there will be no way for T1 to meet it's deadline once T3 has started.
The updated deadlines are:
T1(20,57)
T2(30,142.5)
T3(100,228)
These will repeat every 1851930 time units, and require the same time to complete.
A small simplification: When calculating factor, the period-time cancels out. This means you don't really need to calculate the period to get the factor:
Period = 2000
Required time = (Period / 100) * 20 + (Period / 250) * 30 + (Period / 400) * 100
f = Required time / Period = 20 / 100 + 30 / 250 + 100 / 400 = 0.57
f = Sum(Duration[i] / Period[i])
To calculate the period, you could do this:
Period(T1,T2) = lcm(100, 250) = 500
Period(T1,T2,T3) = lcm(500, 400) = 2000
where lcm(x,y) is the Least Common Multiple.

Resources