I have 6 processes as follows:
-- P0 --
arrival time = 0
burst time = 10
-- P1 --
arrival time = 110
burst time = 210
-- P2 --
arrival time = 130
burst time = 70
-- P3 --
arrival time = 130
burst time = 70
-- P4 --
arrival time = 130
burst time = 90
-- P5 --
arrival time = 130
burst time = 50
How can I calculate the waiting time and turnaround time for each process? The system should be non-preemptive (the process gets the CPU until it's done). Also: there are 4 logical processors in this system.
Assume systemTime is the current systems uptime, and arrivalTime is relative to that. ie: an arrivalTime of 0 means the process starts when the system does; an arrivalTime of 130 means the process is started 130 units after the system starts.
Is this correct: waitingTime = (systemTime - arrivalTime) ?
My reasoning for thinking this is that systemTime - arrivalTime is the time the process has been waiting in the fcfs queue to use the CPU (or is this wrong?)
And for turnaround time, I was thinking something like: turnaroundTime = burstTime + waitingTime, since the waiting time and the burst time should be the total time to complete the process. Though once again I don't know if my intuition is correct.
Any and all readings would be greatly appreciated!
For non-preemptive system,
waitingTime = startTime - arrivalTime
turnaroundTime = burstTime + waitingTime = finishTime- arrivalTime
startTime = Time at which the process started executing
finishTime = Time at which the process finished executing
You can keep track of the current time elapsed in the system(timeElapsed). Assign all processors to a process in the beginning, and execute until the shortest process is done executing. Then assign this processor which is free to the next process in the queue. Do this until the queue is empty and all processes are done executing. Also, whenever a process starts executing, recored its startTime, when finishes, record its finishTime (both same as timeElapsed). That way you can calculate what you need.
wt = tt - cpu tm.
Tt = cpu tm + wt.
Where wt is a waiting time and tt is turnaround time. Cpu time is also called burst time.
Related
Given:
Operation time required by:
memory units: 200 ps
ALU and adders: 100 ps
Register file: 50 ps
other units and wires: no delay
Instruction mix and operation time in ps:
25% loads (600 ps)
10% stores ( 550 ps)
45% ALU instructions (400 ps)
15% branches (350 ps)
5% jumps (200 ps)
Every instruction executes in 1 clock cycle
Two implementations: fixed length and variable length
Which implementation would be faster and by how much?
Solution
Reference Table
Rule: CPU execution time: IC * CPI * CCT
Since CPI = 1...
CPU execution time: IC * CCT
My questions are:
What does it mean when an implementation has variable / fixed length?
How were the values for CPU execution timesingleclock calculated?
What does it mean when an implementation has variable / fixed length?
Fixed length clock means that each clock cycle has the same period, irrespective of the instruction being execution. Variable length clock means that different clock cycles may have different periods, depending on the instruction being executed.
So in a fixed clock design, the clock cycle has to be at least 600 ps, which is the longest time any instruction would take to execute (the load instruction). In a variable clock design, we can calculate the average clock cycle as follows:
Average CPU clock cycle = 600*25% + 550*10% + 400*45% + 350*15% + 200*5% = 447.5 ps
How were the values for CPU execution timesingleclock
calculated?
To determine which implementation is faster, you need to measure speedup, which is defined as:
Speedup = CPU execution time(single) / CPU execution time(variable)
Using the definition of CPU execution time we get (note that the number of instructions is the same):
Speedup = CPU execution time(single) / CPU execution time(variable)
= (Instruction count * Clock cycle time(single)) / (Instruction count * Clock cycle time(variable))
= Clock cycle time(single) / Clock cycle time(variable)
= 600 / 447.5 = 1.34
Se the variable clock design is 1.34 faster.
Regarding CPU execution timevariable
CPU execution timevariable is technically equal to the sum of the individual clock cycle times of each executed instruction. But we used the average clock cycle time instead to calculate speedup. Will we get the same result either way? Let's find out!
Assume there are N executed instructions and let C1, C2, ..., CN denote the cycle times of each of them, respectively. Hence:
CPU execution time(variable) = C1 + C2 + ... + CN
= 600*25%*N + 550*10%*N + 400*45%*N + 350*15%*N + 200*5%*N
= N * average CPU clock cycle
So they are the same.
I think I'm misunderstanding something with Lamport timestamps. It appears that it expects that messages take the same time to travel between distributed endpoints.
Lets say that process p1 sends messages m1 and m2 sequentially to process p2. Following the pseudocode in the Algorithm section of the article, we have:
# we start at 0
time(p1) = 0
# we send m1
time(p1) = time(p1) + 1 = 1
send(m1, 1)
# we send m2
time(p1) = time(p1) + 1 = 2
send(m2, 2)
If m1 reaches p2 before m2 everything is fine. But if m2 comes first, we get:
# we start at 0
time(p2) = 0
# we receive m2 first
time(p2) = max(2, time(p2)) + 1 = max(2, 0) + 1 = 3
# we receive m1 second
time(p2) = max(1, time(p2)) + 1 = max(1, 3) + 1 = 4
So in p2's local time (time(p2)) m2 has time of 3, and m1 has time of 4. That is the opposite of the order in which the messages were originally sent.
Am I missing something fundamental or do Lamport timestamps require consistent travel times to work?
The time of a message is the time contained in the message, not the time of the process receiving the message.
The timer is logically shared between the two processes, so it needs to count events (sends) by both processes, which is why the receiver adds one to the time when it receives a mesage.
The algorithm attempts to maintain the two processes' timers in sync, even if messages are lost or delayed, which is why the receiver takes the maximum of its view of the time and the sender's view of the time found in the message. If these are not the same, some message has been lost or delayed. That will result in the two processes having different views of the time, but the clocks will be resynchronised when the next message is sent and received (in either direction).
If the two processes simultaneously send a message, the two messages will contain the same time, which is why it is not a total order. But the clocks will still eventually be resynchronised.
I run a mapreduce job on a hadoop cluster. The job's running time I saw in browser at master:8088 and master:19888 (job history server web UI) are shown below:
master:8088
master:19888
I have two questions:
Why are the elapsed times from two pictures different?
Why sometimes the Average Reduce Time is a negative number?
It looks like the Average Reduce Time is based on the times the previous tasks (shuffle/merge) took to finish and not necessarily the amount of time the reduce actually took to run.
Looking at this source code you can see the relevant calculations occurring around line 300.
if (attempt.getState() == TaskAttemptState.SUCCEEDED) {
numReduces++;
avgShuffleTime += (attempt.getShuffleFinishTime() - attempt.getLaunchTime());
avgMergeTime += attempt.getSortFinishTime() - attempt.getShuffleFinishTime();
avgReduceTime += (attempt.getFinishTime() - attempt.getSortFinishTime());
}
Followed by:
if (numReduces > 0) {
avgReduceTime = avgReduceTime / numReduces;
avgShuffleTime = avgShuffleTime / numReduces;
avgMergeTime = avgMergeTime / numReduces;
}
Looking at your numbers, they seem to be generally in-line with this approach to calculating the run times (everything converted to seconds):
Total Pre-reduce time = Map Run Time + Ave Shuffle + Ave Merge
143 = 43 + 83 + 17
Ave Reduce Time = Elapsed Time - Total Pre-reduce
-10 = 133 - 143
So looking at how long the Map, Shuffle and Merge took compared with the Elapsed we end up with a negative number close to your -8.
This is a partial answer, only for question 1!
I see a difference in "Submitted" and "Started" of 8 seconds in the second picture, while the time "Started" in the first picture is equal to the "Submitted" time of the second. I guess this covers the 8-second difference that you see as "Elapsed" time.
I am very curious for the second question as well, but it may not be a coincidence that it is also 8 seconds.
The following two code snippets perform the same task (generating M samples uniformly from an N-dim sphere). I was wondering why the latter one consumes much more time than the previous one.
%% MATLAB R2014a
M = 30;
N = 10000;
#1
tic
S = zeros(M, N);
for k = 1:M
P = ones(1, N);
for i = 1:N - 1
t = rand*2*pi;
P(1:i) = P(1:i)*sin(t);
P(i+1) = P(i+1)*cos(t);
end
S(k,:) = P;
end
toc
#2
tic
S = ones(M, N);
for k = 1:M
for i = 1:N - 1
t = rand*2*pi;
S(k, 1:i) = S(k, 1:i)*sin(t);
S(k, i+1) = S(k, i+1)*cos(t);
end
end
toc
The output is:
Elapsed time is 15.007667 seconds.
Elapsed time is 59.745311 seconds.
And I also tried M = 1,
Elapsed time is 0.463370 seconds.
Elapsed time is 1.566913 seconds.
#2 is nearly 4 times slower than #1. Is frequent 2d element accessing in #2 making it time-consuming?
The time difference is due to memory access patterns, and how well they map onto the cache. And also possibly to MATLAB's exploitation of your hardware vector unit (SSE/AVX). MATLAB stores matrices "column-major", meaning S(2,1) is next to S(1,1).
In #1, you process each sample using the vector P, which lives in contiguous memory. These 80,000 bytes fit easily in L2 cache for the fast repeated access you need to perform. They're also neighbors, and trivially vectorized (I'm not certain if MATLAB performs this optimization, but I'd hope so...)
In #2, you access a row of S at a time, which is not contiguous, but rather is interleaved by M values. So each row is spread across 30*80,000 bytes, which does not fit in L2 cache. It'll have to be read back in for each repeated access, even though you're ignoring 29/30 values in that data.
Here's the test. All I'm doing it transposing S so that you can process a column at a time instead, then putting it back at the end just to get the same result:
#3
tic
S = ones(N, M);
for k = 1:M
for i = 1:N - 1
t = rand*2*pi;
S(1:i, k) = S(1:i, k)*sin(t);
S(i+1, k) = S(i+1, k)*cos(t);
end
end
S = S.';
toc
Results:
Elapsed time is 11.254212 seconds.
Elapsed time is 45.847750 seconds.
Elapsed time is 11.501580 seconds.
Yep, transposing S gets us the same contiguous access and performance as the separate vector approach. By the way, L3 vs. L2 is about 4x more clock cycles... 1
Let's see if we can find any breakpoints related to cache size. Here's N = 1000, where everything should fit in L2:
Elapsed time is 0.240184 seconds.
Elapsed time is 0.373448 seconds.
Elapsed time is 0.258566 seconds.
Much lower difference, though now we're probably into L1 effects.
Finally, here's a completely different way to solve your problem. It relies on the fact that multivariate normal RV's have the correct symmetry.
#4
tic
S = randn(M, N);
S = bsxfun(#rdivide, S, sqrt(sum(S.*S, 2)));
toc
Elapsed time is 10.714104 seconds.
Elapsed time is 45.351277 seconds.
Elapsed time is 11.031061 seconds.
Elapsed time is 0.015068 seconds.
I suspect the advantage comes from using a hard coded 1 in the access of the array. If you try M=1 you will still see a significant speed up for the sin(t) line. My guess is that the assembly under the hood can do some use immediate instructions as opposed to reloading the variable K into a register.
I'm trying to calculate the execution time of an application. Assuming the only stall penalty occurs on memory access instructions (100 cycles being the penalty).
How am I supposed to find out execution time in seconds with this info?
CPI (CPUCycles?) = 1.0
ClockRate = 1GHZ
TotalInstructions = 59880
MemoryAccessInstructions = 8467
CacheMissRate = 62% (0.62) (5290/8467)
CacheHits = 3117
CacheMisses = 5290
CacheMissPenalty = 100 (cycles)
Assuming no other penalties.
totalCycles = TotalInstructions + CacheMisses * CacheMissPenalty ?
I assume that cache hits cost same as other opcodes, so those are included in TotalInstructions.
That's then 588880 cycles, 1GHz is 1000000000 cycles per second.
So that code will take 0.58888ms to execute (5.8888e-7 second).
This value is of course purely theoretical estimate, as modern CPU doesn't work like that (1 instruction = 1 cycle). If you are interested in real world values, just profile it.