performance of single-cycle processor - performance

Given:
Operation time required by:
memory units: 200 ps
ALU and adders: 100 ps
Register file: 50 ps
other units and wires: no delay
Instruction mix and operation time in ps:
25% loads (600 ps)
10% stores ( 550 ps)
45% ALU instructions (400 ps)
15% branches (350 ps)
5% jumps (200 ps)
Every instruction executes in 1 clock cycle
Two implementations: fixed length and variable length
Which implementation would be faster and by how much?
Solution
Reference Table
Rule: CPU execution time: IC * CPI * CCT
Since CPI = 1...
CPU execution time: IC * CCT
My questions are:
What does it mean when an implementation has variable / fixed length?
How were the values for CPU execution timesingleclock calculated?

What does it mean when an implementation has variable / fixed length?
Fixed length clock means that each clock cycle has the same period, irrespective of the instruction being execution. Variable length clock means that different clock cycles may have different periods, depending on the instruction being executed.
So in a fixed clock design, the clock cycle has to be at least 600 ps, which is the longest time any instruction would take to execute (the load instruction). In a variable clock design, we can calculate the average clock cycle as follows:
Average CPU clock cycle = 600*25% + 550*10% + 400*45% + 350*15% + 200*5% = 447.5 ps
How were the values for CPU execution timesingleclock
calculated?
To determine which implementation is faster, you need to measure speedup, which is defined as:
Speedup = CPU execution time(single) / CPU execution time(variable)
Using the definition of CPU execution time we get (note that the number of instructions is the same):
Speedup = CPU execution time(single) / CPU execution time(variable)
= (Instruction count * Clock cycle time(single)) / (Instruction count * Clock cycle time(variable))
= Clock cycle time(single) / Clock cycle time(variable)
= 600 / 447.5 = 1.34
Se the variable clock design is 1.34 faster.
Regarding CPU execution timevariable
CPU execution timevariable is technically equal to the sum of the individual clock cycle times of each executed instruction. But we used the average clock cycle time instead to calculate speedup. Will we get the same result either way? Let's find out!
Assume there are N executed instructions and let C1, C2, ..., CN denote the cycle times of each of them, respectively. Hence:
CPU execution time(variable) = C1 + C2 + ... + CN
= 600*25%*N + 550*10%*N + 400*45%*N + 350*15%*N + 200*5%*N
= N * average CPU clock cycle
So they are the same.

Related

How to compute the achieved FLOPS of a MPI program which calls cuBlas function

I am accelerating a MPI program using cuBlas function. To evaluate the application's efficiency, I want to know the FLOPS, memory usage and other stuff of GPU after the program has ran, especially FLOPS.
I have read the relevant question:How to calculate Gflops of a kernel. I think the answers give two ways to calculate the FLOPS of a program:
The model count of an operation divided by the cost time of the operation
Using NVIDIA's profiling tools
The first solution doesn't depend on any tools. But I'm not sure the meaning of model count. It's O(f(N))? Like the model count of GEMM is O(N^3)? And if I multiply two matrices of 4 x 5 and 5 x 6 and the cost time is 0.5 s, is the model count 4 x 5 x 6 = 120? So the FLOPS is 120 / 0.5 = 240?
The second solution uses nvprof, which is deprecated now and replaced by Nsight System and Nsight Compute. But those two tools only work for CUDA program, instead of MPI program launching CUDA function. So I am wondering whether there is a tool to profile the program launching CUDA function.
I have been searching for this question for two days but still can't find an acceptable solution.
But I'm not sure the meaning of model count. It's O(f(N))? Like the
model count of GEMM is O(N^3)? And if I multiply two matrices of 4 x 5
and 5 x 6 and the cost time is 0.5 s, is the model count 4 x 5 x 6 =
120? So the FLOPS is 120 / 0.5 = 240?
The standard BLAS GEMM operation is C <- alpha * (A dot B) + beta * C and for A (m by k), B (k by n) and C (m by n), each inner product of a row of A and a column of B multiplied by alpha is 2 * k + 1 flop and there are m * n inner products in A dot B and another 2 * m * n flop for adding beta * C to that dot product. So the total model FLOP count is (2 * k + 3) * (m * n) when alpha and beta are both non-zero.
For your example, assuming alpha = 1 and beta = 0 and the implementation is smart enough to skip the extra operations (and most are) GEMM flop count is (2 * 5) * (4 * 6) = 240, and if the execution time is 0.5 seconds, the model arithmetic throughput is 240 / 0.5 = 480 flop/s.
I would recommend using that approach if you really need to calculate performance of GEMM (or other BLAS/LAPACK operations). This is the way that most of the computer linear algebra literature and benchmarking has worked since the 1970’s and how most reported results you will find are calculated, including the HPC LINPACK benchmark.
The Using the CLI to Analyze MPI Codes states clearly how to use nsys to collect MPI program runtime information.
And the gitlab Roofline Model on NVIDIA GPUs uses ncu to collect real time FLOPS and memory usage of the program. The methodology to compute these metrics is:
Time:
sm__cycles_elapsed.avg / sm__cycles_elapsed.avg.per_second
FLOPs:
DP: sm__sass_thread_inst_executed_op_dadd_pred_on.sum + 2 x
sm__sass_thread_inst_executed_op_dfma_pred_on.sum +
sm__sass_thread_inst_executed_op_dmul_pred_on.sum
SP: sm__sass_thread_inst_executed_op_fadd_pred_on.sum + 2 x
sm__sass_thread_inst_executed_op_ffma_pred_on.sum +
sm__sass_thread_inst_executed_op_fmul_pred_on.sum
HP: sm__sass_thread_inst_executed_op_hadd_pred_on.sum + 2 x
sm__sass_thread_inst_executed_op_hfma_pred_on.sum +
sm__sass_thread_inst_executed_op_hmul_pred_on.sum
Tensor Core: 512 x sm__inst_executed_pipe_tensor.sum
Bytes:
DRAM: dram__bytes.sum
L2: lts__t_bytes.sum
L1: l1tex__t_bytes.sum

Multilevel cache and maximum miss rate

the first level(L1) has hit rate 600 psec, miss rate 10% and miss penalnty 80 nsec. I add a second level cache(L2) with hit rate 5 nsec. I am trying to find th e maximum miss rate for the second level,considering that the combination of the caches (L1 + L2) has double efficiency than the one-level cache L1.
I am using these forms
Average memory access time = Hit time (L1) + Miss rate (L1) x Miss penalty (L1)
Miss penalty (L1) = Hit time (L2) + Miss rate (L2) x Miss penalty (L2)
the solution i get is 40 %,but the correct answer is 9,25 %.
can anyone help?
THANKS IN ADVANCE
avg = 8.6 = 0.6 + 0.1*80
1/2*avg = 4.3 = 0.6 + 0.1*(5 + x*80)
=> 3.2 = x*8
=> x = 0.4
So, it seems that you answer is correct under assumptions that
- "Average memory access time" does not include any other time value for various secondary effects;
- Double efficiency means that it takes a half time in average.

Calculate execution time of a program based on CPI, instructions, etc

I'm trying to calculate the execution time of an application. Assuming the only stall penalty occurs on memory access instructions (100 cycles being the penalty).
How am I supposed to find out execution time in seconds with this info?
CPI (CPUCycles?) = 1.0
ClockRate = 1GHZ
TotalInstructions = 59880
MemoryAccessInstructions = 8467
CacheMissRate = 62% (0.62) (5290/8467)
CacheHits = 3117
CacheMisses = 5290
CacheMissPenalty = 100 (cycles)
Assuming no other penalties.
totalCycles = TotalInstructions + CacheMisses * CacheMissPenalty ?
I assume that cache hits cost same as other opcodes, so those are included in TotalInstructions.
That's then 588880 cycles, 1GHz is 1000000000 cycles per second.
So that code will take 0.58888ms to execute (5.8888e-7 second).
This value is of course purely theoretical estimate, as modern CPU doesn't work like that (1 instruction = 1 cycle). If you are interested in real world values, just profile it.

Having a simple increment-clock procedure on top of sorted by inclusive time list! Optimize netlogo code

In my simulation every tick clock value should increment by one, after resolving some of my other performance problems I have noticed this function is on top of the sorted by inclusive time list (the value is 960205 ms):
Calls Incl T(ms) Excl T(ms) Excl/calls
40001 960205.451 3586.591 0.090
to increment-clock
set clock (clock + 1)
if clock = 48
[ set clock 0 ]
end
According to http://ccl.northwestern.edu/netlogo/5.0/docs/profiler.html
Inclusive time is the time from when the procedure was entered, until it finishes
I thought this one should be straight forward and easy,
what I am doing wrong here?
I remember I used ticks before to increment the clock , I don't remember why I have changed it to the one I mentioned above, but still this should not take this much time! (the profiler inclusive time value for this one is 599841.851 ms which indicated following is faster than the one above:
to increment-clock
set clock (clock + 1)
if ticks mod 48 = 0 [set clock 0]
end
Calls Incl T(ms) Excl T(ms) Excl/calls
40001 599841.851 2943.394 0.074
Thanks.
Marzy
That has to be a bug in the profiler extension. At least, I can't think of another explanation. (Well, are you calling profiler:reset between runs...?)
I tried this just now:
extensions [profiler]
globals [clock]
to increment-clock
set clock (clock + 1)
if clock = 48
[ set clock 0 ]
end
to test
profiler:reset
profiler:start
repeat 1000000 [ increment-clock ]
profiler:stop
print profiler:report
end
and I get:
observer> test
BEGIN PROFILING DUMP
Sorted by Exclusive Time
Name Calls Incl T(ms) Excl T(ms) Excl/calls
INCREMENT-CLOCK 1000000 1176.245 1176.245 0.001
Sorted by Inclusive Time
INCREMENT-CLOCK 1000000 1176.245 1176.245 0.001
Sorted by Number of Calls
INCREMENT-CLOCK 1000000 1176.245 1176.245 0.001
END PROFILING DUMP
which seems much more reasonable.
You can report the bug at https://github.com/NetLogo/NetLogo/issues or bugs#ccl.northwestern.edu.

Calculating Waiting Time and Turnaround Time in (non-preemptive) FCFS queue

I have 6 processes as follows:
-- P0 --
arrival time = 0
burst time = 10
-- P1 --
arrival time = 110
burst time = 210
-- P2 --
arrival time = 130
burst time = 70
-- P3 --
arrival time = 130
burst time = 70
-- P4 --
arrival time = 130
burst time = 90
-- P5 --
arrival time = 130
burst time = 50
How can I calculate the waiting time and turnaround time for each process? The system should be non-preemptive (the process gets the CPU until it's done). Also: there are 4 logical processors in this system.
Assume systemTime is the current systems uptime, and arrivalTime is relative to that. ie: an arrivalTime of 0 means the process starts when the system does; an arrivalTime of 130 means the process is started 130 units after the system starts.
Is this correct: waitingTime = (systemTime - arrivalTime) ?
My reasoning for thinking this is that systemTime - arrivalTime is the time the process has been waiting in the fcfs queue to use the CPU (or is this wrong?)
And for turnaround time, I was thinking something like: turnaroundTime = burstTime + waitingTime, since the waiting time and the burst time should be the total time to complete the process. Though once again I don't know if my intuition is correct.
Any and all readings would be greatly appreciated!
For non-preemptive system,
waitingTime = startTime - arrivalTime
turnaroundTime = burstTime + waitingTime = finishTime- arrivalTime
startTime = Time at which the process started executing
finishTime = Time at which the process finished executing
You can keep track of the current time elapsed in the system(timeElapsed). Assign all processors to a process in the beginning, and execute until the shortest process is done executing. Then assign this processor which is free to the next process in the queue. Do this until the queue is empty and all processes are done executing. Also, whenever a process starts executing, recored its startTime, when finishes, record its finishTime (both same as timeElapsed). That way you can calculate what you need.
wt = tt - cpu tm.
Tt = cpu tm + wt.
Where wt is a waiting time and tt is turnaround time. Cpu time is also called burst time.

Resources