Performance Analysis of Multiple Kernels (CUDA C) - performance

I have CUDA program with multiple kernels run on series (in the same stream- the default one). I want to make performance analysis for the program as a whole specifically the GPU portion. I'm doing the analysis using some metrics such as achieved_occupancy, inst_per_warp, gld_efficiency and so on using nvprof tool.
But the profiler gives metrics values separately for each kernel while I want to compute that for them all to see the total usage of the GPU for the program.
Should I take the (average or largest value or total) of all kernels for each metric??

One possible approach would be to use a weighted average method.
Suppose we had 3 non-overlapping kernels in our timeline. Let's say kernel 1 runs for 10 milliseconds, kernel 2 runs for 20 millisconds, and kernel 3 runs for 30 milliseconds. Collectively, all 3 kernels are occupying 60 milliseconds in our overall application timeline.
Let's also suppose that the profiler reports the gld_efficiency metric as follows:
kernel duration gld_efficiency
1 10ms 88%
2 20ms 76%
3 30ms 50%
You could compute the weighted average as follows:
88*10 76*20 50*30
"overall" global load efficiency = ----- + ----- + ----- = 65%
60 60 60
I'm sure there may be other approaches that make sense also. For example, a better approach might be to have the profiler report the total number of global load transaction for each kernel, and do your weighting based on that, rather than kernel duration:
kernel gld_transactions gld_efficiency
1 1000 88%
2 2000 76%
3 3000 50%
88*1000 76*2000 50*3000
"overall" global load efficiency = ------- + ------- + ------- = 65%
6000 6000 6000

Related

H2O GBM slows down when adding more cores

Training the following GBM model on 2 cores vs 96 cores (on EC2 c5.large and c5.metal) results in faster training times when using less cores. I checked the water meter to verify all cores were running.
Training times:
c5.large (2 cores): ~1min
c5.metal (96 cores): ~2min
Training details:
training set size 6840 rows x 95 cols
seed 1
ntrees 1000
max_depth 50
min_rows 10
learn_rate 0.005
sample_rate 0.5
col_sample_rate 0.5
stopping_rounds 2
stopping_metric "MSE"
stopping_tolerance 1.0E-5
score_tree_interval 500
histogram_type "UniformAdaptive"
nbins 800
nbins_top_level 1024
Any thoughts on why this is happening?
I think the reason is that the parallel speed is composed of two main components:
computing time on every single core
communicating time to communicate and collecting results
If you have small data and a lot of cores, the algorithm could slow down due to huge communication. Try for example 4, 6, 10 cores instead of 96 to speed up.

How to interpret CPU time vs CPU percentage

When I check azure monitoring tool, CPU usages are shown in CPU time
min: 4.69s
max: 2008.08 s
avg : 207.63 s
I am familiar with CPU% which makes sense as in application requiring cpu cycles.
how does the above time correspond to percentage?
What would be the max in seconds which corresponds to 70 or 100% cpu usage?
note : cpu is 4 cores
On a different instance, I noticed in a 60 second window
min: 0
max : 133.83
avg : 19.61
Based on below answers (see Nachiket's explanation in comments as well)
133.83 is a product of cpu time multiplied by cores ( in my case 4 cores)
Cpu utilization in this case is 133.83/(60*4) = 54.1%
Some cloud monitoring tools give resource usage in standard time measures. (seconds, hours, days etc.)
If you have usage in seconds like,
min: 4.69s
max: 2008.08 s
avg : 207.63 s
Then you can find out usage in % from above using definition of %.
% utilization = (resource used time / total resource availability time)
ex: if cpu was available for 100 seconds and out of that 80 seconds it was used then
% utilization = 80/100 = 80% CPU utilization
From your given time, total available time is missing. Find that out and use above formula.
% utilization = avg. usage/total availability
no. of cores shouldn't matter as that is present in both cases.
% utilization = ( (no. of cores * avg util)/(no. of core * total availability))
I am not sure about azure cloud monitoring but if it is providing same then you can use it.

What is the performance of 10 processors capable of 200 MFLOPs running code which is 10% sequential and 90% parallelelizable?

simple problem from Wilkinson and Allen's Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Working through the exercises at the end of the first chapter and want to make sure that I'm on the right track. The full question is:
1-11 A multiprocessor consists of 10 processors, each capable of a peak execution rate of 200 MFLOPs (millions of floating point operations per second). What is the performance of the system as measured in MFLOPs when 10% of the code is sequential and 90% is parallelizable?
I assume the question wants me to find the number of operations per second of a serial processor which would take the same amount of time to run the program as the multiprocessor.
I think I'm right in thinking that 10% of the program is run at 200 MFLOPs, and 90% is run at 2,000 MFLOPs, and that I can average these speeds to find the performance of the multiprocessor in MFLOPs:
1/10 * 200 + 9/10 * 2000 = 1820 MFLOPs
So when running a program which is 10% serial and 90% parallelizable the performance of the multiprocessor is 1820 MFLOPs.
Is my approach correct?
ps: I understand that this isn't exactly how this would work in reality because it's far more complex, but I would like to know if I'm grasping the concepts.
Your calculation would be fine if 90% of the time, all 10 processors were fully utilized, and 10% of the time, just 1 processor was in use. However, I don't think that is a reasonable interpretation of the problem. I think it is more reasonable to assume that if a single processor were used, 10% of its computations would be on the sequential part, and 90% of its computations would be on the parallelizable part.
One possibility is that the sequential part and parallelizable parts can be run in parallel. Then one processor could run the sequential part, and the other 9 processors could do the parallelizable part. All processors would be fully used, and the result would be 2000 MFLOPS.
Another possibility is that the sequential part needs to be run first, and then the parallelizable part. If a single processor needed 1 hour to do the first part, and 9 hours to do the second, then it would take 10 processors 1 + 0.9 = 1.9 hours total, for an average of about (1*200 + 0.9*2000)/1.9 ~ 1053 MFLOPS.

Calculating CPU Performance in MIPS

i was taking an exam earlier and i memorized the questions that i didnt know how to answer but somehow got it correct(since the online exam using electronic classrom(eclass) was done through the use of multiple choice.. The exam was coded so each of us was given random questions at random numbers and random answers on random choices, so yea)
anyways, back to my questions..
1.)
There is a CPU with a clock frequency of 1 GHz. When the instructions consist of two
types as shown in the table below, what is the performance in MIPS of the CPU?
-Execution time(clocks)- Frequency of Appearance(%)
Instruction 1 10 60
Instruction 2 15 40
Answer: 125
2.)
There is a hard disk drive with specifications shown below. When a record of 15
Kbytes is processed, which of the following is the average access time in milliseconds?
Here, the record is stored in one track.
[Specifications]
Capacity: 25 Kbytes/track
Rotation speed: 2,400 revolutions/minute
Average seek time: 10 milliseconds
Answer: 37.5
3.)
Assume a magnetic disk has a rotational speed of 5,000 rpm, and an average seek time of 20 ms. The recording capacity of one track on this disk is 15,000 bytes. What is the average access time (in milliseconds) required in order to transfer one 4,000-byte block of data?
Answer: 29.2
4.)
When a color image is stored in video memory at a tonal resolution of 24 bits per pixel,
approximately how many megabytes (MB) are required to display the image on the
screen with a resolution of 1024 x768 pixels? Here, 1 MB is 106 bytes.
Answer:18.9
5.)
When a microprocessor works at a clock speed of 200 MHz and the average CPI
(“cycles per instruction” or “clocks per instruction”) is 4, how long does it take to
execute one instruction on average?
Answer: 20 nanoseconds
I dont expect someone to answer everything, although they are indeed already answered but i am just wondering and wanting to know how it arrived at those answers. Its not enough for me knowing the answer, ive tried solving it myself trial and error style to arrive at those numbers but it seems taking mins to hours so i need some professional help....
1.)
n = 1/f = 1 / 1 GHz = 1 ns.
n*10 * 0.6 + n*15 * 0.4 = 12 ns (=average instruction time) = 83.3 MIPS.
2.)3.)
I don't get these, honestly.
4.)
Here, 1 MB is 10^6 bytes.
3 Bytes * 1024 * 768 = 2359296 Bytes = 2.36 MB
But often these 24 bits are packed into 32 bits b/c of the memory layout (word width), so often it will be 4 Bytes*1024*768 = 3145728 Bytes = 3.15 MB.
5)
CPI / f = 4 / 200 MHz = 20 ns.

how to tell if a program has been successfully parallelized?

A program run on a parallel machine is measured to have the following efficiency values for increasing numbers of processors, P.
P 1 2 3 4 5 6 7
E 100 90 85 80 70 60 50
Using the above results, plot the speedup graph.
Use the graph to explain whether or not the program has been successfully parallelized.
P E Speedup
1 100% 1
2 90% 1.8
3 85% 2.55
4 80% 3.2
5 70% 3.5
6 60% 3.6
7 50% 3.5
This is a past year exam question, and I know how to calculate the speedup & plot the graph. However I don't know how to tell a program is successfully parallelized.
Amdahl's law
I think the idea here is that not all portion can be parallelized.
For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimum execution time cannot be less than that critical 1 hour. Hence the speedup is limited up to 20×
In this example, the speedup reached maximum 3.6 with 6 processors. So the parallel portion is about 1-1/3.6 is about 72.2%.

Resources