Imagine I have developed a CUDA kernel and tuned the block size and grid size for optimal performance on my machine. But if I give my application to a customer with a different GPU he might need other settings for grid size and block size to gain optimal performance.
How do I change the grid size and block size during runtime so that my kernel runs optimal on different GPUs?
When you change grid size, you are changing the total number of threads. Focusing on total threads, then, the principal target is the maximum in-flight thread-carrying capacity of the GPU you are running on.
A GPU code that seeks to maximize the utilization of the GPU it is running on, should attempt to have at least that many threads. Less can be bad, more is not likely to make a big difference.
This target is easy to calculate. For most GPUs it is 2048 times the number of SMs in your GPU. (Turing GPUs have reduced the maximum thread load per SM from 2048 to 1024).
You can find out the number of SMs in your GPU at runtime using a call to cudaGetDeviceProperties() (study the deviceQuery sample code).
Once you know the number of SMs, multiply it by 2048. That is the number of threads to launch in your grid. At this level of tuning/approximation, there should be no need to change the tuned number of threads per block.
It's true that your specific code may not be able to actually achieve 2048 threads on each SM (this is related to discussions of occupancy). However, for a simplistic target, this won't hurt anything. If you already know the actual occupancy capability of your code, or have used the occupancy API to determine it, then you can scale down your target from 2048 threads per SM to some lower number. But this scaling down probably won't improve the performance of your code by much, if at all.
Related
I am new to GPGPU and CUDA. From my reading, on current-generation CUDA GPU's, threads get bundled into warps of 32 threads. All threads in a warp execute the same instructions so if there is divergence in branches all threads essentially take the time corresponding to taking all the incurred branches. However, it seems that different warps executing simultaneously on the GPU can have divergent branches without this cost since the different warps are executed by separate computational resources. So my question is, how many concurrent warps can be so executed where divergence doesn't cause this penality... i.e. what number is it that I should look for in the spec sheet. Is it the number of "shader processors" or the number of "Streaming multiprocessors" that is relevant here?
Also, the same question for AMD Radeon: Here the relevant terms might be "unified shaders" and "compute units".
Finally, suppose I have a workload that is highly divergent across threads so I essentially just want one thread per warp. Essentially using the GPU as an ordinary multi-core CPU. Is that possible and how should I lay out the threads and thread-blocks for this to happen? Can I avoid allocating memory etc. for the 31 redundant threads in the warp. I realize this might not be the ideal workload for GPGPU but it would be usable for running an activity in the background without blocking the host CPU.
I am new to GPGPU and am instead learning OpenCL. But this question has remained unanswered for months, so I'll have a stab at it (and hopefully an expert will correct me if I'm wrong).
However, it seems that different warps executing simultaneously on the GPU can have divergent branches without this cost since the different warps are executed by separate computational resources
Not necessarily. On AMD systems, only 64 work-items (called Threads in CUDA) are worked on at any given time (technically: each VALU in AMD systems works on 16 items at once, but any given instruction is repeated four-times, every time. So 64-items per "AMD Wavefront"). On NVidia systems, it seems like 32-threads are executed at a time per warp.
Of course, the "Block Size" is likely far larger than 64. So if you were doing 32x32 pixel blocks, you'd need 1024 cores / shaders / work items per work group (OpenCL) or Warp.
These 1024 threads CAN diverge without penalty under NVidia Pascal, because they're split into sets of 32.
So if you have a work group / warp size of 1024, correlating to 32x32 block of pixels... the first two rows will execute on one VALU (AMD GCN) or SM (NVidia Pascal). As long as ALL of those 32 threads / 64-work items take the same branches, you won't have any penalties.
Finally, suppose I have a workload that is highly divergent across threads so I essentially just want one thread per warp. Essentially using the GPU as an ordinary multi-core CPU. Is that possible and how should I lay out the threads and thread-blocks for this to happen? Can I avoid allocating memory etc. for the 31 redundant threads in the warp. I realize this might not be the ideal workload for GPGPU but it would be usable for running an activity in the background without blocking the host CPU.
if( threadid> 0) {
} else {
dostuff();
}
Honestly, I think its best if you just diverge and hope for the best. All of those cores have resources of their own (Registers and stuff).
Looking at this fact, I've started wondering how registers work in GPU? Before knowing this, I thought going higher and higher above the hierarchical memory ladder, the size keeps on decreasing (which is intuitive (latency decrease, size decrease)). What is the purpose of registers in GPU's and why is their size greater than the L2/L1 cache?
Thanks.
In CPUs caches serve two basic purposes:
They enable temporal and spatial reuse of data already fetched from DRAM. This reduces the required bandwidth of the DRAM.
CPU caches provide a huge reduction of latency, which is extremely important for single threaded performance.
GPUs are not focused on single thread performance, but are focused on throughput instead. Most of the time they also deal with working sets that are too big to fit into any reasonably sized cache. Small cache help in some situation, but overall caches are not nearly as important for GPUs as they are for CPUs.
Now to the second part of the question: Why huge registers files? GPUs reach their performance by exploiting thread level parallelism. Many threads need to be active at the same time to reach high performance levels. But every thread needs to store its own set of registers. In Maxwell GPUs and likely in GP104/GTX1080 every SM can host up to 2048 threads. Every SM has a 256 KB register file, so if all threads are used, 32x 32-bit registers are available per thread.
I mentioned earlier that CPUs use caches to reduce memory latency, but GPUs must also somehow deal with memory latency. They just switch to a different thread, while a thread is waiting for an answer from the memory. Latency and throughput and threads are connected by Little's law:
(data in flight/thread) * threads = latency x throughput
The memory latency is likely a few hundred ns to thousand ns (lets use 1000ns). The throughput here would be the memory bandwidth (320 GB/s). To full utilize the available memory bandwidth we need at (320 GB/s * 1000 ns=) 320 KB in flight. GTX1080 should have 20 SMs, so each SM would need to have 16 KB in flight to full use the memory bandwidth. Even if all 2048 threads are used for memory access all the time, every thread would still be required to have 8 bytes in outstanding memory requests. If some threads are busy with calculations and cannot send out new memory requests, even more memory requests are required from the remaining threads. If threads use more than 32 registers per thread, even more memory requests per thread are needed.
If GPUs would use smaller register files, they could not use the full bandwidth of their memory. They would send out some work to memory interface and then all threads would be waiting for answers from the memory interface and no new work could be submitted to the memory interface. The huge registers are required to have enough threads available. Careful coding is still required to really get the maximum power of the GPU.
GPU is built for 3D and calculations so vendors dedicated more area for cores. More cores need more data to feed them and that needs more gpu area for scheduling mechanisms to maintain occupation as high as possible.
Too many cores, too many 3D pipelines such as tmu and rop, too many scheduling parts and too wide memory controllers to feed those cores.
Gpu area is just not enough for everything. Least important one seems to be caches. Even texture memory is more important than that and that is faster too.
Making gpu bigger means lower yield for production and that means less profit. Since gpu vendors are not charity organisations, they chose maximum profit, optimum performance and power savings(as of lately). Cache is expensive.
A compute unit in a gpu can have more than kilobytes of registers per thread so any multiply used data is not needed to be transferred between long distances(such as cache and cores) and make it have energy efficieny.
Also you can hide latency of some parts by having good occupation ratio for large-enough calculations; local shared memory(per compute unit) and registers(per thread) has more important role in achieving that.
While memory controller,L1 and L2 can handle only 100 GB/s, 200 GB/s and 300 GB/s;local shared memory and registers can be up to 5 TB/s and 15 TB/s bandwidth for a gpu.
CL_DEVICE_NAME = GeForce GT 630
CL_DEVICE_TYPE = CL_DEVICE_TYPE_GPU
CL_PLATFORM_NAME : NVIDIA CUDA
size_t global_item_size = 8;
size_t local_item_size = 1;
clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
Here, printing in the kernel is not allowed. Hence, how to ensure that all my 8 cores are running in parallel?
Extra info (regarding my question): for kernel, i am passing input and and output array of 8X8 size as a buffer. According to workitem number, i am solving that row and saving the result in output buffer. and after that i am reading the result.
If i am running AMD platform SDK, where i add print statement in kernel by
#pragma OPENCL EXTENSION cl_amd_printf : enable
hence i can see clearly, if i am using 4 core machine, my first 4 cores are running parallel and then rest will run in parallel, which shows it is solving maximum 4 in parallel.
But, how can i see the same for my CL_DEVICE_TYPE_GPU?
Any help/pointers/suggestions will be appreciated.
Using printf is not at all a reliable method of determining if your code is actually executing in parallel. You could have 4 threads running concurrently on a single core for example, and would still have your printf statements output in a non-deterministic order as the CPU time-slices between them. In fact, section 6.12.13.1 of the OpenCL 1.2 specification ("printf output synchronization") explicitly states that there are no guarantees about the order in which the output is written.
It sounds like what you are really after is a metric that will tell you how well your device is being utilised, which is different than determining if certain work-items are actually executing in parallel. The best way to do this would be to use a profiler, which would usually contain such a metric. Unfortunately NVIDIA's NVVP no longer works with OpenCL, so this doesn't really help you.
On NVIDIA hardware, work-items within a work-group are batched up into groups of 32, known as a warp. Each warp executes in a SIMD fashion, so the 32 work-items in the warp execute in lockstep. You will typically have many warps resident on each compute unit, potentially from multiple work-groups. The compute unit will transparently context switch between these warps as necessary to keep the processing elements busy when warps stall.
Your brief code snippet indicates that you are asking for 8 work-items with a work-group size of 1. I don't know if this is just an example, but if it isn't then this will almost certainly deliver fairly poor performance on the GPU. As per the above, you really want the work-group size to be multiple of 32, so that the GPU can fill each warp. Additionally, you'll want hundreds of work-items in your global size (NDRange) in order to properly fill the GPU. Running such a small problem size isn't going to be very indicative of how well your GPU can perform.
If you are enqueueing enough work items (at least 32 but ideally thousands) then your "workitems are running parallel".
You can see details of how your kernel is executing by using a profiling tool, for example Parallel Nsight on NVIDIA hardware or CodeXL on AMD hardware. It will tell you things about hardware occupancy and execution speed. You'll also be able to see memory transfers.
I basically need some help to explain/confirm some experimental results.
Basic Theory
A common idea expressed in papers on DVFS is that execution times have on-chip and off-chip components. On-chip components of execution time scale linearly with CPU frequency whereas the off-chip components remain unaffected.
Therefore, for CPU-bound applications, there is a linear relationship between CPU frequency and instruction-retirement rate. On the other hand, for a memory bound application where the caches are often missed and DRAM has to be accessed frequently, the relationship should be affine (one is not just a multiple of the other, you also have to add a constant).
Experiment
I was doing experiments looking at how CPU frequency affects instruction-retirement rate and execution time under different levels of memory-boundedness.
I wrote a test application in C that traverses a linked list. I effectively create a linked list whose individual nodes have sizes equal to the size of a cache-line (64 bytes). I allocated a large amount of memory that is a multiple of the cache-line size.
The linked list is circular such that the last element links to the first element. Also, this linked list randomly traverses through the cache-line sized blocks in the allocated memory. Every cache-line sized block in the allocated memory is accessed, and no block is accessed more than once.
Because of the random traversal, I assumed it should not be possible for the hardware to use any pre-fetching. Basically, by traversing the list, you have a sequence of memory accesses with no stride pattern, no temporal locality, and no spacial locality. Also, because this is a linked list, one memory access can not begin until the previous one completes. Therefore, the memory accesses should not be parallelizable.
When the amount of allocated memory is small enough, you should have no cache misses beyond initial warm up. In this case, the workload is effectively CPU bound and the instruction-retirement rate scales very cleanly with CPU frequency.
When the amount of allocated memory is large enough (bigger than the LLC), you should be missing the caches. The workload is memory bound and the instruction-retirement rate should not scale as well with CPU frequency.
The basic experimental setup is similiar to the one described here:
"Actual CPU Frequency vs CPU Frequency Reported by the Linux "cpufreq" Subsystem".
The above application is run repeatedly for some duration. At the start and end of the duration, the hardware performance counter is sampled to determine the number of instructions retired over the duration. The length of the duration is measured as well. The average instruction-retirement rate is measured as the ratio between these two values.
This experiment is repeated across all the possible CPU frequency settings using the "userspace" CPU-frequency governor in Linux. Also, the experiment is repeated for the CPU-bound case and the memory-bound case as described above.
Results
The two following plots show results for the CPU-bound case and memory-bound case respectively. On the x-axis, the CPU clock frequency is specified in GHz. On the y-axis, the instruction-retirement rate is specified in (1/ns).
A marker is placed for repetition of the experiment described above. The line shows what the result would be if instruction-retirement rate increased at the same rate as CPU frequency and passed through the lowest-frequency marker.
Results for the CPU-bound case.
Results for the memory-bound case.
The results make sense for the CPU-bound case, but not as much for the memory-bound case. All the markers for the memory-bound fall below the line which is expected because the instruction-retirement rate should not increase at the same rate as CPU frequency for a memory-bound application. The markers appear to fall on straight lines, which is also expected.
However, there appears to be step-changes in the instruction-retirement rate with change in CPU frequency.
Question
What is causing the step changes in the instruction-retirement rate? The only explanation I could think of is that the memory controller is somehow changing the speed and power-consumption of memory with changes in the rate of memory requests. (As instruction-retirement rate increases, the rate of memory requests should increase as well.) Is this a correct explanation?
You seem to have exactly the results you expected - a roughly linear trend for the cpu bound program, and a shallow(er) affine one for the memory bound case (which is less cpu effected). You will need a lot more data to determine if they are consistent steps or if they are - as I suspect - mostly random jitter depending on how 'good' the list is.
The cpu clock will affect bus clocks, which will affect timings and so on - synchronisation between differently clocked buses is always challenging for hardware designers. The spacing of your steps is interestingly 400 Mhz but I wouldn't draw too much from this - generally, this kind of stuff is way too complex and specific-hardware dependent to be properly analysed without 'inside' knowledge the memory controller used, etc.
(please draw nicer lines of best fit)
I have been asked to measure how "efficiently " does my code use the GPU /what % of peak performance are algorithms achieving.I am not sure how to do this comparison.Till now I have basically had timers put in my code and measure the execution.How can I compare this to optimal performance and find what might be the bottle necks? (I did hear about visual profiler but couldnt get it to work ..it keeps giving me "cannot load output" error).
Each card has a maximum memory bandwidth and processing speed. For example, the GTX 480 bandwidth is 177.4 GB/s. You will need to know the specs for your card.
The first thing to decide is whether your code is memory bound or computation bound. If it is clearly one or the other, that will help you focus on the correct "efficiency" to measure. If your program is memory bound, then you need to compare your bandwidth with the cards maximum bandwidth.
You can calculate memory bandwidth by computing the amount of memory you read/write and dividing by run time (I use cuda events for timing). Here is a good example of calculating bandwidth efficiency (look at the whitepaper for the parallel reduction) and using it to help validate a kernel.
I don't know very much about determining the efficiency if instead you are ALU bound. You can probably count (or profile) the number of instructions, but what is the card's maximum?
I'm also not sure what to do in the likely case that your kernel is something in between memory bound and ALU bound.
Anyone...?
Generally "efficiently" would probably be a measure of how much memory and GPU cycles (average, min, max) of your program is using. Then the efficiency measure would be avg(mem)/total memory for the time period and so on with AVG(GPU cycles)/Max GPU cycles.
Then I'd compare these metrics to metrics from some GPU benchmark suites (which you can assume to be pretty efficient at using most of the GPU). Or you could measure against some random GPU intensive programs of your choice. That'd be how I'd do it but I've never thought to try so good luck!
As for bottlenecks and "optimal" performance. These are probably NP-Complete problems that no one can help you with. Get out the old profiler and debuggers and start working your way through your code.
Can't help with profiler and microoptimisation, but there is a CUDA calculator http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls , which trys to estimate how does your CUDA code use the hardware resources, based on this values:
Threads Per Block
Registers Per Thread
Shared Memory Per Block (bytes)