Relation Between CUDA Blocks and Threads and SMPs - parallel-processing

I recently read this CUDA tutorial: and one thing was unclear. When we sum two vectors we divide the task into several blocks and threads to do this in parallel. My question is why then the number of blocks (and maybe threads) doesn't depend on physical properties of GPU, the number of physical SMPs and threads?
For example let's say GPU has 16 SMPs and each of them can run 128 threads, will it be faster to split the problem into 16 blocks by 128 threads or, like in the article, split by 4000 blocks with 256 threads?

It does not depend because the number of threads will depend mainly on your problem size and the block size will depend on your GPU architecture. For example, if your GPU has 3000 cores and can have blocks of a maximum of 512, and your code will process a matrix with a size of 2 billion, you will have to specify the "number of blocks X number of threads per block(which is not greater than 512)" that will be EQUAL or GREATER than 2 billion, then CUDA will smartly partition your blocks of threads into your 3000 CUDA cores of your GPU until all of the threads specified by the "numBLocks X numThreadsPerBlock" have been called by the GPU.


How to calculate speedup and efficiency in a hybrid CPU and GPU algorithm?

I have an algorithm that I have executed in parallel using only CPU and I have achieved a speedup of 30x. That is, an efficiency equal to 0.93 (efficiency = speedup/cores, i.e. 0.93 = 30/32).
Later I added 2 GPUs (Tesla C2075 of 448 cores each) together to the 32 CPU cores.
To calculate the efficiency including CPUs and GPUs, should I add the amount of GPU cores to the CPU cores? That is, I would calculate the efficiency using 928 cores (32 + 448 + 448 = 928). Or should it be calculated differently?
Speedup and efficiency has been calculated based on what has been said here:
GPUs have bigger "core complex" architectures called "SM" or "CU" with tens of pipelines each. Not "very" similar to "SIMD" of a CPU, they can issue commands in parallel to these pipelines in a "single-threaded" kernel code.
You have counted "cores" in CPU and not SIMD pipelines (which is 4 to 16 times of number of cores) so, it wouldn't be wrong to count SM units of Nvidia or CU of Amd or Slice subset of Intel etc.
Tesla C2075 has 14 SM units so you could add 14 for each GPU (32+14+14).
If you have also used SIMDified code for CPU, then it wouldn't be wrong to count each pipeline of a GPU which is 32 to 192 times the number of SM/CU(like 448 per GPU of yours) (32*SIMD_WIDTH + 448 + 448).
At least this is how I would compute "core efficiency" and "pipeline efficiency". If data transfer to/from GPU is not a bottleneck, efficiency should not drop much after GPUs are added.

How does OpenCL distribute work items?

I'm testing and comparing GPU speed up with different numbers of work-items (no work-groups). The kernel I'm using is a very simple but long operation. When I test with multiple work-items, I use a barrier function and split the work in smaller chunks to get the same result as with just one work-item. I measure the kernel execution time using cl_event and the results are the following:
1 work-item: 35735 ms
2 work-items: 11822 ms (3 times faster than with 1 work-item)
10 work-items: 2380 ms (5 times faster than with 2 work-items)
100 work-items: 239 ms (10 times faster than with 10 work-items)
200 work-items: 122 ms (2 times faster than with 100 work-items)
CPU takes about 580 ms on average to do the same operation.
The only result I don't understand and can't explain is the one with 2 work items. I would expect the speed up to be about 2 times faster compared to the result with just one work item, so why is it 3?
I'm trying to make sense of these numbers by looking at how these work-items were distributed on processing elements. I'm assuming if I have just one kernel, only one compute unit (or multiprocessor) will be activated and the work items distributed on all processing elements (or CUDA cores) of that compute unit. What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
CL_DEVICE_MAX_WORK_ITEM_SIZES are 1024 / 1024 / 64 and CL_DEVICE_MAX_WORK_GROUP_SIZE 1024. Since I'm using just one dimension, does that mean I can have 1024 work-items running at the same time per processing element or per compute unit? When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
My GPU info: Nvidia GeForce GT 525M, 96 CUDA cores (2 compute units, 48 CUDA cores per unit)
The only result I don't understand and can't explain is the one with 2
work items. I would expect the speed up to be about 2 times faster
compared to the result with just one work item, so why is it 3?
The exact reasons will probably be hard to pin down, but here are a few suggestions:
GPUs aren't optimised at all for small numbers of work items. Benchmarking that end of the scale isn't especially useful.
35 seconds is a very long time for a GPU. Your GPU probably has other things to do, so your work-item is probably being interrupted many times, with its context saved and resumed every time.
It will depend very much on your algorithm. For example, if your kernel uses local memory, or a work-size dependent amount of private memory, it might "spill" to global memory, which will slow things down.
Depending on your kernel's memory access patterns, you might be running into the effects of read/write coalescing. More work items means fewer memory accesses.
What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
Most GPU hardware supports a form of SMT to hide memory access latency. So a compute core will have up to some fixed number of work items in-flight at a time, and if one of them is blocked waiting for a memory access or barrier, the core will continue executing commands on another work item. Note that the maximum number of simultaneous threads can be further limited if your kernel uses a lot of local memory or private registers, because those are a finite resource shared by all cores in a compute unit.
Work-groups will normally run on only one compute unit at a time, because local memory and barriers don't work across units. So you don't want to make your groups too large.
One final note: compute hardware tends to be grouped in powers of 2, so it's usually a good idea to make your work group sizes a multiple of e.g. 16 or 64. 1000 is neither, which usually means some cores will be doing nothing.
When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
Please be more precise in this question, it's not clear what you're asking.

How do cuda threads are executed inside a single block?

I have several question regarding cuda. Following is a figure taken from a book on parallel programming. It shows how threads are allocated in the device for a multiplication of two vectors each of length 8192.
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?
In this particular example, each thread seems to be assigned to 32 elements in the vector. Code that is executed by a single thread is executed sequentially.
The size of the thread blocks is up to the programmer. However, there are restrictions on the number and size of the thread blocks given the hardware the code is executed on. For more information on this, see this elaborate answer:
Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)
From your illustration, it seems that:
The grid is composed of 16 thread blocks, numbered from 0 to 15.
Each block is composed of 16 "SIMD threads", numbered from 0 to 15
Each "SIMD thread" computes the product of 32 vector elements.
It is not necessarily obvious from the illustration whether "SIMD thread" means, in the CUDA (OpenCL) parlance:
A warp (wavefront) of 32 threads (work-items)
A thread (work-item) working on 32 elements
I will assume the former ("SIMD thread" = warp/wavefront), since it is a more reasonable assumption performance-wise, but the latter isn't technically incorrect, it's simply suboptimal design (on current hardware, at least).
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
As stated above, there are 16 warps (numbered from 0 to 15, that makes 16) in thread block 0, each of them made of 32 threads. These threads execute in lockstep, simultaneously, in parallel. The warps are executed independently from each another, sequentially or in parallel, depending on the capabilities of the underlying hardware. For example, the hardware may be capable of scheduling a number of warps for simultaneous execution.
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?
In this case, it is simply a decision of the programmer, but in some cases there are also hardware limitations that could force the programmer into changing the design. For example, there is a maximum number of threads a block can handle, and there is a maximum number of blocks a grid can handle.

optimal number of CUDA parallel blocks

Can there be any performance advantage to launch a grid of blocks simultaneously over launching blocks one at a time if the number of threads in each block is already larger than the number of CUDA cores?
I think there is; A thread block is assigned to a Streaming Multiprocessor (SM) and the SM further divides the threads of each block into warps of 32 threads (newer architectures can handle larger warps) that are scheduled to be executed (more-less) sequentially. Considering this, it will be faster to break each computation into blocks so that they occupy as many SMs as possible. It is also meaning full to build blocks that are multiples of the threads per warp that the card supports (a block of 32 or 64 threads rather than 40 threads, for the case that SMs use 32-thread warps).
Launch Latency
Launch latency (API call to work is started on the GPU) is of a grid is 3-8 µs on Linux to
30-80 µs on Windows Vista/Win7.
Distributing a block to a SM is 10-100s ns.
Launching a warp in a block (32 threads) is a few cycles and happens in parallel on each SM.
Resource Limitations
Concurrent Kernels
- Tesla N/A only 1 grid at a time
- Fermi 16 grids at a time
- Kepler 16 grids (Kepler2 32 grids)
Maximum Blocks (not considering occupancy limitations)
- Tesla SmCount * 8 (gtx280 = 30 * 8 = 240)
- Fermi SmCount * 16 (gf100 = 16 * 16 = 256)
- Kepler SmCount * 16 (gk104 = 8 * 16 = 128)
See occupancy calculator for limitations on threads per block, threads per SM, registers per SM, registers per thread, ...
Warps Scheduling and CUDA Cores
CUDA cores are floating point/ALU units. Each SM has other types of execution units including load/store, special function, branch, etc. A CUDA core is equivalent to a SIMD unit in a x86 processor. It is not equivalent to a x86 core.
Occupancy is the measure of warps per SM to the maximum number of warps per SM. The more warps per SM the higher the chance that the warp scheduler has an eligible warp to schedule. However, the higher the occupancy the less resources will be available per thread. As a basic goal you want to target more than
25% 8 warps on Tesla
50% or 24 warps on Fermi
50% or 32 warps on Kepler (generally higher)
You'll notice there is no real relationship to CUDA cores in these calculations.
To understand this better read the Fermi whitepaper and if you can use the Nsight Visual Studio Edition CUDA Profiler look at the Issue Efficiency Experiment (not yet available in the CUDA Profiler or Visual Profiler) to understand how well your kernel is hiding execution and memory latency.

CUDA - Blocks and Threads

I have implemented a string matching algorithm on the GPU. The searching time of a parallel version has been decreased considerably compared with the sequential version of the algorithm, but by using different number of blocks and threads I get different results.
How can I determine the number of the blocks and threds to get the best results?
I think this question is hard, if not impossible, to answer for the reason that it really depends on the algorithm and how it is operating. Since i cant see your implementation i can give you some leads:
Don't use global memory & check how you can max out the use of shared memory. Generally get a good feel of how threads access memory and how data is retrieved etc.
Understand how your warps operate. Sometimes threads in a warp may wait for other threads to finish in case you have 1 to 1 mapping between thread and data. So instead of this 1 to 1 mapping, you can map threads to multiple data so that they are kept busy.
Since blocks consist of threads that are group in 32 threads warp, it is the best if the number of threads in a block is a multiple of 32, so that you dont get warps consisting of 3 threads etc.
Avoid Diverging paths in warps.
I hope it helps a bit.
#Chris points are very important too but depend more on the algorithm itself.
Check the Cuda Manual about Thread alignment regarding memory lookups. Shared Memory Arrays should also be size of multiple of 16.
Use Coalesced global memory reads. But by algorithm design this is often the case and using shared memory helps.
Don't use atomic operations in global memory or at all if possible. They are very slow. Some algorithms using atomic operations can be rewritten using different techniques.
Without shown code no-one can tell you what is the best or why performance changes.
The number of threads per block of your kernel is the most important value.
Important values to calculate that value are:
Maximum number of resident threads per multiprocessor
Maximum number of resident blocks per multiprocessor
Maximum number of threads per block
Number of 32-bit registers per multiprocessor
Your algorithms should be scalable across all GPU's reaching 100% occupancy. For this I created myself a helper class which automatically detects the best thread numbers for the used GPU and passes it to the Kernel as a DEFINE.
* Number of Threads in a Block
* Maximum number of resident blocks per multiprocessor : 8
* ///////////////////
* Compute capability:
* ///////////////////
* Cuda [1.0 - 1.1] =
* Maximum number of resident threads per multiprocessor 768
* Optimal Usage: 768 / 8 = 96
* Cuda [1.2 - 1.3] =
* Maximum number of resident threads per multiprocessor 1024
* Optimal Usage: 1024 / 8 = 128
* Cuda [2.x] =
* Maximum number of resident threads per multiprocessor 1536
* Optimal Usage: 1536 / 8 = 192
public static int BLOCK_SIZE_DEF = 96;
Example Cuda 1.1 to reach 786 resident Threads per SM
8 Blocks * 96 Threads per Block = 786 threads
3 Blocks * 256 Threads per Block = 786 threads
1 Blocks * 512 Threads per Block = 512 threads <- 33% of GPU will be idle
This is also mentioned in the book:
Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)
Good programming advices:
Analyse your kernel code and write down the maximal number of threads it can handle or how many "units" it can process.
Also output your register usage and try to lower it to the respective targeted CUDA version. Because if you use too many registers in your kernel less blocks will be executed resulting in less occupancy and performance.
Example: Using Cuda 1.1 and using optimal number of 768 resident threads per SM you have 8192 registers to use. This leads to 8192 / 768 = 10 maximum registers per thread/kernel. If you use 11 the GPU will use 1 Block less resulting in decreased performance.
Example: A matrix independent row vector normalizing kernel of mine.
* ////////////////////////
* // Compute capability //
* ////////////////////////
* Used 12 registers, 540+16 bytes smem, 36 bytes cmem[1]
* Used 10 registers, 540+16 bytes smem, 36 bytes cmem[1] <-- with -maxregcount 10 Limit for Cuda 1.1
* I: Maximum number of Rows = max(x-dim)^max(dimGrid)
* II: Maximum number of Columns = unlimited, since they are loaded in a tile loop
* Cuda [1.0 - 1.3]:
* I: 65535^2 = 4.294.836.225
* Cuda [2.0]:
* II: 65535^3 = 281.462.092.005.375
