CUDA Parallel Cross Product - parallel-processing

Disclaimer: I am fairly new to CUDA and parallel programming - so if you're not going to bother to answer my question, just ignore this, or at least point me to the right resources so I can find the answer myself.
Here's the particular problem I'm looking to solve using parallel programming. I have some 1D arrays that store 3D vectors in this format -> [v0x, v0y, v0z, ... vnx, vny, vnz], where n is the vector, and x, y, z are the respective components.
Suppose I want to find the cross product between vectors [v0, v1, ... vn] in one array and their corresponding vectors [v0, v1, ... vn] in another array.
The calculation is pretty straightforward without parallelization:
result[x] = vec1[y]*vec2[z] - vec1[z]*vec2[y];
result[y] = vec1[z]*vec2[x] - vec1[x]*vec2[z];
result[z] = vec1[x]*vec2[y] - vec1[y]*vec2[x];
The problem I'm having is understanding how to implement CUDA parallelization for the arrays I currently have. Since each value in the result vector is a separate calculation, I can effectively run the above calculation for each vector in parallel. Since each component of the resulting cross product is a separate calculation, those too could run in parallel. How would I go about setting up the blocks and threads/ go about thinking about setting up the threads for such a problem?

The top 2 optimization priorities for any CUDA programmer are to use memory efficiently, and expose enough parallelism to hide latency. We'll use those to guide our algorithmic choices.
A very simple thread strategy (the thread strategy answers the question, "what will each thread do or be responsible for?") in any transformation (as opposed to reduction) type problem is to have each thread be responsible for 1 output value. Your problem fits the description of transformation - the output data set size is on the order of the input data set size(s).
I'll assume that you intended to have two equal length vectors containing your 3D vectors, and that you want to take the cross product of the first 3D vectors in each and the 2nd 3D vectors in each, and so on.
If we choose a thread strategy of 1 output point per thread (i.e. result[x] or result[y] or result[z], all together would be 3 output points), then we will need 3 threads to compute the output of each vector cross product. If we have enough vectors to multiply, then we will have enough threads to keep our machine "busy" and do a good job of hiding latency. As a rule of thumb, your problem will start to become interesting on GPUs if the number of threads is 10000 or more, so this means we would want your 1D vectors to consist of about 3000 3D vectors or more. Let's assume that is the case.
In order to tackle the memory efficiency objective, our first task is to load your vector data from global memory. We will want this ideally to be coalesced, which roughly means adjacent threads access adjacent elements in memory. We'll want the output stores to be coalesced also, and our thread strategy of choosing one output point/one vector component per thread will work nicely to support that.
For efficient memory usage, we'd like to ideally load each item from global memory only once. Your algorithm naturally involves a small amount of data reuse. The data reuse is evident since the computation of result[y] depends on vec2[z] and the computation of result[x] also depends on vec2[z] to pick just one example. Therefore a typical strategy when there is data reuse is to load the data first into CUDA shared memory, and then allow the threads to perform their computations based on the data in shared memory. As we will see, this makes it fairly easy/convenient for us to arrange for coalesced loads from global memory, since the global data load arrangement is no longer tightly coupled to the threads or the usage of the data for computation.
The last challenge is to figure out an indexing pattern so that each thread will select the proper elements from shared memory to multiply together. If we look at your calculation pattern that you have depicted in your question, we see that the first load from vec1 follows an offset pattern of +1(modulo 3) from the index that the result is being computed for. So x->y, y->z, and z -> x. Likewise we see a +2(modulo 3) for the next load from vec2, another +2(modulo 3) pattern for the next load from vec1 and another +1(modulo 3) pattern for the final load from vec2.
If we combine all these ideas, we can then write a kernel that should have generally efficient characteristics:
$ cat t1003.cu
#include <stdio.h>
#define TV1 1
#define TV2 2
const size_t N = 4096; // number of 3D vectors
const int blksize = 192; // choose as multiple of 3 and 32, and less than 1024
typedef float mytype;
//pairwise vector cross product
template <typename T>
__global__ void vcp(const T * __restrict__ vec1, const T * __restrict__ vec2, T * __restrict__ res, const size_t n){
__shared__ T sv1[blksize];
__shared__ T sv2[blksize];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < 3*n){ // grid-stride loop
// load shared memory using coalesced pattern to global memory
sv1[threadIdx.x] = vec1[idx];
sv2[threadIdx.x] = vec2[idx];
// compute modulo/offset indexing for thread loads of shared data from vec1, vec2
int my_mod = threadIdx.x%3; // costly, but possibly hidden by global load latency
int off1 = my_mod+1;
if (off1 > 2) off1 -= 3;
int off2 = my_mod+2;
if (off2 > 2) off2 -= 3;
__syncthreads();
// each thread loads its computation elements from shared memory
T t1 = sv1[threadIdx.x-my_mod+off1];
T t2 = sv2[threadIdx.x-my_mod+off2];
T t3 = sv1[threadIdx.x-my_mod+off2];
T t4 = sv2[threadIdx.x-my_mod+off1];
// compute result, and store using coalesced pattern, to global memory
res[idx] = t1*t2-t3*t4;
idx += gridDim.x*blockDim.x;} // for grid-stride loop
}
int main(){
mytype *h_v1, *h_v2, *d_v1, *d_v2, *h_res, *d_res;
h_v1 = (mytype *)malloc(N*3*sizeof(mytype));
h_v2 = (mytype *)malloc(N*3*sizeof(mytype));
h_res = (mytype *)malloc(N*3*sizeof(mytype));
cudaMalloc(&d_v1, N*3*sizeof(mytype));
cudaMalloc(&d_v2, N*3*sizeof(mytype));
cudaMalloc(&d_res, N*3*sizeof(mytype));
for (int i = 0; i<N; i++){
h_v1[3*i] = TV1;
h_v1[3*i+1] = 0;
h_v1[3*i+2] = 0;
h_v2[3*i] = 0;
h_v2[3*i+1] = TV2;
h_v2[3*i+2] = 0;
h_res[3*i] = 0;
h_res[3*i+1] = 0;
h_res[3*i+2] = 0;}
cudaMemcpy(d_v1, h_v1, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_v2, h_v2, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
vcp<<<(N*3+blksize-1)/blksize, blksize>>>(d_v1, d_v2, d_res, N);
cudaMemcpy(h_res, d_res, N*3*sizeof(mytype), cudaMemcpyDeviceToHost);
// verification
for (int i = 0; i < N; i++) if ((h_res[3*i] != 0) || (h_res[3*i+1] != 0) || (h_res[3*i+2] != TV1*TV2)) { printf("mismatch at %d, was: %f, %f, %f, should be: %f, %f, %f\n", i, h_res[3*i], h_res[3*i+1], h_res[3*i+2], (float)0, (float)0, (float)(TV1*TV2)); return -1;}
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
}
$ nvcc t1003.cu -o t1003
$ cuda-memcheck ./t1003
========= CUDA-MEMCHECK
no error
========= ERROR SUMMARY: 0 errors
$
Note that I've chosen to write the kernel using a grid-stride loop. This isn't terribly important to this discussion, and not that relevant for this problem, because I've chosen a grid size equal to the problem size (4096*3). However for much larger problem sizes, you might choose a smaller grid size than the overall problem size, for some possible small efficiency gain.
For such a simple problem as this, it's fairly easy to define "optimality". The optimal scenario would be however long it takes to load the input data (just once) and write the output data. If we consider a larger version of the test code above, changing N to 40960 (and making no other changes), then the total data read and written would be 40960*3*4*3 bytes. If we profile that code and then compare to bandwidthTest as a proxy for peak achievable memory bandwidth, we observe:
$ CUDA_VISIBLE_DEVICES="1" nvprof ./t1003
==27861== NVPROF is profiling process 27861, command: ./t1003
no error
==27861== Profiling application: ./t1003
==27861== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 65.97% 162.22us 2 81.109us 77.733us 84.485us [CUDA memcpy HtoD]
30.04% 73.860us 1 73.860us 73.860us 73.860us [CUDA memcpy DtoH]
4.00% 9.8240us 1 9.8240us 9.8240us 9.8240us void vcp<float>(float const *, float const *, float*, unsigned long)
API calls: 99.10% 249.79ms 3 83.263ms 6.8890us 249.52ms cudaMalloc
0.46% 1.1518ms 96 11.998us 374ns 454.09us cuDeviceGetAttribute
0.25% 640.18us 3 213.39us 186.99us 229.86us cudaMemcpy
0.10% 255.00us 1 255.00us 255.00us 255.00us cuDeviceTotalMem
0.05% 133.16us 1 133.16us 133.16us 133.16us cuDeviceGetName
0.03% 71.903us 1 71.903us 71.903us 71.903us cudaLaunchKernel
0.01% 15.156us 1 15.156us 15.156us 15.156us cuDeviceGetPCIBusId
0.00% 7.0920us 3 2.3640us 711ns 4.6520us cuDeviceGetCount
0.00% 2.7780us 2 1.3890us 612ns 2.1660us cuDeviceGet
0.00% 1.9670us 1 1.9670us 1.9670us 1.9670us cudaGetLastError
0.00% 361ns 1 361ns 361ns 361ns cudaGetErrorString
$ CUDA_VISIBLE_DEVICES="1" /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla K20Xm
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6375.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6554.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 171220.3
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$
The kernel takes 9.8240us to execute, and in that time loads or stores a total of 40960*3*4*3 bytes of data. Therefore the achieved memory bandwidth by the kernel is 40960*3*4*3/0.000009824 or 150 GB/s. The proxy measurement for peak achievable on this GPU is 171 GB/s, so this kernel achieves 88% of the optimal throughput. With more careful benchmarking to run the kernel twice in a row, the 2nd execution requires only 8.99us to execute. This brings the achieved bandwidth in this case up to 96% of peak achievable throughput.

Related

Strange under-performance of Titan X in memory-bound kernels (e.g. elementwise addition)

I've been noticing super-slow execution times for essentially all of my CUDA kernels on some machine (Fedora 24, GeForce Titan X maxwell), but not on others. Edit: I previously gave the CUDA vectorAdd sample as an MCVE, but due to doubts regarding whether that should really be memory-bottlenecked due to the low workload per thread; so, here's a hand-unrolling of that kernel:
enum { serialization_factor = 8 };
__global__ void vectorAdd(
const float* __restrict__ lhs,
const float* __restrict__ rhs,
float* __restrict__ result,
int length)
{
int pos = threadIdx.x + blockIdx.x * blockDim.x * serialization_factor;
if (length - pos >= blockDim.x * serialization_factor) {
#pragma unroll
for(int i = 0; i < serialization_factor; i++) {
result[pos] = lhs[pos] + rhs[pos];
pos += blockDim.x;
}
}
else {
for(; pos < length; pos += blockDim.x) {
result[pos] = lhs[pos] + rhs[pos];
}
}
}
... and suppose we run this for 5,000,000 elements; and launch the kernel twice, ignoring the first run.
Well, with my home GPU, a Geforce GTX 650 Ti Boost, I get 527 usec. This is a bit strange - I was expecting something like 555 usec, by bandwidth calculations: 3004 MHz clock * 192-bit bus = 72096 MB = 72 GB/sec , and 2 * 4 bytes per float * 5M of data. But it's pretty close so let's ignore the difference. The profiler tells me the "Global Load Throughput" is 72.355 GB/sec.
Now, on the Maxwell Titan X at work, I get 232 usec. That's about twice as fast - but the GPU's bandwidth is 5 times as high as my home GPU: ~336 GB/sec. I should be seeing something like 120 usec. And - the profiler tells me the "Global Load Throughput" is 343.271 GB/sec (!)
How could this be happening?
Notes:
If you think I've gotten something wrong with the kernel, please comment about that rather than writing an answer.
The Titan doesn't have ECC on.
Your bandwidth calculations are not fully accurate. The specified theoretical peak memory bandwidth of the GTX 650 Ti BOOST is twice as high (144.2 GB/s) as you calculated because of double data rate transfer (transfer of separate words on both the raising and the falling edge of the clock signal). The achieved bandwidth in the vector add example is 50% higher than you calculated, because writing back of results to memory also needs to be taken into account. This means your GTX 650 Ti BOOST measurements achieved ~79% of it's theoretical peak bandwidth.
The Titan X's specified peak memory bandwidth is 336.5 GB/s, so your test achieved ~77% of theoretical peak memory bandwidth.
This is about as good as it gets. The remaining discrepancy is due to overhead like memory refresh, the time needed to switch the transfer direction etc.
Adding some to tera's answer, your algorithm has a warm-up and cool-down phase: when having many requests in flight, latency gets hidden indeed, but at the cost of a warm-up, and a cool-down for the last iterations.
If your scheduling is good, you will have a work chunk of 2048 (max resident threads per sm) x 24 (number of sm on the GTX Titan X). Each of which will operate on 8 values. Hence, your work chunk is 393,216 entries.
For your 5,000,000 size sample, it results in 12.7 iterations (13 with the last being incomplete). The warm-up/cool-down cost is 1 iteration.
Depending on scheduling of threads (and this is not necessarily predictable), you may run 14 iterations total; for which you could have 5,111,808 entries at approximately same cost (still one warm-up/cool-down). That size would provide you the best performance I believe.
As a result, the incomplete iteration plus warm-up/cool-down could cost about 10% of performance, the achieved bandwidth being closer to 85% of peak if not more.
The minimal run time of a kernel should also be looked at as it might account for a few micros as well. Running on various data sizes should mitigate this point.
Last but not least, the memory frequency might be modifiable with nvidia-smi, as explained here.

How to multiply 100 Integers with another 100 integers in parallel on a GPU using PyOpenCL?

There are many PyopenCL examples on doing arithmetic operations on vectors of size 4. if I have to multiply 100 integers with another 100 integers all at once using AMD GPU on Mac through PyOpenCL, can someone provide and explain the code please? Since max vector size can be 16, I would like to know how can I ask the GPU to do this operation which needs processing more than 16 integers in parallel.
I have a AMD D500 firepro GPU.
Does every Work Item( thread) perform a task independently, if yes there are 24 compute units and each compute unit has 255 work items for single dimension and [255,255,255] for three dimension. Does it mean my GPU has 6120 independent work items?
I made a short example for an entry-wise multiplication of two one-dimensional integer arrays. Note that if you plan to multiply only 100 values it will not be faster than doing this on the CPU, since you have a lot of overhead with copying the data and so on.
import pyopencl as cl
import numpy as np
#this is compiled by the GPU driver and will be executed on the GPU
kernelsource = """
__kernel void multInt( __global int* res,
__global int* a,
__global int* b){
int i = get_global_id(0);
int N = get_global_size(0); //this is the dimension given as second argument in the kernel execution
res[i] = a[i] * b[i];
}
"""
device = cl.get_platforms()[0].get_devices()[0]
context = cl.Context([device])
program = cl.Program(context, kernelsource).build()
queue = cl.CommandQueue(context)
#preparing input data in numpy arrays in local memory (i.e. accessible by the CPU)
N = 100
a_local = np.array(range(N)).astype(np.int32)
b_local = (np.ones(N)*10).astype(np.int32)
#preparing result buffer in local memory
res_local = np.zeros(N).astype(np.int32)
#copy input data to GPU-memory
a_buf = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=a_local)
b_buf = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=b_local)
#prepare result buffer in GPU-memory
res_buf = cl.Buffer(context, cl.mem_flags.WRITE_ONLY, res_local.nbytes)
#execute previously compiled kernel on GPU
program.multInt(queue,(N,), None, res_buf, a_buf, b_buf)
#copy the result from GPU-memory to CPU-memory
cl.enqueue_copy(queue, res_local, res_buf)
print("result: {}".format(res_local))
For the documentation of PyOpenCL: Once you have understood the working principle of GPGPU programming and the programming concepts of OpenCL, PyOpenCL is pretty straightforward.

GPU sorting vs CPU sorting

I made a very naive implementation of the mergesort algorithm, which i turned to work on CUDA with very minimal implementation changes, the algorith code follows:
//Merge for mergesort
__device__ void merge(int* aux,int* data,int l,int m,int r)
{
int i,j,k;
for(i=m+1;i>l;i--){
aux[i-1]=data[i-1];
}
//Copy in reverse order the second subarray
for(j=m;j<r;j++){
aux[r+m-j]=data[j+1];
}
//Merge
for(k=l;k<=r;k++){
if(aux[j]<aux[i] || i==(m+1))
data[k]=aux[j--];
else
data[k]=aux[i++];
}
}
//What this code do is performing a local merge
//of the array
__global__
void basic_merge(int* aux, int* data,int n)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int tn = n / (blockDim.x*gridDim.x);
int l = i * tn;
int r = l + tn;
//printf("Thread %d: %d,%d: \n",i,l,r);
for(int i{1};i<=(tn/2)+1;i*=2)
for(int j{l+i};j<(r+1);j+=2*i)
{
merge(aux,data,j-i,j-1,j+i-1);
}
__syncthreads();
if(i==0){
//Complete the merge
do{
for(int i{tn};i<(n+1);i+=2*tn)
merge(aux,data,i-tn,i-1,i+tn-1);
tn*=2;
}while(tn<(n/2)+1);
}
}
The problem is that no matter how many threads i launch on my GTX 760, the sorting performance is always much much more worst than the same code on CPU running on 8 threads (My CPU have hardware support for up to 8 concurrent threads).
For example, sorting 150 million elements on CPU takes some hundred milliseconds, on GPU up to 10 minutes (even with 1024 threads per block)! Clearly i'm missing some important point here, can you please provide me with some comment? I strongly suspect the the problem is in the final merge operation performed by the first thread, at that point we have a certain amount of subarray (the exact amount depend on the number of threads) which are sorted and need to me merged, this is completed by just one thread (one tiny GPU thread).
I think i should use come kind of reduction here, so each thread perform in parallel further more merge, and the "Complete the merge" step just merge the last two sorted subarray..
I'm very new to CUDA.
EDIT (ADDENDUM):
Thanks for the link, I must admit I still need some time to learn better CUDA before taking full advantage of that material.. Anyway, I was able to rewrite the sorting function in order to take advantage as long as possible of multiple threads, my first implementation had a bottleneck in the last phase of the merge procedure, which was performed by only one multiprocessor.
Now after the first merge, I use each time up to (1/2)*(n/b) threads, where n is the amount of data to sort and b is the size of the chunk of data sorted by each threads.
The improvement in performance is surprising, using only 1024 threads it takes about ~10 seconds to sort 30 milion element.. Well, this is still a poor result unfortunately! The problem is in the threads syncronization, but first things first, let's see the code:
__global__
void basic_merge(int* aux, int* data,int n)
{
int k = blockIdx.x*blockDim.x + threadIdx.x;
int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1;
b = pow( (float)2, b);
int l=k*b;
int r=min(l+b-1,n-1);
__syncthreads();
for(int m{1};m<=(r-l);m=2*m)
{
for(int i{l};i<=r;i+=2*m)
{
merge(aux,data,i,min(r,i+m-1),min(r,i+2*m-1));
}
}
__syncthreads();
do{
if(k<=(n/b)*.5)
{
l=2*k*b;
r=min(l+2*b-1,n-1);
merge(aux,data,l,min(r,l+b-1),r);
}else break;
__syncthreads();
b*=2;
}while((r+1)<n);
}
The function 'merge' is the same as before. Now the problem is that I'm using only 1024 threads instead of the 65000 and more I can run on my CUDA device, the problem is that __syncthreads does not work as sync primitive at grid level, but only at block level!
So i can syncronize up to 1024 threads,that is the amount of threads supported per block. Without a proper syncronization each thread mess up the data of the other, and the merging procedure does not work.
In order to boost the performance I need some kind of syncronization between all the threads in the grid, seems that no API exist for this purpose, and i read about a solution which involve multiple kernel launch from the host code, using the host as barrier for all the threads.
I have a certain plan on how to implement this tehcnique in my mergesort function, I will provide you with the code in the near future. Did you have any suggestion on your own?
Thanks
It looks like all the work is being done in __global __ memory. Each write takes a long time and each read takes a long time making the function slow. I think it would help to maybe first copy your data to __shared __ memory first and then do the work in there and then when the sorting is completed(for that block) copy the results back to global memory.
Global memory takes about 400 clock cycles (or about 100 if the data happens to be in L2 cache). Shared memory on the other hand only takes 1-3 clock cycles to write and read.
The above would help with performance a lot. Some other super minor things you can try are..
(1) remove the first __syncthreads(); It is not really doing anything because no data is being past in between warps at that point.
(2) Move the "int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1; b = pow( (float)2, b);" outside the kernel and just pass in b instead. This is being calculated over and over when it really only needs to be calculated once.
I tried to follow along on your algorithm but was not able to. The variable names were hard to follow...or... your code is above my head and I cannot follow. =) Hope the above helps.

Some basic CUDA enquiries

I am new in Cuda development and I decided to start scripting small examples in order to understand how it is working. I decided to share the kernel function that I make and computes the squared euclidean distance between the corresponding rows of two equal sized matrices.
__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols )
{
int i, squareEuclDist = 0;
int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
//int c = blockDim.y * blockIdx.y + threadIdx.y; // cols
if( r < rows ){ // take each row with var r (thread)
for ( i = 0; i < cols; i++ )//compute squared Euclid dist of each row
squareEuclDist += ( A[r + rows*i] - B[r + rows*i] ) * ( A[r + rows*i] - B[r + rows*i] );
C[r] = squareEuclDist;
squareEuclDist = 0;
}
}
The kernel initialization is done by
int threadsPerBlock = 256;
int blocksPerGrid = ceil( (double) numElements / threadsPerBlock);
// numElements = 1500x200 (matrix size) ==> 1172 blocks/grid
and is called as
cudaEuclid<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C, rows, cols );
The d_A and d_B are the inserted matrices, in this example of size 1500 x 200.
Question 1: I have read the basic theory of choosing the threads per block and the blocks per grid number but is still something missing. I try to understand in this simple kernel what is the optimum kernel parameter initialization and I am asking a little help to start thinking in CUDA way.
Question 2: An other thing I would like to ask is if there are any suggestions about how can we improve the code efficiency? Can we use int c = blockDim.y * blockIdx.y + threadIdx.y to make things more parallel?Share memory is applicable here?
Below, my GPU info is attached.
Device 0: "GeForce 9600 GT"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 512 MBytes (536870912 bytes)
( 8) Multiprocessors x ( 8) CUDA Cores/MP: 64 CUDA Cores
GPU Clock rate: 1680 MHz (1.68 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 256-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Concurrent kernel execution: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Question 3: Can we express the amount of global memory with that of shared memory and other type of memories that GPU has? Does the number of threads has to do with that?
Question 4: If the maximum number of threads per block is 512 how is possible the maximum sizes of each dimension of a block be 512x512x62 (= 16252628 threads)? What the correlation with my maximum sizes of each dimension of a grid?
Question 5: Using the memory clock rate can we say how many threads are processed at each second?
UPDATE:
The for loop replaced with column threads
__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols ){
int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
int c = blockDim.y * blockIdx.y + threadIdx.y; // cols
float x=0;
if(c < cols && r < rows){
x = ( A[c + r*cols] - B[c + r*cols] ) * ( A[c + r*cols] - B[c + r*cols] );
}
C[r] = x;
}
Called with:
int threadsPerBlock = 256;
int blocksPerGrid = ceil( (double) numElements / threadsPerBlock);
cudaEuclid<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C, rows, cols );
A1. Optimize the threads per block is basically heuristics. You could try
for(int threadsPerBlock=32; threadsPerBlock<=512;threadsPerBlock+=32){...}
A2. Currently you use one thread per row and sum the elements to squareEuclDist linearly. You could consider use one thread block per row. Within the block, each thread computes the square-difference of one element and you could use parallel reduction to sum them together. Please refer to the following link for parallel reduction.
http://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf
A3. the list you show is the total amount of global/shared memory. Multiple threads will share these hardware resources. You could find this tool in your cuda installation dir to help you calculate the exact number per thread of those hardware resources you can use in a particular kernel.
$CUDA_HOME/tools/CUDA_Occupancy_Calculator.xls
A4. maximum sizes of each dimension does not mean all dimensions can reach their max at the same time. However there's no limitation on block per grid, so 65536x65536x1 blocks in a grid is possible.
A5. mem clock has nothing to do with the thread number. You could read the programming model section in cuda doc for more info.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#scalable-programming-model
Ok, so there are few things related to a kernel, one is number of multiprocessors (associated with blocks) and number of cores (associated with cores), blocks are scheduled to run on a multiprocessor (which is 8 for you), threads are scheduled to run on multiple cores on a single multiprocessor. Ideally you would like to have enough number of blocks and threads so that all you multi-processors and all cores in each multi-processor are occupied. It is advisable to have larger number of blocks and threads when compared to multi-processors and cores as coalescing of threads/blocks can be done.
multiple dimensions make programming easier (for eg: 2D/3D images, you could divide the image into sub-parts and give it to different blocks and then process those sub-images on multiple threads), it is more intuitive to use multiple dimensions (x, y, z) for accessing blocks and threads. In some cases, it helps you to have more dimensions if there is a restriction in maximum number of blocks in one dimension (for example if you had a large image, you may hit a limit on maximum number of blocks if you just use one dimension).
I am not sure if I understand what you mean in your third question, I can tell a bit about shared memory. Shared memory is present on a single multi-processor, it is shared by cores on the processor. For you, the amount of shared memory is 16KB, most modern GPUs have 64KB of shared memory on a processor and you can chose how much you want to have for your application, 16KB in the 64KB is generally reserved for cache and you can use the remaining 48KB for you or increase the cache size and lower your shared memory size. Shared memory is much faster than global memory, so incase you have some data which will be accessed frequently, it would be wise to transfer it to shared memory. The number of threads is not at all related to shared memory. Also, global memory and shared memory are separate.
If you can see, each block dimension is less than 512, you cannot have more than 512 threads per block (limit has been changed to 1024 in newer CUDA versions on better architectures). Till Fermi each processor had 32 or 48 cores so it didn't make much sense to have more than 512 threads. The new Kepler architecture has 192 cores per multi-processor.
Threads are executed in a warp, which is generally 16 threads clubbed together and executed on the cores in a multi-processor simultaneously. If you assume that there is always a miss in the shared memory, depending on the number of cores you have per multiprocessor and the memory clock rate, you can calculate how may threads would be processed each second (you would need to take into account the number of instructions which are processed per thread also, there would also be some time involved for processing operations on registers etc).
I hope that answers your questions to some extent.

How is a CUDA kernel launched?

I have created a simple CUDA application to add two matrices. It is compiling fine. I want to know how the kernel will be launched by all the threads and what will the flow be inside CUDA? I mean, in what fashion every thread will execute each element of the matrices.
I know this is a very basic concept, but I don't know this. I am confused regarding the flow.
You launch a grid of blocks.
Blocks are indivisibly assigned to multiprocessors (where the number of blocks on the multiprocessor determine the amount of available shared memory).
Blocks are further split into warps. For a Fermi GPU that is 32 threads that either execute the same instruction or are inactive (because they branched away, e.g. by exiting from a loop earlier than neighbors within the same warp or not taking the if they did). On a Fermi GPU at most two warps run on one multiprocessor at a time.
Whenever there is latency (that is execution stalls for memory access or data dependencies to complete) another warp is run (the number of warps that fit onto one multiprocessor - of the same or different blocks - is determined by the number of registers used by each thread and the amount of shared memory used by a/the block(s)).
This scheduling happens transparently. That is, you do not have to think about it too much.
However, you might want to use the predefined integer vectors threadIdx (where is my thread within the block?), blockDim (how large is one block?), blockIdx (where is my block in the grid?) and gridDim (how large is the grid?) to split up work (read: input and output) among the threads. You might also want to read up how to effectively access the different types of memory (so multiple threads can be serviced within a single transaction) - but that's leading off topic.
NSight provides a graphical debugger that gives you a good idea of what's happening on the device once you got through the jargon jungle. Same goes for its profiler regarding those things you won't see in the debugger (e.g. stall reasons or memory pressure).
You can synchronize all threads within the grid (all there are) by another kernel launch.
For non-overlapping, sequential kernel execution no further synchronization is needed.
The threads within one grid (or one kernel run - however you want to call it) can communicate via global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).
You can synchronize all threads within one block with the intrinsic instruction __syncthreads() (all threads will be active afterwards - although, as always, at most two warps can run on a Fermi GPU). The threads within one block can communicate via shared or global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).
As mentioned earlier, all threads within a warp are always "synchronized", although some might be inactive. They can communicate through shared or global memory (or "lane swapping" on upcoming hardware with compute capability 3). You can use atomic operations (for arithmetic) and volatile-qualified shared or global variables (load or store access happening sequentially within the same warp). The volatile qualifier tells the compiler to always access memory and never registers whose state cannot be seen by other threads.
Further, there are warp-wide vote functions that can help you make branch decisions or compute integer (prefix) sums.
OK, that's basically it. Hope that helps. Had a good flow writing :-).
Lets take an example of addition of 4*4 matrices.. you have two matrices A and B, having dimension 4*4..
int main()
{
int *a, *b, *c; //To store your matrix A & B in RAM. Result will be stored in matrix C
int *ad, *bd, *cd; // To store matrices into GPU's RAM.
int N =4; //No of rows and columns.
size_t size=sizeof(float)* N * N;
a=(float*)malloc(size); //Allocate space of RAM for matrix A
b=(float*)malloc(size); //Allocate space of RAM for matrix B
//allocate memory on device
cudaMalloc(&ad,size);
cudaMalloc(&bd,size);
cudaMalloc(&cd,size);
//initialize host memory with its own indices
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
a[i * N + j]=(float)(i * N + j);
b[i * N + j]= -(float)(i * N + j);
}
}
//copy data from host memory to device memory
cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);
//calculate execution configuration
dim3 grid (1, 1, 1);
dim3 block (16, 1, 1);
//each block contains N * N threads, each thread calculates 1 data element
add_matrices<<<grid, block>>>(ad, bd, cd, N);
cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost);
printf("Matrix A was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
printf("%f ",a[i*N+j]);
printf("\n");
}
printf("\nMatrix B was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
printf("%f ",b[i*N+j]);
printf("\n");
}
printf("\nAddition of A and B gives C----\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
printf("%f ",c[i*N+j]); //if correctly evaluated, all values will be 0
printf("\n");
}
//deallocate host and device memories
cudaFree(ad);
cudaFree(bd);
cudaFree (cd);
free(a);
free(b);
free(c);
getch();
return 1;
}
/////Kernel Part
__global__ void add_matrices(float *ad,float *bd,float *cd,int N)
{
int index;
index = blockIDx.x * blockDim.x + threadIDx.x
cd[index] = ad[index] + bd[index];
}
Lets take an example of addition of 16*16 matrices..
you have two matrices A and B, having dimension 16*16..
First of all you have to decide your thread configuration.
You are suppose to launch a kernel function, which will perform the parallel computation of you matrix addition, which will get executed on your GPU device.
Now,, one grid is launched with one kernel function..
A grid can have max 65,535 no of blocks which can be arranged in 3 dimensional ways. (65535 * 65535 * 65535).
Every block in grid can have max 1024 no of threads.Those threads can also be arranged in 3 dimensional ways (1024 * 1024 * 64)
Now our problem is addition of 16 * 16 matrices..
A | 1 2 3 4 | B | 1 2 3 4 | C| 1 2 3 4 |
| 5 6 7 8 | + | 5 6 7 8 | = | 5 6 7 8 |
| 9 10 11 12 | | 9 10 11 12 | | 9 10 11 12 |
| 13 14 15 16| | 13 14 15 16| | 13 14 15 16|
We need 16 threads to perform the computation.
i.e. A(1,1) + B (1,1) = C(1,1)
A(1,2) + B (1,2) = C(1,2)
. . .
. . .
A(4,4) + B (4,4) = C(4,4)
All these threads will get executed simultaneously.
So we need a block with 16 threads.
For our convenience we will arrange threads in (16 * 1 * 1) way in a block
As no of threads are 16 so we need one block only to store those 16 threads.
so, grid configuration will be dim3 Grid(1,1,1) i.e. grid will have only one block
and block configuration will be dim3 block(16,1,1) i.e. block will have 16 threads arranged column wise.
Following program will give you the clear idea about its execution..
Understanding the indexing part(i.e. threadIDs, blockDim, blockID) is the important part. You need to go through the CUDA literature. Once you have clear idea about indexing, you will win the half battle! So spend some time with cuda books, different algorithms and paper-pencil of course!
Try 'Cuda-gdb', which is the CUDA debugger.

Resources