How is a CUDA kernel launched? - parallel-processing

I have created a simple CUDA application to add two matrices. It is compiling fine. I want to know how the kernel will be launched by all the threads and what will the flow be inside CUDA? I mean, in what fashion every thread will execute each element of the matrices.
I know this is a very basic concept, but I don't know this. I am confused regarding the flow.

You launch a grid of blocks.
Blocks are indivisibly assigned to multiprocessors (where the number of blocks on the multiprocessor determine the amount of available shared memory).
Blocks are further split into warps. For a Fermi GPU that is 32 threads that either execute the same instruction or are inactive (because they branched away, e.g. by exiting from a loop earlier than neighbors within the same warp or not taking the if they did). On a Fermi GPU at most two warps run on one multiprocessor at a time.
Whenever there is latency (that is execution stalls for memory access or data dependencies to complete) another warp is run (the number of warps that fit onto one multiprocessor - of the same or different blocks - is determined by the number of registers used by each thread and the amount of shared memory used by a/the block(s)).
This scheduling happens transparently. That is, you do not have to think about it too much.
However, you might want to use the predefined integer vectors threadIdx (where is my thread within the block?), blockDim (how large is one block?), blockIdx (where is my block in the grid?) and gridDim (how large is the grid?) to split up work (read: input and output) among the threads. You might also want to read up how to effectively access the different types of memory (so multiple threads can be serviced within a single transaction) - but that's leading off topic.
NSight provides a graphical debugger that gives you a good idea of what's happening on the device once you got through the jargon jungle. Same goes for its profiler regarding those things you won't see in the debugger (e.g. stall reasons or memory pressure).
You can synchronize all threads within the grid (all there are) by another kernel launch.
For non-overlapping, sequential kernel execution no further synchronization is needed.
The threads within one grid (or one kernel run - however you want to call it) can communicate via global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).
You can synchronize all threads within one block with the intrinsic instruction __syncthreads() (all threads will be active afterwards - although, as always, at most two warps can run on a Fermi GPU). The threads within one block can communicate via shared or global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).
As mentioned earlier, all threads within a warp are always "synchronized", although some might be inactive. They can communicate through shared or global memory (or "lane swapping" on upcoming hardware with compute capability 3). You can use atomic operations (for arithmetic) and volatile-qualified shared or global variables (load or store access happening sequentially within the same warp). The volatile qualifier tells the compiler to always access memory and never registers whose state cannot be seen by other threads.
Further, there are warp-wide vote functions that can help you make branch decisions or compute integer (prefix) sums.
OK, that's basically it. Hope that helps. Had a good flow writing :-).

Lets take an example of addition of 4*4 matrices.. you have two matrices A and B, having dimension 4*4..
int main()
{
int *a, *b, *c; //To store your matrix A & B in RAM. Result will be stored in matrix C
int *ad, *bd, *cd; // To store matrices into GPU's RAM.
int N =4; //No of rows and columns.
size_t size=sizeof(float)* N * N;
a=(float*)malloc(size); //Allocate space of RAM for matrix A
b=(float*)malloc(size); //Allocate space of RAM for matrix B
//allocate memory on device
cudaMalloc(&ad,size);
cudaMalloc(&bd,size);
cudaMalloc(&cd,size);
//initialize host memory with its own indices
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
a[i * N + j]=(float)(i * N + j);
b[i * N + j]= -(float)(i * N + j);
}
}
//copy data from host memory to device memory
cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);
//calculate execution configuration
dim3 grid (1, 1, 1);
dim3 block (16, 1, 1);
//each block contains N * N threads, each thread calculates 1 data element
add_matrices<<<grid, block>>>(ad, bd, cd, N);
cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost);
printf("Matrix A was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
printf("%f ",a[i*N+j]);
printf("\n");
}
printf("\nMatrix B was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
printf("%f ",b[i*N+j]);
printf("\n");
}
printf("\nAddition of A and B gives C----\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
printf("%f ",c[i*N+j]); //if correctly evaluated, all values will be 0
printf("\n");
}
//deallocate host and device memories
cudaFree(ad);
cudaFree(bd);
cudaFree (cd);
free(a);
free(b);
free(c);
getch();
return 1;
}
/////Kernel Part
__global__ void add_matrices(float *ad,float *bd,float *cd,int N)
{
int index;
index = blockIDx.x * blockDim.x + threadIDx.x
cd[index] = ad[index] + bd[index];
}
Lets take an example of addition of 16*16 matrices..
you have two matrices A and B, having dimension 16*16..
First of all you have to decide your thread configuration.
You are suppose to launch a kernel function, which will perform the parallel computation of you matrix addition, which will get executed on your GPU device.
Now,, one grid is launched with one kernel function..
A grid can have max 65,535 no of blocks which can be arranged in 3 dimensional ways. (65535 * 65535 * 65535).
Every block in grid can have max 1024 no of threads.Those threads can also be arranged in 3 dimensional ways (1024 * 1024 * 64)
Now our problem is addition of 16 * 16 matrices..
A | 1 2 3 4 | B | 1 2 3 4 | C| 1 2 3 4 |
| 5 6 7 8 | + | 5 6 7 8 | = | 5 6 7 8 |
| 9 10 11 12 | | 9 10 11 12 | | 9 10 11 12 |
| 13 14 15 16| | 13 14 15 16| | 13 14 15 16|
We need 16 threads to perform the computation.
i.e. A(1,1) + B (1,1) = C(1,1)
A(1,2) + B (1,2) = C(1,2)
. . .
. . .
A(4,4) + B (4,4) = C(4,4)
All these threads will get executed simultaneously.
So we need a block with 16 threads.
For our convenience we will arrange threads in (16 * 1 * 1) way in a block
As no of threads are 16 so we need one block only to store those 16 threads.
so, grid configuration will be dim3 Grid(1,1,1) i.e. grid will have only one block
and block configuration will be dim3 block(16,1,1) i.e. block will have 16 threads arranged column wise.
Following program will give you the clear idea about its execution..
Understanding the indexing part(i.e. threadIDs, blockDim, blockID) is the important part. You need to go through the CUDA literature. Once you have clear idea about indexing, you will win the half battle! So spend some time with cuda books, different algorithms and paper-pencil of course!

Try 'Cuda-gdb', which is the CUDA debugger.

Related

CUDA Parallel Cross Product

Disclaimer: I am fairly new to CUDA and parallel programming - so if you're not going to bother to answer my question, just ignore this, or at least point me to the right resources so I can find the answer myself.
Here's the particular problem I'm looking to solve using parallel programming. I have some 1D arrays that store 3D vectors in this format -> [v0x, v0y, v0z, ... vnx, vny, vnz], where n is the vector, and x, y, z are the respective components.
Suppose I want to find the cross product between vectors [v0, v1, ... vn] in one array and their corresponding vectors [v0, v1, ... vn] in another array.
The calculation is pretty straightforward without parallelization:
result[x] = vec1[y]*vec2[z] - vec1[z]*vec2[y];
result[y] = vec1[z]*vec2[x] - vec1[x]*vec2[z];
result[z] = vec1[x]*vec2[y] - vec1[y]*vec2[x];
The problem I'm having is understanding how to implement CUDA parallelization for the arrays I currently have. Since each value in the result vector is a separate calculation, I can effectively run the above calculation for each vector in parallel. Since each component of the resulting cross product is a separate calculation, those too could run in parallel. How would I go about setting up the blocks and threads/ go about thinking about setting up the threads for such a problem?
The top 2 optimization priorities for any CUDA programmer are to use memory efficiently, and expose enough parallelism to hide latency. We'll use those to guide our algorithmic choices.
A very simple thread strategy (the thread strategy answers the question, "what will each thread do or be responsible for?") in any transformation (as opposed to reduction) type problem is to have each thread be responsible for 1 output value. Your problem fits the description of transformation - the output data set size is on the order of the input data set size(s).
I'll assume that you intended to have two equal length vectors containing your 3D vectors, and that you want to take the cross product of the first 3D vectors in each and the 2nd 3D vectors in each, and so on.
If we choose a thread strategy of 1 output point per thread (i.e. result[x] or result[y] or result[z], all together would be 3 output points), then we will need 3 threads to compute the output of each vector cross product. If we have enough vectors to multiply, then we will have enough threads to keep our machine "busy" and do a good job of hiding latency. As a rule of thumb, your problem will start to become interesting on GPUs if the number of threads is 10000 or more, so this means we would want your 1D vectors to consist of about 3000 3D vectors or more. Let's assume that is the case.
In order to tackle the memory efficiency objective, our first task is to load your vector data from global memory. We will want this ideally to be coalesced, which roughly means adjacent threads access adjacent elements in memory. We'll want the output stores to be coalesced also, and our thread strategy of choosing one output point/one vector component per thread will work nicely to support that.
For efficient memory usage, we'd like to ideally load each item from global memory only once. Your algorithm naturally involves a small amount of data reuse. The data reuse is evident since the computation of result[y] depends on vec2[z] and the computation of result[x] also depends on vec2[z] to pick just one example. Therefore a typical strategy when there is data reuse is to load the data first into CUDA shared memory, and then allow the threads to perform their computations based on the data in shared memory. As we will see, this makes it fairly easy/convenient for us to arrange for coalesced loads from global memory, since the global data load arrangement is no longer tightly coupled to the threads or the usage of the data for computation.
The last challenge is to figure out an indexing pattern so that each thread will select the proper elements from shared memory to multiply together. If we look at your calculation pattern that you have depicted in your question, we see that the first load from vec1 follows an offset pattern of +1(modulo 3) from the index that the result is being computed for. So x->y, y->z, and z -> x. Likewise we see a +2(modulo 3) for the next load from vec2, another +2(modulo 3) pattern for the next load from vec1 and another +1(modulo 3) pattern for the final load from vec2.
If we combine all these ideas, we can then write a kernel that should have generally efficient characteristics:
$ cat t1003.cu
#include <stdio.h>
#define TV1 1
#define TV2 2
const size_t N = 4096; // number of 3D vectors
const int blksize = 192; // choose as multiple of 3 and 32, and less than 1024
typedef float mytype;
//pairwise vector cross product
template <typename T>
__global__ void vcp(const T * __restrict__ vec1, const T * __restrict__ vec2, T * __restrict__ res, const size_t n){
__shared__ T sv1[blksize];
__shared__ T sv2[blksize];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < 3*n){ // grid-stride loop
// load shared memory using coalesced pattern to global memory
sv1[threadIdx.x] = vec1[idx];
sv2[threadIdx.x] = vec2[idx];
// compute modulo/offset indexing for thread loads of shared data from vec1, vec2
int my_mod = threadIdx.x%3; // costly, but possibly hidden by global load latency
int off1 = my_mod+1;
if (off1 > 2) off1 -= 3;
int off2 = my_mod+2;
if (off2 > 2) off2 -= 3;
__syncthreads();
// each thread loads its computation elements from shared memory
T t1 = sv1[threadIdx.x-my_mod+off1];
T t2 = sv2[threadIdx.x-my_mod+off2];
T t3 = sv1[threadIdx.x-my_mod+off2];
T t4 = sv2[threadIdx.x-my_mod+off1];
// compute result, and store using coalesced pattern, to global memory
res[idx] = t1*t2-t3*t4;
idx += gridDim.x*blockDim.x;} // for grid-stride loop
}
int main(){
mytype *h_v1, *h_v2, *d_v1, *d_v2, *h_res, *d_res;
h_v1 = (mytype *)malloc(N*3*sizeof(mytype));
h_v2 = (mytype *)malloc(N*3*sizeof(mytype));
h_res = (mytype *)malloc(N*3*sizeof(mytype));
cudaMalloc(&d_v1, N*3*sizeof(mytype));
cudaMalloc(&d_v2, N*3*sizeof(mytype));
cudaMalloc(&d_res, N*3*sizeof(mytype));
for (int i = 0; i<N; i++){
h_v1[3*i] = TV1;
h_v1[3*i+1] = 0;
h_v1[3*i+2] = 0;
h_v2[3*i] = 0;
h_v2[3*i+1] = TV2;
h_v2[3*i+2] = 0;
h_res[3*i] = 0;
h_res[3*i+1] = 0;
h_res[3*i+2] = 0;}
cudaMemcpy(d_v1, h_v1, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_v2, h_v2, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
vcp<<<(N*3+blksize-1)/blksize, blksize>>>(d_v1, d_v2, d_res, N);
cudaMemcpy(h_res, d_res, N*3*sizeof(mytype), cudaMemcpyDeviceToHost);
// verification
for (int i = 0; i < N; i++) if ((h_res[3*i] != 0) || (h_res[3*i+1] != 0) || (h_res[3*i+2] != TV1*TV2)) { printf("mismatch at %d, was: %f, %f, %f, should be: %f, %f, %f\n", i, h_res[3*i], h_res[3*i+1], h_res[3*i+2], (float)0, (float)0, (float)(TV1*TV2)); return -1;}
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
}
$ nvcc t1003.cu -o t1003
$ cuda-memcheck ./t1003
========= CUDA-MEMCHECK
no error
========= ERROR SUMMARY: 0 errors
$
Note that I've chosen to write the kernel using a grid-stride loop. This isn't terribly important to this discussion, and not that relevant for this problem, because I've chosen a grid size equal to the problem size (4096*3). However for much larger problem sizes, you might choose a smaller grid size than the overall problem size, for some possible small efficiency gain.
For such a simple problem as this, it's fairly easy to define "optimality". The optimal scenario would be however long it takes to load the input data (just once) and write the output data. If we consider a larger version of the test code above, changing N to 40960 (and making no other changes), then the total data read and written would be 40960*3*4*3 bytes. If we profile that code and then compare to bandwidthTest as a proxy for peak achievable memory bandwidth, we observe:
$ CUDA_VISIBLE_DEVICES="1" nvprof ./t1003
==27861== NVPROF is profiling process 27861, command: ./t1003
no error
==27861== Profiling application: ./t1003
==27861== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 65.97% 162.22us 2 81.109us 77.733us 84.485us [CUDA memcpy HtoD]
30.04% 73.860us 1 73.860us 73.860us 73.860us [CUDA memcpy DtoH]
4.00% 9.8240us 1 9.8240us 9.8240us 9.8240us void vcp<float>(float const *, float const *, float*, unsigned long)
API calls: 99.10% 249.79ms 3 83.263ms 6.8890us 249.52ms cudaMalloc
0.46% 1.1518ms 96 11.998us 374ns 454.09us cuDeviceGetAttribute
0.25% 640.18us 3 213.39us 186.99us 229.86us cudaMemcpy
0.10% 255.00us 1 255.00us 255.00us 255.00us cuDeviceTotalMem
0.05% 133.16us 1 133.16us 133.16us 133.16us cuDeviceGetName
0.03% 71.903us 1 71.903us 71.903us 71.903us cudaLaunchKernel
0.01% 15.156us 1 15.156us 15.156us 15.156us cuDeviceGetPCIBusId
0.00% 7.0920us 3 2.3640us 711ns 4.6520us cuDeviceGetCount
0.00% 2.7780us 2 1.3890us 612ns 2.1660us cuDeviceGet
0.00% 1.9670us 1 1.9670us 1.9670us 1.9670us cudaGetLastError
0.00% 361ns 1 361ns 361ns 361ns cudaGetErrorString
$ CUDA_VISIBLE_DEVICES="1" /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla K20Xm
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6375.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6554.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 171220.3
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$
The kernel takes 9.8240us to execute, and in that time loads or stores a total of 40960*3*4*3 bytes of data. Therefore the achieved memory bandwidth by the kernel is 40960*3*4*3/0.000009824 or 150 GB/s. The proxy measurement for peak achievable on this GPU is 171 GB/s, so this kernel achieves 88% of the optimal throughput. With more careful benchmarking to run the kernel twice in a row, the 2nd execution requires only 8.99us to execute. This brings the achieved bandwidth in this case up to 96% of peak achievable throughput.

Cache friendly offline random read

Consider this function in C++:
void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[*b1++];
}
}
Its purpose should be clear enough. Unfortunately, b1 contains random data and trash the cache, making foo the bottleneck of my program. Is there anyway I can optimize it?
This is an SSCCE that should resemble my actual code:
#include <iostream>
#include <chrono>
#include <algorithm>
#include <numeric>
namespace {
void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[*b1++];
}
}
constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;
uint32_t data[max_n], index[max_q], result[max_q];
}
int main() {
uint32_t seed = 0;
auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
std::generate_n(data, max_n, rng);
std::generate_n(index, max_q, [rng]() { return rng() % max_n; });
auto t1 = std::chrono::high_resolution_clock::now();
foo(data, data + max_n, index, index + max_q, result);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;
uint32_t hash = 0;
for (unsigned i = 0; i < max_q; i++)
hash += result[i] ^ (i << 8) ^ i;
std::cout << hash << std::endl;
}
This is not Cache-friendly copying of an array with readjustment by known index, gather, scatter, which asks about random writes and assumes b is a permutation.
First, let's take a look at the actual performance of the code above:
$ sudo perf stat ./offline-read
0.123023
1451229184
Performance counter stats for './offline-read':
184.661547 task-clock (msec) # 0.997 CPUs utilized
3 context-switches # 0.016 K/sec
0 cpu-migrations # 0.000 K/sec
717 page-faults # 0.004 M/sec
623,638,834 cycles # 3.377 GHz
419,309,952 instructions # 0.67 insn per cycle
70,803,672 branches # 383.424 M/sec
16,895 branch-misses # 0.02% of all branches
0.185129552 seconds time elapsed
We are getting a low IPC of 0.67, probably caused almost entirely by load-misses to DRAM5. Let's confirm:
sudo ../pmu-tools/ocperf.py stat -e cycles,LLC-load-misses,cycle_activity.stalls_l3_miss ./offline-read
perf stat -e cycles,LLC-load-misses,cpu/event=0xa3,umask=0x6,cmask=6,name=cycle_activity_stalls_l3_miss/ ./offline-read
0.123979
1451229184
Performance counter stats for './offline-read':
622,661,371 cycles
16,114,063 LLC-load-misses
368,395,404 cycle_activity_stalls_l3_miss
0.184045411 seconds time elapsed
So ~370k cycles out of 620k are straight-up stalled on outstanding misses. In fact, the portion of cycles stalled this way in foo() is much higher, close to 90% since perf is also measuring the init and accumulate code which takes about a third of the runtime (but doesn't have significant L3 misses).
This is nothing unexpected, since we knew the random-read pattern a1[*b1++] was going to have essentially zero locality. In fact, the number of LLC-load-misses is 16 million1, corresponding almost exactly to the 16 million random reads of a1.2
If we just assume 100% of foo() is spending waiting on memory access, we can get an idea of the total cost of each miss: 0.123 sec / 16,114,063 misses == 7.63 ns/miss. On my box, the memory latency is around 60 ns in the best case, so less than 8 ns per miss means we are already extracting a lot of memory-level parallelism (MLP): about 8 misses would have to be overlapped and in-flight on average to achieve this (even totally ignoring the additional traffic from the streaming load of b1 and streaming write of o).
So I don't think there are many tweaks you can apply to the simple loop to do much better. Still, two possibilities are:
Non-temporal stores for the writes to o, if your platform supports them. This would cut out the reads implied by RFO for normal stores. It should be a straight win since o is never read again (inside the timed portion!).
Software prefetching. Carefully tuned prefetching of a1 or b1 could potentially help a bit. The impact is going to be fairly limited, however, since we are already approaching the limits of MLP as described above. Also, we expect the linear reads of b1 to be almost perfectly prefetched by the hardware prefetchers. The random reads of a1 seem like they could be amenable to prefetching, but in practice the ILP in the loop leads to enough MLP though out-of-order processing (at least on big OoO processors like recent x86).
In the comments user harold already mentioned that he tried prefetching with only a small effect.
So since the simple tweaks aren't likely to bear much fruit, you are left with transforming the loop. One "obvious" transformation is to sort the indexes b1 (along with the index element's original position) and then do the reads from a1 in sorted order. This transforms the reads of a1 from completely random, to almost3 linear, but now the writes are all random, which is no better.
Sort and then unsort
The key problem is that the reads of a1 under control of b1 are random, and a1 is large you get a miss-to-DRAM for essentially every read. We can fix that by sorting b1, and then reading a1 in order to get a permuted result. Now you need to "un-permute" the result a1 to get the result in the final order, which is simply another sort, this time on the "output index".
Here's a worked example with the given input array a, index array b and output array o, and i which is the (implicit) position of each element:
i = 0 1 2 3
a = [00, 10, 20, 30]
b = [ 3, 1, 0, 1]
o = [30, 10, 00, 10] (desired result)
First, sort array b, with the original array position i as secondary data (alternately you may see this as sorting tuples (b[0], 0), (b[1], 1), ...), this gives you the sorted b array b' and the sorted index list i' as shown:
i' = [ 2, 1, 3, 0]
b' = [ 0, 1, 1, 3]
Now you can read the permuted result array o' from a under the control of b'. This read is strictly increasing in order, and should be able to operate at close to memcpy speeds. In fact you may be able to take advantage of wide contiguous SIMD reads and some shuffles to do several reads and once and move the 4-byte elements into the right place (duplicating some elements and skipping others):
a = [00, 10, 20, 30]
b' = [ 0, 1, 1, 3]
o' = [00, 10, 10, 30]
Finally, you de-permute o' to get o, conceptually simply by sorting o' on the permuted indexes i':
i' = [ 2, 1, 3, 0]
o' = [00, 10, 10, 30]
i = [ 0, 1, 2, 3]
o = [30, 10, 00, 10]
Finished!
Now this is the simplest idea of the technique and isn't particularly cache-friendly (each pass conceptually iterates over one or more 2^26-byte arrays), but it at least fully uses every cache line it reads (unlike the original loop which only reads a single element from a cache line, which is why you have 16 million misses even though the data only occupies 1 million cache lines!). All of the reads are more or less linear, so hardware prefetching will help a lot.
How much speedup you get probably large depends on how will you implement the sorts: they need to be fast and cache sensitive. Almost certainly some type of cache-aware radix sort will work best.
Here are some notes on ways to improve this further:
Optimize the amount of sorting
You don't actually need to fully sort b. You just want to sort it "enough" such that the subsequent reads of a under the control of b' are more or less linear. For example, 16 elements fit in a cache line, so you don't need to sort based on the last 4 bits at all: the same linear sequence of cache lines will be read anyways. You could also sort on even fewer bits: e.g., if you ignored the 5 least-significant bits, you'd read cache lines in an "almost linear" way, sometimes swapping two cache lines from the perfectly linear pattern like: 0, 1, 3, 2, 5, 4, 6, 7. Here, you'll still get the full benefit of the L1 cache (subsequent reads to a cache line will always hit), and I suspect such a pattern would still be prefetched well and if not you can always help it with software prefetching.
You can test on your system what the optimal number of ignored bits is. Ignoring bits has two benefits:
Less work to do in the radix search, either from fewer passes needed or needing fewer buckets in one or more passes (which helps caching).
Potentially less work to do to "undo" the permutation in the last step: if the undo by examining the original index array b, ignoring bits means that you get the same savings when undoing the search.
Cache block the work
The above description lays out everything in several sequential, disjoint passes that each work on the entire data set. In practice, you'd probably want to interleave them to get better caching behavior. For example, assuming you use an MSD radix-256 sort, you might do the first pass, sorting the data into 256 buckets of approximately 256K elements each.
Then rather than doing the full second pass, you might finish sorting only the first (or first few) buckets, and proceed to do the read of a based on the resulting block of b'. You are guaranteed that this block is contiguous (i.e., a suffix of the final sorted sequence) so you don't give up any locality in the read, and your reads will generally be cached. You may also do the first pass of de-permuting o' since the block of o' is also hot in the cache (and perhaps you can combine the latter two phases into one loop).
Smart De-permutation
One area for optimization is how exactly the de-permutation of o' is implemented. In the description above, we assume some index array i initially with values [0, 1, 2, ..., max_q] which is sorted along with b. That's conceptually how it works, but you may not need to actually materialize i right away and sort it as auxillary data. In the first pass of the radix sort, for example, the value of i is implicitly known (since you are iterating through the data), so it could be calculated for free4 and written out during the first pass without every having appeared in sorted order.
There may also be more efficient ways to do the "unsort" operation than maintaining the full index. For example, the original unsorted b array conceptually has all the information needed to do the unsort, but it is clear to me how to use to efficiently unsort.
Is it be faster?
So will this actually be faster than the naive approach? It depends largely on implementation details especially including the efficiency of the implemented sort. On my hardware, the naive approach is processing about ~140 million elements per second. Online descriptions of cache-aware radix sorts seem to vary from perhaps 200 to 600 million elements/s, and since you need two of those, the opportunity for a big speedup would seem limited if you believe those numbers. On other hand, those numbers are from older hardware, and for slightly more general searches (e.g,. for all 32 bits of the key, while we may be able to use as few as 16 bits).
Only a careful implementation will determine if it is feasible, and feasibility also depends on the hardware. For example, on hardware that can't sustain as much MLP, the sorting-unsorting approach becomes relatively more favorable.
The best approach also depends on the relative values of max_n and max_q. For example, if max_n >> max_q, then the reads will be "sparse" even with optimal sorting, so the naive approach would be better. On the other hand if max_n << max_q, then the same index will usually be read many times, so the sorting approach will have good read locality, the sorting steps will themselves have better locality, and further optimizations which handle duplicate reads explicitly may be possible.
Multiple Cores
It isn't clear from the question whether you are interested in parallelizing this. The naive solution for foo() already does admit a "straightforward" parallelization where you simply partition the a and b arrays into equal sized chunks, on for each thread, which would seem to provide a perfect speedup. Unfortunately, you'll probably find that you get much worse than linear scaling, because you'll be running into resource contention in the memory controller and associated uncore/offcore resources which are shared between all cores on a socket. So it isn't clear how much more throughput you'll get for a purely parallel random read load to memory as you add more cores6.
For the radix-sort version, most of the bottlenecks (store throughput, total instruction throughput) are in the core, so I expect it to scale reasonably with additional cores. As Peter mentioned in the comment, if you are using hyperthreading, the sort may have the additional benefit of good locality in the core local L1 and L2 caches, effectively letting each sibling thread use the entire cache, rather than cutting the effective capacity in half. Of course, that involves carefully managing your thread affinity so that sibling threads actually use nearby data, and not just letting the scheduler do whatever it does.
1 You might ask why the LLC-load-misses isn't say 32 or 48 million, given that we also have to read all 16 million elements of b1 and then the accumulate() call reads all of result. The answer is that LLC-load-misses only counts demand misses that actually miss in the L3. The other mentioned read patterns are totally linear, so the prefetchers will always be bringing the line into the L3 before it is needed. These don't count as "LLC misses" by the definition perf uses.
2 You might want to know how I know that the load misses all come from the reads of a1 in foo: I simply used perf record and perf mem to confirm that the misses were coming from the expected assembly instruction.
3 Almost linear because b1 is not a permutation of all indexes, so in principle there can be skipped and duplicate indexes. At the cache-line level, however, it is highly likely that every cache line will be read in-order since each element has a ~63% chance of being included, and a cache line has 16 4-byte elements, so there's only about a 1 in 10 million chance that any given cache has zero elements. So prefetching, which works at the cache line level, will work fine.
4 Here I mean that the calculation of the value comes for free or nearly so, but of course the write still costs. This is still much better than the "up-front materialization" approach, however, which first creates the i array [0, 1, 2, ...] needing max_q writes and then again needs another max_q writes to sort it in the first radix sort pass. The implicit materialization only incurs the second write.
5 In fact, the IPC of the actual timed section foo() is much lower: about 0.15 based on my calculations. The reported IPC of the entire process is an average of the IPC of the timed section and the initialization and accumulation code before and after which has a much higher IPC.
6 Notably, this is different from a how a dependent-load latency bound workflow scales: a load that is doing random read but can only have one load in progress because each load depends on the result of last scales very well to multiple cores because the serial nature of the loads doesn't use many downstream resources (but such loads can conceptually also be sped up even on a single core by changing the core loop to handle more than one dependent load stream in parallel).
You can partition indices into buckets where higher bits of indices are the same. Beware that if indices are not random the buckets will overflow.
#include <iostream>
#include <chrono>
#include <cassert>
#include <algorithm>
#include <numeric>
#include <vector>
namespace {
constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;
void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[*b1++];
}
}
uint32_t* foo_fx(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, const uint32_t b_offset, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[b_offset+(*b1++)];
}
return o;
}
uint32_t data[max_n], index[max_q], result[max_q];
std::pair<uint32_t, uint32_t[max_q / 8]>index_fx[16];
}
int main() {
uint32_t seed = 0;
auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
std::generate_n(data, max_n, rng);
//std::generate_n(index, max_q, [rng]() { return rng() % max_n; });
for (size_t i = 0; i < max_q;++i) {
const uint32_t idx = rng() % max_n;
const uint32_t bucket = idx >> 20;
assert(bucket < 16);
index_fx[bucket].second[index_fx[bucket].first] = idx % (1 << 20);
index_fx[bucket].first++;
assert((1 << 20)*bucket + index_fx[bucket].second[index_fx[bucket].first - 1] == idx);
}
auto t1 = std::chrono::high_resolution_clock::now();
//foo(data, data + max_n, index, index + max_q, result);
uint32_t* result_begin = result;
for (int i = 0; i < 16; ++i) {
result_begin = foo_fx(data, data + max_n, index_fx[i].second, index_fx[i].second + index_fx[i].first, (1<<20)*i, result_begin);
}
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;
std::cout << std::accumulate(result, result + max_q, 0ull) << std::endl;
}

Strange under-performance of Titan X in memory-bound kernels (e.g. elementwise addition)

I've been noticing super-slow execution times for essentially all of my CUDA kernels on some machine (Fedora 24, GeForce Titan X maxwell), but not on others. Edit: I previously gave the CUDA vectorAdd sample as an MCVE, but due to doubts regarding whether that should really be memory-bottlenecked due to the low workload per thread; so, here's a hand-unrolling of that kernel:
enum { serialization_factor = 8 };
__global__ void vectorAdd(
const float* __restrict__ lhs,
const float* __restrict__ rhs,
float* __restrict__ result,
int length)
{
int pos = threadIdx.x + blockIdx.x * blockDim.x * serialization_factor;
if (length - pos >= blockDim.x * serialization_factor) {
#pragma unroll
for(int i = 0; i < serialization_factor; i++) {
result[pos] = lhs[pos] + rhs[pos];
pos += blockDim.x;
}
}
else {
for(; pos < length; pos += blockDim.x) {
result[pos] = lhs[pos] + rhs[pos];
}
}
}
... and suppose we run this for 5,000,000 elements; and launch the kernel twice, ignoring the first run.
Well, with my home GPU, a Geforce GTX 650 Ti Boost, I get 527 usec. This is a bit strange - I was expecting something like 555 usec, by bandwidth calculations: 3004 MHz clock * 192-bit bus = 72096 MB = 72 GB/sec , and 2 * 4 bytes per float * 5M of data. But it's pretty close so let's ignore the difference. The profiler tells me the "Global Load Throughput" is 72.355 GB/sec.
Now, on the Maxwell Titan X at work, I get 232 usec. That's about twice as fast - but the GPU's bandwidth is 5 times as high as my home GPU: ~336 GB/sec. I should be seeing something like 120 usec. And - the profiler tells me the "Global Load Throughput" is 343.271 GB/sec (!)
How could this be happening?
Notes:
If you think I've gotten something wrong with the kernel, please comment about that rather than writing an answer.
The Titan doesn't have ECC on.
Your bandwidth calculations are not fully accurate. The specified theoretical peak memory bandwidth of the GTX 650 Ti BOOST is twice as high (144.2 GB/s) as you calculated because of double data rate transfer (transfer of separate words on both the raising and the falling edge of the clock signal). The achieved bandwidth in the vector add example is 50% higher than you calculated, because writing back of results to memory also needs to be taken into account. This means your GTX 650 Ti BOOST measurements achieved ~79% of it's theoretical peak bandwidth.
The Titan X's specified peak memory bandwidth is 336.5 GB/s, so your test achieved ~77% of theoretical peak memory bandwidth.
This is about as good as it gets. The remaining discrepancy is due to overhead like memory refresh, the time needed to switch the transfer direction etc.
Adding some to tera's answer, your algorithm has a warm-up and cool-down phase: when having many requests in flight, latency gets hidden indeed, but at the cost of a warm-up, and a cool-down for the last iterations.
If your scheduling is good, you will have a work chunk of 2048 (max resident threads per sm) x 24 (number of sm on the GTX Titan X). Each of which will operate on 8 values. Hence, your work chunk is 393,216 entries.
For your 5,000,000 size sample, it results in 12.7 iterations (13 with the last being incomplete). The warm-up/cool-down cost is 1 iteration.
Depending on scheduling of threads (and this is not necessarily predictable), you may run 14 iterations total; for which you could have 5,111,808 entries at approximately same cost (still one warm-up/cool-down). That size would provide you the best performance I believe.
As a result, the incomplete iteration plus warm-up/cool-down could cost about 10% of performance, the achieved bandwidth being closer to 85% of peak if not more.
The minimal run time of a kernel should also be looked at as it might account for a few micros as well. Running on various data sizes should mitigate this point.
Last but not least, the memory frequency might be modifiable with nvidia-smi, as explained here.

Generating threads based on a variable length array in cuda?

Fist, let me explain what I am implementing. The goal of my program is to generate all possible, non-distinct combinations of a given character set on a cuda enabled GPU. In order to parallelize the work, I am initializing each thread to a starting character.
For instance, consider the character set abcdefghijklmnopqrstuvwxyz. In this case, there will ideally be 26 threads: characterSet[threadIdx.x] = a for example (in practice, there would obviously be an offset to span the entire grid so that each thread has a unique identifier).
Here is my code thus far:
//Used to calculate grid dimensions
int* threads;
int* blocks;
int* tpb;
int charSetSize;
void calculate_grid_parameters(int length, int size, int* threads, int* blocks, int* tpb){
//Validate input
if(!threads || !blocks || ! tpb){
cout <<"An error has occured: Null pointer passed to function...\nPress enter to exit...";
getchar();
exit(1);
}
//Declarations
const int maxBlocks = 65535; //Does not change
int maxThreads = 512; //Limit in order to provide more portability
int dev = 0;
int maxCombinations;
cudaDeviceProp deviceProp;
//Query device
//cudaGetDeviceProperties(&deviceProp, dev);
//maxThreads = deviceProp.maxThreadsPerBlock;
//Determine total threads to spawn
//Length of password * size of character set
//Each thread will handle part of the total number of the combinations
if(length > 3) length = 3; //Max length is 3
maxCombinations = length * size;
assert(maxCombinations < (maxThreads * maxBlocks));
}
It is fairly basic.
I've limited length to 3 for a specific reason. The full character set, abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 !\"#$&'()*+-.:;<>=?#[]^_{}~| is, I believe, 92 characters. This means for a length of 3, there are 778,688 possible non-distinct combinations. If it were length 4, than it would be roughly 71 million, and the maximum number of threads for my GPU is about 69 million (in one dimension). Furthermore, these combinations have already been generated in a file that will be read into an array and then delegated a specific initializing thread.
This leads me to my problem.
The maximum number of blocks on a cuda GPU (for 1-d) is 65,535. Each of those blocks (on my gpu) can run 1024 threads in one dimension. I've limited it to 512 in my code for portability purposes (this may be unnecessary). Ideally, each block should run 32 threads or a multiple of 32 threads in order to be efficient. The issue I have is how many threads I need. Like I said above, if I am using a full character set of length 3 for the starting values, this necessitates 778,688 threads. This happens to be divisible by 32, yielding 24,334 blocks assuming each block runs 32 threads. However, if I run the same character set with length two, I am left with 264.5 blocks each running 32 threads.
Basically, my character set is variable and the length of the initializing combinations is variable from 1-3.
If I round up to the nearest whole number, my offset index, tid = threadIdx.x + .... will be accessing parts of the array that simply do not exist.
How can I handle this problem in such a way that is will still run efficiently and not spawn unnecessary threads that could potentially cause memory problems?
Any constructive input is appreciated.
The code you've posted doesn't seem to do anything significant and includes no cuda code.
Your question appears to be this:
How can I handle this problem in such a way that is will still run efficiently and not spawn unnecessary threads that could potentially cause memory problems?
It's common practice when launching a kernel to "round up" to the nearest increment of threads, perhaps 32, perhaps some multiple of 32, so that an integral number of blocks can be launched. In this case, it's common practice to include a thread check in the kernel code, such as:
__global__ void mykernel(.... int size){
int idx=threadIdx.x + blockDim.x*blockIdx.x;
if (idx < size){
//main body of kernel code here
}
}
In this case, size is your overall problem size (the number of threads that you actually want). The overhead of the additional threads that are doing nothing is normally not a significant performance issue.

Some basic CUDA enquiries

I am new in Cuda development and I decided to start scripting small examples in order to understand how it is working. I decided to share the kernel function that I make and computes the squared euclidean distance between the corresponding rows of two equal sized matrices.
__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols )
{
int i, squareEuclDist = 0;
int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
//int c = blockDim.y * blockIdx.y + threadIdx.y; // cols
if( r < rows ){ // take each row with var r (thread)
for ( i = 0; i < cols; i++ )//compute squared Euclid dist of each row
squareEuclDist += ( A[r + rows*i] - B[r + rows*i] ) * ( A[r + rows*i] - B[r + rows*i] );
C[r] = squareEuclDist;
squareEuclDist = 0;
}
}
The kernel initialization is done by
int threadsPerBlock = 256;
int blocksPerGrid = ceil( (double) numElements / threadsPerBlock);
// numElements = 1500x200 (matrix size) ==> 1172 blocks/grid
and is called as
cudaEuclid<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C, rows, cols );
The d_A and d_B are the inserted matrices, in this example of size 1500 x 200.
Question 1: I have read the basic theory of choosing the threads per block and the blocks per grid number but is still something missing. I try to understand in this simple kernel what is the optimum kernel parameter initialization and I am asking a little help to start thinking in CUDA way.
Question 2: An other thing I would like to ask is if there are any suggestions about how can we improve the code efficiency? Can we use int c = blockDim.y * blockIdx.y + threadIdx.y to make things more parallel?Share memory is applicable here?
Below, my GPU info is attached.
Device 0: "GeForce 9600 GT"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 512 MBytes (536870912 bytes)
( 8) Multiprocessors x ( 8) CUDA Cores/MP: 64 CUDA Cores
GPU Clock rate: 1680 MHz (1.68 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 256-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Concurrent kernel execution: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Question 3: Can we express the amount of global memory with that of shared memory and other type of memories that GPU has? Does the number of threads has to do with that?
Question 4: If the maximum number of threads per block is 512 how is possible the maximum sizes of each dimension of a block be 512x512x62 (= 16252628 threads)? What the correlation with my maximum sizes of each dimension of a grid?
Question 5: Using the memory clock rate can we say how many threads are processed at each second?
UPDATE:
The for loop replaced with column threads
__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols ){
int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
int c = blockDim.y * blockIdx.y + threadIdx.y; // cols
float x=0;
if(c < cols && r < rows){
x = ( A[c + r*cols] - B[c + r*cols] ) * ( A[c + r*cols] - B[c + r*cols] );
}
C[r] = x;
}
Called with:
int threadsPerBlock = 256;
int blocksPerGrid = ceil( (double) numElements / threadsPerBlock);
cudaEuclid<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C, rows, cols );
A1. Optimize the threads per block is basically heuristics. You could try
for(int threadsPerBlock=32; threadsPerBlock<=512;threadsPerBlock+=32){...}
A2. Currently you use one thread per row and sum the elements to squareEuclDist linearly. You could consider use one thread block per row. Within the block, each thread computes the square-difference of one element and you could use parallel reduction to sum them together. Please refer to the following link for parallel reduction.
http://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf
A3. the list you show is the total amount of global/shared memory. Multiple threads will share these hardware resources. You could find this tool in your cuda installation dir to help you calculate the exact number per thread of those hardware resources you can use in a particular kernel.
$CUDA_HOME/tools/CUDA_Occupancy_Calculator.xls
A4. maximum sizes of each dimension does not mean all dimensions can reach their max at the same time. However there's no limitation on block per grid, so 65536x65536x1 blocks in a grid is possible.
A5. mem clock has nothing to do with the thread number. You could read the programming model section in cuda doc for more info.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#scalable-programming-model
Ok, so there are few things related to a kernel, one is number of multiprocessors (associated with blocks) and number of cores (associated with cores), blocks are scheduled to run on a multiprocessor (which is 8 for you), threads are scheduled to run on multiple cores on a single multiprocessor. Ideally you would like to have enough number of blocks and threads so that all you multi-processors and all cores in each multi-processor are occupied. It is advisable to have larger number of blocks and threads when compared to multi-processors and cores as coalescing of threads/blocks can be done.
multiple dimensions make programming easier (for eg: 2D/3D images, you could divide the image into sub-parts and give it to different blocks and then process those sub-images on multiple threads), it is more intuitive to use multiple dimensions (x, y, z) for accessing blocks and threads. In some cases, it helps you to have more dimensions if there is a restriction in maximum number of blocks in one dimension (for example if you had a large image, you may hit a limit on maximum number of blocks if you just use one dimension).
I am not sure if I understand what you mean in your third question, I can tell a bit about shared memory. Shared memory is present on a single multi-processor, it is shared by cores on the processor. For you, the amount of shared memory is 16KB, most modern GPUs have 64KB of shared memory on a processor and you can chose how much you want to have for your application, 16KB in the 64KB is generally reserved for cache and you can use the remaining 48KB for you or increase the cache size and lower your shared memory size. Shared memory is much faster than global memory, so incase you have some data which will be accessed frequently, it would be wise to transfer it to shared memory. The number of threads is not at all related to shared memory. Also, global memory and shared memory are separate.
If you can see, each block dimension is less than 512, you cannot have more than 512 threads per block (limit has been changed to 1024 in newer CUDA versions on better architectures). Till Fermi each processor had 32 or 48 cores so it didn't make much sense to have more than 512 threads. The new Kepler architecture has 192 cores per multi-processor.
Threads are executed in a warp, which is generally 16 threads clubbed together and executed on the cores in a multi-processor simultaneously. If you assume that there is always a miss in the shared memory, depending on the number of cores you have per multiprocessor and the memory clock rate, you can calculate how may threads would be processed each second (you would need to take into account the number of instructions which are processed per thread also, there would also be some time involved for processing operations on registers etc).
I hope that answers your questions to some extent.

Resources