Some basic CUDA enquiries - parallel-processing

I am new in Cuda development and I decided to start scripting small examples in order to understand how it is working. I decided to share the kernel function that I make and computes the squared euclidean distance between the corresponding rows of two equal sized matrices.
__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols )
{
int i, squareEuclDist = 0;
int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
//int c = blockDim.y * blockIdx.y + threadIdx.y; // cols
if( r < rows ){ // take each row with var r (thread)
for ( i = 0; i < cols; i++ )//compute squared Euclid dist of each row
squareEuclDist += ( A[r + rows*i] - B[r + rows*i] ) * ( A[r + rows*i] - B[r + rows*i] );
C[r] = squareEuclDist;
squareEuclDist = 0;
}
}
The kernel initialization is done by
int threadsPerBlock = 256;
int blocksPerGrid = ceil( (double) numElements / threadsPerBlock);
// numElements = 1500x200 (matrix size) ==> 1172 blocks/grid
and is called as
cudaEuclid<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C, rows, cols );
The d_A and d_B are the inserted matrices, in this example of size 1500 x 200.
Question 1: I have read the basic theory of choosing the threads per block and the blocks per grid number but is still something missing. I try to understand in this simple kernel what is the optimum kernel parameter initialization and I am asking a little help to start thinking in CUDA way.
Question 2: An other thing I would like to ask is if there are any suggestions about how can we improve the code efficiency? Can we use int c = blockDim.y * blockIdx.y + threadIdx.y to make things more parallel?Share memory is applicable here?
Below, my GPU info is attached.
Device 0: "GeForce 9600 GT"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 512 MBytes (536870912 bytes)
( 8) Multiprocessors x ( 8) CUDA Cores/MP: 64 CUDA Cores
GPU Clock rate: 1680 MHz (1.68 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 256-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Concurrent kernel execution: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Question 3: Can we express the amount of global memory with that of shared memory and other type of memories that GPU has? Does the number of threads has to do with that?
Question 4: If the maximum number of threads per block is 512 how is possible the maximum sizes of each dimension of a block be 512x512x62 (= 16252628 threads)? What the correlation with my maximum sizes of each dimension of a grid?
Question 5: Using the memory clock rate can we say how many threads are processed at each second?
UPDATE:
The for loop replaced with column threads
__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols ){
int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
int c = blockDim.y * blockIdx.y + threadIdx.y; // cols
float x=0;
if(c < cols && r < rows){
x = ( A[c + r*cols] - B[c + r*cols] ) * ( A[c + r*cols] - B[c + r*cols] );
}
C[r] = x;
}
Called with:
int threadsPerBlock = 256;
int blocksPerGrid = ceil( (double) numElements / threadsPerBlock);
cudaEuclid<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C, rows, cols );

A1. Optimize the threads per block is basically heuristics. You could try
for(int threadsPerBlock=32; threadsPerBlock<=512;threadsPerBlock+=32){...}
A2. Currently you use one thread per row and sum the elements to squareEuclDist linearly. You could consider use one thread block per row. Within the block, each thread computes the square-difference of one element and you could use parallel reduction to sum them together. Please refer to the following link for parallel reduction.
http://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf
A3. the list you show is the total amount of global/shared memory. Multiple threads will share these hardware resources. You could find this tool in your cuda installation dir to help you calculate the exact number per thread of those hardware resources you can use in a particular kernel.
$CUDA_HOME/tools/CUDA_Occupancy_Calculator.xls
A4. maximum sizes of each dimension does not mean all dimensions can reach their max at the same time. However there's no limitation on block per grid, so 65536x65536x1 blocks in a grid is possible.
A5. mem clock has nothing to do with the thread number. You could read the programming model section in cuda doc for more info.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#scalable-programming-model

Ok, so there are few things related to a kernel, one is number of multiprocessors (associated with blocks) and number of cores (associated with cores), blocks are scheduled to run on a multiprocessor (which is 8 for you), threads are scheduled to run on multiple cores on a single multiprocessor. Ideally you would like to have enough number of blocks and threads so that all you multi-processors and all cores in each multi-processor are occupied. It is advisable to have larger number of blocks and threads when compared to multi-processors and cores as coalescing of threads/blocks can be done.
multiple dimensions make programming easier (for eg: 2D/3D images, you could divide the image into sub-parts and give it to different blocks and then process those sub-images on multiple threads), it is more intuitive to use multiple dimensions (x, y, z) for accessing blocks and threads. In some cases, it helps you to have more dimensions if there is a restriction in maximum number of blocks in one dimension (for example if you had a large image, you may hit a limit on maximum number of blocks if you just use one dimension).
I am not sure if I understand what you mean in your third question, I can tell a bit about shared memory. Shared memory is present on a single multi-processor, it is shared by cores on the processor. For you, the amount of shared memory is 16KB, most modern GPUs have 64KB of shared memory on a processor and you can chose how much you want to have for your application, 16KB in the 64KB is generally reserved for cache and you can use the remaining 48KB for you or increase the cache size and lower your shared memory size. Shared memory is much faster than global memory, so incase you have some data which will be accessed frequently, it would be wise to transfer it to shared memory. The number of threads is not at all related to shared memory. Also, global memory and shared memory are separate.
If you can see, each block dimension is less than 512, you cannot have more than 512 threads per block (limit has been changed to 1024 in newer CUDA versions on better architectures). Till Fermi each processor had 32 or 48 cores so it didn't make much sense to have more than 512 threads. The new Kepler architecture has 192 cores per multi-processor.
Threads are executed in a warp, which is generally 16 threads clubbed together and executed on the cores in a multi-processor simultaneously. If you assume that there is always a miss in the shared memory, depending on the number of cores you have per multiprocessor and the memory clock rate, you can calculate how may threads would be processed each second (you would need to take into account the number of instructions which are processed per thread also, there would also be some time involved for processing operations on registers etc).
I hope that answers your questions to some extent.

Related

Why accessing an array of int8_t is not faster than int32_t, due to cache?

I have read that when accessing with a stride
for (int i = 0; i < aSize; i++) a[i] *= 3;
for (int i = 0; i < aSize; i += 16) a[i] *= 3;
both loops should perform similarly, as memory accesses are in a higher order than multiplication.
I'm playing around with google benchmark and while testing similar cache behaviour, I'm getting results that I don't understand.
template <class IntegerType>
void BM_FillArray(benchmark::State& state) {
for (auto _ : state)
{
IntegerType a[15360 * 1024 * 2]; // Reserve array that doesn't fit in L3
for (size_t i = 0; i < sizeof(a) / sizeof(IntegerType); ++i)
benchmark::DoNotOptimize(a[i] = 0); // I have compiler optimizations disabled anyway
}
}
BENCHMARK_TEMPLATE(BM_FillArray, int32_t);
BENCHMARK_TEMPLATE(BM_FillArray, int8_t);
Run on (12 X 3592 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 15360 KiB (x1)
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
BM_FillArray<int32_t> 196577075 ns 156250000 ns 4
BM_FillArray<int8_t> 205476725 ns 160156250 ns 4
I would expect accessing the array of bytes to be faster than the array of ints as more elements fit in a cache line, but this is not the case.
Here are the results with optimizations enabled:
BM_FillArray<int32_t> 47279657 ns 47991071 ns 14
BM_FillArray<int8_t> 49374830 ns 50000000 ns 10
Anyone please can clarify this? Thanks :)
UPDATE 1:
I have read the old article "What programmers should know about memory" and everything is more clear now. However, I've tried the following benchmark:
template <int32_t CacheLineSize>
void BM_ReadArraySeqCacheLine(benchmark::State& state) {
struct CacheLine
{
int8_t a[CacheLineSize];
};
vector<CacheLine> cl;
int32_t workingSetSize = state.range(0);
int32_t arraySize = workingSetSize / sizeof(CacheLine);
cl.resize(arraySize);
const int32_t iterations = 1536 * 1024;
for (auto _ : state)
{
srand(time(NULL));
int8_t res = 0;
int32_t i = 0;
while (i++ < iterations)
{
//size_t idx = i% arraySize;
int idx = (rand() / float(RAND_MAX)) * arraySize;
benchmark::DoNotOptimize(res += cl[idx].a[0]);
}
}
}
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 1)
->Arg(32 * 1024) // L1 Data 32 KiB(x6)
->Arg(256 * 1024) // L2 Unified 256 KiB(x6)
->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 64)
->Arg(32 * 1024) // L1 Data 32 KiB(x6)
->Arg(256 * 1024) // L2 Unified 256 KiB(x6)
->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 128)
->Arg(32 * 1024) // L1 Data 32 KiB(x6)
->Arg(256 * 1024) // L2 Unified 256 KiB(x6)
->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
I would expect random accesses to perform much worse when working size doesn't fit the caches. However, these are the results:
BM_ReadArraySeqCacheLine<1>/32768 39936129 ns 38690476 ns 21
BM_ReadArraySeqCacheLine<1>/262144 40822781 ns 39062500 ns 16
BM_ReadArraySeqCacheLine<1>/15728640 58144300 ns 57812500 ns 10
BM_ReadArraySeqCacheLine<64>/32768 32786576 ns 33088235 ns 17
BM_ReadArraySeqCacheLine<64>/262144 32066729 ns 31994048 ns 21
BM_ReadArraySeqCacheLine<64>/15728640 50734420 ns 50000000 ns 10
BM_ReadArraySeqCacheLine<128>/32768 29122832 ns 28782895 ns 19
BM_ReadArraySeqCacheLine<128>/262144 31991964 ns 31875000 ns 25
BM_ReadArraySeqCacheLine<128>/15728640 68437327 ns 68181818 ns 11
what am I missing?
UPDATE 2:
I'm using now what you suggested (linear_congruential_engine) to generate the random numbers, and I'm using only static arrays, but the results are now even more confusing to me.
Here is the updated code:
template <int32_t WorkingSetSize, int32_t ElementSize>
void BM_ReadArrayRndCacheLine(benchmark::State& state) {
struct Element
{
int8_t data[ElementSize];
};
constexpr int32_t ArraySize = WorkingSetSize / sizeof(ElementSize);
Element a[ArraySize];
constexpr int32_t iterations = 1536 * 1024;
linear_congruential_engine<size_t, ArraySize/10, ArraySize/10, ArraySize> lcg; // I've tried with many params...
for (auto _ : state)
{
int8_t res = 0;
int32_t i = 0;
while (i++ < iterations)
{
size_t idx = lcg();
benchmark::DoNotOptimize(res += a[idx].data[0]);
}
}
}
// L1 Data 32 KiB(x6)
// L2 Unified 256 KiB(x6)
// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 128);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 128);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 128);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 128);
Here are the results (optimizations enabled):
// First template parameter is working set size.
// Second template parameter is array elemeent size.
BM_ReadArrayRndCacheLine<32 * 1024, 1> 2833786 ns 2823795 ns 249
BM_ReadArrayRndCacheLine<32 * 1024, 64> 2960200 ns 2979343 ns 236
BM_ReadArrayRndCacheLine<32 * 1024, 128> 2896079 ns 2910539 ns 204
BM_ReadArrayRndCacheLine<256 * 1024, 1> 3114670 ns 3111758 ns 236
BM_ReadArrayRndCacheLine<256 * 1024, 64> 3629689 ns 3643135 ns 193
BM_ReadArrayRndCacheLine<256 * 1024, 128> 3213500 ns 3187189 ns 201
BM_ReadArrayRndCacheLine<15360 * 1024, 1> 5782703 ns 5729167 ns 90
BM_ReadArrayRndCacheLine<15360 * 1024, 64> 5958600 ns 6009615 ns 130
BM_ReadArrayRndCacheLine<15360 * 1024, 128> 5958221 ns 5998884 ns 112
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 1> 6143701 ns 6076389 ns 90
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 64> 5800649 ns 5902778 ns 90
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 128> 5826414 ns 5729167 ns 90
How is it possible that for (L1d < workingSet < L2) results do not differ much against (workingSet < L1d)? Throughput and latency of L2 are still very high, but with the random accesses I'm trying to prevent prefetching and force cache misses.. so, why I'm not even noticing a minimal increment?
Even when trying to fetch from main memory (workingSet > L3) I'm not getting a massive performance drop. You mention that latest architectures can hold bandwidths of up to ~8bytes per clock, but I understand that they must copy a hold cache line, and that without prefetching with a predictable linear pattern, latency should be more noticeable in my tests... why is not the case?
I suspect that page faults and tlb may have something to do too.
(I've downloaded vtune analyzer to try to understand all this stuff better, but it's hanging on my machine and I've waiting for support)
I REALLY appreciate your help Peter Cordes :)
I'm just a GAME programmer trying to show my teammates whether using certain integer types in our code might (or not) have implications on our game performance. For example, whether we should worry about using fast types (eg. int_fast16_t) or using the least possible bytes in our variables for better packing (eg. int8_t).
Re: the ultimate question: int_fast16_t is garbage for arrays because glibc on x86-64 unfortunately defines it as a 64-bit type (not 32-bit), so it wastes huge amounts of cache footprint. The question is "fast for what purpose", and glibc answered "fast for use as array indices / loop counters", apparently, even though it's slower to divide, or to multiply on some older CPUs (which were current when the choice was made). IMO this was a bad design decision.
Cpp uint32_fast_t resolves to uint64_t but is slower for nearly all operations than a uint32_t (x86_64). Why does it resolve to uint64_t?
How should the [u]int_fastN_t types be defined for x86_64, with or without the x32 ABI?
Why are the fast integer types faster than the other integer types?
Generally using small integer types for arrays is good; usually cache misses are a problem so reducing your footprint is nice even if it means using a movzx or movsx load instead of a memory source operand to use it with an int or unsigned 32-bit local. If SIMD is ever possible, having more elements per fixed-width vector means you get more work done per instruction.
But unfortunately int_fast16_t isn't going to help you achieve that with some libraries, but short will, or int_least16_t.
See my comments under the question for answers to the early part: 200 stall cycles is latency, not throughput. HW prefetch and memory-level parallelism hide that. Modern Microprocessors - A 90 Minute Guide! is excellent, and has a section on memory. See also What Every Programmer Should Know About Memory? which is still highly relevant in 2021. (Except for some stuff about prefetch threads.)
Your Update 2 with a faster PRNG
Re: why L2 isn't slower than L1: out-of-order exec is sufficient to hide L2 latency, and even your LGC is too slow to stress L2 throughput. It's hard to generate random numbers fast enough to give the available memory-level parallelism much trouble.
Your Skylake-derived CPU has an out-of-order scheduler (RS) of 97 uops, and a ROB size of 224 uops (like https://realworldtech.com/haswell-cpu/3 but larger), and 12 LFBs to track cache lines it's waiting for. As long as the CPU can keep track of enough in-flight loads (latency * bandwidth), having to go to L2 is not a big deal. Ability to hide cache misses is one way to measure out-of-order window size in the first place: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity
Latency for an L2 hit is 12 cycles (https://www.7-cpu.com/cpu/Skylake.html). Skylake can do 2 loads per clock from L1d cache, but not from L2. (It can't sustain 1 cache line per clock IIRC, but 1 per 2 clocks or even somewhat better is doable).
Your LCG RNG bottlenecks your loop on its latency: 5 cycles for power-of-2 array sizes, or more like 13 cycles for non-power-of-2 sizes like your "L3" test attempts1. So that's about 1/10th the access rate that L1d can handle, and even if every access misses L1d but hits in L2, you're not even keeping more than one load in flight from L2. OoO exec + load buffers aren't even going to break a sweat. So L1d and L2 will be the same speed because they both user power-of-2 array sizes.
note 1: imul(3c) + add(1c) for x = a * x + c, then remainder = x - (x/m * m) using a multiplicative inverse, probably mul(4 cycles for high half of size_t?) + shr(1) + imul(3c) + sub(1c). Or with a power-of-2 size, modulo is just AND with a constant like (1UL<<n) - 1.
Clearly my estimates aren't quite right because your non-power-of-2 arrays are less than twice the times of L1d / L2, not 13/5 which my estimate would predict even if L3 latency/bandwidth wasn't a factor.
Running multiple independent LCGs in an unrolled loop could make a difference. (With different seeds.) But a non-power-of-2 m for an LCG still means quite a few instructions so you would bottleneck on CPU front-end throughput (and back-end execution ports, specifically the multiplier).
An LCG with multiplier (a) = ArraySize/10 is probably just barely a large enough stride for the hardware prefetcher to not benefit much from locking on to it. But normally IIRC you want a large odd number or something (been a while since I looked at the math of LCG param choices), otherwise you risk only touching a limited number of array elements, not eventually covering them all. (You could test that by storing a 1 to every array element in a random loop, then count how many array elements got touched, i.e. by summing the array, if other elements are 0.)
a and c should definitely not both be factors of m, otherwise you're accessing the same 10 cache lines every time to the exclusion of everything else.
As I said earlier, it doesn't take much randomness to defeat HW prefetch. An LCG with c=0, a= an odd number, maybe prime, and m=UINT_MAX might be good, literally just an imul. You can modulo to your array size on each LCG result separately, taking that operation off the critical path. At this point you might as well keep the standard library out of it and literally just unsigned rng = 1; to start, and rng *= 1234567; as your update step. Then use arr[rng % arraysize].
That's cheaper than anything you could do with xorshift+ or xorshft*.
Benchmarking cache latency:
You could generate an array of random uint16_t or uint32_t indices once (e.g. in a static initializer or constructor) and loop over that repeatedly, accessing another array at those positions. That would interleave sequential and random access, and make code that could probably do 2 loads per clock with L1d hits, especially if you use gcc -O3 -funroll-loops. (With -march=native it might auto-vectorize with AVX2 gather instructions, but only for 32-bit or wider elements, so use -fno-tree-vectorize if you want to rule out that confounding factor that only comes from taking indices from an array.)
To test cache / memory latency, the usual technique is to make linked lists with a random distribution around an array. Walking the list, the next load can start as soon as (but not before) the previous load completes. Because one depends on the other. This is called "load-use latency". See also Is there a penalty when base+offset is in a different page than the base? for a trick Intel CPUs use to optimistically speed up workloads like that (the 4-cycle L1d latency case, instead of the usual 5 cycle). Semi-related: PyPy 17x faster than Python. Can Python be sped up? is another test that's dependent on pointer-chasing latency.

CUDA Parallel Cross Product

Disclaimer: I am fairly new to CUDA and parallel programming - so if you're not going to bother to answer my question, just ignore this, or at least point me to the right resources so I can find the answer myself.
Here's the particular problem I'm looking to solve using parallel programming. I have some 1D arrays that store 3D vectors in this format -> [v0x, v0y, v0z, ... vnx, vny, vnz], where n is the vector, and x, y, z are the respective components.
Suppose I want to find the cross product between vectors [v0, v1, ... vn] in one array and their corresponding vectors [v0, v1, ... vn] in another array.
The calculation is pretty straightforward without parallelization:
result[x] = vec1[y]*vec2[z] - vec1[z]*vec2[y];
result[y] = vec1[z]*vec2[x] - vec1[x]*vec2[z];
result[z] = vec1[x]*vec2[y] - vec1[y]*vec2[x];
The problem I'm having is understanding how to implement CUDA parallelization for the arrays I currently have. Since each value in the result vector is a separate calculation, I can effectively run the above calculation for each vector in parallel. Since each component of the resulting cross product is a separate calculation, those too could run in parallel. How would I go about setting up the blocks and threads/ go about thinking about setting up the threads for such a problem?
The top 2 optimization priorities for any CUDA programmer are to use memory efficiently, and expose enough parallelism to hide latency. We'll use those to guide our algorithmic choices.
A very simple thread strategy (the thread strategy answers the question, "what will each thread do or be responsible for?") in any transformation (as opposed to reduction) type problem is to have each thread be responsible for 1 output value. Your problem fits the description of transformation - the output data set size is on the order of the input data set size(s).
I'll assume that you intended to have two equal length vectors containing your 3D vectors, and that you want to take the cross product of the first 3D vectors in each and the 2nd 3D vectors in each, and so on.
If we choose a thread strategy of 1 output point per thread (i.e. result[x] or result[y] or result[z], all together would be 3 output points), then we will need 3 threads to compute the output of each vector cross product. If we have enough vectors to multiply, then we will have enough threads to keep our machine "busy" and do a good job of hiding latency. As a rule of thumb, your problem will start to become interesting on GPUs if the number of threads is 10000 or more, so this means we would want your 1D vectors to consist of about 3000 3D vectors or more. Let's assume that is the case.
In order to tackle the memory efficiency objective, our first task is to load your vector data from global memory. We will want this ideally to be coalesced, which roughly means adjacent threads access adjacent elements in memory. We'll want the output stores to be coalesced also, and our thread strategy of choosing one output point/one vector component per thread will work nicely to support that.
For efficient memory usage, we'd like to ideally load each item from global memory only once. Your algorithm naturally involves a small amount of data reuse. The data reuse is evident since the computation of result[y] depends on vec2[z] and the computation of result[x] also depends on vec2[z] to pick just one example. Therefore a typical strategy when there is data reuse is to load the data first into CUDA shared memory, and then allow the threads to perform their computations based on the data in shared memory. As we will see, this makes it fairly easy/convenient for us to arrange for coalesced loads from global memory, since the global data load arrangement is no longer tightly coupled to the threads or the usage of the data for computation.
The last challenge is to figure out an indexing pattern so that each thread will select the proper elements from shared memory to multiply together. If we look at your calculation pattern that you have depicted in your question, we see that the first load from vec1 follows an offset pattern of +1(modulo 3) from the index that the result is being computed for. So x->y, y->z, and z -> x. Likewise we see a +2(modulo 3) for the next load from vec2, another +2(modulo 3) pattern for the next load from vec1 and another +1(modulo 3) pattern for the final load from vec2.
If we combine all these ideas, we can then write a kernel that should have generally efficient characteristics:
$ cat t1003.cu
#include <stdio.h>
#define TV1 1
#define TV2 2
const size_t N = 4096; // number of 3D vectors
const int blksize = 192; // choose as multiple of 3 and 32, and less than 1024
typedef float mytype;
//pairwise vector cross product
template <typename T>
__global__ void vcp(const T * __restrict__ vec1, const T * __restrict__ vec2, T * __restrict__ res, const size_t n){
__shared__ T sv1[blksize];
__shared__ T sv2[blksize];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < 3*n){ // grid-stride loop
// load shared memory using coalesced pattern to global memory
sv1[threadIdx.x] = vec1[idx];
sv2[threadIdx.x] = vec2[idx];
// compute modulo/offset indexing for thread loads of shared data from vec1, vec2
int my_mod = threadIdx.x%3; // costly, but possibly hidden by global load latency
int off1 = my_mod+1;
if (off1 > 2) off1 -= 3;
int off2 = my_mod+2;
if (off2 > 2) off2 -= 3;
__syncthreads();
// each thread loads its computation elements from shared memory
T t1 = sv1[threadIdx.x-my_mod+off1];
T t2 = sv2[threadIdx.x-my_mod+off2];
T t3 = sv1[threadIdx.x-my_mod+off2];
T t4 = sv2[threadIdx.x-my_mod+off1];
// compute result, and store using coalesced pattern, to global memory
res[idx] = t1*t2-t3*t4;
idx += gridDim.x*blockDim.x;} // for grid-stride loop
}
int main(){
mytype *h_v1, *h_v2, *d_v1, *d_v2, *h_res, *d_res;
h_v1 = (mytype *)malloc(N*3*sizeof(mytype));
h_v2 = (mytype *)malloc(N*3*sizeof(mytype));
h_res = (mytype *)malloc(N*3*sizeof(mytype));
cudaMalloc(&d_v1, N*3*sizeof(mytype));
cudaMalloc(&d_v2, N*3*sizeof(mytype));
cudaMalloc(&d_res, N*3*sizeof(mytype));
for (int i = 0; i<N; i++){
h_v1[3*i] = TV1;
h_v1[3*i+1] = 0;
h_v1[3*i+2] = 0;
h_v2[3*i] = 0;
h_v2[3*i+1] = TV2;
h_v2[3*i+2] = 0;
h_res[3*i] = 0;
h_res[3*i+1] = 0;
h_res[3*i+2] = 0;}
cudaMemcpy(d_v1, h_v1, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_v2, h_v2, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
vcp<<<(N*3+blksize-1)/blksize, blksize>>>(d_v1, d_v2, d_res, N);
cudaMemcpy(h_res, d_res, N*3*sizeof(mytype), cudaMemcpyDeviceToHost);
// verification
for (int i = 0; i < N; i++) if ((h_res[3*i] != 0) || (h_res[3*i+1] != 0) || (h_res[3*i+2] != TV1*TV2)) { printf("mismatch at %d, was: %f, %f, %f, should be: %f, %f, %f\n", i, h_res[3*i], h_res[3*i+1], h_res[3*i+2], (float)0, (float)0, (float)(TV1*TV2)); return -1;}
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
}
$ nvcc t1003.cu -o t1003
$ cuda-memcheck ./t1003
========= CUDA-MEMCHECK
no error
========= ERROR SUMMARY: 0 errors
$
Note that I've chosen to write the kernel using a grid-stride loop. This isn't terribly important to this discussion, and not that relevant for this problem, because I've chosen a grid size equal to the problem size (4096*3). However for much larger problem sizes, you might choose a smaller grid size than the overall problem size, for some possible small efficiency gain.
For such a simple problem as this, it's fairly easy to define "optimality". The optimal scenario would be however long it takes to load the input data (just once) and write the output data. If we consider a larger version of the test code above, changing N to 40960 (and making no other changes), then the total data read and written would be 40960*3*4*3 bytes. If we profile that code and then compare to bandwidthTest as a proxy for peak achievable memory bandwidth, we observe:
$ CUDA_VISIBLE_DEVICES="1" nvprof ./t1003
==27861== NVPROF is profiling process 27861, command: ./t1003
no error
==27861== Profiling application: ./t1003
==27861== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 65.97% 162.22us 2 81.109us 77.733us 84.485us [CUDA memcpy HtoD]
30.04% 73.860us 1 73.860us 73.860us 73.860us [CUDA memcpy DtoH]
4.00% 9.8240us 1 9.8240us 9.8240us 9.8240us void vcp<float>(float const *, float const *, float*, unsigned long)
API calls: 99.10% 249.79ms 3 83.263ms 6.8890us 249.52ms cudaMalloc
0.46% 1.1518ms 96 11.998us 374ns 454.09us cuDeviceGetAttribute
0.25% 640.18us 3 213.39us 186.99us 229.86us cudaMemcpy
0.10% 255.00us 1 255.00us 255.00us 255.00us cuDeviceTotalMem
0.05% 133.16us 1 133.16us 133.16us 133.16us cuDeviceGetName
0.03% 71.903us 1 71.903us 71.903us 71.903us cudaLaunchKernel
0.01% 15.156us 1 15.156us 15.156us 15.156us cuDeviceGetPCIBusId
0.00% 7.0920us 3 2.3640us 711ns 4.6520us cuDeviceGetCount
0.00% 2.7780us 2 1.3890us 612ns 2.1660us cuDeviceGet
0.00% 1.9670us 1 1.9670us 1.9670us 1.9670us cudaGetLastError
0.00% 361ns 1 361ns 361ns 361ns cudaGetErrorString
$ CUDA_VISIBLE_DEVICES="1" /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla K20Xm
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6375.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6554.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 171220.3
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$
The kernel takes 9.8240us to execute, and in that time loads or stores a total of 40960*3*4*3 bytes of data. Therefore the achieved memory bandwidth by the kernel is 40960*3*4*3/0.000009824 or 150 GB/s. The proxy measurement for peak achievable on this GPU is 171 GB/s, so this kernel achieves 88% of the optimal throughput. With more careful benchmarking to run the kernel twice in a row, the 2nd execution requires only 8.99us to execute. This brings the achieved bandwidth in this case up to 96% of peak achievable throughput.

Strange under-performance of Titan X in memory-bound kernels (e.g. elementwise addition)

I've been noticing super-slow execution times for essentially all of my CUDA kernels on some machine (Fedora 24, GeForce Titan X maxwell), but not on others. Edit: I previously gave the CUDA vectorAdd sample as an MCVE, but due to doubts regarding whether that should really be memory-bottlenecked due to the low workload per thread; so, here's a hand-unrolling of that kernel:
enum { serialization_factor = 8 };
__global__ void vectorAdd(
const float* __restrict__ lhs,
const float* __restrict__ rhs,
float* __restrict__ result,
int length)
{
int pos = threadIdx.x + blockIdx.x * blockDim.x * serialization_factor;
if (length - pos >= blockDim.x * serialization_factor) {
#pragma unroll
for(int i = 0; i < serialization_factor; i++) {
result[pos] = lhs[pos] + rhs[pos];
pos += blockDim.x;
}
}
else {
for(; pos < length; pos += blockDim.x) {
result[pos] = lhs[pos] + rhs[pos];
}
}
}
... and suppose we run this for 5,000,000 elements; and launch the kernel twice, ignoring the first run.
Well, with my home GPU, a Geforce GTX 650 Ti Boost, I get 527 usec. This is a bit strange - I was expecting something like 555 usec, by bandwidth calculations: 3004 MHz clock * 192-bit bus = 72096 MB = 72 GB/sec , and 2 * 4 bytes per float * 5M of data. But it's pretty close so let's ignore the difference. The profiler tells me the "Global Load Throughput" is 72.355 GB/sec.
Now, on the Maxwell Titan X at work, I get 232 usec. That's about twice as fast - but the GPU's bandwidth is 5 times as high as my home GPU: ~336 GB/sec. I should be seeing something like 120 usec. And - the profiler tells me the "Global Load Throughput" is 343.271 GB/sec (!)
How could this be happening?
Notes:
If you think I've gotten something wrong with the kernel, please comment about that rather than writing an answer.
The Titan doesn't have ECC on.
Your bandwidth calculations are not fully accurate. The specified theoretical peak memory bandwidth of the GTX 650 Ti BOOST is twice as high (144.2 GB/s) as you calculated because of double data rate transfer (transfer of separate words on both the raising and the falling edge of the clock signal). The achieved bandwidth in the vector add example is 50% higher than you calculated, because writing back of results to memory also needs to be taken into account. This means your GTX 650 Ti BOOST measurements achieved ~79% of it's theoretical peak bandwidth.
The Titan X's specified peak memory bandwidth is 336.5 GB/s, so your test achieved ~77% of theoretical peak memory bandwidth.
This is about as good as it gets. The remaining discrepancy is due to overhead like memory refresh, the time needed to switch the transfer direction etc.
Adding some to tera's answer, your algorithm has a warm-up and cool-down phase: when having many requests in flight, latency gets hidden indeed, but at the cost of a warm-up, and a cool-down for the last iterations.
If your scheduling is good, you will have a work chunk of 2048 (max resident threads per sm) x 24 (number of sm on the GTX Titan X). Each of which will operate on 8 values. Hence, your work chunk is 393,216 entries.
For your 5,000,000 size sample, it results in 12.7 iterations (13 with the last being incomplete). The warm-up/cool-down cost is 1 iteration.
Depending on scheduling of threads (and this is not necessarily predictable), you may run 14 iterations total; for which you could have 5,111,808 entries at approximately same cost (still one warm-up/cool-down). That size would provide you the best performance I believe.
As a result, the incomplete iteration plus warm-up/cool-down could cost about 10% of performance, the achieved bandwidth being closer to 85% of peak if not more.
The minimal run time of a kernel should also be looked at as it might account for a few micros as well. Running on various data sizes should mitigate this point.
Last but not least, the memory frequency might be modifiable with nvidia-smi, as explained here.

Generating threads based on a variable length array in cuda?

Fist, let me explain what I am implementing. The goal of my program is to generate all possible, non-distinct combinations of a given character set on a cuda enabled GPU. In order to parallelize the work, I am initializing each thread to a starting character.
For instance, consider the character set abcdefghijklmnopqrstuvwxyz. In this case, there will ideally be 26 threads: characterSet[threadIdx.x] = a for example (in practice, there would obviously be an offset to span the entire grid so that each thread has a unique identifier).
Here is my code thus far:
//Used to calculate grid dimensions
int* threads;
int* blocks;
int* tpb;
int charSetSize;
void calculate_grid_parameters(int length, int size, int* threads, int* blocks, int* tpb){
//Validate input
if(!threads || !blocks || ! tpb){
cout <<"An error has occured: Null pointer passed to function...\nPress enter to exit...";
getchar();
exit(1);
}
//Declarations
const int maxBlocks = 65535; //Does not change
int maxThreads = 512; //Limit in order to provide more portability
int dev = 0;
int maxCombinations;
cudaDeviceProp deviceProp;
//Query device
//cudaGetDeviceProperties(&deviceProp, dev);
//maxThreads = deviceProp.maxThreadsPerBlock;
//Determine total threads to spawn
//Length of password * size of character set
//Each thread will handle part of the total number of the combinations
if(length > 3) length = 3; //Max length is 3
maxCombinations = length * size;
assert(maxCombinations < (maxThreads * maxBlocks));
}
It is fairly basic.
I've limited length to 3 for a specific reason. The full character set, abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 !\"#$&'()*+-.:;<>=?#[]^_{}~| is, I believe, 92 characters. This means for a length of 3, there are 778,688 possible non-distinct combinations. If it were length 4, than it would be roughly 71 million, and the maximum number of threads for my GPU is about 69 million (in one dimension). Furthermore, these combinations have already been generated in a file that will be read into an array and then delegated a specific initializing thread.
This leads me to my problem.
The maximum number of blocks on a cuda GPU (for 1-d) is 65,535. Each of those blocks (on my gpu) can run 1024 threads in one dimension. I've limited it to 512 in my code for portability purposes (this may be unnecessary). Ideally, each block should run 32 threads or a multiple of 32 threads in order to be efficient. The issue I have is how many threads I need. Like I said above, if I am using a full character set of length 3 for the starting values, this necessitates 778,688 threads. This happens to be divisible by 32, yielding 24,334 blocks assuming each block runs 32 threads. However, if I run the same character set with length two, I am left with 264.5 blocks each running 32 threads.
Basically, my character set is variable and the length of the initializing combinations is variable from 1-3.
If I round up to the nearest whole number, my offset index, tid = threadIdx.x + .... will be accessing parts of the array that simply do not exist.
How can I handle this problem in such a way that is will still run efficiently and not spawn unnecessary threads that could potentially cause memory problems?
Any constructive input is appreciated.
The code you've posted doesn't seem to do anything significant and includes no cuda code.
Your question appears to be this:
How can I handle this problem in such a way that is will still run efficiently and not spawn unnecessary threads that could potentially cause memory problems?
It's common practice when launching a kernel to "round up" to the nearest increment of threads, perhaps 32, perhaps some multiple of 32, so that an integral number of blocks can be launched. In this case, it's common practice to include a thread check in the kernel code, such as:
__global__ void mykernel(.... int size){
int idx=threadIdx.x + blockDim.x*blockIdx.x;
if (idx < size){
//main body of kernel code here
}
}
In this case, size is your overall problem size (the number of threads that you actually want). The overhead of the additional threads that are doing nothing is normally not a significant performance issue.

How is a CUDA kernel launched?

I have created a simple CUDA application to add two matrices. It is compiling fine. I want to know how the kernel will be launched by all the threads and what will the flow be inside CUDA? I mean, in what fashion every thread will execute each element of the matrices.
I know this is a very basic concept, but I don't know this. I am confused regarding the flow.
You launch a grid of blocks.
Blocks are indivisibly assigned to multiprocessors (where the number of blocks on the multiprocessor determine the amount of available shared memory).
Blocks are further split into warps. For a Fermi GPU that is 32 threads that either execute the same instruction or are inactive (because they branched away, e.g. by exiting from a loop earlier than neighbors within the same warp or not taking the if they did). On a Fermi GPU at most two warps run on one multiprocessor at a time.
Whenever there is latency (that is execution stalls for memory access or data dependencies to complete) another warp is run (the number of warps that fit onto one multiprocessor - of the same or different blocks - is determined by the number of registers used by each thread and the amount of shared memory used by a/the block(s)).
This scheduling happens transparently. That is, you do not have to think about it too much.
However, you might want to use the predefined integer vectors threadIdx (where is my thread within the block?), blockDim (how large is one block?), blockIdx (where is my block in the grid?) and gridDim (how large is the grid?) to split up work (read: input and output) among the threads. You might also want to read up how to effectively access the different types of memory (so multiple threads can be serviced within a single transaction) - but that's leading off topic.
NSight provides a graphical debugger that gives you a good idea of what's happening on the device once you got through the jargon jungle. Same goes for its profiler regarding those things you won't see in the debugger (e.g. stall reasons or memory pressure).
You can synchronize all threads within the grid (all there are) by another kernel launch.
For non-overlapping, sequential kernel execution no further synchronization is needed.
The threads within one grid (or one kernel run - however you want to call it) can communicate via global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).
You can synchronize all threads within one block with the intrinsic instruction __syncthreads() (all threads will be active afterwards - although, as always, at most two warps can run on a Fermi GPU). The threads within one block can communicate via shared or global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).
As mentioned earlier, all threads within a warp are always "synchronized", although some might be inactive. They can communicate through shared or global memory (or "lane swapping" on upcoming hardware with compute capability 3). You can use atomic operations (for arithmetic) and volatile-qualified shared or global variables (load or store access happening sequentially within the same warp). The volatile qualifier tells the compiler to always access memory and never registers whose state cannot be seen by other threads.
Further, there are warp-wide vote functions that can help you make branch decisions or compute integer (prefix) sums.
OK, that's basically it. Hope that helps. Had a good flow writing :-).
Lets take an example of addition of 4*4 matrices.. you have two matrices A and B, having dimension 4*4..
int main()
{
int *a, *b, *c; //To store your matrix A & B in RAM. Result will be stored in matrix C
int *ad, *bd, *cd; // To store matrices into GPU's RAM.
int N =4; //No of rows and columns.
size_t size=sizeof(float)* N * N;
a=(float*)malloc(size); //Allocate space of RAM for matrix A
b=(float*)malloc(size); //Allocate space of RAM for matrix B
//allocate memory on device
cudaMalloc(&ad,size);
cudaMalloc(&bd,size);
cudaMalloc(&cd,size);
//initialize host memory with its own indices
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
a[i * N + j]=(float)(i * N + j);
b[i * N + j]= -(float)(i * N + j);
}
}
//copy data from host memory to device memory
cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);
//calculate execution configuration
dim3 grid (1, 1, 1);
dim3 block (16, 1, 1);
//each block contains N * N threads, each thread calculates 1 data element
add_matrices<<<grid, block>>>(ad, bd, cd, N);
cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost);
printf("Matrix A was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
printf("%f ",a[i*N+j]);
printf("\n");
}
printf("\nMatrix B was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
printf("%f ",b[i*N+j]);
printf("\n");
}
printf("\nAddition of A and B gives C----\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
printf("%f ",c[i*N+j]); //if correctly evaluated, all values will be 0
printf("\n");
}
//deallocate host and device memories
cudaFree(ad);
cudaFree(bd);
cudaFree (cd);
free(a);
free(b);
free(c);
getch();
return 1;
}
/////Kernel Part
__global__ void add_matrices(float *ad,float *bd,float *cd,int N)
{
int index;
index = blockIDx.x * blockDim.x + threadIDx.x
cd[index] = ad[index] + bd[index];
}
Lets take an example of addition of 16*16 matrices..
you have two matrices A and B, having dimension 16*16..
First of all you have to decide your thread configuration.
You are suppose to launch a kernel function, which will perform the parallel computation of you matrix addition, which will get executed on your GPU device.
Now,, one grid is launched with one kernel function..
A grid can have max 65,535 no of blocks which can be arranged in 3 dimensional ways. (65535 * 65535 * 65535).
Every block in grid can have max 1024 no of threads.Those threads can also be arranged in 3 dimensional ways (1024 * 1024 * 64)
Now our problem is addition of 16 * 16 matrices..
A | 1 2 3 4 | B | 1 2 3 4 | C| 1 2 3 4 |
| 5 6 7 8 | + | 5 6 7 8 | = | 5 6 7 8 |
| 9 10 11 12 | | 9 10 11 12 | | 9 10 11 12 |
| 13 14 15 16| | 13 14 15 16| | 13 14 15 16|
We need 16 threads to perform the computation.
i.e. A(1,1) + B (1,1) = C(1,1)
A(1,2) + B (1,2) = C(1,2)
. . .
. . .
A(4,4) + B (4,4) = C(4,4)
All these threads will get executed simultaneously.
So we need a block with 16 threads.
For our convenience we will arrange threads in (16 * 1 * 1) way in a block
As no of threads are 16 so we need one block only to store those 16 threads.
so, grid configuration will be dim3 Grid(1,1,1) i.e. grid will have only one block
and block configuration will be dim3 block(16,1,1) i.e. block will have 16 threads arranged column wise.
Following program will give you the clear idea about its execution..
Understanding the indexing part(i.e. threadIDs, blockDim, blockID) is the important part. You need to go through the CUDA literature. Once you have clear idea about indexing, you will win the half battle! So spend some time with cuda books, different algorithms and paper-pencil of course!
Try 'Cuda-gdb', which is the CUDA debugger.

Resources