Which way to order a shared 2D/3D array for parallel reduction over 1 dimension in CUDA/OpenCL? - algorithm

Overall goal
I have several reductions to make on a bipartite graph, represented by two dense arrays for vertices and a dense array specifying whether an edge is present b/w the two. Say, two arrays are a0[] and a1[], and all edges go like e[i0][i1] (that is, from elements in a0 to elements in a1).
There are ~100+100 vertices, and ~100*100 edges, so each thread is responsible for one edge.
Task 1 : max reduction
For each vertex in a0 I want to find the maximum of all vertices (in a1) connected to it, and then the same in reverse: having assigned the result to an array b0, for each vertex in a1, I want to find the maximum b0[i0] of the connected vertices.
To do this, I:
1) load into shared memory
#define DC_NUM_FROM_SHARED 16
#define DC_NUM_TO_SHARED 16
__global__ void max_reduce_down(
Value* value1
, Value* max_value_in_connected
, int r0_size, int r1_size
, bool** connected
)
{
int id_from;
id_from = blockIdx.x * blockDim.x + threadIdx.x;
id_to = blockIdx.y * blockDim.y + threadIdx.y;
bool within_bounds = (id_from < r0_size) && (id_to < r1_size);
//load into shared memory
__shared__ Value value[DC_NUM_TO_SHARED][DC_NUM_FROM_SHARED]; //FROM is the inner (consecutive) dimension
if(within_bounds)
value[threadIdx.y][threadIdx.x] = connected[id_to][id_from]? value1[id_to] : 0;
else
value[threadIdx.y][threadIdx.x] = 0;
__syncthreads();
if(!within_bounds)
return;
2) reduce
for(int stride = DC_NUM_TO_SHARED/2; threadIdx.y < stride; stride >>= 1)
{
value[threadIdx.y][threadIdx.x] = max(value[threadIdx.y][threadIdx.x], dc[threadIdx.y + stride][threadIdx.x]);
__syncthreads();
}
3) write back
max_value_connected[id_from] = value[0][threadIdx.x];
Task 2 : best k
Similar problem, but reduction is only in for vertices in a0, I need to find the k best candidates are chosen from connected in a1 (k is ~5).
1) I initialize the shared array with zero elements except for the 1st place
int id_from, id_to;
id_from = blockIdx.x * blockDim.x + threadIdx.x;
id_to = blockIdx.y * blockDim.y + threadIdx.y;
__shared Value* values[MAX_CHAMPS * CHAMPS_NUM_FROM_SHARED * CHAMPS_NUM_TO_SHARED]; //champion overlaps
__shared int* champs[MAX_CHAMPS * CHAMPS_NUM_FROM_SHARED * CHAMPS_NUM_TO_SHARED]; // overlap champions
bool within_bounds = (id_from < r0_size) && (id_to < r1_size);
int i = threadIdx.y * CHAMPS_NUM_FROM_SHARED + threadIdx.x;
if(within_bounds)
{
values[i] = connected[id_to][id_from] * values1[id_to];
champs[i] = connected[id_to][id_from] ? id_to : -1;
}
else
{
values[i] = 0;
champs[i] = -1;
}
for(int place = 1; place < CHAMP_COUNT; place++)
{
i = (place * CHAMPS_NUM_TO_SHARED + threadIdx.y) * CHAMPS_NUM_FROM_SHARED + threadIdx.x;
values[i] = 0;
champs[i] = -1;
}
if(! within_bounds)
return;
__syncthreads();
2) reduce it
for(int stride = CHAMPS_NUM_TO_SHARED/2; threadIdx.y < stride; stride >>= 1)
{
merge_2_champs(values, champs, CHAMP_COUNT, id_from, id_to, id_to + stride);
__syncthreads();
}
3) write the results back
for(int place = 0; place < LOCAL_DESIRED_ACTIVITY; place++)
champs0[place][id_from] = champs[place * CHAMPS_NUM_TO_SHARED * CHAMPS_NUM_FROM_SHARED + threadIdx.x];
Issue
How do I order (transpose) the elements in the shared array, so that memory access uses the cache better?
Does it matter at this point, or there is much more I can gain from other optimizations?
Would it be better to transpose the edge matrix if I needed to optimize for Task 2? (as far as I understood, there is a symmetry in Task 1, so it doesn't matter).
P.S.
I have delayed unrolling loops and doing the first reduction iteration while loading, since I thought it is too complicated to do before I have explored simpler ways.
For Task 2, it would be nice to not load zero elements, since the array would never need to grow, and only start shrinking once log k steps have been made. This would make it k times more compact in shared memory! But I dread the resulting index math.
Syntax and Correctness
The unusual types are just typedef'ed ints/chars/etc - AFAIK, in GPUs, it makes sense to compactify those as much as possible. I have not run the code yet, no need to check for indexing errors.
Also, I am using CUDA, but I am interested in an OpenCL perspective as well, since I think the best solution should be the same, and I will be using OpenCL in the future anyway.

OK, I think I figured this out.
The two alternatives that I am considering are to have reductions work on the y dimension, and independent on the x dimension, or vice versa (x dimension being the contiguous one). In any case, the scheduler is able to assemble threads into warps along the x dimension, so some coherence is guaranteed. However, having coherence extend beyond a warp would be great. Also, due to the 2D/3D nature of the shared arrays, one would have to limit the dimensions to 16 or even 8.
To ensure coalescence within a warp, the scheduler has to assemble warps along the x dimension.
If reducing over x dimension, after each iteration, the number of active threads in a warp will halve. However, if reducing over y dimension, it is the number of active warps that will halve.
So, I need to reduce over y.
Unless the transpose (load) is the slowest, which is an abnormal case.

Coalesced buffer reads really matter; kernels can be 32x slower if you don't do them. It can be worth doing a re-arrangement pass if it means being able to do them (of course, the re-arrangement pass needs to be coalesced as well, but you can often leverage shared local memory to do this).

Related

Grid size in phase #4 of Harris' reduction optimization

I am learning about unrolling loops to optimize kernel computation.
This is a code snippet from the book Professional CUDA C Programming:
if (idx + 4 * blockDim.x <= n)
{
int a1 = g_idata[idx];
int a2 = g_idata[idx + blockDim.x];
int a3 = g_idata[idx + 2 * blockDim.x];
int a4 = g_idata[idx + 3 * blockDim.x];
tmpSum = a1 + a2 + a3 + a4;
}
In my understanding, each thread works on 4 data blocks and processes a single element from each data block.
So, when we launch kernel, compared with kernel w/o unrolling grid.x, the configuration is changed to
reduceSmemUnroll<<<grid.x / 4, block>>>.
Then I have a question about the code snippet from Mark Harris's presentation on parallel reduction on page 32:
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
My question is about how to determine the size of grid when launching the kernel? Should it be grid.x/2 compared to configuration w/o multiple load?
Yes, it should be half the number of blocks; it says so on the slide with the first occurrence of the code snippet you quoted from in Mark's presentation - already on slide 18:
Halve the number of blocks, and replace single load:
[code snippet]
with two loads and [the] first add of the reduction
Of course, you need to be careful about the sizes. The presentation assumes, for simplicity, that your overall length is a power of 2, so you can always safely divide by 2 while there are multiple elements left. In real life that is not the case, so you may need to allow for slack (e.g. "half the grid size plus one if it was odd").

Selection Sort in Cuda

So, I'm trying to implement selection sort in Cuda, but so far I haven't been as successful.
__device__ void selection_sort( int *data, int left, int right ){
for( int i = left ; i <= right ; ++i ){
int min_val = data[i];
int min_idx = i;
// Find the smallest value in the range [left, right].
for( int j = i+1 ; j <= right ; ++j ){
int val_j = data[j];
if( val_j < min_val ){
min_idx = j;
min_val = val_j;
}
}
// Swap the values.
if( i != min_idx ){
data[min_idx] = data[i];
data[i] = min_val;
}
}
}
My main attempt here is to find the minimum and parallelize the solution. Now, I realize the code looks very C++ 'ish but I'm nowhere qualified as skilled in Cuda.
Is there a way to parallelize the solution? Are there any more additions to be made?
Selection sort algorithm for N numbers can be roughly described as:
for i from N-1 down to 0
find the maximum element among data[0] ~ data[i]
swap that maximum element with data[i] within the data array
The first part (finding the maximum element) falls into a widely known and well documented class of problems called reduction. However, to perform the second part (swapping), you must track the index of the maximum element while comparing the values, and it is not so natural to do that while performing reduction. This is one of the reasons why selection sort do not port well to parallel architectures.
Also, you can see that the problem size diminishes by one for each loop, and this is another aspect of the selection sort algorithm that does not map well to parallel architectures. In case of CUDA, 32 threads form a warp, which execute at the same time. Although you can tell arbitrary number of threads to run within a warp, it is generally not recommended to do so because it is a loss of computing power.
I've tried to build a CUDA version of selection sort myself, but I stopped doing it because it seems there are better algorithms well suited for CUDA. But I'll just show you what I've done so far to illustrate why selection sort is not good for CUDA.
Firstly, start from a small and simple problem: sorting 32 elements. Since 32 threads form a warp, you can use shuffle instructions to find maximum value. (Full code)
// Finds the maximum element within a warp and gives the maximum element to
// thread with lane id 0. Note that other elements do not get lost but their
// positions are shuffled.
__inline__ __device__ int warpMax(int data, unsigned int threadId)
{
for (int mask = 16; mask > 0; mask /= 2) {
int dual_data = __shfl_xor(data, mask, 32);
if (threadId & mask)
data = min(data, dual_data);
else
data = max(data, dual_data);
}
return data;
}
__global__ void selection32(int* d_data, int* d_data_sorted)
{
unsigned int threadId = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int laneId = threadIdx.x % 32;
int n = N;
while(n-- > 0) {
// get the maximum element among d_data and put it in d_data_sorted[n]
int data = d_data[threadId];
data = warpMax(data, threadId);
d_data[threadId] = data;
// now maximum element is in d_data[0]
if (laneId == 0) {
d_data_sorted[n] = d_data[0];
d_data[0] = INT_MIN; // this element is ignored from now on
}
}
}
int main()
{
// ... build data and trasfer to d_data ...
selection32<<<1, 32>>>(d_data, d_data_sorted);
// ... get the sorted array stored at d_data_sorted ...
}
(Some may argue that this is not exactly a selection sort since 1) the array elements of the unsorted area keep shuffling, and 2) it is not an in-place sort. Please note that I'm just trying to show that selection sort does not fit in for CUDA. Also, note that warpMax has highly divergent branches, making it less optimal for CUDA.)
The case with only 1 warp of elements may look parallel-ish, but the thing gets worse when the problem size increases to multiple warps. Let's see the case for 1024 elements. (I've chosen the number 1024 becuase it is the maximum number limit of threads in a block.) Now there are 32 warps, and after calling warpMax for each warp, we must compare the maximum elements of each warp to get the maximum element among the 1024 elements. This problem of comparing 32 warp-maximum-values cannot be done with warpMax because we need to track in which warp the maximum value came from to swap the maximum value with the last element in the data array. One way I can think of for doing this is using one single thread to compare warp-maximum-values. This is not a good implemenation for CUDA becuase other 1023 threads in the block become idle.
Furthermore, if the problem size grows larger than a block can cover, we need to compare the maximum values of each block, implying that we will have to launch separate kernels since we need to synchronize between blocks. And it is redundant to say that we need to keep track of in which block the maximum value came from. All of these just tells that implementing selection sort for CUDA is not a good idea.

warp shuffling to reduction of arrays with any length

I am working on a Cuda kernel which performs vector dot product (A x B). I assumed that the length of each vector is multiple of 32 (32,64, ...) and defined the block size to be equal to the length of the array. Each thread in the block multiplies one element of A to the corresponding element of B (thread i ==>psum = A[i]xB[i]). After multiplication, I used the following functions which used warp shuffling technique to perform reduction and calculate the sum all multiplications.
__inline__ __device__
float warpReduceSum(float val) {
int warpSize =32;
for (int offset = warpSize/2; offset > 0; offset /= 2)
val += __shfl_down(val, offset);
return val;
}
__inline__ __device__
float blockReduceSum(float val) {
static __shared__ int shared[32]; // Shared mem for 32 partial sums
int lane = threadIdx.x % warpSize;
int wid = threadIdx.x / warpSize;
val = warpReduceSum(val); // Each warp performs partial reduction
if (lane==0)
shared[wid]=val; // Write reduced value to shared memory
__syncthreads(); // Wait for all partial reductions
//read from shared memory only if that warp existed
val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;
if (wid==0)
val = warpReduceSum(val); // Final reduce within first warp
return val;
}
I simply call blockReduceSum(psum) which psum is the multiplication of two elements by a thread.
This approach doesn't work when the length of the array is not multiple of 32, so my question is, can we change this code so that it also works for any length? or is it impossible because if the length of the array is not multiple of 32, some warps have elements belonging more than one array?
First of all, depending on the GPU you are using, performing dot product with just 1 block will probably not be very efficient (as long as you are not batching several dot products in 1 kernel, each done by a single block).
To answer your question: you can reuse the code you have written by just calling your kernel with the number of threads being the closest multiple of 32 higher than N (length of the array) and introducing if statement before calling to blockReduceSum that would like this:
__global__ void kernel(float * A, float * B, int N) {
float psum = 0;
if(threadIdx.x < N) //threadIDx.x because your are using single block, you will need to change it to more general id once you move to multiple blocks
psum = A[threadIdx.x] * B[threadIdx.x];
blockReduceSum(psum);
//The rest of computation
}
That way, threads that do not have array element associated with them, but that need to be there due to use of __shfl, will contribute 0 to the sum.

Numerical Integration; CUDA development

I need advice on how to proceed and utilize the compute power of CUDA device for numerical integration of a function. Some information about my device is below (irrelevant)
Hardware
Geforce GTX470; Compute Capability 2.0
Problem Description
I have a function like
g(x) = x * f(x, a, b, c)
That I need to integrate as given equation
Now I already have written an integration function, which simply takes g(x), breaks the interval into N sub intervals, computes the result for individual sub interval, and then I sum it up on CPU. For completion purposes I provide below a code example.
__device__ float function(float x, float a, float b, float c) {
// do some complex calculation
return result;
}
__global__ void kernel(float *d_arr, float a, float b, float c, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float x = (float)idx / (float)N;
if (idx < N) {
d_arr[idx] = x * function(x, a, b, c);
}
}
The code above is only for demonstration purposes, I actually use Romberg method to integration my g(x) but the idea is the same. My real problem comes because of the fact that I don't have just one set of values (a, b, c), I have multiple values of this set.
I have a 2D array in device memory, precisely (3, 1024) 3 rows, 1024 columns. Each column represent a single set on which an integration function needs to be performed.
The problem arrives when I have to decide whether I shall execute a block of threads such as 1024, keeping in mind that one thread is equivalent to one integration function. In this case the function I wrote above is of no use. Because I want to perform parallel integration for all sets of values, I have to write an integration function, which can do integration sequentially. As an example:
__global__ void kernel(float *d_arr, float a, float b, float c, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0;
for (int i = 0; i < N; i++) {
float x = (float)i / (float) N;
sum += x * function(x, a, b, c);
}
d_arr[idx] = sum;
}
So you see my point? Option A, seems to be better, but I cannot use it because I don't know how can I do multiple integrals and then distribute each integral to N threads.
How would you do it? Can you suggest me, How can I achieve, both multiple integrals and while each integral can be distributed to N threads? Is there any better way to do it.
Looking forward for your advice.
If I understand your problem correctly, you want to do numerical integration with multiple (1024) sets of inputs (a,b,c), and for each integral you need N sub-intervals. Let's call the number of sets of inputs M.
If N is large enough (let's say > 10000) the first kernel sample you pasted could be good enough (invoking it M times for different set of inputs). Whether or not it utilizes all available device throughput depends on how complex your function is.
I didn't get what exactly you do with the d_arr[] array? Normally for numerical integration you would want to sum it. Right? Are you summing up the results on CPU? Consider using atomicAdd (esp. if you are going to run your kernel on compute cap 3.0 and above gpus) or a parallel scan if you find atomicAdd not fast enough.
If N is small, it's better to launch N*M threads in a single kernel.
In your case as M=1024, you can have every block process one set of inputs (i.e, set blockSize = 1024), and pass (a,b,c) inputs as arrays to the kernel - something like this:
__global__ void kernel(float *d_arr, float *a_array, float *b_array, float *c_array, int totalThreads, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float x = (float) blockIdx.x / (float) N;
float a = a_array[threadIdx.x];
float b = b_array[threadIdx.x];
float c = c_array[threadIdx.x];
if (idx < totalThreads) {
// what happen to this array?
d_arr[idx] = x * function(x, a, b, c);
}
}
Again, you would later need to extract elements from d_arr from appropriate positions and sum them up (for each integral).
If your function is not very complex and the above kernel becomes memory bound, you can try the other way round, i.e, having every thread block to process every sub-interval - with different thread block working on different set of inputs. Kernel would look something like this:
(this example assumes that N <= 1024, but it's possible to break up your kernel to take advantage of this approach even if it's not)
__global__ void kernel(float *d_arr, float *a_array, float *b_array, float *c_array, int totalThreads) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float x = (float)threadIdx.x / (float) blockDim.x; // N = blockDim.x
float a = a_array[blockIdx.x]; // every thread in block accesses same memory location
float b = b_array[blockIdx.x];
float c = c_array[blockIdx.x];
// d_arr has 'M' elements containing the integral for each input set.
if (idx < totalThreads)
{
atomicAdd(&d_arr[blockIdx.x], x * function(x, a, b, c));
}
}
In the above kernel have a_array, b_array and c_array allocated in constant memory. this will be faster as every thread in block will accesses same location.
As an example, I have also replaced your d_arr writes with an atomicAdd.

How to make sum calculations without using atomic in CUDA

In the below code, how can I calculate sum_array value without using atomicAdd.
Kernel method
__global__ void calculate_sum( int width,
int height,
int *pntrs,
int2 *sum_array )
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if ( row >= height || col >= width ) return;
int idx = pntrs[ row * width + col ];
//atomicAdd( &sum_array[ idx ].x, col );
//atomicAdd( &sum_array[ idx ].y, row );
sum_array[ idx ].x += col;
sum_array[ idx ].y += row;
}
Launch Kernel
dim3 dimBlock( 16, 16 );
dim3 dimGrid( ( width + ( dimBlock.x - 1 ) ) / dimBlock.x,
( height + ( dimBlock.y - 1 ) ) / dimBlock.y );
Reduction is a general name for this kind of problems. Look at the presentation for further explanation or use Google for other examples.
General way to solve this is to make parallel sum of global memory segments inside the thread blocks and store the results in global memory. Afterwards, copy the partial results to CPU memory space, sum the partial results using CPU, and copy the result back to GPU memory. You can avoid coping of memory by execution of another parallel sum for the partial results.
Another approach is to use highly optimized libraries for CUDA such as Thrust or CUDPP which contain functions doing the stuff.
My Cuda is very very rusty, but this is roughly how you do it (courtesy of "Cuda by Example", which I would strongly suggest you to read):
https://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0
Do a better partitioning of the array you need to sum: threads in CUDA are lightweight, but not so much that you can spawn one for just two sums and hope to get any performance benefit in return.
At this point each thread will be tasked to sum over a slice of your data: create an array of shared int as big as the number of your threads, where each thread will save the partial sum it computed.
Synchronize the threads and reduce the shared memory array:
(please, take it as pseudocode)
// Code to sum over a slice, essentially a loop over each thread subset
// and accumulate over "localsum" (a local variable)
...
// Save the result in the shared memory
partial[threadidx] = localsum;
// Synchronize the threads:
__syncthreads();
// From now on partial is filled with the result of all computations: you can reduce partial
// we'll do it the illiterate way, using a single thread (it can be easily parallelized)
if(threadidx == 0) {
for(i = 1; i < nthreads; ++i) {
partial[0] += partial[i];
}
}
and off you go: partial[0] will hold your sum (or computation).
See the dot product example in "CUDA by example" for a more rigorous discussion of the topic and a reduction algorithm that runs in about O(log(n)).
Hope this helps

Resources