OpenCL/CUDA: Two-stage reduction Algorithm - algorithm

Reduction of large arrays can be done by calling __reduce(); multiple times.
The following code however uses only two stages and is documented here:
However I am unable to understand the algorithm for this two stage reduction. can some give a simpler explanation?
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2; offset > 0; offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
It can also be well implemented using CUDA.

You create N threads. The first thread looks at values at positions 0, N, 2*N, ... The second thread looks at values 1, N+1, 2*N+1, ... That's the first loop. It reduces length values into N values.
Then each thread saves its smallest value in shared/local memory. Then you have a synchronization instruction (barrier(CLK_LOCAL_MEM_FENCE).) Then you have standard reduction in shared/local memory. When you're done the thread with local id 0 saves its result in the output array.
All in all, you have a reduction from length to N/get_local_size(0) values. You'd need to do one last pass after this code is done executing. However, this gets most of the job done, for example, you might have length ~ 10^8, N = 2^16, get_local_size(0) = 256 = 2^8, and this code reduces 10^8 elements into 256 elements.
Which parts do you not understand?

Related

warp shuffling to reduction of arrays with any length

I am working on a Cuda kernel which performs vector dot product (A x B). I assumed that the length of each vector is multiple of 32 (32,64, ...) and defined the block size to be equal to the length of the array. Each thread in the block multiplies one element of A to the corresponding element of B (thread i ==>psum = A[i]xB[i]). After multiplication, I used the following functions which used warp shuffling technique to perform reduction and calculate the sum all multiplications.
__inline__ __device__
float warpReduceSum(float val) {
int warpSize =32;
for (int offset = warpSize/2; offset > 0; offset /= 2)
val += __shfl_down(val, offset);
return val;
}
__inline__ __device__
float blockReduceSum(float val) {
static __shared__ int shared[32]; // Shared mem for 32 partial sums
int lane = threadIdx.x % warpSize;
int wid = threadIdx.x / warpSize;
val = warpReduceSum(val); // Each warp performs partial reduction
if (lane==0)
shared[wid]=val; // Write reduced value to shared memory
__syncthreads(); // Wait for all partial reductions
//read from shared memory only if that warp existed
val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;
if (wid==0)
val = warpReduceSum(val); // Final reduce within first warp
return val;
}
I simply call blockReduceSum(psum) which psum is the multiplication of two elements by a thread.
This approach doesn't work when the length of the array is not multiple of 32, so my question is, can we change this code so that it also works for any length? or is it impossible because if the length of the array is not multiple of 32, some warps have elements belonging more than one array?
First of all, depending on the GPU you are using, performing dot product with just 1 block will probably not be very efficient (as long as you are not batching several dot products in 1 kernel, each done by a single block).
To answer your question: you can reuse the code you have written by just calling your kernel with the number of threads being the closest multiple of 32 higher than N (length of the array) and introducing if statement before calling to blockReduceSum that would like this:
__global__ void kernel(float * A, float * B, int N) {
float psum = 0;
if(threadIdx.x < N) //threadIDx.x because your are using single block, you will need to change it to more general id once you move to multiple blocks
psum = A[threadIdx.x] * B[threadIdx.x];
blockReduceSum(psum);
//The rest of computation
}
That way, threads that do not have array element associated with them, but that need to be there due to use of __shfl, will contribute 0 to the sum.

Improving the Efficiency of Compact/Scatter in CUDA

Summary:
Any ideas about how to further improve upon the basic scatter operation in CUDA? Especially if one knows it will only be used to compact a larger array into a smaller one? or why the below methods of vectorizing memory ops and shared memory didn't work? I feel like there may be something fundamental I am missing and any help would be appreciated.
EDIT 03/09/15: So I found this Parallel For All Blog post "Optimized Filtering with Warp-Aggregated Atomics". I had assumed atomics would be intrinsically slower for this purpose, however I was wrong - especially since I don't think I care about maintaining element order in the array during my simulation. I'll have to think about it some more and then implement it to see what happens!
EDIT 01/04/16: I realized I never wrote about my results. Unfortunately in that Parallel for All Blog post they compared the global atomic method for compact to the Thrust prefix-sum compact method, which is actually quite slow. CUB's Device::IF is much faster than Thrust's - as is the prefix-sum version I wrote using CUB's Device::Scan + custom code. The warp-aggregrate global atomic method is still faster by about 5-10%, but nowhere near the 3-4x faster I had been hoping for based on the results in the blog. I'm still using the prefix-sum method as while maintaining element order is not necessary, I prefer the consistency of the prefix-sum results and the advantage from the atomics is not very big. I still try various methods to improve compact, but so far only marginal improvements (2%) at best for dramatically increased code complexity.
Details:
I am writing a simulation in CUDA where I compact out elements I am no longer interested in simulating every 40-60 time steps. From profiling it seems that the scatter op takes up the most amount of time when compacting - more so than the filter kernel or the prefix sum. Right now I use a pretty basic scatter function:
__global__ void scatter_arrays(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
int myID = blockIdx.x*blockDim.x + threadIdx.x;
for(int id = myID; id < freq_Index; id+= blockDim.x*gridDim.x){
if(flag[id]){
new_freq[scan_Index[id]] = freq[id];
}
}
}
freq_Index is the number of elements in the old array. The flag array is the result from the filter. Scan_ID is the result from the prefix sum on the flag array.
Attempts I've made to improve it are to read the flagged frequencies into shared memory first and then write from shared memory to global memory - the idea being that the writes to global memory would be more coalesced amongst the warps (e.g. instead of thread 0 writing to position 0 and thread 128 writing to position 1, thread 0 would write to 0 and thread 1 would write to 1). I also tried vectorizing the reads and the writes - instead of reading and writing floats/ints I read/wrote float4/int4 from the global arrays when possible, so four numbers at a time. This I thought might speed up the scatter by having fewer memory ops transferring larger amounts of memory. The "kitchen sink" code with both vectorized memory loads/stores and shared memory is below:
const int compact_threads = 256;
__global__ void scatter_arrays2(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
int gID = blockIdx.x*blockDim.x + threadIdx.x; //global ID
int tID = threadIdx.x; //thread ID within block
__shared__ float row[4*compact_threads];
__shared__ int start_index[1];
__shared__ int end_index[1];
float4 myResult;
int st_index;
int4 myFlag;
int4 index;
for(int id = gID; id < freq_Index/4; id+= blockDim.x*gridDim.x){
if(tID == 0){
index = reinterpret_cast<const int4*>(scan_Index)[id];
myFlag = reinterpret_cast<const int4*>(flag)[id];
start_index[0] = index.x;
st_index = index.x;
myResult = reinterpret_cast<const float4*>(freq)[id];
if(myFlag.x){ row[0] = myResult.x; }
if(myFlag.y){ row[index.y-st_index] = myResult.y; }
if(myFlag.z){ row[index.z-st_index] = myResult.z; }
if(myFlag.w){ row[index.w-st_index] = myResult.w; }
}
__syncthreads();
if(tID > 0){
myFlag = reinterpret_cast<const int4*>(flag)[id];
st_index = start_index[0];
index = reinterpret_cast<const int4*>(scan_Index)[id];
myResult = reinterpret_cast<const float4*>(freq)[id];
if(myFlag.x){ row[index.x-st_index] = myResult.x; }
if(myFlag.y){ row[index.y-st_index] = myResult.y; }
if(myFlag.z){ row[index.z-st_index] = myResult.z; }
if(myFlag.w){ row[index.w-st_index] = myResult.w; }
if(tID == blockDim.x -1 || gID == mutations_Index/4 - 1){ end_index[0] = index.w + myFlag.w; }
}
__syncthreads();
int count = end_index[0] - st_index;
int rem = st_index & 0x3; //equivalent to modulo 4
int offset = 0;
if(rem){ offset = 4 - rem; }
if(tID < offset && tID < count){
new_mutations_freq[population*new_array_Length+st_index+tID] = row[tID];
}
int tempID = 4*tID+offset;
if((tempID+3) < count){
reinterpret_cast<float4*>(new_freq)[tID] = make_float4(row[tempID],row[tempID+1],row[tempID+2],row[tempID+3]);
}
tempID = tID + offset + (count-offset)/4*4;
if(tempID < count){ new_freq[st_index+tempID] = row[tempID]; }
}
int id = gID + freq_Index/4 * 4;
if(id < freq_Index){
if(flag[id]){
new_freq[scan_Index[id]] = freq[id];
}
}
}
Obviously it gets a bit more complicated. :) While the above kernel seems stable when there are hundreds of thousands of elements in the array, I've noticed a race condition when the array numbers in the tens of millions. I'm still trying to track the bug down.
But regardless, neither method (shared memory or vectorization) together or alone improved performance. I was especially surprised by the lack of benefit from vectorizing the memory ops. It had helped in other functions I had written, though now I am wondering if maybe it helped because it increased Instruction-Level-Parallelism in the calculation steps of those other functions rather than the fewer memory ops.
I found the algorithm mentioned in this poster (similar algorithm also discussed in this paper) works pretty well, especially for compacting large arrays. It uses less memory to do it and is slightly faster than my previous method (5-10%). I put in a few tweaks to the poster's algorithm: 1) eliminating the final warp shuffle reduction in phase 1, can simply sum the elements as they are calculated, 2) giving the function the ability to work over more than just arrays sized as a multiple of 1024 + adding grid-strided loops, and 3) allowing each thread to load their registers simultaneously in phase 3 instead of one at a time. I also use CUB instead of Thrust for Inclusive sum for faster scans. There may be more tweaks I can make, but for now this is good.
//kernel phase 1
int myID = blockIdx.x*blockDim.x + threadIdx.x;
//padded_length is nearest multiple of 1024 > true_length
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
int lnID = threadIdx.x % warp_size;
int warpID = id >> 5;
unsigned int mask;
unsigned int cnt=0;//;//
for(int j = 0; j < 32; j++){
int index = (warpID<<10)+(j<<5)+lnID;
bool pred;
if(index > true_length) pred = false;
else pred = predicate(input[index]);
mask = __ballot(pred);
if(lnID == 0) {
flag[(warpID<<5)+j] = mask;
cnt += __popc(mask);
}
}
if(lnID == 0) counter[warpID] = cnt; //store sum
}
//kernel phase 2 -> CUB Inclusive sum transforms counter array to scan_Index array
//kernel phase 3
int myID = blockIdx.x*blockDim.x + threadIdx.x;
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
int lnID = threadIdx.x % warp_size;
int warpID = id >> 5;
unsigned int predmask;
unsigned int cnt;
predmask = flag[(warpID<<5)+lnID];
cnt = __popc(predmask);
//parallel prefix sum
#pragma unroll
for(int offset = 1; offset < 32; offset<<=1){
unsigned int n = __shfl_up(cnt, offset);
if(lnID >= offset) cnt += n;
}
unsigned int global_index = 0;
if(warpID > 0) global_index = scan_Index[warpID - 1];
for(int i = 0; i < 32; i++){
unsigned int mask = __shfl(predmask, i); //broadcast from thread i
unsigned int sub_group_index = 0;
if(i > 0) sub_group_index = __shfl(cnt, i-1);
if(mask & (1 << lnID)){
compacted_array[global_index + sub_group_index + __popc(mask & ((1 << lnID) - 1))] = input[(warpID<<10)+(i<<5)+lnID];
}
}
}
}
EDIT: There is a newer article by a subset of the poster authors where they examine a faster variation of compact than what is written above. However, their new version is not order preserving, so not useful for myself and I haven't implemented it to test it out. That said, if your project doesn't rely on object order, their newer compact version can probably speed up your algorithm.

Optimising Matrix Multiplication OpenCL - Purpose: learn how to manage memory

I'm new to OpenCL and trying to understand how to optimise matrix multiplication to become familiar with the various paradigms. Here's the current code.
If I'm multipliying matrices A and B. I allocate a row of A in private memory to start with (because each work item uses it), and a column of B in local memory (because each work group uses it).
1) the code is currently incorrect, unfortunately I'm struggling on how to use local work ids to get the correct code, but I can't find my mistake? I'm basing myself on http://www.cs.bris.ac.uk/home/simonm/workshops/OpenCL_lecture3.pdf but (slide 27) it seems that this is wrong as they don't make use of loc_size in their internal loop)
2) Are there any other optimisations you would suggest with this code?
__kernel void mmul(
__global int* C,
__global int* A,
__global int* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
{
int k,ty;
int tx = get_global_id(0);
int loctx = get_local_id(0);
int loc_size = get_local_size(0);
int value = 0 ;
int tmp_array[1000];
for(k=0;k<rB;k++) {
tmp_array[k] = A[tx * cA + k] ;
}
for (ty=0 ; ty < cC ; ty++) { \n" \
for (k = loctx ; k < rB ; k+=loc_size) {
local_mem[k] = B[ty + k * cC] ;
}
barrier(CLK_LOCAL_MEM_FENCE);
value = 0 ;
for(k=0;k<rB;k+=1) {
int i = loctx + k*loc_size;
value += tmp_array[k] * local_mem[i];
}
C[ty + (tx * cC)] = value;
}
}
where I set the global and local work items as follows
const size_t globalWorkItems[1] = {result_row};
const size_t localWorkItems[1] = {(size_t)local_wi_size};
local_wi_size is result_row/number of compute units (such that result_row % compute units == 0)
Your code is pretty close, but the indexing into the local memory array is actually simpler that you think. You have a row in private memory and a column in local memory, and you need to compute the dot product of these two vectors. You just need to sum row[k]*col[k], for k = 0 up to N-1:
for(k=0;k<rB;k+=1) {
value += tmp_array[k] * local_mem[k];
}
There's actually a second, more subtle bug that is also present in the example solution given on the slides you are using. Since you are reading and writing local memory inside a loop, you actually need two barriers, in order to make sure that work-items writing to local memory on iteration i don't overwrite values that are being read by other work-items executing iteration i-1.
Therefore, the full code for your kernel (tested and working), should look something like this:
__kernel void mmul(
__global int* C,
__global int* A,
__global int* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
{
int k,ty;
int tx = get_global_id(0);
int loctx = get_local_id(0);
int loc_size = get_local_size(0);
int value = 0;
int tmp_array[1000];
for(k=0;k<rB;k++) {
tmp_array[k] = A[tx * cA + k] ;
}
for (ty=0 ; ty < cC ; ty++) {
for (k = loctx ; k < rB ; k+=loc_size) {
local_mem[k] = B[ty + k * cC];
}
barrier(CLK_LOCAL_MEM_FENCE); // First barrier to ensure writes have finished
value = 0;
for(k=0;k<rB;k+=1) {
value += tmp_array[k] * local_mem[k];
}
C[ty + (tx * cC)] = value;
barrier(CLK_LOCAL_MEM_FENCE); // Second barrier to ensure reads have finished
}
}
You can find the full set of exercises and solutions that go with the slides you are looking at on the HandsOnOpenCL GitHub page. There's also a more complete set of slides from the same tutorial available here, which go on to show a much more optimised matrix multiply example that uses a blocking approach to better exploit temporal and spatial locality. The aforementioned missing barrier bug has been fixed in the example solution code, but not on the slides (yet).

TERCOM algorithm - Changing from single thread to multiple threads in CUDA

I'm currently working on porting a TERCOM algorithm from using only 1 thread to use multiple threads. Briefly explained , the TERCOM algorithm receives 5 measurements and the heading, and compare this measurements to a prestored map. The algorithm will choose the best match, i.e. lowest Mean Absolute Difference (MAD), and return the position.
The code is working perfectly with one thread and for-loops, but when I try to use multiple threads and blocks it returns the wrong answer. It seems like the multithread version doesn't "run through" the calculation in the same way as the singlethread versjon. Does anyone know what I am doing wrong?
Here's the code using for-loops
__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements)
{
//Without threads
float pos[2]={0};
float theta=heading*(PI/180);
float MAD=0;
// Calculate how much to move in x and y direction
float offset_x = h*cos(theta);
float offset_y = -h*sin(theta);
float min=100000; //Some High value
//Calculate Mean Absolute Difference
for(float row=0;row<m;row++)
{
for(float col=0;col<n;col++)
{
for(float g=0; g<N; g++)
{
f[(int)g] = tex2D (tex, col+(g-2)*offset_x+0.5f, row+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-f[(int)g]);
}
if(MAD<min)
{
min=MAD;
pos[0]=col;
pos[1]=row;
}
MAD=0; //Reset MAD
}
}
f[0]=min;
f[1]=pos[0];
f[2]=pos[1];
}
This is my attempt to use multiple threads
__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements)
{
// With threads
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
float pos[2]={0};
float theta=heading*(PI/180);
float MAD=0;
// Calculate how much to move in x and y direction
float offset_x = h*cos(theta);
float offset_y = -h*sin(theta);
float min=100000; //Some High value
if(idx < n && idy < m)
{
for(float g=0; g<N; g++)
{
f[(int)g] = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-f[(int)g]);
}
if(MAD<min)
{
min=MAD;
pos[0]=idx;
pos[1]=idy;
}
MAD=0; //Reset MAD
}
f[0]=min;
f[1]=pos[0];
f[2]=pos[1];
}
To launch the kernel
dim3 dimBlock( 16,16 );
dim3 dimGrid;
dimGrid.x = (n + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (m + dimBlock.y - 1)/dimBlock.y;
kernel <<< dimGrid,dimBlock >>> (m, n, h, N, dev_results, heading, dev_measurements);
The basic problem here is that you have a memory race in the code, centered around the use of f as both some sort of thread local scratch space and an output variable. Every concurrent thread will be trying to write values into the same locations in f simultaneously, which will produce undefined behaviour.
As best as I can tell, the use of f as scratch space isn't even necessary at all and the main computational section of the kernel could be written as something like:
if(idx < n && idy < m)
{
for(float g=0; g<N; g++)
{
float fval = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-fval);
}
min=MAD;
pos[0]=idx;
pos[1]=idy;
}
[disclaimer: written in browser, use at own risk]
At the end of that calculation, each thread has its own values of min and pos. At a minimum these must be stored in unique global memory (ie. the output must have enough space for each thread result). You will then need to perform some sort of reduction operation to obtain the global minimum from the set of thread local values. That could be in the host, or in the device code, or some combination of the two. There is a lot of code already available for CUDA parallel reductions which you should be able to find by searching and/or looking in the examples supplied with the CUDA toolkit. It should be trivial to adapt them to your specify case where you need to retain the position along with the minimum value.

How to absolute 2 double or 4 floats using SSE instruction set? (Up to SSE4)

Here's the sample C code that I am trying to accelerate using SSE, the two arrays are 3072 element long with doubles, may drop it down to float if i don't need the precision of doubles.
double sum = 0.0;
for(k = 0; k < 3072; k++) {
sum += fabs(sima[k] - simb[k]);
}
double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0)));
Anyway my current problem is how to do the fabs step in a SSE register for doubles or float so that I can keep the whole calculation in the SSE registers so that it remains fast and I can parallelize all of the steps by partly unrolling this loop.
Here's some resources I've found fabs() asm or possibly this flipping the sign - SO however the weakness of the second one would need a conditional check.
I suggest using bitwise and with a mask. Positive and negative values have the same representation, only the most significant bit differs, it is 0 for positive values and 1 for negative values, see double precision number format. You can use one of these:
inline __m128 abs_ps(__m128 x) {
static const __m128 sign_mask = _mm_set1_ps(-0.f); // -0.f = 1 << 31
return _mm_andnot_ps(sign_mask, x);
}
inline __m128d abs_pd(__m128d x) {
static const __m128d sign_mask = _mm_set1_pd(-0.); // -0. = 1 << 63
return _mm_andnot_pd(sign_mask, x); // !sign_mask & x
}
Also, it might be a good idea to unroll the loop to break the loop-carried dependency chain. Since this is a sum of nonnegative values, the order of summation is not important:
double norm(const double* sima, const double* simb) {
__m128d* sima_pd = (__m128d*) sima;
__m128d* simb_pd = (__m128d*) simb;
__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
for(int k = 0; k < 3072/2; k+=2) {
sum1 += abs_pd(_mm_sub_pd(sima_pd[k], simb_pd[k]));
sum2 += abs_pd(_mm_sub_pd(sima_pd[k+1], simb_pd[k+1]));
}
__m128d sum = _mm_add_pd(sum1, sum2);
__m128d hsum = _mm_hadd_pd(sum, sum);
return *(double*)&hsum;
}
By unrolling and breaking the dependency (sum1 and sum2 are now independent), you let the processor execute the additions our of order. Since the instruction is pipelined on a modern CPU, the CPU can start working on a new addition before the previous one is finished. Also, bitwise operations are executed on a separate execution unit, the CPU can actually perform it in the same cycle as addition/subtraction. I suggest Agner Fog's optimization manuals.
Finally, I don't recommend using openMP. The loop is too small and the overhead of distribution the job among multiple threads might be bigger than any potential benefit.
The maximum of -x and x should be abs(x). Here it is in code:
x = _mm_max_ps(_mm_sub_ps(_mm_setzero_ps(), x), x)
Probably the easiest way is as follows:
__m128d vsum = _mm_set1_pd(0.0); // init partial sums
for (k = 0; k < 3072; k += 2)
{
__m128d va = _mm_load_pd(&sima[k]); // load 2 doubles from sima, simb
__m128d vb = _mm_load_pd(&simb[k]);
__m128d vdiff = _mm_sub_pd(va, vb); // calc diff = sima - simb
__m128d vnegdiff = mm_sub_pd(_mm_set1_pd(0.0), vdiff); // calc neg diff = 0.0 - diff
__m128d vabsdiff = _mm_max_pd(vdiff, vnegdiff); // calc abs diff = max(diff, - diff)
vsum = _mm_add_pd(vsum, vabsdiff); // accumulate two partial sums
}
Note that this may not be any faster than scalar code on modern x86 CPUs, which typically have two FPUs anyway. However if you can drop down to single precision then you may well get a 2x throughput improvement.
Note also that you will need to combine the two partial sums in vsum into a scalar value after the loop, but this is fairly trivial to do and is not performance-critical.

Resources