replacing blockId with a counter - parallel-processing

Initially I used blockIdx.x in my code but I want to remove that and instead have a global value and use that in my block instead of blockidx.x. Since my code is too large and it hangs when I run it with large input sizes I thought think can help. I increment the counter atomically but when I run the code it hangs. Can anyone take a look at my code and see if I'm doing something wrong?
__device__ int counter = 0;
__global__ void kernel(int * ginput, int * goutput)
{
const int tid = threadIdx.x;
const int id = threadIdx.x + blockIdx.x * blockDim.x;
in myval = ginput[id];
if (tid == 0) {
atomicAdd(&counter, 1);
}
__syncthreads();
if (counter == 0) {
goutput[tid] = ...;
}
if (counter > 0) {
...
}
}
If I use blockIdx.x instead of counter in my code it works but I just want to replace it with the counter

If you want the counter to replace your usage of blockIdx.x (i.e. you want each block to have a unique value that it reads from counter), then something like this should work:
__device__ int counter = 0;
__global__ void kernel(int * ginput, int * goutput)
{
const int tid = threadIdx.x;
const int id = threadIdx.x + blockIdx.x * blockDim.x;
__shared__ int my_block_id;
if (tid == 0) {
my_block_id = atomicAdd(&counter, 1);
}
__syncthreads();
if (my_block_id == 0) {
goutput[tid] = ...;
}
if (my_block_id > 0) {
...
}
}
Your method is going to be troublesome, because if you do something like this:
if (counter > 5) ....
you are potentially reading a new updated value of counter from global memory, and any number of blocks may have updated that value, so the behavior will be unpredictable.

Related

CUDA Initialize Array on Device

I am very new to CUDA and I am trying to initialize an array on the device and return the result back to the host to print out to show if it was correctly initialized. I am doing this because the end goal is a dot product solution in which I multiply two arrays together, storing the results in another array and then summing up the entire thing so that I only need to return the host one value.
In the code I am working on all I am only trying to see if I am initializing the array correctly. I am trying to create an array of size N following the patterns of 1,2,3,4,5,6,7,8,1,2,3....
This is the code that I've written and it compiles without issue but when I run it the terminal is hanging and I have no clue why. Could someone help me out here? I'm so incredibly confused :\
#include <stdio.h>
#include <stdlib.h>
#include <chrono>
#define ARRAY_SIZE 100
#define BLOCK_SIZE 32
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int temp;
if(temp != 8){
a_d[x] = temp;
temp++;
} else {
a_d[x] = temp;
temp = 1;
}
}
int main (int argc, char *argv[])
{
//declare pointers for arrays
int *a_d, *b_d, *c_d, *sum_h, *sum_d,a_h[ARRAY_SIZE];
//set space for device variables
cudaMalloc((void**) &a_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &b_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &c_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &sum_d, sizeof(int));
// set execution configuration
dim3 dimblock (BLOCK_SIZE);
dim3 dimgrid (ARRAY_SIZE/BLOCK_SIZE);
// actual computation: call the kernel
cu_kernel <<<dimgrid, dimblock>>> (a_d,b_d,c_d,ARRAY_SIZE);
cudaError_t result;
// transfer results back to host
result = cudaMemcpy (a_h, a_d, sizeof(int) * ARRAY_SIZE, cudaMemcpyDeviceToHost);
if (result != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed.");
exit(1);
}
// print reversed array
printf ("Final state of the array:\n");
for (int i =0; i < ARRAY_SIZE; i++) {
printf ("%d ", a_h[i]);
}
printf ("\n");
}
There are at least 3 issues with your kernel code.
you are using shared memory variable temp without initializing it.
you are not resolving the order in which threads access a shared variable as discussed here.
you are imagining (perhaps) a particular order of thread execution, and CUDA provides no guarantees in that area
The first item seems self-evident, however naive methods to initialize it in a multi-threaded environment like CUDA are not going to work. Firstly we have the multi-threaded access pattern, again, Furthermore, in a multi-block scenario, shared memory in one block is logically distinct from shared memory in another block.
Rather than wrestle with mechanisms unsuited to create the pattern you desire, (informed by notions carried over from a serial processing environment), I would simply do something trivial like this to create the pattern you desire:
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
if (x < size) a_d[x] = (x&7) + 1;
}
Are there other ways to do it? certainly.
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int temp;
if (!threadIdx.x) temp = blockIdx.x*blockDim.x;
__syncthreads();
if (x < size) a_d[x] = ((temp+threadIdx.x) & 7) + 1;
}
You can get as fancy as you like.
These changes will still leave a few values at zero at the end of the array, which would require changes to your grid sizing. There are many questions about this already, or study a sample code like vectorAdd.

CUDA string search in large file, wrong result

I am working on simple naive string search in CUDA.
I am new in CUDA. It works fine fol smaller files ( aprox. ~1MB ). After I make these files bigger ( ctrl+a ctrl+c several times in notepad++ ), my program's results are higher ( about +1% ) than a
grep -o text file_name | wc -l
It is very simple function, so I don't know what could cause this. I need it to work with larger files ( ~500MB ).
Kernel code ( gpuCount is a __device__ int global variable ):
__global__ void stringSearchGpu(char *data, int dataLength, char *input, int inputLength){
int id = blockDim.x*blockIdx.x + threadIdx.x;
if (id < dataLength)
{
int fMatch = 1;
for (int j = 0; j < inputLength; j++)
{
if (data[id + j] != input[j]) fMatch = 0;
}
if (fMatch)
{
atomicAdd(&gpuCount, 1);
}
}
}
This is calling the kernel in main function:
int blocks = 1, threads = fileSize;
if (fileSize > 1024)
{
blocks = (fileSize / 1024) + 1;
threads = 1024;
}
clock_t cpu_start = clock();
// kernel call
stringSearchGpu<<<blocks, threads>>>(cudaBuffer, strlen(buffer), cudaInput, strlen(input));
cudaDeviceSynchronize();
After this I just copy the result to Host and print it.
Can anyone please help me with this?
First of all, you should always check return values of CUDA functions to check for errors. Best way to do so would be the following:
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
Wrap your CUDA calls, such as:
gpuErrchk(cudaDeviceSynchronize());
Second, your kernel accesses out of bounds memory. Suppose, dataLength=100, inputLength=7 and id=98. In your kernel code:
if (id < dataLength) // 98 is less than 100, so condition true
{
int fMatch = 1;
for (int j = 0; j < inputLength; j++) // j runs from [0 - 6]
{
// if j>1 then id+j>=100, which is out of bounds, illegal operation
if (data[id + j] != input[j]) fMatch = 0;
}
Change the condition to something like:
if (id < dataLength - inputLength)

Improving the Efficiency of Compact/Scatter in CUDA

Summary:
Any ideas about how to further improve upon the basic scatter operation in CUDA? Especially if one knows it will only be used to compact a larger array into a smaller one? or why the below methods of vectorizing memory ops and shared memory didn't work? I feel like there may be something fundamental I am missing and any help would be appreciated.
EDIT 03/09/15: So I found this Parallel For All Blog post "Optimized Filtering with Warp-Aggregated Atomics". I had assumed atomics would be intrinsically slower for this purpose, however I was wrong - especially since I don't think I care about maintaining element order in the array during my simulation. I'll have to think about it some more and then implement it to see what happens!
EDIT 01/04/16: I realized I never wrote about my results. Unfortunately in that Parallel for All Blog post they compared the global atomic method for compact to the Thrust prefix-sum compact method, which is actually quite slow. CUB's Device::IF is much faster than Thrust's - as is the prefix-sum version I wrote using CUB's Device::Scan + custom code. The warp-aggregrate global atomic method is still faster by about 5-10%, but nowhere near the 3-4x faster I had been hoping for based on the results in the blog. I'm still using the prefix-sum method as while maintaining element order is not necessary, I prefer the consistency of the prefix-sum results and the advantage from the atomics is not very big. I still try various methods to improve compact, but so far only marginal improvements (2%) at best for dramatically increased code complexity.
Details:
I am writing a simulation in CUDA where I compact out elements I am no longer interested in simulating every 40-60 time steps. From profiling it seems that the scatter op takes up the most amount of time when compacting - more so than the filter kernel or the prefix sum. Right now I use a pretty basic scatter function:
__global__ void scatter_arrays(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
int myID = blockIdx.x*blockDim.x + threadIdx.x;
for(int id = myID; id < freq_Index; id+= blockDim.x*gridDim.x){
if(flag[id]){
new_freq[scan_Index[id]] = freq[id];
}
}
}
freq_Index is the number of elements in the old array. The flag array is the result from the filter. Scan_ID is the result from the prefix sum on the flag array.
Attempts I've made to improve it are to read the flagged frequencies into shared memory first and then write from shared memory to global memory - the idea being that the writes to global memory would be more coalesced amongst the warps (e.g. instead of thread 0 writing to position 0 and thread 128 writing to position 1, thread 0 would write to 0 and thread 1 would write to 1). I also tried vectorizing the reads and the writes - instead of reading and writing floats/ints I read/wrote float4/int4 from the global arrays when possible, so four numbers at a time. This I thought might speed up the scatter by having fewer memory ops transferring larger amounts of memory. The "kitchen sink" code with both vectorized memory loads/stores and shared memory is below:
const int compact_threads = 256;
__global__ void scatter_arrays2(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
int gID = blockIdx.x*blockDim.x + threadIdx.x; //global ID
int tID = threadIdx.x; //thread ID within block
__shared__ float row[4*compact_threads];
__shared__ int start_index[1];
__shared__ int end_index[1];
float4 myResult;
int st_index;
int4 myFlag;
int4 index;
for(int id = gID; id < freq_Index/4; id+= blockDim.x*gridDim.x){
if(tID == 0){
index = reinterpret_cast<const int4*>(scan_Index)[id];
myFlag = reinterpret_cast<const int4*>(flag)[id];
start_index[0] = index.x;
st_index = index.x;
myResult = reinterpret_cast<const float4*>(freq)[id];
if(myFlag.x){ row[0] = myResult.x; }
if(myFlag.y){ row[index.y-st_index] = myResult.y; }
if(myFlag.z){ row[index.z-st_index] = myResult.z; }
if(myFlag.w){ row[index.w-st_index] = myResult.w; }
}
__syncthreads();
if(tID > 0){
myFlag = reinterpret_cast<const int4*>(flag)[id];
st_index = start_index[0];
index = reinterpret_cast<const int4*>(scan_Index)[id];
myResult = reinterpret_cast<const float4*>(freq)[id];
if(myFlag.x){ row[index.x-st_index] = myResult.x; }
if(myFlag.y){ row[index.y-st_index] = myResult.y; }
if(myFlag.z){ row[index.z-st_index] = myResult.z; }
if(myFlag.w){ row[index.w-st_index] = myResult.w; }
if(tID == blockDim.x -1 || gID == mutations_Index/4 - 1){ end_index[0] = index.w + myFlag.w; }
}
__syncthreads();
int count = end_index[0] - st_index;
int rem = st_index & 0x3; //equivalent to modulo 4
int offset = 0;
if(rem){ offset = 4 - rem; }
if(tID < offset && tID < count){
new_mutations_freq[population*new_array_Length+st_index+tID] = row[tID];
}
int tempID = 4*tID+offset;
if((tempID+3) < count){
reinterpret_cast<float4*>(new_freq)[tID] = make_float4(row[tempID],row[tempID+1],row[tempID+2],row[tempID+3]);
}
tempID = tID + offset + (count-offset)/4*4;
if(tempID < count){ new_freq[st_index+tempID] = row[tempID]; }
}
int id = gID + freq_Index/4 * 4;
if(id < freq_Index){
if(flag[id]){
new_freq[scan_Index[id]] = freq[id];
}
}
}
Obviously it gets a bit more complicated. :) While the above kernel seems stable when there are hundreds of thousands of elements in the array, I've noticed a race condition when the array numbers in the tens of millions. I'm still trying to track the bug down.
But regardless, neither method (shared memory or vectorization) together or alone improved performance. I was especially surprised by the lack of benefit from vectorizing the memory ops. It had helped in other functions I had written, though now I am wondering if maybe it helped because it increased Instruction-Level-Parallelism in the calculation steps of those other functions rather than the fewer memory ops.
I found the algorithm mentioned in this poster (similar algorithm also discussed in this paper) works pretty well, especially for compacting large arrays. It uses less memory to do it and is slightly faster than my previous method (5-10%). I put in a few tweaks to the poster's algorithm: 1) eliminating the final warp shuffle reduction in phase 1, can simply sum the elements as they are calculated, 2) giving the function the ability to work over more than just arrays sized as a multiple of 1024 + adding grid-strided loops, and 3) allowing each thread to load their registers simultaneously in phase 3 instead of one at a time. I also use CUB instead of Thrust for Inclusive sum for faster scans. There may be more tweaks I can make, but for now this is good.
//kernel phase 1
int myID = blockIdx.x*blockDim.x + threadIdx.x;
//padded_length is nearest multiple of 1024 > true_length
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
int lnID = threadIdx.x % warp_size;
int warpID = id >> 5;
unsigned int mask;
unsigned int cnt=0;//;//
for(int j = 0; j < 32; j++){
int index = (warpID<<10)+(j<<5)+lnID;
bool pred;
if(index > true_length) pred = false;
else pred = predicate(input[index]);
mask = __ballot(pred);
if(lnID == 0) {
flag[(warpID<<5)+j] = mask;
cnt += __popc(mask);
}
}
if(lnID == 0) counter[warpID] = cnt; //store sum
}
//kernel phase 2 -> CUB Inclusive sum transforms counter array to scan_Index array
//kernel phase 3
int myID = blockIdx.x*blockDim.x + threadIdx.x;
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
int lnID = threadIdx.x % warp_size;
int warpID = id >> 5;
unsigned int predmask;
unsigned int cnt;
predmask = flag[(warpID<<5)+lnID];
cnt = __popc(predmask);
//parallel prefix sum
#pragma unroll
for(int offset = 1; offset < 32; offset<<=1){
unsigned int n = __shfl_up(cnt, offset);
if(lnID >= offset) cnt += n;
}
unsigned int global_index = 0;
if(warpID > 0) global_index = scan_Index[warpID - 1];
for(int i = 0; i < 32; i++){
unsigned int mask = __shfl(predmask, i); //broadcast from thread i
unsigned int sub_group_index = 0;
if(i > 0) sub_group_index = __shfl(cnt, i-1);
if(mask & (1 << lnID)){
compacted_array[global_index + sub_group_index + __popc(mask & ((1 << lnID) - 1))] = input[(warpID<<10)+(i<<5)+lnID];
}
}
}
}
EDIT: There is a newer article by a subset of the poster authors where they examine a faster variation of compact than what is written above. However, their new version is not order preserving, so not useful for myself and I haven't implemented it to test it out. That said, if your project doesn't rely on object order, their newer compact version can probably speed up your algorithm.

Julia Set - Cuda , improve the performance failed

Recently I am learning the examples in the book CUDA by JASON SANDERS.
the example of Juila Set makes a bad performance of 7032ms.
Here is the program:
#include <cuda.h>
#include <cuda_runtime.h>
#include <cpu_bitmap.h>
#include <book.h>
#define DIM 1024
struct cuComplex{
float r;
float i;
__device__ cuComplex(float a, float b) : r(a),i(b){
}
__device__ float magnitude2(void){
return r*r+i*i;
}
__device__ cuComplex operator *(const cuComplex& a){
return cuComplex(r*a.r-i*a.i, i*a.r+r*a.i);
}
__device__ cuComplex operator +(const cuComplex& a){
return cuComplex(r+a.r,i+a.i);
}
};
__device__ int julia(int x,int y){
const float scale = 1.5;
float jx = scale * (float)(DIM/2 - x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);
cuComplex c(-0.8,0.156);
cuComplex a(jx,jy);
int i = 0;
for(i = 0; i<200; i++){
a = a*a + c;
if(a.magnitude2() > 1000){
return 0;
}
}
return 1;
}
__global__ void kernel(unsigned char *ptr){
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y*gridDim.x;
int juliaValue = julia(x,y);
ptr[offset*4 + 0] = 255*juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 1;
ptr[offset*4 + 3] = 255;
}
int main(void){
CPUBitmap bitmap(DIM,DIM);
unsigned char * dev_bitmap;
dim3 grid(DIM,DIM);
dim3 blocks(DIM/16,DIM/16);
dim3 threads(16,16);
dim3 thread(DIM,DIM);
cudaEvent_t start,stop;
cudaEvent_t bitmapCpy_start,bitmapCpy_stop;
HANDLE_ERROR(cudaEventCreate(&start));
HANDLE_ERROR(cudaEventCreate(&stop));
HANDLE_ERROR(cudaEventCreate(&bitmapCpy_start));
HANDLE_ERROR(cudaEventCreate(&bitmapCpy_stop));
HANDLE_ERROR(cudaMalloc((void **)&dev_bitmap,bitmap.image_size()));
HANDLE_ERROR(cudaEventRecord(start,0));
kernel<<<grid,1>>>(dev_bitmap);
HANDLE_ERROR(cudaMemcpy(bitmap.get_ptr(),dev_bitmap,bitmap.image_size(),cudaMemcpyDeviceToHost));
//HANDLE_ERROR(cudaEventRecord(bitmapCpy_stop,0));
//HANDLE_ERROR(cudaEventSynchronize(bitmapCpy_stop));
// float copyTime;
// HANDLE_ERROR(cudaEventElapsedTime(&copyTime,bitmapCpy_start,bitmapCpy_stop));
HANDLE_ERROR(cudaEventRecord(stop,0));
HANDLE_ERROR(cudaEventSynchronize(stop));
float elapsedTime;
HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime,start,stop));
//printf("Total time is %3.1f ms, time for copying is %3.1f ms \n",elapsedTime,copyTime);
printf("Total time is %3.1f ms\n",elapsedTime);
bitmap.display_and_exit();
HANDLE_ERROR(cudaEventDestroy(start));
HANDLE_ERROR(cudaEventDestroy(stop));
HANDLE_ERROR(cudaEventDestroy(bitmapCpy_start));
HANDLE_ERROR(cudaEventDestroy(bitmapCpy_stop));
HANDLE_ERROR(cudaFree(dev_bitmap));
}
I think the main factor that influences the performance is that the program above just run 1 thread in every block:
kernel<<<grid,1>>>(dev_bitmap);
so I change the kernel like the following:
__global__ void kernel(unsigned char *ptr){
int x = threadIdx.x + blockIdx.x*blockDim.x;
int y = threadIdx.y + blockIdx.y*blockDim.y;
int offset = x + y*gridDim.x*blockIdx.x;
int juliaValue = julia(x,y);
ptr[offset*4 + 0] = 255*juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 1;
ptr[offset*4 + 3] = 255;
}
and call kernel:
dim3 blocks(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<blocks,threads>>>(dev_bitmap);
I think this change is not a big deal, but when I ran it, it acted like that it ran into some endless loops, no image appeared and I couldn't do anything with my screen, just blocked there.
toolkit: cuda 5.5
system: ubuntu 12.04
When I run the original code you have posted here, I get a correct display and a time of ~340ms.
When I make your kernel change, I get an "unspecified launch error" on the kernel launch.
In your modified kernel, you have the following which is an incorrect computation:
int offset = x + y*gridDim.x*blockIdx.x;
When I change it to:
int offset = x + y*gridDim.x*blockDim.x;
I get normal execution and results, and an indicated time of ~10ms.

Cuda thrust global memory writing very slow

I am currently writing a code, that calculates a integral Histogram on the GPU using the Nvidia thrust library.
Therefore I allocate a continuous Block of device memory which I update with a custom functor all the time.
The problem is, that the write to the device memory is veeery slow, but the reads are actually ok.
The basic setup is the following:
struct HistogramCreation
{
HistogramCreation(
...
// pointer to memory
...
){}
/// The actual summation operator
__device__ void operator()(int index){
.. do the calculations ..
for(int j=0;j<30;j++){
(1) *_memoryPointer = values (also using reads to such locations) ;
}
}
}
void foo(){
cudaMalloc(_pointer,size);
HistogramCreation initialCreation( ... _pointer ...);
thrust::for_each(
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(_imageSize),
initialCreation);
}
if I change the writing in (1) to the following>
unsigned int val = values;
The performance is much better. THis is the only global memory write I have.
Using the memory write I get about 2s for HD Footage.
using the local variable it takes about 50 ms so about a factor of 40 less.
Why is this so slow? how could I improve it?
Just as #OlegTitov said, frequent load/store with global
memory should be avoided as much as possible. When there's a
situation where it's inevitable, then coalesced memory
access can help the execution process not to get too slow;
however in most cases, histogram calculation is pretty tough
to realize the coalesced access.
While most of the above is basically just restating
#OlegTitov's answer, i'd just like to share about an
investigation i did about finding summation with NVIDIA
CUDA. Actually the result is pretty interesting and i hope
it'll be a helpful information for other xcuda developers.
The experiment was basically to run a speed test of finding
summation with various memory access patterns: using global
memory (1 thread), L2 cache (atomic ops - 128 threads), and
L1 cache (shared mem - 128 threads)
This experiment used:
Kepler GTX 680,
1546 cores # 1.06GHz
GDDR5 256-bit # 3GHz
Here are the kernels:
__global__
void glob(float *h) {
float* hist = h;
uint sd = SEEDRND;
uint random;
for (int i = 0; i < NUMLOOP; i++) {
if (i%NTHREADS==0) random = rnd(sd);
int rind = random % NBIN;
float randval = (float)(random % 10)*1.0f ;
hist[rind] += randval;
}
}
__global__
void atom(float *h) {
float* hist = h;
uint sd = SEEDRND;
for (int i = threadIdx.x; i < NUMLOOP; i+=NTHREADS) {
uint random = rnd(sd);
int rind = random % NBIN;
float randval = (float)(random % 10)*1.0f ;
atomicAdd(&hist[rind], randval);
}
}
__global__
void shm(float *h) {
int lid = threadIdx.x;
uint sd = SEEDRND;
__shared__ float shm[NTHREADS][NBIN];
for (int i = 0; i < NBIN; i++) shm[lid][i] = h[i];
for (int i = lid; i < NUMLOOP; i+=NTHREADS) {
uint random = rnd(sd);
int rind = random % NBIN;
float randval = (float)(random % 10)*1.0f ;
shm[lid][rind] += randval;
}
/* reduction here */
for (int i = 0; i < NBIN; i++) {
__syncthreads();
if (threadIdx.x < 64) {
shm[threadIdx.x][i] += shm[threadIdx.x+64][i];
}
__syncthreads();
if (threadIdx.x < 32) {
shm[threadIdx.x][i] += shm[threadIdx.x+32][i];
}
__syncthreads();
if (threadIdx.x < 16) {
shm[threadIdx.x][i] += shm[threadIdx.x+16][i];
}
__syncthreads();
if (threadIdx.x < 8) {
shm[threadIdx.x][i] += shm[threadIdx.x+8][i];
}
__syncthreads();
if (threadIdx.x < 4) {
shm[threadIdx.x][i] += shm[threadIdx.x+4][i];
}
__syncthreads();
if (threadIdx.x < 2) {
shm[threadIdx.x][i] += shm[threadIdx.x+2][i];
}
__syncthreads();
if (threadIdx.x == 0) {
shm[0][i] += shm[1][i];
}
}
for (int i = 0; i < NBIN; i++) h[i] = shm[0][i];
}
OUTPUT
atom: 102656.00 shm: 102656.00 glob: 102656.00
atom: 122240.00 shm: 122240.00 glob: 122240.00
... blah blah blah ...
One Thread: 126.3919 msec
Atomic: 7.5459 msec
Sh_mem: 2.2207 msec
The ratio between these kernels is 57:17:1. Many things can
be analyzed here, and it truly does not mean that using
L1 or L2 memory spaces will always give you more than 10
times speedup of the whole program.
And here's the main and other funcs:
#include <iostream>
#include <cstdlib>
#include <cstdio>
using namespace std;
#define NUMLOOP 1000000
#define NBIN 36
#define SEEDRND 1
#define NTHREADS 128
#define NBLOCKS 1
__device__ uint rnd(uint & seed) {
#if LONG_MAX > (16807*2147483647)
int const a = 16807;
int const m = 2147483647;
seed = (long(seed * a))%m;
return seed;
#else
double const a = 16807;
double const m = 2147483647;
double temp = seed * a;
seed = (int) (temp - m * floor(temp/m));
return seed;
#endif
}
... the above kernels ...
int main()
{
float *h_hist, *h_hist2, *h_hist3, *d_hist, *d_hist2,
*d_hist3;
h_hist = (float*)malloc(NBIN * sizeof(float));
h_hist2 = (float*)malloc(NBIN * sizeof(float));
h_hist3 = (float*)malloc(NBIN * sizeof(float));
cudaMalloc((void**)&d_hist, NBIN * sizeof(float));
cudaMalloc((void**)&d_hist2, NBIN * sizeof(float));
cudaMalloc((void**)&d_hist3, NBIN * sizeof(float));
for (int i = 0; i < NBIN; i++) h_hist[i] = 0.0f;
cudaMemcpy(d_hist, h_hist, NBIN * sizeof(float),
cudaMemcpyHostToDevice);
cudaMemcpy(d_hist2, h_hist, NBIN * sizeof(float),
cudaMemcpyHostToDevice);
cudaMemcpy(d_hist3, h_hist, NBIN * sizeof(float),
cudaMemcpyHostToDevice);
cudaEvent_t start, end;
float elapsed = 0, elapsed2 = 0, elapsed3;
cudaEventCreate(&start);
cudaEventCreate(&end);
cudaEventRecord(start, 0);
atom<<<NBLOCKS, NTHREADS>>>(d_hist);
cudaThreadSynchronize();
cudaEventRecord(end, 0);
cudaEventSynchronize(start);
cudaEventSynchronize(end);
cudaEventElapsedTime(&elapsed, start, end);
cudaEventRecord(start, 0);
shm<<<NBLOCKS, NTHREADS>>>(d_hist2);
cudaThreadSynchronize();
cudaEventRecord(end, 0);
cudaEventSynchronize(start);
cudaEventSynchronize(end);
cudaEventElapsedTime(&elapsed2, start, end);
cudaEventRecord(start, 0);
glob<<<1, 1>>>(d_hist3);
cudaThreadSynchronize();
cudaEventRecord(end, 0);
cudaEventSynchronize(start);
cudaEventSynchronize(end);
cudaEventElapsedTime(&elapsed3, start, end);
cudaMemcpy(h_hist, d_hist, NBIN * sizeof(float),
cudaMemcpyDeviceToHost);
cudaMemcpy(h_hist2, d_hist2, NBIN * sizeof(float),
cudaMemcpyDeviceToHost);
cudaMemcpy(h_hist3, d_hist3, NBIN * sizeof(float),
cudaMemcpyDeviceToHost);
/* print output */
for (int i = 0; i < NBIN; i++) {
printf("atom: %10.2f shm: %10.2f glob:
%10.2f¥n",h_hist[i],h_hist2[i],h_hist3[i]);
}
printf("%12s: %8.4f msec¥n", "One Thread", elapsed3);
printf("%12s: %8.4f msec¥n", "Atomic", elapsed);
printf("%12s: %8.4f msec¥n", "Sh_mem", elapsed2);
return 0;
}
When writing GPU code you should avoid reading and writing to/from global memory. Global memory is very slow on GPU. That's the hardware feature. The only thing you can do is to make neighboring treads read/write in neighboring adresses in global memory. This will cause coalescing and speed up the process. But in general read your data once, process it and write it out once.
Note that NVCC might optimize out a lot of your code after you make the modification - it detects that no write to global memory is made and just removes the "unneeded" code. So this speedup may not be coming out of the global writer per ce.
I would recommend using profiler on your actual code (the one with global write) to see if there's anything like unaligned access or other perf problem.

Resources