Why cublas copy algorithm is so fast in cuda? - algorithm

I wrote kernel that copied input vector to output vector.
But the performance is not enough compared to cublascopy API.
The cublasScopy is almost 100 times faster than my kernel in case of 1M elements.
Anyone knows about algorithm of cublascopy?
__global__ void copy_kernel(const float *rv1, int inc1, float *rvo, int inco, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < n)
{
rvo[tid*inco] = rv1[tid*inc1];
tid += (blockDim.x * gridDim.x);
}
}

Thanks for Robert's help.
I found that measurement code had a bug.
I have to add cudaDeviceSynchronize() only for measuring performance.
And then above my kernel is a little bit slower than cublasScopy.
I think it is reasonable.
// cuBLAS Algorithm
timer.onTimer(4);
cublasScopy(handle, num_data, d_i_vals, inc1, d_o_vals, inco);
cudaDeviceSynchronize(); // this dummy line is needed only for measurement purpose
timer.offTimer(4);
timer.onTimer(5);
checkCudaErrors(cudaMemcpy(o_vals, d_o_vals, sizeof(float) * o_buf_size,
cudaMemcpyDeviceToHost));
cudaDeviceSynchronize();
timer.offTimer(5);

Related

Why is this code ten times slower on the GPU than CPU?

I have a problem that boils down to performing some arithmetic on each element of a set of matrices. I thought this sounded like the kind of computation that could benefit greatly from being shifted onto the GPU. However, I've only succeeded in slowing down the computation by a factor of 10!
Here are the specifics of my test system:
OS: Windows 10
CPU: Core i7-4700MQ # 2.40 GHz
GPU: GeForce GT 750M (compute capability 3.0)
CUDA SDK: v7.5
The code below performs equivalent calcs to my production code, on the CPU and on the GPU. The latter is consistently ten times slower on my machine (CPU approx. 650ms; GPU approx. 7s).
I've tried changing the grid and block sizes; I've increased and decreased the size of the array passed to the GPU; I've run it through the visual profiler; I've tried integer data rather than doubles, but whatever I do, the GPU version is always significantly slower than the CPU equivalent.
So why is the GPU version so much slower and what changes, that I've not mentioned above, could I try to improve its performance?
Here's my command line: nvcc source.cu -o CPUSpeedTest.exe -arch=sm_30
And here's the contents of source.cu:
#include <iostream>
#include <windows.h>
#include <cuda_runtime_api.h>
void AdjustArrayOnCPU(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
for (size_t i = 0; i < arrayLength; i++)
{
double adjustmentFactor = factor1 * factor2 * factor3 * (curve[i] / denominator);
array[i] = array[i] * adjustmentFactor;
}
}
__global__ void CudaKernel(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < arrayLength)
{
double adjustmentFactor = factor1 * factor2 * factor3 * (curve[idx] / denominator);
array[idx] = array[idx] * adjustmentFactor;
}
}
void AdjustArrayOnGPU(double array[], int arrayLength, double factor1, double factor2, double factor3, double denominator, double curve[], int curveLength)
{
double *dev_row, *dev_curve;
cudaMalloc((void**)&dev_row, sizeof(double) * arrayLength);
cudaMalloc((void**)&dev_curve, sizeof(double) * curveLength);
cudaMemcpy(dev_row, array, sizeof(double) * arrayLength, cudaMemcpyHostToDevice);
cudaMemcpy(dev_curve, curve, sizeof(double) * curveLength, cudaMemcpyHostToDevice);
CudaKernel<<<100, 1000>>>(factor1, factor2, factor3, denominator, dev_row, arrayLength, dev_curve, curveLength);
cudaMemcpy(array, dev_row, sizeof(double) * arrayLength, cudaMemcpyDeviceToHost);
cudaFree(dev_curve);
cudaFree(dev_row);
}
void FillArray(int length, double row[])
{
for (size_t i = 0; i < length; i++) row[i] = 0.1 + i;
}
int main(void)
{
const int arrayLength = 10000;
double arrayForCPU[arrayLength], curve1[arrayLength], arrayForGPU[arrayLength], curve2[arrayLength];;
FillArray(arrayLength, curve1);
FillArray(arrayLength, curve2);
///////////////////////////////////// CPU Version ////////////////////////////////////////
LARGE_INTEGER StartingTime, EndingTime, ElapsedMilliseconds, Frequency;
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
for (size_t iterations = 0; iterations < 10000; iterations++)
{
FillArray(arrayLength, arrayForCPU);
AdjustArrayOnCPU(1.0, 1.0, 1.0, 1.0, arrayForCPU, 10000, curve1, 10000);
}
QueryPerformanceCounter(&EndingTime);
ElapsedMilliseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMilliseconds.QuadPart *= 1000;
ElapsedMilliseconds.QuadPart /= Frequency.QuadPart;
std::cout << "Elapsed Milliseconds: " << ElapsedMilliseconds.QuadPart << std::endl;
///////////////////////////////////// GPU Version ////////////////////////////////////////
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
for (size_t iterations = 0; iterations < 10000; iterations++)
{
FillArray(arrayLength, arrayForGPU);
AdjustArrayOnGPU(arrayForGPU, 10000, 1.0, 1.0, 1.0, 1.0, curve2, 10000);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
std::cout << "CUDA Elapsed Milliseconds: " << elapsedTime << std::endl;
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
And here is an example of the output of CUDASpeedTest.exe
Elapsed Milliseconds: 565
CUDA Elapsed Milliseconds: 7156.76
What follows is likely to be embarrassingly obvious to most developers working with CUDA, but may be of value to others - like myself - who are new to the technology.
The GPU code is ten times slower than the CPU equivalent because the GPU code exhibits a perfect storm of performance-wrecking characteristics.
The GPU code spends most of its time allocating memory on the GPU, copying data to the device, performing a very, very simple calculation (that is supremely fast irrespective of the type of processor it's running on) and then copying data back from the device to the host.
As noted in the comments, if an upper bound exists on the size of the data structures being processed, then a buffer on the GPU can be allocated exactly once and reused. In the code above, this takes the GPU to CPU runtime down from 10:1 to 4:1.
The remaining performance disparity is down to the fact that the CPU is able to perform the required calculations, in serial, millions of times in a very short time span due to its simplicity. In the code above, the calculation involves reading a value from an array, some multiplication, and finally an assignment
to an array element. Something this simple must be performed millions of times
before the benefits of doing so in parallel outweigh the necessary time penalty of transferring the data to the GPU and back. On my test system, a million array elements is the break even point, where GPU and CPU perform in (approximately) the same amount of time.

Generate random number within a function with cuRAND without preallocation

Is it possible to generate random numbers within a device function without preallocate all the states? I would like to generate and use them in "realtime". I need to use them for Monte Carlo simulations what are the most suitable for this purpose? The number generated below are single precision is it possible to have them in double precision?
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
__global__ void cudaRand(float *d_out, unsigned long seed)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init(seed, i, 0, &state);
d_out[i] = curand_uniform(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
float *v = new float[N];
float *d_out;
cudaMalloc((void**)&d_out, N * sizeof(float));
// generate random numbers
cudaRand << < 1, N >> > (d_out, time(NULL));
cudaMemcpy(v, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
{
printf("out: %f\n", v[i]);
}
cudaFree(d_out);
delete[] v;
return 0;
}
UPDATE
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
#include <ctime>
__global__ void cudaRand(double *d_out)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init((unsigned long long)clock() + i, 0, 0, &state);
d_out[i] = curand_uniform_double(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
double *h_v = new double[N];
double *d_out;
cudaMalloc((void**)&d_out, N * sizeof(double));
// generate random numbers
cudaRand << < 1, N >> > (d_out);
cudaMemcpy(h_v, d_out, N * sizeof(double), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
printf("out: %f\n", h_v[i]);
cudaFree(d_out);
delete[] h_v;
return 0;
}
How I was dealing with the similar situation in the past, within __device__/__global__ function:
int tId = threadIdx.x + (blockIdx.x * blockDim.x);
curandState state;
curand_init((unsigned long long)clock() + tId, 0, 0, &state);
double rand1 = curand_uniform_double(&state);
double rand2 = curand_uniform_double(&state);
So just use curand_uniform_double for generating random doubles and also I believe you don't want the same seed for all of the threads, thats what I am trying to achieve by using clock() + tId instead. This way the odds of having the same rand1/rand2 in any of the two threads are close to nil.
EDIT:
However, based on below comments, proposed approach may perhaps lead to biased result:
JackOLantern pointed me to this part of curand documentation:
Sequences generated with different seeds usually do not have statistically correlated values, but some choices of seeds may give statistically correlated sequences.
Also there is a devtalk thread devoted to how to improve performance of curand_init in which the proposed solution to speed up the curand initialization is:
One thing you can do is use different seeds for each thread and a fixed subsequence of 0 and offset of 0.
But the same poster is later stating:
The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don't know of any problems like this, and even if they do exist they will most likely be rare.
So it is basically up to you whether you want better performance (as I did) or 1000% unbiased results. If that is what you desire, then solution proposed by JackOLantern is the correct one, i.e. initialize curand as:
curand_init((unsigned long long)clock(), tId, 0, &state)
Using not 0 value for offset and subsequence parameters is, however, decreasing performance. For more info on these parameters you may review this SO thread and also curand documentation.
I see that JackOLantern stated in comment that:
I would say it is not recommandable to call curand_init and curand_uniform_double from within the same kernel from two reasons ........ Second, curand_init initializes the pseudorandom number generator and sets all of its parameters, so I'm afraid your approach will be somewhat slow.
I was dealing with this in my thesis on several pages, tried various approaches to get different random numbers in each thread and creating curandState in each of the threads turned out to be the most viable solution for me. I needed to generate ~10 random numbers in each thread and among others I tried:
developing my own simple random number generator (Linear Congruential Generator) whose intialization was basically for free, however, the performance suffered greatly when generating numbers, so in the end having curandState in each thread turned out to be superior,
pre-allocating curandStates and reusing them - this was memory heavy and when I decreased number of preallocated states then I had to use non zero values for offset/subsequence parameters of curand_uniform_double in order to get rid of bias which led to decreased performance when generating numbers.
So after making thorough analysis I decided to indeed call curand_init and curand_uniform_double in each thread. The only problem was with the amount of registry that these states were occupying so I had to be careful with the block sizes not to exceed the max number of registry available to each block.
Thats what I have to say about provided solution which I was finally able to test and it is working just fine on my machine/GPU. I run the code from UPDATE section in the above question and 16 different random numbers were displayed in the console correctly. Therefore I advise you to properly perform error checking after executing kernel to see what went wrong inside. This topic is very well covered in this SO thread.

CUDA performance of atomic operation on different address in warp

To my knowledge, if atomic operations are performed on same memory address location in a warp, the performance of the warp could be 32 times slower.
But what if atomic operations of threads in a warp are on 32 different memory locations? Is there any performance penalty at all? Or it will be as fast as normal operation?
My use case is that I have 32 different positions, each thread in a warp needs one of these position but which position is data dependent. So each thread could use atomicCAS to scan if the location desired is empty or not. If it is not empty, scan the next position.
If I am lucky, 32 threads could atomicCAS to 32 different memory locations, is there any performance penalty is this case?
I assume Kepler architecture is used
In the code below, I'm adding a constant value to the elements of an array (dev_input). I'm comparing two kernels, one using atomicAdd and one using regular addition. This is an example taken to the extreme in which atomicAdd operates on completely different addresses, so there will be no need for serialization of the operations.
#include <stdio.h>
#define BLOCK_SIZE 1024
int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void regular_addition(float *dev_input, float val, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) dev_input[i] = dev_input[i] + val;
}
__global__ void atomic_operations(float *dev_input, float val, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) atomicAdd(&dev_input[i],val);
}
int main(){
int N = 8192*32;
float* output = (float*)malloc(N*sizeof(float));
float* dev_input; gpuErrchk(cudaMalloc((void**)&dev_input, N*sizeof(float)));
gpuErrchk(cudaMemset(dev_input, 0, N*sizeof(float)));
int NumBlocks = iDivUp(N,BLOCK_SIZE);
float time, timing1 = 0.f, timing2 = 0.f;
cudaEvent_t start, stop;
int niter = 32;
for (int i=0; i<niter; i++) {
gpuErrchk(cudaEventCreate(&start));
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaEventRecord(start,0));
atomic_operations<<<NumBlocks,BLOCK_SIZE>>>(dev_input,3,N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop,0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&time, start, stop));
timing1 = timing1 + time;
}
printf("Time for atomic operations: %3.5f ms \n", timing1/(float)niter);
for (int i=0; i<niter; i++) {
gpuErrchk(cudaEventCreate(&start));
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaEventRecord(start,0));
regular_addition<<<NumBlocks,BLOCK_SIZE>>>(dev_input,3,N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop,0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&time, start, stop));
timing2 = timing2 + time;
}
printf("Time for regular addition: %3.5f ms \n", timing2/(float)niter);
}
Testing this code on my NVIDIA GeForce GT540M, CUDA 5.5, Windows 7, I obtain approximately the same results for the two kernels, i.e., about 0.7ms.
Now change the instruction
if (i < N) atomicAdd(&dev_input[i],val);
to
if (i < N) atomicAdd(&dev_input[i%32],val);
which is closer to the case of your interest, namely, each atomicAdd operates on different addresses within a warp. The result I obtain is that no performance penalty is observed.
Finally, change the above instruction to
if (i < N) atomicAdd(&dev_input[0],val);
This is the other extreme in which atomicAdd always operates on the same address. In this case, the execution time raises to 5.1ms.
The above tests have been performed on a Fermi architecture. You can try to run the above code on your Kepler card.

TERCOM algorithm - Changing from single thread to multiple threads in CUDA

I'm currently working on porting a TERCOM algorithm from using only 1 thread to use multiple threads. Briefly explained , the TERCOM algorithm receives 5 measurements and the heading, and compare this measurements to a prestored map. The algorithm will choose the best match, i.e. lowest Mean Absolute Difference (MAD), and return the position.
The code is working perfectly with one thread and for-loops, but when I try to use multiple threads and blocks it returns the wrong answer. It seems like the multithread version doesn't "run through" the calculation in the same way as the singlethread versjon. Does anyone know what I am doing wrong?
Here's the code using for-loops
__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements)
{
//Without threads
float pos[2]={0};
float theta=heading*(PI/180);
float MAD=0;
// Calculate how much to move in x and y direction
float offset_x = h*cos(theta);
float offset_y = -h*sin(theta);
float min=100000; //Some High value
//Calculate Mean Absolute Difference
for(float row=0;row<m;row++)
{
for(float col=0;col<n;col++)
{
for(float g=0; g<N; g++)
{
f[(int)g] = tex2D (tex, col+(g-2)*offset_x+0.5f, row+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-f[(int)g]);
}
if(MAD<min)
{
min=MAD;
pos[0]=col;
pos[1]=row;
}
MAD=0; //Reset MAD
}
}
f[0]=min;
f[1]=pos[0];
f[2]=pos[1];
}
This is my attempt to use multiple threads
__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements)
{
// With threads
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
float pos[2]={0};
float theta=heading*(PI/180);
float MAD=0;
// Calculate how much to move in x and y direction
float offset_x = h*cos(theta);
float offset_y = -h*sin(theta);
float min=100000; //Some High value
if(idx < n && idy < m)
{
for(float g=0; g<N; g++)
{
f[(int)g] = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-f[(int)g]);
}
if(MAD<min)
{
min=MAD;
pos[0]=idx;
pos[1]=idy;
}
MAD=0; //Reset MAD
}
f[0]=min;
f[1]=pos[0];
f[2]=pos[1];
}
To launch the kernel
dim3 dimBlock( 16,16 );
dim3 dimGrid;
dimGrid.x = (n + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (m + dimBlock.y - 1)/dimBlock.y;
kernel <<< dimGrid,dimBlock >>> (m, n, h, N, dev_results, heading, dev_measurements);
The basic problem here is that you have a memory race in the code, centered around the use of f as both some sort of thread local scratch space and an output variable. Every concurrent thread will be trying to write values into the same locations in f simultaneously, which will produce undefined behaviour.
As best as I can tell, the use of f as scratch space isn't even necessary at all and the main computational section of the kernel could be written as something like:
if(idx < n && idy < m)
{
for(float g=0; g<N; g++)
{
float fval = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-fval);
}
min=MAD;
pos[0]=idx;
pos[1]=idy;
}
[disclaimer: written in browser, use at own risk]
At the end of that calculation, each thread has its own values of min and pos. At a minimum these must be stored in unique global memory (ie. the output must have enough space for each thread result). You will then need to perform some sort of reduction operation to obtain the global minimum from the set of thread local values. That could be in the host, or in the device code, or some combination of the two. There is a lot of code already available for CUDA parallel reductions which you should be able to find by searching and/or looking in the examples supplied with the CUDA toolkit. It should be trivial to adapt them to your specify case where you need to retain the position along with the minimum value.

CUDA performance doubts

Since i didnt got a response from the CUDA forum, ill try it here:
After doing a few programs in CUDA ive now started to obtain their effective bandwidth. However i have some strange results, for example in the following code, where i can sum all the elements in a vector(regardless of dimension), the bandwidth with the Unroll Code and the "normal" code seems to have the same median result(around 3000 Gb/s)
I dont know if im doing something wrong(AFAIK the program works fine) but from what ive read so far, the Unroll code should have a higher bandwidth.
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#include <math.h>
#define elements 1000
#define blocksize 16
__global__ void vecsumkernel(float*input, float*output,int nelements){
__shared__ float psum[blocksize];
int tid=threadIdx.x;
if(tid + blockDim.x * blockIdx.x < nelements)
psum[tid]=input[tid+blockDim.x*blockIdx.x];
else
psum[tid]=0.0f;
__syncthreads();
//WITHOUT UNROLL
int stride;
for(stride=blockDim.x/2;stride>0;stride>>=1){
if(tid<stride)
psum[tid]+=psum[tid+stride];
__syncthreads();
}
if(tid==0)
output[blockIdx.x]=psum[0];
//WITH UNROLL
/*
if(blocksize>=512 && tid<256) psum[tid]+=psum[tid+256];__syncthreads();
if(blocksize>=256 && tid<128) psum[tid]+=psum[tid+128];__syncthreads();
if(blocksize>=128 && tid<64) psum[tid]+=psum[tid+64];__syncthreads();
if (tid < 32) {
if (blocksize >= 64) psum[tid] += psum[tid + 32];
if (blocksize >= 32) psum[tid] += psum[tid + 16];
if (blocksize >= 16) psum[tid] += psum[tid + 8];
if (blocksize >= 8) psum[tid] += psum[tid + 4];
if (blocksize >= 4) psum[tid] += psum[tid + 2];
if (blocksize >= 2) psum[tid] += psum[tid + 1];
}*/
if(tid==0)
output[blockIdx.x]=psum[0];
}
void vecsumv2(float*input, float*output, int nelements){
dim3 dimBlock(blocksize,1,1);
int i;
for(i=((int)ceil((double)(nelements)/(double)blocksize))*blocksize;i>1;i(int)ceil((double)i/(double)blocksize)){
dim3 dimGrid((int)ceil((double)i/(double)blocksize),1,1);
printf("\ni=%d\ndimgrid=%u\n ",i,dimGrid.x);
vecsumkernel<<<dimGrid,dimBlock>>>(i==((int)ceil((double)(nelements)/(double)blocksize))*blocksize ?input:output,output,i==((int)ceil((double)(nelements)/(double)blocksize))*blocksize ? elements:i);
}
}
void printVec(float*vec,int dim){
printf("\n{");
for(int i=0;i<dim;i++)
printf("%f ",vec[i]);
printf("}\n");
}
int main(){
cudaEvent_t evstart, evstop;
cudaEventCreate(&evstart);
cudaEventCreate(&evstop);
float*input=(float*)malloc(sizeof(float)*(elements));
for(int i=0;i<elements;i++)
input[i]=(float) i;
float*output=(float*)malloc(sizeof(float)*elements);
float *input_d,*output_d;
cudaMalloc((void**)&input_d,elements*sizeof(float));
cudaMalloc((void**)&output_d,elements*sizeof(float));
cudaMemcpy(input_d,input,elements*sizeof(float),cudaMemcpyHostToDevice);
cudaEventRecord(evstart,0);
vecsumv2(input_d,output_d,elements);
cudaEventRecord(evstop,0);
cudaEventSynchronize(evstop);
float time;
cudaEventElapsedTime(&time,evstart,evstop);
printf("\ntempo gasto:%f\n",time);
float Bandwidth=((1000*4*2)/10^9)/time;
printf("\n Bandwidth:%f Gb/s\n",Bandwidth);
cudaMemcpy(output,output_d,elements*sizeof(float),cudaMemcpyDeviceToHost);
cudaFree(input_d);
cudaFree(output_d);
printf("soma do vector");
printVec(output,4);
}
Your unrolled code has a lot of branching in it. I count ten additional branches. Typically branching within a warp on a GPU is expensive because all threads in the warp end up waiting on the branch (divergence).
See here for more info on warp divergence:
http://forums.nvidia.com/index.php?showtopic=74842
Have you tried using a profiler to see what's going on?
3000 Gb/s Does not make sense. The max bus speed of PCIe is 8Gb/s on each direction.
Take a look at this paper Parallel Prefix Sum to gain insight on how to speed up your implementation.
Also consider that the thrust library have this already implemented in the Reductions module
your not-unrolled code is invalid. For stride<32 some threads of the same warp enter the for-loop, while the others do not. Therefore, some (but not all) threads of the warp hit the __syncthreads(). CUDA specification says that when that happens, the behaviour is undefined.
It can happen that warp gets out of sync and some threads already begin loading next chunk of data, halting on next instances of __syncthreads() while previous threads are still stuck in your previous loop.
I am not sure though if that is what you are going to face in this particular case.
I see you're doing Reduction Sum in kernel. Here's a good presentation by NVIDIA for optimizing reduction on GPUs. You'll notice that the same code that was giving a throughput of 2 GB/s is optimized to 63 GB/s in this guide.

Resources