I'm encountering a very strange problem: Mu 9800GT doesnt seem to calculate at all.
I've tried all hello-worlds i've found in the internet, here's one of them:
this program creates 1..100 array on hosts, sends it to device, calculates a square of each value, returns it to host, prints the results.
#include "stdafx.h"
#include <stdio.h>
#include <cuda.h>
__global__ void square_array(float *a, int N)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
// main routine that executes on the host
int main(void)
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
so the output is expected to be:
1 1.000
2 4.000
3 9.000
4 16.000
I swear back in 2009 it worked perfectly (vista 32, deviceemu)
now i get output:
1 1.000
2 2.000
3 3.000
4 4.000
so my card doesnt do anything. What can be the problem?
Configuration is:
visual studio 2010 32bit
cuda toolkit 3.2 64bit
compilation settings: cuda 3.2 toolkit, 32-bit target platform, deviceemu or not - doesnt matter, the results are the same.
i also tried it on my vmware xp(32bit) visual studio 2008. the result is the same.
Please help me, i barely made the programe to compile, now i need it to work.
You can also view my project with all it needs from my post at nvidia forums ( 2.7 kb)
Thanks, Ilya

Your code produces the intended results on my Linux system so I would suggest checking the error codes returned by cudaMalloc and cudaMemcpy to ensure there are no silent driver/runtime errors. For example
cudaError_t error = cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
printf("error status: %s\n", cudaGetErrorString(error));
should print
error status: no error
if the call is successful.
Also, I believe device emulation was deprecated in CUDA 3.0 and removed entirely in CUDA 3.1. I don't know if that's related to your problem though.
To compile several files you'd just do something like this
$nvcc -c
$nvcc -c
$nvcc -o foobar foo.o bar.o
alternatively, you can do the linking in the last step with g++ like so
$g++ -o foobar foo.o bar.o -L/usr/local/cuda/lib64 -lcudart


large-size page-locked memory copy get wrong result in CUDA

I found an issue about large-size page-locked memory in CUDA. Here is the source code and makefile. The code allocates 10GB page-locked memory and copy some data from device memory to this page-locked memory, the data in device memory are set 1.0 before the copy.
#include <cuda.h>
#include <assert.h>
#include <cuda_runtime.h>
#include "helper_cuda.h"
void test_k(double* x, size_t n)
int gid = blockIdx.x*blockDim.x + threadIdx.x;
if(gid<n) x[gid] = 1.0 ;
int main(int argc, char* argv[])
size_t n = size_t(10)*1024*1024*1024/sizeof(double);
printf("\n n: %zu, page-locked memory size: %zu MB\n", n, n*sizeof(double)/1024/1024);
double* x_h = NULL, *x_d = NULL;
int gpuid = 0;
if(argc>1 ) gpuid = atoi(argv[1]);
printf("select gpu %d\n", gpuid);
checkCudaErrors(cudaMallocHost(&x_h, sizeof(double)*n));
checkCudaErrors(cudaMalloc(&x_d, sizeof(double)*n));
for(int i = 0; i < n; ++i) x_h[i]=0.0;
int nthd = 256;
int nblk = (n+nthd-1) / nthd;
test_k<<<nblk, nthd, 0, 0>>>(x_d, n);
checkCudaErrors(cudaMemcpy(x_h, x_d, sizeof(double)*n, cudaMemcpyDeviceToHost));
int errCount = 0;
for(size_t i = 0; i < n; ++i){
if(x_h[i] == 0.0) errCount++;
printf("%s errCount: %d, which should be 0\n", errCount?"Error:":"Correct", errCount);
return 0;
CUDA_PATH = /depot/cuda/cuda-11.2/
CUDA_INC = -I$(CUDA_PATH)/include -I$(CUDA_PATH)/samples/common/inc
NVCC = $(CUDA_PATH)/bin/nvcc
NVCCXXFLAGS = -std=c++11 -O3 -w -m64 -Xptxas -dlcm=cg -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 $(CUDA_INC)
all: testLargePin
$(NVCC) $^ $(NVCCXXFLAGS) -o $#
rm testLargePin -f
I run the binary on three different GPU servers(all with A100-SXM4-40GB). On machine 1, the result is correct. On machine 2, it reports
CUDA error at code=719(cudaErrorLaunchFailure) "cudaMemcpy(x_h, x_d, sizeof(double)*n, cudaMemcpyDeviceToHost)"
On machine 3, its copy is wrong, there are lots of zeros in the page-locked array.
n: 1342177280, page-locked memory size: 10240 MB
select gpu 0
Error: errCount: 1024, which should be 0
Anyone knows the reason and how to fix the issue? like an API to check the max page-locked memory size in specified machine? Thanks in advance.
Error 719 is about dereferencing an invalid device pointer, accessing out of bounds shared memory, or system specific problem...
In my experience, synchronization helped troubles about memory error and inconsistent results. Did you try adding cudaDeviceSyncronize(); after checkCudaErrors(cudaMemcpy(x_h, x_d, sizeof(double)*n, cudaMemcpyDeviceToHost)); ??
About page-locked memory, there's no limit in CUDA. I think you have to check this on your host side.

Is there any way to reduce sum 100M float elements of an array in CUDA?

I'm new to CUDA. So please bear with questions with trivial solutions, if any.
I am trying to find the sum of 100M float elements of an array. From the following code one could see that I've used a reduction kernel and thrust. I suppose the kernel stores the sum in g_odata[0]. As all the elements are same in g_idata the result should be n*g_idata[1]. But you could clearly see the results are incorrect for both of them.
What am I getting wrong? How could I achieve my target?
Every reduction kernel I found is for integer datatype. e.g. the highly recommended Optimizing Parallel Reduction in CUDA.. Is there any specific reason to that?
Here is my code:
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <iomanip>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
using namespace std;
__global__ void reduce(float *g_idata, float *g_odata) {
__shared__ float sdata[256];
int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[threadIdx.x] = g_idata[i];
for (int s=1; s < blockDim.x; s *=2)
int index = 2 * s * threadIdx.x;;
if (index < blockDim.x)
sdata[index] += sdata[index + s];
if (threadIdx.x == 0)
int main(void){
unsigned int n=pow(10,8);
float *g_idata, *g_odata;
cudaMallocManaged(&g_idata, n*sizeof(float));
cudaMallocManaged(&g_odata, n*sizeof(float));
int blockSize = 32;
int numBlocks = (n + blockSize - 1) / blockSize;
for(int i=0;i<n;i++){g_idata[i]=6.1;g_odata[i]=0;}
reduce<<<numBlocks, blockSize>>>(g_idata, g_odata);
cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;
g_odata[0]=thrust::reduce(thrust::device, g_idata, g_idata+n);
cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;
6.0129e+08 6.1e+08 8.7097e+06
6.09986e+08 6.1e+08 13824
I am using CUDA 10. nvcc --version :
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
Details of my GPU DeviceQuery:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 750"
CUDA Driver Version / Runtime Version 10.0 / 10.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 1999 MBytes (2096168960 bytes)
( 4) Multiprocessors, (128) CUDA Cores/MP: 512 CUDA Cores
GPU Max Clock rate: 1110 MHz (1.11 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS
Thanks in advance.
I think the reason you are confused about the results here is a lack of understanding of floating point arithmetic. This whitepaper covers the topic pretty well. As a simple concept to grasp, if I have numbers represented as float quantities, and I attempt to do this:
100000000 + 1
the result will be: 100000000 (write some code and try it yourself)
This isn't unique to GPUs, CPU code will behave the same way (try it).
So for very large reductions, we get to the point (often) where we are adding very large numbers to much much smaller numbers, and the results aren't accurate from a "pure math" point of view.
That is fundamentally the problem here. In your CPU code, when you decide that the correct result should be 6.1*n, that kind of multiplication problem is not subject to the limits of adding large numbers to small ones that I just described, so you get an "accurate" result from that.
One of the ways to prove this or work around it, is to use double representation instead of float. This doesn't really completely eliminate the problem, but it pushes the resolution to the point where it can do a much better job of representing the range of numbers here.
The following code primarily has that change. You can change the typedef to compare the behavior between float and double.
There are a few other changes in the code. None of them are the cause of the discrepancy you witnessed.
$ cat
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <iomanip>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
#define BLOCK_SIZE 32
typedef double ft;
using namespace std;
__device__ double my_atomicAdd(double* address, double val)
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val +
// Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
} while (assumed != old);
return __longlong_as_double(old);
__device__ float my_atomicAdd(float* addr, float val){
return atomicAdd(addr, val);
__global__ void reduce(ft *g_idata, ft *g_odata, int n) {
__shared__ ft sdata[BLOCK_SIZE];
int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[threadIdx.x] = (i < n)?g_idata[i]:0;
for (int s=1; s < blockDim.x; s *=2)
int index = 2 * s * threadIdx.x;;
if ((index +s) < blockDim.x)
sdata[index] += sdata[index + s];
if (threadIdx.x == 0)
int main(void){
unsigned int n=pow(10,8);
ft *g_idata, *g_odata;
cudaMallocManaged(&g_idata, n*sizeof(ft));
cudaMallocManaged(&g_odata, sizeof(ft));
cout << "n = " << n << endl;
int blockSize = BLOCK_SIZE;
int numBlocks = (n + blockSize - 1) / blockSize;
g_odata[0] = 0;
for(int i=0;i<n;i++){g_idata[i]=6.1;}
reduce<<<numBlocks, blockSize>>>(g_idata, g_odata, n);
cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;
g_odata[0]=thrust::reduce(thrust::device, g_idata, g_idata+n);
cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;
$ nvcc -o t18
$ cuda-memcheck ./t18
n = 100000000
6.1e+08 6.1e+08 0.00527966
6.1e+08 6.1e+08 5.13792e-05
========= ERROR SUMMARY: 0 errors

how do i include sm_11_atomic_function.h? [duplicate]

I'm having a issue with my class
Calling nvcc -v -o kernel.o I'm getting this error: error: identifier "atomicAdd" is undefined
My code:
#include "dot.h"
#include <cuda.h>
#include "device_functions.h" //might call atomicAdd
__global__ void dot (int *a, int *b, int *c){
__shared__ int temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
if( 0 == threadIdx.x ){
int sum = 0;
for( int i = 0; i<THREADS_PER_BLOCK; i++)
sum += temp[i];
atomicAdd(c, sum);
Some suggest?
You need to specify an architecture to nvcc which supports atomic memory operations (the default architecture is 1.0 which does not support atomics). Try:
nvcc -arch=sm_11 -v -o kernel.o
and see what happens.
EDIT in 2015 to note that the default architecture in CUDA 7.0 is now 2.0, which supports atomic memory operations, so this should not be a problem in newer toolkit versions.
Today with the latest cuda SDK and toolkit this solution will not work.
People also say that adding:
compute_11,sm_11; OR compute_12,sm_12; OR compute_13,sm_13;
to CUDA in the Project Properties in Visual Studio 2010 will work. It doesn't.
You have to specify this for the .cu file itself in its own properties (Under the C++/CUDA->Device->Code Generation) tab such as:

OpenCL in Xcode/OSX - Can't assign zero in kernel loop

I'm developing an accelerated component in OpenCL, using Xcode 4.5.1 and Grand Central Dispatch, guided by this tutorial.
The full kernel kept failing on the GPU, giving signal SIGABRT. I couldn't make much progress interpreting the error beyond that.
But I broke out aspects of the kernel to test, and I found something very peculiar involving assigning certain values to positions in an array within a loop.
Test scenario: give each thread a fixed range of array indices to initialize.
kernel void zero(size_t num_buckets, size_t positions_per_bucket, global int* array) {
size_t bucket_index = get_global_id(0);
if (bucket_index >= num_buckets) return;
for (size_t i = 0; i < positions_per_bucket; i++)
array[bucket_index * positions_per_bucket + i] = 0;
The above kernel fails. However, when I assign 1 instead of 0, the kernel succeeds (and my host code prints out the array of 1's). Based on a handful of tests on various integer values, I've only had problems with 0 and -1.
I've tried to outsmart the compiler with 1-1, (int) 0, etc, with no success. Passing zero in as a kernel argument worked though.
The assignment to zero does work outside of the context of a for loop:
array[bucket_index * positions_per_bucket] = 0;
The findings above were confirmed on two machines with different configurations. (OSX 10.7 + GeForce, OSX 10.8 + Radeon.) Furthermore, the kernel had no trouble when running on CL_DEVICE_TYPE_CPU -- it's just on the GPU.
Clearly, something ridiculous is happening, and it must be on my end, because "zero" can't be broken. Hopefully it's something simple. Thank you for your help.
Host code:
#include <stdio.h>
#include <OpenCL/OpenCL.h>
#include ""
int main(int argc, const char* argv[]) {
dispatch_queue_t queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);
size_t num_buckets = 64;
size_t positions_per_bucket = 4;
cl_int* h_array = malloc(sizeof(cl_int) * num_buckets * positions_per_bucket);
cl_int* d_array = gcl_malloc(sizeof(cl_int) * num_buckets * positions_per_bucket, NULL, CL_MEM_WRITE_ONLY);
dispatch_sync(queue, ^{
cl_ndrange range = { 1, { 0 }, { num_buckets }, { 0 } };
zero_kernel(&range, num_buckets, positions_per_bucket, d_array);
gcl_memcpy(h_array, d_array, sizeof(cl_int) * num_buckets * positions_per_bucket);
for (size_t i = 0; i < num_buckets * positions_per_bucket; i++)
printf("%d ", h_array[i]);
Refer to the OpenCL standard, section 6, paragraph 8 "Restrictions", bullet point k (emphasis mine):
6.8 k. Arguments to kernel functions in a program cannot be declared with the built-in scalar types bool, half, size_t, ptrdiff_t, intptr_t, and uintptr_t. [...]
The fact that your compiler even let you build the kernel at all indicates it is somewhat broken.
So you might want to fix that... but if that doesn't fix it, then it looks like a compiler bug, plain and simple (of CLC, that is, the OpenCL compiler, not your host code). There is no reason this kernel should work with any constant other than 0, -1. Did you try updating your OpenCL driver, what about trying on a different operating system (though I suppose this code is OS X only)?

Breakpoints inside CUDA kernel __global__ not hitting

Using visual studios 2010. Win 7. Nsight 2.1
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <assert.h>
void incrementArrayOnHost(float *a, int N)
int i;
for (i=0; i < N; i++) a[i] = a[i]+1.f;
__global__ void incrementArrayOnDevice(float *a, int N)
int idx = blockIdx.x*blockDim.x + threadIdx.x;
int j = idx;
int i = 2;
i = i+j; //->breakpoint here
if (idx<N) a[idx] = a[idx]+1.f; //->breakpoint here
int main(void)
float *a_h, *b_h; // pointers to host memory
float *a_d; // pointer to device memory
int i, N = 10;
size_t size = N*sizeof(float);
// allocate arrays on host
a_h = (float *)malloc(size);
b_h = (float *)malloc(size);
// allocate array on device
cudaMalloc((void **) &a_d, size);
// initialization of host data
for (i=0; i<N; i++) a_h[i] = (float)i;
// copy data from host to device
cudaMemcpy(a_d, a_h, sizeof(float)*N, cudaMemcpyHostToDevice);
// do calculation on host
incrementArrayOnHost(a_h, N);
// do calculation on device:
// Part 1 of 2. Compute execution configuration
int blockSize = 4;
int nBlocks = N/blockSize + (N%blockSize == 0?0:1);
// Part 2 of 2. Call incrementArrayOnDevice kernel
incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);
// Retrieve result from device and store in b_h
cudaMemcpy(b_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// check results
for (i=0; i<N; i++) assert(a_h[i] == b_h[i]);
// cleanup
free(a_h); free(b_h); cudaFree(a_d);
return 0;
I've tried inserting breakpoints as listed above inside my global void incrementArrayOnDevice(float *a, int N) but they're not hitting.
When I run debug (f5) in visual studios, I tried to step into incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N); but they would skip the entire kernel code section.
tried to add a watch on the variables i and j but there was an error "CXX0017: Error: symbol "i" not found."
Is this issue normal? Can someone please try on their pc and let me know if they can hit the breakpoints? If you can, what possible problem could mine be? Please help! :(
Nsight debugging is different from VS debugging . You need to use Nsight debugging to hit the kernel breakpoints. However, for this you need 2 GPU cards. Do you have 2 cards in the first place? Please check
You can debug on a single GPU but on the following conditions:
You have to be using 5.0 toolkit
You have to be programming on a GPU that suports 303.xx NForceWare or higher
