Why is sizeof giving wrong answer - c++11

ok I've come across a weirdness, maybe someone can explain it.
Source code is (c++ 11) :
‪#‎include‬ <stdio.h>
struct xyz_ {
float xyz[3];
float &x = xyz[0];
float &y = xyz[1];
float &z = xyz[2];
int main(int argc, char *argv[])
xyz_ xyz;
xyz.x = 0;
xyz.y = 1;
xyz.z = 2;
xyz.xyz[1] = 1;
printf("as array %f %f %f\n",xyz.xyz[0],xyz.xyz[1],xyz.xyz[2]);
printf("as elements %f %f %f\n",xyz.x,xyz.y,xyz.z);
int sizexyz = sizeof(xyz);
int sizefloat = sizeof(float);
printf("float is %d big, but xyz is %d big\n",sizefloat,sizexyz);
return 0;
output is:
as array 0.000000 1.000000 2.000000
as elements 0.000000 1.000000 2.000000
float is 4 big, but xyz is 24 big
So the structure works as I would expect, but the size is twice as large as it should be. Using chars instead of float in the structure gives a segfault when run.
I wanted to use struct xyz_ as either an array of floats or individual float elements.

It is unspecified whether references require storage. In this case your output suggests that your compiler has decided to use storage to implement the references x, y and z.

Suppose you add another constructor:
struct xyz_ {
float xyz[3];
float &x = xyz[0];
float &y = xyz[1];
float &z = xyz[2];
xyz_(float& a, float& b, float& c)
: x(a), y(b), z(c)
It should be clear that now the three x, y and z members may be bound to the array elements or may be bound to something else.
Looks like what you are looking for is
union P3d {
float xyz[3];
struct {
float x, y, z;
Unfortunately for some strange reasons (apparently mostly political) this is not supported in the standard (despite compilers do actually support it).

How about this:
struct xyz_
float xyz[3];
float &x() {return xyz[0];}
float &y() {return xyz[1];}
float &z() {return xyz[2];}
Not as beautiful or elegant, but might reduce the size a bit, though I think the this pointer might occupy additional space, not sure...
Of course you would have to use x(), y() and z().

What would be the size of xyz_ if it's declared like this?
struct xyz_ {
float xyz[3];
float *x = &xyz[0];
float *y = &xyz[1];
float *z = &xyz[2];
The reference also needs it's own space to store the information where it is pointing at.

In C you could do the following, but it's not legal in C++11.
union xyz_ {
float xyz[3];
struct { float x, y, z; };


Cannot convert from float to int Processing/Java

I have some code here:
int mutate(float x){
if (random(1) < .1){
float offset = randomGaussian()/2;
float newx = x + offset;
return newx;
} else {
return x;
This code gives an error on both samples of returning a value saying "Type mismatch: Cannot convert from float to int." What is wrong with my code?
Thanks in advance.
You need to change the return type to float in order to return decimal values (if that's what you are interested in):
float mutate(float x){
if (random(1) < .1){
float offset = randomGaussian()/2;
float newx = x + offset;
return newx;
} else {
return x;
First off, remember what int and float are:
int can only hold whole numbers without decimal places, like 1, 42, and -54321.
float can hold numbers with decimal places, like 0.25, 1.9999, and -543.21.
So, you need to figure out what you meant to return from your function: should it be an int or a float value? If it's supposed to be a float value, then you can simply change the return type of the function to float. If you want it to return an int value, then you'll have to rethink the logic inside your function so it's using int values instead.
Note that you can convert from a float to an int using the int() function. More info can be found in the reference.

Cuda matrix addition

I have written the following code to sum two 4x4 matrices in cuda.
__global__ void Matrix_add(double* a, double* b, double* c,int n)
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
int index = row * n + col;
if(col<n && row <n)
c[index] = a[index] + b[index];
int main()
int n=4;
double **h_a;
double **h_b;
double **h_c;
double *d_a, *d_b, *d_c;
int size = n*n*sizeof(double);
h_a = (double **) malloc(n*sizeof(double*));
h_b = (double **) malloc(n*sizeof(double*));
h_c = (double **) malloc(n*sizeof(double*));
int t=0;
for (t=0;t<n;t++)
h_a[t]= (double *)malloc(n*sizeof(double));
h_b[t]= (double *)malloc(n*sizeof(double));
h_c[t]= (double *)malloc(n*sizeof(double));
int i=0,j=0;
dim3 dimBlock(4,4);
dim3 dimGrid(1,1);
Matrix_add<<<dimGrid, dimBlock>>>(d_a,d_b,d_c,n);
for( j=0;j<n;j++)
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
Result of this addition should be a 2x2 all-ones matrix but in the result all the elements of matrix are 0. Also I get this message after getting result:
Segmentation fault (core dumped)
Can anyone please help me to find out the problem.
Thank you
Your host arrays (h_a, h_b, h_c) are not contiguous in memory, so your initial cudaMemcpy() calls will read garbage into GPU memory (apparently zeros in your case).
The reason is that your hosts arrays are not actually flat, but instead are represented as arrays of pointers. I guess to fake two-dimensional arrays in C? In any case, you either need to be more careful with your cudaMemcpy()s and copy the host arrays row-by-row, or use a flat representation on the host.

Atomic max for floats in OpenCL

I need an atomic max function for floats in OpenCL. This is my current naive code using atomic_xchg
float value = data[index];
if ( value > *max_value )
atomic_xchg(max_value, value);
This code gives the correct result when using an Intel CPU, but not for a Nvidia GPU. Is this code correct, or can anyone help me?
You can do it like this:
//Function to perform the atomic max
inline void AtomicMax(volatile __global float *source, const float operand) {
union {
unsigned int intVal;
float floatVal;
} newVal;
union {
unsigned int intVal;
float floatVal;
} prevVal;
do {
prevVal.floatVal = *source;
newVal.floatVal = max(prevVal.floatVal,operand);
} while (atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal);
__kernel mykern(__global float *data, __global float *max_value){
unsigned int index = get_global_id(0);
float value = data[index];
AtomicMax(max_value, value);
As stated in LINK.
What it does is create a union of float and int. Perform the math on the float, but compare integers when doing the atomic xchg. As long as the integers match, the operation is completed.
However, the speed decrease due to the use of these methods is very high. Use them carefully.

Optimizing CUDA interpolation

I have developped the following interpolation with CUDA and I am looking for a way of improving this interpolation. For some reasons, I dont want to use CUDA textures.
The other point that I have noticed that for some unknown reasons, is that the interpolation is not performed on the whole vector in my case if the size of the vector is superior than the number of threads (for example with a vector of size 1000, and a number of threads equal to 512,. A thread does its first job and that’s all. I would like to optimize the singleInterp function.
Here is my code:
__device__ float singleInterp(float* data, float x, int lx_data) {
float res = 0;
int i1=0;
int j=lx_data;
int imid;
while (j>i1+1)
imid = (int)(i1+j+1)/2;
if (data[imid]<x)
if (i1==j)
res = data[i1+lx_data];
res =__fmaf_rn( __fdividef(data[j+lx_data]-data[i1+lx_data],(data[j]-data[i1])),x-data[i1], data[i1+lx_data]);
return res;
__global__ void linearInterpolation(float* data, float* x_in, int lx_data) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
int index = i;
if (index < lx_data)
x_in[index] = singleInterp(data, x_in[index], lx_data);
It seems that you are interested in 1D linear interpolation. I already had the problem of optimizing such a kind of interpolation and I ended up with the following code
__global__ void linear_interpolation_kernel_function_GPU(double* __restrict__ result_d, const double* __restrict__ data_d, const double* __restrict__ x_out_d, const int M, const int N)
int j = threadIdx.x + blockDim.x * blockIdx.x;
double reg_x_out = x_out_d[j/2]+M/2;
int k = floor(reg_x_out);
double a = (reg_x_out)-floor(reg_x_out);
double dk = data_d[2*k+(j&1)];
double dkp1 = data_d[2*k+2+(j&1)];
result_d[j] = a * dkp1 + (-dk * a + dk);
The data are assumed to be sampled at integer nodes between -M/2 and M/2.
The code is "equivalent" to 1D texture interpolation, as explained at the following web-page. For the 1D linear texture interpolation, see Fig. 13 of the CUDA-Programming-Guide. For comparisons betwee different solutions, please see the following thread.

PyCUDA - passing a matrix by reference from python to C++ CUDA code

I have to write in a PyCUDA function that gets two matrices Nx3 and Mx3, and return a matrix NxM, but I can't figure out how to pass by reference a matrix without knowing the number of columns.
My code basically is something like that:
#kernel declaration
mod = SourceModule("""
__global__ void distance(int N, int M, float d1[][3], float d2[][3], float res[][M])
int i = threadIdx.x;
int j = threadIdx.y;
float x, y, z;
x = d2[j][0]-d1[i][0];
y = d2[j][1]-d1[i][1];
z = d2[j][2]-d1[i][2];
res[i][j] = x*x + y*y + z*z;
#load data
data1 = numpy.loadtxt("data1.txt").astype(numpy.float32) # Nx3 matrix
data2 = numpy.loadtxt("data2.txt").astype(numpy.float32) # Mx3 matrix
res = numpy.zeros([N,M]).astype(numpy.float32) # NxM matrix
#invoke kernel
dist_gpu = mod.get_function("distance")
dist_gpu(cuda.In(numpy.int32(N)), cuda.In(numpy.int32(M)), cuda.In(data1), cuda.In(data2), cuda.Out(res), block=(N,M,1))
#save data
numpy.savetxt("results.txt", res)
Compiling this I receive an error:
kernel.cu(3): error: a parameter is not allowed
that is, I cannot use M as the number of columns for res[][] in the declaretion of the function. I cannot either left the number of columns undeclared...
I need a matrix NxM as an output, but I can't figure out how to do this. Can you help me?
You should use pitched linear memory access inside the kernel, that is how ndarray and gpuarray store data internally, and PyCUDA will pass a pointer to the data in gpu memory allocated for a gpuarray when it is supplied as a argument to a PyCUDA kernel. So (if I understand what you are trying to do) your kernel should be written as something like:
__device__ unsigned int idx2d(int i, int j, int lda)
return j + i*lda;
__global__ void distance(int N, int M, float *d1, float *d2, float *res)
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
float x, y, z;
x = d2[idx2d(j,0,3)]-d1[idx2d(i,0,3)];
y = d2[idx2d(j,1,3)]-d1[idx2d(i,1,3)];
z = d2[idx2d(j,2,3)]-d1[idx2d(i,2,3)];
res[idx2d(i,j,N)] = x*x + y*y + z*z;
Here I have assumed the numpy default row major ordering in defining the idx2d helper function. There are still problems with the Python side of the code you posted, but I guess you know that already.
EDIT: Here is a complete working repro case based of the code posted in your question. Note that it only uses a single block (like the original), so be mindful of block and grid dimensions when trying to run it on anything other than trivially small cases.
import numpy as np
from pycuda import compiler, driver
from pycuda import autoinit
#kernel declaration
mod = compiler.SourceModule("""
__device__ unsigned int idx2d(int i, int j, int lda)
return j + i*lda;
__global__ void distance(int N, int M, float *d1, float *d2, float *res)
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
float x, y, z;
x = d2[idx2d(j,0,3)]-d1[idx2d(i,0,3)];
y = d2[idx2d(j,1,3)]-d1[idx2d(i,1,3)];
z = d2[idx2d(j,2,3)]-d1[idx2d(i,2,3)];
res[idx2d(i,j,N)] = x*x + y*y + z*z;
#make data
data1 = np.random.uniform(size=18).astype(np.float32).reshape(-1,3)
data2 = np.random.uniform(size=12).astype(np.float32).reshape(-1,3)
res = np.zeros([N,M]).astype(np.float32) # NxM matrix
#invoke kernel
dist_gpu = mod.get_function("distance")
dist_gpu(np.int32(N), np.int32(M), driver.In(data1), driver.In(data2), \
driver.Out(res), block=(N,M,1), grid=(1,1))
print res
