PyCUDA - passing a matrix by reference from python to C++ CUDA code - matrix

I have to write in a PyCUDA function that gets two matrices Nx3 and Mx3, and return a matrix NxM, but I can't figure out how to pass by reference a matrix without knowing the number of columns.
My code basically is something like that:
#kernel declaration
mod = SourceModule("""
__global__ void distance(int N, int M, float d1[][3], float d2[][3], float res[][M])
int i = threadIdx.x;
int j = threadIdx.y;
float x, y, z;
x = d2[j][0]-d1[i][0];
y = d2[j][1]-d1[i][1];
z = d2[j][2]-d1[i][2];
res[i][j] = x*x + y*y + z*z;
#load data
data1 = numpy.loadtxt("data1.txt").astype(numpy.float32) # Nx3 matrix
data2 = numpy.loadtxt("data2.txt").astype(numpy.float32) # Mx3 matrix
res = numpy.zeros([N,M]).astype(numpy.float32) # NxM matrix
#invoke kernel
dist_gpu = mod.get_function("distance")
dist_gpu(cuda.In(numpy.int32(N)), cuda.In(numpy.int32(M)), cuda.In(data1), cuda.In(data2), cuda.Out(res), block=(N,M,1))
#save data
numpy.savetxt("results.txt", res)
Compiling this I receive an error: error: a parameter is not allowed
that is, I cannot use M as the number of columns for res[][] in the declaretion of the function. I cannot either left the number of columns undeclared...
I need a matrix NxM as an output, but I can't figure out how to do this. Can you help me?

You should use pitched linear memory access inside the kernel, that is how ndarray and gpuarray store data internally, and PyCUDA will pass a pointer to the data in gpu memory allocated for a gpuarray when it is supplied as a argument to a PyCUDA kernel. So (if I understand what you are trying to do) your kernel should be written as something like:
__device__ unsigned int idx2d(int i, int j, int lda)
return j + i*lda;
__global__ void distance(int N, int M, float *d1, float *d2, float *res)
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
float x, y, z;
x = d2[idx2d(j,0,3)]-d1[idx2d(i,0,3)];
y = d2[idx2d(j,1,3)]-d1[idx2d(i,1,3)];
z = d2[idx2d(j,2,3)]-d1[idx2d(i,2,3)];
res[idx2d(i,j,N)] = x*x + y*y + z*z;
Here I have assumed the numpy default row major ordering in defining the idx2d helper function. There are still problems with the Python side of the code you posted, but I guess you know that already.
EDIT: Here is a complete working repro case based of the code posted in your question. Note that it only uses a single block (like the original), so be mindful of block and grid dimensions when trying to run it on anything other than trivially small cases.
import numpy as np
from pycuda import compiler, driver
from pycuda import autoinit
#kernel declaration
mod = compiler.SourceModule("""
__device__ unsigned int idx2d(int i, int j, int lda)
return j + i*lda;
__global__ void distance(int N, int M, float *d1, float *d2, float *res)
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
float x, y, z;
x = d2[idx2d(j,0,3)]-d1[idx2d(i,0,3)];
y = d2[idx2d(j,1,3)]-d1[idx2d(i,1,3)];
z = d2[idx2d(j,2,3)]-d1[idx2d(i,2,3)];
res[idx2d(i,j,N)] = x*x + y*y + z*z;
#make data
data1 = np.random.uniform(size=18).astype(np.float32).reshape(-1,3)
data2 = np.random.uniform(size=12).astype(np.float32).reshape(-1,3)
res = np.zeros([N,M]).astype(np.float32) # NxM matrix
#invoke kernel
dist_gpu = mod.get_function("distance")
dist_gpu(np.int32(N), np.int32(M), driver.In(data1), driver.In(data2), \
driver.Out(res), block=(N,M,1), grid=(1,1))
print res


Eigen JacobiSVD cuda compile error

I've got an error, regarding calling JacobiSVD in my cuda function.
This is the part of the code that causing the error.
Eigen::JacobiSVD<Eigen::Matrix3d> svd( cov_e, Eigen::ComputeThinU | Eigen::ComputeThinV);
And this is the error message. error: calling a __host__
function("Eigen::JacobiSVD , (int)2> ::JacobiSVD") from a __global__
function("kernel") is not allowed
I've used the following command to compile it.
I'm using code 8.0 with eigen3 on ubuntu 16.04.
It seems like other functions such as eigen value decomposition also gives the same error.
Anyone knows a solution? I'm enclosing my code below.
//nvcc -ptx
#include </usr/include/eigen3/Eigen/Core>
#include </usr/include/eigen3/Eigen/SVD>
#include </usr/include/eigen3/Eigen/Sparse>
#include </usr/include/eigen3/Eigen/Dense>
#include </usr/include/eigen3/Eigen/Eigenvalues>
__global__ void kernel(double *p, double *breaks,double *ind, double *mu, double *cov, double *e,double *v, int *n, char *isgood, int minpts, int maxgpu){
bool debuginfo = false;
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(debuginfo)printf("Thread %d got pointer\n",idx);
if( idx < maxgpu){
int s_ind = breaks[idx];
int e_ind = breaks[idx+1];
int diff = e_ind-s_ind;
if(diff >minpts){
int cnt = 0;
Eigen::MatrixXd local_p(3,diff) ;
for(int k = s_ind;k<e_ind;k++){
int temp_ind=ind[k];
//Eigen::Matrix<double, 3, diff> local_p;
local_p(1,cnt) = p[temp_ind*3];
local_p(2,cnt) = p[temp_ind*3+1];
local_p(3,cnt) = p[temp_ind*3+2];
Eigen::Matrix3d centered = local_p.rowwise() - local_p.colwise().mean();
Eigen::Matrix3d cov_e = (centered.adjoint() * centered) / double(local_p.rows() - 1);
Eigen::JacobiSVD<Eigen::Matrix3d> svd( cov_e, Eigen::ComputeThinU | Eigen::ComputeThinV);
/* Eigen::Matrix3d Cp = svd.matrixU() * svd.singularValues().asDiagonal() * svd.matrixV().transpose();
n[idx] = diff;
isgood[idx] = 1;
for(int x = 0; x < 3; x++)
for(int y = 0; y < 3; y++)
v[x+ 3*y +idx*9]=svd.matrixV()(x, y);
cov[x+ 3*y +idx*9]=cov_e(x, y);
//if(debuginfo)printf("%f ",R[x+ 3*y +i*9]);
if(debuginfo)printf("%f ",Rm(x, y));
} else {
n[idx] = 0;
isgood[idx] = 0;
for(int x = 0; x < 3; x++)
for(int y = 0; y < 3; y++)
v[x+ 3*y +idx*9]=0;
cov[x+ 3*y +idx*9]=0;
First of all, Ubuntu 16.04 provides Eigen 3.3-beta1, which is not really recommended to be used. I would suggest upgrading to a more recent version. Furthermore, to include Eigen, write (e.g.):
#include <Eigen/Eigenvalues>
and compile with -I /usr/include/eigen3 (if you use the version provided by the OS), or better -I /path/to/local/eigen-version.
Then, as talonmies noted, you can't call host-functions from kernels, (I'm not sure at the moment, why JacobiSVD is not marked as device function), but in your case it would make much more sense to use Eigen::SelfAdjointEigenSolver, anyway. Since the matrix you are decomposing is fixed-size 3x3 you should actually use the optimized computeDirect method:
Eigen::SelfAdjointEigenSolver<Eigen::Matrix3d> eig; // default constructor
eig.computeDirect(cov_e); // works for 2x2 and 3x3 matrices, does not require loops
It seems the computeDirect even works on the beta version provided by Ubuntu (I'd still recommend to update).
Some unrelated notes:
The following is wrong, since you should start with index 0:
local_p(1,cnt) = p[temp_ind*3];
local_p(2,cnt) = p[temp_ind*3+1];
local_p(3,cnt) = p[temp_ind*3+2];
Also, you can write this in one line:
local_p.col(cnt) = Eigen::Vector3d::Map(p+temp_ind*3);
This line will not fit (unless diff==3):
Eigen::Matrix3d centered = local_p.rowwise() - local_p.colwise().mean();
What you probably mean is (local_p is actually 3xn not nx3)
Eigen::Matrix<double, 3, Eigen::Dynamic> centered = local_p.colwise() - local_p.rowwise().mean();
And when computing cov_e you need to .adjoint() the second factor, not the first.
You can avoid both 'big' matrices local_p and centered, by directly accumulating Eigen::Matrix3d sum2 and Eigen::Vector3d sum with sum2 += v*v.adjoint() and sum +=v and computing
Eigen::Vector3d mu = sum / diff;
Eigen::Matrix3d cov_e = (sum2 - mu*mu.adjoint()*diff)/(diff-1);

Optimising Matrix Multiplication OpenCL - Purpose: learn how to manage memory

I'm new to OpenCL and trying to understand how to optimise matrix multiplication to become familiar with the various paradigms. Here's the current code.
If I'm multipliying matrices A and B. I allocate a row of A in private memory to start with (because each work item uses it), and a column of B in local memory (because each work group uses it).
1) the code is currently incorrect, unfortunately I'm struggling on how to use local work ids to get the correct code, but I can't find my mistake? I'm basing myself on but (slide 27) it seems that this is wrong as they don't make use of loc_size in their internal loop)
2) Are there any other optimisations you would suggest with this code?
__kernel void mmul(
__global int* C,
__global int* A,
__global int* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
int k,ty;
int tx = get_global_id(0);
int loctx = get_local_id(0);
int loc_size = get_local_size(0);
int value = 0 ;
int tmp_array[1000];
for(k=0;k<rB;k++) {
tmp_array[k] = A[tx * cA + k] ;
for (ty=0 ; ty < cC ; ty++) { \n" \
for (k = loctx ; k < rB ; k+=loc_size) {
local_mem[k] = B[ty + k * cC] ;
value = 0 ;
for(k=0;k<rB;k+=1) {
int i = loctx + k*loc_size;
value += tmp_array[k] * local_mem[i];
C[ty + (tx * cC)] = value;
where I set the global and local work items as follows
const size_t globalWorkItems[1] = {result_row};
const size_t localWorkItems[1] = {(size_t)local_wi_size};
local_wi_size is result_row/number of compute units (such that result_row % compute units == 0)
Your code is pretty close, but the indexing into the local memory array is actually simpler that you think. You have a row in private memory and a column in local memory, and you need to compute the dot product of these two vectors. You just need to sum row[k]*col[k], for k = 0 up to N-1:
for(k=0;k<rB;k+=1) {
value += tmp_array[k] * local_mem[k];
There's actually a second, more subtle bug that is also present in the example solution given on the slides you are using. Since you are reading and writing local memory inside a loop, you actually need two barriers, in order to make sure that work-items writing to local memory on iteration i don't overwrite values that are being read by other work-items executing iteration i-1.
Therefore, the full code for your kernel (tested and working), should look something like this:
__kernel void mmul(
__global int* C,
__global int* A,
__global int* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
int k,ty;
int tx = get_global_id(0);
int loctx = get_local_id(0);
int loc_size = get_local_size(0);
int value = 0;
int tmp_array[1000];
for(k=0;k<rB;k++) {
tmp_array[k] = A[tx * cA + k] ;
for (ty=0 ; ty < cC ; ty++) {
for (k = loctx ; k < rB ; k+=loc_size) {
local_mem[k] = B[ty + k * cC];
barrier(CLK_LOCAL_MEM_FENCE); // First barrier to ensure writes have finished
value = 0;
for(k=0;k<rB;k+=1) {
value += tmp_array[k] * local_mem[k];
C[ty + (tx * cC)] = value;
barrier(CLK_LOCAL_MEM_FENCE); // Second barrier to ensure reads have finished
You can find the full set of exercises and solutions that go with the slides you are looking at on the HandsOnOpenCL GitHub page. There's also a more complete set of slides from the same tutorial available here, which go on to show a much more optimised matrix multiply example that uses a blocking approach to better exploit temporal and spatial locality. The aforementioned missing barrier bug has been fixed in the example solution code, but not on the slides (yet).

TERCOM algorithm - Changing from single thread to multiple threads in CUDA

I'm currently working on porting a TERCOM algorithm from using only 1 thread to use multiple threads. Briefly explained , the TERCOM algorithm receives 5 measurements and the heading, and compare this measurements to a prestored map. The algorithm will choose the best match, i.e. lowest Mean Absolute Difference (MAD), and return the position.
The code is working perfectly with one thread and for-loops, but when I try to use multiple threads and blocks it returns the wrong answer. It seems like the multithread version doesn't "run through" the calculation in the same way as the singlethread versjon. Does anyone know what I am doing wrong?
Here's the code using for-loops
__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements)
//Without threads
float pos[2]={0};
float theta=heading*(PI/180);
float MAD=0;
// Calculate how much to move in x and y direction
float offset_x = h*cos(theta);
float offset_y = -h*sin(theta);
float min=100000; //Some High value
//Calculate Mean Absolute Difference
for(float row=0;row<m;row++)
for(float col=0;col<n;col++)
for(float g=0; g<N; g++)
f[(int)g] = tex2D (tex, col+(g-2)*offset_x+0.5f, row+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-f[(int)g]);
MAD=0; //Reset MAD
This is my attempt to use multiple threads
__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements)
// With threads
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
float pos[2]={0};
float theta=heading*(PI/180);
float MAD=0;
// Calculate how much to move in x and y direction
float offset_x = h*cos(theta);
float offset_y = -h*sin(theta);
float min=100000; //Some High value
if(idx < n && idy < m)
for(float g=0; g<N; g++)
f[(int)g] = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-f[(int)g]);
MAD=0; //Reset MAD
To launch the kernel
dim3 dimBlock( 16,16 );
dim3 dimGrid;
dimGrid.x = (n + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (m + dimBlock.y - 1)/dimBlock.y;
kernel <<< dimGrid,dimBlock >>> (m, n, h, N, dev_results, heading, dev_measurements);
The basic problem here is that you have a memory race in the code, centered around the use of f as both some sort of thread local scratch space and an output variable. Every concurrent thread will be trying to write values into the same locations in f simultaneously, which will produce undefined behaviour.
As best as I can tell, the use of f as scratch space isn't even necessary at all and the main computational section of the kernel could be written as something like:
if(idx < n && idy < m)
for(float g=0; g<N; g++)
float fval = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-fval);
[disclaimer: written in browser, use at own risk]
At the end of that calculation, each thread has its own values of min and pos. At a minimum these must be stored in unique global memory (ie. the output must have enough space for each thread result). You will then need to perform some sort of reduction operation to obtain the global minimum from the set of thread local values. That could be in the host, or in the device code, or some combination of the two. There is a lot of code already available for CUDA parallel reductions which you should be able to find by searching and/or looking in the examples supplied with the CUDA toolkit. It should be trivial to adapt them to your specify case where you need to retain the position along with the minimum value.

Optimizing CUDA interpolation

I have developped the following interpolation with CUDA and I am looking for a way of improving this interpolation. For some reasons, I dont want to use CUDA textures.
The other point that I have noticed that for some unknown reasons, is that the interpolation is not performed on the whole vector in my case if the size of the vector is superior than the number of threads (for example with a vector of size 1000, and a number of threads equal to 512,. A thread does its first job and that’s all. I would like to optimize the singleInterp function.
Here is my code:
__device__ float singleInterp(float* data, float x, int lx_data) {
float res = 0;
int i1=0;
int j=lx_data;
int imid;
while (j>i1+1)
imid = (int)(i1+j+1)/2;
if (data[imid]<x)
if (i1==j)
res = data[i1+lx_data];
res =__fmaf_rn( __fdividef(data[j+lx_data]-data[i1+lx_data],(data[j]-data[i1])),x-data[i1], data[i1+lx_data]);
return res;
__global__ void linearInterpolation(float* data, float* x_in, int lx_data) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
int index = i;
if (index < lx_data)
x_in[index] = singleInterp(data, x_in[index], lx_data);
It seems that you are interested in 1D linear interpolation. I already had the problem of optimizing such a kind of interpolation and I ended up with the following code
__global__ void linear_interpolation_kernel_function_GPU(double* __restrict__ result_d, const double* __restrict__ data_d, const double* __restrict__ x_out_d, const int M, const int N)
int j = threadIdx.x + blockDim.x * blockIdx.x;
double reg_x_out = x_out_d[j/2]+M/2;
int k = floor(reg_x_out);
double a = (reg_x_out)-floor(reg_x_out);
double dk = data_d[2*k+(j&1)];
double dkp1 = data_d[2*k+2+(j&1)];
result_d[j] = a * dkp1 + (-dk * a + dk);
The data are assumed to be sampled at integer nodes between -M/2 and M/2.
The code is "equivalent" to 1D texture interpolation, as explained at the following web-page. For the 1D linear texture interpolation, see Fig. 13 of the CUDA-Programming-Guide. For comparisons betwee different solutions, please see the following thread.

Data structures and algorithms for adaptive "uniform" mesh?

I need a data structure for storing float values at an uniformly sampled 3D mesh:
x = x0 + ix*dx where 0 <= ix < nx
y = y0 + iy*dy where 0 <= iy < ny
z = z0 + iz*dz where 0 <= iz < nz
Up to now I have used my Array class:
Array3D<float> A(nx, ny,nz);
A(0,0,0) = 0.0f; // ix = iy = iz = 0
Internally it stores the float values as an 1D array with nx * ny * nz elements.
However now I need to represent an mesh with more values than I have RAM,
e.g. nx = ny = nz = 2000.
I think many neighbour nodes in such an mesh may have similar values so I was thinking if there was some simple way that I could "coarsen" the mesh adaptively.
For instance if the 8 (ix,iy,iz) nodes of an cell in this mesh have values that are less than 5% apart; they are "removed" and replaced by just one value; the mean of the 8 values.
How could I implement such a data structure in a simple and efficient way?
thanks Ante for suggesting lossy compression. I think this could work the following way:
#define BLOCK_SIZE 64
struct CompressedArray3D {
CompressedArray3D(int ni, int nj, int nk) {
NI = ni/BLOCK_SIZE + 1;
NJ = nj/BLOCK_SIZE + 1;
NK = nk/BLOCK_SIZE + 1;
blocks = new float*[NI*NJ*NK];
compressedSize = new unsigned int[NI*NJ*NK];
void setBlock(int I, int J, int K, float values[BLOCK_SIZE][BLOCK_SIZE][BLOCK_SIZE]) {
unsigned int csize;
blocks[I*NJ*NK + J*NK + K] = compress(values, csize);
compressedSize[I*NJ*NK + J*NK + K] = csize;
float getValue(int i, int j, int k) {
int I = i/BLOCK_SIZE;
int J = j/BLOCK_SIZE;
int K = k/BLOCK_SIZE;
int ii = i - I*BLOCK_SIZE;
int jj = j - J*BLOCK_SIZE;
int kk = k - K*BLOCK_SIZE;
float *compressedBlock = blocks[I*NJ*NK + J*NK + K];
unsigned int csize = compressedSize[I*NJ*NK + J*NK + K];
decompress(compressedBlock, csize, values);
return values[ii][jj][kk];
// number of blocks:
int NI, NJ, NK;
// number of samples:
int ni, nj, nk;
float** blocks;
unsigned int* compressedSize;
For this to be useful I need a lossy compression that is:
extremely fast, also on small datasets (e.g. 64x64x64)
compress quite hard > 3x, never mind if it looses quite a bit of info.
Any good candidates?
It sounds like you're looking for a LOD (level of detail) adaptive mesh. It's a recurring theme in video games and terrain simulation.
For terrain, see here: -- look for the ROAM video which is IIRC not only adaptive by distance, but also by view direction.
For non-terrain entities, there is a huge body of work (here's one example: Generic Adaptive Mesh Refinement).
I would suggest to use OctoMap to handle large 3D data.
And to extend it as shown here to handle geometrical properties.
