Rcpp Parallel or openmp for matrixvector product - openmp

I am trying to program the naive parallel version of Conjugate gradient, so I started with the simple Wikipedia algorithm, and I want to change the dot-products and MatrixVector products by their appropriate parallel version, The Rcppparallel documentation has the code for the dot-product using parallelReduce; I think I'm gonna use that version for my code, but I'm trying to make the MatrixVector multiplication, but I haven't achieved good results compared to R base (no parallel)
Some versions of parallel matrix multiplication: using OpenMP, Rcppparallel, serial version, a serial version with Armadillo, and the benchmark
// [[Rcpp::depends(RcppParallel)]]
#include <Rcpp.h>
#include <RcppParallel.h>
#include <numeric>
// #include <cstddef>
// #include <cstdio>
#include <iostream>
using namespace RcppParallel;
using namespace Rcpp;
struct InnerProduct : public Worker
{
// source vectors
const RVector<double> x;
const RVector<double> y;
// product that I have accumulated
double product;
// constructors
InnerProduct(const NumericVector x, const NumericVector y)
: x(x), y(y), product(0) {}
InnerProduct(const InnerProduct& innerProduct, Split)
: x(innerProduct.x), y(innerProduct.y), product(0) {}
// process just the elements of the range I've been asked to
void operator()(std::size_t begin, std::size_t end) {
product += std::inner_product(x.begin() + begin,
x.begin() + end,
y.begin() + begin,
0.0);
}
// join my value with that of another InnerProduct
void join(const InnerProduct& rhs) {
product += rhs.product;
}
};
struct MatrixMultiplication : public Worker
{
// source matrix
const RMatrix<double> A;
//source vector
const RVector<double> x;
// destination matrix
RMatrix<double> out;
// initialize with source and destination
MatrixMultiplication(const NumericMatrix A, const NumericVector x, NumericMatrix out)
: A(A), x(x), out(out) {}
// take the square root of the range of elements requested
void operator()(std::size_t begin, std::size_t end) {
for (std::size_t i = begin; i < end; i++) {
// rows we will operate on
//RMatrix<double>::Row rowi = A.row(i);
RMatrix<double>::Row rowi = A.row(i);
//double res = std::inner_product(rowi.begin(), rowi.end(), x.begin(), 0.0);
//Rcout << "res" << res << std::endl;
out(i,1) = std::inner_product(rowi.begin(), rowi.end(), x.begin(), 0.0);
//Rcout << "res" << out(i,1) << std::endl;
}
}
};
// [[Rcpp::export]]
double parallelInnerProduct(NumericVector x, NumericVector y) {
// declare the InnerProduct instance that takes a pointer to the vector data
InnerProduct innerProduct(x, y);
// call paralleReduce to start the work
parallelReduce(0, x.length(), innerProduct);
// return the computed product
return innerProduct.product;
}
//librar(Rbenchmark)
// [[Rcpp::export]]
NumericVector matrixXvectorRcppParallel(NumericMatrix A, NumericVector x) {
// // declare the InnerProduct instance that takes a pointer to the vector data
// InnerProduct innerProduct(x, y);
int nrows = A.nrow();
NumericVector out(nrows);
for(int i = 0; i< nrows;i++ )
{
out(i) = parallelInnerProduct(A(i,_),x);
}
// return the computed product
return out;
}
// [[Rcpp::export]]
arma::rowvec matrixXvectorParallel(arma::mat A, arma::colvec x){
arma::rowvec y = A.row(0)*0;
int filas = A.n_rows;
int columnas = A.n_cols;
#pragma omp parallel for
for(int j=0;j<columnas;j++)
{
//y(j) = A.row(j)*x(j))
y(j) = dotproduct(A.row(j),x);
}
return y;
}
arma::mat matrixXvector2(arma::mat A, arma::mat x){
//arma::rowvec y = A.row(0)*0;
//y=A*x;
return A*x;
}
arma::rowvec matrixXvectorParallel2(arma::mat A, arma::colvec x){
arma::rowvec y = A.row(0)*0;
int filas = A.n_rows;
int columnas = A.n_cols;
#pragma omp parallel for
for(int j = 0; j < columnas ; j++){
double result = 0;
for(int i = 0; i < filas; i++){
result += x(i)*A(j,i);
}
y(j) = result;
}
return y;
}
Benchmark
test replications elapsed relative user.self sys.self user.child sys.child
1 M %*% a 20 0.026 1.000 0.140 0.060 0 0
2 matrixXvector2(M, as.matrix(a)) 20 0.040 1.538 0.101 0.217 0 0
4 matrixXvectorParallel2(M, a) 20 0.063 2.423 0.481 0.000 0 0
3 matrixXvectorParallel(M, a) 20 0.146 5.615 0.745 0.398 0 0
5 matrixXvectorRcppParallel(M, a) 20 0.335 12.885 2.305 0.079 0 0
My last trial at the moment was using parallefor with Rcppparallel, but I'm getting memory errors and I dont have idea where the problem is
// [[Rcpp::export]]
NumericVector matrixXvectorRcppParallel2(NumericMatrix A, NumericVector x) {
// // declare the InnerProduct instance that takes a pointer to the vector data
int nrows = A.nrow();
NumericMatrix out(nrows,1); //allocar mempria de vector de salida
//crear worker
MatrixMultiplication matrixMultiplication(A, x, out);
parallelFor(0,A.nrow(),matrixMultiplication);
// return the computed product
return out;
}
What I notice is that when I check in my terminal using htop how the processors are working, I see in htop when I apply the conventional Matrix vector multiplication using R-base, that is using all the processors, so Does the matrix multiplication perform parallel by default? because in theory, only one processor should be working if is the serial version.
If someone knows which is the better path, OpenMP or Rcppparallel, or another way, that gives me better performance than the apparently serial version of R-base.
The serial code for conjugte gradient at the moment
// [[Rcpp::export]]
arma::colvec ConjugateGradient(arma::mat A, arma::colvec xini, arma::colvec b, int num_iteraciones){
//arma::colvec xnew = xini*0 //inicializar en 0's
arma::colvec x= xini; //inicializar en 0's
arma::colvec rkold = b - A*xini;
arma::colvec rknew = b*0;
arma::colvec pk = rkold;
int k=0;
double alpha_k=0;
double betak=0;
double normak = 0.0;
for(k=0; k<num_iteraciones;k++){
Rcout << "iteracion numero " << k << std::endl;
alpha_k = sum(rkold.t() * rkold) / sum(pk.t()*A*pk); //sum de un elemento para realizar casting
(pk.t()*A*pk);
x = x+ alpha_k * pk;
rknew = rkold - alpha_k*A*pk;
normak = sum(rknew.t()*rknew);
if( normak < 0.000001){
break;
}
betak = sum(rknew.t()*rknew) / sum( rkold.t() * rkold );
//actualizar valores para siguiente iteracion
pk = rknew + betak*pk;
rkold = rknew;
}
return x;
}
I wasn't aware of the use of BLAS in R, thanks Hong Ooi and tim18, so the new benchmark using option(matprod="internal") and option(matprod="blas")
options(matprod = "internal")
res<-benchmark(M%*%a,matrixXvector2(M,as.matrix(a)),matrixXvectorParallel(M,a),matrixXvectorParallel2(M,a),matrixXvectorRcppParallel(M,a),order="relative",replications = 20)
res
test replications elapsed relative user.self sys.self user.child sys.child
2 matrixXvector2(M, as.matrix(a)) 20 0.043 1.000 0.107 0.228 0 0
4 matrixXvectorParallel2(M, a) 20 0.069 1.605 0.530 0.000 0 0
1 M %*% a 20 0.072 1.674 0.071 0.000 0 0
3 matrixXvectorParallel(M, a) 20 0.140 3.256 0.746 0.346 0 0
5 matrixXvectorRcppParallel(M, a) 20 0.343 7.977 2.272 0.175 0 0
options(matprod="blas")
options(matprod = "blas")
res<-benchmark(M%*%a,matrixXvector2(M,as.matrix(a)),matrixXvectorParallel(M,a),matrixXvectorParallel2(M,a),matrixXvectorRcppParallel(M,a),order="relative",replications = 20)
res
test replications elapsed relative user.self sys.self user.child sys.child
1 M %*% a 20 0.021 1.000 0.093 0.054 0 0
2 matrixXvector2(M, as.matrix(a)) 20 0.092 4.381 0.177 0.464 0 0
5 matrixXvectorRcppParallel(M, a) 20 0.328 15.619 2.143 0.109 0 0
4 matrixXvectorParallel2(M, a) 20 0.438 20.857 3.036 0.000 0 0
3 matrixXvectorParallel(M, a) 20 0.546 26.000 3.667 0.127 0 0

As you already found out, the base R matrix multiplication can be multi-threaded, if a multi-threaded BLAS implementation is used. This is the case for the rocker/* docker images, which typically use OpenBLAS.
In addition, (Rcpp)Armadillo already uses the BLAS library used by R (in this case multi-threaded OpenBLAS) as well as OpenMP. So your "serial" version is actually multi-threaded. You can verify this in htop with a large enough matrix as input.
BTW, what you are trying to do looks like premature optimization to me.

Related

Dealing with matrices in CUDA: understanding basic concepts

I'm building a CUDA kernel to compute the numerical N*N jacobian of a function, using finite differences; in the example I provided, it is the square function (each entry of the vector is squared). The host coded allocates in linear memory, while I'm using a 2-dimensional indexing in the kernel.
My issue is that I haven't found a way to sum on the diagonal of the matrices cudaMalloc'ed. My attempt has been to use the statement threadIdx.x == blockIdx.x as a condition for the diagonal, but instead it evaluates to true only for them both at 0.
Here is the kernel and EDIT: I posted the whole code as an answer, based on the suggestions in the comments (the main() is basically the same, while the kernel is not)
template <typename T>
__global__ void jacobian_kernel (
T * J,
const T t0,
const T tn,
const T h,
const T * u0,
const T * un,
const T * un_old)
{
T cgamma = 2 - sqrtf(2);
const unsigned int t = threadIdx.x;
const unsigned int b = blockIdx.x;
const unsigned int tid = t + b * blockDim.x;
/*__shared__*/ T temp_sx[BLOCK_SIZE][BLOCK_SIZE];
/*__shared__*/ T temp_dx[BLOCK_SIZE][BLOCK_SIZE];
__shared__ T sm_temp_du[BLOCK_SIZE];
T* temp_du = &sm_temp_du[0];
if (tid < N )
{
temp_sx[b][t] = un[t];
temp_dx[b][t] = un[t];
if ( t == b )
{
if ( tn == t0 )
{
temp_du[t] = u0[t]*0.001;
temp_sx[b][t] += temp_du[t]; //(*)
temp_dx[b][t] -= temp_du[t];
temp_sx[b][t] += ( abs( temp_sx[b][t] ) < 10e-6 ? 0.1 : 0 );
temp_dx[b][t] += ( abs( temp_dx[b][t] ) < 10e-6 ? 0.1 : 0 );
temp_sx[b][t] = ( temp_sx[b][t] == 0 ? 0.1 : temp_sx[b][t] );
temp_dx[b][t] = ( temp_dx[b][t] == 0 ? 0.1 : temp_dx[b][t] );
}
else
{
temp_du[t] = MAX( un[t] - un_old[t], 10e-6 );
temp_sx[b][t] += temp_du[t];
temp_dx[b][t] -= temp_du[t];
}
}
__syncthreads();
//J = f(tn, un + du)
d_func(tn, (temp_sx[b]), (temp_sx[b]), 1.f);
d_func(tn, (temp_dx[b]), (temp_dx[b]), 1.f);
__syncthreads();
J[tid] = (temp_sx[b][t] - temp_dx[b][t]) * powf((2 * temp_du[t]), -1);
//J[tid]*= - h*cgamma/2;
//J[tid]+= ( t == b ? 1 : 0);
//J[tid] = temp_J[tid];
}
}
The general procedure for computing the jacobian is
Copy un into every row of temp_sx and temp_dx
Compute du as a 0.01 magnitude from u0
Sum du to the diagonal of temp_sx, subtract du from the diagonal of temp_dx
Compute the square function on each entry of temp_sx and temp_dx
Subtract them and divide every entry by 2*du
This procedure can be summarized with (f(un + du*e_i) - f(un - du*e_i))/2*du.
My problem is to sum du to the diagonal of the matrices of temp_sx and temp_dx like I tried in (*). How can I achieve that?
EDIT: Now calling 1D blocks and threads; in fact, .y axis wasn't used at all in the kernel. I'm calling the kernel with a fixed amount of shared memory
Note that in int main() I'm calling the kernel with
#define REAL sizeof(float)
#define N 32
#define BLOCK_SIZE 16
#define NUM_BLOCKS ((N*N + BLOCK_SIZE - 1)/ BLOCK_SIZE)
...
dim3 dimGrid(NUM_BLOCKS,);
dim3 dimBlock(BLOCK_SIZE);
size_t shm_size = N*N*REAL;
jacobian_kernel <<< dimGrid, dimBlock, size_t shm_size >>> (...);
So that I attempt to deal with block-splitting the function calls. In the kernel to sum on the diagonal I used if(threadIdx.x == blockIdx.x){...}. Why isn't this correct? I'm asking it because while debugging and making the code print the statement, It only evaluates true if they both are 0. Thus du[0] is the only numerical value and the matrix becomes nan. Note that this approach worked with the first code I built, where instead I called the kernel with
jacobian_kernel <<< N, N >>> (...)
So that when threadIdx.x == blockIdx.x the element is on the diagonal. This approach doesn't fit anymore though, since now I need to deal with larger N (possibly larger than 1024, which is the maximum number of threads per block).
What statement should I put there that works even if the matrices are split into blocks and threads?
Let me know if I should share some other info.
Here is how I managed to solve my problem, based on the suggestion in the comments on the answer. The example is compilable, provided you put helper_cuda.h and helper_string.h in the same directory or you add -I directive to the CUDA examples include path, installed along with the CUDA toolkit. The relevant changes are only in the kernel; there's a minor change in the main() though, since I was calling double the resources to execute the kernel, but the .y axis of the grid of thread blocks wasn't even used at all, so it didn't generate any error.
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <math.h>
#include <assert.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include "helper_cuda.h"
#include "helper_string.h"
#include <fstream>
#ifndef MAX
#define MAX(a,b) ((a > b) ? a : b)
#endif
#define REAL sizeof(float)
#define N 128
#define BLOCK_SIZE 128
#define NUM_BLOCKS ((N*N + BLOCK_SIZE - 1)/ BLOCK_SIZE)
template <typename T>
inline void printmatrix( T mat, int rows, int cols);
template <typename T>
__global__ void jacobian_kernel ( const T * A, T * J, const T t0, const T tn, const T h, const T * u0, const T * un, const T * un_old);
template<typename T>
__device__ void d_func(const T t, const T u[], T res[], const T h = 1);
template<typename T>
int main ()
{
float t0 = 0.; //float tn = 0.;
float h = 0.1;
float* u0 = (float*)malloc(REAL*N); for(int i = 0; i < N; ++i){u0[i] = i+1;}
float* un = (float*)malloc(REAL*N); memcpy(un, u0, REAL*N);
float* un_old = (float*)malloc(REAL*N); memcpy(un_old, u0, REAL*N);
float* J = (float*)malloc(REAL*N*N);
float* A = (float*)malloc(REAL*N*N); host_heat_matrix(A);
float *d_u0;
float *d_un;
float *d_un_old;
float *d_J;
float *d_A;
checkCudaErrors(cudaMalloc((void**)&d_u0, REAL*N)); //printf("1: %p\n", d_u0);
checkCudaErrors(cudaMalloc((void**)&d_un, REAL*N)); //printf("2: %p\n", d_un);
checkCudaErrors(cudaMalloc((void**)&d_un_old, REAL*N)); //printf("3: %p\n", d_un_old);
checkCudaErrors(cudaMalloc((void**)&d_J, REAL*N*N)); //printf("4: %p\n", d_J);
checkCudaErrors(cudaMalloc((void**)&d_A, REAL*N*N)); //printf("4: %p\n", d_J);
checkCudaErrors(cudaMemcpy(d_u0, u0, REAL*N, cudaMemcpyHostToDevice)); assert(d_u0 != NULL);
checkCudaErrors(cudaMemcpy(d_un, un, REAL*N, cudaMemcpyHostToDevice)); assert(d_un != NULL);
checkCudaErrors(cudaMemcpy(d_un_old, un_old, REAL*N, cudaMemcpyHostToDevice)); assert(d_un_old != NULL);
checkCudaErrors(cudaMemcpy(d_J, J, REAL*N*N, cudaMemcpyHostToDevice)); assert(d_J != NULL);
checkCudaErrors(cudaMemcpy(d_A, A, REAL*N*N, cudaMemcpyHostToDevice)); assert(d_A != NULL);
dim3 dimGrid(NUM_BLOCKS); std::cout << "NUM_BLOCKS \t" << dimGrid.x << "\n";
dim3 dimBlock(BLOCK_SIZE); std::cout << "BLOCK_SIZE \t" << dimBlock.x << "\n";
size_t shm_size = N*REAL; //std::cout << shm_size << "\n";
//HERE IS A RELEVANT CHANGE OF THE MAIN, SINCE I WAS CALLING
//THE KERNEL WITH A 2D GRID BUT WITHOUT USING THE .y AXIS,
//WHILE NOW THE GRID IS 1D
jacobian_kernel <<< dimGrid, dimBlock, shm_size >>> (d_A, d_J, t0, t0, h, d_u0, d_un, d_un_old);
checkCudaErrors(cudaMemcpy(J, d_J, REAL*N*N, cudaMemcpyDeviceToHost)); //printf("4: %p\n", d_J);
printmatrix( J, N, N);
checkCudaErrors(cudaDeviceReset());
free(u0);
free(un);
free(un_old);
free(J);
}
template <typename T>
__global__ void jacobian_kernel (
const T * A,
T * J,
const T t0,
const T tn,
const T h,
const T * u0,
const T * un,
const T * un_old)
{
T cgamma = 2 - sqrtf(2);
const unsigned int t = threadIdx.x;
const unsigned int b = blockIdx.x;
const unsigned int tid = t + b * blockDim.x;
/*__shared__*/ T temp_sx[BLOCK_SIZE][BLOCK_SIZE];
/*__shared__*/ T temp_dx[BLOCK_SIZE][BLOCK_SIZE];
__shared__ T sm_temp_du;
T* temp_du = &sm_temp_du;
//HERE IS A RELEVANT CHANGE (*)
if ( t < BLOCK_SIZE && b < NUM_BLOCKS )
{
temp_sx[b][t] = un[t]; //printf("temp_sx[%d] = %f\n", t,(temp_sx[b][t]));
temp_dx[b][t] = un[t];
//printf("t = %d, b = %d, t + b * blockDim.x = %d \n",t, b, tid);
//HERE IS A NOTE (**)
if ( t == b )
{
//printf("t = %d, b = %d \n",t, b);
if ( tn == t0 )
{
*temp_du = u0[t]*0.001;
temp_sx[b][t] += *temp_du;
temp_dx[b][t] -= *temp_du;
temp_sx[b][t] += ( abs( temp_sx[b][t] ) < 10e-6 ? 0.1 : 0 );
temp_dx[b][t] += ( abs( temp_dx[b][t] ) < 10e-6 ? 0.1 : 0 );
temp_sx[b][t] = ( temp_sx[b][t] == 0 ? 0.1 : temp_sx[b][t] );
temp_dx[b][t] = ( temp_dx[b][t] == 0 ? 0.1 : temp_dx[b][t] );
}
else
{
*temp_du = MAX( un[t] - un_old[t], 10e-6 );
temp_sx[b][t] += *temp_du;
temp_dx[b][t] -= *temp_du;
}
;
}
//printf("du[%d] %f\n", tid, (*temp_du));
__syncthreads();
//printf("temp_sx[%d][%d] = %f\n", b, t, temp_sx[b][t]);
//printf("temp_dx[%d][%d] = %f\n", b, t, temp_dx[b][t]);
//d_func(tn, (temp_sx[b]), (temp_sx[b]), 1.f);
//d_func(tn, (temp_dx[b]), (temp_dx[b]), 1.f);
matvec_dev( tn, A, (temp_sx[b]), (temp_sx[b]), N, N, 1.f );
matvec_dev( tn, A, (temp_dx[b]), (temp_dx[b]), N, N, 1.f );
__syncthreads();
//printf("temp_sx_later[%d][%d] = %f\n", b, t, (temp_sx[b][t]));
//printf("temp_sx_later[%d][%d] - temp_dx_later[%d][%d] = %f\n", b,t,b,t, (temp_sx[b][t] - temp_dx[b][t]) / 2 * *temp_du);
//if (t == b ) printf( "2du[%d]^-1 = %f\n",t, powf((2 * *temp_du), -1));
J[tid] = (temp_sx[b][t] - temp_dx[b][t]) / (2 * *temp_du);
}
}
template<typename T>
__device__ void d_func(const T t, const T u[], T res[], const T h )
{
__shared__ float temp_u;
temp_u = u[threadIdx.x];
res[threadIdx.x] = h*powf( (temp_u), 2);
}
template <typename T>
inline void printmatrix( T mat, int rows, int cols)
{
std::ofstream matrix_out;
matrix_out.open( "heat_matrix.txt", std::ofstream::out);
for( int i = 0; i < rows; i++)
{
for( int j = 0; j <cols; j++)
{
double next = mat[i + N*j];
matrix_out << ( (next >= 0) ? " " : "") << next << " ";
}
matrix_out << "\n";
}
}
The relevant change is on (*). Before I used if (tid < N) which has two downsides:
First, it is wrong, since it should be tid < N*N, as my data is 2D, while tid is a global index which tracks all the data.
Even if I wrote tid < N*N, since I'm splitting the function calls into blocks, the t < BLOCK_SIZE && b < NUM_BLOCKS seems clearer to me in how the indexing is arranged in the code.
Moreover, the statement t == b in (**) is actually the right one to operate on the diagonal elements of the matrix. The fact that it was evaluated true only on 0 was because of my error right above.
Thanks for the suggestions!

improper mandelbrot set output plotting

i am trying to write a code to display Mandelbrot set for the numbers between
(-3,-3) to (2,2) on my terminal.
The main function generates & feeds a complex number to analyze function.
The analyze function returns character "*" for the complex number Z within the set and "." for the numbers which lie outside the set.
The code:
#define MAX_A 2 // upperbound on real
#define MAX_B 2 // upper bound on imaginary
#define MIN_A -3 // lowerbnd on real
#define MIN_B -3 // lower bound on imaginary
#define NX 300 // no. of points along x
#define NY 200 // no. of points along y
#define max_its 50
int analyze(double real,double imag);
void main()
{
double a,b;
int x,x_arr,y,y_arr;
int array[NX][NY];
int res;
for(y=NY-1,x_arr=0;y>=0;y--,x_arr++)
{
for(x=0,y_arr++;x<=NX-1;x++,y_arr++)
{
a= MIN_A+ ( x/( (double)NX-1)*(MAX_A-MIN_A) );
b= MIN_B+ ( y/( (double)NY-1 )*(MAX_B-MIN_B) );
//printf("%f+i%f ",a,b);
res=analyze(a,b);
if(res>49)
array[x][y]=42;
else
array[x][y]=46;
}
// printf("\n");
}
for(y=0;y<NY;y++)
{
for(x=0;x<NX;x++)
printf("%2c",array[x][y]);
printf("\n");
}
}
The analyze function accepts the coordinate on imaginary plane ;
and computes (Z^2)+Z 50 times ; and while computing if the complex number explodes, then function returns immidiately else the function returns after finishing 50 iterations;
int analyze(double real,double imag)
{
int iter=0;
double r=4.0;
while(iter<50)
{
if ( r < ( (real*real) + (imag*imag) ) )
{
return iter;
}
real= ( (real*real) - (imag*imag) + real);
imag= ( (2*real*imag)+ imag);
iter++;
}
return iter;
}
So, i am analyzing 60000 (NX * NY) numbers & displaying it on the terminal
considering 3:2 ratio (300,200) , i even tried 4:3 (NX:NY) , but the output remains same and the generated shape is not even close to the mandlebrot set :
hence, the output appears inverted ,
i browsed & came across lines like:
(x - 400) / ZOOM;
(y - 300) / ZOOM;
on many mandelbrot codes , but i am unable to understand how this line may rectify my output.
i guess i am having trouble in mapping output to the terminal!
(LB_Real,UB_Imag) --- (UB_Real,UB_Imag)
| |
(LB_Real,LB_Imag) --- (UB_Real,LB_Imag)
Any Hint/help will be very useful
The Mandelbrot recurrence is zn+1 = zn2 + c.
Here's your implementation:
real= ( (real*real) - (imag*imag) + real);
imag= ( (2*real*imag)+ imag);
Problem 1. You're updating real to its next value before you've used the old value to compute the new imag.
Problem 2. Assuming you fix problem 1, you're computing zn+1 = zn2 + zn.
Here's how I'd do it using double:
int analyze(double cr, double ci) {
double zr = 0, zi = 0;
int r;
for (r = 0; (r < 50) && (zr*zr + zi*zi < 4.0); ++r) {
double zr1 = zr*zr - zi*zi + cr;
double zi1 = 2 * zr * zi + ci;
zr = zr1;
zi = zi1;
}
return r;
}
But it's easier to understand if you use the standard C99 support for complex numbers:
#include <complex.h>
int analyze(double cr, double ci) {
double complex c = cr + ci * I;
double complex z = 0;
int r;
for (r = 0; (r < 50) && (cabs(z) < 2); ++r) {
z = z * z + c;
}
return r;
}

Julia Set - Cuda , improve the performance failed

Recently I am learning the examples in the book CUDA by JASON SANDERS.
the example of Juila Set makes a bad performance of 7032ms.
Here is the program:
#include <cuda.h>
#include <cuda_runtime.h>
#include <cpu_bitmap.h>
#include <book.h>
#define DIM 1024
struct cuComplex{
float r;
float i;
__device__ cuComplex(float a, float b) : r(a),i(b){
}
__device__ float magnitude2(void){
return r*r+i*i;
}
__device__ cuComplex operator *(const cuComplex& a){
return cuComplex(r*a.r-i*a.i, i*a.r+r*a.i);
}
__device__ cuComplex operator +(const cuComplex& a){
return cuComplex(r+a.r,i+a.i);
}
};
__device__ int julia(int x,int y){
const float scale = 1.5;
float jx = scale * (float)(DIM/2 - x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);
cuComplex c(-0.8,0.156);
cuComplex a(jx,jy);
int i = 0;
for(i = 0; i<200; i++){
a = a*a + c;
if(a.magnitude2() > 1000){
return 0;
}
}
return 1;
}
__global__ void kernel(unsigned char *ptr){
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y*gridDim.x;
int juliaValue = julia(x,y);
ptr[offset*4 + 0] = 255*juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 1;
ptr[offset*4 + 3] = 255;
}
int main(void){
CPUBitmap bitmap(DIM,DIM);
unsigned char * dev_bitmap;
dim3 grid(DIM,DIM);
dim3 blocks(DIM/16,DIM/16);
dim3 threads(16,16);
dim3 thread(DIM,DIM);
cudaEvent_t start,stop;
cudaEvent_t bitmapCpy_start,bitmapCpy_stop;
HANDLE_ERROR(cudaEventCreate(&start));
HANDLE_ERROR(cudaEventCreate(&stop));
HANDLE_ERROR(cudaEventCreate(&bitmapCpy_start));
HANDLE_ERROR(cudaEventCreate(&bitmapCpy_stop));
HANDLE_ERROR(cudaMalloc((void **)&dev_bitmap,bitmap.image_size()));
HANDLE_ERROR(cudaEventRecord(start,0));
kernel<<<grid,1>>>(dev_bitmap);
HANDLE_ERROR(cudaMemcpy(bitmap.get_ptr(),dev_bitmap,bitmap.image_size(),cudaMemcpyDeviceToHost));
//HANDLE_ERROR(cudaEventRecord(bitmapCpy_stop,0));
//HANDLE_ERROR(cudaEventSynchronize(bitmapCpy_stop));
// float copyTime;
// HANDLE_ERROR(cudaEventElapsedTime(&copyTime,bitmapCpy_start,bitmapCpy_stop));
HANDLE_ERROR(cudaEventRecord(stop,0));
HANDLE_ERROR(cudaEventSynchronize(stop));
float elapsedTime;
HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime,start,stop));
//printf("Total time is %3.1f ms, time for copying is %3.1f ms \n",elapsedTime,copyTime);
printf("Total time is %3.1f ms\n",elapsedTime);
bitmap.display_and_exit();
HANDLE_ERROR(cudaEventDestroy(start));
HANDLE_ERROR(cudaEventDestroy(stop));
HANDLE_ERROR(cudaEventDestroy(bitmapCpy_start));
HANDLE_ERROR(cudaEventDestroy(bitmapCpy_stop));
HANDLE_ERROR(cudaFree(dev_bitmap));
}
I think the main factor that influences the performance is that the program above just run 1 thread in every block:
kernel<<<grid,1>>>(dev_bitmap);
so I change the kernel like the following:
__global__ void kernel(unsigned char *ptr){
int x = threadIdx.x + blockIdx.x*blockDim.x;
int y = threadIdx.y + blockIdx.y*blockDim.y;
int offset = x + y*gridDim.x*blockIdx.x;
int juliaValue = julia(x,y);
ptr[offset*4 + 0] = 255*juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 1;
ptr[offset*4 + 3] = 255;
}
and call kernel:
dim3 blocks(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<blocks,threads>>>(dev_bitmap);
I think this change is not a big deal, but when I ran it, it acted like that it ran into some endless loops, no image appeared and I couldn't do anything with my screen, just blocked there.
toolkit: cuda 5.5
system: ubuntu 12.04
When I run the original code you have posted here, I get a correct display and a time of ~340ms.
When I make your kernel change, I get an "unspecified launch error" on the kernel launch.
In your modified kernel, you have the following which is an incorrect computation:
int offset = x + y*gridDim.x*blockIdx.x;
When I change it to:
int offset = x + y*gridDim.x*blockDim.x;
I get normal execution and results, and an indicated time of ~10ms.

Piecemeal processing of a matrix - CUDA

OK, so lets say I have an ( N x N ) matrix that I would like to process. This matrix is quite large for my computer, and if I try to send it to the device all at once I get a 'out of memory error.'
So is there a way to send sections of the matrix to the device? One way I can see to do it is copy portions of the matrix on the host, and then send these manageable copied portions from the host to the device, and then put them back together at the end.
Here is something I have tried, but the cudaMemcpy in the for loop returns error code 11, 'invalid argument.'
int h_N = 10000;
size_t h_size_m = h_N*sizeof(float);
h_A = (float*)malloc(h_size_m*h_size_m);
int d_N = 2500;
size_t d_size_m = d_N*sizeof(float);
InitializeMatrices(h_N);
int i;
int iterations = (h_N*h_N)/(d_N*d_N);
for( i = 0; i < iterations; i++ )
{
float* h_array_ref = h_A+(i*d_N*d_N);
cudasafe( cudaMemcpy(d_A, h_array_ref, d_size_m*d_size_m, cudaMemcpyHostToDevice), "cudaMemcpy");
cudasafe( cudaFree(d_A), "cudaFree(d_A)" );
}
What I'm trying to accomplish with the above code is this: instead of send the entire matrix to the device, I simply send a pointer to a place within that matrix and reserve enough space on the device to do the work, and then with the next iteration of the loop move the pointer forward within the matrix, etc. etc.
Not only can you do this (assuming your problem is easily decomposed this way into sub-arrays), it can be a very useful thing to do for performance; once you get the basic approach you've described working, you can start using asynchronous memory copies and double-buffering to overlap some of the memory transfer time with the time spent computing what is already on-card.
But first one gets the simple thing working. Below is a 1d example (multiplying a vector by a scalar and adding another scalar) but using a linearized 2d array would be the same; the key part is
CHK_CUDA( cudaMalloc(&xd, batchsize*sizeof(float)) );
CHK_CUDA( cudaMalloc(&yd, batchsize*sizeof(float)) );
tick(&gputimer);
int nbatches = 0;
for (int nstart=0; nstart < n; nstart+=batchsize) {
int size=batchsize;
if ((nstart + batchsize) > n) size = n - nstart;
CHK_CUDA( cudaMemcpy(xd, &(x[nstart]), size*sizeof(float), cudaMemcpyHostToDevice) );
blocksize = (size+nblocks-1)/nblocks;
cuda_saxpb<<<nblocks, blocksize>>>(xd, a, b, yd, size);
CHK_CUDA( cudaMemcpy(&(ycuda[nstart]), yd, size*sizeof(float), cudaMemcpyDeviceToHost) );
nbatches++;
}
gputime = tock(&gputimer);
CHK_CUDA( cudaFree(xd) );
CHK_CUDA( cudaFree(yd) );
You allocate the buffers at the start, and then loop through until you're done, each time doing the copy, starting the kernel, and then copying back. You free at the end.
The full code is
#include <stdio.h>
#include <stdlib.h>
#include <getopt.h>
#include <cuda.h>
#include <sys/time.h>
#include <math.h>
#define CHK_CUDA(e) {if (e != cudaSuccess) {fprintf(stderr,"Error: %s\n", cudaGetErrorString(e)); exit(-1);}}
__global__ void cuda_saxpb(const float *xd, const float a, const float b,
float *yd, const int n) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
if (i<n) {
yd[i] = a*xd[i]+b;
}
return;
}
void cpu_saxpb(const float *x, float a, float b, float *y, int n) {
int i;
for (i=0;i<n;i++) {
y[i] = a*x[i]+b;
}
return;
}
int get_options(int argc, char **argv, int *n, int *s, int *nb, float *a, float *b);
void tick(struct timeval *timer);
double tock(struct timeval *timer);
int main(int argc, char **argv) {
int n=1000;
int nblocks=10;
int batchsize=100;
float a = 5.;
float b = -1.;
int err;
float *x, *y, *ycuda;
float *xd, *yd;
double abserr;
int blocksize;
int i;
struct timeval cputimer;
struct timeval gputimer;
double cputime, gputime;
err = get_options(argc, argv, &n, &batchsize, &nblocks, &a, &b);
if (batchsize > n) {
fprintf(stderr, "Resetting batchsize to size of vector, %d\n", n);
batchsize = n;
}
if (err) return 0;
x = (float *)malloc(n*sizeof(float));
if (!x) return 1;
y = (float *)malloc(n*sizeof(float));
if (!y) {free(x); return 1;}
ycuda = (float *)malloc(n*sizeof(float));
if (!ycuda) {free(y); free(x); return 1;}
/* run CPU code */
tick(&cputimer);
cpu_saxpb(x, a, b, y, n);
cputime = tock(&cputimer);
/* run GPU code */
/* only have to allocate once */
CHK_CUDA( cudaMalloc(&xd, batchsize*sizeof(float)) );
CHK_CUDA( cudaMalloc(&yd, batchsize*sizeof(float)) );
tick(&gputimer);
int nbatches = 0;
for (int nstart=0; nstart < n; nstart+=batchsize) {
int size=batchsize;
if ((nstart + batchsize) > n) size = n - nstart;
CHK_CUDA( cudaMemcpy(xd, &(x[nstart]), size*sizeof(float), cudaMemcpyHostToDevice) );
blocksize = (size+nblocks-1)/nblocks;
cuda_saxpb<<<nblocks, blocksize>>>(xd, a, b, yd, size);
CHK_CUDA( cudaMemcpy(&(ycuda[nstart]), yd, size*sizeof(float), cudaMemcpyDeviceToHost) );
nbatches++;
}
gputime = tock(&gputimer);
CHK_CUDA( cudaFree(xd) );
CHK_CUDA( cudaFree(yd) );
abserr = 0.;
for (i=0;i<n;i++) {
abserr += fabs(ycuda[i] - y[i]);
}
printf("Y = a*X + b, problemsize = %d\n", n);
printf("CPU time = %lg millisec.\n", cputime*1000.);
printf("GPU time = %lg millisec (done with %d batches of %d).\n",
gputime*1000., nbatches, batchsize);
printf("CUDA and CPU results differ by %lf\n", abserr);
free(x);
free(y);
free(ycuda);
return 0;
}
int get_options(int argc, char **argv, int *n, int *s, int *nb, float *a, float *b) {
const struct option long_options[] = {
{"nvals" , required_argument, 0, 'n'},
{"nblocks" , required_argument, 0, 'B'},
{"batchsize" , required_argument, 0, 's'},
{"a", required_argument, 0, 'a'},
{"b", required_argument, 0, 'b'},
{"help", no_argument, 0, 'h'},
{0, 0, 0, 0}};
char c;
int option_index;
int tempint;
while (1) {
c = getopt_long(argc, argv, "n:B:a:b:s:h", long_options, &option_index);
if (c == -1) break;
switch(c) {
case 'n': tempint = atoi(optarg);
if (tempint < 1 || tempint > 500000) {
fprintf(stderr,"%s: Cannot use number of points %s;\n Using %d\n", argv[0], optarg, *n);
} else {
*n = tempint;
}
break;
case 's': tempint = atoi(optarg);
if (tempint < 1 || tempint > 50000) {
fprintf(stderr,"%s: Cannot use number of points %s;\n Using %d\n", argv[0], optarg, *s);
} else {
*s = tempint;
}
break;
case 'B': tempint = atoi(optarg);
if (tempint < 1 || tempint > 1000 || tempint > *n) {
fprintf(stderr,"%s: Cannot use number of blocks %s;\n Using %d\n", argv[0], optarg, *nb);
} else {
*nb = tempint;
}
break;
case 'a': *a = atof(optarg);
break;
case 'b': *b = atof(optarg);
break;
case 'h':
puts("Calculates y[i] = a*x[i] + b on the GPU.");
puts("Options: ");
puts(" --nvals=N (-n N): Set the number of values in y,x.");
puts(" --batchsize=N (-s N): Set the number of values to transfer at a time.");
puts(" --nblocks=N (-B N): Set the number of blocks used.");
puts(" --a=X (-a X): Set the parameter a.");
puts(" --b=X (-b X): Set the parameter b.");
puts(" --niters=N (-I X): Set number of iterations to calculate.");
puts("");
return +1;
}
}
return 0;
}
void tick(struct timeval *timer) {
gettimeofday(timer, NULL);
}
double tock(struct timeval *timer) {
struct timeval now;
gettimeofday(&now, NULL);
return (now.tv_usec-timer->tv_usec)/1.0e6 + (now.tv_sec - timer->tv_sec);
}
Running this one gets:
$ ./batched-saxpb --nvals=10240 --batchsize=10240 --nblocks=20
Y = a*X + b, problemsize = 10240
CPU time = 0.072 millisec.
GPU time = 0.117 millisec (done with 1 batches of 10240).
CUDA and CPU results differ by 0.000000
$ ./batched-saxpb --nvals=10240 --batchsize=5120 --nblocks=20
Y = a*X + b, problemsize = 10240
CPU time = 0.066 millisec.
GPU time = 0.133 millisec (done with 2 batches of 5120).
CUDA and CPU results differ by 0.000000
$ ./batched-saxpb --nvals=10240 --batchsize=2560 --nblocks=20
Y = a*X + b, problemsize = 10240
CPU time = 0.067 millisec.
GPU time = 0.167 millisec (done with 4 batches of 2560).
CUDA and CPU results differ by 0.000000
The GPU time goes up in this case (we're doing more memory copies) but the answers stay the same.
Edited: The original version of this code had an option for running multiple iterations of the kernel for timing purposes, but that's unnecessarily confusing in this context so it's removed.

Examples of strict aliasing of pointers in GCC C99, no performance differences

I'm trying to understand the impact of strict aliasing on performance in C99. My goal is to optimize a vector dot product, which takes up a large amount of time in my program (profiled it!). I thought that aliasing could be the problem, but the following code doesn't show any substantial difference between the standard approach and the strict aliasing version, even with vectors of size 100 million. I've also tried to use local variables to avoid aliasing, with similar results.
What's happening?
I'm using gcc-4.7 on OSX 10.7.4. Results are in microseconds.
$ /usr/local/bin/gcc-4.7 -fstrict-aliasing -Wall -std=c99 -O3 -o restrict restrict.c
$ ./restrict
sum: 100000000 69542
sum2: 100000000 70432
sum3: 100000000 70372
sum4: 100000000 69891
$ /usr/local/bin/gcc-4.7 -Wall -std=c99 -O0 -fno-strict-aliasing -o restrict restrict.c
$ ./restrict
sum: 100000000 258487
sum2: 100000000 261349
sum3: 100000000 258829
sum4: 100000000 258129
restrict.c (note this code will need several hundred MB RAM):
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <unistd.h>
/* original */
long sum(int *x, int *y, int n)
{
long i, s = 0;
for(i = 0 ; i < n ; i++)
s += x[i] * y[i];
return s;
}
/* restrict */
long sum2(int *restrict x, int *restrict y, int n)
{
long i, s = 0;
for(i = 0 ; i < n ; i++)
s += x[i] * y[i];
return s;
}
/* local restrict */
long sum3(int *x, int *y, int n)
{
int *restrict xr = x;
int *restrict yr = y;
long i, s = 0;
for(i = 0 ; i < n ; i++)
s += xr[i] * yr[i];
return s;
}
/* use local variables */
long sum4(int *x, int *y, int n)
{
int xr, yr;
long i, s = 0;
for(i = 0 ; i < n ; i++)
{
xr = x[i];
yr = y[i];
s += xr * yr;
}
return s;
}
int main(void)
{
struct timeval tp1, tp2;
struct timezone tzp;
long i, n = 1e8L, s;
int *x = malloc(sizeof(int) * n);
int *y = malloc(sizeof(int) * n);
long elapsed1;
for(i = 0 ; i < n ; i++)
x[i] = y[i] = 1;
gettimeofday(&tp1, &tzp);
s = sum(x, y, n);
gettimeofday(&tp2, &tzp);
elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
+ (tp2.tv_usec - tp1.tv_usec);
printf("sum:\t%ld\t%ld\n", s, elapsed1);
gettimeofday(&tp1, &tzp);
s = sum2(x, y, n);
gettimeofday(&tp2, &tzp);
elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
+ (tp2.tv_usec - tp1.tv_usec);
printf("sum2:\t%ld\t%ld\n", s, elapsed1);
gettimeofday(&tp1, &tzp);
s = sum3(x, y, n);
gettimeofday(&tp2, &tzp);
elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
+ (tp2.tv_usec - tp1.tv_usec);
printf("sum3:\t%ld\t%ld\n", s, elapsed1);
gettimeofday(&tp1, &tzp);
s = sum3(x, y, n);
gettimeofday(&tp2, &tzp);
elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
+ (tp2.tv_usec - tp1.tv_usec);
printf("sum4:\t%ld\t%ld\n", s, elapsed1);
return EXIT_SUCCESS;
}
Off the cuff:
with no strict aliasing rules, the compiler might simply generate optimized code that does subtly different things than intended.
It is not a given that disabling strict aliasing rules leads to faster code.
If it does, it's also not a given that the optimized code actually show different results. This depends a lot on the actual data access patterns, and often even the processor/cache architecture.
Regarding your example code, I'd say that aliasing is irrelevant (for emitted code, at least) since there is never any write access to the array elements inside the sumXXX functions.
(You might get slightly better performance (or opposite) if you pass the same vector twice. There might be a boon from hot cache and smaller cache footprint. There may be a penalty from redundant Loads putting the prefetch predictor off-track. As always: use a profiler)

Resources