Accessing submatrices using cuBLAS - matrix

I have read the following post
Accessing submatrices using LAPACK
I would like to do something similar calling cuBLAS routines from Fortran.
Basically I have a large matrix partitioned into 3 x 3 blocks with the partitioning changing in each step of a loop. At the moment, I allocate/free pointers for each individual sub-block and copy the relevant parts of the matrix to and from the device at each step. That creates a lot of overhead which I am hoping to eliminate. Is that feasible?

You can do device pointer arithmetic in host code in just the same way as you would with host pointers. For example, if you had an MxN matrix stored on the GPU:
float *A_d;
cudaMalloc((void **)&A_d, size_t(M*N)*sizeof(float));
and you wanted to operate on a submatrix starting at (x1,y1), then you would pass A+x1+M*y1 to any CUBLAS function which expects a matrix as an argument.

talonmies has already satisfactorily answered this question. To support his answer and to be possibly useful to other users, I'm here providing a full example on how using cublas<t>gemm to perform multiplications between submatrices of full matrices A and B and how assigning the result to a submatrix of a full matrix C.
Although the question regards Fortran, the code below is given in C/C++ since I'm not using Fortran in connection with CUDA and since many users are using CUDA in connection with C/C++.
The code makes use of
pointer arithmetics to access submatrices;
the concept of the leading dimension and of submatrix dimensions.
The code below considers three matrices:
A - 10 x 9;
B - 15 x 13;
C - 10 x 12.
Matrix C is initialized to all 10s. The code performs the following submatrix multiplication in Matlab language:
C(1+x3:5+x3,1+y3:3+y3) = A(1+x1:5+x1,1+y1:4+y1) * B(1+x2:4+x2,1+y2:3+x2);
The Utilities.cu and Utilities.cuh files are mantained here and omitted here.
#include <thrust/device_vector.h>
#include <thrust/random.h>
#include <cublas_v2.h>
#include "Utilities.cuh"
/********/
/* MAIN */
/********/
int main()
{
/**************************/
/* SETTING UP THE PROBLEM */
/**************************/
//const int Nrows1 = 10; // --- Number of rows of matrix 1
//const int Ncols1 = 10; // --- Number of columns of matrix 1
//const int Nrows2 = 15; // --- Number of rows of matrix 2
//const int Ncols2 = 15; // --- Number of columns of matrix 2
//const int Nrows3 = 12; // --- Number of rows of matrix 3
//const int Ncols3 = 12; // --- Number of columns of matrix 3
const int Nrows1 = 10; // --- Number of rows of matrix 1
const int Ncols1 = 9; // --- Number of columns of matrix 1
const int Nrows2 = 15; // --- Number of rows of matrix 2
const int Ncols2 = 13; // --- Number of columns of matrix 2
const int Nrows3 = 10; // --- Number of rows of matrix 3
const int Ncols3 = 12; // --- Number of columns of matrix 3
const int Nrows = 5; // --- Number of rows of submatrix matrix 3 = Number of rows of submatrix 1
const int Ncols = 3; // --- Number of columns of submatrix matrix 3 = Number of columns of submatrix 2
const int Nrowscols = 4; // --- Number of columns of submatrix 1 and of rows of submatrix 2
const int x1 = 3; // --- Offset for submatrix multiplication along the rows
const int y1 = 2; // --- Offset for submatrix multiplication along the columns
const int x2 = 6; // --- Offset for submatrix multiplication along the rows
const int y2 = 4; // --- Offset for submatrix multiplication along the columns
const int x3 = 3; // --- Offset for submatrix multiplication along the rows
const int y3 = 5; // --- Offset for submatrix multiplication along the columns
// --- Random uniform integer distribution between 0 and 100
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(0, 20);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_matrix1(Nrows1 * Ncols1);
thrust::device_vector<float> d_matrix2(Nrows2 * Ncols2);
for (size_t i = 0; i < d_matrix1.size(); i++) d_matrix1[i] = (float)dist(rng);
for (size_t i = 0; i < d_matrix2.size(); i++) d_matrix2[i] = (float)dist(rng);
printf("\n\nOriginal full size matrix A\n");
for(int i = 0; i < Nrows1; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols1; j++)
std::cout << d_matrix1[j * Nrows1 + i] << " ";
std::cout << "]\n";
}
printf("\n\nOriginal full size matrix B\n");
for(int i = 0; i < Nrows2; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols2; j++)
std::cout << d_matrix2[j * Nrows2 + i] << " ";
std::cout << "]\n";
}
/*************************/
/* MATRIX MULTIPLICATION */
/*************************/
cublasHandle_t handle;
cublasSafeCall(cublasCreate(&handle));
thrust::device_vector<float> d_matrix3(Nrows3 * Ncols3, 10.f);
float alpha = 1.f;
float beta = 0.f;
cublasSafeCall(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, Nrows, Ncols, Nrowscols, &alpha,
thrust::raw_pointer_cast(d_matrix1.data())+x1+Nrows1*y1, Nrows1, thrust::raw_pointer_cast(d_matrix2.data())+x2+Nrows2*y2, Nrows2,
&beta, thrust::raw_pointer_cast(d_matrix3.data())+x3+Nrows3*y3, Nrows3));
printf("\n\nResult full size matrix C\n");
for(int i = 0; i < Nrows3; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols3; j++)
std::cout << d_matrix3[j * Nrows3 + i] << " ";
std::cout << "]\n";
}
return 0;
}

Related

Given N equal circles (possibly overlapping) and M points on a plane. Find a circle which contains maximum number of points

Picture below shows a simple case. Circle 1 is the winner, because it contains points [1, 2, 5] -- more then any other circle.
Naive implementation which checks every point against every circle gives Time Limit.
"Use hash" they say. But where?
#include <iostream>
#include <vector>
using namespace std;
struct Point
{
int x;
int y;
};
int64_t dist(Point p1, Point p2)
{
int64_t dx = p1.x - p2.x;
int64_t dy = p1.y - p2.y;
return dx*dx + dy*dy;
}
int main()
{
int circle_num;
cin >> circle_num;
vector<Point> circles(circle_num);
vector<int64_t> count (circle_num);
for (Point& p : circles)
cin >> p.x >> p.y;
int points_num;
cin >> points_num;
while (points_num--)
{
Point p;
cin >> p.x >> p.y;
for (int i = 0; i != circle_num; ++i)
{
if (dist(p, circles[i]) <= 400)
++count[i];
}
}
int index = 0;
int64_t max_count = 0;
for (int i = 0; i != circle_num; ++i)
{
if (count[i] > max_count)
{
max_count = count[i];
index = i;
}
}
cout << (index + 1) << endl;
}
Possible input:
3 // number of circles
-1 0 // circle 1 center
1 0 // circle 2 center
2 5 // circle 3 center
3 // number of points
10 0
20 0
22 5
Output: 3 -- circle 3 contains the most number of points
Since the circles are all the same size (800 units), a practical approach is to divide the plane into a grid, with each square 401x401 units, and use a hash from (x,y) -> list to collect the points in each square.
Then for each circle, just check the points in the up to 9 squares that it overlaps.

Some CUDA computations fail with larger block dimension (< 1024)

I am learning CUDA with a GTX 960 4GB. I wrote a program which performs an element-wise matrix multiplication. When I increase the block dimensions for x and y to lets say (32 x 32) in combination with a large matrix (lets say 15000 x 15000 elements), some but not all multiplication results are wrong (value 0 instead of 6).
When I then decrease the block dimensions to e.g (8 x 8), all results are right again. When I decrease the Matrix size, the results are right again, too.
So in case of this example, there seems to be combinations of total threads and threads per block, which does not work.
I am surprised I can't find any threads regarding this topic. All I can find is about increasing performance and occupancy, but not about when some but not all calculations are aborted.
The grid dimensions are calculated as follows:
dim3 blocks(ceil<int>(COLS / threads.x), ceil<int>(ROWS / threads.y));
Why do some multiplications fail while others are successful?
Some Examples
Block dim : (8, 8)
Matrix shape : (15000, 15000)
Verification : 0 elements have failed, total length 225000000, shape: (15000, 15000)
Block dim : (16, 16)
Matrix shape : (15000, 15000)
Verification : 239936 elements have failed, total length 225000000, shape: (15000, 15000)
Block dim : (32, 32)
Matrix shape : (15000, 15000)
Verification : 719424 elements have failed, total length 225000000, shape: (15000, 15000).
Block dim : (32, 32)
Matrix shape : (10000, 10000)
Verification : 0 elements have failed, total length 100000000, shape: (10000, 10000).
Driver Version
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 470.82.00 Thu Oct 14 10:24:40 UTC 2021
Complete Code
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define ROWS 10000
#define COLS 10000
#define MAX_ERR 1e-6
typedef struct {
int width;
int height;
float* elements;
} Matrix;
size_t ij(int i, int j){
return j * ROWS + i;
}
__global__ void matrix_multi_elemwise(const Matrix OUT, const Matrix A, const Matrix B) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (col < A.width && row < A.height) {
int index = row * A.height + col; // linearisation of index
OUT.elements[index] = A.elements[index] * B.elements[index];
}
}
int main(){
Matrix A, B, OUT;
Matrix dev_A, dev_B, dev_OUT;
size_t SIZE = ROWS * COLS * sizeof(float);
// Allocate host memory
A.elements = (float*) malloc(SIZE);
B.elements = (float*) malloc(SIZE);
OUT.elements = (float*) malloc(SIZE);
// Initialize host matrices
A.height = ROWS; A.width = COLS;
B.height = ROWS; B.width = COLS;
OUT.height = ROWS; OUT.width = COLS;
for (int j = 0; j < ROWS; j++) {
for(int i = 0; i < COLS; i++){
A.elements[ij(i, j)] = 2.0f;
B.elements[ij(i, j)] = 3.0f;
}
}
// Allocate device memory
cudaMalloc((void**) &dev_A.elements, SIZE);
cudaMalloc((void**) &dev_B.elements, SIZE);
cudaMalloc((void**) &dev_OUT.elements, SIZE);
dev_A.height = A.height; dev_A.width = A.width;
dev_B.height = A.height; dev_B.width = B.width;
dev_OUT.height = A.height; dev_OUT.width = OUT.width;
// Transfer data from host to device memory
cudaMemcpy(dev_A.elements, A.elements, SIZE, cudaMemcpyHostToDevice);
cudaMemcpy(dev_B.elements, B.elements, SIZE, cudaMemcpyHostToDevice);
// Executing kernel
dim3 threads(16, 16);
dim3 blocks(ceil<int>(COLS / threads.x), ceil<int>(ROWS / threads.y));
matrix_multi_elemwise<<<blocks, threads>>>(dev_OUT, dev_A, dev_B);
cudaError_t err = cudaGetLastError();
if(err != cudaSuccess) {
printf("CUDA Runtime API Error reported : %s in file %s on line.\n", cudaGetErrorString(err), __FILE__);
}
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Transfer data back to host memory
cudaMemcpy(OUT.elements, dev_OUT.elements, SIZE, cudaMemcpyDeviceToHost);
// Verification
int count = 0, length = 0, i = 0, j = 0;
for (j = 0; j < ROWS; j++) {
for(i = 0; i < COLS; i++){
//assert(fabs(OUT.elements[ij(i, j)] / A.elements[ij(i, j)] - B.elements[ij(i, j)]) < MAX_ERR);
if (fabs(OUT.elements[ij(i, j)] / A.elements[ij(i, j)] - B.elements[ij(i, j)]) > MAX_ERR) {
count++;
}
length++;
}
}
printf("Verification: %i elements have failed, total length %i, shape: (%i, %i).\n", count, length, i, j);
// Deallocate device memory
cudaFree(dev_A.elements);
cudaFree(dev_B.elements);
cudaFree(dev_OUT.elements);
// Deallocate host memory
free(A.elements);
free(B.elements);
free(OUT.elements);
}
The number of blocks is wrong. Indeed, COLS and threads.x are both integers. Thus, the result is a truncated integer. ceil<int> cannot ceil the result as it has already been truncated. This cause some blocks not to be computed: 15000 is divisible by 8 but not by 16. You need to either cast COLS to a floating-point number or compute the ceil result manually (safer). Here is an example:
dim3 blocks((COLS + threads.x - 1) / threads.x, (ROWS + threads.y - 1) / threads.y);
As pointed out in the comment, note that row * A.height + col is wrong: it should be row * A.width + col instead. This causes issues for non square matrices.

How to expand the product of a sequence of binomials efficiently?

The product of the sequence of binomials reads
where {a_i} and {b_i} are coefficients in binomials.
I need to expand it to a polynomial
and use all coefficients {c_k} in the polynomial afterwards.
How to expand it efficiently? The speed has priority over the memory occupation because the expansion will be used many times.
What I tried
At present I just come up with an update scheme, which expands the polynomial right after absorbing one binomial.
This scheme needs two arrays — one for results up to i-1 and the other for results up to i.
Here is the C++ code for my naive scheme, but I think this question is irrelevant to what language is used.
#include <iostream>
#include <vector>
int main()
{
using namespace std;
// just an example, the coefficients are actually real numbers in [0,1]
unsigned d = 3;
vector<double> a;
vector<double> b;
a.resize(d, 1); b.resize(d, 1);
// given two arrays, a[] and b[], of length d
vector< vector<double> > coefficients(2);
coefficients[0].resize(d + 1);
coefficients[1].resize(d + 1);
if (d > 0) {
auto &coeff = coefficients[0]; // i = 0
coeff[0] = a[0];
coeff[1] = b[0];
for (unsigned i = 1; i < d; ++i) {// i : [1, d-1]
const auto ai = a[i];
const auto bi = b[i];
const auto &oldCoeff = coefficients[(i-1)%2];
auto &coeff = coefficients[i%2];
coeff[0] = oldCoeff[0] * ai; // j = 0
for (unsigned j = 1; j <= i; ++j) { // j : [1, i]
coeff[j] = oldCoeff[j] * ai + oldCoeff[j-1] * bi;
}
coeff[i+1] = oldCoeff[i] * bi; // j = i
}
}
const auto &coeff = coefficients[(d-1)%2];
for (unsigned i = 0; i < d; ++i) {
cout << coeff[i] << "\t";
}
cout << coeff[d] << '\n';
}

Sparse plus dense matrix operation using cuSPARSE

Is it possible to add a sparse matrix and a dense matrix using cuSPARSE? In cuBLAS, I'd just treat the matrices as vectors and use axpy. cuSPARSE does have axpy for sparse/dense vectors, but it cannot be used for matrices because sparse vectors and matrices have different memory structure.
cusparse has dense-to-sparse and sparse-to-dense conversion routines. You could:
convert the sparse matrix to dense (e.g. with cusparse<t>csr2dense), then add the two with cublas<t>geam, producing a dense matrix result
convert the dense matrix to sparse (e.g. with cusparse<t>dense2csr), then use cusparse<t>csrgeam to produce a sparse result
Note that using cusparse<t>geam is a little bit more involved than just a single function call, but the usage methodology is given in the documentation. Also, when using cusparse<t>dense2csr, you will likely want to use cusparse<t>nnz to help with the storage allocations needed.
Here is a fully worked example with a customized kernel summing up a sparse matrix A stored in CSR format with a dense matrix B providing a dense matrix C. The customized kernel explicitly deals with the mapping between the CSR and dense indices.
#include <stdio.h>
#include <assert.h>
#include <cusparse.h>
#define BLOCKSIZEX 16
#define BLOCKSIZEY 16
/*******************/
/* iDivUp FUNCTION */
/*******************/
int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) { exit(code); }
}
}
void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }
/***************************/
/* CUSPARSE ERROR CHECKING */
/***************************/
static const char *_cusparseGetErrorEnum(cusparseStatus_t error)
{
switch (error)
{
case CUSPARSE_STATUS_SUCCESS:
return "CUSPARSE_STATUS_SUCCESS";
case CUSPARSE_STATUS_NOT_INITIALIZED:
return "CUSPARSE_STATUS_NOT_INITIALIZED";
case CUSPARSE_STATUS_ALLOC_FAILED:
return "CUSPARSE_STATUS_ALLOC_FAILED";
case CUSPARSE_STATUS_INVALID_VALUE:
return "CUSPARSE_STATUS_INVALID_VALUE";
case CUSPARSE_STATUS_ARCH_MISMATCH:
return "CUSPARSE_STATUS_ARCH_MISMATCH";
case CUSPARSE_STATUS_MAPPING_ERROR:
return "CUSPARSE_STATUS_MAPPING_ERROR";
case CUSPARSE_STATUS_EXECUTION_FAILED:
return "CUSPARSE_STATUS_EXECUTION_FAILED";
case CUSPARSE_STATUS_INTERNAL_ERROR:
return "CUSPARSE_STATUS_INTERNAL_ERROR";
case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
return "CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
case CUSPARSE_STATUS_ZERO_PIVOT:
return "CUSPARSE_STATUS_ZERO_PIVOT";
}
return "<unknown>";
}
inline void __cusparseSafeCall(cusparseStatus_t err, const char *file, const int line)
{
if (CUSPARSE_STATUS_SUCCESS != err) {
fprintf(stderr, "CUSPARSE error in file '%s', line %d, error %s\nterminating!\n", __FILE__, __LINE__, \
_cusparseGetErrorEnum(err)); \
assert(0); \
}
}
extern "C" void cusparseSafeCall(cusparseStatus_t err) { __cusparseSafeCall(err, __FILE__, __LINE__); }
/*****************************/
/* SETUP DESCRIPTOR FUNCTION */
/*****************************/
void setUpDescriptor(cusparseMatDescr_t &descrA, cusparseMatrixType_t matrixType, cusparseIndexBase_t indexBase) {
cusparseSafeCall(cusparseCreateMatDescr(&descrA));
cusparseSafeCall(cusparseSetMatType(descrA, matrixType));
cusparseSafeCall(cusparseSetMatIndexBase(descrA, indexBase));
}
/********************************************************/
/* DENSE TO SPARSE CONVERSION FOR REAL DOUBLE PRECISION */
/********************************************************/
void dense2SparseD(const double * __restrict__ d_A_dense, int **d_nnzPerVector, double **d_A,
int **d_A_RowIndices, int **d_A_ColIndices, int &nnz, cusparseMatDescr_t descrA,
const cusparseHandle_t handle, const int M, const int N) {
const int lda = M; // --- Leading dimension of dense matrix
gpuErrchk(cudaMalloc(&d_nnzPerVector[0], M * sizeof(int)));
// --- Compute the number of nonzero elements per row and the total number of nonzero elements
// the dense d_A_dense
cusparseSafeCall(cusparseDnnz(handle, CUSPARSE_DIRECTION_ROW, M, N, descrA, d_A_dense,
lda, d_nnzPerVector[0], &nnz));
// --- Device side sparse matrix
gpuErrchk(cudaMalloc(&d_A[0], nnz * sizeof(double)));
gpuErrchk(cudaMalloc(&d_A_RowIndices[0], (M + 1) * sizeof(int)));
gpuErrchk(cudaMalloc(&d_A_ColIndices[0], nnz * sizeof(int)));
cusparseSafeCall(cusparseDdense2csr(handle, M, N, descrA, d_A_dense, lda, d_nnzPerVector[0],
d_A[0], d_A_RowIndices[0], d_A_ColIndices[0]));
}
/********************************/
/* SPARSE + DENSE CUSTOM KERNEL */
/********************************/
__global__ void sparsePlusDense(const double * __restrict__ d_A, const int * __restrict__ d_A_RowIndices,
const int * __restrict__ d_A_ColIndices, const double * __restrict__ d_B,
double * __restrict__ d_C, const int M, const int N) {
const int tidx = threadIdx.x + blockIdx.x * blockDim.x;
const int tidy = threadIdx.y + blockIdx.y * blockDim.y;
if ((tidx >= N) || (tidy >= M)) return;
const int row = tidy;
const int nnzRow = d_A_RowIndices[tidy + 1] - d_A_RowIndices[tidy];
if (tidx >= nnzRow) return;
const int col = d_A_ColIndices[d_A_RowIndices[tidy] + tidx];
d_C[row * N + col] = d_C[row * N + col] + d_A[d_A_RowIndices[tidy] + tidx];
}
/********/
/* MAIN */
/********/
int main() {
cusparseHandle_t handle;
// --- Initialize cuSPARSE
cusparseSafeCall(cusparseCreate(&handle));
// --- Initialize matrix descriptors
cusparseMatDescr_t descrA;
setUpDescriptor(descrA, CUSPARSE_MATRIX_TYPE_GENERAL, CUSPARSE_INDEX_BASE_ZERO);
/**************************/
/* SETTING UP THE PROBLEM */
/**************************/
const int M = 5; // --- Number of rows
const int N = 4; // --- Number of columns
// --- Host side dense matrix
double *h_A_dense = (double*)malloc(M * N * sizeof(*h_A_dense));
// --- Column-major storage
h_A_dense[0] = 0.4612; h_A_dense[5] = 0.0; h_A_dense[10] = 1.3; h_A_dense[15] = 0.0;
h_A_dense[1] = 0.0; h_A_dense[6] = 1.443; h_A_dense[11] = 0.0; h_A_dense[16] = 0.0;
h_A_dense[2] = -0.0006; h_A_dense[7] = 0.4640; h_A_dense[12] = 0.0723; h_A_dense[17] = 0.0;
h_A_dense[3] = 0.3566; h_A_dense[8] = 0.0; h_A_dense[13] = 0.7543; h_A_dense[18] = 0.0;
h_A_dense[4] = 0.; h_A_dense[9] = 0.0; h_A_dense[14] = 0.0; h_A_dense[19] = 0.1;
// --- Create device array and copy host array to it
double *d_A_dense; gpuErrchk(cudaMalloc(&d_A_dense, M * N * sizeof(double)));
gpuErrchk(cudaMemcpy(d_A_dense, h_A_dense, M * N * sizeof(*d_A_dense), cudaMemcpyHostToDevice));
/*******************************/
/* FROM DENSE TO SPARSE MATRIX */
/*******************************/
int nnz = 0; // --- Number of nonzero elements in dense matrix
int *d_nnzPerVector; // --- Device side number of nonzero elements per row
double *d_A; // --- Sparse matrix values - array of size nnz
int *d_A_RowIndices; // --- "Row indices"
int *d_A_ColIndices; // --- "Column indices"
dense2SparseD(d_A_dense, &d_nnzPerVector, &d_A, &d_A_RowIndices, &d_A_ColIndices, nnz, descrA,
handle, M, N);
/*************************/
/* DENSE MATRIX OPERANDS */
/*************************/
// --- Host side dense matrix
double *h_B_dense = (double*)malloc(M * N * sizeof(*h_B_dense));
// --- Column-major storage
h_B_dense[0] = 1.5; h_B_dense[5] = -0.2; h_B_dense[10] = -0.9; h_B_dense[15] = 1.1;
h_B_dense[1] = 2.1; h_B_dense[6] = 2.0; h_B_dense[11] = 1.1; h_B_dense[16] = -0.009;
h_B_dense[2] = -2; h_B_dense[7] = -0.82; h_B_dense[12] = 1.2; h_B_dense[17] = 1.21;
h_B_dense[3] = -0.001; h_B_dense[8] = -1.1; h_B_dense[13] = 0.887; h_B_dense[18] = 1.1143;
h_B_dense[4] = 1.1; h_B_dense[9] = 2.1; h_B_dense[14] = -1.1213; h_B_dense[19] = 5.4334;
// --- Create device array and copy host array to it
double *d_B_dense; gpuErrchk(cudaMalloc(&d_B_dense, M * N * sizeof(double)));
gpuErrchk(cudaMemcpy(d_B_dense, h_B_dense, M * N * sizeof(*d_B_dense), cudaMemcpyHostToDevice));
// --- Allocate space for the result e initialize it
double *d_C_dense; gpuErrchk(cudaMalloc(&d_C_dense, M * N * sizeof(double)));
gpuErrchk(cudaMemcpy(d_C_dense, d_B_dense, M * N * sizeof(double), cudaMemcpyDeviceToDevice));
/*********************************/
/* RUN THE SPARSE-DENSE ADDITION */
/*********************************/
dim3 GridDim(iDivUp(N, BLOCKSIZEX), iDivUp(M, BLOCKSIZEY));
dim3 BlockDim(BLOCKSIZEX, BLOCKSIZEY);
sparsePlusDense << <GridDim , BlockDim>> >(d_A, d_A_RowIndices, d_A_ColIndices, d_B_dense, d_C_dense, M, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
/*******************************************************/
/* CHECKING THE RESULTS FOR SPARSE TO DENSE CONVERSION */
/*******************************************************/
double *h_C_dense = (double *)malloc(M * N * sizeof(double));
gpuErrchk(cudaMemcpy(h_C_dense, d_C_dense, M * N * sizeof(double), cudaMemcpyDeviceToHost));
printf("\nFirst dense operand matrix (column-major storage) \n");
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++)
printf("%f\t", h_A_dense[n * M + m]);
printf("\n");
}
printf("\nSecond dense operand matrix (row-major storage) \n");
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++)
printf("%f\t", h_B_dense[n + m * N]);
printf("\n");
}
printf("\nReference dense matrix (the first has column-major storage, the second row-major\n");
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++)
printf("%f\t", h_A_dense[n * M + m] + h_B_dense[n + m * N]);
printf("\n");
}
printf("\nSecond dense operand matrix (row-major storage) \n");
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++)
printf("%f\t", h_C_dense[n + m * N]);
printf("\n");
}
return 0;
}

Repeated Squaring - Matrix Multiplication using NEWMAT

I'm trying to use the repeated squaring algorithm (using recursion) to perform matrix exponentiation. I've included header files from the NEWMAT library instead of using arrays. The original matrix has elements in the range (-5,5), all numbers being of type float.
# include "C:\User\newmat10\newmat.h"
# include "C:\User\newmat10\newmatio.h"
# include "C:\User\newmat10\newmatap.h"
# include <iostream>
# include <time.h>
# include <ctime>
# include <cstdlib>
# include <iomanip>
using namespace std;
Matrix repeated_squaring(Matrix A, int exponent, int n) //Recursive function
{
A(n,n);
IdentityMatrix I(n);
if (exponent == 0) //Matrix raised to zero returns an Identity Matrix
return I;
else
{
if ( exponent%2 == 1 ) // if exponent is odd
return (A * repeated_squaring (A*A, (exponent-1)/2, n));
else //if exponent is even
return (A * repeated_squaring( A*A, exponent/2, n));
}
}
Matrix direct_squaring(Matrix B, int k, int no) //Brute Force Multiplication
{
B(no,no);
Matrix C = B;
for (int i = 1; i <= k; i++)
C = B*C;
return C;
}
//----Creating a matrix with elements b/w (-5,5)----
float unifRandom()
{
int a = -5;
int b = 5;
float temp = (float)((b-a)*( rand()/RAND_MAX) + a);
return temp;
}
Matrix initialize_mat(Matrix H, int ord)
{
H(ord,ord);
for (int y = 1; y <= ord; y++)
for(int z = 1; z<= ord; z++)
H(y,z) = unifRandom();
return(H);
}
//---------------------------------------------------
void main()
{
int exponent, dimension;
cout<<"Insert exponent:"<<endl;
cin>>exponent;
cout<< "Insert dimension:"<<endl;
cin>>dimension;
cout<<"The number of rows/columns in the square matrix is: "<<dimension<<endl;
cout<<"The exponent is: "<<exponent<<endl;
Matrix A(dimension,dimension),B(dimension,dimension);
Matrix C(dimension,dimension),D(dimension,dimension);
B= initialize_mat(A,dimension);
cout<<"Initial Matrix: "<<endl;
cout<<setw(5)<<setprecision(2)<<B<<endl;
//-----------------------------------------------------------------------------
cout<<"Repeated Squaring Result: "<<endl;
clock_t time_before1 = clock();
C = repeated_squaring (B, exponent , dimension);
cout<< setw(5) <<setprecision(2) <<C;
clock_t time_after1 = clock();
float diff1 = ((float) time_after1 - (float) time_before1);
cout << "It took " << diff1/CLOCKS_PER_SEC << " seconds to complete" << endl<<endl;
//---------------------------------------------------------------------------------
cout<<"Direct Squaring Result:"<<endl;
clock_t time_before2 = clock();
D = direct_squaring (B, exponent , dimension);
cout<<setw(5)<<setprecision(2)<<D;
clock_t time_after2 = clock();
float diff2 = ((float) time_after2 - (float) time_before2);
cout << "It took " << diff2/CLOCKS_PER_SEC << " seconds to complete" << endl<<endl;
}
I face the following problems:
The random number generator returns only "-5" as each element in the output.
The Matrix multiplication yield different results with brute force multiplication and using the repeated squaring algorithm.
I'm timing the execution time of my code to compare the times taken by brute force multiplication and by repeated squaring.
Could someone please find out what's wrong with the recursion and with the matrix initialization?
NOTE: While compiling this program, make sure you've imported the NEWMAT library.
Thanks in advance!
rand() returns an int so rand()/RAND_MAX will truncate to an integer = 0. Try your
repeated square algorithm by hand with n = 1, 2 and 3 and you'll find a surplus A *
and a gross inefficiency.
Final Working code has the following improvements:
Matrix repeated_squaring(Matrix A, int exponent, int n) //Recursive function
{
A(n,n);
IdentityMatrix I(n);
if (exponent == 0) //Matrix raised to zero returns an Identity Matrix
return I;
if (exponent == 1)
return A;
{
if (exponent % 2 == 1) // if exponent is odd
return (A*repeated_squaring (A*A, (exponent-1)/2, n));
else //if exponent is even
return (repeated_squaring(A*A, exponent/2, n));
}
}
Matrix direct_squaring(Matrix B, int k, int no) //Brute Force Multiplication
{
B(no,no);
Matrix C(no,no);
C=B;
for (int i = 0; i < k-1; i++)
C = B*C;
return C;
}
//----Creating a matrix with elements b/w (-5,5)----
float unifRandom()
{
int a = -5;
int b = 5;
float temp = (float) ((b-a)*((float) rand()/RAND_MAX) + a);
return temp;
}

Resources