Compiling LAPACKe library using C in MacOS - macos

I need to use LAPACKe functions in a code that should run on Linux and macOS, but the problem is under OsX. I have a MacBook Pro 2021, with an M1 Pro processor, running an OsX 12.6.2
I wrote an example code
#include <stdio.h>
#include <stdlib.h>
#ifdef __APPLE__
#include <Accelerate/Accelerate.h>
#else
#include <lapacke.h>
#endif
const int N= 3, NRHS=2, LDA=N, LDB=N;
const int NN=5, NRHS2=3;
int main(int argc, char** argv) {
/******* Example of using the lapack library with
the C interface for solving linear systems
for a general full matrix **************/
int ipiv[N], info;
// a[LDA*N]
double a[] = {
6.80, -2.11, 5.66,
-6.05, -3.30, 5.36,
-0.45, 2.58, -2.70
};
// b[LDB*NRHS]
double b[] = {
4.02, 6.19, -8.22,
-1.56, 4.00, -8.67
};
printf("\nTest of using LAPACKe Library\n");
printf("Matrix A : %d by %d\n", N, N);
for(int i=0; i<N; i++){
int s = i;
for(int j=0; j<N; j++, s+=LDA) printf(" % 6.2lf", a[s]);
printf("\n");
}
printf("\nRight hand side %d vectors of %d\n", NRHS, N);
for(int i=0; i<N; i++){
int s = i;
for(int j=0; j<NRHS; j++, s+=LDA) printf(" % 10.6lf", b[s]);
printf("\n");
}
/** As long the LAPACK_COL_MAJOR is used, the matrix is
filled up by columns, pay attention in the way is printed **/
info = LAPACKE_dgesv( LAPACK_COL_MAJOR, N, NRHS, a, LDA, ipiv, b, LDB);
if (info == 0) {
printf("\nFactorization of LU : %d by %d\n", N, N);
for(int i=0; i<N; i++){
int s = i;
for(int j=0; j<N; j++, s+=LDA) printf(" % 10.5lf", a[s]);
printf("\n");
}
printf("\nSolution of %dright hand side vectors of %d\n", NRHS, N);
for(int i=0; i<N; i++){
int s = i;
for(int j=0; j<NRHS; j++, s+=LDA) printf(" % 10.6lf", b[s]);
printf("\n");
}
printf("\nFactorization pivot indices by %d\n", N);
for(int i=0; i<N; i++) printf(" % 5.0d", ipiv[i]);
printf("\n\n");
} // End of if (info == 0)
else
printf("An error ocurred in the LAPACK lib dgesv, with code %d\n", info);
/******* Example of using the lapack library with
the C interface for solving linear systems
for a trigiagonal matrix **************/
int ldb = NN;
// d1[NN-1] lower diagonal
double dl[] = {1, 4, 4, 1};
// d[NN] main diagonal
double d[] = {-2, -2, -2, -2, -2};
// du[NN-1] upper diagonal
double du[] = {1, 4, 4, 1};
// bb[NN*NRHS2] number of righ hand side vectors
double bb[] = {
3., 5., 5., 5., 3.,
-1.56, 4., -8.67, 1.75, 2.86,
9.81, -4.09, -4.57, -8.61, 8.99
};
info = LAPACKE_dgtsv(LAPACK_COL_MAJOR, NN, NRHS2, dl, d, du, bb, ldb);
if (info == 0) {
printf("\nTest of using LAPACKe Library for Tridiagonal systems\n");
printf("\nSolution of %dright hand side vectors of %d\n", NRHS2, NN);
for(int i=0; i<NN; i++){
int s = i;
for(int j=0; j<NRHS2; j++, s+=ldb) printf(" % 10.6lf", bb[s]);
printf("\n");
}
} // End of if (info == 0)
else
printf("An error ocurred in the LAPACK lib dgesv, with code %d\n", info);
return 0;
}
This runs in a WSL (Windows Subsystem Linux) and compiles successfully making the command
gcc example2.c -o lapack -llapacke
But, when I try to compile it in my Mac using the line
gcc example2.c -o lapacke -framework Accelerate
I receive the following error:
example2.c:47:9: error: implicit declaration of function 'LAPACKE_dgesv' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
info = LAPACKE_dgesv( LAPACK_COL_MAJOR, N, NRHS, a, LDA, ipiv, b, LDB);
^
example2.c:47:24: error: use of undeclared identifier 'LAPACK_COL_MAJOR'
info = LAPACKE_dgesv( LAPACK_COL_MAJOR, N, NRHS, a, LDA, ipiv, b, LDB);
^
example2.c:91:9: error: implicit declaration of function 'LAPACKE_dgtsv' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
info = LAPACKE_dgtsv(LAPACK_COL_MAJOR, NN, NRHS2, dl, d, du, bb, ldb);
^
example2.c:91:9: note: did you mean 'LAPACKE_dgesv'?
example2.c:47:9: note: 'LAPACKE_dgesv' declared here
info = LAPACKE_dgesv( LAPACK_COL_MAJOR, N, NRHS, a, LDA, ipiv, b, LDB);
^
example2.c:91:23: error: use of undeclared identifier 'LAPACK_COL_MAJOR'
info = LAPACKE_dgtsv(LAPACK_COL_MAJOR, NN, NRHS2, dl, d, du, bb, ldb);
^
4 errors generated.
It looks like inside the Accelerate.h the headers of LAPACKe functions are not defined. I was looking around but I don't see anything that can help me.
BTW, I have another example using cblas and it runs smoothly

Related

Matrix Multiplication OpenMP Counter-Intuitive Results

I am currently porting some code over to OpenMP at my place of work. One of the tasks I am doing is figuring out how to speed up matrix multiplication for one of our applications.
The matrices are stored in row-major format, so A[i*cols +j] gives the A_i_j element of the matrix A.
The code looks like this (uncommenting the pragma parallelises the code):
#include <omp.h>
#include <iostream>
#include <iomanip>
#include <stdio.h>
#define NUM_THREADS 8
#define size 500
#define num_iter 10
int main (int argc, char *argv[])
{
// omp_set_num_threads(NUM_THREADS);
int *A = new int [size*size];
int *B = new int [size*size];
int *C = new int [size*size];
for (int i=0; i<size; i++)
{
for (int j=0; j<size; j++)
{
A[i*size+j] = j*1;
B[i*size+j] = i*j+2;
C[i*size+j] = 0;
}
}
double total_time = 0;
double start = 0;
for (int t=0; t<num_iter; t++)
{
start = omp_get_wtime();
int i, k;
// #pragma omp parallel for num_threads(10) private(i, k) collapse(2) schedule(dynamic)
for (int j=0; j<size; j++)
{
for (i=0; i<size; i++)
{
for (k=0; k<size; k++)
{
C[i*size+j] += A[i*size+k] * B[k*size+j];
}
}
}
total_time += omp_get_wtime() - start;
}
std::setprecision(5);
std::cout << total_time/num_iter << std::endl;
delete[] A;
delete[] B;
delete[] C;
return 0;
}
What is confusing me is the following: why is dynamic scheduling faster than static scheduling for this task? Timing the runs and taking an average shows that static scheduling is slower, which to me is a bit counterintuitive since each thread is doing the same amount of work.
Also, am I correctly speeding up my matrix multiplication code?
Parallel matrix multiplication is non-trivial (have you even considered cache-blocking?). Your best bet is likely to be to use a BLAS Library for this, rather than writing it yourself. (Remember, "The best code is the code I do not have to write").
Wikipedia: Basic Linear Algebra Subprograms points to many implementations, a lot of which (including Intel Math Kernel Library) have free licenses.

Binary Matrix Reduction in CUDA

I have to traverse all cells of an imaginary matrix m * n and add + 1 for all cells that meet a certain condition.
My naive solution was as follows:
#include <stdio.h>
__global__ void calculate_pi(int center, int *count) {
int x = threadIdx.x;
int y = blockIdx.x;
if (x*x + y*y <= center*center) {
*count++;
}
}
int main() {
int interactions;
printf("Enter the number of interactions: ");
scanf("%d", &interactions);
int l = sqrt(interactions);
int h_count = 0;
int *d_count;
cudaMalloc(&d_count, sizeof(int));
cudaMemcpy(&d_count, &h_count, sizeof(int), cudaMemcpyHostToDevice);
calculate_pi<<<l,l>>>(l/2, d_count);
cudaMemcpy(&h_count, d_count, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(d_count);
printf("Sum: %d\n", h_count);
return 0;
}
In my use case, the value of interactions can be very large, making it impossible to allocate l * l of space.
Can someone help me? Any suggestions are welcome.
There are at least 2 problems with your code:
Your kernel code will not work correctly with an ordinary add here:
*count++;
this is because multiple threads are trying to do this at the same time, and CUDA does not automatically sort that out for you. For the purpose of this explanation, we will fix this with an atomicAdd(), although other methods are possible.
The ampersand doesn't belong here:
cudaMemcpy(&d_count, &h_count, sizeof(int), cudaMemcpyHostToDevice);
^
I assume that is just a typo, since you did it correctly on the subsequent cudaMemcpy operation:
cudaMemcpy(&h_count, d_count, sizeof(int), cudaMemcpyDeviceToHost);
This methodology (effectively creating a square array of threads using threadIdx.x for one dimension and blockIdx.x for the other) will only work up to an interactions value that leads to an l value of 1024, or less, because CUDA threadblocks are limited to 1024 threads, and you are using l as the size of the threadblock in your kernel launch. To fix this you would want to learn how to create a CUDA 2D grid of arbitrary dimensions, and adjust your kernel launch and in-kernel indexing calculations appropriately. For now we will just make sure that the calculated l value is in range for your code design.
Here's an example addressing the above issues:
$ cat t1590.cu
#include <stdio.h>
__global__ void calculate_pi(int center, int *count) {
int x = threadIdx.x;
int y = blockIdx.x;
if (x*x + y*y <= center*center) {
atomicAdd(count, 1);
}
}
int main() {
int interactions;
printf("Enter the number of interactions: ");
scanf("%d", &interactions);
int l = sqrt(interactions);
if ((l > 1024) || (l < 1)) {printf("Error: interactions out of range\n"); return 0;}
int h_count = 0;
int *d_count;
cudaMalloc(&d_count, sizeof(int));
cudaMemcpy(d_count, &h_count, sizeof(int), cudaMemcpyHostToDevice);
calculate_pi<<<l,l>>>(l/2, d_count);
cudaMemcpy(&h_count, d_count, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(d_count);
cudaError_t err = cudaGetLastError();
if (err == cudaSuccess){
printf("Sum: %d\n", h_count);
printf("fraction satisfying test: %f\n", h_count/(float)interactions);
}
else
printf("CUDA error: %s\n", cudaGetErrorString(err));
return 0;
}
$ nvcc -o t1590 t1590.cu
$ ./t1590
Enter the number of interactions: 1048576
Sum: 206381
fraction satisfying test: 0.196820
$
We see that the code indicates a calculated fraction of about 0.2. Does this appear to be correct? I claim that it does appear to be correct based on your test. You are effectively creating a grid that represents dimensions of lxl. Your test is asking, effectively, "which points in that grid are within a circle, with the center at the origin (corner) of the grid, and radius l/2 ?"
Pictorially, that looks something like this:
and it is reasonable to assume the red shaded area is somewhat less than 0.25 of the total area, so 0.2 is a reasonable estimate of that area.
As a bonus, here is a version of the code that reduces the restriction listed in item 3 above:
#include <stdio.h>
__global__ void calculate_pi(int center, int *count) {
int x = threadIdx.x+blockDim.x*blockIdx.x;
int y = threadIdx.y+blockDim.y*blockIdx.y;
if (x*x + y*y <= center*center) {
atomicAdd(count, 1);
}
}
int main() {
int interactions;
printf("Enter the number of interactions: ");
scanf("%d", &interactions);
int l = sqrt(interactions);
int h_count = 0;
int *d_count;
const int bs = 32;
dim3 threads(bs, bs);
dim3 blocks((l+threads.x-1)/threads.x, (l+threads.y-1)/threads.y);
cudaMalloc(&d_count, sizeof(int));
cudaMemcpy(d_count, &h_count, sizeof(int), cudaMemcpyHostToDevice);
calculate_pi<<<blocks,threads>>>(l/2, d_count);
cudaMemcpy(&h_count, d_count, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(d_count);
cudaError_t err = cudaGetLastError();
if (err == cudaSuccess){
printf("Sum: %d\n", h_count);
printf("fraction satisfying test: %f\n", h_count/(float)interactions);
}
else
printf("CUDA error: %s\n", cudaGetErrorString(err));
return 0;
}
This is launching a 2D grid based on l, and should work up to at least 1 billion interactions .

Intel gather instruction

I am a little confused about how Intel gather intrinsic works.
I have the following simple code. One of them is to set y[0]=y[1] = x[0], ... y[20002]=y[20003]=x[10002], the other one is to set y[i] = x[i], y[i+1] = x[i+2].
I just randomly print out some values to check the correctness. I found that I could get both y[10] and y[11] equal 2.46 if "zeros" is used. However, I will get a random number for y[11] when I use "stride", while y[10] is still 2.46. Any idea about what's wrong?
#include <stdio.h>
#include <xmmintrin.h>
#include <immintrin.h>
void dummy(double *x, double *y) {
printf("%lf, %lf\n", y[10], y[11]);
return;
}
int main() {
double x[20004];
double y[20004];
__m128i zeros = _mm_set_epi64x(0, 0);
__m128i stride = _mm_set_epi64x(2, 0);
for (int i = 0; i <= 20004; ++i) {
x[i] = i * 0.246;
}
for (int j = 0; j <= 10000; j+=2) {
#ifdef ZERO
__m128d gather = _mm_i64gather_pd(&x[j], zeros, 1);
#else
__m128d gather = _mm_i64gather_pd(&x[j], stride, 1);
#endif
_mm_store_pd(&y[j], gather);
}
dummy(x, y);
}

Openacc error ibgomp: while loading libgomp-plugin-host_nonshm.so.1: libgomp-plugin-host_nonshm.so.1: cannot

I want to compile an easy openacc sample (it was attached) , it was correctly compiled but when i run it got an error :
compile with : gcc-5 -fopenacc accVetAdd.c -lm
run with : ./a.out
got error in runtime
error: libgomp: while loading libgomp-plugin-host_nonshm.so.1: libgomp-plugin-host_nonshm.so.1: cannot open shared object file: No such file or directory
I google it and find only one page! then i ask how to fix this problem?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char* argv[])
{
// Size of vectors
int n = 10000;
// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;
// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);
// Allocate memory for each vector
a = (double*)malloc(bytes);
b = (double*)malloc(bytes);
c = (double*)malloc(bytes);
// Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
int i;
for (i = 0; i<n; i++) {
a[i] = sin(i)*sin(i);
b[i] = cos(i)*cos(i);
}
// sum component wise and save result into vector c
#pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
for (i = 0; i<n; i++) {
c[i] = a[i] + b[i];
}
// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0.0;
for (i = 0; i<n; i++) {
sum += c[i];
}
sum = sum / n;
printf("final result: %f\n", sum);
// Release memory
free(a);
free(b);
free(c);
return 0;
}
libgomp dynamically loads shared object files for the plugins it supports, such as the one implementing the host_nonshm device. If they're installed in a non-standard directory (that is, not in the system's default search path), you need to tell the dynamic linker where to look for these shared object files: either compile with -Wl,-rpath,[...], or set the LD_LIBRARY_PATH environment variable.

MPI matrix multification compile err: undeclared with code

I coded a mpi matrix multification program, which use scanf("%d", &size), designate matrix size, then I defined int matrix[size*size], but when I complied it, it reported that matrix is undeclared. Please tell me why, or what my problem is!
According Ed's suggestion, I changed the matrix definition to if(myid == 0) block, but got the same err! Now I post my code, please help me find out where I made mistakes! thank you!
int size;
int main(int argc, char* argv[]) {
int myid, numprocs;
int *p;
MPI_Status status;
int i,j,k;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
if(myid == 0)
{
scanf("%d", &size);
int matrix1[size*size];
int matrix2[size*size];
int matrix3[size*size];
int section = size/numprocs;
int tail = size % numprocs;
srand((unsigned)time(NULL));
for( i=0; i<size; i++)
for( j=0; j<size; j++)
{
matrix1[i*size+j]=rand()%9;
matrix3[i*size+j]= 0;
matrix2[i*size+j]=rand()%9;
}
printf("Matrix1 is: \n");
for( i=0; i<size; i++)
{
for( j=0; j<size; j++)
{
printf("%3d", matrix1[i*size+j]);
}
printf("\n");
}
printf("\n");
printf("Matrix2 is: \n");
Reformatted code would be nice...
One problem is that you haven't declared the size variable. Another problem is that the [size] notation for declaring arrays is only good for sizes that are known at compile time. You want to use malloc() instead.
You don't actually need to define a MAX_SIZE if you use dynamic memory allocation.
#include <stdio.h>
#include <stdlib.h>
...
scanf("%d", &size);
int *matrix1 = (int *) malloc(size*size*sizeof(int));
int *matrix2 = (int *) malloc(size*size*sizeof(int));
int *matrix3 = (int *) malloc(size*size*sizeof(int));
...

Resources