Is possible to use CUBLAS with OpenACC? - cublas

I have to implement a function that I already have in CUDA-C using the OpenACC directives (I have to do a comparison). In the original code there's cubasSgemv call, there is some way to use cublas library under openacc?

Yes, you can use the host_data construct to do this. Here's an example of how to call cublasSaxpy from OpenACC:
#pragma acc data create(x[0:n]) copyout(y[0:n])
{
#pragma acc kernels
{
for( i = 0; i < n; i++)
{
x[i] = 1.0f;
y[i] = 0.0f;
}
}
#pragma acc host_data use_device(x,y)
{
cublasSaxpy(n, 2.0, x, 1, y, 1);
}
}
I have other examples in an an article I wrote about OpenACC interoperability a few months ago. You can find it at http://www.pgroup.com/lit/articles/insider/v5n2a2.htm .

Related

-ta=tesla:deepcopy flag and #pragma acc shape

I just found out about the deepcopy flag. Until this moment I've always used -ta=tesla:managed to handle deep copy and I would like to explore the alternative.
I read this article: https://www.pgroup.com/blogs/posts/deep-copy-beta.htm which is well written but I think it does not cover my case. I have a structure of this type:
typedef struct Data_{
double ****Vc;
double ****Uc;
} Data
The shape of these to array is not defined by an element of the struct itself but by the elements of another structure and that are themselves defined only during the execution of the program.
How can I use the #pragma acc shape(Vc, Uc) in this case?
Without this pragma and copying the structure as follows:
int main(){
Data data;
Initialize(&data);
}
int Initialize(Data *data){
data->Uc = ARRAY_4D(ntot[KDIR], ntot[JDIR], ntot[IDIR], NVAR, double);
data->Vc = ARRAY_4D(NVAR, ntot[KDIR], ntot[JDIR], ntot[IDIR], double);
#pragma acc enter data copyin(data)
PrimToCons3D(data->Vc, data->Uc, grid, NULL);
}
void PrimToCons3D(double ****V, double ****U, Grid *grid, RBox *box){
#pragma acc parallel loop collapse(3) present(V[:NVAR][:nx3_tot][:nx2_tot][:nx1_tot])
for (k = kbeg; k <= kend; k++){
for (j = jbeg; j <= jend; j++){
for (i = ibeg; i <= iend; i++){
double v[NVAR];
#pragma acc loop
for (nv = 0; nv < NVAR; nv++) v[nv] = V[nv][k][j][i];
}
I get
FATAL ERROR: data in PRESENT clause was not found on device 1: name=V host:0x1fd2b80
file:/home/Prova/Src/mappers3D.c PrimToCons3D line:140
Btw, this same code works fine with -ta=tesla:managed.
Since you don't provide a full reproducing example, I wasn't able to test this, but it would look something like:
typedef struct Data_{
int i,j,k,l;
double ****Vc;
double ****Uc;
#pragma acc shape(Vc[0:k][0:j][0:i][0:l])
#pragma acc shape(Uc[0:k][0:j][0:i][0:l])
} Data;
int Initialize(Data *data){
data->Vc.i = ntot[IDIR];
data->Vc.j = ntot[JDIR];
data->Vc.k = ntot[KDIR];
data->Vc.l = NVAR;
data->Uc.i = ntot[IDIR];
data->Uc.j = ntot[JDIR];
data->Uc.k = ntot[KDIR];
data->Uc.l = NVAR;
data->Uc = ARRAY_4D(ntot[KDIR], ntot[JDIR], ntot[IDIR], NVAR, double);
data->Vc = ARRAY_4D(NVAR, ntot[KDIR], ntot[JDIR], ntot[IDIR], double);
#pragma acc enter data copyin(data)
PrimToCons3D(data->Vc, data->Uc, grid, NULL);
}
void PrimToCons3D(double ****V, double ****U, Grid *grid, RBox *box){
int kbeg, jbeg, ibeg, kend, jend, iend;
#pragma acc parallel loop collapse(3) present(V, U)
for (int k = kbeg; k <= kend; k++){
for (int j = jbeg; j <= jend; j++){
for (int i = ibeg; i <= iend; i++){
Though keep in mind that the "shape" and "policy" directives we not adopted by the OpenACC standard and we (the NVHPC compiler team) only did a Beta version, which we have not maintained.
Probably better to do a manual deep copy, which will be standard compliant, which I can help with if you can provide a reproducer which includes how you're doing the array allocation, i.e. "ARRAY_4D".

Difference between mutual exclusion like atomic and reduction in OpenMP

I'm am following video lectures of Tim Mattson on OpenMP and there was one exercise to find errors in provided code that count area of the Mandelbrot. So here is the solution that was provided:
#define NPOINTS 1000
#define MAXITER 1000
void testpoint(struct d_complex);
struct d_complex{
double r;
double i;
};
struct d_complex c;
int numoutside = 0;
int main(){
int i,j;
double area, error, eps = 1.0e-5;
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps)
for(i = 0; i<NPOINTS; i++){
for(j=0; j < NPOINTS; j++){
c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps;
c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps;
testpoint(c);
}
}
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS-numoutside)/(double)(NPOINTS*NPOINTS);
error=area/(double)NPOINTS;
printf("Area of Mandlebrot set = %12.8f +/- %12.8f\n",area,error);
printf("Correct answer should be around 1.510659\n");
}
void testpoint(struct d_complex c){
// Does the iteration z=z*z+c, until |z| > 2 when point is known to be outside set
// If loop count reaches MAXITER, point is considered to be inside the set
struct d_complex z;
int iter;
double temp;
z=c;
for (iter=0; iter<MAXITER; iter++){
temp = (z.r*z.r)-(z.i*z.i)+c.r;
z.i = z.r*z.i*2+c.i;
z.r = temp;
if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp atomic
numoutside++;
break;
}
}
}
The question I have is, could we use reduction in #pragma omp parallel of variable numoutside like:
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps) reduction(+:numoutside)
without atomic construct in testpoint function?
I tested the function without atomic, and the result was different from the one I got in the first place. Why does that happen? And while I understand the concept of mutual exclusion and use of it because of race conditioning, isn't reduction just another form of solving that problem with private variables?
Thank You in advance.

OpenMP and (Rcpp)Eigen

I am wondering how to write code that at times makes use of OpenMP parallelization built into the Eigen library while at other times uses Parallelization that I specify. Hopefully, the below code snippet should provide background into my problem.
I am asking this question at the design stage of my library (sorry I don't have a working / broken code example).
#ifdef _OPENMP
#include <omp.h>
#endif
#include <RcppEigen.h>
void fxn(..., int ncores=-1){
if (ncores > 0) omp_set_num_threads(ncores);
/*
* Code with matrix products
* where I would like to use Eigen's
* OpenMP parallelization
*/
#pragma omp parallel for
for (int i=0; i < iter; i++){
/*
* Code I would like to parallelize "myself"
* even though it involves matrix products
*/
}
}
What is best practice for controlling the balance between Eigen's own parallelization with OpenMP and my own.
UPDATE:
I wrote a simple example and tested ggael's suggestion. In short, I am skeptical that it solves the problem I was posing (or I am doing something else wrong - apologies if its the latter). Notice that with explicit parallelization of the for loop there is no change in run-time (not even a slow
#ifdef _OPENMP
#include <omp.h>
#endif
#include <RcppEigen.h>
using namespace Rcpp;
// [[Rcpp::plugins(openmp)]]
// [[Rcpp::export]]
Eigen::MatrixXd testing(Eigen::MatrixXd A, Eigen::MatrixXd B, int n_threads=1){
Eigen::setNbThreads(n_threads);
Eigen::MatrixXd C = A*B;
Eigen::setNbThreads(1);
for (int i=0; i < A.cols(); i++){
A.col(i).array() = A.col(i).array()*B.col(i).array();
}
return A;
}
// [[Rcpp::export]]
Eigen::MatrixXd testing_omp(Eigen::MatrixXd A, Eigen::MatrixXd B, int n_threads=1){
Eigen::setNbThreads(n_threads);
Eigen::MatrixXd C = A*B;
Eigen::setNbThreads(1);
#pragma omp parallel for num_threads(n_threads)
for (int i=0; i < A.cols(); i++){
A.col(i).array() = A.col(i).array()*B.col(i).array();
}
return A;
}
/*** R
A <- matrix(rnorm(1000*1000), 1000, 1000)
B <- matrix(rnorm(1000*1000), 1000, 1000)
microbenchmark::microbenchmark(testing(A,B, n_threads=1),
testing_omp(A,B, n_threads=1),
testing(A,B, n_threads=8),
testing_omp(A,B, n_threads=8),
times=10)
*/
Unit: milliseconds
expr min lq mean median uq max neval cld
testing(A, B, n_threads = 1) 169.74272 183.94500 212.83868 218.15756 236.97049 264.52183 10 b
testing_omp(A, B, n_threads = 1) 166.53132 178.48162 210.54195 227.65258 234.16727 238.03961 10 b
testing(A, B, n_threads = 8) 56.03258 61.16001 65.15763 62.67563 67.37089 83.43565 10 a
testing_omp(A, B, n_threads = 8) 54.18672 57.78558 73.70466 65.36586 67.24229 167.90310 10 a
The easiest is probably to disable/enable Eigen's multi-threading at runtime:
Eigen::setNbThreads(1); // single thread mode
#pragma omp parallel for
for (int i=0; i < iter; i++){
// Code I would like to parallelize "myself"
// even though it involves matrix products
}
Eigen::setNbThreads(0); // restore default

parallelizing in openMP

I have the following code that I want to paralleize using OpenMP
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0;
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
//printf("%lf\t", a);
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
//printf("%lf\t", h_c[m*c_c + n]);
}
//cout<<"row "<<m<<" completed"<<endl;
}
In this I want every thread to perform "for j" and "for k" simultaneouly.
I am trying to do using pragma omp parallel for before the "for m" loop but not getting the correct result.
How can I do this in an optimized manner. thanks in advance.
Depending exactly from which loop you want to parallelize, you have three options:
#pragma omp parallel
{
#pragma omp for // Option #1
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0;
#pragma omp for // Option #2
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
//printf("%lf\t", a);
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
//printf("%lf\t", h_c[m*c_c + n]);
}
//cout<<"row "<<m<<" completed"<<endl;
}
}
//////////////////////////////////////////////////////////////////////////
// Option #3
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
#pragma omp parallel
{
double value = 0.0;
#pragma omp for
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
//printf("%lf\t", a);
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
//printf("%lf\t", h_c[m*c_c + n]);
}
}
//cout<<"row "<<m<<" completed"<<endl;
}
Test and profile each. You might find that option #1 is fastest if there isn't a lot of work for each thread, or you may find that with optimizations on, there is no difference (or even a slowdown) when enabling OMP.
Edit
I've adopted the MCVE supplied in the comments as follows:
#include <iostream>
#include <chrono>
#include <omp.h>
#include <algorithm>
#include <vector>
#define W_OMP
int main(int argc, char *argv[])
{
std::vector<double> h_a(9);
std::generate(h_a.begin(), h_a.end(), std::rand);
int r_b = 500;
int c_b = r_b;
std::vector<double> h_b(r_b * c_b);
std::generate(h_b.begin(), h_b.end(), std::rand);
int r_c = 500;
int c_c = r_c;
int r_a = 3, c_a = 3;
std::vector<double> h_c(r_c * c_c);
auto start = std::chrono::system_clock::now();
#ifdef W_OMP
#pragma omp parallel
{
#endif
int m,n,j,k;
#ifdef W_OMP
#pragma omp for
#endif
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0,a;
for(j=0; j<r_b; j++)
{
for(k=0; k<c_b; k++)
{
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else a = h_a[((m-j)*c_a) + (n-k)];
value += h_b[(j*c_b) + k] * a;
}
}
h_c[m*c_c + n] = value;
}
}
#ifdef W_OMP
}
#endif
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms"
#ifdef W_OMP
"\t with OMP"
#else
"\t without OMP"
#endif
"\n";
return 0;
}
As a reference, I'm using VS2012 (OMP 2.0, grrr). I'm not sure when collapse was introduced, but apparently after 2.0. Optimizations were /O2 and compiled in Release x64.
Benchmarks
Using the original sizes of the loops (7,7,5,5) and therefore arrays, the results were 0ms without OMP and 1ms with. Verdict: optimizations were better, and the added overhead wasn't worth it. Also, the measurements are not reliable (too short).
Using the slightly larger sizes of the loops (100, 100, 100, 100) and therefore arrays, the results were about equal at about 108ms. Verdict: still not worth the naive effort, tweaking OMP parameters might tip the scale. Definitely not the x4 speedup I would hope for.
Using an even larger sizes of the loops (500, 500, 500, 500) and therefore arrays, OMP started to pull ahead. Without OMP 74.3ms, with 15s. Verdict: Worth it. Weird. I got a x5 speedup with four threads and four cores on an i5. I'm not going to try and figure out how that happened.
Summary
As has been stated in countless answers here on SO, it's not always a good idea to parallelize every for loop you come across. Things that can screw up your desired xN speedup:
Not enough work per thread to justify the overhead of creating the additional threads
The work itself is memory bound. This means that the CPU can be running at 1petaHz and you still won't see a speedup.
Memory access patterns. I'm not going to go there. Feel free to edit in the relevant info if you want it.
OMP parameters. The best choice of parameters will often be a result of this entire list (not including this item, to avoid recursion issues).
SIMD operations. Depending on what and how you're doing, the compiler may vectorize your operations. I have no idea if OMP will usurp the SIMD operations, but it is possible. Check your assembly (foreign language to me) to confirm.

Parallel Bellman-Ford implementation

Can anyone point me to a good pseudocode of a simple parallel shortest path algorithm? Or any language, it doesn't matter. I'm having trouble finding good examples =[
I eventually implemented it myself for a bitcoin bot using OpenMP:
/*defines the chunk size as 1 contiguous iteration*/
#define CHUNKSIZE 1
/*forks off the threads*/
#pragma omp parallel private(i) {
/*Starts the work sharing construct*/
#pragma omp for schedule(dynamic, CHUNKSIZE)
list<list_node>::iterator i;
for (int u = 0; u < V - 1; u++) {
if (dist[u] != INT_MAX) {
for (i = adj[u].begin(); i != adj[u].end(); ++i) {
if (dist[i->get_vertex()] > dist[u] + i->get_weight()) {
dist[i->get_vertex()] = dist[u] + i->get_weight();
pre[i->get_vertex()] = u;
}
}
}
}
}
If you want to look at my full implementation, you can view it as a Gist on my GitHub

Resources