for loops on openmp - parallel-processing

for loops on openmp - parallel-processing

The following code is running faster when I dont use openMP. Why is it so?
I am running it on a dual core machine.
The time taken to run using open mp is about 0.05 sec while it takes only 0.03 sec when I run it without openMP.
#include<omp.h>
#include<iostream>
#include<time.h>
using namespace std;
int main()
{
clock_t start=clock();
int i,j,t1,t2, n=1;
float a[1000][1000];
float b[1000][1000];
#pragma omp parallel
{
#pragma omp for private(i)
for(j=0; j<1000; j++)
{
for(i=0; i<1000; i++)
{
a[i][j]=(i*j*i*j)/(i+j+1)/(i*j*i+8*i+1);
b[i][j]=(i*j*i*j)/(i+j+1)/(i*j*i+8*i+1);
}
}
}
clock_t end=clock();
cout<<"Time to run is "<<(double)(end - start)/CLOCKS_PER_SEC<<endl;
}

Related

Matrix Multiplication OpenMP Counter-Intuitive Results

I am currently porting some code over to OpenMP at my place of work. One of the tasks I am doing is figuring out how to speed up matrix multiplication for one of our applications.
The matrices are stored in row-major format, so A[i*cols +j] gives the A_i_j element of the matrix A.
The code looks like this (uncommenting the pragma parallelises the code):
#include <omp.h>
#include <iostream>
#include <iomanip>
#include <stdio.h>
#define NUM_THREADS 8
#define size 500
#define num_iter 10
int main (int argc, char *argv[])
{
// omp_set_num_threads(NUM_THREADS);
int *A = new int [size*size];
int *B = new int [size*size];
int *C = new int [size*size];
for (int i=0; i<size; i++)
{
for (int j=0; j<size; j++)
{
A[i*size+j] = j*1;
B[i*size+j] = i*j+2;
C[i*size+j] = 0;
}
}
double total_time = 0;
double start = 0;
for (int t=0; t<num_iter; t++)
{
start = omp_get_wtime();
int i, k;
// #pragma omp parallel for num_threads(10) private(i, k) collapse(2) schedule(dynamic)
for (int j=0; j<size; j++)
{
for (i=0; i<size; i++)
{
for (k=0; k<size; k++)
{
C[i*size+j] += A[i*size+k] * B[k*size+j];
}
}
}
total_time += omp_get_wtime() - start;
}
std::setprecision(5);
std::cout << total_time/num_iter << std::endl;
delete[] A;
delete[] B;
delete[] C;
return 0;
}
What is confusing me is the following: why is dynamic scheduling faster than static scheduling for this task? Timing the runs and taking an average shows that static scheduling is slower, which to me is a bit counterintuitive since each thread is doing the same amount of work.
Also, am I correctly speeding up my matrix multiplication code?

Parallel matrix multiplication is non-trivial (have you even considered cache-blocking?). Your best bet is likely to be to use a BLAS Library for this, rather than writing it yourself. (Remember, "The best code is the code I do not have to write").
Wikipedia: Basic Linear Algebra Subprograms points to many implementations, a lot of which (including Intel Math Kernel Library) have free licenses.

How to : Bitonic Sort with OpenMP

I' m quite new to openmp and i have this assignment for school.I think the problem is in bitonicMerge.I have been trying a lot of variations and possibilities and the "best solution" i found is the following:
`void sort() {
#pragma omp parallel
{
#pragma omp single
recBitonicSort(0, N, ASCENDING);
}
}
void recBitonicSort(int lo, int cnt, int dir) {
if (cnt>1) {
int k=cnt/2;
#pragma omp task if(cnt>1024) // elements vary from 2^12 to 2^24
recBitonicSort(lo, k, ASCENDING);
#pragma omp task if(cnt>1024)
recBitonicSort(lo+k, k, DESCENDING);
#pragma omp taskwait
bitonicMerge(lo, cnt, dir);
}
}
void bitonicMerge(int lo, int cnt, int dir) {
if (cnt>1) {
int k=cnt/2;
int i;
#pragma omp parallel num_threads(p)
{
#pragma omp for schedule(static) nowait
for (i=lo; i<lo+k; i++)
{
//printf("Num of threads: %d\n", omp_get_num_threads());
compare(i, i+k, dir);
}
#pragma omp single
{
#pragma omp task if(cnt>1024)
bitonicMerge(lo, k, dir);
#pragma omp task if(cnt>1024)
bitonicMerge(lo+k, k, dir);
}
}
}
}`
The code works but has a time cost(Imperative bitonic takes 0.5s, Recursive takes 7-8s with elements=2^20 and maxthreads=8).I am aware that printf prints 1 thread only, probably because recBitonicSort assigns a single thread to bitonicMerge, but i cant find a better solution.

rcpp matrix multiplication speed improvement

I am using Rcpp for some heavy computation. However, my code doesn't run as fast as expected. I know I must have done something wrong because the same computation is faster in MATLAB. I have a hard time figuring out how I could improve my code, so I'm genuinely asking for help here.
I have a 12 by 5000^2/2 matrix called mX, and a 5000^2/2 by 1 matrix called w.
I would like to calculate A=mX * diag(w) * t(mX).
I have written the following four functions to do it.
#include <RcppArmadillo.h>
#include <RcppEigen.h>
#include <omp.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::depends("RcppEigen")]]
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
using Eigen::MatrixXd;
// [[Rcpp::export]]
arma::mat MultiplyArma1(arma::mat mX, arma::colvec w){
return mX*arma::diagmat(w)*arma::trans(mX);
}
// [[Rcpp::export]]
arma::mat MultiplyArma2(arma::mat mX, arma::colvec w){
return mX*(arma::repmat(w,1,12)%arma::trans(mX));
}
// [[Rcpp::export]]
arma::mat MultiplyCpp(arma::mat mX, arma::colvec w){
omp_set_num_threads(4);
int N=mX.n_rows;
int K=mX.n_cols;
arma::mat tmX=arma::zeros<arma::mat>(K,N);
arma::mat ProductF=arma::zeros<arma::mat>(N,N);
arma::mat ProductM=arma::zeros<arma::mat>(K,N);
#pragma omp parallel for schedule(static)
for (int i=0; i<K; ++i){
for (int j=0; j<N; ++j){
tmX(i,j)=mX(j,i);
}
}
#pragma omp parallel for schedule(static)
for (int j=0; j<N; ++j){
for (int i=0; i<K; ++i){
ProductM(i,j)=tmX(i,j)*w(i);
}
}
#pragma omp parallel for schedule(static)
for (int j=0; j<N; ++j){
for (int i=0; i<N; ++i){
for (int k=0; k<K; ++k){
ProductF(i,j) +=tmX(k,i)*ProductM(k,i);
}
}
}
return ProductF;
}
// [[Rcpp::export]]
MatrixXd MultiplyEigen(const MatrixXd mX, const MatrixXd Dw){
return mX* ( Dw.cwiseProduct( mX.transpose() ) );
}
The 1st and 2nd both use RcppArmadillo, and only differs in representation of diag(w). The 3rd makes use of openmp, and I code up everthing element wise. The 4th uses RcppEigen. I don't know how to generate diag(w) in RcppEigen, so I have the input be Dw=rep(w,1,12).
mX=matrix(0.7, 12, 5000^2/2)
w=matrix(0.2, 5000^2/2,1)
Dw=repmat(w, 1,12)
tic()
a1=MultiplyArma1(mX,w)
toc()
tic()
a2=MultiplyArma2(mX,w)
toc()
tic()
a3=MultiplyCpp(mX,w)
toc()
tic()
a4=MultiplyEigen(mX, Dw)
toc()
Then as I time the speed, They all take more than 3 seconds. However, when I do the same calculation in MATLAB A=mX*(repmat(w,1,12).*mX'), it takes only 1.5 seconds.
It makes me feel that there's still room for improving my Rcpp function, but I honestly don't know how. I would very much appreciate your help!

Open MP with gsl_matrix is slower than sequential

I've created a simple c program using gsl(GNU Scienctific Library) and open mp. In this simple program, I want to test the execution time for sequential and parallel. Here is the program snippets, main.c.
#include "omp.h"
#include <stdio.h>
#include <gsl/gsl_matrix.h>
#include <time.h>
int main()
{
omp_set_num_threads(4);
int n1=10000, n2=10000;
gsl_matrix *A = gsl_matrix_alloc(n1, n2);
int i,j;
struct timeval tv1, tv2, tv3, tv4;
gettimeofday(&tv1, 0);
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv2, 0);
long elapsed = (tv2.tv_sec-tv1.tv_sec)*1000000 + tv2.tv_usec-tv1.tv_usec;
printf("Sequential Duration:%ldms\n", elapsed);
gettimeofday(&tv3, 0);
#pragma omp parallel for private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv4, 0);
elapsed = (tv4.tv_sec-tv3.tv_sec)*1000000 + tv4.tv_usec-tv3.tv_usec;
printf(" Parallel Duration:%ldms\n", elapsed);
return 0;
}
Then I compiled the above code, using this command:
gcc -fopenmp main.c -o test -lgsl -lgslcblas -lm
Here is the program's result:
Sequential Duration:11980106ms
Parallel Duration:20624043ms
Why, the parallel part slower than the sequential part. How can I optimize this code? Thanks

as you have written it the j variable is shared between all threads so the threads are overwritting other threads state constantly, leading to them iterating values they have already covered.
You should always minimize the scope of variables when trying to parallelize with openmp. Either move the scope of j into the loop or mark it as private explicitly:
#pragma omp parallel for private(j)
also clock counts the processor time not the real time, you probably want to use gettimeofday
you matrix is too small to benefit much from parallelization, the threading overhead will dominate. Increase it to ~10000x10000 to start seeing something.

The problem here is that you do not know what the procedure gsl_matrix_set does with A. You do not know if it is thread safe. To change one element in that matrix you supply the whole matrix to the routine instead of only the indices of the element. This smells by false sharing (see e.g. this answer).
I would try this instead
gsl_matrix_set(A[i][j],i*j*1000000);
If that does not work and what you are interested in is only the time difference between serial and parallel I would just do
A[i][j] = i*j*1000000

In the thread part, try this:
#pragma omp parallel private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
or
#pragma omp parallel for
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}

Why my parallel program on linear search using OpenMP is taking more execution time than the sequential linear search program?

#include <stdio.h>
#include <omp.h>
int main()
{
int i, key=85, tid;
int a[100] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33, 34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,6 4,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94 ,95};
#pragma omp parallel num_threads(2) private(i)
{
tid = omp_get_thread_num();
#pragma omp for
for(i=0; i<100; i++)
if(a[i] == key)
{
printf("Key found. Position = %d by thread %d \n", i+1, tid);
}
}
return 0;
}
Here is my parallel program.. I'm using GCC in Fedora and system is dual-core...
Actually i need to compare both sequential and parallel program for linear search and prove parallel is better than sequential.
Do i need to add user and sys time to calculate execution time for both sequential and parallel( as this uses two core)??
pls help me out. Thanks in advance.

It costs some time to setup the parallel environment. Try a much larger array. You should certanly see a speed up.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

for loops on openmp - parallel-processing

Related

Matrix Multiplication OpenMP Counter-Intuitive Results

How to : Bitonic Sort with OpenMP

rcpp matrix multiplication speed improvement

Open MP with gsl_matrix is slower than sequential

Why my parallel program on linear search using OpenMP is taking more execution time than the sequential linear search program?

Categories

Resources