Why OpenMP TASK postorder binary tree traversal uses single thread? - openmp

I'm new to OpenMP and try to write a postorder tree traversal using OpenMP TASK.
The tree is simple:
Node 2 (root) is the parent of 3, 4
Node 3 is the parent of 5, 6
Node 4 is the parent of 7, 8
The output shows that all the work is done by thread 0, but every pdf online says it should be parallel. Is this an issue of some sort of thread safe?
The code is
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
typedef struct Node Node;
struct Node {
int id;
Node* left;
Node* right;
};
void process(int n) {
printf("I'm now on node %d working\n", n);
for (int i = 0; i < 50000000; i++) {int j = 1; j++;}
printf("I'm now on node %d working finished\n", n);
}
void traverseTree(Node * t) {
#pragma omp task
{
if (t->left) {
printf("Thread %d now go to node %d\n", omp_get_thread_num(), t->left->id);
traverseTree(t->left);
}
}
#pragma omp task
{
if (t->right) {
printf("Thread %d now go to node %d\n", omp_get_thread_num(), t->right->id);
traverseTree(t->right);
}
}
#pragma omp taskwait
printf("I'm thread %d, now on node %d working\n", omp_get_thread_num(), t->id);
process(t->id);
}
int main() {
//build the tree
int n_nodes = 9;
Node nodes[n_nodes];
for (int i = 2; i < n_nodes; i++) {
nodes[i].id = i;
if (nodes[i].id * 2 < n_nodes) {
nodes[i].left = &nodes[nodes[i].id*2-1];
nodes[i].right = &nodes[nodes[i].id*2];
} else {
nodes[i].left = NULL;
nodes[i].right = NULL;
}
}
#pragma omp prallel num_threads(24)
{
#pragma omp single
{
traverseTree(&nodes[2]);
}
}
}
Can anyone explain this?

Well, it is due to ... a typing mistake. You should write:
#pragma omp parallel num_threads(24)
// ~~~~~~~~
By the way, optimizing compilers will optimize out such a loop (ie. remove it):
for (int i = 0; i < 50000000; i++) {int j = 1; j++;}
If it takes time, this means optimizations are disabled. This is OK to debug a program but not to profile it. Please consider using the -O3 flag with GCC/Clang/ICC or /O2 with MSVC.

Related

Matrix Multiplication OpenMP Counter-Intuitive Results

I am currently porting some code over to OpenMP at my place of work. One of the tasks I am doing is figuring out how to speed up matrix multiplication for one of our applications.
The matrices are stored in row-major format, so A[i*cols +j] gives the A_i_j element of the matrix A.
The code looks like this (uncommenting the pragma parallelises the code):
#include <omp.h>
#include <iostream>
#include <iomanip>
#include <stdio.h>
#define NUM_THREADS 8
#define size 500
#define num_iter 10
int main (int argc, char *argv[])
{
// omp_set_num_threads(NUM_THREADS);
int *A = new int [size*size];
int *B = new int [size*size];
int *C = new int [size*size];
for (int i=0; i<size; i++)
{
for (int j=0; j<size; j++)
{
A[i*size+j] = j*1;
B[i*size+j] = i*j+2;
C[i*size+j] = 0;
}
}
double total_time = 0;
double start = 0;
for (int t=0; t<num_iter; t++)
{
start = omp_get_wtime();
int i, k;
// #pragma omp parallel for num_threads(10) private(i, k) collapse(2) schedule(dynamic)
for (int j=0; j<size; j++)
{
for (i=0; i<size; i++)
{
for (k=0; k<size; k++)
{
C[i*size+j] += A[i*size+k] * B[k*size+j];
}
}
}
total_time += omp_get_wtime() - start;
}
std::setprecision(5);
std::cout << total_time/num_iter << std::endl;
delete[] A;
delete[] B;
delete[] C;
return 0;
}
What is confusing me is the following: why is dynamic scheduling faster than static scheduling for this task? Timing the runs and taking an average shows that static scheduling is slower, which to me is a bit counterintuitive since each thread is doing the same amount of work.
Also, am I correctly speeding up my matrix multiplication code?
Parallel matrix multiplication is non-trivial (have you even considered cache-blocking?). Your best bet is likely to be to use a BLAS Library for this, rather than writing it yourself. (Remember, "The best code is the code I do not have to write").
Wikipedia: Basic Linear Algebra Subprograms points to many implementations, a lot of which (including Intel Math Kernel Library) have free licenses.

How to : Bitonic Sort with OpenMP

I' m quite new to openmp and i have this assignment for school.I think the problem is in bitonicMerge.I have been trying a lot of variations and possibilities and the "best solution" i found is the following:
`void sort() {
#pragma omp parallel
{
#pragma omp single
recBitonicSort(0, N, ASCENDING);
}
}
void recBitonicSort(int lo, int cnt, int dir) {
if (cnt>1) {
int k=cnt/2;
#pragma omp task if(cnt>1024) // elements vary from 2^12 to 2^24
recBitonicSort(lo, k, ASCENDING);
#pragma omp task if(cnt>1024)
recBitonicSort(lo+k, k, DESCENDING);
#pragma omp taskwait
bitonicMerge(lo, cnt, dir);
}
}
void bitonicMerge(int lo, int cnt, int dir) {
if (cnt>1) {
int k=cnt/2;
int i;
#pragma omp parallel num_threads(p)
{
#pragma omp for schedule(static) nowait
for (i=lo; i<lo+k; i++)
{
//printf("Num of threads: %d\n", omp_get_num_threads());
compare(i, i+k, dir);
}
#pragma omp single
{
#pragma omp task if(cnt>1024)
bitonicMerge(lo, k, dir);
#pragma omp task if(cnt>1024)
bitonicMerge(lo+k, k, dir);
}
}
}
}`
The code works but has a time cost(Imperative bitonic takes 0.5s, Recursive takes 7-8s with elements=2^20 and maxthreads=8).I am aware that printf prints 1 thread only, probably because recBitonicSort assigns a single thread to bitonicMerge, but i cant find a better solution.

generating random variables with openmp in c++

How can I generate in parallel (is it efficient? possible?) random variables with my linear congruential generator below:
double* uniform(long N)
{
long i,j;
long a=16807;
long long m=(((long long)1)<<31)-1;
long I[N];
double *U;
#pragma omp parallel for firstprivate(i)
for (j = 0; j < N;j++)
{
if (i==0)
{
int y= omp_get_thread_num(); // undefined ref error here
I[y];
i++;
}
else
{
I[j] = (a*I[j-1])%m;
}
}
#pragma omp parallel for
for (i=0; i<N; i++)
U[i] = (double)I[i]/(m+1.0);
return U;
}
My goal is to generate 2 variables to use them in another function (box-muller method):
double* gauss(long int N)
{
double *X, *Y, *U;
X = generator(N/2);
Y = generator(N/2);
#pragma omp parallel for
for (i=0;i<N/2;i++)
{
U[2*i]=sqrt(-2 * log(X[i]))*sin(Y[i]*2*3.14);
U[2*i+1]=sqrt(-2 * log(X[i]))*cos(Y[i]*2*3.14);
}
return U;
}
How want to know how can I get different seeds when generating uniform variables with the function uniform?

"Warning : Non-POD class type passed through ellipsis" for simple thrust program

In spite of reading many answers on the same kind of questions on SO I am not able to figure out solution in my case. I have written the following code to implement a thrust program. Program performs simple copy and display operation.
#include <stdio.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
int main(void)
{
// H has storage for 4 integers
thrust::host_vector<int> H(4);
H[0] = 14;
H[1] = 20;
H[2] = 38;
H[3] = 46;
// H.size() returns the size of vector H
printf("\nSize of vector : %d",H.size());
printf("\nVector Contents : ");
for (int i = 0; i < H.size(); ++i) {
printf("\t%d",H[i]);
}
thrust::device_vector<int> D = H;
printf("\nDevice Vector Contents : ");
for (int i = 0; i < D.size(); i++) {
printf("%d",D[i]); //This is where I get the warning.
}
return 0;
}
Thrust implements certain operations to facilitate using elements of a device_vector in host code, but this apparently isn't one of them.
There are many approaches to addressing this issue. The following code demonstrates 3 possible approaches:
explicitly copy D[i] to a host variable, and thrust has an appropriate method defined for that.
copy the thrust device_vector back to a host_vector before print-out.
use thrust::copy to directly copy the elements of the device_vector to a stream.
Code:
#include <stdio.h>
#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
int main(void)
{
// H has storage for 4 integers
thrust::host_vector<int> H(4);
H[0] = 14;
H[1] = 20;
H[2] = 38;
H[3] = 46;
// H.size() returns the size of vector H
printf("\nSize of vector : %d",H.size());
printf("\nVector Contents : ");
for (int i = 0; i < H.size(); ++i) {
printf("\t%d",H[i]);
}
thrust::device_vector<int> D = H;
printf("\nDevice Vector Contents : ");
//method 1
for (int i = 0; i < D.size(); i++) {
int q = D[i];
printf("\t%d",q);
}
printf("\n");
//method 2
thrust::host_vector<int> Hnew = D;
for (int i = 0; i < Hnew.size(); i++) {
printf("\t%d",Hnew[i]);
}
printf("\n");
//method 3
thrust::copy(D.begin(), D.end(), std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
Note that for methods like these, thrust is generating various kinds of device-> host copy operations to facilitate the use of device_vector in host code. This has performance implications, so you might want to use the defined copy operations for large vectors.

Optimizing N-queen with openmp

I am learning OPENMP and wrote the following code to solve nqueens problem.
//Full Code: https://github.com/Shafaet/Codes/blob/master/OPENMP/Parallel%20N- Queen%20problem.cpp
int n;
int call(int col,int rowmask,int dia1,int dia2)
{
if(col==n)
{
return 1;
}
int row,ans=0;
for(row=0;row<n;row++)
{
if(!(rowmask & (1<<row)) & !(dia1 & (1<<(row+col))) & !(dia2 & (1<<((row+n-1)-col))))
{
ans+=call(col+1,rowmask|1<<row,dia1|(1<<(row+col)), dia2|(1<<((row+n-1)-col)));
}
}
return ans;
}
double parallel()
{
double st=omp_get_wtime();
int ans=0;
int i;
int rowmask=0,dia1=0,dia2=0;
#pragma omp parallel for reduction(+:ans) shared(i,rowmask)
for(i=0;i<n;i++)
{
rowmask=0;
dia1=0,dia2=0;
int col=0,row=i;
ans+=call(1,rowmask|1<<row,dia1|(1<<(row+col)), dia2|(1<<((row+n-1)-col)));
}
printf("Found %d configuration for n=%d\n",ans,n);
double en=omp_get_wtime();
printf("Time taken using openmp %lf\n",en-st);
return en-st;
}
double serial()
{
double st=omp_get_wtime();
int ans=0;
int i;
int rowmask=0,dia1=0,dia2=0;
for(i=0;i<n;i++)
{
rowmask=0;
dia1=0,dia2=0;
int col=0,row=i;
ans+=call(1,rowmask|1<<row,dia1|(1<<(row+col)), dia2|(1<<((row+n-1)-col)));
}
printf("Found %d configuration for n=%d\n",ans,n);
double en=omp_get_wtime();
printf("Time taken without openmp %lf\n",en-st);
return en-st;
}
int main()
{
double average=0;
int count=0;
for(int i=2;i<=13;i++)
{
count++;
n=i;
double stime=serial();
double ptime=parallel();
printf("OpenMP is %lf times faster for n=%d\n",stime/ptime,n);
average+=stime/ptime;
puts("===============");
}
printf("On average OpenMP is %lf times faster\n",average/count);
return 0;
}
Parallel code is already faster than normal one but i wonder how can i optimize it more using openmp pragmas. I want to know what i should do for better performance and what i should not do.
Thanks in advance.
(Please dont suggest any optimizations which are non-related to parallel programming)
Your code seems to use classic backtracking N-Queens recursive algorithm, which is not the fastest possible for N-Queens solving, but (due to simplicity) is the most vivid one in terms of practicing with parallelism basics.
That's being said: this is very simple, thus you don't expect it to naturally demonstrate lots of advanced OpenMP means except basic "parallel for" and reduction.
But, as far as you're looking for learning parallelism and probably for more clearness and better learning curve, there is one more (out of many possible) implementation available, which uses the same algorithm but tends to be more readable and vivid from educational perspective:
void setQueen(int queens[], int row, int col) {
//check all previously placed rows for attacks
for(int i=0; i<row; i++) {
// vertical attacks
if (queens[i]==col) {
return;
}
// diagonal attacks
if (abs(queens[i]-col) == (row-i) ) {
return;
}
}
// column is ok, set the queen
queens[row]=col;
if(row==size-1) {
#pragma omp atomic
nrOfSolutions++; //Placed final queen, found a solution
}
else {
// try to fill next row
for(int i=0; i<size; i++) {
setQueen(queens, row+1, i);
}
}
}
//Function to find all solutions for nQueens problem on size x size chessboard.
void solve() {
#pragma omp parallel for
for(int i=0; i<size; i++) {
// try all positions in first row
int * queens = new int[size]; //array representing queens placed on a chess board. Index is row position, value is column.
setQueen(queens, 0, i);
delete[](queens);
}
}
This given code is one of Intel Advisor XE samples (for both C++ and Fortran); the parallelization aspects for given sample are discussed in very detailed manner in Chapter 10 of given Parallel Programming Book (in fact, given chapter just uses N-Queens to demonstrate how to use tools in order to parallelize serial code in general).
Given Advisor n-queens sample uses essentially the same algorithm as yours, but it replaces explicit reduction with combination of simple parallel for + atomic. This code is expected to be less efficient, but more "procedural-style" and more "educational", since it demonstrates "hidden" data race. In case you upload given samplecode, you will actually find 4 equialent N-Queens parallel implementatons using TBB, Cilk Plus and OpenMP (OMP is for C++ and Fortran).
I know I am a little late for the party, but you can use task queueing for further optimization.(about 7-10% faster results).No idea why. Here's the code,that i am using :
#include <iostream> // std::cout, cin, cerr ...
#include <iomanip> // modify std::out
#include <omp.h>
using namespace std;
int nrOfSolutions=0;
int size=0;
void print(int queens[]) {
cerr << "Solution " << nrOfSolutions << endl;
for(int row=0; row<size; row++) {
for(int col=0; col<size; col++) {
if(queens[row]==col) {
cout << "Q";
}
else {
cout << "-";
}
}
cout << endl;
}
}
void setQueen(int queens[], int row, int col, int id) {
for(int i=0; i<row; i++) {
// vertical attacks
if (queens[i]==col) {
return;
}
// diagonal attacks
if (abs(queens[i]-col) == (row-i) ) {
return;
}
}
// column is ok, set the queen
queens[row]=col;
if(row==size-1) {
// only one thread should print allowed to print at a time
{
// increasing the solution counter is not atomic
#pragma omp critical
nrOfSolutions++;
#ifdef _DEBUG
#pragma omp critical
print(queens);
#endif
}
}
else {
// try to fill next row
for(int i=0; i<size; i++) {
setQueen(queens, row+1, i, id);
}
}
}
void solve() {
int myid=0 ;
#pragma omp parallel
#pragma omp single
{
for(int i=0; i<size; i++) {
/*
#ifdef _OMP //(???)
myid = omp_get_thread_num();
#endif
#ifdef _DEBUG
cout << "ThreadNum: " << myid << endl ;
#endif
*/
// try all positions in first row
// create separate array for each recursion
// started here
#pragma omp task
setQueen(new int[size], 0, i, myid);
}
}
}
int main(int argc, char*argv[]) {
if(argc !=2) {
cerr << "Usage: nq-openmp-taskq boardSize.\n";
return 0;
}
size = atoi(argv[1]);
cout << "Starting OpenMP Task Queue solver for size " << size << "...\n";
double st=omp_get_wtime();
solve();
double en=omp_get_wtime();
printf("Time taken using openmp %lf\n",en-st);
cout << "Number of solutions: " << nrOfSolutions << endl;
return 0;
}

Resources