Thread safe counter using OpenMP - openmp

What are solutions to the hazards in doing this?
#include <iostream>
#include <unistd.h>
#include <cstdlib>
#include <ctime>
int main(){
int k = 0;
#pragma omp parallel for
for(int i = 0; i < 100000; i++){
if (i % 2){ /** Conditional based on i **/
#pragma omp atomic
k++;
usleep(1000 * ((float)std::rand() / RAND_MAX));
#pragma omp task
std::cout << k << std::endl; /** Some sort of task **/
}
}
return 0;
}
I need all ks to be unique. What would be a better way of doing this?
Edit
Notice how this question refers to an aggregate
In particular I want to spawn tasks based on a shared variable. I run the risk of having a race condition.
Consider thread 2 completes, evaluates true for the conditional, and increments k before thread 1 spawns all tasks.
Edit edit
I tried to force a race condition. It wasn't obvious without the sleep. There are in fact problems. How can I overcome this.
Here's a quick solution:
...
#pragma omp atomic
k++;
int c = k;
...
but I'd like a guarantee.
Tangential. Why doesn't this implementation work?
...
int c;
#pragma omp crtical
{
k++;
c = k;
}
...
At the end of the function, std::cout << k;, is consistently less than the expected 50000 output proof

I hate to answer my question so quickly, but I found a solution for this particular instance.
As of OpenMP 3.1 there is the "atomic capture" pragma
The use case is for problems just like this. The resultant code:
#include <iostream>
#include <unistd.h>
#include <cstdlib>
#include <ctime>
int main(){
int k = 0;
#pragma omp parallel for
for(int i = 0; i < 100000; i++){
if (i % 2){ /** Conditional based on i **/
int c;
#pragma omp atomic capture
{
c = k;
k++;
}
usleep(1000 * ((float)std::rand() / RAND_MAX));
#pragma omp task
std::cout << c << std::endl; /** Some sort of task **/
}
}
std::cout << k << std::endl; /** Some sort of task **/
std::cout.flush();
return 0;
}
I will leave this problem open if someone would like to contribute ideas/ code arch. suggestions for avoiding these problems, reasons the #pragma omp crtical didn't work

Related

OpenMP: task_reduction=reduction? What is 'in_reduction'?

Is 'task_reduction' same as 'reduction ([ reduction-modifier ,] reduction
identifier : list )' with task reduction-modifier?
If it is same, then why do we need 'task_reduction'?
What is 'in_reduction' doing?
In the text says 'The in_reduction clause specifies that a task participates in a reduction '
But what does that mean? In 'in_reduction' we need same clause as reduction.
in_reduction(identifier : list)
But if we can put reduction variables in 'list', then what does that do with 'task participates in reduction'...?
I can imagine how reduction works but I can't imagine with 'in_reduction'.
Why do we need that?
======================================
I made an example. This code should give sum of num at even index number
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char* argv[]){
int Array [10]= {0,1,2,3,4,5,6,7,8,9};
int Array_length = 10;
int counter = 0;
int result = 0;
#pragma omp parallel
#pragma omp single
{
#pragma omp taskgroup task_reduction(+:result)
//#pragma omp parallel reduction(task,+:result) this can work same as taskgroup task_reduction?
{
while (Array_length!=counter){
if (counter%2==0){
#pragma omp task in_reduction(+:result)
{
result+=Array[counter];
}
}else {
result+=Array[counter];
}
counter=counter+1;
}
}
}
printf("The sum of all array elements is equal to %d.\n", result);
}
And I also made an illustration, so I could visualize my comprehension.
So… task_reduction create a reduction, so that local result can be contributed and only task with in_reduction will participate in contribution of Reduction?
If I understand correctly, then this code should give 20 as result. However, my code gives 45, which is sum of 0 to 9.
Where did I make a mistake?
By the way, what happens if I didn’t write ‘in_reduction’ at all? Then the result is 0?
The way task reduction works is that the task needs to know where to contribute its local result to. SO, what you have to do is to have an taskgroup that "creates" the reduction and then have tasks contribute to it:
void example() {
int result = 0;
#pragma omp parallel // create parallel team
#pragma omp single // have only one task creator
{
#pragma omp taskgroup task_reduction(+:result)
{
while(have_to_create_tasks()) {
#pragma omp task in_reduction(+:result)
{ // this tasks contribute to the reduction
result = do_something();
}
#pragma omp task firstprivate(result)
{ // this task does not contribute to the reduction
result = do_something_else();
}
}
}
}
}
So, the in_reduction is needed for a task to contribute to a reduction that has been created by a task_reduction clause of the enclosing taskgroup region.
The reduction clause cannot be used with the task construct, but only worksharing constructs and other loop constructs.
The only tasking construct that has the reduction clause, is the taskloop construct that uses it for a short cut for a hidden task_reduction construct that encloses all the loop constructs that it creates and that then have a hidden in_reduction clause, too.
UPDATE (to cover the edits by the original poster):
The problem with the code is now that you have two things happening (see the inline comments in your updated code):
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char* argv[]){
int Array [10]= {0,1,2,3,4,5,6,7,8,9};
int Array_length = 10;
int counter = 0;
int result = 0;
#pragma omp parallel
#pragma omp single
{
#pragma omp taskgroup task_reduction(+:result)
{
// "result" is a shared variable in this parallel region
while (Array_length!=counter) {
if (counter%2==0){
#pragma omp task in_reduction(+:result)
{
// This task will contribute to the reduction result
// as you would expect.
result+=Array[counter];
}
} else {
// This addition to "result" is performed by the "single"
// thread and thus hits the shared variable. You can see
// this when you print the address of "result" here
// and before the parallel region.
result+=Array[counter];
}
counter=counter+1;
}
} // Here the "single" thread waits for the taskgroup to complete
// and the reduction to happen. So, here the shared variable
// "result" is added to the value of "result" coming from the
// task reduction. So, result = 25 from the "single" thread and
// result = 20 are added up to result =45
}
printf("The sum of all array elements is equal to %d.\n", result);
}
The addition at the end of the task reduction seems to be a race condition as the updates coming from the single thread and the updates coming from the end of the taskgroup are not synchronized. I guess that the race does not show up, as the code is too fast to clearly expose it.
To fix the code, you'd have to also have a task construct around the update for odd numbers, like so:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char* argv[]){
int Array [10]= {0,1,2,3,4,5,6,7,8,9};
int Array_length = 10;
int counter = 0;
int result = 0;
#pragma omp parallel
#pragma omp single
{
#pragma omp taskgroup task_reduction(+:result)
{
// "result" is a shared variable in this parallel region
while (Array_length!=counter) {
if (counter%2==0){
#pragma omp task in_reduction(+:result)
{
// This task will contribute to the reduction result
// as you would expect.
result+=Array[counter];
}
} else {
#pragma omp task firstprivate(result)
{
// "result" is now a task-local variable that is not
// shared. If you remove the firstprivate, then the
// race condition on the shared variable "result" is
// back.
result+=Array[counter];
}
}
counter=counter+1;
}
} // Here the "single" thread waits for the taskgroup to complete
// and the reduction to happen. So, here the shared variable
// "result" is added to the value of "result" coming from the
// task reduction. So, result = 25 from the "single" thread and
// result = 20 are added up to result =45
}
printf("The sum of all array elements is equal to %d.\n", result);
}
In my first answer, I failed to add a proper firstprivate or private clause to the task. I'm sorry about that.

Matrix Multiplication OpenMP Counter-Intuitive Results

I am currently porting some code over to OpenMP at my place of work. One of the tasks I am doing is figuring out how to speed up matrix multiplication for one of our applications.
The matrices are stored in row-major format, so A[i*cols +j] gives the A_i_j element of the matrix A.
The code looks like this (uncommenting the pragma parallelises the code):
#include <omp.h>
#include <iostream>
#include <iomanip>
#include <stdio.h>
#define NUM_THREADS 8
#define size 500
#define num_iter 10
int main (int argc, char *argv[])
{
// omp_set_num_threads(NUM_THREADS);
int *A = new int [size*size];
int *B = new int [size*size];
int *C = new int [size*size];
for (int i=0; i<size; i++)
{
for (int j=0; j<size; j++)
{
A[i*size+j] = j*1;
B[i*size+j] = i*j+2;
C[i*size+j] = 0;
}
}
double total_time = 0;
double start = 0;
for (int t=0; t<num_iter; t++)
{
start = omp_get_wtime();
int i, k;
// #pragma omp parallel for num_threads(10) private(i, k) collapse(2) schedule(dynamic)
for (int j=0; j<size; j++)
{
for (i=0; i<size; i++)
{
for (k=0; k<size; k++)
{
C[i*size+j] += A[i*size+k] * B[k*size+j];
}
}
}
total_time += omp_get_wtime() - start;
}
std::setprecision(5);
std::cout << total_time/num_iter << std::endl;
delete[] A;
delete[] B;
delete[] C;
return 0;
}
What is confusing me is the following: why is dynamic scheduling faster than static scheduling for this task? Timing the runs and taking an average shows that static scheduling is slower, which to me is a bit counterintuitive since each thread is doing the same amount of work.
Also, am I correctly speeding up my matrix multiplication code?
Parallel matrix multiplication is non-trivial (have you even considered cache-blocking?). Your best bet is likely to be to use a BLAS Library for this, rather than writing it yourself. (Remember, "The best code is the code I do not have to write").
Wikipedia: Basic Linear Algebra Subprograms points to many implementations, a lot of which (including Intel Math Kernel Library) have free licenses.

OpenMP and (Rcpp)Eigen

I am wondering how to write code that at times makes use of OpenMP parallelization built into the Eigen library while at other times uses Parallelization that I specify. Hopefully, the below code snippet should provide background into my problem.
I am asking this question at the design stage of my library (sorry I don't have a working / broken code example).
#ifdef _OPENMP
#include <omp.h>
#endif
#include <RcppEigen.h>
void fxn(..., int ncores=-1){
if (ncores > 0) omp_set_num_threads(ncores);
/*
* Code with matrix products
* where I would like to use Eigen's
* OpenMP parallelization
*/
#pragma omp parallel for
for (int i=0; i < iter; i++){
/*
* Code I would like to parallelize "myself"
* even though it involves matrix products
*/
}
}
What is best practice for controlling the balance between Eigen's own parallelization with OpenMP and my own.
UPDATE:
I wrote a simple example and tested ggael's suggestion. In short, I am skeptical that it solves the problem I was posing (or I am doing something else wrong - apologies if its the latter). Notice that with explicit parallelization of the for loop there is no change in run-time (not even a slow
#ifdef _OPENMP
#include <omp.h>
#endif
#include <RcppEigen.h>
using namespace Rcpp;
// [[Rcpp::plugins(openmp)]]
// [[Rcpp::export]]
Eigen::MatrixXd testing(Eigen::MatrixXd A, Eigen::MatrixXd B, int n_threads=1){
Eigen::setNbThreads(n_threads);
Eigen::MatrixXd C = A*B;
Eigen::setNbThreads(1);
for (int i=0; i < A.cols(); i++){
A.col(i).array() = A.col(i).array()*B.col(i).array();
}
return A;
}
// [[Rcpp::export]]
Eigen::MatrixXd testing_omp(Eigen::MatrixXd A, Eigen::MatrixXd B, int n_threads=1){
Eigen::setNbThreads(n_threads);
Eigen::MatrixXd C = A*B;
Eigen::setNbThreads(1);
#pragma omp parallel for num_threads(n_threads)
for (int i=0; i < A.cols(); i++){
A.col(i).array() = A.col(i).array()*B.col(i).array();
}
return A;
}
/*** R
A <- matrix(rnorm(1000*1000), 1000, 1000)
B <- matrix(rnorm(1000*1000), 1000, 1000)
microbenchmark::microbenchmark(testing(A,B, n_threads=1),
testing_omp(A,B, n_threads=1),
testing(A,B, n_threads=8),
testing_omp(A,B, n_threads=8),
times=10)
*/
Unit: milliseconds
expr min lq mean median uq max neval cld
testing(A, B, n_threads = 1) 169.74272 183.94500 212.83868 218.15756 236.97049 264.52183 10 b
testing_omp(A, B, n_threads = 1) 166.53132 178.48162 210.54195 227.65258 234.16727 238.03961 10 b
testing(A, B, n_threads = 8) 56.03258 61.16001 65.15763 62.67563 67.37089 83.43565 10 a
testing_omp(A, B, n_threads = 8) 54.18672 57.78558 73.70466 65.36586 67.24229 167.90310 10 a
The easiest is probably to disable/enable Eigen's multi-threading at runtime:
Eigen::setNbThreads(1); // single thread mode
#pragma omp parallel for
for (int i=0; i < iter; i++){
// Code I would like to parallelize "myself"
// even though it involves matrix products
}
Eigen::setNbThreads(0); // restore default

Why does my openMP 2.0 critical directive not flush?

I am currently attempting to parallelize a maximum value search using openMP 2.0 and Visual Studio 2012. I feel like this problem is so simple, it could be used as a textbook example. However, I run into a race condition I do not understand.
The code passage in question is:
double globalMaxVal = std::numeric_limits<double>::min();;
#pragma omp parallel for
for(int i = 0; i < numberOfLoops; i++)
{
{/* ... */} // In this section I determine maxVal
// Besides reading out values from two std::vector via the [] operator, I do not access or manipulate any global variables.
#pragma omp flush(globalMaxVal) // IF I COMMENT OUT THIS LINE I RUN INTO A RACE CONDITION
#pragma omp critical
if(maxVal > globalMaxVal)
{
globalMaxVal = maxVal;
}
}
I do not grasp why it is necessary to flush globalMaxVal. The openMP 2.0 documentation states: "A flush directive without a variable-list is implied for the following directives: [...] At entry to and exit from critical [...]" Yet, I get results diverging from the non-parallelized implementation, if I leave out the flush directive.
I realize that above's code might not be the prettiest or most efficient way to solve my problem, but at the moment I want to understand, why I am seeing this race condition.
Any help would be greatly appreciated!
EDIT:
Below I've now added a minimal, complete and verifiable example below requiring only openMP and the standard library. I've been able to reproduce the problem described above with this code.
For me some runs yield a globalMaxVal != 99, if I omit the flush directive. With the directive, it works just fine.
#include <algorithm>
#include <iostream>
#include <random>
#include <Windows.h>
#include <omp.h>
int main()
{
// Repeat parallelized code 20 times
for(int r = 0; r < 20; r++)
{
int globalMaxVal = 0;
#pragma omp parallel for
for(int i = 0; i < 100; i++)
{
int maxVal = i;
// Some dummy calculations to use computation time
std::random_device rd;
std::mt19937 generator(rd());
std::uniform_real_distribution<double> particleDistribution(-1.0, 1.0);
for(int j = 0; j < 1000000; j++)
particleDistribution(generator);
// The actual code bit again
#pragma omp flush(globalMaxVal) // IF I COMMENT OUT THIS LINE I RUN INTO A RACE CONDITION
#pragma omp critical
if(maxVal > globalMaxVal)
{
globalMaxVal = maxVal;
}
}
// Report outcome - expected to be 99
std::cout << "Run: " << r << ", globalMaxVal: " << globalMaxVal << std::endl;
}
system("pause");
return 0;
}
EDIT 2:
After further testing, we've found that compiling the code in Visual Studio without optimization (/Od) or in Linux gives correct results, whereas the bugs surface in Visual Studio 2012 (Microsoft C/C++ compiler version 17.00.61030) with activated optimization (/O2).

Parallelizing a series of independent sequential lines of code

What is the best way to execute multiple lines of code in parallel if they are not dependent of each other? (I'm using OpenMP)
Pseudo code:
database->connect()
openfile("stuff.txt")
ping("stackoverflow.com")
x = 2;
y = a + b;
The only way I can come up with is:
#pragma omp parallel for
for(i = 0; i < 5; i++)
switch (i) {
case 0: database->connect(); break;
...
I haven't tried it, but I also remember that you're not supposed to break while using OpenMP
So I'm assuming that the indivdual things you listed as independant tasks were just examples. If they really are things like y=a+b, then as #chrisaycock and #ejd have said, they're too small for this sort of parallelism (eg thread based, as opposed to ILP or something) to actually take advantage of the concurrency due to overheads. But if they are bigger operations, the way to do task-based parallelism in OpenMP is with the task directive: eg,
#include <stdio.h>
#include <omp.h>
#include <unistd.h>
void work(int *v) {
*v = omp_get_thread_num();
sleep(1);
}
int main(int argc, char **argv)
{
int a, b, c;
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(a) default(none)
work(&a);
#pragma omp task shared(b) default(none)
work(&b);
#pragma omp task shared(c) default(none)
work(&c);
}
}
printf("a,b,c = %d,%d,%d\n", a, b, c);
return 0;
}

Resources