What is Parallel for usage in different for loops? - parallel-processing

I have some code fragments here where what I think as correct is not given as the answer. I need some help to clarify this.
dotp=0;
for (i=0;i<n;i++){
dotp+= a[i]+b[i];
}
given answer for parallelizing this code is :
dotp=0;
#pragma omp parallel for reduction(+:dotp)
for (i=0;i<n;i++){
dotp+= a[i]+b[i];
}
I think it needs the dotp to be added as a firstprivate to be visible inside the for loop
dotp=0;
#pragma omp parallel for reduction(+:dotp) firstprivate(dotp)
for (i=0;i<n;i++){
dotp+= a[i]+b[i];
}
If this is not correct why we do not have to use firstprivate ?

The reduction clause marks dotp as shared and performs a final summation.
The initial value of each shared dotp within the for loop is 0. The final step in the summation is to add the previous version (original) value of dotp, in your case this is 0. However it could be any value.
You do not (and should not) need to say it is first private, to enforce the initial zeroing of dotp.

Related

Rcpp causes segfault RcppArmadillo does not

I'm currently trying to parallelize an existing hierarchical MCMC sampling scheme. The majority of my (by now sequential) source code is written in RcppArmadillo, so I'd like to stick with this framework for parallelization, too.
Before starting with parallelizing my code I have read a couple of blog posts on Rcpp/Openmp. In the majority of these blog posts (e.g. Drew Schmidt, wrathematics) the authors warn about the issue of thread safety, R/Rcpp data structures and Openmp. The bottom line of all posts I have read so far is, R and Rcpp are not thread safe, don't call them from within an omp parallel pragma.
Because of this, the following Rcpp example causes a segfault, when called from R:
#include <Rcpp.h>
#include <omp.h>
using namespace Rcpp;
double rcpp_rootsum_j(Rcpp::NumericVector x)
{
Rcpp::NumericVector ret = sqrt(x);
return sum(ret);
}
// [[Rcpp::export]]
Rcpp::NumericVector rcpp_rootsum(Rcpp::NumericMatrix x, int cores = 2)
{
omp_set_num_threads(cores);
const int nr = x.nrow();
const int nc = x.ncol();
Rcpp::NumericVector ret(nc);
#pragma omp parallel for shared(x, ret)
for (int j=0; j<nc; j++)
ret[j] = rcpp_rootsum_j(x.column(j));
return ret;
}
As Drew explains in his blog post, the segfault happens due to a "hidden" copy, which Rcpp makes in the call to ret[j] = rcpp_rootsum_j(x.column(j));.
Since I'm interested in the behavior of RcppArmadillo in case of parallelization, I have converted Drew's example:
//[[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <omp.h>
double rcpp_rootsum_j_arma(arma::vec x)
{
arma::vec ret = arma::sqrt(x);
return arma::accu(ret);
}
// [[Rcpp::export]]
arma::vec rcpp_rootsum_arma(arma::mat x, int cores = 2)
{
omp_set_num_threads(cores);
const int nr = x.n_rows;
const int nc = x.n_cols;
arma::vec ret(nc);
#pragma omp parallel for shared(x, ret)
for (int j=0; j<nc; j++)
ret(j) = rcpp_rootsum_j_arma(x.col(j));
return ret;
}
Interestingly the semantically equivalent code does not causes a segfault.
The second thing I have noticed during my research is, that the aforementioned statement (R and Rcpp are not thread safe, don't call them from within an omp parallel pragma) seems to not always hold to be true. For example the call in the next example does not cause a segfault, although we're reading and writing to Rcpp data structures.
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::export]]
Rcpp::NumericMatrix rcpp_sweep_(Rcpp::NumericMatrix x, Rcpp::NumericVector vec)
{
Rcpp::NumericMatrix ret(x.nrow(), x.ncol());
#pragma omp parallel for default(shared)
for (int j=0; j<x.ncol(); j++)
{
#pragma omp simd
for (int i=0; i<x.nrow(); i++)
ret(i, j) = x(i, j) - vec(i);
}
return ret;
}
My Questions
Why does the code from the first example causes a segfault but the code from example two and three not?
How do I know, if it is safe to call a method (arma::mat.col(i)) or if it is unsafe to call a method (Rcpp::NumericMatrix.column(i))? Do I have to read the framework's source code every time?
Any suggestions on how to avoid these "opaque" situations like in example one?
It might be pure coincidence that my RcppArmadillo example does not fail. See Dirks comments below.
EDIT 1
In his answer and in both of his comments Dirk strongly recommends to study more closely the examples in the Rcpp Gallery.
Here are my initial assumptions:
Extracting rows, columns, etc. within a OpenMp pragma is generally not thread safe since it might call back into R to allocate new space in memory for a hidden copy.
Because RcppArmadillo relies on the same lightweight/proxy model for data structures as Rcpp, my first assumption also applies to RcppArmadillo.
Data structures from the std namespace should be somehow more safe since they don't use the same lightweight/proxy scheme.
Primitive data types also shouldn't cause problems as they live on the stack and don't require R to allocate and manage memory.
Optimizing Code vs...
arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);
Hierarchical Risk Parity...
interMatrix(_, i) = MAT_COV(_, index_asset); // 3rd code example 3rd method
Using RcppProgress...
thread_sum += R::dlnorm(i+j, 0.0, 1.0, 0); // subsection OpenMP support
In my opinion the first and second example clearly interfere with my assumptions made in point one and two. Example three also gives me headaches, since for me it looks like a call to R...
My updated questions
Where is the difference between example one/two and my first code example?
Where did I get lost in my assumptions?
Any recommendations, besides RcppGallery and GitHub, on how to get a better idea of the interaction of Rcpp and OpenMP?
Before starting with parallelizing my code I have read a couple of blog posts on Rcpp/Openmp. In the majority of these blog posts (e.g. Drew Schmidt, wrathematics) the authors warn about the issue of thread safety, R/Rcpp data structures and Openmp. The bottom line of all posts I have read so far is, R and Rcpp are not thread safe, don't call them from within an omp parallel pragma.
That is a well-known limitation of R itself not being thread-safe. That means you cannot call back, or trigger R events -- which may happen with Rcpp unless you are careful. To be more plain: The constraint has nothing to do with Rcpp, it simply means you cannot blindly drop into OpenMP via Rcpp. But you can if you're careful.
We have countless examples of success with OpenMP and related tools both in numerous packages on CRAN, on the Rcpp Gallery and via extension packages like RcppParallel.
You appear to have been very selective in what you chose to read on this topic, and you ended up with something somewhere between wrong and misleading. I suggest you turn to the several examples on the Rcpp Gallery which deal with OpenMP / RcppParallel as they deal with the very problem. Or if you're in a hurry: look up RVector and RMatrix in the RcppParallel documentation.
Resources:
Six posts on OpenMP at RcppGallery
RcppParallel site
RcppParallel CRAN package
and your largest resource may be some targeted search at GitHub for code involving R, C++ and OpenMP. It will lead you to numerous working examples.

Difference between num_threads vs. omp_set_num_threads vs OMP_NUM_THREADS

I am quite confused about the ways to specify the number of threads in parallel part of a code.
I know I can use:
the enviromental variable OMP_NUM_THREADS
function omp_set_num_threads(int)
num_threads(int) in #pragma omp parallel for num_threads(NB_OF_THREADS)
What I have gathered so far the first two are equivalent. But what about the third one?
Can someone provide a more detailed exposition of the difference, I could not find any information in the internet regarding the difference between 1/2 and 3.
OMP_NUM_THREADS and omp_set_num_threads() are not equivalent. The environment variable is only used to set the initial value of the nthreads-var ICV (internal control variable) which controls the maximum number of threads in a team. omp_set_num_threads() can be used to change the value of nthreads-var at any time (outside of any parallel regions, of course) and affects all subsequent parallel regions. Therefore setting a value, e.g. n, to OMP_NUM_THREADS is equivalent to calling omp_set_num_threads(n) before the very first parallel region is encountered.
The algorithm to determine the number of threads in a parallel region is very clearly described in the OpenMP specification that is available freely on the OpenMP website:
if a num_threads clause exists
then let ThreadsRequested be the value of the num_threads clause expression;
else let ThreadsRequested = value of the first element of nthreads-var;
That priority of the different ways to set nthreads-var is listed in the ICV Override Relationships part of the specification:
The num_threads clause and omp_set_num_threads() override the value of the OMP_NUM_THREADS environment variable and the initial value of the first element of the nthreads-var ICV.
Translated into human language, that is:
OMP_NUM_THREADS (if present) specifies initially the number of threads;
calls to omp_set_num_threads() override the value of OMP_NUM_THREADS;
the presence of the num_threads clause overrides both other values.
The actual number of threads used is also affected by whether dynamic team sizes are enabled (dyn-var ICV settable via OMP_DYNAMIC and/or omp_set_dynamic()), by whether a thread limit is imposed by thread-limit-var (settable via OMP_THREAD_LIMIT), as well as by whether nested parallelism (OMP_NESTED / omp_set_nested()) is enabled or not.
Think of it like scope. Option 3 (num_threads) sets the number of threads for the current team of threads only. The other options are global/state settings. I generally don't set the number of threads and instead I just use the defaults. When I do change the number of threads it's usually only in special cases so I use option three so that the next time I use a parallel team it goes back to the global (default) setting. See the code below. After I use option 3 the next team of threads goes back to the last global setting.
#include <stdio.h>
#include <omp.h>
int main() {
#pragma omp parallel
{
#pragma omp single
{
printf("%d\n", omp_get_num_threads());
}
}
omp_set_num_threads(8);
#pragma omp parallel
{
#pragma omp single
{
printf("%d\n", omp_get_num_threads());
}
}
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
printf("%d\n", omp_get_num_threads());
}
}
#pragma omp parallel
{
#pragma omp single
{
printf("%d\n", omp_get_num_threads());
}
}
}
4
8
2
8

How are firstprivate and lastprivate different than private clauses in OpenMP?

I've looked at the official definitions, but I'm still quite confused.
firstprivate: Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.
To me, that sounds a lot like private. I've looked for examples, but I don't seem to understand how it's special or how it can be used.
lastprivate: Specifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections).
I feel like I understand this one a bit better because of the following example:
#pragma omp parallel
{
#pragma omp for lastprivate(i)
for (i=0; i<n-1; i++)
a[i] = b[i] + b[i+1];
}
a[i]=b[i];
So, in this example, I understand that lastprivate allows for i to be returned outside of the loop as the last value it was.
I just started learning OpenMP today.
private variables are not initialised, i.e. they start with random values like any other local automatic variable (and they are often implemented using automatic variables on the stack of each thread). Take this simple program as an example:
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i = 10;
#pragma omp parallel private(i)
{
printf("thread %d: i = %d\n", omp_get_thread_num(), i);
i = 1000 + omp_get_thread_num();
}
printf("i = %d\n", i);
return 0;
}
With four threads it outputs something like:
thread 0: i = 0
thread 3: i = 32717
thread 1: i = 32717
thread 2: i = 1
i = 10
(another run of the same program)
thread 2: i = 1
thread 1: i = 1
thread 0: i = 0
thread 3: i = 32657
i = 10
This clearly demonstrates that the value of i is random (not initialised) inside the parallel region and that any modifications to it are not visible after the parallel region (i.e. the variable keeps its value from before entering the region).
If i is made firstprivate, then it is initialised with the value that it has before the parallel region:
thread 2: i = 10
thread 0: i = 10
thread 3: i = 10
thread 1: i = 10
i = 10
Still modifications to the value of i inside the parallel region are not visible after it.
You already know about lastprivate (and it is not applicable to the simple demonstration program as it lacks worksharing constructs).
So yes, firstprivate and lastprivate are just special cases of private. The first one results in bringing in values from the outside context into the parallel region while the second one transfers values from the parallel region to the outside context. The rationale behind these data-sharing classes is that inside the parallel region all private variables shadow the ones from the outside context, i.e. it is not possible to use an assignment operation to modify the outside value of i from inside the parallel region.
You cannot use local variable i before initialization, the program will give an error since C++ 14 Standard.

Is it possible to inject values in the frama-c value analyzer?

I'm experimenting with the frama-c value analyzer to evaluate C-Code, which is actually threaded.
I want to ignore any threading problems that might occur und just inspect the possible values for a single thread. So far this works by setting the entry point to where the thread starts.
Now to my problem: Inside one thread I read values that are written by another thread, because frama-c does not (and should not?) consider threading (currently) it assumes my variable is in some broad range, but I know that the range is in fact much smaller.
Is it possible to tell the value analyzer the value range of this variable?
Example:
volatile int x = 0;
void f() {
while(x==0)
sleep(100);
...
}
Here frama-c detects that x is volatile and thus has range [--..--], but I know what the other thread will write into x, and I want to tell the analyzer that x can only be 0 or 1.
Is this possible with frama-c, especially in the gui?
Thanks in advance
Christian
This is currently not possible automatically. The value analysis considers that volatile variables always contain the full range of values included in their underlying type. There however exists a proprietary plug-in that transforms accesses to volatile variables into calls to user-supplied function. In your case, your code would be transformed into essentially this:
int x = 0;
void f() {
while(1) {
x = f_volatile_x();
if (x == 0)
sleep(100);
...
}
By specifying f_volatile_x correctly, you can ensure it returns values between 0 and 1 only.
If the variable 'x' is not modified in the thread you are studying, you could also initialize it at the beginning of the 'main' function with :
x = Frama_C_interval (0, 1);
This is a function defined by Frama-C in ...../share/frama-c/builtin.c so you have to add this file to your inputs when you use it.

C/OpenMP - issue with threadprivate and vectors of pointers

I'm new to the world of parallel programming and openmp, so this may be a futile question, but I can't really come up with good answer to what I'm experiencing, so I hope someone will be able to shed some light on the matter.
What I am trying to achieve is to have a private copy of a dinamically allocated matrix (of integers) for every thread that will handle the following parallel section, but as soon as the flow of execution enters said region the reference to the supposedly private matrix holds a null value.
Is there any limitation of this directive I'm not aware of? Everything seems to work just fine with monodimensional dynamic arrays.
A snippet of the code is the following one...
#define n 10000
int **matrix;
#pragma omp threadprivate(matrix)
int main()
{
matrix = (int**) calloc(n, sizeof(int*));
for(i=0;i<n;i++) matrix[i] = (int*) calloc(n, sizeof(int));
AdjacencyMatrix(n, matrix);
...
/* Explicitly turn off dynamic threads */
omp_set_dynamic(0);
#pragma omp parallel
{
// From now on, matrix is NULL...
executor_p(matrix, n);
}
....
Look at the OpenMP documentation regarding what happens with the threadprivate clause:
On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined, unless a COPYIN clause is specified in the PARALLEL directive
There's no guarantee of what value is going to be stored in the matrix variable in the parallel region.
OpenMP can privatise only variables with known storage size. That is you can have a private copy of an array if it was defined like double matrix[N][M]. In your case is not only the storage size unknown (a pointer doesn't store the number of elements that it is pointing to) but also your matrix is not a contiguous area in memory and rather a pointer to a list of dynamically allocated rows.
What you would end up with is having a private copy of the top-level pointer, not a private copy of the matrix data itself.

Resources