I'm currently trying to parallelize an existing hierarchical MCMC sampling scheme. The majority of my (by now sequential) source code is written in RcppArmadillo, so I'd like to stick with this framework for parallelization, too.
Before starting with parallelizing my code I have read a couple of blog posts on Rcpp/Openmp. In the majority of these blog posts (e.g. Drew Schmidt, wrathematics) the authors warn about the issue of thread safety, R/Rcpp data structures and Openmp. The bottom line of all posts I have read so far is, R and Rcpp are not thread safe, don't call them from within an omp parallel pragma.
Because of this, the following Rcpp example causes a segfault, when called from R:
#include <Rcpp.h>
#include <omp.h>
using namespace Rcpp;
double rcpp_rootsum_j(Rcpp::NumericVector x)
{
Rcpp::NumericVector ret = sqrt(x);
return sum(ret);
}
// [[Rcpp::export]]
Rcpp::NumericVector rcpp_rootsum(Rcpp::NumericMatrix x, int cores = 2)
{
omp_set_num_threads(cores);
const int nr = x.nrow();
const int nc = x.ncol();
Rcpp::NumericVector ret(nc);
#pragma omp parallel for shared(x, ret)
for (int j=0; j<nc; j++)
ret[j] = rcpp_rootsum_j(x.column(j));
return ret;
}
As Drew explains in his blog post, the segfault happens due to a "hidden" copy, which Rcpp makes in the call to ret[j] = rcpp_rootsum_j(x.column(j));.
Since I'm interested in the behavior of RcppArmadillo in case of parallelization, I have converted Drew's example:
//[[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <omp.h>
double rcpp_rootsum_j_arma(arma::vec x)
{
arma::vec ret = arma::sqrt(x);
return arma::accu(ret);
}
// [[Rcpp::export]]
arma::vec rcpp_rootsum_arma(arma::mat x, int cores = 2)
{
omp_set_num_threads(cores);
const int nr = x.n_rows;
const int nc = x.n_cols;
arma::vec ret(nc);
#pragma omp parallel for shared(x, ret)
for (int j=0; j<nc; j++)
ret(j) = rcpp_rootsum_j_arma(x.col(j));
return ret;
}
Interestingly the semantically equivalent code does not causes a segfault.
The second thing I have noticed during my research is, that the aforementioned statement (R and Rcpp are not thread safe, don't call them from within an omp parallel pragma) seems to not always hold to be true. For example the call in the next example does not cause a segfault, although we're reading and writing to Rcpp data structures.
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::export]]
Rcpp::NumericMatrix rcpp_sweep_(Rcpp::NumericMatrix x, Rcpp::NumericVector vec)
{
Rcpp::NumericMatrix ret(x.nrow(), x.ncol());
#pragma omp parallel for default(shared)
for (int j=0; j<x.ncol(); j++)
{
#pragma omp simd
for (int i=0; i<x.nrow(); i++)
ret(i, j) = x(i, j) - vec(i);
}
return ret;
}
My Questions
Why does the code from the first example causes a segfault but the code from example two and three not?
How do I know, if it is safe to call a method (arma::mat.col(i)) or if it is unsafe to call a method (Rcpp::NumericMatrix.column(i))? Do I have to read the framework's source code every time?
Any suggestions on how to avoid these "opaque" situations like in example one?
It might be pure coincidence that my RcppArmadillo example does not fail. See Dirks comments below.
EDIT 1
In his answer and in both of his comments Dirk strongly recommends to study more closely the examples in the Rcpp Gallery.
Here are my initial assumptions:
Extracting rows, columns, etc. within a OpenMp pragma is generally not thread safe since it might call back into R to allocate new space in memory for a hidden copy.
Because RcppArmadillo relies on the same lightweight/proxy model for data structures as Rcpp, my first assumption also applies to RcppArmadillo.
Data structures from the std namespace should be somehow more safe since they don't use the same lightweight/proxy scheme.
Primitive data types also shouldn't cause problems as they live on the stack and don't require R to allocate and manage memory.
Optimizing Code vs...
arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);
Hierarchical Risk Parity...
interMatrix(_, i) = MAT_COV(_, index_asset); // 3rd code example 3rd method
Using RcppProgress...
thread_sum += R::dlnorm(i+j, 0.0, 1.0, 0); // subsection OpenMP support
In my opinion the first and second example clearly interfere with my assumptions made in point one and two. Example three also gives me headaches, since for me it looks like a call to R...
My updated questions
Where is the difference between example one/two and my first code example?
Where did I get lost in my assumptions?
Any recommendations, besides RcppGallery and GitHub, on how to get a better idea of the interaction of Rcpp and OpenMP?
Before starting with parallelizing my code I have read a couple of blog posts on Rcpp/Openmp. In the majority of these blog posts (e.g. Drew Schmidt, wrathematics) the authors warn about the issue of thread safety, R/Rcpp data structures and Openmp. The bottom line of all posts I have read so far is, R and Rcpp are not thread safe, don't call them from within an omp parallel pragma.
That is a well-known limitation of R itself not being thread-safe. That means you cannot call back, or trigger R events -- which may happen with Rcpp unless you are careful. To be more plain: The constraint has nothing to do with Rcpp, it simply means you cannot blindly drop into OpenMP via Rcpp. But you can if you're careful.
We have countless examples of success with OpenMP and related tools both in numerous packages on CRAN, on the Rcpp Gallery and via extension packages like RcppParallel.
You appear to have been very selective in what you chose to read on this topic, and you ended up with something somewhere between wrong and misleading. I suggest you turn to the several examples on the Rcpp Gallery which deal with OpenMP / RcppParallel as they deal with the very problem. Or if you're in a hurry: look up RVector and RMatrix in the RcppParallel documentation.
Resources:
Six posts on OpenMP at RcppGallery
RcppParallel site
RcppParallel CRAN package
and your largest resource may be some targeted search at GitHub for code involving R, C++ and OpenMP. It will lead you to numerous working examples.
Related
I want to implement simple image processing routine quite similar to Auto Levels, so need to precalculate thresholds, make LUT and then make histogram stretching/normalization applying LUT.
But my question is not about algorithm side, it is about using extern defined functions, because i need a couple of while cycles for LUT calculation and i think using extern is good for it.
I tried following examples from Halide sources and checked this question too
I use AOT compilation currently testing on PC(winx64), aiming for arm in future, and have the following generator code:
Var x("x"), y("y");
Func make_a_root{ "make_a_root" };
Buffer<bitType> Lut{256, "lut"};
make_a_root(x, y) = inputY(x, y);
ExternFuncArgument arg = make_a_root;
Func g;
g.define_extern("generateAutoLevelsLut", { arg }, UInt(8), 2, Halide::NameMangling::CPlusPlus);
g.compute_root();
inputY has Input<Buffer<uint8_t>> inputY{ "input_y", 2 }; type
First i just want to make it run the call, so function body makes nothing but print (can i define function in same cpp file as generator?)
int generateAutoLevelsLut(halide_buffer_t * input, halide_buffer_t * out)
{
printf("\nextern call\n");
return 0;
}
I tried default mangling with extern "C" too.
Never succeeded getting print message though, so my question is, why this happenin. Is it just misunderstanding on some syntax or are there any problem with calling extern function from generator code?
EDIT:
Added usage of extern like 'out(x,y) = g(x,y)' (lvalue should be actually used!) , and it started to make a call. Now struggling with host == NULL. Digging into bounds inference stuff.
EDIT 2:
I added basic bounds inference checks, now it does not crash.. The next problem i have now, is: Is it possible to make call to external function, without actually influencing output result in direct manner?
Let me concretise what i mean.
The generator code looks like following:
Buffer<bitType> lut{256, "lut"};
args[0] = inputY;
args[1] = lut;
g.define_extern("generateAutoLevelsLut", args, { UInt(8) }, 2, Halide::NameMangling::C);
outputY(x, y) = g(x, y); // Call line
g.compute_root();
outputY.compute_root();
Extern functon code fills second input 'lut' with some dummy LUT:
Halide::Runtime::Buffer<uint16_t> im2Buffer(*input2);
Mat im2Mat(Size(im2Buffer.width(), im2Buffer.height()), CVC_8U, im2Buffer.data(), im2Buffer.stride(1));
for (int i = 0; i < 256; i++)
im2Mat.at<uchar>(i) = i;
And if i comment the 'Call line' in generator, it optimizes away the call to extern at all.
I want to make something like:
Func lutRoot;
lutRoot(x) = lut(x); // to convert from Buffer
outputY(x, y) = autoLevelsPrecalcLut(inputY, lutRoot)(x, y);
And here lut is implicitly passed into extern and filled there. But it doesn't work, as well as other variants which ignore the modification of 'output'... or maybe this whole approach is wrong?
Any suggestions? Thanks
EDIT 3:
Solved task avoiding extern calls, replacing while cycles with argmin and RDom combo, but original question about extern still remains
That should work (or fail with a linker error if it wasn't going to). It's possible the Halide pipeline doesn't think it needs to call your extern function. E.g. does something use the result?
Alternatively, try stderr instead, just in case it's an output stream buffering issue. That extern function definition is likely to cause Halide to error out (because it doesn't reply to the bounds inference query), and erroring out calls abort by default, which would swallow things printed to stdout.
I have the code, which looks like:
...
const N=10000;
std::array<std::pair <int,int>,N> nnt;
bool compar(std::pair<int,int> i, std::pair <int,int> j) {return (int)
(i.second) > (int)(j.second);}
...
int main(int argc, char **argv)
{
#pragma acc data create(...,nnt)
{
#pragma acc parallel loop
{...}
//the nnt array is filled here
//here i need to sort nnt allocated on gpu, using the
//comparator compar()
}
}
So i need to sort an array of pairs, alocated on the GPU by the means of CUDA of OpenAcc.
As far as i understood, it is unlikely that i will be able to sort std::array of std::pair's on GPU.
Actually, i need to sort one array, allocated on the gpu, by another one alocated on the gpu, i. e. if there are
int a[N];
int b[N];
which are allocated or copied to the GPU by the means of CUDA or OpenAcc, i need to sort the array a by the values of the array b, and i need this sort to be done on GPU. May be, there are some CUDA functions that will help or the CUDA Thrust sort functions could be used (like thrust::stable_sort), i don't know. Is there a way to do it?
Is there a way to do it?
yes, one possible method would be to use thrust::sort_by_key, which allows you to sort device data using a device pointer.
This blog explains the method to interface between thrust and OpenACC. Including the passage of a deviceptr between routines.
This example code may be of interest. Specifically, the hash example gives a fully-worked example of calling thrust::sort_by_key from OpenACC.
Heluuu,
I have a rather large program that I'm attempting to thread. So far, this has been succesful, and the basics are all working as intended.
I now want to do some fancy work with cascading threads in nested mode. Essentially, I want the main parallel region to use any free threads in lower parallel regions.
To detail the current system, the main parallel region starts 10 threads. I have 12 cores, so I can use 2 more threads. There is a second parallel region where some heavy computing happens, and I want the first two threads to reach this point to start a new team there, each with 2 threads. Every new entry to the lower parallel region after this will continue in serial.
So, this should look like the following.
Main region: 10 threads started.
Lower region: 2 new threads started.
Thread 1: 2 threads in lower region.
Thread 2: 2 threads in lower region.
Thread 3-10: 1 thread in lower region.
Please keep in mind that these numbers are for the sake of clarity in providing a concrete description of my situation, and not the absolute and only case in which the program operates.
The code:
main() {
...
...
omp_set_num_threads(n);
omp_set_dynamic(x);
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < iterations; i++) {
...
Compute();
...
}
}
}
And in Compute
bool Compute() {
...
float nThreads = omp_get_thread_limit() - omp_get_num_threads();
nThreads = ceil(nThreads / omp_get_num_threads());
omp_set_num_threads((int)nThreads);
#pragma omp parallel
{
...
#pragma omp for
for (int i = 0; i < nReductSize; i++) {
...
}
}
}
Now, my problem is that setting the uppermost limit for the whole program (i.e. OMP_THREAD_LIMIT) only works from outside the program. Using
export OMP_THREAD_LIMIT=5
from the bash command line works great. But I want to do it internally. So far, I've tried
putenv("OMP_THREAD_LIMIT=12");
setenv("OMP_THREAD_LIMIT", "12", 1);
but when I call omp_get_thread_limit() or getenv("OMP_THREAD_LIMIT") I get wacky return values. Even when I set the variable with export, calling getenv("OMP_THREAD_LIMIT"); returns 0.
So, I would ask for your help in this: How do I properly set OMP_THREAD_LIMIT at runtime?
This is the main function where I set the thread defaults. It is executed well before any threading occurs:
#ifdef _OPENMP
const char *name = "OMP_THREAD_LIMIT";
const char *value = "5";
int overwrite = 1;
int success = setenv(name, value, overwrite);
cout << "Var set (0 is success): " << success << endl;
#endif
Oh, and setenv reports success in setting the variable.
Compiler says
gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
Flags
CCFLAGS = -c -O0 -fopenmp -g -msse -msse2 -msse3 -mfpmath=sse -std=c++0x
OpenMP version is 3.0.
This is correct implementation of OpenMP, and it ignores changes in environment from inside the program. As stated in OpenMP 3.1 Standard, page 159:
Modifications to the environment variables after the program has started, even if
modified by the program itself, are ignored by the OpenMP implementation.
You are doing exactly what is said in this paragraph.
OpenMP allows changing of such parameters only via omp_set_* functions, but there are no such function for thread-limit-var ICV:
However, the settings of some of the ICVs can be modified during the execution of the OpenMP
program by the use of the appropriate directive clauses or OpenMP API routines.
I think, you may use num_threads clause of #pragma omp parallel to achieve what you want.
Changing the behavior of OpenMP using OMP_THREAD_LIMIT (or any other OMP_* environment variable) is not possible after the program has started; these are intended for use by the user. You could have the user invoke your program through a script that sets OMP_THREAD_LIMIT and then calls your program, but that's probably not what you need to do in this case.
OMP_NUM_THREADS, omp_set_num_threads, and the num_threads clause are usually used to set the number of threads operating in a region.
It might be offtopic, but you may want to try openmp collapse instead of handcrafting here.
Consider the below function,
public static int foo(int x){
return x + 5;
}
Now, let us call it,
int in = /*Input taken from the user*/;
int x = foo(10); // ... (1)
int y = foo(in); // ... (2)
Here, can the compiler change
int x = foo(10); // ... (1)
to
int x = 15; // ... (1)
by evaluating the function call during compile time since the input to the function is available during compile time ?
I understand this is not possible during the call marked (2) because the input is available only during run time.
I do not want to know a way of doing it in any specific language. I would like to know why this can or can not be a feature of a compiler itself.
C++ does have a method for this:
Have a read up on the 'constexpr' keyword in C++11, it allows compile time evaluation of functions.
They have a limitation: the function must be a return statement (not multiple lines of code), but can call other constexpr functions (C++14 does not have this limitation AFAIK).
static constexpr int foo(int x){
return x + 5;
}
EDIT:
Why a compiler might not evaluate a function (just my guess):
It might not be appropriate to remove a function by evaluating it without being told.
The function could be used in different compilation units, and with static/dynamic inputs: thus evaluating it in some circumstances and adding a call in other places.
This use would provide inconsistent execution times (especially on a deterministic platform like AVR) where timing may be important, or at least need to be predictable.
Also interrupts (and how the compiler interacts with them) may come into play here.
EDIT:
constexpr is actually stronger -- it requires that the compiler do this. The compiler is free to fold away functions without constexpr, but the programmer can't rely on it doing so.
Can you give an example in the case where the user would have benefited from this but the compiler chose not to do it ?
inline functions may, or may not resolve to constant expressions which could be optimized into the end result.
However, a constexpr guarantees it. An inline function cannot be used as a compile time constant whereas constexpr can allow you to formulate compile time functions and more so, objects.
A basic example where constexpr makes a guarantee that inline cannot.
constexpr int foo( int a, int b, int c ){
return a+b+c;
}
int array[ foo(1, 2, 3) ];
And the same as a simple object.
struct Foo{
constexpr Foo( int a, int b, int c ) : val(a+b+c){}
int val;
};
constexpr Foo foo( 1,2,4 );
int array[ foo.val ];
Unless foo.val is a compile time constant, the code above will not compile.
Even as just a function, an inline function has no guarantee. And the linker can also do inlining over multiple compilation units, after the syntax has been compiled (array bounds checked for integer constants).
This is kind of like meta-programming, but without the templates. Of course these examples do not do the topic justice, however very complex solutions would benefit from the ability to use objects and functional programming to achieve a result.
Yes, evaluation can happen during compile time. This comes under the heading of constant folding and function inlining, both of which are common optimizations for optimizing compilers.
Many languages do not have strong distinction between "compile time" and "run time", but the general rule is that the language defines an "execution model" which defines the behavior of any particular program with any particular input (or specifies that it is undefined). The compiler must produce an executable that can read any input and produce the corresponding output as defined by the execution model. What happens inside the executable doesn't matter -- as long as the externally viewed behavior is correct.
Here "input", "output" and "behavior" includes all possible interactions with the environment that are defined in the execution model, including timing effects.
I'm new to the world of parallel programming and openmp, so this may be a futile question, but I can't really come up with good answer to what I'm experiencing, so I hope someone will be able to shed some light on the matter.
What I am trying to achieve is to have a private copy of a dinamically allocated matrix (of integers) for every thread that will handle the following parallel section, but as soon as the flow of execution enters said region the reference to the supposedly private matrix holds a null value.
Is there any limitation of this directive I'm not aware of? Everything seems to work just fine with monodimensional dynamic arrays.
A snippet of the code is the following one...
#define n 10000
int **matrix;
#pragma omp threadprivate(matrix)
int main()
{
matrix = (int**) calloc(n, sizeof(int*));
for(i=0;i<n;i++) matrix[i] = (int*) calloc(n, sizeof(int));
AdjacencyMatrix(n, matrix);
...
/* Explicitly turn off dynamic threads */
omp_set_dynamic(0);
#pragma omp parallel
{
// From now on, matrix is NULL...
executor_p(matrix, n);
}
....
Look at the OpenMP documentation regarding what happens with the threadprivate clause:
On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined, unless a COPYIN clause is specified in the PARALLEL directive
There's no guarantee of what value is going to be stored in the matrix variable in the parallel region.
OpenMP can privatise only variables with known storage size. That is you can have a private copy of an array if it was defined like double matrix[N][M]. In your case is not only the storage size unknown (a pointer doesn't store the number of elements that it is pointing to) but also your matrix is not a contiguous area in memory and rather a pointer to a list of dynamically allocated rows.
What you would end up with is having a private copy of the top-level pointer, not a private copy of the matrix data itself.