I want to using gtest in openmp.
I want calculation something in parallel block, and check the thread private variable's result, and do more calculation.
Here is an example.
#include "gmock.h"
#include <iostream>
using namespace testing;
TEST(SimpleGtest, OpenMP)
{
#pragma omp parallel
{
// some thread private variable
int thread_index = omp_get_thread_num();
int z;
// some calculation
// ...
// check result of thread private variable
ASSERT_THAT(z, Eq(13));
// other calculation
// ...
}
}
int main(int argc, char **argv)
{
testing::InitGoogleTest(&argc, argv);
return RUN_ALL_TESTS();
}
but when I compile the code, the compiler complaint that:
error: "return" branches to or from an OpenMP structured block are illegal
ASSERT_THAT(z, Eq(13));
^
The code above is just a simple example.
I know we can not using "thread private variable" but using "thread shared variable" so we can do the assert out of parallel block.
But is there any solution that we can using gtest to check the thread private variable's result in OpenMP parallel block?
Thanks in advance!
do not using ASSERT_THAT in parallel block. EXPECT_EQ can help.
Related
I'm studying SYCL at university and I have a question about performance of a code.
In particular I have this C/C++ code:
And I need to translate it in a SYCL kernel with parallelization and I do this:
#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
//Create a vector with size elements and initialize them to 1
std::vector<float> dA(size);
try {
queue gpuQueue{ gpu_selector{} };
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
gpuQueue.submit([&](handler& cgh) {
accessor inA{ bufA,cgh };
cgh.parallel_for(range<1>(size),
[=](id<1> i) { inA[i] = inA[i] + 2; }
);
});
gpuQueue.wait_and_throw();
}
catch (std::exception& e) { throw e; }
So my question is about c value, in this case I use directly the value two but this will impact on the performance when I'll run the code? I need to create a variable or in this way is correct and the performance are good?
Thanks in advance for the help!
Interesting question. In this case the value 2 will be a literal in the instruction in your SYCL kernel - this is as efficient as it gets, I think! There's the slight complication that you have an implicit cast from int to float. My guess is that you'll probably end up with a float literal 2.0 in your device assembly. Your SYCL device won't have to fetch that 2 from memory or cast at runtime or anything like that, it just lives in the instruction.
Equally, if you had:
constexpr int c = 2;
// the rest of your code
[=](id<1> i) { inA[i] = inA[i] + c; }
// etc
The compiler is almost certainly smart enough to propagate the constant value of c into the kernel code. So, again, the 2.0 literal ends up in the instruction.
I compiled your example with DPC++ and extracted the LLVM IR, and found the following lines:
%5 = load float, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
%add.i = fadd float %5, 2.000000e+00
store float %add.i, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
This shows a float load & store to/from the same address, with an 'add 2.0' instruction in between. If I modify to use the variable c like I demonstrated, I get the same LLVM IR.
Conclusion: you've already achieved maximum efficiency, and compilers are smart!
I am having a really hard time in implementing openMP code on my mac machine on Terminal with icc compiler. I find the following error! Please do help me with the correction of error.
The following code is pasted as follows. IT NEVER WORK FOR openMP for, reduce either. The pragma is just not recognising. Appreciate yourself trying the code to help.
#include <stdio.h>
#include <omp.h>
int main()
{
#pragma omp parallel for
{
for(int i=0;i<3;i++)
{
printf("Hello");
}
}
return 0;
}
To add to my comment, the correct version of the code is
#include <stdio.h>
#include <omp.h>
int main()
{
#pragma omp parallel for
for(int i=0;i<3;i++)
{
printf("Hello");
}
return 0;
}
The proper compiler command line is icc -fopenmp ... -o bla.exe bla.c (assuming that the file is named bla.c). Please replace ... with the other command line options that you will need for your code to compile.
UPDATE: The proper compiler command line for the new OpenMP compilers from Intel is to use -fiopenmp (needs -fopenmp-targets=spir64 for GPUs).
The following (reduced) code is very badly handled by the series of GCC
#include <vector>
#include <cilk/cilk.h>
void walk(std::vector<int> v, int bnd, unsigned size) {
if (v.size() < size)
for (int i=0; i<bnd; i++) {
std::vector<int> vnew(v);
vnew.push_back(i);
cilk_spawn walk(vnew, bnd, size);
}
}
int main(int argc, char **argv) {
std::vector<int> v{};
walk(v , 5, 5);
}
Specifically:
G++ 5.3.1 crash:
red.cpp: In function ‘<built-in>’:
red.cpp:20:39: internal compiler error: in lower_stmt, at gimple-low.c:397
cilk_spawn walk(vnew, bnd, size);
G++ 6.3.1 create a code which works perfectly well if executed on one core
but segfault sometime, signal a double free some other times if using more cores. A student who
has a arch linux g++7 reported a similar result.
My question : is there something wrong with that code. Am I invoking some
undefined behavior or is it simply a bug I should report ?
Answering my own question:
According to https://gcc.gnu.org/ml/gcc-help/2017-03/msg00078.html its indeed a bug in GCC. The temporary is destroyed in the parent and not in the children in a cilk_spawn. So if the thread fork really occur, it might be destroyed too early.
I would like to use PAPI to get the overall counters of all C++11 std::thread threads in a program.
PAPI documentation on Threads says that:
Thread support in the PAPI library can be initialized by calling the following low-level function in C: int PAPI_thread_init(unsigned long(*handle)(void));
where the handle is a
Pointer to a routine that returns the current thread ID as an unsigned long.
For example, for pthreads the handle is pthread_self.
But, I have no idea what it should be with C++11 std::thread.
Nor if it makes more sense to use something different from PAPI.
C++11 threading support has the std::this_thread::get_id() function that returns a std::thread::id instance which can be serialized to a stream. Then you coud try to read an unsigned long from the stream and return ir. Something like this:
#include <thread>
#include <iostream>
#include <sstream>
unsigned long current_thread_id()
{
std::stringstream id_stream;
id_stream << std::this_thread::get_id();
unsigned long id;
id_stream >> id;
return id;
}
int main(int argc, char** argv)
{
std::cout << current_thread_id();
return 0;
}
So in this snippet the current_thread_id function is what you are looking for, but you should add proper error handling (the thread id may not always be a number, in that case you will not be able to read a number from the stream and you should handle that accordingly).
That being said, maybe just use GetCurrentThreadId , since you are already introducing the Linux specific pthread_self.
I need to fill 2D array (tmp[Ny][Nx]) while each cell of the array gets an integral (of some function) as a function of free parameters. Since I deal with a very large arrays (here I simplified my case), I need to use OpenMP parallelism in order to speed my calculations up. Here I use simple #pragma omp parallel for directive.
Without using #pragma omp parallel for, the code executes perfectly. But adding the parallel directive, produces race conditions in the output.
I tried to cure it by making private(i,j,par), it did not help.
P.S. I use VS2008 Professional with OpenMP 2.0 and under WIndows 7 OS
Here is my code: (a short sample)
testfunc(const double* var, const double* par)
{
// here is some simple function to be integrated over
// var[0] and var[1] and two free parameters par[0] and par[1]
return ....
}
#define Nx 10000
#define Ny 10000
static double tmp[Ny][Nx];
int main()
{
double par[2]; // parameters
double xmin[]={0,0} // limits of 2D integration
double xmax[]={1,1};// limits of 2D integration
double val,Tol=1e-7,AbsTol=1e-7;
int i,j,NDim=2,NEval=1e5;
#pragma omp parallel for private(i,j,par,val)
for (i=0;i<Nx;i++)
{
for (j=0;j<Ny;j++)
{
par[0]=i;
par[1]=j*j;
adapt_integrate(testfunc,par, NDim, xmin, xmax,
NEval, Tol, AbsTol, &val, &err);
// adapt_integrate - receives my integrand, performs
// integration and returns a result through "val"
tmp[i][j] = val;
}
}
}
It produces race conditions at the output. I tried to avoid it by making all internal variables (i,j,par and val) private, but it doesn't help.
P.S. Serial version (#threads=1) of this code runs properly.
(Answered in the question. Converted to a community wiki answer. See Question with no answers, but issue solved in the comments (or extended in chat) )
The OP wrote:
The problem Solved!
I defined parameters of integration as global and used #pragma omp threadprivate(parGlob) directive for them. Now it works like a charm. I've been thinking that private() and threadprivate() have the same meaning, just different ways of implementations, but they do not.
So, playing with these directives may give a correct answer. Another thing is that defining iterator i inside the first for loop gives additional 20%-30% speed up in performance. So, the fastest version of the code looks now as:
testfunc(const double* var, const double* par)
{
.......
}
#define Nx 10000
#define Ny 10000
static double tmp[Ny][Nx];
double parGlob[2]; //<- Here are they!!!
#pragma omp threadprivate(parGlob) // <-Magic directive!!!!
int main()
{
// Not here !!!! -> double par[2]; // parameters
double xmin[]={0,0} // limits of 2D integration
double xmax[]={1,1};// limits of 2D integration
double val,Tol=1e-7,AbsTol=1e-7;
int j,NDim=2,NEval=1e5;
#pragma omp parallel for private(j,val) // no `i` inside `private` clause
for (int i=0;i<Nx;i++)
{
for (j=0;j<Ny;j++)
{
parGlob[0]=i;
parGlob[1]=j*j;
adapt_integrate(testfunc,par, NDim, xmin, xmax,
NEval, Tol, AbsTol, &val, &err);
tmp[i][j] = val;
}
}
}