Race conditions with OpenMP - openmp

I need to fill 2D array (tmp[Ny][Nx]) while each cell of the array gets an integral (of some function) as a function of free parameters. Since I deal with a very large arrays (here I simplified my case), I need to use OpenMP parallelism in order to speed my calculations up. Here I use simple #pragma omp parallel for directive.
Without using #pragma omp parallel for, the code executes perfectly. But adding the parallel directive, produces race conditions in the output.
I tried to cure it by making private(i,j,par), it did not help.
P.S. I use VS2008 Professional with OpenMP 2.0 and under WIndows 7 OS
Here is my code: (a short sample)
testfunc(const double* var, const double* par)
{
// here is some simple function to be integrated over
// var[0] and var[1] and two free parameters par[0] and par[1]
return ....
}
#define Nx 10000
#define Ny 10000
static double tmp[Ny][Nx];
int main()
{
double par[2]; // parameters
double xmin[]={0,0} // limits of 2D integration
double xmax[]={1,1};// limits of 2D integration
double val,Tol=1e-7,AbsTol=1e-7;
int i,j,NDim=2,NEval=1e5;
#pragma omp parallel for private(i,j,par,val)
for (i=0;i<Nx;i++)
{
for (j=0;j<Ny;j++)
{
par[0]=i;
par[1]=j*j;
adapt_integrate(testfunc,par, NDim, xmin, xmax,
NEval, Tol, AbsTol, &val, &err);
// adapt_integrate - receives my integrand, performs
// integration and returns a result through "val"
tmp[i][j] = val;
}
}
}
It produces race conditions at the output. I tried to avoid it by making all internal variables (i,j,par and val) private, but it doesn't help.
P.S. Serial version (#threads=1) of this code runs properly.

(Answered in the question. Converted to a community wiki answer. See Question with no answers, but issue solved in the comments (or extended in chat) )
The OP wrote:
The problem Solved!
I defined parameters of integration as global and used #pragma omp threadprivate(parGlob) directive for them. Now it works like a charm. I've been thinking that private() and threadprivate() have the same meaning, just different ways of implementations, but they do not.
So, playing with these directives may give a correct answer. Another thing is that defining iterator i inside the first for loop gives additional 20%-30% speed up in performance. So, the fastest version of the code looks now as:
testfunc(const double* var, const double* par)
{
.......
}
#define Nx 10000
#define Ny 10000
static double tmp[Ny][Nx];
double parGlob[2]; //<- Here are they!!!
#pragma omp threadprivate(parGlob) // <-Magic directive!!!!
int main()
{
// Not here !!!! -> double par[2]; // parameters
double xmin[]={0,0} // limits of 2D integration
double xmax[]={1,1};// limits of 2D integration
double val,Tol=1e-7,AbsTol=1e-7;
int j,NDim=2,NEval=1e5;
#pragma omp parallel for private(j,val) // no `i` inside `private` clause
for (int i=0;i<Nx;i++)
{
for (j=0;j<Ny;j++)
{
parGlob[0]=i;
parGlob[1]=j*j;
adapt_integrate(testfunc,par, NDim, xmin, xmax,
NEval, Tol, AbsTol, &val, &err);
tmp[i][j] = val;
}
}
}

Related

How to optimize SYCL kernel

I'm studying SYCL at university and I have a question about performance of a code.
In particular I have this C/C++ code:
And I need to translate it in a SYCL kernel with parallelization and I do this:
#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
//Create a vector with size elements and initialize them to 1
std::vector<float> dA(size);
try {
queue gpuQueue{ gpu_selector{} };
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
gpuQueue.submit([&](handler& cgh) {
accessor inA{ bufA,cgh };
cgh.parallel_for(range<1>(size),
[=](id<1> i) { inA[i] = inA[i] + 2; }
);
});
gpuQueue.wait_and_throw();
}
catch (std::exception& e) { throw e; }
So my question is about c value, in this case I use directly the value two but this will impact on the performance when I'll run the code? I need to create a variable or in this way is correct and the performance are good?
Thanks in advance for the help!
Interesting question. In this case the value 2 will be a literal in the instruction in your SYCL kernel - this is as efficient as it gets, I think! There's the slight complication that you have an implicit cast from int to float. My guess is that you'll probably end up with a float literal 2.0 in your device assembly. Your SYCL device won't have to fetch that 2 from memory or cast at runtime or anything like that, it just lives in the instruction.
Equally, if you had:
constexpr int c = 2;
// the rest of your code
[=](id<1> i) { inA[i] = inA[i] + c; }
// etc
The compiler is almost certainly smart enough to propagate the constant value of c into the kernel code. So, again, the 2.0 literal ends up in the instruction.
I compiled your example with DPC++ and extracted the LLVM IR, and found the following lines:
%5 = load float, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
%add.i = fadd float %5, 2.000000e+00
store float %add.i, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
This shows a float load & store to/from the same address, with an 'add 2.0' instruction in between. If I modify to use the variable c like I demonstrated, I get the same LLVM IR.
Conclusion: you've already achieved maximum efficiency, and compilers are smart!

How to map a data with openmp target to use inside a function?

I would like to know how can I map a data for future use inside of a function?
I wrote some code like the following:
struct {
int* a;
int *b;
// other members...
} s;
void func1(struct s* _s){
int a* = _s->a;
int b* = _s->b;
// do something with _s
#pragma omp target
{
// do something with a and b;
}
}
int main(){
struct s* _s;
// alloc _s, a and b
int *a = _s->a;
int *b = _s->b;
#pragma omp target data map(to: a, b)
{
func1(_s);
// call another funcs with device use of mapped data...
}
// free data
}
The code compiles, but on execution Kernel execution error at <address> is spammed from verbose output of execution, followed by many Device kernel launch failed! and CUDA error is: an illegal memory access was encountered
Your map directive looks like it's probably mapping the value of the pointers a and b to the device rather than the arrays they're pointing to. I think you want to shape them so that the runtime maps the data and not just the pointers. Personally, I would also put the map clause on your target region too, since that gives the compiler more information to work with and the present check will find the data already on the device from the outer data region and not perform any further data movement.

c++ assign a class member function a lambda function for computation efficiency [duplicate]

This question already has answers here:
C++ lambda with captures as a function pointer
(9 answers)
Closed 7 years ago.
UPDATED: (Rephrased). I'm looking to boost the computation efficiency of my code by make an run-time assignment of a class member function to one of many functions conditional on other class members.
One recommended solution uses #include <functional> and function<void()>, as shown in the simple test example:
#include <iostream>
using namespace std;
struct Number {
int n;
function(void()) doIt;
Number(int i):n(i) {};
void makeFunc() {
auto _odd = [this]() { /* op specific to odd */ };
auto _even = [this]() { /* op specific to even */ };
// compiles but produces bloated code, not computatinally efficient
if (n%2) doIt = _odd;
else doIt = _even;
};
};
int main() {
int i;
cin >> i;
Number e(i);
e.makeFunc();
e.doIt();
};
I'm finding that the compiled code (i.e. debug assembly) is grotesquely complicated and presumably NOT computationally efficient (the desired goal).
Does someone have an alternative construct that would achieve the end goal of a computationally efficient means of conditionally defining, at run-time, a class member function.
A capturing lambda expression cannot be assigned to a regular function pointer like you have.
I suggest using
std::function<void()> doIt;
instead of
void (*doIt)();

How Can synchronize data between differernt cores on Xeon (linux how to use memory barriers)

I wrote a simple program to test memory Synchronization. Use a global queue to share with two
processes, and bind two processes to different cores. my code is blew.
#include<stdio.h>
#include<sched.h>
#define __USE_GNU
void bindcpu(int pid) {
int cpuid;
cpu_set_t mask;
cpu_set_t get;
CPU_ZERO(&mask);
if (pid > 0) {
cpuid = 1;
} else {
cpuid = 5;
}
CPU_SET(cpuid, &mask);
if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
printf("warning: could not set CPU affinity, continuing...\n");
}
}
#define Q_LENGTH 512
int g_queue[512];
struct point {
int volatile w;
int volatile r;
};
volatile struct point g_p;
void iwrite(int x) {
while (g_p.r == g_p.w);
sleep(0.1);
g_queue[g_p.w] = x;
g_p.w = (g_p.w + 1) % Q_LENGTH;
printf("#%d!%d", g_p.w, g_p.r);
}
void iread(int *x) {
while (((g_p.r + 1) % Q_LENGTH) == g_p.w);
*x = g_queue[g_p.r];
g_p.r = (g_p.r + 1) % Q_LENGTH;
printf("-%d*%d", g_p.r, g_p.w);
}
int main(int argc, char * argv[]) {
//int num = sysconf(_SC_NPROCESSORS_CONF);
int pid;
pid = fork();
g_p.r = Q_LENGTH;
bindcpu(pid);
int i = 0, j = 0;
if (pid > 0) {
printf("call iwrite \0");
while (1) {
iread(&j);
}
} else {
printf("call iread\0");
while (1) {
iwrite(i);
i++;
}
}
}
The data between the two processesIntel(R) Xeon(R) CPU E3-1230 and two cores didn't synchronized.
CPU: Intel(R) Xeon(R) CPU E3-1230
OS: 3.8.0-35-generic #50~precise1-Ubuntu SMP
I want to know beyond IPC How I can synchronize the data between the different cores in user
space ?
If you are wanting your application to manipulate the cpus shared-cache in order to accomplish IPC I don't believe you will be able to do that.
Chapter 9 of "Linux Kernel Development Second Edition" has information on synchronizing multi-threaded applications (including atomic operations, semiphores, barriers, etc...):
http://www.makelinux.net/books/lkd2/ch09
so you may get some ideas on what you are looking for there.
here is a decent write up for Intel® Smart Cache "Software Techniques for Shared-Cache Multi-Core Systems": http://archive.is/hm0y
here are some stackoverflow questions/answers that may help you find the information you are looking for:
Storing C/C++ variables in processor cache instead of system memory
C++: Working with the CPU cache
Understanding how the CPU decides what gets loaded into cache memory
Sorry for bombarding you with links but this is the best I can do without a clearer understanding of what you are looking to accomplish.
I suggest reading "Volatile: Almost Useless for Multi-Threaded Programming" for why volatile should be removed from the example code. Instead, use C11 or C++11 atomic operations. See also the Fenced Data Transfer example in of the TBB Design Patterns Manual.
Below I show the parts of the question example that I changed to use C++11 atomics. I compiled it with g++ 4.7.2.
#include <atomic>
...
struct point g_p;
struct point {
std::atomic<int> w;
std::atomic<int> r;
};
void iwrite(int x) {
int w = g_p.w.load(std::memory_order_relaxed);
int r;
while ((r=g_p.r.load(std::memory_order_acquire)) == w);
sleep(0.1);
g_queue[w] = x;
w = (w+1)%Q_LENGTH;
g_p.w.store( w, std::memory_order_release);
printf("#%d!%d", w, r);
}
void iread(int *x) {
int r = g_p.r.load(std::memory_order_relaxed);
int w;
while (((r + 1) % Q_LENGTH) == (w=g_p.w.load(std::memory_order_acquire)));
*x = g_queue[r];
g_p.r.store( (r + 1) % Q_LENGTH, std::memory_order_release );
printf("-%d*%d", r, w);
}
The key changes are:
I removed "volatile" everywhere.
The members of struct point are declared as std::atomic
Some loads and stores of g_p.r and g_p.w are fenced. Others are hoisted.
When loading a variable modified by another thread, the code "snapshots" it into a local variable.
The code uses "relaxed load" (no fence) where a thread loads a variable that no other thread modifies. I hoisted those loads out of the spin loops since there is no point in repeating them.
The code uses "acquiring load" where a thread loads a "message is ready" indicator that is set by another thread, and uses a "releasing store" where it is storing a "message is ready" indicator" to be read by another thread. The release is necessary to ensure that the "message" (queue data) is written before the "ready" indicator (member of g_p) is written. The acquire is likewise necessary to ensure that the "message" is read after the "ready" indicator is seen.
The snapshots are used so that the printf reports the value that the thread actually used, as opposed to some new value that appeared later. In general I like to use the snapshot style for two reasons. First, touching shared memory can be expensive because it often requires cache-line transfers. Second, the style gives me a stable value to use locally without having to worry that a reread might return a different value.

How to use arrays in program (global) scope in OpenCL

AMD OpenCL Programming Guide, Section 6.3 Constant Memory Optimization:
Globally scoped constant arrays. These arrays are initialized,
globally scoped, and in the constant address space (as specified in
section 6.5.3 of the OpenCL specification). If the size of an array is
below 64 kB, it is placed in hardware constant buffers; otherwise, it
uses global memory. An example of this is a lookup table for math
functions.
I want to use this "globally scoped constant array". I have such code in pure C
#define SIZE 101
int *reciprocal_table;
int reciprocal(int number){
return reciprocal_table[number];
}
void kernel(int *output)
{
for(int i=0; i < SIZE; i+)
output[i] = reciprocal(i);
}
I want to port it into OpenCL
__kernel void kernel(__global int *output){
int gid = get_global_id(0);
output[gid] = reciprocal(gid);
}
int reciprocal(int number){
return reciprocal_table[number];
}
What should I do with global variable reciprocal_table? If I try to add __global or __constant to it I get an error:
global variable must be declared in addrSpace constant
I don't want to pass __constant int *reciprocal_table from kernel to reciprocal. Is it possible to initialize global variable somehow? I know that I can write it down into code, but does other way exist?
P.S. I'm using AMD OpenCL
UPD Above code is just an example. I have real much more complex code with a lot of functions. So I want to make array in program scope to use it in all functions.
UPD2 Changed example code and added citation from Programming Guide
#define SIZE 2
int constant array[SIZE] = {0, 1};
kernel void
foo (global int* input,
global int* output)
{
const uint id = get_global_id (0);
output[id] = input[id] + array[id];
}
I can get the above to compile with Intel as well as AMD. It also works without the initialization of the array but then you would not know what's in the array and since it's in the constant address space, you could not assign any values.
Program global variables have to be in the __constant address space, as stated by section 6.5.3 in the standard.
UPDATE Now, that I fully understood the question:
One thing that worked for me is to define the array in the constant space and then overwrite it by passing a kernel parameter constant int* array which overwrites the array.
That produced correct results only on the GPU Device. The AMD CPU Device and the Intel CPU Device did not overwrite the arrays address. It also is probably not compliant to the standard.
Here's how it looks:
#define SIZE 2
int constant foo[SIZE] = {100, 100};
int
baz (int i)
{
return foo[i];
}
kernel void
bar (global int* input,
global int* output,
constant int* foo)
{
const uint id = get_global_id (0);
output[id] = input[id] + baz (id);
}
For input = {2, 3} and foo = {0, 1} this produces {2, 4} on my HD 7850 Device (Ubuntu 12.10, Catalyst 9.0.2). But on the CPU I get {102, 103} with either OCL Implementation (AMD, Intel). So I can not stress, how much I personally would NOT do this, because it's only a matter of time, before this breaks.
Another way to achieve this is would be to compute .h files with the host during runtime with the definition of the array (or predefine them) and pass them to the kernel upon compilation via a compiler option. This, of course, requires recompilation of the clProgram/clKernel for every different LUT.
I struggled to get this work in my own program some time ago.
I did not find any way to initialize a constant or global scope array from the host via some clEnqueueWriteBuffer or so. The only way is to write it explicitely in your .cl source file.
So here my trick to initialize it from the host is to use the fact that you are actually compiling your source from the host, which also means you can alter your src.cl file before compiling it.
First my src.cl file reads:
__constant double lookup[SIZE] = { LOOKUP }; // precomputed table (in constant memory).
double func(int idx) {
return(lookup[idx])
}
__kernel void ker1(__global double *in, __global double *out)
{
... do something ...
double t = func(i)
...
}
notice the lookup table is initialized with LOOKUP.
Then, in the host program, before compiling your OpenCL code:
compute the values of my lookup table in host_values[]
on your host, run something like:
char *buf = (char*) malloc( 10000 );
int count = sprintf(buf, "#define LOOKUP "); // actual source generation !
for (int i=0;i<SIZE;i++) count += sprintf(buf+count, "%g, ",host_values[i]);
count += sprintf(buf+count,"\n");
then read the content of your source file src.cl and place it right at buf+count.
you now have a source file with an explicitely defined lookup table that you just computed from the host.
compile your buffer with something like clCreateProgramWithSource(context, 1, (const char **) &buf, &src_sz, err);
voilĂ  !
It looks like "array" is a look-up table of sorts. You'll need to clCreateBuffer and clEnqueueWriteBuffer so the GPU has a copy of it to use.

Resources