OpenMP parallel section dependency

OpenMP parallel section dependency - parallel-processing

I am running into a very odd OpenMP problem that I can't figure out. This is what it looks like.
I have (let's say)four functions
function0(Data *data) { ..body.. }
function1(Data *data) { ..body.. }
function2(Data *data) { ..body.. }
function3(Data *data) { ..body.. }
These function may or may not modify what is pointed by data
Here is how they are called sequentially
// Version 1
void test(Data *data)
{
function0(data);
function1(data);
function2(data);
function3(data);
}
I can rearrange the calls in anyway I want and it still works perfect. So i am assuming they are somehow(?) independent.
Now when parallelizing
// Version 2
void test(Data *data)
{
int th_id;
#pragma omp parallel private(th_id) default(shared)
{
th_id = omp_get_thread_num();
if(th_id==0) {
function0(data);
}
if(th_id==1) {
function1(data);
}
if(th_id==2){
function2(data);
}
if(th_id==3){
function3(data);
}
}
}
It DOESN'T work (version 2).
However if I synchronize the threads after each call it works
// Version 3
void test(Data *data)
{
int th_id;
#pragma omp parallel private(th_id) default(shared)
{
th_id = omp_get_thread_num();
if(th_id==0) {
function0(data);
}
#pragma omp barrier
if(th_id==1) {
function1(data);
}
#pragma omp barrier
if(th_id==2){
function2(data);
}
#pragma omp barrier
if(th_id==3){
function3(data);
}
}
}
I am thinking there is some data racing problem regarding what is pointed by data
But why would it work (in the sequential version 1) when I rearrange the calls then?

Suppose you had two function like this
function0(int *data)
{
*data = *data + 1;
}
function1(int *data)
{
*data = *data + 2;
}
Clearly you can run those two operations in either order sequentially and at the end the value will have been incremented by 3.
However, if you run the two functions in parallel you have a data race, and it's entirely posisble that one of the additions will be lost, so you could get any the initial value incremented by 1,2, or 3.
Just because the functions appear to be commutative sequentially that doesn't mean that they can safely be run in parallel.

Related

std::atomic on struct bit-fields

I'm modifying some existing open source library and there is a struct (say named as Node) containing bit-fields, e.g.
struct Node {
std::atomic<uint32_t> size:30;
std::atomic<uint32_t> isnull:1;
};
To fit my needs, these fields need to be atomic so I was expecting to use std::atomic for this and faced compile time error:
bit-field 'size' has non-integral type 'std::atomic<uint32_t>'
According to documentation, there is a restricted set of types which can be used for std::atomic
Can anyone advise/have idea on how to get functionality of atomic fields with the minimum impact to the existing source code?
Thanks in advance!

I used an unsigned short as an example below.
This is less ideal, but you could sacrifice 8 bits and insert a std::atomic_flag in the bit field with a union. Unfortunately, std::atomic_flag type is a std::atomic_bool type.
This structure can be spin locked manually every time you access it. However, the code should have minimal performance degradation (unlike creating, locking, unlocking, destroying with a std::mutex and std::unique_lock).
This code may waste about 10-30 clock cycles to enable low cost multi-threading.
PS. Make sure the reserved 8 bits below are not messed up by the endian structure of the processor. You may have to define at the end for big-endian processors. I only tested this code on an Intel CPU (always little-endian).
#include <iostream>
#include <atomic>
#include <thread>
union Data
{
std::atomic_flag access = ATOMIC_FLAG_INIT; // one byte
struct
{
typedef unsigned short ushort;
ushort reserved : 8;
ushort count : 4;
ushort ready : 1;
ushort unused : 3;
} bits;
};
class SpinLock
{
public:
inline SpinLock(std::atomic_flag &access, bool locked=true)
: mAccess(access)
{
if(locked) lock();
}
inline ~SpinLock()
{
unlock();
}
inline void lock()
{
while (mAccess.test_and_set(std::memory_order_acquire))
{
}
}
// each attempt will take about 10-30 clock cycles
inline bool try_lock(unsigned int attempts=0)
{
while(mAccess.test_and_set(std::memory_order_acquire))
{
if (! attempts) return false;
-- attempts;
}
return true;
}
inline void unlock()
{
mAccess.clear(std::memory_order_release);
}
private:
std::atomic_flag &mAccess;
};
void aFn(int &i, Data &d)
{
SpinLock lock(d.access, false);
// manually locking/unlocking can be tighter
lock.lock();
if (d.bits.ready)
{
++d.bits.count;
}
d.bits.ready ^= true; // alternate each time
lock.unlock();
}
int main(void)
{
Data f;
f.bits.count = 0;
f.bits.ready = true;
std::thread *p[8];
for (int i = 0; i < 8; ++ i)
{
p[i] = new std::thread([&f] (int i) { aFn(i, f); }, i);
}
for (int i = 0; i < 8; ++i)
{
p[i]->join();
delete p[i];
}
std::cout << "size: " << sizeof(f) << std::endl;
std::cout << "count: " << f.bits.count << std::endl;
}
The result is as expected...
size: 2
count: 4

Dependent sections in openmp

I'm wondering if there is a way to include a dependency between sections in OpenMP? I know that there is a possibility to do that when using tasks but is there a way using sections? Say that I have the following case:
#pragma omp parallell sections
{
#pragma single
{
if (....) {
#pragma section
{
a = A(); // <- Takes much more time than B and will be parallellized further within A
}
#pragma section
{
b = B();
}
}
...
#pragma section (dependent on b?)
for (...) {
c = C(b);
}
}
}
What would be the best way to make sure that the last section is executed after 'b' is available?

C++ simple mutex using atomic_flag (code not working)

This is an exercise of using atomic_flag with acquire/release memory model to implement a very simple mutex.
There are THREADS number of threads, and each thread increment cou LOOP number of times. The threads are synchronized with this simple mutex. However, the code throws exception in thread.join() function. Could someone please enlighten me why this does not work? Thank you in advance!
#include <atomic>
#include <thread>
#include <assert.h>
#include <vector>
using namespace std;
class mutex_simplified {
private:
atomic_flag flag;
public:
void lock() {
while (flag.test_and_set(memory_order_acquire));
}
void unlock() {
flag.clear(memory_order_release);
}
};
mutex_simplified m_s;
int cou(0);
const int LOOP = 10000;
const int THREADS = 1000;
void increment() {
for (unsigned i = 0; i < LOOP; i++) {
m_s.lock();
cou++;
m_s.unlock();
}
}
int main() {
thread a(increment);
thread b(increment);
vector<thread> threads;
for (int i = 0; i < THREADS; i++)
threads.push_back(thread(increment));
for (auto & t : threads) {
t.join();
}
assert(cou == THREADS*LOOP);
}

You are not joining threads a and b. As the result, they might be still running while your program is finishing its execution.
You should either add a.join() and b.join() somewhere, or probably just remove them as the assertion in your main function will fail if you keep them.
Another issue is that you need to explicitly initialize atomic_flag instance in your mutex constructor. It might not cause issues in your example because global variables are zero-initialized, but this might cause issues later.

Stored lambda function calls are very slow - fix or workaround?

In an attempt to make a more usable version of the code I wrote for an answer to another question, I used a lambda function to process an individual unit. This is a work in progress. I've got the "client" syntax looking pretty nice:
// for loop split into 4 threads, calling doThing for each index
parloop(4, 0, 100000000, [](int i) { doThing(i); });
However, I have an issue. Whenever I call the saved lambda, it takes up a ton of CPU time. doThing itself is an empty stub. If I just comment out the internal call to the lambda, then the speed returns to normal (4 times speedup for 4 threads). I'm using std::function to save the reference to the lambda.
My question is - Is there some better way that the stl library internally manages lambdas for large sets of data, that I haven't come across?
struct parloop
{
public:
std::vector<std::thread> myThreads;
int numThreads, rangeStart, rangeEnd;
std::function<void (int)> lambda;
parloop(int _numThreads, int _rangeStart, int _rangeEnd, std::function<void(int)> _lambda) //
: numThreads(_numThreads), rangeStart(_rangeStart), rangeEnd(_rangeEnd), lambda(_lambda) //
{
init();
exit();
}
void init()
{
myThreads.resize(numThreads);
for (int i = 0; i < numThreads; ++i)
{
myThreads[i] = std::thread(myThreadFunction, this, chunkStart(i), chunkEnd(i));
}
}
void exit()
{
for (int i = 0; i < numThreads; ++i)
{
myThreads[i].join();
}
}
int rangeJump()
{
return ceil(float(rangeEnd - rangeStart) / float(numThreads));
}
int chunkStart(int i)
{
return rangeJump() * i;
}
int chunkEnd(int i)
{
return std::min(rangeJump() * (i + 1) - 1, rangeEnd);
}
static void myThreadFunction(parloop *self, int start, int end) //
{
std::function<void(int)> lambda = self->lambda;
// we're just going to loop through the numbers and print them out
for (int i = start; i <= end; ++i)
{
lambda(i); // commenting this out speeds things up back to normal
}
}
};
void doThing(int i) // "payload" of the lambda function
{
}
int main()
{
auto start = timer.now();
auto stop = timer.now();
// run 4 trials of each number of threads
for (int x = 1; x <= 4; ++x)
{
// test between 1-8 threads
for (int numThreads = 1; numThreads <= 8; ++numThreads)
{
start = timer.now();
// this is the line of code which calls doThing in the loop
parloop(numThreads, 0, 100000000, [](int i) { doThing(i); });
stop = timer.now();
cout << numThreads << " Time = " << std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count() / 1000000.0f << " ms\n";
//cout << "\t\tsimple list, time was " << deltaTime2 / 1000000.0f << " ms\n";
}
}
cin.ignore();
cin.get();
return 0;
}

I'm using std::function to save the reference to the lambda.
That's one possible problem, as std::function is not a zero-runtime-cost abstraction. It is a type-erased wrapper that has a virtual-call like cost when invoking operator() and could also potentially heap-allocate (which could mean a cache-miss per call).
If you want to store your lambda in such a way that does not introduce additional overhead and that allows the compiler to inline it, you should use a template parameter. This is not always possible, but might fit your use case. Example:
template <typename TFunction>
struct parloop
{
public:
std::thread **myThreads;
int numThreads, rangeStart, rangeEnd;
TFunction lambda;
parloop(TFunction&& _lambda,
int _numThreads, int _rangeStart, int _rangeEnd)
: lambda(std::move(_lambda)),
numThreads(_numThreads), rangeStart(_rangeStart),
rangeEnd(_rangeEnd)
{
init();
exit();
}
// ...
To deduce the type of the lambda, you can use an helper function:
template <typename TF, typename... TArgs>
auto make_parloop(TF&& lambda, TArgs&&... xs)
{
return parloop<std::decay_t<TF>>(
std::forward<TF>(lambda), std::forward<TArgs>(xs)...);
}
Usage:
auto p = make_parloop([](int i) { doThing(i); },
numThreads, 0, 100000000);
I wrote an article that's related to the subject:
"Passing functions to functions"
It contains some benchmarks that show how much assembly is generated for std::function compared to a template parameter and other solutions.

Is code motion permitted in presence of atomic memory accesses?

Looking at the following example,
#include <atomic>
extern std::atomic<int> TheVal;
extern std::atomic<bool> Ready;
int Unrelated;
void test() {
for (int i=0; i<100; ++i) {
Unrelated = 42; // A loop-invariant store
if (Ready.load(std::memory_order_acquire)) continue;
TheVal.store(i, std::memory_order_release);
Ready.store(1, std::memory_order_release);
}
}
Is the compiler allowed to move the store to 'Unrelated' out of the loop such that test() would be similar to the following?
void test() {
Unrelated = 42; // Move the loop-invariant store here
for (int i=0; i<100; ++i) {
if (Ready.load(std::memory_order_acquire)) continue;
TheVal.store(i, std::memory_order_release);
Ready.store(1, std::memory_order_release);
}
}
If the variables 'Ready' and 'TheVal' were not atomic then this optimization would be surely safe, but does their atomicity prevent this optimization?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

OpenMP parallel section dependency - parallel-processing

Related

std::atomic on struct bit-fields

Dependent sections in openmp

C++ simple mutex using atomic_flag (code not working)

Stored lambda function calls are very slow - fix or workaround?

Is code motion permitted in presence of atomic memory accesses?

Categories

Resources