application exits prematurely with OpenMp with the error code : Fatal User Error 1002: Not all work-sharing constructs executed by all threads

application exits prematurely with OpenMp with the error code : Fatal User Error 1002: Not all work-sharing constructs executed by all threads - parallel-processing

I added openMp code to some serial code in a simulator applicaton, when I run a program that uses this application the program exits unexpectedly with the output "The thread 'Win32 Thread' (0x1828) has exited with code 1 (0x1)", this happens in the parallel region where I added the OpenMp code,
here's a code sample:
#pragma omp parallel for private (curr_proc_info, current_writer, method_h) shared (exceptionOccured) schedule(dynamic, 1)
for (i = 0 ; i < method_process_num ; i++)
{
current_writer = 0;
// we need to add protection before we can dequeue a method from the methods queue,
#pragma omp critical(dequeueMethod)
method_h = pop_runnable_method(curr_proc_info, current_writer);
if(method_h !=0 && exceptionOccured == false){
try {
method_h->semantics();
}
catch( const sc_report& ex ) {
::std::cout << "\n" << ex.what() << ::std::endl;
m_error = true;
exceptionOccured = true; // we cannot jump outside the loop, so instead of return we use a flag and return somewhere else
}
}
}
The scheduling was static before I made it dynamic, after I added dynamic with a chunk size of 1 the application proceeded a little further before it exited, can this be an indication of what is happening inside the parallel region?
thanks

As I read it, and I'm more of a Fortran programmer than C/C++, your private variable curr_proc_info is not declared (or defined ?) before it first appears in the call to pop_runnable_method. But private variables are undefined on entry to the parallel region.
I also think your sharing of exception_occurred is a little fishy since it suggests that an exception on any thread should be noticed by any thread, not just the thread in which it is noticed. Of course, that may be your intent.
Cheers
Mark

Related

C++11 std::condition_variable - notify_one() not behaving as expected?

I don't see this program having any practical usage, but while experimenting with c++ 11 concurrency and conditional_variables I stumbled across something I don't fully understand.
At first I assumed that using notify_one() would allow the program below to work. However, in actuality the program just froze after printing one. When I switched over to using notify_all() the program did what I wanted it to do (print all natural numbers in order). I am sure this question has been asked in various forms already. But my specific question is where in the doc did I read wrong.
I assume notify_one() should work because of the following statement.
If any threads are waiting on *this, calling notify_one unblocks one of the waiting threads.
Looking below only one of the threads will be blocked at a given time, correct?
class natural_number_printer
{
public:
void run()
{
m_odd_thread = std::thread(
std::bind(&natural_number_printer::print_odd_natural_numbers, this));
m_even_thread = std::thread(
std::bind(&natural_number_printer::print_even_natural_numbers, this));
m_odd_thread.join();
m_even_thread.join();
}
private:
std::mutex m_mutex;
std::condition_variable m_condition;
std::thread m_even_thread;
std::thread m_odd_thread;
private:
void print_odd_natural_numbers()
{
for (unsigned int i = 1; i < 100; ++i) {
if (i % 2 == 1) {
std::cout << i << " ";
m_condition.notify_all();
} else {
std::unique_lock<std::mutex> lock(m_mutex);
m_condition.wait(lock);
}
}
}
void print_even_natural_numbers()
{
for (unsigned int i = 1; i < 100; ++i) {
if (i % 2 == 0) {
std::cout << i << " ";
m_condition.notify_all();
} else {
std::unique_lock<std::mutex> lock(m_mutex);
m_condition.wait(lock);
}
}
}
};

The provided code "works" correctly and gets stuck by design. The cause is described in the documentation
The effects of notify_one()/notify_all() and
wait()/wait_for()/wait_until() take place in a single total order, so
it's impossible for notify_one() to, for example, be delayed and
unblock a thread that started waiting just after the call to
notify_one() was made.
The step-by-step logic is
The print_odd_natural_numbers thread is started
The print_even_natural_numbers thread is started also.
The m_condition.notify_all(); line of print_even_natural_numbers is executed before than the print_odd_natural_numbers thread reaches the m_condition.wait(lock); line.
The m_condition.wait(lock); line of print_odd_natural_numbers is executed and the thread gets stuck.
The m_condition.wait(lock); line of print_even_natural_numbers is executed and the thread gets stuck also.

Can you choose a thread from a thread pool to execute (boost)

Here is some code i have atm.
int main()
{
boost::thread_group threads; // Thread Pool
// Here we create threads and kick them off by passing
// the address of the function to call
for (int i = 0; i < num_threads; i++)
threads.create_thread(&SendDataToFile);
threads.join_all();
system("PAUSE");
}
void SendDataToFile()
{
// The lock guard will make sure only one thread (client)
// will access this application at once
boost::lock_guard<boost::mutex> lock(io_mutex);
for (int i = 0; i < 5; i++)
cout << "Writing" << boost::this_thread::get_id() << endl;
}
At the moment im just using cout instead of writing to file.
Is it possible to actually choose a thread to carry out an operation before another thread. So i have a file i want to write to, 4 threads want to access that file at the same time, is it possible for me to say ok thread 2 you go first. ? in BOOST
can the fstream be used like cout? when i did write to a file the output was not messy (without a mutex)? but when i print to the console without a mutex it is messy as you would expect.

There are a number of equivalent ways you could do this using some combination of global variables protected by atomic updates, a mutex, semaphore, condition variable, etc. The way that seems to me to most directly communicate what you're trying to do is to have your threads wait on a ticket lock where instead of their ticket number representing the order that they arrived at the lock, it's chosen to be the order in which the threads were created. You could combine that idea with the Boost spinlock example for a simple and probably performant implementation.

Callback passed to boost::asio::async_read_some never invoked in usage where boost::asio::read_some returns data

I have been working on implementing a half duplex serial driver by learning from a basic serial terminal example using boost::asio::basic_serial_port:
http://lists.boost.org/boost-users/att-41140/minicom.cpp
I need to read asynchronously but still detect when the handler is finished in the main thread so I pass async_read_some a callback with several additional reference parameters in a lambda function using boost:bind. The handler never gets invoked but if I replace the async_read_some function with the read_some function it returns data without an issue.
I believe I'm satisfying all of the necessary requirements for this function to invoke the handler because they are the same for the asio::read some function which returns:
The buffer stays in scope
One or more bytes is received by the serial device
The io service is running
The port is open and running at the correct baud rate
Does anyone know if I'm missing another assumption unique to the asynchronous read or if I'm not setting up the io_service correctly?
Here is an example of how I'm using the code with async_read_some (http://www.boost.org/doc/libs/1_56_0/doc/html/boost_asio/reference/basic_serial_port/async_read_some.html):
void readCallback(const boost::system::error_code& error, size_t bytes_transfered, bool & finished_reading, boost::system::error_code& error_report, size_t & bytes_read)
{
std::cout << "READ CALLBACK\n";
std::cout.flush();
error_report = error;
bytes_read = bytes_transfered;
finished_reading = true;
return;
}
int main()
{
int baud_rate = 115200;
std::string port_name = "/dev/ttyUSB0";
boost::asio::io_service io_service_;
boost::asio::serial_port serial_port_(io_service_,port_name);
serial_port_.set_option(boost::asio::serial_port_base::baud_rate(baud_rate));
boost::thread service_thread_;
service_thread = boost::thread(boost::bind(&boost::asio::io_service::run, &io_service_));
std::cout << "Starting byte read\n";
boost::system::error_code ec;
bool finished_reading = false;
size_t bytes_read;
int max_response_size = 8;
uint8_t read_buffer[max_response_size];
serial_port_.async_read_some(boost::asio::buffer(read_buffer, max_response_size),
boost::bind(readCallback,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred,
finished_reading, ec, bytes_read));
std::cout << "Waiting for read to finish\n";
while (!finished_reading)
{
boost::this_thread::sleep(boost::posix_time::milliseconds(1));
}
std::cout << "Finished byte read: " << bytes_read << "\n";
for (int i = 0; i < bytes_read; ++i)
{
printf("0x%x ",read_buffer[i]);
}
}
The result is that the callback does not print out anything and the while !finished loop never finishes.
Here is how I use the blocking read_some function (boost.org/doc/libs/1_56_0/doc/html/boost_asio/reference/basic_serial_port/read_some.html):
int main()
{
int baud_rate = 115200;
std::string port_name = "/dev/ttyUSB0";
boost::asio::io_service io_service_;
boost::asio::serial_port serial_port_(io_service_,port_name);
serial_port_.set_option(boost::asio::serial_port_base::baud_rate(baud_rate));
boost::thread service_thread_;
service_thread = boost::thread(boost::bind(&boost::asio::io_service::run, &io_service_));
std::cout << "Starting byte read\n";
boost::system::error_code ec;
int max_response_size = 8;
uint8_t read_buffer[max_response_size];
int bytes_read = serial_port_.read_some(boost::asio::buffer(read_buffer, max_response_size),ec);
std::cout << "Finished byte read: " << bytes_read << "\n";
for (int i = 0; i < bytes_read; ++i)
{
printf("0x%x ",read_buffer[i]);
}
}
This version prints from 1 up to 8 characters that I send, blocking until at least one is sent.

The code does not guarantee that the io_service is running. io_service::run() will return when either:
All work has finished and there are no more handlers to be dispatched
The io_service has been stopped.
In this case, it is possible for the service_thread_ to be created and invoke io_service::run() before the serial_port::async_read_some() operation is initiated, adding work to the io_service. Thus, the service_thread_ could immediately return from io_service::run(). To resolve this, either:
Invoke io_service::run() after the asynchronous operation has been initiated.
Create a io_service::work object before starting the service_thread_. A work object prevents the io_service from running out of work.
This answer may provide some more insight into the behavior of io_service::run().
A few other things to note and to expand upon Igor's answer:
If a thread is not progressing in a meaningful way while waiting for an asynchronous operation to complete (i.e. spinning in a loop sleeping), then it may be worth examining if mixing synchronous behavior with asynchronous operations is the correct solution.
boost::bind() copies its arguments by value. To pass an argument by reference, wrap it with boost::ref() or boost::cref():
boost::bind(..., boost::ref(finished_reading), boost::ref(ec),
boost::ref(bytes_read));
Synchronization needs to be added to guarantee memory visibility of finished_reading in the main thread. For asynchronous operations, Boost.Asio will guarantee the appropriate memory barriers to ensure correct memory visibility (see this answer for more details). In this case, a memory barrier is required within the main thread to guarantee the main thread observes changes to finished_reading by other threads. Consider using either a Boost.Thread synchronization mechanism like boost::mutex, or Boost.Atomic's atomic objects or thread and signal fences.

Note that boost::bind copies its arguments. If you want to pass an argument by reference, wrap it with boost::ref (or std::ref):
boost::bind(readCallback, boost::asio::placeholders::error, boost::asio::placeholders::bytes_transferred, boost::ref(finished_reading), ec, bytes_read));
(However, strictly speaking, there's a race condition on the bool variable you pass to another thread. A better solution would be to use std::atomic_bool.)

Multithreaded synchronization primitive

I have the following scenario:
I have multiple worker threads running that all go through a certain section of code, and they're allowed to do so simultaneously. No critical section surrounds this piece of code right now as it's not required for these threads.
I have a main thread that also -occassionally- wants to enter that section of code, but when it does, none of the other worker threads should use that section of code.
Naive solution: surround the section of code with a critical section. But that would kill a lot of parallelism between the worker threads, which is important in my case.
Is there a better solution?

Use RW locks. RW locks allow multiple readers and only a single writer. Your workers would call read-lock at the start of the critical section and the main thread would write-lock.
By definition, when calling read-lock, the calling process will wait for any writing threads to finish. When calling write-lock, the calling process will wait for any reading or writing threads to finish.
Example using POSIX threads:
pthread_rwlock_t lock;
/* worker threads */
void *do_work(void *args) {
for (int i = 0; i < 100; ++i) {
pthread_rwlock_rdlock(&lock);
// do some work...
pthread_rwlock_unlock(&lock);
sleep(1);
}
pthread_exit(0);
}
/* main thread */
int main(void) {
pthread_t workers[4];
pthread_rwlock_init(&lock);
int i;
// spawn workers...
for (i = 0; i < 4; ++i) {
pthread_create(workers[i]; NULL, do_worker, NULL);
}
for (i = 0; i < 100, ++i) {
pthread_rwlock_wrlock(&lock);
// do some work...
pthread_rwlock_unlock(&lock);
sleep(1);
}
return 0;
}

As far as I understand it, your worker threads are started asynchronously. So when the main thread wants to run this code section, you have to ensure that no worker thread is executing it. Therefore you have to stop all worker threads before the main thread can enter that code section, and allow them to enter it again afterwards.
This could be done - using Grand Central Dispatch - if your worker threads would be assigned to a dispatch group, see https://developer.apple.com/library/mac/#documentation/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.html.
The main thread could then send the message dispatch_group_wait to this dispatch group, wait for all worker thread to leave this code section, execute it, and then requeue the worker threads.

pthread condition variables vs win32 events (linux vs windows-ce)

I am doing a performance evaluation between Windows CE and Linux on an arm imx27 board. The code has already been written for CE and measures the time it takes to do different kernel calls like using OS primitives like mutex and semaphores, opening and closing files and networking.
During my porting of this application to Linux (pthreads) I stumbled upon a problem which I cannot explain. Almost all tests showed a performance increase from 5 to 10 times but not my version of win32 events (SetEvent and WaitForSingleObject), CE actually "won" this test.
To emulate the behaviour I was using pthreads condition variables (I know that my implementation doesn't fully emulate the CE version but it's enough for the evaluation).
The test code uses two threads that "ping-pong" each other using events.
Windows code:
Thread 1: (the thread I measure)
HANDLE hEvt1, hEvt2;
hEvt1 = CreateEvent(NULL, FALSE, FALSE, TEXT("MyLocEvt1"));
hEvt2 = CreateEvent(NULL, FALSE, FALSE, TEXT("MyLocEvt2"));
ResetEvent(hEvt1);
ResetEvent(hEvt2);
for (i = 0; i < 10000; i++)
{
SetEvent (hEvt1);
WaitForSingleObject(hEvt2, INFINITE);
}
Thread 2: (just "responding")
while (1)
{
WaitForSingleObject(hEvt1, INFINITE);
SetEvent(hEvt2);
}
Linux code:
Thread 1: (the thread I measure)
struct event_flag *event1, *event2;
event1 = eventflag_create();
event2 = eventflag_create();
for (i = 0; i < 10000; i++)
{
eventflag_set(event1);
eventflag_wait(event2);
}
Thread 2: (just "responding")
while (1)
{
eventflag_wait(event1);
eventflag_set(event2);
}
My implementation of eventflag_*:
struct event_flag* eventflag_create()
{
struct event_flag* ev;
ev = (struct event_flag*) malloc(sizeof(struct event_flag));
pthread_mutex_init(&ev->mutex, NULL);
pthread_cond_init(&ev->condition, NULL);
ev->flag = 0;
return ev;
}
void eventflag_wait(struct event_flag* ev)
{
pthread_mutex_lock(&ev->mutex);
while (!ev->flag)
pthread_cond_wait(&ev->condition, &ev->mutex);
ev->flag = 0;
pthread_mutex_unlock(&ev->mutex);
}
void eventflag_set(struct event_flag* ev)
{
pthread_mutex_lock(&ev->mutex);
ev->flag = 1;
pthread_cond_signal(&ev->condition);
pthread_mutex_unlock(&ev->mutex);
}
And the struct:
struct event_flag
{
pthread_mutex_t mutex;
pthread_cond_t condition;
unsigned int flag;
};
Questions:
Why doesn't I see the performance boost here?
What can be done to improve performance (e.g are there faster ways to implement CEs behaviour)?
I'm not used to coding pthreads, are there bugs in my implementation maybe resulting in performance loss?
Are there any alternative libraries for this?

Note that you don't need to be holding the mutex when calling pthread_cond_signal(), so you might be able to increase the performance of your condition variable 'event' implementation by releasing the mutex before signaling the condition:
void eventflag_set(struct event_flag* ev)
{
pthread_mutex_lock(&ev->mutex);
ev->flag = 1;
pthread_mutex_unlock(&ev->mutex);
pthread_cond_signal(&ev->condition);
}
This might prevent the awakened thread from immediately blocking on the mutex.

This type of implementation only works if you can afford to miss an event. I just tested it and ran into many deadlocks. The main reason for this is that the condition variables only wake up a thread that is already waiting. Signals issued before are lost.
No counter is associated with a condition that allows a waiting thread to simply continue if the condition has already been signalled. Windows Events support this type of use.
I can think of no better solution than taking a semaphore (the POSIX version is very easy to use) that is initialized to zero, using sem_post() for set() and sem_wait() for wait(). You can surely think of a way to have the semaphore count to a maximum of 1 using sem_getvalue()
That said I have no idea whether the POSIX semaphores are just a neat interface to the Linux semaphores or what the performance penalties are.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

application exits prematurely with OpenMp with the error code : Fatal User Error 1002: Not all work-sharing constructs executed by all threads - parallel-processing

Related

C++11 std::condition_variable - notify_one() not behaving as expected?

Can you choose a thread from a thread pool to execute (boost)

Callback passed to boost::asio::async_read_some never invoked in usage where boost::asio::read_some returns data

Multithreaded synchronization primitive

pthread condition variables vs win32 events (linux vs windows-ce)

Categories

Resources