Fork/ execlp in #pragma omp parallel for stops arbitrarily - openmp

I am programming a Kalman-Filter based realtime program in C using a large number of realizations. To generate the realization output I have to execute an external program (groundwater simulation software) about 100 times. I am therefore using OpenMP with fork and exceclp for parallelization of this block:
#pragma omp parallel for private(temp)
for(i=1;i<=n_real;i++){
/* Start ensemble model run */
int = fork();
int ffork = 0;
int status;
if (pid == 0) {
//Child process
log_info("Start Dadia for Ens %d on thread %d",i,omp_get_thread_num());
sprintf(temp,"ens_%d",i);
chdir(temp);
execlp("../SPRING/dadia","dadia","-return","-replace_kenn","73","ini.csv","CSV",NULL);
log_info("Could not execute function dadia - exit anyway");
ffork = 1;
_exit(1);
}
else{
//Parent process
wait(NULL);
if (ffork == 0){
log_info("DADIA run ens_%d successfully finished",i);
}
}
}
In general the code runs smoothly for small number of realizations (with 6 threads). However sometimes the code hangs in the last cycle of parallel iterations. The occurrence only happens if number iterations >> number threads. I tried scheduling the for loop with different options, but it didn't solve the problem. I know that fork is not the best solution to use with OpenMP. But I am wondring why it's sometimes hangs at arbitrary points.
Thank's a lot for any kind if feedback.
Different Ubuntu versions tried (including difffrent comiler versions)

This version finally seems to run stable. Thank's for your hints:
#pragma omp parallel for private(temp) schedule(dynamic)
for(i=1;i<=n_real;i++) {
/* Start ensemble model run */
/* Calling bash-commands inside OpenMP enviroment requires using fork().
Else the parallel process is terminated.
Note exec automatically ends the child process after termination.
If Sitra fails, the child process needs to be terminated manually */
pid_t pid = 0;
int status;
pid = fork();
if (pid == 0) {
log_info("Start SITRA for Ens %d on thread %d",i,omp_get_thread_num());
sprintf(temp,"ens_%d",i);
chdir(temp);
execl("../SPRING/sitra","sitra","-return",(char *) 0);
perror("In exec(): ");
}
if (pid > 0) {
pid = wait(&status);
log_info("End of sitra for ensemble %d",i);
if (WIFEXITED(status)) {
printf("Sitra ended with exit(%d).\n", WEXITSTATUS(status));
}
if (WIFSIGNALED(status)) {
printf("Sitra ended with kill -%d.\n", WTERMSIG(status));
}
}
if (pid < 0){
perror("In fork():");
}
}

Related

IO Completion ports: separate thread pool to process the dequeued packets?

NOTE: I have added the C++ tag to this because a) the code is C++ and b) people using C++ may well have used IO completion ports. So please don't shout.
I am playing with IO completion ports, and have eventually fully understood (and tested, to prove) - both with help from RbMm - the meaning of the NumberOfConcurrentThreads parameter within CreateIoCompletionPort().
I have the following small program which creates 10 threads all waiting on the completion port. I tell my completion port to only allow 4 threads to be runnable at once (I have four CPUs). I then enqueue 8 packets to the port. My thread function outputs a message if it dequeues a packet with an ID > 4; in order for this message to be output, I have to stop at least one of the four currently running threads, which happens when I enter '1' at the console.
Now this is all fairly simple code. I have one big concern however, and that is that if all of the threads that are processing a completion packet get bogged down, it will mean no more packets can be dequeued and processed. That is what I am simulating with my infinite loop - the fact that no more packets are dequeued until I enter '1' at the console highlights this potential problem!
Would a better solution not be to have my four threads dequeuing packets (or as many threads as CPUs), then when one is dequeued, farm the processing of that packet off to a worker thread from a separate pool, thereby removing the risk of all threads in the IOCP being bogged down thus no more packets being dequeued?
I ask this as all the examples of IO completion port code I have seen use a method similar to what I show below, not using a separate thread pool which I propose. This is what makes me think that I am missing something because I am outnumbered!
Note: this is a somewhat contrived example, because Windows will allow an additional packet to be dequeued if one of the runnable threads enters a wait state; I show this in my code with a commented out cout call:
The system also allows a thread waiting in GetQueuedCompletionStatus
to process a completion packet if another running thread associated
with the same I/O completion port enters a wait state for other
reasons, for example the SuspendThread function. When the thread in
the wait state begins running again, there may be a brief period when
the number of active threads exceeds the concurrency value. However,
the system quickly reduces this number by not allowing any new active
threads until the number of active threads falls below the concurrency
value.
But I won't be calling SuspendThread in my thread functions, and I don't know which functions other than cout will cause the thread to enter a wait state, thus I can't predict if one or more of my threads will ever get bogged down! Hence my idea of a thread pool; at least context switching would mean that other packets get a chance to be dequeued!
#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include <thread>
#include <vector>
#include <algorithm>
#include <atomic>
#include <ctime>
#include <iostream>
using namespace std;
int main()
{
HANDLE hCompletionPort1;
if ((hCompletionPort1 = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 4)) == NULL)
{
return -1;
}
vector<thread> vecAllThreads;
atomic_bool bStop(false);
// Fill our vector with 10 threads, each of which waits on our IOCP.
generate_n(back_inserter(vecAllThreads), 10, [hCompletionPort1, &bStop] {
thread t([hCompletionPort1, &bStop]()
{
// Thread body
while (true)
{
DWORD dwBytes = 0;
LPOVERLAPPED pOverlapped = 0;
ULONG_PTR uKey;
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
{
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
//cout << uKey; // EVEN THIS WILL CAUSE A "wait" which causes MORE THAN 4 THREADS TO ENTER!
if (uKey >4)
cout << "Started processing packet ID > 4!" << endl;
while (!bStop)
; // INFINITE LOOP
}
}
});
return move(t);
}
);
// Queue 8 completion packets to our IOCP...only four will be processed until we set our bool
for (int i = 1; i <= 8; ++i)
{
PostQueuedCompletionStatus(hCompletionPort1, 0, i, new OVERLAPPED);
}
while (!bStop)
{
int nVal;
cout << "Enter 1 to cause current processing threads to end: ";
cin >> nVal;
bStop = (nVal == 1);
}
for (int i = 0; i < 10; ++i) // Tell all 10 threads to stop processing on the IOCP
{
PostQueuedCompletionStatus(hCompletionPort1, 0, 0, 0); // Special packet marking end of IOCP usage
}
for_each(begin(vecAllThreads), end(vecAllThreads), mem_fn(&thread::join));
return 0;
}
EDIT #1
What I mean by "separate thread pool" is something like the following:
class myThread {
public:
void SetTask(LPOVERLAPPED pO) { /* start processing pO*/ }
private:
thread m_thread; // Actual thread object
};
// The threads in this thread pool are not associated with the IOCP in any way whatsoever; they exist
// purely to be handed a completion packet which they then process!
class ThreadPool
{
public:
void Initialise() { /* create 100 worker threads and add them to some internal storage*/}
myThread& GetNextFreeThread() { /* return one of the 100 worker thread we created*/}
} g_threadPool;
The code that each of my four threads associated with the IOCP then change to
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
{
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
// Pick a new thread from a pool of pre-created threads and assign it the packet to process
myThread& thr = g_threadPool.GetNextFreeThread();
thr.SetTask(pOverlapped);
// Now, this thread can immediately return to the IOCP; it doesn't matter if the
// packet we dequeued would take forever to process; that is happening in the
// separate thread thr *that will not intefere with packets being dequeued from IOCP!*
}
This way, there is no possible way that I can end up in the situation where no more packets are being dequeued!
It seems there is conflicting opinion on whether a separate thread pool should be used. Clearly, as the sample code I have posted shows, there is potential for packets to stop being dequeued from the IOCP if the processing of the packets does not enter a wait state; given, the infinite loop is perhaps unrealistic but it does demonstrate the point.

Following xv6/Linux forking and waitpid processes

int main()
{
int count = 0;
int pid;
if ( !(pid=fork()))
{
while (((count<2) && (pid=fork()) ) {
count++;
printf("%d",count)
}
if(count>0)
{
printf("%d", count);
}
}
if (pid) {
waitpid(pid,NULL,0);
count = count<<1;
printf("%d", count)
}
}
I'm having trouble reading this piece of code. I can't figure out how many system calls are made or even remotely trace them. This is leading me to not be able to understand how the output works. I am assuming the scheduler can do anything. What are all possible outputs and how many system calls are made?
Launch with strace -f ./myprogram to trace all system calls.
From looking at the code, about 3 additional processes will be created.

C++11 std::condition_variable - notify_one() not behaving as expected?

I don't see this program having any practical usage, but while experimenting with c++ 11 concurrency and conditional_variables I stumbled across something I don't fully understand.
At first I assumed that using notify_one() would allow the program below to work. However, in actuality the program just froze after printing one. When I switched over to using notify_all() the program did what I wanted it to do (print all natural numbers in order). I am sure this question has been asked in various forms already. But my specific question is where in the doc did I read wrong.
I assume notify_one() should work because of the following statement.
If any threads are waiting on *this, calling notify_one unblocks one of the waiting threads.
Looking below only one of the threads will be blocked at a given time, correct?
class natural_number_printer
{
public:
void run()
{
m_odd_thread = std::thread(
std::bind(&natural_number_printer::print_odd_natural_numbers, this));
m_even_thread = std::thread(
std::bind(&natural_number_printer::print_even_natural_numbers, this));
m_odd_thread.join();
m_even_thread.join();
}
private:
std::mutex m_mutex;
std::condition_variable m_condition;
std::thread m_even_thread;
std::thread m_odd_thread;
private:
void print_odd_natural_numbers()
{
for (unsigned int i = 1; i < 100; ++i) {
if (i % 2 == 1) {
std::cout << i << " ";
m_condition.notify_all();
} else {
std::unique_lock<std::mutex> lock(m_mutex);
m_condition.wait(lock);
}
}
}
void print_even_natural_numbers()
{
for (unsigned int i = 1; i < 100; ++i) {
if (i % 2 == 0) {
std::cout << i << " ";
m_condition.notify_all();
} else {
std::unique_lock<std::mutex> lock(m_mutex);
m_condition.wait(lock);
}
}
}
};
The provided code "works" correctly and gets stuck by design. The cause is described in the documentation
The effects of notify_one()/notify_all() and
wait()/wait_for()/wait_until() take place in a single total order, so
it's impossible for notify_one() to, for example, be delayed and
unblock a thread that started waiting just after the call to
notify_one() was made.
The step-by-step logic is
The print_odd_natural_numbers thread is started
The print_even_natural_numbers thread is started also.
The m_condition.notify_all(); line of print_even_natural_numbers is executed before than the print_odd_natural_numbers thread reaches the m_condition.wait(lock); line.
The m_condition.wait(lock); line of print_odd_natural_numbers is executed and the thread gets stuck.
The m_condition.wait(lock); line of print_even_natural_numbers is executed and the thread gets stuck also.

Prevent terminal prompt from printing on exec() call

SO,
There are many similar questions, however none that I have been able to use. My code snippet is as follows:
for(int j=0; j<N; j++) {
pid_t pid = fork();
if (pid == -1) {
exit(-1); //err
} else if (pid == 0) {//kid
stringstream ss;
ss<<j;
execlp("./sub","sub",ss.str().c_str(),NULL);
exit(0);
} else {
/* parent */
}
}
my executing code in sub(.cpp) is:
int main( int argc, char **argv )
{
cout<<argv[i]<<endl;
exit(0);
}
my output is as such:
[terminal prompt '$'] 4
2
3
etc.
Is there a way I could prevent the prompt from displaying on the exec call? and why is it ONLY displaying on the first exec call, and not on every one?
What you see is the normal prompt of your shell, because the parent process terminates very quickly. It is not the output of the exec call. The forked processes print their output after the parent process has terminated.
You can use waitpid() in the parent process to "wait" until all forked process have terminated.

application exits prematurely with OpenMp with the error code : Fatal User Error 1002: Not all work-sharing constructs executed by all threads

I added openMp code to some serial code in a simulator applicaton, when I run a program that uses this application the program exits unexpectedly with the output "The thread 'Win32 Thread' (0x1828) has exited with code 1 (0x1)", this happens in the parallel region where I added the OpenMp code,
here's a code sample:
#pragma omp parallel for private (curr_proc_info, current_writer, method_h) shared (exceptionOccured) schedule(dynamic, 1)
for (i = 0 ; i < method_process_num ; i++)
{
current_writer = 0;
// we need to add protection before we can dequeue a method from the methods queue,
#pragma omp critical(dequeueMethod)
method_h = pop_runnable_method(curr_proc_info, current_writer);
if(method_h !=0 && exceptionOccured == false){
try {
method_h->semantics();
}
catch( const sc_report& ex ) {
::std::cout << "\n" << ex.what() << ::std::endl;
m_error = true;
exceptionOccured = true; // we cannot jump outside the loop, so instead of return we use a flag and return somewhere else
}
}
}
The scheduling was static before I made it dynamic, after I added dynamic with a chunk size of 1 the application proceeded a little further before it exited, can this be an indication of what is happening inside the parallel region?
thanks
As I read it, and I'm more of a Fortran programmer than C/C++, your private variable curr_proc_info is not declared (or defined ?) before it first appears in the call to pop_runnable_method. But private variables are undefined on entry to the parallel region.
I also think your sharing of exception_occurred is a little fishy since it suggests that an exception on any thread should be noticed by any thread, not just the thread in which it is noticed. Of course, that may be your intent.
Cheers
Mark

Resources