Multithreaded synchronization primitive - parallel-processing

I have the following scenario:
I have multiple worker threads running that all go through a certain section of code, and they're allowed to do so simultaneously. No critical section surrounds this piece of code right now as it's not required for these threads.
I have a main thread that also -occassionally- wants to enter that section of code, but when it does, none of the other worker threads should use that section of code.
Naive solution: surround the section of code with a critical section. But that would kill a lot of parallelism between the worker threads, which is important in my case.
Is there a better solution?

Use RW locks. RW locks allow multiple readers and only a single writer. Your workers would call read-lock at the start of the critical section and the main thread would write-lock.
By definition, when calling read-lock, the calling process will wait for any writing threads to finish. When calling write-lock, the calling process will wait for any reading or writing threads to finish.
Example using POSIX threads:
pthread_rwlock_t lock;
/* worker threads */
void *do_work(void *args) {
for (int i = 0; i < 100; ++i) {
pthread_rwlock_rdlock(&lock);
// do some work...
pthread_rwlock_unlock(&lock);
sleep(1);
}
pthread_exit(0);
}
/* main thread */
int main(void) {
pthread_t workers[4];
pthread_rwlock_init(&lock);
int i;
// spawn workers...
for (i = 0; i < 4; ++i) {
pthread_create(workers[i]; NULL, do_worker, NULL);
}
for (i = 0; i < 100, ++i) {
pthread_rwlock_wrlock(&lock);
// do some work...
pthread_rwlock_unlock(&lock);
sleep(1);
}
return 0;
}

As far as I understand it, your worker threads are started asynchronously. So when the main thread wants to run this code section, you have to ensure that no worker thread is executing it. Therefore you have to stop all worker threads before the main thread can enter that code section, and allow them to enter it again afterwards.
This could be done - using Grand Central Dispatch - if your worker threads would be assigned to a dispatch group, see https://developer.apple.com/library/mac/#documentation/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.html.
The main thread could then send the message dispatch_group_wait to this dispatch group, wait for all worker thread to leave this code section, execute it, and then requeue the worker threads.

Related

IO Completion ports: separate thread pool to process the dequeued packets?

NOTE: I have added the C++ tag to this because a) the code is C++ and b) people using C++ may well have used IO completion ports. So please don't shout.
I am playing with IO completion ports, and have eventually fully understood (and tested, to prove) - both with help from RbMm - the meaning of the NumberOfConcurrentThreads parameter within CreateIoCompletionPort().
I have the following small program which creates 10 threads all waiting on the completion port. I tell my completion port to only allow 4 threads to be runnable at once (I have four CPUs). I then enqueue 8 packets to the port. My thread function outputs a message if it dequeues a packet with an ID > 4; in order for this message to be output, I have to stop at least one of the four currently running threads, which happens when I enter '1' at the console.
Now this is all fairly simple code. I have one big concern however, and that is that if all of the threads that are processing a completion packet get bogged down, it will mean no more packets can be dequeued and processed. That is what I am simulating with my infinite loop - the fact that no more packets are dequeued until I enter '1' at the console highlights this potential problem!
Would a better solution not be to have my four threads dequeuing packets (or as many threads as CPUs), then when one is dequeued, farm the processing of that packet off to a worker thread from a separate pool, thereby removing the risk of all threads in the IOCP being bogged down thus no more packets being dequeued?
I ask this as all the examples of IO completion port code I have seen use a method similar to what I show below, not using a separate thread pool which I propose. This is what makes me think that I am missing something because I am outnumbered!
Note: this is a somewhat contrived example, because Windows will allow an additional packet to be dequeued if one of the runnable threads enters a wait state; I show this in my code with a commented out cout call:
The system also allows a thread waiting in GetQueuedCompletionStatus
to process a completion packet if another running thread associated
with the same I/O completion port enters a wait state for other
reasons, for example the SuspendThread function. When the thread in
the wait state begins running again, there may be a brief period when
the number of active threads exceeds the concurrency value. However,
the system quickly reduces this number by not allowing any new active
threads until the number of active threads falls below the concurrency
value.
But I won't be calling SuspendThread in my thread functions, and I don't know which functions other than cout will cause the thread to enter a wait state, thus I can't predict if one or more of my threads will ever get bogged down! Hence my idea of a thread pool; at least context switching would mean that other packets get a chance to be dequeued!
#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include <thread>
#include <vector>
#include <algorithm>
#include <atomic>
#include <ctime>
#include <iostream>
using namespace std;
int main()
{
HANDLE hCompletionPort1;
if ((hCompletionPort1 = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 4)) == NULL)
{
return -1;
}
vector<thread> vecAllThreads;
atomic_bool bStop(false);
// Fill our vector with 10 threads, each of which waits on our IOCP.
generate_n(back_inserter(vecAllThreads), 10, [hCompletionPort1, &bStop] {
thread t([hCompletionPort1, &bStop]()
{
// Thread body
while (true)
{
DWORD dwBytes = 0;
LPOVERLAPPED pOverlapped = 0;
ULONG_PTR uKey;
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
{
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
//cout << uKey; // EVEN THIS WILL CAUSE A "wait" which causes MORE THAN 4 THREADS TO ENTER!
if (uKey >4)
cout << "Started processing packet ID > 4!" << endl;
while (!bStop)
; // INFINITE LOOP
}
}
});
return move(t);
}
);
// Queue 8 completion packets to our IOCP...only four will be processed until we set our bool
for (int i = 1; i <= 8; ++i)
{
PostQueuedCompletionStatus(hCompletionPort1, 0, i, new OVERLAPPED);
}
while (!bStop)
{
int nVal;
cout << "Enter 1 to cause current processing threads to end: ";
cin >> nVal;
bStop = (nVal == 1);
}
for (int i = 0; i < 10; ++i) // Tell all 10 threads to stop processing on the IOCP
{
PostQueuedCompletionStatus(hCompletionPort1, 0, 0, 0); // Special packet marking end of IOCP usage
}
for_each(begin(vecAllThreads), end(vecAllThreads), mem_fn(&thread::join));
return 0;
}
EDIT #1
What I mean by "separate thread pool" is something like the following:
class myThread {
public:
void SetTask(LPOVERLAPPED pO) { /* start processing pO*/ }
private:
thread m_thread; // Actual thread object
};
// The threads in this thread pool are not associated with the IOCP in any way whatsoever; they exist
// purely to be handed a completion packet which they then process!
class ThreadPool
{
public:
void Initialise() { /* create 100 worker threads and add them to some internal storage*/}
myThread& GetNextFreeThread() { /* return one of the 100 worker thread we created*/}
} g_threadPool;
The code that each of my four threads associated with the IOCP then change to
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
{
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
// Pick a new thread from a pool of pre-created threads and assign it the packet to process
myThread& thr = g_threadPool.GetNextFreeThread();
thr.SetTask(pOverlapped);
// Now, this thread can immediately return to the IOCP; it doesn't matter if the
// packet we dequeued would take forever to process; that is happening in the
// separate thread thr *that will not intefere with packets being dequeued from IOCP!*
}
This way, there is no possible way that I can end up in the situation where no more packets are being dequeued!
It seems there is conflicting opinion on whether a separate thread pool should be used. Clearly, as the sample code I have posted shows, there is potential for packets to stop being dequeued from the IOCP if the processing of the packets does not enter a wait state; given, the infinite loop is perhaps unrealistic but it does demonstrate the point.

Can you choose a thread from a thread pool to execute (boost)

Here is some code i have atm.
int main()
{
boost::thread_group threads; // Thread Pool
// Here we create threads and kick them off by passing
// the address of the function to call
for (int i = 0; i < num_threads; i++)
threads.create_thread(&SendDataToFile);
threads.join_all();
system("PAUSE");
}
void SendDataToFile()
{
// The lock guard will make sure only one thread (client)
// will access this application at once
boost::lock_guard<boost::mutex> lock(io_mutex);
for (int i = 0; i < 5; i++)
cout << "Writing" << boost::this_thread::get_id() << endl;
}
At the moment im just using cout instead of writing to file.
Is it possible to actually choose a thread to carry out an operation before another thread. So i have a file i want to write to, 4 threads want to access that file at the same time, is it possible for me to say ok thread 2 you go first. ? in BOOST
can the fstream be used like cout? when i did write to a file the output was not messy (without a mutex)? but when i print to the console without a mutex it is messy as you would expect.
There are a number of equivalent ways you could do this using some combination of global variables protected by atomic updates, a mutex, semaphore, condition variable, etc. The way that seems to me to most directly communicate what you're trying to do is to have your threads wait on a ticket lock where instead of their ticket number representing the order that they arrived at the lock, it's chosen to be the order in which the threads were created. You could combine that idea with the Boost spinlock example for a simple and probably performant implementation.

pthread condition variables vs win32 events (linux vs windows-ce)

I am doing a performance evaluation between Windows CE and Linux on an arm imx27 board. The code has already been written for CE and measures the time it takes to do different kernel calls like using OS primitives like mutex and semaphores, opening and closing files and networking.
During my porting of this application to Linux (pthreads) I stumbled upon a problem which I cannot explain. Almost all tests showed a performance increase from 5 to 10 times but not my version of win32 events (SetEvent and WaitForSingleObject), CE actually "won" this test.
To emulate the behaviour I was using pthreads condition variables (I know that my implementation doesn't fully emulate the CE version but it's enough for the evaluation).
The test code uses two threads that "ping-pong" each other using events.
Windows code:
Thread 1: (the thread I measure)
HANDLE hEvt1, hEvt2;
hEvt1 = CreateEvent(NULL, FALSE, FALSE, TEXT("MyLocEvt1"));
hEvt2 = CreateEvent(NULL, FALSE, FALSE, TEXT("MyLocEvt2"));
ResetEvent(hEvt1);
ResetEvent(hEvt2);
for (i = 0; i < 10000; i++)
{
SetEvent (hEvt1);
WaitForSingleObject(hEvt2, INFINITE);
}
Thread 2: (just "responding")
while (1)
{
WaitForSingleObject(hEvt1, INFINITE);
SetEvent(hEvt2);
}
Linux code:
Thread 1: (the thread I measure)
struct event_flag *event1, *event2;
event1 = eventflag_create();
event2 = eventflag_create();
for (i = 0; i < 10000; i++)
{
eventflag_set(event1);
eventflag_wait(event2);
}
Thread 2: (just "responding")
while (1)
{
eventflag_wait(event1);
eventflag_set(event2);
}
My implementation of eventflag_*:
struct event_flag* eventflag_create()
{
struct event_flag* ev;
ev = (struct event_flag*) malloc(sizeof(struct event_flag));
pthread_mutex_init(&ev->mutex, NULL);
pthread_cond_init(&ev->condition, NULL);
ev->flag = 0;
return ev;
}
void eventflag_wait(struct event_flag* ev)
{
pthread_mutex_lock(&ev->mutex);
while (!ev->flag)
pthread_cond_wait(&ev->condition, &ev->mutex);
ev->flag = 0;
pthread_mutex_unlock(&ev->mutex);
}
void eventflag_set(struct event_flag* ev)
{
pthread_mutex_lock(&ev->mutex);
ev->flag = 1;
pthread_cond_signal(&ev->condition);
pthread_mutex_unlock(&ev->mutex);
}
And the struct:
struct event_flag
{
pthread_mutex_t mutex;
pthread_cond_t condition;
unsigned int flag;
};
Questions:
Why doesn't I see the performance boost here?
What can be done to improve performance (e.g are there faster ways to implement CEs behaviour)?
I'm not used to coding pthreads, are there bugs in my implementation maybe resulting in performance loss?
Are there any alternative libraries for this?
Note that you don't need to be holding the mutex when calling pthread_cond_signal(), so you might be able to increase the performance of your condition variable 'event' implementation by releasing the mutex before signaling the condition:
void eventflag_set(struct event_flag* ev)
{
pthread_mutex_lock(&ev->mutex);
ev->flag = 1;
pthread_mutex_unlock(&ev->mutex);
pthread_cond_signal(&ev->condition);
}
This might prevent the awakened thread from immediately blocking on the mutex.
This type of implementation only works if you can afford to miss an event. I just tested it and ran into many deadlocks. The main reason for this is that the condition variables only wake up a thread that is already waiting. Signals issued before are lost.
No counter is associated with a condition that allows a waiting thread to simply continue if the condition has already been signalled. Windows Events support this type of use.
I can think of no better solution than taking a semaphore (the POSIX version is very easy to use) that is initialized to zero, using sem_post() for set() and sem_wait() for wait(). You can surely think of a way to have the semaphore count to a maximum of 1 using sem_getvalue()
That said I have no idea whether the POSIX semaphores are just a neat interface to the Linux semaphores or what the performance penalties are.

inter-process condition variables in Windows

I know that I can use condition variable to synchronize work between the threads, but is there any class like this (condition variable) to synchronize work between the processes, thanks in advance
Use a pair of named Semaphore objects, one to signal and one as a lock. Named sync objects on Windows are automatically inter-process, which takes care of that part of the job for you.
A class like this would do the trick.
class InterprocessCondVar {
private:
HANDLE mSem; // Used to signal waiters
HANDLE mLock; // Semaphore used as inter-process lock
int mWaiters; // # current waiters
protected:
public:
InterprocessCondVar(std::string name)
: mWaiters(0), mLock(NULL), mSem(NULL)
{
// NOTE: You'll need a real "security attributes" pointer
// for child processes to see the semaphore!
// "CreateSemaphore" will do nothing but give you the handle if
// the semaphore already exists.
mSem = CreateSemaphore( NULL, 0, std::numeric_limits<LONG>::max(), name.c_str());
std::string lockName = name + "_Lock";
mLock = CreateSemaphore( NULL, 0, 1, lockName.c_str());
if(!mSem || !mLock) {
throw std::runtime_exception("Semaphore create failed");
}
}
virtual ~InterprocessCondVar() {
CloseHandle( mSem);
CloseHandle( mLock);
}
bool Signal();
bool Broadcast();
bool Wait(unsigned int waitTimeMs = INFINITE);
}
A genuine condition variable offers 3 calls:
1) "Signal()": Wake up ONE waiting thread
bool InterprocessCondVar::Signal() {
WaitForSingleObject( mLock, INFINITE); // Lock
mWaiters--; // Lower wait count
bool result = ReleaseSemaphore( mSem, 1, NULL); // Signal 1 waiter
ReleaseSemaphore( mLock, 1, NULL); // Unlock
return result;
}
2) "Broadcast()": Wake up ALL threads
bool InterprocessCondVar::Broadcast() {
WaitForSingleObject( mLock, INFINITE); // Lock
bool result = ReleaseSemaphore( mSem, nWaiters, NULL); // Signal all
mWaiters = 0; // All waiters clear;
ReleaseSemaphore( mLock, 1, NULL); // Unlock
return result;
}
3) "Wait()": Wait for the signal
bool InterprocessCondVar::Wait(unsigned int waitTimeMs) {
WaitForSingleObject( mLock, INFINITE); // Lock
mWaiters++; // Add to wait count
ReleaseSemaphore( mLock, 1, NULL); // Unlock
// This must be outside the lock
return (WaitForSingleObject( mSem, waitTimeMs) == WAIT_OBJECT_0);
}
This should ensure that Broadcast() ONLY wakes up threads & processes that are already waiting, not all future ones too. This is also a VERY heavyweight object. For CondVars that don't need to exist across processes I would create a different class w/ the same API, and use unnamed objects.
You could use named semaphore or named mutex. You could also share memory between processes by shared memory.
For a project I'm working on I needed a condition variable and mutex implementation which can handle dead processes and won't cause other processes to end up in a deadlock in such a case. I implemented the mutex with the native named mutexes provided by the WIN32 api because they can indicate whether a dead process owns the lock by returning WAIT_ABANDONED. The next issue was that I also needed a condition variable I could use across processes together with these mutexes. I started of with the suggestion from user3726672 but soon discovered that there are several issues in which the state of the counter variable and the state of the semaphore ends up being invalid.
After doing some research, I found a paper by Microsoft Research which explains exactly this scenario: Implementing Condition Variables with Semaphores . It uses a separate semaphore for every single thread to solve the mentioned issues.
My final implementation uses a portion of shared memory in which I store a ringbuffer of thread-ids (the id's of the waiting threads). The processes then create their own handle for every named semaphore/thread-id which they have not encountered yet and cache it. The signal/broadcast/wait functions are then quite straight forward and follow the idea of the proposed solution in the paper. Just remember to remove your thread-id from the ringbuffer if your wait operation fails or results in a timeout.
For the Win32 implementation I recommend reading the following documents:
Semaphore Objects and Using Mutex Objects as those describe the functions you'll need for the implementation.
Alternatives: boost::interprocess has some robust mutex emulation support but it is based on spin locks and caused a very high cpu load on our embedded system which was the final reason why we were looking into our own implementation.
#user3726672: Could you update your post to point to this post or to the referenced paper?
Best Regards,
Michael
Update:
I also had a look at an implementation for linux/posix. Turns out pthread already provides everything you'll need. Just put pthread_cond_t and pthread_mutex_t in some shared memory to share it with the other process and initialize both with PTHREAD_PROCESS_SHARED. Also set PTHREAD_MUTEX_ROBUST on the mutex.
Yes. You can use a (named) Mutex for that. Use CreateMutex to create one. You then wait for it (with functions like WaitForSingleObject), and release it when you're done with ReleaseMutex.
For reference, Boost.Interprocess (documentation for version 1.59) has condition variables and much more. Please note, however, that as of this writing, that "Win32 synchronization is too basic".

application exits prematurely with OpenMp with the error code : Fatal User Error 1002: Not all work-sharing constructs executed by all threads

I added openMp code to some serial code in a simulator applicaton, when I run a program that uses this application the program exits unexpectedly with the output "The thread 'Win32 Thread' (0x1828) has exited with code 1 (0x1)", this happens in the parallel region where I added the OpenMp code,
here's a code sample:
#pragma omp parallel for private (curr_proc_info, current_writer, method_h) shared (exceptionOccured) schedule(dynamic, 1)
for (i = 0 ; i < method_process_num ; i++)
{
current_writer = 0;
// we need to add protection before we can dequeue a method from the methods queue,
#pragma omp critical(dequeueMethod)
method_h = pop_runnable_method(curr_proc_info, current_writer);
if(method_h !=0 && exceptionOccured == false){
try {
method_h->semantics();
}
catch( const sc_report& ex ) {
::std::cout << "\n" << ex.what() << ::std::endl;
m_error = true;
exceptionOccured = true; // we cannot jump outside the loop, so instead of return we use a flag and return somewhere else
}
}
}
The scheduling was static before I made it dynamic, after I added dynamic with a chunk size of 1 the application proceeded a little further before it exited, can this be an indication of what is happening inside the parallel region?
thanks
As I read it, and I'm more of a Fortran programmer than C/C++, your private variable curr_proc_info is not declared (or defined ?) before it first appears in the call to pop_runnable_method. But private variables are undefined on entry to the parallel region.
I also think your sharing of exception_occurred is a little fishy since it suggests that an exception on any thread should be noticed by any thread, not just the thread in which it is noticed. Of course, that may be your intent.
Cheers
Mark

Resources