pthread condition variables vs win32 events (linux vs windows-ce) - windows

I am doing a performance evaluation between Windows CE and Linux on an arm imx27 board. The code has already been written for CE and measures the time it takes to do different kernel calls like using OS primitives like mutex and semaphores, opening and closing files and networking.
During my porting of this application to Linux (pthreads) I stumbled upon a problem which I cannot explain. Almost all tests showed a performance increase from 5 to 10 times but not my version of win32 events (SetEvent and WaitForSingleObject), CE actually "won" this test.
To emulate the behaviour I was using pthreads condition variables (I know that my implementation doesn't fully emulate the CE version but it's enough for the evaluation).
The test code uses two threads that "ping-pong" each other using events.
Windows code:
Thread 1: (the thread I measure)
HANDLE hEvt1, hEvt2;
hEvt1 = CreateEvent(NULL, FALSE, FALSE, TEXT("MyLocEvt1"));
hEvt2 = CreateEvent(NULL, FALSE, FALSE, TEXT("MyLocEvt2"));
ResetEvent(hEvt1);
ResetEvent(hEvt2);
for (i = 0; i < 10000; i++)
{
SetEvent (hEvt1);
WaitForSingleObject(hEvt2, INFINITE);
}
Thread 2: (just "responding")
while (1)
{
WaitForSingleObject(hEvt1, INFINITE);
SetEvent(hEvt2);
}
Linux code:
Thread 1: (the thread I measure)
struct event_flag *event1, *event2;
event1 = eventflag_create();
event2 = eventflag_create();
for (i = 0; i < 10000; i++)
{
eventflag_set(event1);
eventflag_wait(event2);
}
Thread 2: (just "responding")
while (1)
{
eventflag_wait(event1);
eventflag_set(event2);
}
My implementation of eventflag_*:
struct event_flag* eventflag_create()
{
struct event_flag* ev;
ev = (struct event_flag*) malloc(sizeof(struct event_flag));
pthread_mutex_init(&ev->mutex, NULL);
pthread_cond_init(&ev->condition, NULL);
ev->flag = 0;
return ev;
}
void eventflag_wait(struct event_flag* ev)
{
pthread_mutex_lock(&ev->mutex);
while (!ev->flag)
pthread_cond_wait(&ev->condition, &ev->mutex);
ev->flag = 0;
pthread_mutex_unlock(&ev->mutex);
}
void eventflag_set(struct event_flag* ev)
{
pthread_mutex_lock(&ev->mutex);
ev->flag = 1;
pthread_cond_signal(&ev->condition);
pthread_mutex_unlock(&ev->mutex);
}
And the struct:
struct event_flag
{
pthread_mutex_t mutex;
pthread_cond_t condition;
unsigned int flag;
};
Questions:
Why doesn't I see the performance boost here?
What can be done to improve performance (e.g are there faster ways to implement CEs behaviour)?
I'm not used to coding pthreads, are there bugs in my implementation maybe resulting in performance loss?
Are there any alternative libraries for this?

Note that you don't need to be holding the mutex when calling pthread_cond_signal(), so you might be able to increase the performance of your condition variable 'event' implementation by releasing the mutex before signaling the condition:
void eventflag_set(struct event_flag* ev)
{
pthread_mutex_lock(&ev->mutex);
ev->flag = 1;
pthread_mutex_unlock(&ev->mutex);
pthread_cond_signal(&ev->condition);
}
This might prevent the awakened thread from immediately blocking on the mutex.

This type of implementation only works if you can afford to miss an event. I just tested it and ran into many deadlocks. The main reason for this is that the condition variables only wake up a thread that is already waiting. Signals issued before are lost.
No counter is associated with a condition that allows a waiting thread to simply continue if the condition has already been signalled. Windows Events support this type of use.
I can think of no better solution than taking a semaphore (the POSIX version is very easy to use) that is initialized to zero, using sem_post() for set() and sem_wait() for wait(). You can surely think of a way to have the semaphore count to a maximum of 1 using sem_getvalue()
That said I have no idea whether the POSIX semaphores are just a neat interface to the Linux semaphores or what the performance penalties are.

Related

Can you choose a thread from a thread pool to execute (boost)

Here is some code i have atm.
int main()
{
boost::thread_group threads; // Thread Pool
// Here we create threads and kick them off by passing
// the address of the function to call
for (int i = 0; i < num_threads; i++)
threads.create_thread(&SendDataToFile);
threads.join_all();
system("PAUSE");
}
void SendDataToFile()
{
// The lock guard will make sure only one thread (client)
// will access this application at once
boost::lock_guard<boost::mutex> lock(io_mutex);
for (int i = 0; i < 5; i++)
cout << "Writing" << boost::this_thread::get_id() << endl;
}
At the moment im just using cout instead of writing to file.
Is it possible to actually choose a thread to carry out an operation before another thread. So i have a file i want to write to, 4 threads want to access that file at the same time, is it possible for me to say ok thread 2 you go first. ? in BOOST
can the fstream be used like cout? when i did write to a file the output was not messy (without a mutex)? but when i print to the console without a mutex it is messy as you would expect.
There are a number of equivalent ways you could do this using some combination of global variables protected by atomic updates, a mutex, semaphore, condition variable, etc. The way that seems to me to most directly communicate what you're trying to do is to have your threads wait on a ticket lock where instead of their ticket number representing the order that they arrived at the lock, it's chosen to be the order in which the threads were created. You could combine that idea with the Boost spinlock example for a simple and probably performant implementation.

What's the correct method for CoreAudio realtime thread to communicate with UI thread?

I need to pass data between CoreAudio's realtime thread and the UI thread (one way, RT->UI). I know I can't use any Cocoa/Objective C methods like performSelectorOnMainThread or NSNotification and I can't use anything that will allocate memory as this will potentially block the RT thread.
What is the correct method for communicating between threads? Can I use GCD message queues or is there a more basic system to use?
Edit:
Thinking about this a bit more, I suppose I could use a lock free ring buffer, which the RT thread puts a message into, and the UI thread checks for messages to pull out. Is this the best way and if so is there a system already to do this in CoreAudio or available elsewhere or do I need to code it up myself?
It turns out this was a lot simpler than I expected and the solution I came up with was just to use the Portaudio ring buffer. I needed to add pa_ringbuffer.[ch] and pa_memorybarrier.h to my project and then define a MessageData structure to store in the ring buffer.
typedef struct MessageData {
MessageType type;
union {
struct {
NSUInteger position;
} position;
} data;
} MessageData;
Then I allocated some space to store 32 messages and created the ring buffer.
_playbackData->RTToMainBuffer = malloc(sizeof(MessageData) * 32);
PaUtil_InitializeRingBuffer(&_playbackData->RTToMainRB, sizeof(MessageData),
32, _playbackData->RTToMainBuffer);
Finally I started an NSTimer for every 20ms to pull data from the ring buffer
while (PaUtil_GetRingBufferReadAvailable(&_playbackData->RTToMainRB)) {
MessageData *dataPtr1, *dataPtr2;
ring_buffer_size_t sizePtr1, sizePtr2;
// Should we read more than one at a time?
if (PaUtil_GetRingBufferReadRegions(&_playbackData->RTToMainRB, 1,
(void *)&dataPtr1, &sizePtr1,
(void *)&dataPtr2, &sizePtr2) != 1) {
continue;
}
// Parse message
switch (dataPtr1->type) {
case MessageTypeEOS:
break;
case MessageTypePosition:
break;
default:
break;
}
PaUtil_AdvanceRingBufferReadIndex(&_playbackData->RTToMainRB, 1);
}
Then in the realtime thread, pushing a message to the ringbuffer was simply
MessageData *dataPtr1, *dataPtr2;
ring_buffer_size_t sizePtr1, sizePtr2;
if (PaUtil_GetRingBufferWriteRegions(&data->RTToMainRB, 1,
(void *)&dataPtr1, &sizePtr1,
(void *)&dataPtr2, &sizePtr2)) {
dataPtr1->type = MessageTypePosition;
dataPtr1->data.position.position = currentPosition;
PaUtil_AdvanceRingBufferWriteIndex(&data->RTToMainRB, 1);
}
A ringbuffer is a good solution. Two if you need to communicate both ways ie. inbox/outbox message passing.
This is a good implementation for iOS/Mac if you don't want to use Portaudio.
https://github.com/michaeltyson/TPCircularBuffer

How to make a fast context switch from one process to another?

I need to run unsafe native code on a sandbox process and I need to reduce bottleneck of process switch. Both processes (controller and sandbox) shares two auto-reset events and a coherent view of a mapped file (shared memory) that is used for communication.
To make this article smaller, I removed initializations from sample code, but the events are created by the controller, duplicated using DuplicateHandle, and then sent to sandbox process prior to work.
Controller source:
void inSandbox(HANDLE hNewRequest, HANDLE hAnswer, volatile int *shared) {
int before = *shared;
for (int i = 0; i < 100000; ++i) {
// Notify sandbox of a new request and wait for answer.
SignalObjectAndWait(hNewRequest, hAnswer, INFINITE, FALSE);
}
assert(*shared == before + 100000);
}
void inProcess(volatile int *shared) {
int before = *shared;
for (int i = 0; i < 100000; ++i) {
newRequest(shared);
}
assert(*shared == before + 100000);
}
void newRequest(volatile int *shared) {
// In this test, the request only increments an int.
(*shared)++;
}
Sandbox source:
void sandboxLoop(HANDLE hNewRequest, HANDLE hAnswer, volatile int *shared) {
// Wait for the first request from controller.
assert(WaitForSingleObject(hNewRequest, INFINITE) == WAIT_OBJECT_0);
for(;;) {
// Perform request.
newRequest(shared);
// Notify controller and wait for next request.
SignalObjectAndWait(hAnswer, hNewRequest, INFINITE, FALSE);
}
}
void newRequest(volatile int *shared) {
// In this test, the request only increments an int.
(*shared)++;
}
Measurements:
inSandbox() - 550ms, ~350k context switches, 42% CPU (25% kernel, 17% user).
inProcess() - 20ms, ~2k context switches, 55% CPU (2% kernel, 53% user).
The machine is Windows 7 Pro, Core 2 Duo P9700 with 8gb of memory.
An interesting fact is that sandbox solution uses 42% of CPU vs 55% of in-process solution. Another noteworthy fact is that sandbox solution contains 350k context switches, which is much more than the 200k context switches that we can infer from source code.
I need to know if there's a way to reduce the overhead of transfer control to another process. I already tried to use pipes instead of events, and it was much worse. I also tried to use no event at all, by making the sandbox call SuspendThread(GetCurrentThread()) and making the controller call ResumeThread(hSandboxThread) on every request, but the performance was similar to using events.
If you have a solution that uses assembly (like performing a manual context switch) or Windows Driver Kit, please let me know as well. I don't mind having to install a driver to make this faster.
I heard that Google Native Client does something similar, but I only found this documentation. If you have more information, please let me know.
The first thing to try is raising the priority of the waiting thread. This should reduce the number of extraneous context switches.
Alternatively, since you're on a 2-core system, using spinlocks instead of events would make your code much much faster, at the cost of system performance and power consumption:
void inSandbox(volatile int *lock, volatile int *shared)
{
int i, before = *shared;
for (i = 0; i < 100000; ++i) {
*lock = 1;
while (*lock != 0) { }
}
assert(*shared == before + 100000);
}
void newRequest(volatile int *shared) {
// In this test, the request only increments an int.
(*shared)++;
}
void sandboxLoop(volatile int *lock, volatile int * shared)
{
for(;;) {
while (*lock != 1) { }
newRequest(shared);
*lock = 0;
}
}
In this scenario, you should probably set thread affinity masks and/or lower the priority of the spinning thread so that it doesn't compete with the busy thread for CPU time.
Ideally, you'd use a hybrid approach. When one side is going to be busy for a while, let the other side wait on an event so that other processes can get some CPU time. You could trigger the event a little ahead of time (using the spinlock to retain synchronization) so that the other thread will be ready when you are.

implementing a scheduler class in Windows

I want to implement a scheduler class, which any object can use to schedule timeouts and cancel then if necessary. When a timeout expires, this information will be sent to the timeout setter/owner at that time asynchronously.
So, for this purpose, I have 2 fundamental classes WindowsTimeout and WindowsScheduler.
class WindowsTimeout
{
bool mCancelled;
int mTimerID; // Windows handle to identify the actual timer set.
ITimeoutReceiver* mSetter;
int cancel()
{
mCancelled = true;
if ( timeKillEvent(mTimerID) == SUCCESS) // Line under question # 1
{
delete this; // Timeout instance is self-destroyed.
return 0; // ok. OS Timer resource given back.
}
return 1; // fail. OS Timer resource not given back.
}
WindowsTimeout(ITimeoutReceiver* setter, int timerID)
{
mSetter = setter;
mTimerID = timerID;
}
};
class WindowsScheduler
{
static void CALLBACK timerFunction(UINT uID,UINT uMsg,DWORD dwUser,DWORD dw1,DWORD dw2)
{
WindowsTimeout* timeout = (WindowsTimeout*) uMsg;
if (timeout->mCancelled)
delete timeout;
else
timeout->mDestination->GEN(evTimeout(timeout));
}
WindowsTimeout* schedule(ITimeoutReceiver* setter, TimeUnit t)
{
int timerID = timeSetEvent(...);
if (timerID == SUCCESS)
{
return WindowsTimeout(setter, timerID);
}
return 0;
}
};
My questions are:
Q.1. When a WindowsScheduler::timerFunction() call is made, this call is performed in which context ? It is simply a callback function and I think, it is performed by the OS context, right ? If it is so, does this calling pre-empt any other tasks already running ? I mean do callbacks have higher priority than any other user-task ?
Q.2. When a timeout setter wants to cancel its timeout, it calls WindowsTimeout::cancel().
However, there is always a possibility that timerFunction static call to be callbacked by OS, pre-empting the cancel operation, for example, just after mCancelled = true statement. In such a case, the timeout instance will be deleted by the callback function.
When the pre-empted cancel() function comes again, after the callback function completes execution, will try to access an attribute of the deleted instance (mTimerID), as you can see on the line : "Line under question # 1" in the code.
How can I avoid such a case ?
Please note that, this question is an improved version of the previos one of my own here:
Windows multimedia timer with callback argument
Q1 - I believe it gets called within a thread allocated by the timer API. I'm not sure, but I wouldn't be surprised if the thread ran at a very high priority. (In Windows, that doesn't necessarily mean it will completely preempt other threads, it just means it will get more cycles than other threads).
Q2 - I started to sketch out a solution for this, but then realized it was a bit harder than I thought. Personally, I would maintain a hash table that maps timerIDs to your WindowsTimeout object instances. The hash table could be a simple std::map instance that's guarded by a critical section. When the timer callback occurs, it enters the critical section and tries to obtain the WindowsTimer instance pointer, and then flags the WindowsTimer instance as having been executed, exits the critical section, and then actually executes the callback. In the event that the hash table doesn't contain the WindowsTimer instance, it means the caller has already removed it. Be very careful here.
One subtle bug in your own code above:
WindowsTimeout* schedule(ITimeoutReceiver* setter, TimeUnit t)
{
int timerID = timeSetEvent(...);
if (timerID == SUCCESS)
{
return WindowsTimeout(setter, timerID);
}
return 0;
}
};
In your schedule method, it's entirely possible that the callback scheduled by timeSetEvent will return BEFORE you can create an instance of WindowsTimeout.

inter-process condition variables in Windows

I know that I can use condition variable to synchronize work between the threads, but is there any class like this (condition variable) to synchronize work between the processes, thanks in advance
Use a pair of named Semaphore objects, one to signal and one as a lock. Named sync objects on Windows are automatically inter-process, which takes care of that part of the job for you.
A class like this would do the trick.
class InterprocessCondVar {
private:
HANDLE mSem; // Used to signal waiters
HANDLE mLock; // Semaphore used as inter-process lock
int mWaiters; // # current waiters
protected:
public:
InterprocessCondVar(std::string name)
: mWaiters(0), mLock(NULL), mSem(NULL)
{
// NOTE: You'll need a real "security attributes" pointer
// for child processes to see the semaphore!
// "CreateSemaphore" will do nothing but give you the handle if
// the semaphore already exists.
mSem = CreateSemaphore( NULL, 0, std::numeric_limits<LONG>::max(), name.c_str());
std::string lockName = name + "_Lock";
mLock = CreateSemaphore( NULL, 0, 1, lockName.c_str());
if(!mSem || !mLock) {
throw std::runtime_exception("Semaphore create failed");
}
}
virtual ~InterprocessCondVar() {
CloseHandle( mSem);
CloseHandle( mLock);
}
bool Signal();
bool Broadcast();
bool Wait(unsigned int waitTimeMs = INFINITE);
}
A genuine condition variable offers 3 calls:
1) "Signal()": Wake up ONE waiting thread
bool InterprocessCondVar::Signal() {
WaitForSingleObject( mLock, INFINITE); // Lock
mWaiters--; // Lower wait count
bool result = ReleaseSemaphore( mSem, 1, NULL); // Signal 1 waiter
ReleaseSemaphore( mLock, 1, NULL); // Unlock
return result;
}
2) "Broadcast()": Wake up ALL threads
bool InterprocessCondVar::Broadcast() {
WaitForSingleObject( mLock, INFINITE); // Lock
bool result = ReleaseSemaphore( mSem, nWaiters, NULL); // Signal all
mWaiters = 0; // All waiters clear;
ReleaseSemaphore( mLock, 1, NULL); // Unlock
return result;
}
3) "Wait()": Wait for the signal
bool InterprocessCondVar::Wait(unsigned int waitTimeMs) {
WaitForSingleObject( mLock, INFINITE); // Lock
mWaiters++; // Add to wait count
ReleaseSemaphore( mLock, 1, NULL); // Unlock
// This must be outside the lock
return (WaitForSingleObject( mSem, waitTimeMs) == WAIT_OBJECT_0);
}
This should ensure that Broadcast() ONLY wakes up threads & processes that are already waiting, not all future ones too. This is also a VERY heavyweight object. For CondVars that don't need to exist across processes I would create a different class w/ the same API, and use unnamed objects.
You could use named semaphore or named mutex. You could also share memory between processes by shared memory.
For a project I'm working on I needed a condition variable and mutex implementation which can handle dead processes and won't cause other processes to end up in a deadlock in such a case. I implemented the mutex with the native named mutexes provided by the WIN32 api because they can indicate whether a dead process owns the lock by returning WAIT_ABANDONED. The next issue was that I also needed a condition variable I could use across processes together with these mutexes. I started of with the suggestion from user3726672 but soon discovered that there are several issues in which the state of the counter variable and the state of the semaphore ends up being invalid.
After doing some research, I found a paper by Microsoft Research which explains exactly this scenario: Implementing Condition Variables with Semaphores . It uses a separate semaphore for every single thread to solve the mentioned issues.
My final implementation uses a portion of shared memory in which I store a ringbuffer of thread-ids (the id's of the waiting threads). The processes then create their own handle for every named semaphore/thread-id which they have not encountered yet and cache it. The signal/broadcast/wait functions are then quite straight forward and follow the idea of the proposed solution in the paper. Just remember to remove your thread-id from the ringbuffer if your wait operation fails or results in a timeout.
For the Win32 implementation I recommend reading the following documents:
Semaphore Objects and Using Mutex Objects as those describe the functions you'll need for the implementation.
Alternatives: boost::interprocess has some robust mutex emulation support but it is based on spin locks and caused a very high cpu load on our embedded system which was the final reason why we were looking into our own implementation.
#user3726672: Could you update your post to point to this post or to the referenced paper?
Best Regards,
Michael
Update:
I also had a look at an implementation for linux/posix. Turns out pthread already provides everything you'll need. Just put pthread_cond_t and pthread_mutex_t in some shared memory to share it with the other process and initialize both with PTHREAD_PROCESS_SHARED. Also set PTHREAD_MUTEX_ROBUST on the mutex.
Yes. You can use a (named) Mutex for that. Use CreateMutex to create one. You then wait for it (with functions like WaitForSingleObject), and release it when you're done with ReleaseMutex.
For reference, Boost.Interprocess (documentation for version 1.59) has condition variables and much more. Please note, however, that as of this writing, that "Win32 synchronization is too basic".

Resources