Producer-consumer problem taken from Wikipedia:
semaphore mutex = 1
semaphore fillCount = 0
semaphore emptyCount = BUFFER_SIZE
procedure producer() {
while (true) {
item = produceItem()
down(emptyCount)
down(mutex)
putItemIntoBuffer(item)
up(mutex)
up(fillCount)
}
up(fillCount) //the consumer may not finish before the producer.
}
procedure consumer() {
while (true) {
down(fillCount)
down(mutex)
item = removeItemFromBuffer()
up(mutex)
up(emptyCount)
consumeItem(item)
}
}
My question - why does the producer have up(fillCount) //the consumer may not finish before the producer after the while loop. When will the program get there and why is it needed?
I think the code doesn't make sense this way. The loop never ends, so the line in question can be never reached.
The code didn't originally contain that line, and it was added by an anonymous editor in March 2009. I removed that line now.
In general, code on Wikipedia is often edited by many people over a long period of time, so it's quite easy to introduce bugs into it.
Related
The code From wikipedia for producer consumer queue with a single producer and a single consumer is:
semaphore fillCount = 0; // items produced
semaphore emptyCount = BUFFER_SIZE; // remaining space
procedure producer()
{
while (true)
{
item = produceItem();
down(emptyCount);
putItemIntoBuffer(item);
up(fillCount);
}
}
procedure consumer()
{
while (true)
{
down(fillCount);
item = removeItemFromBuffer();
up(emptyCount);
consumeItem(item);
}
}
it is stated there that
The solution above works fine when there is only one producer and
consumer.
When there are more producers/consumers, the pseudocode is the same, with a mutex guarding the putItemIntoBuffer(item); and removeItemFromBuffer(); sections:
mutex buffer_mutex; // similar to "semaphore buffer_mutex = 1", but different (see notes below)
semaphore fillCount = 0;
semaphore emptyCount = BUFFER_SIZE;
procedure producer()
{
while (true)
{
item = produceItem();
down(emptyCount);
down(buffer_mutex);
putItemIntoBuffer(item);
up(buffer_mutex);
up(fillCount);
}
}
procedure consumer()
{
while (true)
{
down(fillCount);
down(buffer_mutex);
item = removeItemFromBuffer();
up(buffer_mutex);
up(emptyCount);
consumeItem(item);
}
}
My question is, why isn't the mutex required in the single producer single consumer case?
consider the following:
5 items in a queue allowing 10 items.
Producer produces an item , decrements the empty semaphore (and succeeds), then starts putting the item into the buffer, and is not finished
Consumer decrements the fill semaphore, then starts to remove item from buffer
unexpected. Trying to remove item from buffer (3) while putting item to buffer (2)
Why does what i described not happen?
Because such queue will usually be implemented as a circular queue. Producer will be writing to the tail of the queue, while consumer reads from the head. They never access the same memory at the same time.
The idea here is that both consumer and producer can track the position of the tail/head independently.
Consider the following pseudo-code:
T data[BUFFER_SIZE];
int producerPtr = 0, consumerPtr = 0;
void putItemIntoBuffer(Item item)
{
data[producerPtr] = item;
producerPtr = (producerPtr + 1) % BUFFER_SIZE;
}
Item removeItemFromBuffer(void)
{
Item item = data[consumerPtr ];
consumerPtr = (consumerPtr + 1) % BUFFER_SIZE;
return item;
}
Both consumerPtr and producerPtr can be equal only when the queue is either full or empty in which case those functions will not be called as executing process will remain blocked on a semaphore.
You can say that semaphores are used as a message passing mechanism, allowing the other side to increment its pointer, synchronizing this.
Now if you have multiple processes on one side, that side will need to perform increment and data copying atomically, therefor mutex is needed, but only for the side that has multiple processes e.g. for multiple-producer and multiple-consumer queue you can use 2 separate mutexes to decrease contention.
I have implemented an interprocess message queue in shared memory for one producer and one consumer on Windows.
I am using one named semaphore to count empty slots, one named semaphore to count full slots and one named mutex to protect the data structure in shared memory.
Consider, for example the consumer side. The producer side is similar.
First it waits on the full semaphore then (1) it takes a message from the queue under the mutex and then it signals the empty semaphore (2)
The problem:
If the consumer process crashes between (1) and (2) then effectively the number of slots in the queue that can be used by the process is reduced by one.
Assume that while the consumer is down, the producer can handle the queue getting filled up. (it can either specify a timeout when waiting on the empty semaphore or even specify 0 for no wait).
When the consumer restarts it can continue to read data from the queue. Data will not have been overrun but even after it empties all full slots, the producer will have one less empty slot to use.
After multiple such restarts the queue will have no slots that can be used and no messages can be sent.
Question:
How can this situation be avoided or recovered from?
Here's an outline of one simple approach, using events rather than semaphores:
DWORD increment_offset(DWORD offset)
{
offset++;
if (offset == QUEUE_LENGTH*2) offset = 0;
return offset;
}
void consumer(void)
{
for (;;)
{
DWORD current_write_offset = InterlockedCompareExchange(write_offset, 0, 0);
if ((current_write_offset != *read_offset + QUEUE_LENGTH) &&
(current_write_offset + QUEUE_LENGTH != *read_offset))
{
// Queue is not full, make sure producer is awake
SetEvent(signal_producer_event);
}
if (*read_offset == current_write_offset)
{
// Queue is empty, wait for producer to add a message
WaitForSingleObject(signal_consumer_event, INFINITE);
continue;
}
MemoryBarrier();
_ReadWriteBarrier;
consume((*read_offset) % QUEUE_LENGTH);
InterlockedExchange(read_offset, increment_offset(*read_offset));
}
}
void producer(void)
{
for (;;)
{
DWORD current_read_offset = InterlockedCompareExchange(read_offset, 0, 0);
if (current_read_offset != *write_offset)
{
// Queue is not empty, make sure consumer is awake
SetEvent(signal_consumer_event);
}
if ((*write_offset == current_read_offset + QUEUE_LENGTH) ||
(*write_offset + QUEUE_LENGTH == current_read_offset))
{
// Queue is full, wait for consumer to remove a message
WaitForSingleObject(signal_producer_event, INFINITE);
continue;
}
produce((*write_offset) % QUEUE_LENGTH);
MemoryBarrier();
_ReadWriteBarrier;
InterlockedExchange(write_offset, increment_offset(*write_offset));
}
}
Notes:
The code as posted compiles (given the appropriate declarations) but I have not otherwise tested it.
read_offset is a pointer to a DWORD in shared memory, indicating which slot should be read from next. Similarly, write_offset points to a DWORD in shared memory indicating which slot should be written to next.
An offset of QUEUE_LENGTH + x refers to the same slot as an offset of x so as to disambiguate between a full queue and an empty queue. That's why the increment_offset() function checks for QUEUE_LENGTH*2 rather than just QUEUE_LENGTH and why we take the modulo when calling the consume() and produce() functions. (One alternative to this approach would be to modify the producer to never use the last available slot, but that wastes a slot.)
signal_consumer_event and signal_producer_event must be automatic-reset events. Note that setting an event that is already set is a no-op.
The consumer only waits on its event if the queue is actually empty, and the producer only waits on its event if the queue is actually full.
When either process is woken, it must recheck the state of the queue, because there is a race condition that can lead to a spurious wakeup.
Because I use interlocked operations, and because only one process at a time is using any particular slot, there is no need for a mutex. I've included memory barriers to ensure that the changes the producer writes to a slot will be seen by the consumer. If you're not comfortable with lock-free code, you'll find that it is trivial to convert the algorithm shown to use a mutex instead.
Note that InterlockedCompareExchange(pointer, 0, 0); looks a bit complicated but is just a thread-safe equivalent to *pointer, i.e., it reads the value at the pointer. Similarly, InterlockedExchange(pointer, value); is the same as *pointer = value; but thread-safe. Depending on the compiler and target architecture, interlocked operations may not be strictly necessary, but the performance impact is negligible so I recommend programming defensively.
Consider the case when the consumer crashes during (or before) the call to the consume() function. When the consumer is restarted, it will pick up the same message again and process it as normal. As far as the producer is concerned, nothing unusual has happened, except that the message took longer than usual to be processed. An analogous situation occurs if the producer crashes while creating a message; when restarted, the first message generated will overwrite the incomplete one, and the consumer won't be affected.
Obviously, if the crash occurs after the call to InterlockedExchange but before the call to SetEvent in either the producer or consumer, and if the queue was previously empty or full respectively, then the other process will not be woken up at that point. However, it will be woken up as soon as the crashed process is restarted. You cannot lose slots in the queue, and the processes cannot deadlock.
I think the simple multiple-producer single-consumer case would look something like this:
void producer(void)
{
for (;;)
{
DWORD current_read_offset = InterlockedCompareExchange(read_offset, 0, 0);
if (current_read_offset != *write_offset)
{
// Queue is not empty, make sure consumer is awake
SetEvent(signal_consumer_event);
}
produce_in_local_cache();
claim_mutex();
// read offset may have changed, re-read it
current_read_offset = InterlockedCompareExchange(read_offset, 0, 0);
if ((*write_offset == current_read_offset + QUEUE_LENGTH) ||
(*write_offset + QUEUE_LENGTH == current_read_offset))
{
// Queue is full, wait for consumer to remove a message
WaitForSingleObject(signal_producer_event, INFINITE);
continue;
}
copy_from_local_cache_to_shared_memory((*write_offset) % QUEUE_LENGTH);
MemoryBarrier();
_ReadWriteBarrier;
InterlockedExchange(write_offset, increment_offset(*write_offset));
release_mutex();
}
}
If the active producer crashes, the mutex will be detected as abandoned; you can treat this case as if the mutex were properly released. If the crashed process got as far as incrementing the write offset, the entry it added will be processed as usual; if not, it will be overwritten by whichever producer next claims the mutex. In neither case is any special action needed.
HWM does not seem to work in clrzmq 2.2.5.
Here's my code
private static ulong hwm = 50;
static void testMQ()
{
var _Context = new Context(1);
var pubSock = _Context.Socket(SocketType.PUB);
pubSock.HWM = hwm;
pubSock.Bind("tcp://*:9999");
new Thread(testSub).Start();
Thread.Sleep(1000); // client connect
int i = 0;
while (true)
{
pubSock.Send(i.ToString(), Encoding.ASCII);
Debug.WriteLine(pubSock.Backlog + "/" + i++);
}
}
static void testSub()
{
var _ZmqCtx = new Context(1);
var subSock = _ZmqCtx.Socket(SocketType.SUB);
subSock.HWM = 500;
subSock.Identity = new ASCIIEncoding().GetBytes("bla");
subSock.Connect("tcp://127.0.0.1:9999");
Debug.WriteLine("connected");
subSock.Subscribe("", Encoding.ASCII);
while (true)
{
Debug.WriteLine("r:" + subSock.Recv(Encoding.ASCII));
Thread.Sleep(10);
}
}
Output:
'quickies.vshost.exe' (Managed (v4.0.30319)):
Loaded 'B:\sdev\MSenseWS\GoogleImporter\bin\Debug\clrzmq.dll', Symbols loaded.
connected
r:0
100/0
100/1
100/2
[...]
100/13
r:1
100/14
[...]
100/2988
100/2989
100/2990
100/2991
100/2992
100/2993
100/2994
100/2995
100/2996
r:179
100/2997
100/2998
Expected behavior: pubSock.Send blocks after 500 messages are queued.
Experienced behavior: pubSock.Sends does not block and sends forever until out of memory exception vom native code (clrzmq.dll) is thrown.
Also: Why is backlog always 100?
Thanks for your insights,
Armin
Edit: push/poll sockets achieve the same result
#
#
Resolution:
- The error was on my side, as i was expecting that the HWM is the number of outstanding messages that the clinet(s) have not commited (received). While in fact HWM is the number of messages that are buffered and queued for sending over the network.
In my case i had a client that can not process messages fast enough and so buffer space was allocated until out of memory.
To solve this problem i found out that setting HWM and SWAP on the client socket solves my problem, as messages are queued to a large swap file by zmq and are successively precessed by the application.
Ah, so I'm guessing you have the subscriber thread sleep, but that doesn't mean the underlying ZMQ socket threads also sleep. Therefore the subscriber will continue to take messages off the publisher queue. In other words, using Thread.Sleep() is probably not a good enough way to simulate limited network connectivity or other issues you expect to cause running into the HWM.
I want to implement a scheduler class, which any object can use to schedule timeouts and cancel then if necessary. When a timeout expires, this information will be sent to the timeout setter/owner at that time asynchronously.
So, for this purpose, I have 2 fundamental classes WindowsTimeout and WindowsScheduler.
class WindowsTimeout
{
bool mCancelled;
int mTimerID; // Windows handle to identify the actual timer set.
ITimeoutReceiver* mSetter;
int cancel()
{
mCancelled = true;
if ( timeKillEvent(mTimerID) == SUCCESS) // Line under question # 1
{
delete this; // Timeout instance is self-destroyed.
return 0; // ok. OS Timer resource given back.
}
return 1; // fail. OS Timer resource not given back.
}
WindowsTimeout(ITimeoutReceiver* setter, int timerID)
{
mSetter = setter;
mTimerID = timerID;
}
};
class WindowsScheduler
{
static void CALLBACK timerFunction(UINT uID,UINT uMsg,DWORD dwUser,DWORD dw1,DWORD dw2)
{
WindowsTimeout* timeout = (WindowsTimeout*) uMsg;
if (timeout->mCancelled)
delete timeout;
else
timeout->mDestination->GEN(evTimeout(timeout));
}
WindowsTimeout* schedule(ITimeoutReceiver* setter, TimeUnit t)
{
int timerID = timeSetEvent(...);
if (timerID == SUCCESS)
{
return WindowsTimeout(setter, timerID);
}
return 0;
}
};
My questions are:
Q.1. When a WindowsScheduler::timerFunction() call is made, this call is performed in which context ? It is simply a callback function and I think, it is performed by the OS context, right ? If it is so, does this calling pre-empt any other tasks already running ? I mean do callbacks have higher priority than any other user-task ?
Q.2. When a timeout setter wants to cancel its timeout, it calls WindowsTimeout::cancel().
However, there is always a possibility that timerFunction static call to be callbacked by OS, pre-empting the cancel operation, for example, just after mCancelled = true statement. In such a case, the timeout instance will be deleted by the callback function.
When the pre-empted cancel() function comes again, after the callback function completes execution, will try to access an attribute of the deleted instance (mTimerID), as you can see on the line : "Line under question # 1" in the code.
How can I avoid such a case ?
Please note that, this question is an improved version of the previos one of my own here:
Windows multimedia timer with callback argument
Q1 - I believe it gets called within a thread allocated by the timer API. I'm not sure, but I wouldn't be surprised if the thread ran at a very high priority. (In Windows, that doesn't necessarily mean it will completely preempt other threads, it just means it will get more cycles than other threads).
Q2 - I started to sketch out a solution for this, but then realized it was a bit harder than I thought. Personally, I would maintain a hash table that maps timerIDs to your WindowsTimeout object instances. The hash table could be a simple std::map instance that's guarded by a critical section. When the timer callback occurs, it enters the critical section and tries to obtain the WindowsTimer instance pointer, and then flags the WindowsTimer instance as having been executed, exits the critical section, and then actually executes the callback. In the event that the hash table doesn't contain the WindowsTimer instance, it means the caller has already removed it. Be very careful here.
One subtle bug in your own code above:
WindowsTimeout* schedule(ITimeoutReceiver* setter, TimeUnit t)
{
int timerID = timeSetEvent(...);
if (timerID == SUCCESS)
{
return WindowsTimeout(setter, timerID);
}
return 0;
}
};
In your schedule method, it's entirely possible that the callback scheduled by timeSetEvent will return BEFORE you can create an instance of WindowsTimeout.
Sample code below. I'm a little curious why MyActor is faster than MyActor2. MyActor recursively calls process/react and keeps state in the function parameters whereas MyActor2 keeps state in vars. MyActor even has the extra overhead of tupling the state but still runs faster. I'm wondering if there is a good explanation for this or if maybe I'm doing something "wrong".
I realize the performance difference is not significant but the fact that it is there and consistent makes me curious what's going on here.
Ignoring the first two runs as warmup, I get:
MyActor:
559
511
544
529
vs.
MyActor2:
647
613
654
610
import scala.actors._
object Const {
val NUM = 100000
val NM1 = NUM - 1
}
trait Send[MessageType] {
def send(msg: MessageType)
}
// Test 1 using recursive calls to maintain state
abstract class StatefulTypedActor[MessageType, StateType](val initialState: StateType) extends Actor with Send[MessageType] {
def process(state: StateType, message: MessageType): StateType
def act = proc(initialState)
def send(message: MessageType) = {
this ! message
}
private def proc(state: StateType) {
react {
case msg: MessageType => proc(process(state, msg))
}
}
}
object MyActor extends StatefulTypedActor[Int, (Int, Long)]((0, 0)) {
override def process(state: (Int, Long), input: Int) = input match {
case 0 =>
(1, System.currentTimeMillis())
case input: Int =>
state match {
case (Const.NM1, start) =>
println((System.currentTimeMillis() - start))
(Const.NUM, start)
case (s, start) =>
(s + 1, start)
}
}
}
// Test 2 using vars to maintain state
object MyActor2 extends Actor with Send[Int] {
private var state = 0
private var strt = 0: Long
def send(message: Int) = {
this ! message
}
def act =
loop {
react {
case 0 =>
state = 1
strt = System.currentTimeMillis()
case input: Int =>
state match {
case Const.NM1 =>
println((System.currentTimeMillis() - strt))
state += 1
case s =>
state += 1
}
}
}
}
// main: Run testing
object TestActors {
def main(args: Array[String]): Unit = {
val a = MyActor
// val a = MyActor2
a.start()
testIt(a)
}
def testIt(a: Send[Int]) {
for (_ <- 0 to 5) {
for (i <- 0 to Const.NUM) {
a send i
}
}
}
}
EDIT: Based on Vasil's response, I removed the loop and tried it again. And then MyActor2 based on vars leapfrogged and now might be around 10% or so faster. So... lesson is: if you are confident that you won't end up with a stack overflowing backlog of messages, and you care to squeeze every little performance out... don't use loop and just call the act() method recursively.
Change for MyActor2:
override def act() =
react {
case 0 =>
state = 1
strt = System.currentTimeMillis()
act()
case input: Int =>
state match {
case Const.NM1 =>
println((System.currentTimeMillis() - strt))
state += 1
case s =>
state += 1
}
act()
}
Such results are caused with the specifics of your benchmark (a lot of small messages that fill the actor's mailbox quicker than it can handle them).
Generally, the workflow of react is following:
Actor scans the mailbox;
If it finds a message, it schedules the execution;
When the scheduling completes, or, when there're no messages in the mailbox, actor suspends (Actor.suspendException is thrown);
In the first case, when the handler finishes to process the message, execution proceeds straight to react method, and, as long as there're lots of messages in the mailbox, actor immediately schedules the next message to execute, and only after that suspends.
In the second case, loop schedules the execution of react in order to prevent a stack overflow (which might be your case with Actor #1, because tail recursion in process is not optimized), and thus, execution doesn't proceed to react immediately, as in the first case. That's where the millis are lost.
UPDATE (taken from here):
Using loop instead of recursive react
effectively doubles the number of
tasks that the thread pool has to
execute in order to accomplish the
same amount of work, which in turn
makes it so any overhead in the
scheduler is far more pronounced when
using loop.
Just a wild stab in the dark. It might be due to the exception thrown by react in order to evacuate the loop. Exception creation is quite heavy. However I don't know how often it do that, but that should be possible to check with a catch and a counter.
The overhead on your test depends heavily on the number of threads that are present (try using only one thread with scala -Dactors.corePoolSize=1!). I'm finding it difficult to figure out exactly where the difference arises; the only real difference is that in one case you use loop and in the other you do not. Loop does do fair bit of work, since it repeatedly creates function objects using "andThen" rather than iterating. I'm not sure whether this is enough to explain the difference, especially in light of the heavy usage by scala.actors.Scheduler$.impl and ExceptionBlob.