ZeroMQ XREP -- Endpoint disappearing

ZeroMQ XREP -- Endpoint disappearing - zeromq

I am using a standard LRU queue as defined by the ZeroMQ guide figure 41, and I am wondering how to add in protection so that I don't send messages to end points that have disappeared (server crash, OOM killer, anything along those lines).
From the documentation I read that XREP will just drop the message if it is going to a non-existant end-point, and there is no way I get notified about that. Is there a way to get such a notification? Should I just send out a "ping" first and if I don't get a response then that "worker" is dead meat to me? How will I know that it is the same client that I just sent the ping to that I am getting the message back from?
Or is my use case not a good one for ZeroMQ? I just want to make sure that a message has been received, I don't want it being dropped on the floor without my knowledge...

Pinging a worker to know if it is alive will cause a race condition: the worker might well answer the ping just before it dies.
However, if you assume that a worker will not die during a request processing (you can do little in this case), you can reverse the flow of communication between the workers and the central queue. Let the worker fetch a request from the queue (using a REQ/REP connection) and have it send the answer along with the original envelope when the processing is done (using the same socket as above, or even better through a separate PUSH/PULL connection).
With this scenario, you know that a dead worker will not be sent requests, as it will be unable to fetch them (being dead…). Moreover, your central queue can even ensure that it receives an answer to every request in a given time. If it does not, it can put the request back in the queue so that a new worker will fetch it shortly after. This way, even if a worker dies while processing a request, the request will eventually be served.
(as a side note: be careful if the worker crashes because of a particular request - you do not want to kill your workers one by one, and might want to put a maximum number of tries for a request)
Edit: I wrote some code implementing the other direction to explain what I mean.

Related

Purpose of zeromq send high watermark

The first time I skimmed the zeromq docs, I assumed that the sender high watermark was there to ensure that the sender did not get too far ahead of the receiver. Now that I'm looking at it more carefully, it seems that this can't possibly be true, since the wire protocol doesn't have any concept of ACKs so the sender can't know whether the receiver is keeping up or is way behind. After staring at jeromq code in the debugger for way too long, it seems that the watermark is actually a purely "within-same-process" mechanism to ensure that the application thread that's writing to the ZMQ socket does not get too far ahead of the background thread that's responsible for taking messages off the ZMQ socket and writing bytes into the OS's TCP socket.
It seems like a rather fringe thing to worry about, relative to how much attention it's given in the docs. It doesn't even seem like a great way to control memory usage, because if you have a high water mark of 10, then 15 messages of 2kb each is not allowed, but 5 messages of 100 megs each is allowed, so things are still pretty un-predictable.
Am I understanding all this correctly or am I hopelessly confused.

I think that another thing that says it's not to prevent a sender getting too far ahead of the receiver is that if one set the HWM to 0, that's taken as infinity not actually zero. For 0 to mean zero, it'd have to have some too-ing and fro-ing with the receiver to know whether the socket was actually empty throughout the whole connection.
I wish that 0 did mean zero, because then ZeroMQ could implement both Actor Model and Communicating Sequential Processes architectures. But it doesn't, so it can't.
Possible Uses
None the less, a potential useful aspect is related to the fact that ZeroMQ is Actor Model. Suppose one were sending messages, and it kind of mattered whether or not those messages got through. In the situation where the link has collapsed (something that ZeroMQ's heartbeat can tell you, pretty quickly), messages already sent are potentially lost forever. However, if the HWM is being used to throttle the rate of messages being sent by the application, then the number of lost messages when the link breaks is minimised.
Obviously with CSP - the perfect architecture so far as I'm concerned! - you lose no messages (because the acts of sending and receiving are an execution rendezvous; the send won't complete until the receive has also completed).
What I have done in the past is to queue up messages for transmission in the sending application, sending them as and when the socket / connection can ingest them. Having the outbound message queue in the sending application's control (instead of in ZeroMQ's control) means that sender state can potentially get ahead of the transfer of messages, but still recover easily from a network connection fault.
I have written systems where a sender has a choice of two pathways to send messages through - prime and spare - and if the link to prime has collapsed the sender continues to send to spare instead. Having queued the messages inside the application and not in the socket allows the sender's state can get ahead of the actual transfer of messages, knowing that if a link goes down it's still got all the unsent outboud messages that have been generated in the meantime. These can then be directed at spare instead, without having to rewind the sender's internal state (which could be really tricky) to the last known successful transfer.
Something like that, anyway.
"Why not send to both prime and spare anyway?" is a valid question. Well, sometimes things can be complicated...

C++ IRC Client design

I'm attempting to write an RFC 2812 compliant C++ IRC library.
I am having some trouble with the design of the client itself.
From what I have read IRC communication tends to be asynchronous.
I am using boost::asio::async_read and boost::asio::async_write.
From reading the documentation I have gathered that you cannot perform multiple async_write requests before one is completed. You therefore end up with rather nested callbacks. Doesn't this defeat the purpose of doing async calls? Wouldn't it just be better to use synchronous calls to prevent the nesting? If not, why?
Secondly, if I am not mistaken, each boost::asio::async_write should be followed up by a boost::asio::async_read to receive the server's response to the commands sent. My client's functions, therefore, would need to take a callback parameter so a user of the class may do something after the client receives a response (ex. send another message...).
If I were to continue implementing this with async, should I keep a std::deque<std::tuple<message, callback>> and each time a boost::asio::async_write is finished, and there is a tuple in the queue, dequeue and send the message then raise the callback? Would this be the optimal way to implement this system?
I'm thinking since messages are sent all the time I'm going to have to implement some kind of listener loop that queues up responses, but how would you associate these responses with the specific command that triggered them? Or in the case that the response is just a message to the channel from another user?

The IRC protocol is a full-duplex protocol. As such, one should always be listening to the server connection expecting commands to process. It could be argued that one should primarily use the messages received from the server to update state, rather than correlating request and responses, as the server may not respond to a command or may respond much later than expected. For example, one may issue a WHOIS command, but receive multiple PRIVMSG commands before receiving a response to WHOIS. For a chat client, a user would likely expect being able to receive chat messages while waiting for a response to WHOIS. Hence, having a async_write() to async_read() call chain may not be ideal in handling the protocol.
For a given socket, the Asio documentation does recommend not initiating additional read operations if there is an outstanding composed read operation and not initiating additional write operations if there is an outstanding composed write operation. Queuing up messages and having an asynchronous call chains process from the queue is a great way to fulfill this recommendation. Consider reading this answer for a nice solution using a queue and an asynchronous call chain.
Also, be aware that the server may send a PING command even on an active connection. When the client is responding with a PONG command, it may be necessary to insert the PONG command near the front of the outbound queue so that it gets sent out as soon as possible.

Doesn't this defeat the purpose of doing async calls?
The usual solution is to use strands:
Why do I need strand per connection when using boost::asio?
You are free to queue multiple asynchronous operations on the same io objects using an (implicit) strand¹.
Using a strand ensures that the completion handlers are invoked on that same logical thread.
On the Protocol
You could indeed keep a queue of commands and await responses for each command before sending the next.
You might be a little bit smarter about this if you can spot the correlation due the different type of reply, but then you'd need to keep queues per type of command. I'd consider that premature optimization.

TemporaryQueues: how many is too many? how long should they stay open?

I'm about to create an application that will spawn tasks of about 100,000 requests expecting responses. I'm wondering whether to use a static reply queue or temporary queues. There is only one client requesting and only one server replying. The use case for the client will be to spawn a task about once a day.
I'm thinking I want to use temporary queues for the responses but I'm wondering if there is a reasonable limit to the amount of temporary queues or how long I would want to keep them open.
Some replies make take days to come back or never come back so I would time out the temporary queues after about 3 days.

My immediate thoughts are that 3 days stretches the definition of temporary. In that time you want to survive both requester (producer, who also consumes the response) and broker outages. Temporary queues are a contract between the subscriber and the broker - if one of them goes down, the temporary queue disappears and the responder will get an error when they attempt to reply on that queue.
I would use static queues in this instance - you will need to implement a layer for correlating responses back to requests in your requester, but you would need to do that anyway if you want to survive the outage of that process (probably by storing additional state in a database).

Usage of non-blocking send and blocking receive in MPI?

I am trying to implement master-worker program.
My master has jobs that the workers are going to do. Every time a worker completes a job, he asks for a new job from the master, and the master sends it to him. The workers are calculating minimal paths. When a worker finds a minimum that is better than the global minimum he got, he sends it to everyone including the master.
I plan for the workers and masters to send data using MPI_ISEND. Also, I think that the receive should be blocking. The master has nothing to do when no one has asked for work or has updated the best result, so he should block waiting for a receive. Also, each worker should, after he has done his work, wait on a receive to get a new one.
Nevertheless, I'm not sure of the impact of using non-blocking asynchronous send, and blocking synchronous receive.
An alternative I think is using MPI_IPROBE, but I'm not sure that this will give me any optimization.
Please help me understand whether what I'm doing is right. Is this the right solution?

You can match blocking sends with nonblocking receives and vice versa, that won't cause any problems. However, if the master really has nothing to do while the workers work, and the workers should block after completing their work unit, then there's no reason for non-blocking communication on that front. The master can post a blocking receive with MPI_ANY_SOURCE, and the workers can just use a blocking send to post back their results, since the matching receive at the master will already have been posted.
So, I'd have Send-Recv for exchanging work units between master and worker, and Isend-Irecv for broadcasting the new global minima.

Async Request-Response Algorithm with response time limit

I am writing a Message Handler for an ebXML message passing application. The message follow the Request-Response Pattern. The process is straightforward: The Sender sends a message, the Receiver receives the message and sends back a response. So far so good.
On receipt of a message, the Receiver has a set Time To Respond (TTR) to the message. This could be anywhere from seconds to hours/days.
My question is this: How should the Sender deal with the TTR? I need this to be an async process, as the TTR could be quite long (several days). How can I somehow count down the timer, but not tie up system resources for large periods of time. There could be large volumes of messages.
My initial idea is to have a "Waiting" Collection, to which the message Id is added, along with its TTR expiry time. I would then poll the collection on a regular basis. When the timer expires, the message Id would be moved to an "Expired" Collection and the message transaction would be terminated.
When the Sender receives a response, it can check the "Waiting" collection for its matching sent message, and confirm the response was received in time. The message would then be removed from the collection for the next stage of processing.
Does this sound like a robust solution. I am sure this is a solved problem, but there is precious little information about this type of algorithm. I plan to implement it in C#, but the implementation language is kind of irrelevant at this stage I think.
Thanks for your input

Depending on number of clients you can use persistent JMS queues. One queue per client ID. The message will stay in the queue until a client connects to it to retrieve it.
I'm not understanding the purpose of the TTR. Is it more of a client side measure to mean that if the response cannot be returned within certain time then just don't bother sending it? Or is it to be used on the server to schedule the work and do what's required now and push the requests with later response time to be done later?
It's a broad question...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio