I'm working on an app that notifies multiple Workers when a reboot is about to happen and then waits for ALL the Workers to do perform some tasks then send an ACK before rebooting. The number of Workers can change so somehow my app will need to know how many Workers are currently subscribed so that it knows that every Worker has sent an ACK.
Is a pub/sub approach the best way for doing this? Does it provide a way to figure out how many subscribers are currently connected? Should my app use a REP socket to listen for ACK from the Workers? Is there a more elegant way of designing this?
Thanks
Is a pub/sub approach the best way for doing this?
Using pub/sub from the server to broadcast a "server reboot" message is fine for the workers who get the message, but it's not full-proof. Slow-joiner syndrome may prevent a worker (or workers) from receiving the message. To address that, the server, once it publishes a reboot message, should continue publishing that message until all workers respond with ACK, but that creates a new problem: how does the server keep track of all workers to ensure it receives all necessary ACK's?
Does it provide a way to figure out how many subscribers are
currently connected?
No. Exposing that information breaks ZeroMq's abstraction model which hides the physical details of the connection and connected peers. You can send heartbeat messages periodically from server to workers over pub/sub; workers respond with a logical node id (WorkerNode1, etc), and the server keeps track of each worker in a hashtable along with a future expiration time. When a worker responds to a hearbeat, the server simply resets the future expiration for that worker; the server should periodically check the hashtable and remove expired workers.
That's the best you can do for keeping track of workers. The shorter the expiration, the more accurate the worker list reflects.
Should my app use a REP socket to listen for ACK from the Workers? Is
there a more elegant way of designing this?
REQ/REP sockets have limited uses. I'd use PUB on the server for sending reboot and heartbeat messages; ROUTER to receive ACK's. The workers should use DEALER for sending ACK's (and anything else), and SUB for receiving heartbeats/reboots. ROUTER and DEALER are bi-directional and fully asynchronous, and the most versatile; can't go wrong.
Hope it helps!
Related
I've been reading the ZMQ documentation on heartbeats and read that one should use the ping-pong approach instead the one used for the Paranoid Pirate pattern
For Paranoid Pirate, we chose the second approach. It might not have
been the simplest option: if designing this today, I'd probably try a
ping-pong approach instead.
However, I find little to no documentation about the ping-pong pattern anywhere (and why is it preferred anyway?). The only possible code examples are ping.py and pong.py in the pyzmq examples.
Are these adequate examples that demonstrate a two-way heartbeat? If so, how is "pong" detecting that "ping" is not alive any more? There's also this claim about no payload, but isn't the ping message also considered a payload?
One peer sends a ping command to the other, which replies with a pong
command. Neither command has any payload
Again, these examples may not constitute a full implementation of this approach. If anyone can share some experience, descriptions or code examples, I'd appreciate it.
My aim is to add heartbeat functionality to a broker and worker (router-dealer). Both worker and broker should detect that the partner isn't available any more and (a) deregister the worker (in case of the broker detecting the worker has gone), or (b) try to reconnect later (in case the worker lost its connection to the broker). The worker isn't required when busy, because it wouldn't be in the broker's idle workers queue for new jobs anyway.
ZeroMQ does not provide any mechanism to help you find out whether the socket on the other side is alive or not.
Therefore, the standard scenario of the heartbeat pattern (it is the most convenient I think) is a heartbeat with timeout.
You need sockets on the client and server, which work in separate threads. And also a poller.
Poller example:
p = zmq.Poller()
p.register(socket, zmq.POLLIN)
Сlient sends a message to the server and polls the socket with timeout. Choose timeout value that most suits you and will clearly indicate that the server is not available.
Polling example:
msg = dict(p.poll(timeout))
if socket in msg and msg[socket] == zmq.POLLIN:
# we get heartbeat from server
else:
# timeout - server unavailable
Server does the same.
I think this could help.
I using a pub-sub pattern with tcp. When one of my subscriber dies (kill -9 for example) et been restarted with the same IDENTITY it does not get the previous messages.
What are the solutions so when it restart it gets the messages sent? (I understand 0mq does not handle that)
run publisher
run sub0 (subscribe to socket)
run sub1 (subscribe to socket)
pkill -9 sub0 (simulate daemon dying)
publisher send message
run sub0 again (same ZMQ_IDENTITY)
sub0 does not receive the lost message.
This is entirely the responsibility of your application. Take a look at The Guide... particularly Chapter 5 on advanced pub/sub patterns, and even more specifically Getting an out of band snapshot.
The upshot is that your publishing server actually has two sockets, one for publishing, and one for other system-level communication. Anytime it publishes a new messages, it also adds that message to a local cache... it never forgets the messages it sends. Anytime your subscribing client re-connects to the server, it's 2nd socket sends a request to the server to get all messages it missed (or, as in the case of the linked example, the entire current state of the data), which are sent back over that 2nd socket pair. In this way the subscriber is up to date with all messages when it starts to get new ones over the normal subscriber channel.
I have a client/server setup in which clients send a single request message to the server and gets a bunch of data messages back.
The server is implemented using a ROUTER socket and the clients using a DEALER. The communication is asynchronous.
The clients are typically iPads/iPhones and they connect over wifi so the connection is not 100% reliable.
The issue I’m concern about is if the client connects to the server and sends a request for data but before the response messages are delivered back the communication goes down (e.g. out of wifi coverage).
In this case the messages will be queued up on the server side waiting for the client to reconnect. That is fine for a short time but eventually I would like to drop the messages and the connection to release resources.
By checking activity/timeouts it would be possible in the server and the client applications to identify that the connection is gone. The client can shutdown the socket and in this way free resources but how can it be done in the server?
Per the ZMQ FAQ:
How can I flush all messages that are in the ZeroMQ socket queue?
There is no explicit command for flushing a specific message or all messages from the message queue. You may set ZMQ_LINGER to 0 and close the socket to discard any unsent messages.
Per this mailing list discussion from 2013:
There is no option to drop old messages [from an outgoing message queue].
Your best bet is to implement heartbeating and, when one client stops responding without explicitly disconnecting, restart your ROUTER socket. Messy, I know, this is really something that should have a companion option to HWM. Pieter Hintjens is clearly on board (he created ZMQ) - but that was from 2011, so it looks like nothing ever came of it.
This is a bit late but setting tcp keepalive to a reasonable value will cause dead sockets to close after the timeouts have expired.
Heartbeating is necessary for either side to determine the other side is still responding.
The only thing I'm not sure about is how to go about heartbeating many thousands of clients without spending all available cpu just on dealing with the heartbeats.
I have a single publisher application (PUB) which has N number of subscribers (SUB)
These subscribers need to be able to catch up if they are restarted, or fall down and miss messages.
We have implemented a simple event store that the publisher writes to.
We have implemented a CatchupService which can query the event store and send missed messages to the subscriber.
We have implemented in the subscriber a PUSH socket which sends a request for missed messages.
The subscriber also has a PULL socket which listens for missed messages on a seperate port.
The subscriber will:
Detect a gap
Send a request for missed messages to our CatchupService, the request also contains the address on which to send the results to.
The catchup service has a PULL socket on which it listens for requests
When the CatchupService receives a request it starts a worker thread which:
Gets the missed messages
Opens a PUSH socket connecting to the subscribers PULL socket
Sends the missed messages to the subscriber.
This seems to work quite well however we are unsure if we are using the right socket types for this sort of application. Are these correct or should be using a different pattern.
Sounds okay. Otherwise 0MQ is able to recovery from message loss when peers go offline for a short time. Take a look at the Socket Options and specifically option ZMQ_SNDHWM.
I don't know just how guaranteed the 0MQ recovery mechanisms are so maybe you're best to stay with what you've got, but it is something to be aware of.
I was working with different patterns in zeromq in my project and right now i am using req/rep(later will shift to dealer/router) and pub/sub . The client sends messages to the server and the server publishes this information to other clients who have subscribed.
To use multiple sockets i followed the suggestions on this thread
Combining pub/sub with req/rep in zeromq and used zmq_poll . My server polls on req socket and pub socket.
While writing the code and while reading the above post i guessed that my pub socket will never get polledin and that's what i am observing now when i run the program. Only my request is polled in and publish is not happening at all.
If i don't use polling it works fine i.e as soon as the server gets the message i publish it.
So i am unclear on how polling will be useful in this pattern and how i can use it ?
You probably don't need to poll the pub socket. You certainly don't need to poll in on it - because that can never be triggered (pub sockets are send only).
The polling pattern might be useful in the case where you want to poll for "ready to send" on the req and the pub socket, allowing you to multiplex those channels. This will be particularly useful if/when you move to using a dealer/router.
The reason for that is that replacing req with a dealer (e.g.) can allow you to send multiple messages before receiving responses. Polling for inward and outbound messages will allow you to make maximum advantage of that.