Ping pong heartbeat in ZMQ - zeromq

I've been reading the ZMQ documentation on heartbeats and read that one should use the ping-pong approach instead the one used for the Paranoid Pirate pattern
For Paranoid Pirate, we chose the second approach. It might not have
been the simplest option: if designing this today, I'd probably try a
ping-pong approach instead.
However, I find little to no documentation about the ping-pong pattern anywhere (and why is it preferred anyway?). The only possible code examples are ping.py and pong.py in the pyzmq examples.
Are these adequate examples that demonstrate a two-way heartbeat? If so, how is "pong" detecting that "ping" is not alive any more? There's also this claim about no payload, but isn't the ping message also considered a payload?
One peer sends a ping command to the other, which replies with a pong
command. Neither command has any payload
Again, these examples may not constitute a full implementation of this approach. If anyone can share some experience, descriptions or code examples, I'd appreciate it.
My aim is to add heartbeat functionality to a broker and worker (router-dealer). Both worker and broker should detect that the partner isn't available any more and (a) deregister the worker (in case of the broker detecting the worker has gone), or (b) try to reconnect later (in case the worker lost its connection to the broker). The worker isn't required when busy, because it wouldn't be in the broker's idle workers queue for new jobs anyway.

ZeroMQ does not provide any mechanism to help you find out whether the socket on the other side is alive or not.
Therefore, the standard scenario of the heartbeat pattern (it is the most convenient I think) is a heartbeat with timeout.
You need sockets on the client and server, which work in separate threads. And also a poller.
Poller example:
p = zmq.Poller()
p.register(socket, zmq.POLLIN)
Сlient sends a message to the server and polls the socket with timeout. Choose timeout value that most suits you and will clearly indicate that the server is not available.
Polling example:
msg = dict(p.poll(timeout))
if socket in msg and msg[socket] == zmq.POLLIN:
# we get heartbeat from server
else:
# timeout - server unavailable
Server does the same.
I think this could help.

Related

How to drop inactive/disconnected peers in ZMQ

I have a client/server setup in which clients send a single request message to the server and gets a bunch of data messages back.
The server is implemented using a ROUTER socket and the clients using a DEALER. The communication is asynchronous.
The clients are typically iPads/iPhones and they connect over wifi so the connection is not 100% reliable.
The issue I’m concern about is if the client connects to the server and sends a request for data but before the response messages are delivered back the communication goes down (e.g. out of wifi coverage).
In this case the messages will be queued up on the server side waiting for the client to reconnect. That is fine for a short time but eventually I would like to drop the messages and the connection to release resources.
By checking activity/timeouts it would be possible in the server and the client applications to identify that the connection is gone. The client can shutdown the socket and in this way free resources but how can it be done in the server?
Per the ZMQ FAQ:
How can I flush all messages that are in the ZeroMQ socket queue?
There is no explicit command for flushing a specific message or all messages from the message queue. You may set ZMQ_LINGER to 0 and close the socket to discard any unsent messages.
Per this mailing list discussion from 2013:
There is no option to drop old messages [from an outgoing message queue].
Your best bet is to implement heartbeating and, when one client stops responding without explicitly disconnecting, restart your ROUTER socket. Messy, I know, this is really something that should have a companion option to HWM. Pieter Hintjens is clearly on board (he created ZMQ) - but that was from 2011, so it looks like nothing ever came of it.
This is a bit late but setting tcp keepalive to a reasonable value will cause dead sockets to close after the timeouts have expired.
Heartbeating is necessary for either side to determine the other side is still responding.
The only thing I'm not sure about is how to go about heartbeating many thousands of clients without spending all available cpu just on dealing with the heartbeats.

ZeroMQ REQ/REP server error handling

I am trying to use the ZeroMQ rep/req and cannot figure out how to handle server side errors. Look at the code from here:
socket.bind("tcp://*:%s" % port)
while True:
# Wait for next request from client
message = socket.recv()
print "Received request: ", message
time.sleep (1)
socket.send("World from %s" % port)
My problem is what happens if the client calls socket.send() and then hangs or crashes. Wouldn't the server just get stuck on socket.send() or socket.recv() forever?
Note that it is not a problem with TCP sockets. With TCP sockets I can simply break the connection. With ZMQ, the connections are implicitly managed for me and I don't know if it is possible to break a 'session' or 'connection' and start over.
You can terminate ZMQ sockets much the same way you terminate TCP sockets.
socket.close()
If you need to wait on a message but only up for a finite amount of time you can pass a timeout flag to socket.recv(timeout=1024) and then handle the timeout error case the same way you would when a TCP socket timeouts or disconnects. If you need to manage several sockets all of which may be in an error state then the Poller class will let you accomplish this.
The ZMQ Z-guide offers lots of good hints on how to structure your services to handle different scenarios.
I think chapter 4 can be of interest to you, especially the Lazy Pirate Pattern.
Check out the examples of Lazy Pirate Server and Lazy Pirate Client.
In general,
Make sure you setsockopt() on the socket such that send and recv will not block forever. (Temporary blocking – be it client or server – is OK, but infinite blocking is bad because your application cannot do anything else)
In the event that any of the I/O got an error,
If you are the client, close() the current socket and re-create a new one to establish a new connection
If you are the server, there's nothing else to do, you simply are waiting for a new connection. You will want to explore the Poller class.

ZeroMQ: Publish to multiple Workers and wait for ACK

I'm working on an app that notifies multiple Workers when a reboot is about to happen and then waits for ALL the Workers to do perform some tasks then send an ACK before rebooting. The number of Workers can change so somehow my app will need to know how many Workers are currently subscribed so that it knows that every Worker has sent an ACK.
Is a pub/sub approach the best way for doing this? Does it provide a way to figure out how many subscribers are currently connected? Should my app use a REP socket to listen for ACK from the Workers? Is there a more elegant way of designing this?
Thanks
Is a pub/sub approach the best way for doing this?
Using pub/sub from the server to broadcast a "server reboot" message is fine for the workers who get the message, but it's not full-proof. Slow-joiner syndrome may prevent a worker (or workers) from receiving the message. To address that, the server, once it publishes a reboot message, should continue publishing that message until all workers respond with ACK, but that creates a new problem: how does the server keep track of all workers to ensure it receives all necessary ACK's?
Does it provide a way to figure out how many subscribers are
currently connected?
No. Exposing that information breaks ZeroMq's abstraction model which hides the physical details of the connection and connected peers. You can send heartbeat messages periodically from server to workers over pub/sub; workers respond with a logical node id (WorkerNode1, etc), and the server keeps track of each worker in a hashtable along with a future expiration time. When a worker responds to a hearbeat, the server simply resets the future expiration for that worker; the server should periodically check the hashtable and remove expired workers.
That's the best you can do for keeping track of workers. The shorter the expiration, the more accurate the worker list reflects.
Should my app use a REP socket to listen for ACK from the Workers? Is
there a more elegant way of designing this?
REQ/REP sockets have limited uses. I'd use PUB on the server for sending reboot and heartbeat messages; ROUTER to receive ACK's. The workers should use DEALER for sending ACK's (and anything else), and SUB for receiving heartbeats/reboots. ROUTER and DEALER are bi-directional and fully asynchronous, and the most versatile; can't go wrong.
Hope it helps!

Combining pub/sub with req/rep in zeromq

How can a client both subscribe and listen to replies with zeromq?
That is, on the client side I'd like to run a loop which only receives messages and selectively sends requests, and on the server side I'd like to publish most of the time, but to sometimes receive requests as well.
It looks like I'll have to have two different sockets - one for each mode of communication. Is it possible to avoid that and on the server side receive "request notifications" from the socket on a zeromq callback thread while pushing messages to the socket in my own thread?
I am awfully new to ZeroMQ, so I'm not sure if what you want is considered best-practice or not. However, a solution using multiple sockets is pretty simple using zmq_poll.
The basic idea would be to have both client and server:
open a socket for pub/sub
open a socket for req/rep
multiplex sends and receives between the two sockets in a loop using zmq_poll in an infinite loop
process req/rep and pub/sub events within the loop as they occur
Using zmq_poll in this manner with multiple sockets is nice because it avoids threads altogether. The 0MQ guide has a good example here. Note that in that example, they use a timeout of -1 in zmq_poll, which causes it to block until at least one event occurs on any of the multiplexed sockets, but it's pretty common to use a timeout of x milliseconds or something if your loop needs to do some other work as well.
You can use 2 threads to handle the different sockets. The challenge is that if you need to share data between threads, you need to synchronize it in a safe way.
The alternative is to use the ZeroMQ Poller to select the sockets that have new data on them. The process would then use a single loop in the way bjlaub explained.
This could be accomplished using a variation/subset of the Majordomo Protocol. Here's the idea:
Your server will be a router socket, and your clients will be dealer sockets. Upon connecting to the server, the client needs to send some kind of subscription or "hello" message (of your design). The server receives that packet, but (being a router socket) also receives the ID of that client. When the server needs to send something to that client (through your design), it sends it to that ID. The client can send and receive at will, since it is a dealer socket.

WinSock best accept() practices

Imagine you have a server which can handle only one client at a time. The server uses WSAAsyncSelect to be notified of new connections. In this case, what is the best way of handling FD_ACCEPT messages:
A > Accept the connection attempt right away but queue the client until its turn?
B > Do not accept the next connection attempt until we are done serving the currently connected client?
What do you guys think is the most efficient?
Here I describe the cons that I'm aware for both options. Hopefully this might help you decide.
A)
Upon a new client connection, it could send tons of data making your receive buffer become full, which causes unnecessary packets to be transmitted (see this). If you don't plan to receive any data from the client, shutdown receiving on that socket, thus if the client sends any data after that, the connection is reset. Moreover, if your protocol has strict rules, disconnect the client.
If the connection stays idle for too long, the system might disconnect it. To solve this, use setsockopt to set SO_KEEPALIVE on each client socket.
B)
If you don't accept the connection after a certain period (I guess the default is 60 seconds), it will timeout. In a normal (or most common) situation this indicates the server is overloaded, thus unable to answer in time. However, if the client is also designed by you, make the socket non-blocking, try to connect, then manage the timeout as you wish.
Ask yourself: what do you want the user experience to be at the other end? Do you want them to be stuck? Do you want them to time out? Do you want them to get a polite message?

Resources