How to connect queues to a ZeroMQ PUB/SUB - zeromq

Consider the following:
a set of 3 logical services: S1, S2 and S3
two instances of each service are running, so we have the following processes: S1P1, S1P2, S2P1, S2P2, S3P1, S3P2
a ZeroMQ broker running in a single process and reachable by all service processes
A logical service, let's say S1, publishes a message M1 that is of interest to logical services S2 and S3. Only one process of each logical service must receive M1, so let's say S2P1 and S3P2.
I have tried the following, but without success:
broker thread 1 is running a XSUB/XPUB proxy
broker thread 2 is running a ROUTER/DEALER proxy with the ROUTER connected to the XPUB socket and subscribed to everything (for logical S1)
broker thread 3 is running a ROUTER/DEALER proxy with the ROUTER connected to the XPUB socket and subscribed to everything (for logical S2)
broker thread 4 is running a ROUTER/DEALER proxy with the ROUTER connected to the XPUB socket and subscribed to everything (for logical S3)
each logical service process is running a REP socket thread connected to the broker DEALER socket
I figured that the XSUB/XPUB proxy would give me publish/subscribe semantics and that the ROUTER/DEALER proxies would introduce a competition between the REP sockets for the messages sent by the XSUB/XPUB proxy.
How can I combine ZeroMQ sockets to accomplish this?
Update1
I know "without success" isn't helpful, I've tried different configurations and got different errors. The latest configuration I tried is the following:
(XSUB proxy=> XPUB) => (SUB copyLoop=> REQ) => (ROUTER proxy=> DEALER) => REP
The copyLoop goes like this:
public void start() {
context = ZMQ.context(1);
subSocket = context.socket(ZMQ.SUB);
subSocket.connect(subSocketUrl);
subSocket.subscribe("".getBytes());
reqSocket = context.socket(ZMQ.REQ);
reqSocket.connect(reqSocketUrl);
while (!Thread.currentThread().isInterrupted()) {
final Message msg = receiveNextMessage();
resendMessage(msg);
}
}
private Message receiveNextMessage() {
final String header = subSocket.recvStr();
final String entity = subSocket.recvStr();
return new Message(header, entity);
}
private void resendMessage(Message msg) {
reqSocket.sendMore(msg.getKey());
reqSocket.send(msg.getData(), 0);
}
The exception I get is the following:
java.lang.IllegalStateException: Cannot send another request
at zmq.Req.xsend(Req.java:51) ~[jeromq-0.3.4.jar:na]
at zmq.SocketBase.send(SocketBase.java:613) ~[jeromq-0.3.4.jar:na]
at org.zeromq.ZMQ$Socket.send(ZMQ.java:1206) ~[jeromq-0.3.4.jar:na]
at org.zeromq.ZMQ$Socket.sendMore(ZMQ.java:1189) ~[jeromq-0.3.4.jar:na]
at com.xyz.messaging.zeromq.SubReqProxyConnector.resendMessage(SubReqProxyConnector.java:47) ~[classes/:na]
at com.xyz.messaging.zeromq.SubReqProxyConnector.start(SubReqProxyConnector.java:35) ~[classes/:na]
I'm running JeroMQ 0.3.4, Oracle Java 8 JVM and Windows 7.

You seem to be adding in some complexity with your ROUTER connection - you should be able to do everything connected directly to your publisher.
The error you're currently running into is that REQ sockets have a strict message ordering pattern - you are not allowed to send() twice in a row, you must send/receive/send/receive/etc (likewise, REP sockets must receive/send/receive/send/etc). From what it looks like, you're just doing send/send/send/etc on your REQ socket without ever receiving a response. If you don't care about a response from your peer, then you must receive and discard it or use DEALER (or ROUTER, but DEALER makes more sense in your current diagram).
I've created a diagram of how I would accomplish this architecture below - using your basic process structure.
Broker T1 Broker T2 Broker T3 Broker T4
(PUB*)------>(*SUB)[--](DEALER*) -->(*SUB)[--](DEALER*) -->(*SUB)[--](DEALER*)
|_____________________||____| || | ||
|_____________________||_______________________||____| ||
|| || ||
========================|| ==================|| ===========||=
|| || || || || ||
|| || || || || ||
|| || || || || ||
(REP*) (REP*) (REP*) (REP*) (REP*) (REP*)
S1P1 S1P2 S2P1 S2P2 S3P1 S3P2
So, the main difference is that I've ditched your (SUB copyLoop=> REQ) step. Whether you choose XPUB/XSUB vs PUB/SUB is up to you, but I would tend to start simpler unless you currently want to make use of the extra features of XPUB/XSUB.
Obviously this diagram doesn't deal with how information enters your broker, where you currently show an XSUB socket - that's out of scope for the information you've provided thus far, presumably you're able to receive information into your broker successfully already so I won't deal with that.
I assume your broker threads that are dedicated to each service are making intelligent choices on whether to send the message to their service or not? If so, then your choice of having them subscribed to everything should work fine, otherwise more intelligent subscription setups might be necessary.
If you're using a REP socket on your service processes, then the service process must take that message and deal with it asynchronously, never communicating back any details about that message to the broker. It must then respond to each message with an acknowledgement (like "RECEIVED") so that it follows the strict receive/send/receive/send pattern for REP sockets.
If you want any other type of communication about how the service handles that message sent back to the broker, REP is no longer the appropriate socket type for your service processes, and DEALER may no longer be the correct socket type for your broker. If you want some form of load balancing so that you send to the next open service process, you'll need to use ROUTER/REQ and have each service indicate its availability and have the broker hold on to the message until the next service process says its available by sending results back. If you want some other type of message handling, you'll have to indicate what that is so a suitable architecture can be proposed.

Clearly I got mixed up with a few elements:
Sockets have the same API whether you're using it as a client-side socket (Socket.connect) or a server-side socket (Socket.bind)
Sockets have the same API regardless of the type (e.g. Socket.subscribe should not be called on a PUSH socket)
Some socket types require a send/receive response loop (e.g. REQ/REP)
Some nuances in communication patterns (PUSH/PULL vs ROUTER/DEALER)
The difficulty (impossiblity?) in debugging a ZeroMQ setup
So a big thanks to Jason for his incredibly detailed answer (and awesome diagram!) that pointed me to the right direction.
I ended up with the following design:
broker thread 1 is running a fan-out XSUB/XPUB proxy on bind(localhost:6000) and bind(localhost:6001)
broker thread 2 is running a queuing SUB/PUSH proxy on connect(localhost:6001) and bind(localhost:6002); broker threads 3 and 4 use a similar design with different bind port numbers
message producers connect to the broker using a PUB socket on connect(localhost:6000)
message consumers connect to the broker queuing proxy using a PULL socket on connect(localhost:6002)
On top of this service-specific queuing mechanism, I was able to add a similar service-specific fan-out mechanism rather simply:
broker thread runs a SUB/PUB proxy on connect(localhost:6001) and bind(localhost:6003)
message producers still connect to the broker using a PUB socket on connect(localhost:6000)
message consumers connect to the broker fan-out proxy using a SUB socket on connect(localhost:6003)
This has been an interesting ride.

Related

How may client can connect to the server socket in MQL?

I want to connect more than 500 hundred client to the MQL (Meta Trader) server socket.
There is no description about it in the documentation: https://www.mql5.com/en/docs/network/socketcreate
How many client can connect to the sever and deal with no problem?
Q :" I want to connect more than 500 hundred client to the MQL (Meta Trader) server ... How many client can connect to the sever and deal with no problem? "
A :Not an easy task, indeed.
As you may already know, all the MetaTrader 4/5 ecosystems are built as a distributed-system, having a Terminal-side ( on your, clients' side(s) ) and a Server-side ( a multi-host platform, located at the Broker DataCenter, who registers users, authenticates & feeds, besides many further noted things, a latency-sensitive, high-volume ( markets Volume-wise times number of active clients-wise ) stream of { CFD | FX | DeFi | * }-Market QUOTE messages (having easily cadence of hundreds ToB-events / messages per millisecond at FX-market) to all auth'ed active { MT4 | MT5 }-Terminal computers & accepts and executes XTO-instructions from auth'ed clients & reports results ( state-changes preformed & client's-funds accounting operations ) from XTO-s back to the respective trader's terminals ). That amount of work is, on the Broker side, split among several MetaTrader 4/5 Server server-infrastructure computers. The web-socket handling gets served by one part of such Broker-side infrastructure.
Closer to your reach goes the MetaTrader 4/5 Terminal, that you can program & control. Even here the amounts of resources are limited, as you can read from your linked, Terminal-side, not Server-side documentation of programming tools available :
You can create a maximum of 128 sockets from one MQL5 program. If the limit is exceeded, the error 5271 (ERR_NETSOCKET_TOO_MANY_OPENED) is written to _LastError.
So, the Server-side is controlled by the Broker ( who owns the license to use the MetaQuotes, Inc. product, that gets configured for expected performance envelopes - being ready or not to handle additional 50.000 web-socket connections for NTO-s might not be the Brokers' core business priority, as they collect fees from XTO-s )
"(...) The question is, do we create new socket for each client to connect? As I know, we create the server socket just one time on the Oninit function, then on a timer or chart event handler, do accepting incoming client connection request. So, there is just one socket and many client connect to this socket. Am I right #user3666197 ? – Behzad 23 hours ago"
-&-
"I think my question is not clear. I have done this project. I bought a VPS then install a MT5 on it with the EA that has played the server role. The sever EA could accept 500 client without any problem. It can send and receive messages as well as one connection. For clients, on my pc create a loop to connect 500 connection to the server. One socket on the server EA. – Behzad 4 hours ago"
Given you call MT5-Client-Terminal a "server" in a sense ( just a VPS-hosted MT5-Client-Terminal, running a user-defined MQL5-ExpertAdviser-code ), there seems to be some magic :
(A)you claim to be able to "(...) accept 500 client without any problem.", which is in a direct contradiction to the official MQL5-documented limit of not more than 128 sockets ever opened from an MQL5-{ EA | Script }-code
(B) the official MQL5-documentation does not present a way, how an MT5-Client-Terminal running an MQL5-{ EA | Script }-code can receive connections arriving asynchronously from remote clients ( yet without specifying how that might ever happen, as the official MQL5-Documentation is strict on practically avoiding such to happen if using the MQL5-language functions as of 2022-Q1 )
(C) the official MQL5-documentation confirms, one can SocketConnect() from inside an MT5-Client-Terminal MQL5-{ EA | Script }-code to a known TCP/IP:PORT address :
string KNOWN_ADDRESS = "some.known.FQDN";
int KWOWN_PORT = 80,
TimeoutMILLIS = 1000;
bool FLAG_ExtTLS = false;
//+------------------------------------------------------------------+
...
int MyOUTGOINGsocket = SocketCreate(); //--- check the handle
if ( MyOUTGOINGsocket != INVALID_HANDLE )
{
if ( SocketConnect( MyOUTGOINGsocket, //--- from MT5-Terminal
KNOWN_ADDRESS, // to <_address_>
KNOWN_PORT, // on <_port_>
TimeoutMILLIS // try <_millis_>
) // else FAIL
)
{
Print( "INF: Established connection to ",
KNOWN_ADDRESS, ":",
KNOWN_PORT
);
...
}
else
{
Print( "ERR: Connection to ",
KNOWN_ADDRESS, ":",
KNOWN_PORT,
" failed, error ",
GetLastError()
);
...
}
SocketClose( MyOUTGOINGsocket ); //--- close a socket to release RAM/resources
}
else
{ Print( "ERR: Failed to even create a socket, error was ",
GetLastError()
);
...
}
...
...
//+------------------------------------------------------------------+
One may use, for sure some other, DLL-#import-ed tools for the similar tasks, yet as no MCVE-formulated problem description was presented so far, it is so hard to tell anything more, except for the facts already described above
You can using webrequest method with API from mql client
There is an article explaining how to create a server on MT5:
Working with sockets in MQL, or How to become a signal provider
https://www.mql5.com/en/articles/2599

Overlapping MQ Cluster and reply message delivery with IBM MQ

I have two clusters CLUSMES and CLUSHUBS. Each cluster has two queue managers.
Cluster CLUSMES has QMGRS: QMGR1A and QMGR1B
Cluster CLUSHUBS has QMGRS: QMGR3A and QMGR3B
There is a Gateway QMGR: QMGR2, which forms the overlap and is a partial repository in each MQ cluster.
Request messages are sent out from either QMGR1A/B to either QMGR3A/B via QMGR2 which acts as a cluster Load balance to QMGR3A/B (This works fine) and a reply is expected back to the sending QMGR.
All channel connectivity is in place and fully functional. The problem is how to return the message from where it came from. The replying QMGR connects to QMGR3A/B and issues a put. I will either get a REMOTE_QMGR not found (MQRC 2087) or a MQ Object not found (MQRC 2085) depending on how I have it configured. The message header of the message contains the ReplytoQueue and ReplyToQMgr properly. I would like to have the replying application just issue a put and have it delivered to the proper queue in CLUSMES, but this is proving to be extremely difficult. I have played with Remote QMGR Alias and QAlias on the GateWay Qmgr: QMGR2, but no luck. There is got to be a simple trick to this and there are plenty of examples, but I have not been able to implement one successfully. A clear cut example of what my return path should be would be most helpful. Keep in mind, that the ReplyToQMgris in the MQMD and resolution needs to occur from that. I need resolution to occur at the QMGR2 level, where both clusters are known. Concrete full suggestions appreciated.
MQ Definitions on the QMGR1A/B, where the REPLY is expected:
DEFINE QLOCAL('SERVER.REPLYQ') CLUSTER('CLUSMES')
On QMGR2 ( The Gateway for message hoping)
DEFINE NAMELIST(CLUSTEST) NAMES(CLUSMES,CLUSHUBS)
DEFINE QALIAS(SERVER.ALIAS.REPLYQ) TARGQ(SERVER.REPLYQ) CLUSTER(CLUSTEST) DEFBIND(NOTFIXED)
DEFINE QREMOTE(QMGR1A) RNAME(' ') RQMNAME(QMGR1A) XMITQ('') CLUSTER(CLUSHUBS)
DEFINE QREMOTE(QMGR1B) RNAME(' ') RQMNAME(QMGR1B) XMITQ('') CLUSTER(CLUSHUBS)
On MQMGR3A/B QALIAS(SERVER.ALIAS.REPLYQ) cluster queue. The Gateway QMGR could not resolve the baseQ: mqrc_unknown_alias_base_q 2082
This was the configuration when trying to resolve it using the cluster.
When the request message is sent by the application it would specify ReplyToQMgr of either QMGR1A and QMGR1B and ReplytoQueue with the name of the queue that is present on QMGR1A and QMGR1B, the reply queue need not be clustered.
On gateway queue manager QMGR2 you would define the following objects:
DEFINE QREMOTE(QMGR1A) RNAME('') RQMNAME(QMGR1A) XMITQ('') CLUSTER(CLUSHUBS)
DEFINE QREMOTE(QMGR1B) RNAME('') RQMNAME(QMGR1B) XMITQ('') CLUSTER(CLUSHUBS)
This would allows any queue manager in the cluster CLUSHUBS to route reply messages back to QMGR1A and QMGR1B via gateway queue manager QMGR2.
If you want to limit queues on QMGR1A and QMGR1B that queue managers in the CLUSHUBS cluster can put to you would need to take a different approach. Let me know if that is what you need and I will update my answer with some suggestions.

Is this example tcp socket programming sequence of events safe?

I plan on having two services.
HTTP REST service written in Ruby
JSON RPC service written in Go
The Ruby service will open a TCP socket connection to a Go JSON RPC service. It'll do this for each incoming HTTP request it receives. It will send some data over the socket to the Go service and that service will subsequently send back the corresponding data back down the socket.
Go code
The Go service go would look something like this (simplified):
srv := new(service.App) // this would expose a Process method
rpc.Register(srv)
listener, err := net.Listen("tcp", ":8080")
if err != nil {
// handle error
}
for {
conn, err := listener.Accept()
if err != nil {
// handle error
}
go jsonrpc.ServeConn(conn)
}
Notice we serve the incoming connection using a goroutine, so we can handle requests concurrently.
Ruby code
Below is a simple snippet of Ruby code that demonstrates (in theory) the way I would send data to the Go service:
require "socket"
require "json"
socket = TCPSocket.new "localhost", "8080"
b = {
:method => "App.Process",
:params => [{ :Config => JSON.generate({ :foo => :bar }) }],
:id => "0"
}
socket.write(JSON.dump(b))
response = JSON.load socket.readline
My concern is: will this be a safe sequence of events?
I'm not asking if this will be 'thread safe', because i'm not worried about manipulating shared memory across the go routines. I'm more concerned around whether my Ruby HTTP service will get back the data it's expecting?
If I have two parallel requests coming into my HTTP Service (or maybe the Ruby app is hosted behind a load balancer and so different instances of the HTTP service is handling multiple requests), then I could have instance A send the message Foo to the Go service; while instance B sends the message Bar.
The business logic inside the Go service will return different responses depending on its input so I want to be sure that Ruby instance A gets back the correct response for Foo, and B gets back the correct response for Bar.
I assume a socket connection is more like a queue in that if instance A makes a request to the Go service first and then B does, but B is quicker responding for whatever reason, then the Go service will write the response for B to the socket and instance A of the Ruby app will end up reading in the wrong socket data (this is obviously just one possible scenario considering that I could get lucky and have instance B read the socket data before instance A does).
Solutions?
I'm not sure if there is simple solution to this problem. Unless I don't use a TCP socket or RPC and instead rely on standard HTTP in the Go service. But I wanted the performance and less overhead of TCP.
I'm worried the design could get more complicated by maybe having to implement an external queue as a way of synchronising the responses with the Ruby service.
It maybe because the nature of my Ruby service is fundamentally synchronous (HTTP response/request) that I have no option but to switch to HTTP for the Go service.
But wanted to double check with the community first just in case I'm missing something obvious.
Yes this is safe if you create a new connection every time.
That said there are latent issues with your approach:
TCP connections are rather expensive to establish, so you probably want to re-use connections with a connection pool
If you make too many simultaneous requests you will exhaust ports/open file descriptors which will cause your program to crash
You don't have any timeouts in place, so it's possible to end up with orphaned TCP connections which never complete (either because of something bad on the Go side, or network problems)
I think you'd be better off using HTTP (despite the overhead) since libraries are already written to cope with these problems. HTTP is also much more debuggable since you can just curl an endpoint to test it.
Personally I'd probably go with gRPC.

How does ZeroMQ connect and bind work internally

I am experimenting with ZeroMQ. And I found it really interesting that in ZeroMQ, it does not matter whether either connect or bind happens first. I tried looking into the source code of ZeroMQ but it was too big to find anything.
The code is as follows.
# client side
import zmq
ctx = zmq.Context()
socket = ctx.socket(zmq.PAIR)
socket.connect('tcp://*:2345') # line [1]
# make it wait here
# server side
import zmq
ctx = zmq.Context()
socket = ctx.socket(zmq.PAIR)
socket.bind('tcp://localhost:2345')
# make it wait here
If I start client side first, the server has not been started yet, but magically the code is not blocked at line [1]. At this point, I checked with ss and made sure that the client is not listening on any port. Nor does it have any open connection. Then I start the server. Now the server is listening on port 2345, and magically the client is connected to it. My question is how does the client know the server is now online?
The best place to ask your question is the ZMQ mailing list, as many of the developers (and founders!) of the library are active there and can answer your question directly, but I'll give it a try. I'll admit that I'm not a C developer so my understanding of the source is limited, but here's what I gather, mostly from src/tcp_connector.cpp (other transports are covered in their respective files and may behave differently).
Line 214 starts the open() method, and here looks to be the meat of what's going on.
To answer your question about why the code is not blocked at Line [1], see line 258. It's specifically calling a method to make the socket behave asynchronously (for specifics on how unblock_socket() works you'll have to talk to someone more versed in C, it's defined here).
On line 278, it attempts to make the connection to the remote peer. If it's successful immediately, you're good, the bound socket was there and we've connected. If it wasn't, on line 294 it sets the error code to EINPROGRESS and fails.
To see what happens then, we go back to the start_connecting() method on line 161. This is where the open() method is called from, and where the EINPROGRESS error is used. My best understanding of what's happening here is that if at first it does not succeed, it tries again, asynchronously, until it finds its peer.
I think the best answer is in zeromq wiki
When should I use bind and when connect?
As a very general advice: use bind on the most stable points in your architecture and connect from the more volatile endpoints. For request/reply the service provider might be point where you bind and the client uses connect. Like plain old TCP.
If you can't figure out which parts are more stable (i.e. peer-to-peer) think about a stable device in the middle, where boths sides can connect to.
The question of bind or connect is often overemphasized. It's really just a matter of what the endpoints do and if they live long — or not. And this depends on your architecture. So build your architecture to fit your problem, not to fit the tool.
And
Why do I see different behavior when I bind a socket versus connect a socket?
ZeroMQ creates queues per underlying connection, e.g. if your socket is connected to 3 peer sockets there are 3 messages queues.
With bind, you allow peers to connect to you, thus you don't know how many peers there will be in the future and you cannot create the queues in advance. Instead, queues are created as individual peers connect to the bound socket.
With connect, ZeroMQ knows that there's going to be at least a single peer and thus it can create a single queue immediately. This applies to all socket types except ROUTER, where queues are only created after the peer we connect to has acknowledge our connection.
Consequently, when sending a message to bound socket with no peers, or a ROUTER with no live connections, there's no queue to store the message to.
When you call socket.connect('tcp://*:2345') or socket.bind('tcp://localhost:2345') you are not calling these methods directly on an underlying TCP socket. All of ZMQ's IO - including connecting/binding underlying TCP sockets - happens in threads that are abstracted away from the user.
When these methods are called on a ZMQ socket it essentially queues these events within the IO threads. Once the IO threads begin to process them they will not return an error unless the event is truly impossible, otherwise they will continually attempt to connect/reconnect.
This means that a ZMQ socket may return without an error even if socket.connect is not successful. In your example it would likely fail without error but then quickly reattempt and succeeded if you were to run the server side of script.
It may also allow you to send messages while in this state (depending on the state of the queue in this situation, rather than the state of the network) and will then attempt to transmit queued messages once the IO threads are able to successfully connect. This also includes if a working TCP connection is later lost. The queues may continue to accept messages for the unconnected socket while IO attempts to automatically resolve the lost connection in the background. If the endpoint takes a while to come back online it should still receive it's messages.
To better explain here's another example
<?php
$pid = pcntl_fork();
if($pid)
{
$context = new ZMQContext();
$client = new ZMQSocket($context, ZMQ::SOCKET_REQ);
try
{
$client->connect("tcp://0.0.0.0:9000");
}catch (ZMQSocketException $e)
{
var_dump($e);
}
$client->send("request");
$msg = $client->recv();
var_dump($msg);
}else
{
// in spawned process
echo "waiting 2 seconds\n";
sleep(2);
$context = new ZMQContext();
$server = new ZMQSocket($context, ZMQ::SOCKET_REP);
try
{
$server->bind("tcp://0.0.0.0:9000");
}catch (ZMQSocketException $e)
{
var_dump($e);
}
$msg = $server->recv();
$server->send("response");
var_dump($msg);
}
The binding process will not begin until 2 seconds later than the connecting process. But once the child process wakes and successfully binds the req/rep transaction will successfully take place without error.
jason#jason-VirtualBox:~/php-dev$ php play.php
waiting 2 seconds
string(7) "request"
string(8) "response"
If I was to replace tcp://0.0.0.0:9000 on the binding socket with tcp://0.0.0.0:2345 it will hang because the client is trying to connect to tcp://0.0.0.0:9000, yet still without error.
But if I replace both with tcp://localhost:2345 I get an error on my system because it can't bind on localhost making the call truly impossible.
object(ZMQSocketException)#3 (7) {
["message":protected]=>
string(38) "Failed to bind the ZMQ: No such device"
["string":"Exception":private]=>
string(0) ""
["code":protected]=>
int(19)
["file":protected]=>
string(28) "/home/jason/php-dev/play.php"
["line":protected]=>
int(40)
["trace":"Exception":private]=>
array(1) {
[0]=>
array(6) {
["file"]=>
string(28) "/home/jason/php-dev/play.php"
["line"]=>
int(40)
["function"]=>
string(4) "bind"
["class"]=>
string(9) "ZMQSocket"
["type"]=>
string(2) "->"
["args"]=>
array(1) {
[0]=>
string(20) "tcp://localhost:2345"
}
}
}
["previous":"Exception":private]=>
NULL
}
If your needing real-time information for the state of underlying sockets you should look into socket monitors. Using socket monitors along with the ZMQ poll allows you to poll for both socket events and queue events.
Keep in mind that polling a monitor socket using ZMQ poll is not similar to polling a ZMQ_FD resource via select, epoll, etc. The ZMQ_FD is edge triggered and therefor doesn't behave the way you would expect when polling network resources, where a monitor socket within ZMQ poll is level triggered. Also, monitor sockets are very light weight and latency between the system event and the resulting monitor event is typically sub microsecond.

Publisher finishes before subscriber and messages are lost - why?

Fairly new to zeromq and trying to get a basic pub/sub to work. When I run the following (sub starting before pub) the publisher finishes but the subscriber hangs having not received all the messages - why ?
I think the socket is being closed but the messages have been sent ? Is there a way of ensuring all messages are received ?
Publisher:
import zmq
import random
import time
import tnetstring
context=zmq.Context()
socket=context.socket(zmq.PUB)
socket.bind("tcp://*:5556")
y=0
for x in xrange(5000):
st = random.randrange(1,10)
data = []
data.append(random.randrange(1,100000))
data.append(int(time.time()))
data.append(random.uniform(1.0,10.0))
s = tnetstring.dumps(data)
print 'Sending ...%d %s' % (st,s)
socket.send("%d %s" % (st,s))
print "Messages sent: %d" % x
y+=1
print '*** SERVER FINISHED. # MESSAGES SENT = ' + str(y)
Subscriber :-
import sys
import zmq
import tnetstring
# Socket to talk to server
context = zmq.Context()
socket = context.socket(zmq.SUB)
socket.connect("tcp://localhost:5556")
filter = "" # get all messages
socket.setsockopt(zmq.SUBSCRIBE, filter)
x=0
while True:
topic,data = socket.recv().split()
print "Topic: %s, Data = %s. Total # Messages = %d" % (topic,data,x)
x+=1
In ZeroMQ, clients and servers always try to reconnect; they won't go down if the other side disconnects (because in many cases you'd want them to resume talking if the other side comes up again). So in your test code, the client will just wait until the server starts sending messages again, unless you stop recv()ing messages at some point.
In your specific instance, you may want to investigate using the socket.close() and context.term(). It will block until all the messages have been sent. You also have the problem of a slow joiner. You can add a sleep after the bind, but before you start publishing. This works in a test case, but you will want to really understand what is the solution vs a band-aid.
You need to think of the PUB/SUB pattern like a radio. The sender and receiver are both asynchronous. The Publisher will continue to send even if no one is listening. The subscriber will only receive data if it is listening. If the network goes down in the middle, the data will be lost.
You need to understand this in order to design your messages. For example, if you design your messages to be "idempotent", it doesn't matter if you lose data. An example of this would be a status type message. It doesn't matter if you have any of the previous statuses. The latest one is correct and message loss doesn't matter. The benefits to this approach is that you end up with a more robust and performant system. The downsides are when you can't design your messages this way.
Your example includes a type of message that requires no loss. Another type of message would be transactional. For example, if you just sent the deltas of what changed in your system, you would not be able to lose the messages. Database replication is often managed this way which is why db replication is often so fragile. To try to provide guarantees, you need to do a couple things. One thing is to add a persistent cache. Each message sent needs to be logged in the persistent cache. Each message needs to be assigned a unique id (preferably a sequence) so that the clients can determine if they are missing a message. A second socket (ROUTER/REQ) needs to be added for the client to request the missing messages individually. Alternatively, you could just use the secondary socket to request resending over the PUB/SUB. The clients would then all receive the messages again (which works for the multicast version). The clients would ignore the messages they had already seen. NOTE: this follows the MAJORDOMO pattern found in the ZeroMQ guide.
An alternative approach is to create your own broker using the ROUTER/DEALER sockets. When the ROUTER socket saw each DEALER connect, it would store its ID. When the ROUTER needed to send data, it would iterate over all client IDs and publish the message. Each message should contain a sequence so that the client can know what missing messages to request. NOTE: this is a sort of reimplementation of Kafka from linkedin.

Resources