How to work around zeromq late-joiner problem in a proxy (xpub/xsub)? - zeromq

I have read all salient posts and the zeromq guide on the topic of pub/sub late joiner, and hopefully I simply missed the answer, but just in case I haven't, I have two questions about the zeromq proxy in context to the late joiner:
Does the zeromq proxy with the suggested XSUB/XPUB sockets also suffer from the late joiner problem, ie. are the first few pub messages of a new publisher dropped ?
If so, what is the accepted solution to ensure that even the first published message is received by subscribers with matching topic (my latest info is to sleep...)
I don't believe it is pertinent to the question, but just in case: here are
My proxy's run method, which runs in its own thread; it starts a capture thread if DEBUG is true to log all messages; if no matching subscription exists, nothing is captured.
The proxy works fine including the capture socket. However, I am now adding functionality where a new publisher is started in a thread and will immediately start to publish messages. Hence my question, if a message is published straight away, will it be dropped ?
def run(self):
msg = debug_log(self, f"{self.name} thread running...")
debug_log(self, msg + "binding sockets...")
xpub = self.zmq_ctx.socket(zmq.XPUB)
xpub.bind(sys_conf.system__zmq_broker_xpub_addr)
xsub = self.zmq_ctx.socket(zmq.XSUB)
xsub.bind(sys_conf.system__zmq_broker_xsub_addr)
if self.loglevel == DEBUG and sys_conf.system__zmq_broker_capt_addr:
debug_log(self, msg + " done, now starting broker with message "
"capture and logging.")
capt = self.zmq_ctx.socket(zmq.PAIR)
capt.bind(sys_conf.system__zmq_broker_capt_addr)
capt_th = Thread(
name=self.name + "_capture_thread", target=self.capture_run,
args=(self.zmq_ctx,), daemon=True
)
capt_th.start()
capt.recv() # wait for peer thread to start and check in
debug_log(self, msg + "capture peer synchronised, proceeding.")
zmq.proxy(xpub, xsub, capt) # blocks until thread terminates
else:
debug_log(self, msg + " starting broker.")
zmq.proxy(xpub, xsub) # blocks until thread terminates
def capture_run(self, ctx):
""" capture socket's thread's run method.
debug_log(self, f"{self.name} capture thread running.")
sock = ctx.socket(zmq.PAIR)
sock.connect(sys_conf.system__zmq_broker_capt_addr)
sock.send(b"") # ack message to calling thread
while True:
try: # assume we're getting topic string followed by python object
topic = sock.recv_string()
obj = sock.recv_pyobj()
except Exception: # if not simply log message in bytes
topic = None
obj = sock.recv()
debug_log(self, f"{self.name} capture_run received topic {topic}, "
f"obj {obj}.")
My users of the proxy (they are all both subscribers and publishers) in different threads and/or processes from the proxy:
...
# establish zmq subscriber socket and connect to broker
self._evm_subsock = self._zmq_ctx.socket(zmq.SUB)
self.subscribe_topics()
self._evm_subsock.connect(sys_conf.system__zmq_broker_xpub_addr)
# establish pub socket and connect to broker
self._evm_pub_sock = self._zmq_ctx.socket(zmq.PUB)
self._evm_pub_sock.connect(sys_conf.system__zmq_broker_xsub_addr)
debug_log(self, msg + " Connected to pub/sub broker.")
...

Related

pynng: how to setup, and keep using, multiple Contexts on a REP0 socket

I'm working on a "server" thread, which takes care of some IO calls for a bunch of "clients".
The communication is done using pynng v0.5.0, the server has its own asyncio loop.
Each client "registers" by sending a first request, and then loops receiving the results and sending back READY messages.
On the server, the goal is to treat the first message of each client as a registration request, and to create a dedicated worker task which will loop doing IO stuff, sending the result and waiting for the READY message of that particular client.
To implement this, I'm trying to leverage the Context feature of REP0 sockets.
Side notes
I would have liked to tag this question with nng and pynng, but I don't have enough reputation.
Although I'm an avid consumer of this site, it's my first question :)
I do know about the PUB/SUB pattern, let's just say that for self-instructional purposes, I chose not to use it for this service.
Problem:
After a few iterations, some READY messages are intercepted by the registration coroutine of the server, instead of being routed to the proper worker task.
Since I can't share the code, I wrote a reproducer for my issue and included it below.
Worse, as you can see in the output, some result messages are sent to the wrong client (ERROR:root:<Worker 1>: worker/client mismatch, exiting.).
It looks like a bug, but I'm not entirely sure I understand how to use the contexts correctly, so any help would be appreciated.
Environment:
winpython-3.8.2
pynng v0.5.0+dev (46fbbcb2), with nng v1.3.0 (ff99ee51)
Code:
import asyncio
import logging
import pynng
import threading
NNG_DURATION_INFINITE = -1
ENDPOINT = 'inproc://example_endpoint'
class Server(threading.Thread):
def __init__(self):
super(Server, self).__init__()
self._client_tasks = dict()
#staticmethod
async def _worker(ctx, client_id):
while True:
# Remember, the first 'receive' has already been done by self._new_client_handler()
logging.debug(f"<Worker {client_id}>: doing some IO")
await asyncio.sleep(1)
logging.debug(f"<Worker {client_id}>: sending the result")
# I already tried sending synchronously here instead, just in case the issue was related to that
# (but it's not)
await ctx.asend(f"result data for client {client_id}".encode())
logging.debug(f"<Worker {client_id}>: waiting for client READY msg")
data = await ctx.arecv()
logging.debug(f"<Worker {client_id}>: received '{data}'")
if data != bytes([client_id]):
logging.error(f"<Worker {client_id}>: worker/client mismatch, exiting.")
return
async def _new_client_handler(self):
with pynng.Rep0(listen=ENDPOINT) as socket:
max_workers = 3 + 1 # Try setting it to 3 instead, to stop creating new contexts => now it works fine
while await asyncio.sleep(0, result=True) and len(self._client_tasks) < max_workers:
# The issue is here: at some point, the existing client READY messages get
# intercepted here, instead of being routed to the proper worker context.
# The intent here was to open a new context only for each *new* client, I was
# assuming that a 'recv' on older worker contexts would take precedence.
ctx = socket.new_context()
data = await ctx.arecv()
client_id = data[0]
if client_id in self._client_tasks:
logging.error(f"<Server>: We already have a task for client {client_id}")
continue # just let the client block on its 'recv' for now
logging.debug(f"<Server>: New client : {client_id}")
self._client_tasks[client_id] = asyncio.create_task(self._worker(ctx, client_id))
await asyncio.gather(*list(self._client_tasks.values()))
def run(self) -> None:
# The "server" thread has its own asyncio loop
asyncio.run(self._new_client_handler(), debug=True)
class Client(threading.Thread):
def __init__(self, client_id: int):
super(Client, self).__init__()
self._id = client_id
def __repr__(self):
return f'<Client {self._id}>'
def run(self):
with pynng.Req0(dial=ENDPOINT, resend_time=NNG_DURATION_INFINITE) as socket:
while True:
logging.debug(f"{self}: READY")
socket.send(bytes([self._id]))
data_str = socket.recv().decode()
logging.debug(f"{self}: received '{data_str}'")
if data_str != f"result data for client {self._id}":
logging.error(f"{self}: client/worker mismatch, exiting.")
return
def main():
logging.basicConfig(level=logging.DEBUG)
threads = [Server(),
*[Client(i) for i in range(3)]]
for t in threads:
t.start()
for t in threads:
t.join()
if __name__ == '__main__':
main()
Output:
DEBUG:asyncio:Using proactor: IocpProactor
DEBUG:root:<Client 1>: READY
DEBUG:root:<Client 0>: READY
DEBUG:root:<Client 2>: READY
DEBUG:root:<Server>: New client : 1
DEBUG:root:<Worker 1>: doing some IO
DEBUG:root:<Server>: New client : 0
DEBUG:root:<Worker 0>: doing some IO
DEBUG:root:<Server>: New client : 2
DEBUG:root:<Worker 2>: doing some IO
DEBUG:root:<Worker 1>: sending the result
DEBUG:root:<Client 1>: received 'result data for client 1'
DEBUG:root:<Client 1>: READY
ERROR:root:<Server>: We already have a task for client 1
DEBUG:root:<Worker 1>: waiting for client READY msg
DEBUG:root:<Worker 0>: sending the result
DEBUG:root:<Client 0>: received 'result data for client 0'
DEBUG:root:<Client 0>: READY
DEBUG:root:<Worker 0>: waiting for client READY msg
DEBUG:root:<Worker 1>: received 'b'\x00''
ERROR:root:<Worker 1>: worker/client mismatch, exiting.
DEBUG:root:<Worker 2>: sending the result
DEBUG:root:<Client 2>: received 'result data for client 2'
DEBUG:root:<Client 2>: READY
DEBUG:root:<Worker 2>: waiting for client READY msg
ERROR:root:<Server>: We already have a task for client 2
Edit (2020-04-10): updated both pynng and the underlying nng.lib to their latest version (master branches), still the same issue.
After digging into the sources of both nng and pynng, and confirming my understanding with the maintainers, I can now answer my own question.
When using a context on a REP0 socket, there are a few things to be aware of.
As advertised, send/asend() is guaranteed to be routed to the same peer you last received from.
The data from the next recv/arecv() on this same context, however, is NOT guaranteed to be coming from the same peer.
Actually, the underlying nng call to rep0_ctx_recv() merely reads the next socket pipe with available data, so there's no guarantee that said data is coming from the same peer than the last recv/send pair.
In the reproducer above, I was concurrently calling arecv() both on a new context (in the Server._new_client_handler() coroutine), and on each worker context (in the Server._worker() coroutine).
So what I had previously described as the next request being "intercepted" by the main coroutine was merely a race condition.
One solution would be to only receive from the Server._new_client_handler() coroutine, and have the workers only handle one request. Note that in this case, the workers are no longer dedicated to a particular peer. If this behavior is needed, the routing of incoming requests must be handled at application level.
class Server(threading.Thread):
#staticmethod
async def _worker(ctx, data: bytes):
client_id = int.from_bytes(data, byteorder='big', signed=False)
logging.debug(f"<Worker {client_id}>: doing some IO")
await asyncio.sleep(1 + 10 * random.random())
logging.debug(f"<Worker {client_id}>: sending the result")
await ctx.asend(f"result data for client {client_id}".encode())
async def _new_client_handler(self):
with pynng.Rep0(listen=ENDPOINT) as socket:
while await asyncio.sleep(0, result=True):
ctx = socket.new_context()
data = await ctx.arecv()
asyncio.create_task(self._worker(ctx, data))
def run(self) -> None:
# The "server" thread has its own asyncio loop
asyncio.run(self._new_client_handler(), debug=False)

NATS does not raise an exception when disconnecting

I use an almost standard example of using NATS on python asyncio. I want to receive a message, process it, and send the result
back to the queue, but when the NATS is disconnected (e.g. reboot gnats), the exception does not raise. I even did await asyncio.sleep (1, loop = loop) to change the context
and the disconnect -> reconnect exception was thrown, but this does not happen. What am I doing wrong? May be it's a bug?
import asyncio
from nats.aio.client import Client as NATS
import time
async def run(loop):
nc = NATS()
await nc.connect(io_loop=loop)
async def message_handler(msg):
subject = msg.subject
reply = msg.reply
data = msg.data.decode()
print("Received a message on '{subject} {reply}': {data}".format(
subject=subject, reply=reply, data=data))
# Working
time.sleep(10)
# If nats disconnects at this point, the exception will not be caused
# and will be made attempt to send a message by nc.publish
await asyncio.sleep(2, loop=loop)
print("UNSLEEP")
await nc.publish("test", "test payload".encode())
print("PUBLISHED")
# Simple publisher and async subscriber via coroutine.
await nc.subscribe("foo", cb=message_handler)
while True:
await asyncio.sleep(1, loop=loop)
await nc.close()
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(run(loop))
loop.close()
NATs is built on top of TCP.
TCP has no reliable disconnection signal by definition.
To solve it any messaging system should use a kind of ping messages and drop the connection if timeout occurs.
Strictly speaking you will get disconnection event sometimes, but it may take up to 2 hours (depends on your OS settings).

Bunny::AccessRefused message when trying to read messages

I'm trying to read messages from a queue using Bunny. I only have read permissions on the RabbitMQ server but it seems the code I'm using tries to create the queue - though I can see the queue already exists with queue_exists?().
There must be a process in Bunny whereby one can simply read messages off an existing queue? Here's the code I'm using
require 'bunny'
class ExampleConsumer < Bunny::Consumer
def cancelled?
#cancelled
end
def handle_cancellation(_)
#cancelled = true
end
end
conn = Bunny.new("amqp://xxx:xxx#xxx", automatic_recovery: false)
conn.start
ch = conn.channel
q = ch.queue("a_queue")
consumer = ExampleConsumer.new(ch, q)
When I execute the above I receive:
/Users/jamessmith/.rvm/gems/ruby-1.9.3-p392/gems/bunny-1.7.1/lib/bunny/channel.rb:1915:in `raise_if_continuation_resulted_in_a_channel_error!': ACCESS_REFUSED - access to queue 'a_queue' in vhost '/' refused for user 'xxx' (Bunny::AccessRefused)
in most RMQ configurations I've seen, the consumer will have permissions to create the queue that they need.
If you must have your permissions set up so that you can't create the queue from your consumer, I'd suggest opening an issue ticket with the Bunny gem. it doesn't look like that is supported

Ruby + AMQP: processing queue in parallel

Since most of my tasks depends on the network, I want to process my queue in parallel, not just one message at a time.
So, I'm using the following code:
#!/usr/bin/env ruby
# encoding: utf-8
require "rubygems"
require 'amqp'
EventMachine.run do
connection = AMQP.connect(:host => '127.0.0.1')
channel = AMQP::Channel.new(connection)
channel.prefetch 5
queue = channel.queue("pending_checks", :durable => true)
exchange = channel.direct('', :durable => true)
queue.subscribe(:ack => true) do |metadata, payload|
time = rand(3..9)
puts 'waiting ' + time.to_s + ' for message ' + payload
sleep(time)
puts 'done with '+ payoad
metadata.ack
end
end
Why it is not using my prefetch setting? I guess it should get 5 messages and process them in parallel, no?
Prefetch is the maximum number of messages that may be sent to you in advance before you ack.
In other words, the prefetch size does not limit the transfer of single messages to a client, only the sending in advance of more messages while the client still has one or more unacknowledged messages. (From AMPQ docs)
QoS Prefetching Messages
RabbitMQ AMQP Reference
EventMachine is single threaded and event based. For parallel jobs on different threads or processes, see EM::Deferrable, then Thread or spawn.
Also see Hot Bunnies, a fast DSL on top of the RabbitMQ Java client:
https://github.com/ruby-amqp/hot_bunnies
(Thanks for info from Michael Klishin on Google Groups, and stoyan on blogger)

ZeroMQ High Water Mark Doesn't Work

when I read the "Durable Subscribers and High-Water Marks" in zmq guide, it said "The HWM causes ØMQ to drop messages it can't put onto the queue", but no messages lost when I ran the example. Hit ctrl+c to terminate the durasub.py and then continue it.
the example from the zmq in python.Other languages are the same.
durasub.py
import zmq
import time
context = zmq.Context()
subscriber = context.socket(zmq.SUB)
subscriber.setsockopt(zmq.IDENTITY, "Hello")
subscriber.setsockopt(zmq.SUBSCRIBE, "")
subscriber.connect("tcp://localhost:5565")
sync = context.socket(zmq.PUSH)
sync.connect("tcp://localhost:5564")
sync.send("")
while True:
data = subscriber.recv()
print data
if data == "END":
break
durapub.py
import zmq
import time
context = zmq.Context()
sync = context.socket(zmq.PULL)
sync.bind("tcp://*:5564")
publisher = context.socket(zmq.PUB)
publisher.bind("tcp://*:5565")
publisher.setsockopt(zmq.HWM, 2)
sync_request = sync.recv()
for n in xrange(10):
msg = "Update %d" % n
publisher.send(msg)
time.sleep(1)
publisher.send("END")
The suggestion above is valid, but doesn't properly address the problem in this particular code.
The real problem here is that in durapub.py you call publisher.setsockopt(zmq.HWM, 2) AFTER calling publisher.bind. You should call setsockopt BEFORE bind or connect.
Please refer to 0MQ API documentation for setsockopt:
Caution: All options, with the exception of ZMQ_SUBSCRIBE, ZMQ_UNSUBSCRIBE and ZMQ_LINGER, only take effect for subsequent socket bind/connects.

Resources