What would be the right ZMQ Pattern? - go

I am trying to build a ZeroMQ pattern where,
There can be many clients connecting to a single server endpoint
Server will distribute incoming client tasks to available workers (will be mapped to the number of cores on the server)
These tasks are long running (in hours) and need to perform a lot of local I/O
During each task execution (iteration) there will be data/messages (potentially in order of [GB]s) sent back and forth between the client and the server worker
Client and server workers need to know if there are failures/errors on the peer side, so that they can recover (retry) or shutdown gracefully and try later
Based on the above, I presume that the ROUTER/DEALER pattern would be useful. PUB/SUB is discarded as I need to know if the peer fails.
I tried using various combinations of the ROUTER/DEALER pattern but I am unable to ensure that multiple messages from a client reach the same worker within an iteration. I understand that I need to implement a broker/forwarder/device that routes the incoming messages to the right recipient/handler/worker. But I am unable to map the frontend and backend sockets in the broker. I am looking at MajorDomo pattern, but I guess there has to be a simpler broker model that could just route the messages to the assigned worker. (not really get into services)
I am looking for some examples, if there are any or any guidance on what I may be missing. I am trying to build this in Golang.

Q : "What would be the right ZMQ Pattern?"
Based on the complex composition of all the requirements posted under items 1 - 5, I dare to say, The Right would be NOT to use a single one of the standard, built-in, ZeroMQ trivial primitive Communication Archetype Patterns, but to rather create a multi-layered application-specific composition of a ( M + N + 1 hot-standby robust-enough?) (self-resilient?) Signalling-Messaging infrastructure, that covers all your current ( and possibly extensible for any future one ) application-level requirements, like depicted here for a way simpler distributed-computing use-case, where but a trivial remote-SigKILL was implemented.
Yes, the best would be to create ( and maintain ) your own formalised signalling, that the application level can handle and interact across -- like the heart-beating for detecting dead-worker(s) + permitting to re-instate such failed jobs right on-detected failures (most probably re-located and/or re-scheduled to take place & respective resources not statically pre-mapped, but where physically most feasible at the re-instating moment of time - so even more telemetry signalling will help you decide about the re-instating of the such failed micro-jobs).
ZeroMQ is a fabulous framework right for such complex signalling and messaging hierarchies, so your System Architect's imagination is the only ceiling in this concept.
ZeroMQ will take the rest and do all the hard work nice and easily.

Related

Parallel Req/Rep via Pub/Sub

I have multiple servers, at any point, one and only one will be the leader whcih can respond to a request, all others just drop the request. The issue is that the client does not know which server is the leader.
I have tried using a pub socket on the client for the parallel request out, however I can't work out the right semantics for the response. In terms of how to get the server to respond to that specific client.
A hacky solution which I have tried is to have a sub socket on the client to pub sockets on all the servers, with the leader responding by publishing a message with a filter such that it only goes to the client.
However I am unable to receive any responses this way, the server believes that it sent the message and the client believes it subscribed to "" but then doesn't receive anything...
So I am wondering whether there is a more proper way of doing this? I have thought that potentially a dealer/router with sending to a specific client would work, however I am unsure how to do that.
Essentially I am trying to do a standard Req/Rep however doing the req in parallel to all the nodes, rather than round robin.
UPDATE: By sending the routing id of the dealer in the pub request, making the remote call idempotent (just returning pre-computed results on repeated attempts), and then sending the result back via a router, with message filtering on the receiving side, it now works.
Q : " is (there) a more proper way of doing this? "
Yes.
Start to apply the Maslow's Hammer rule:
“When the only tool you have is a hammer, every problem begins to resemble a nail.”
In other words, do not try use (one) hammer for solving every problem. PUB/SUB-archetype was designed to serve those-and-only-those multi-party Formal-Communications-Pattern archetypes, where many SUB-scribe to .recv() some PUB-lisher(s) .send()-broadcast messages, but nothing other.
Similarly, REQ/REP-archetype was defined and implemented so as to serve one-and-only-one multi-party distributed Formal-Communications-Pattern ( and will obviously not meet any use-case, which has any single other or even a slightly different requirement ).
Users often require some special, non-trivial features, that obviously were not a part of the said trivial Formal-Communications-Pattern archetype primitives ( those ready-made blocks, made available in the ZeroMQ toolbox ).
It is architecs' / designers' role to define, analyse and implement any more complex user-specific distributed-behaviour definition ( a protocol ) and to implement it, most often using a layered combination of the ready-made ZeroMQ primitives.
If in doubts, take a sheet of paper and pencil, draw a small crowd of kids on playground and sketch their "shouts", their "listening", their "silence", "waiting" and "doubts", their many or few "replies", their "voting" and "anger" of not being voted for by friends, their fight for a place on the Sun and their "persistence" not to let others take theirs turn and let 'em sit on the "swing" after releasing the so far pleasurable swinging oneselves.
All this is the part of finding the right mix of ( protocol-orchestrated ) levels of control and levels of freedom to act.
There we get the new, distributed-behaviour, tailor-made for your specific use-case.
Probability to find a ready-made primitive tool to match and fulfill any user-specific use case is limitlessly close to Zero ( sure, unless one's own, user-specific use-case requirements match all those of the primitive archetype, but that is not a user-specific use-case, but a re-use of an already implemented archetype for the very same situation, that was foreseen by the ZeroMQ fathers, wasn't it? )
Again, welcome to the art of Zen-of-Zero.
Maylike to readthis and this and this

Is there any way to have an asynchronous client and server in ZeroMQ?

Is there any way to have an asynchronous client and server in ZeroMQ, using the same TCP-port and many sockets?
I already tried the ROUTER/ROUTER pattern, but with no luck.
The plan is to have an asymmetric connection with send and receive patterns among processors. So a Processor-entity will be a Client and also be a Server at the same time.
Is there any way to have an asynchronous client and server in ZeroMQ, using the same TCP-port and many sockets ?
Yes, there is.
As a preventive step, in other words, before getting into troubles, best review the main conceptual differences in [ ZeroMQ hierarchy in less than a five seconds ] or other posts and discussions here.
The Yes above means, one-.bind()-many-.connect()-s, which composition still uses a just one <transport-class>://<a-class-specific-address>, that for a tcp:// transport-class on IPv4 means that one tcp://A.B.C.D:port# occupied for this whole 1:MANY-sockets-composition.
For obvious reasons, more complex compositions, like many-.bind()-s-many-.connect()-s, are possible, where feasible, given both the ZeroMQ infrastructure topology options and also socket-"in-band"-message-routing features are thus setup and used for smart-decisions on actual message-flow mechanics.

How to get data a ZMQ_PUB service?

Can I publisher service receive data from an external source and send them to the subscribers?
In the wuserver.cpp example, the data are generated from the same script.
Can I write a ZMQ_PUBLISHER entity, which receives data from external data source / application ... ?
In this affirmation:
There is one more important thing to know about PUB-SUB sockets: you do not know precisely when a subscriber starts to get messages. Even if you start a subscriber, wait a while, and then start the publisher, the subscriber will always miss the first messages that the publisher sends. This is because as the subscriber connects to the publisher (something that takes a small but non-zero time), the publisher may already be sending messages out.
Does this mean, that a PUB-SUB ZeroMQ pattern is performed to a best effort - UDP style?
Q1: Can I write a ZMQ_PUBLISHER entity, which receives data from external data source/application?
A1: Oh sure, this is why ZeroMQ is so helping us in designing smart distributed-systems. Just imagine the PUB-side process to also have other { .bind() | .connect() }-calls, so as to establish such other links to data-feeder(s), and you are done to operate the wished to have scheme. In distributed-systems this gives you a new freedom to smart integrate heterogeneous systems to talk to each other in a very efficient way.
Q2:Does this mean, that a PUB-SUB ZeroMQ pattern is performed to a best effort - UDP style?
A2: No, it has another meaning. The newly declared subscriber entities at some uncertain moment start to negotiate their respective subscription-topic filtering and such a ( distributed ) process takes some a-priori unknown time. Unless until the new / changed topic-filter policy was established, there is nothing to go into the SUB-side exgress interface to meet a .recv()-call, so no one can indeed tell, when that will get happened, can he?
On a higher level, there is another well known dichotomy of ZeroMQ -- Zero-Warranty Principle -- expect to either get delivered a complete message or none at all, which prevents the framework users from a need to handle any kind of damaged / inconsistent message-payloads. Either OK, or None. That's a great warranty. The more for distributed-systems.

Fault tolerant redundancy

This might result in biased and opinion based answers, if so I'll close the question but...
I have a rather basic requirement of improving our up-time and speed. As part of this I'm looking at the two main competing approaches, traditional pub/sub and akka.net. We don't have any issues currently or expect to have any need for concurrency control.
What we have is several basic workflows which are data analysis, manipulation and persistence of the result:
Step 1) Capture work to be done (IE what objects need to do some work)
Step 2) Execute that work load and produce a result
Step 3) Save result
Using traditional pub/sub This seems rather easy. Have micro services for each step, push a message at the end of each step with the data required (or more to the point data that might be useful) for the next step. Using any off the self message queue/topic/subscription software this provides a nice ability to:
1) geographically spread the loads around the world to where the source data is located
2) increase the number of "workers" that subscribe to increase through put
3) push to something central that can support the idea of connecting "workers" with a minimal learning curve
4) any component (or set of workers for a component) further down the workflow has/have a queue where the messages queue and wait for said component to come back online (even if the whole component disconnects)
5) adding new components that do something new and different, is as easy as registering a new subscription to a topic.
It's all pretty much out of the box easy joy... assuming sensible aggregate and bounded context patterns are adhered to here. I'm not seeking advise of how to write good distributed code, I'm looking for how deploy it, support it, debug rouge/missing/corrupt messages etc. Which is why I want to know what Akka.net offers.
I've seen there's Akka.net clustering . It may or may not be production ready yet, but best I understand what it can/could do for us.
So the main questions I have are:
1) Where are messages stored prior to arriving? So long as a publisher has access to the messaging bus/software endpoint, any such software will store and hold messages waiting for a subscriber to connect and pick up it's messages (obvious assumptions about the subscription having already been registered so the messages queue for it). How does Akka.net cluster handle all of this?
2) What tooling exists for operational support of these queues and mailboxes in Akka.net cluster? What tools give an operator insight into what is in a mailbox received but waiting to be processed and what tools exist for viewing what has been "published" and not yet "received"? Most competing Pub/Sub software has operational tools so I'm looking for some comparison here.
3) How do you debug rouge, missing or corrupt messages. We all know we should trust our software but a bad message can cause a system to spiral out of control, so how would I eject a bad message from the system? How can I modify a message so it's going to behave differently because the business needs something fixed at 3:30 am? How can I answer "where is my message" with "it IS in the system and it IS waiting to be received" or "it has been received and just in the mailbox"?
4) If a component goes down HARD (recycle, hardware failure what ever) what will restore the mailboxes, queues etc? Any message that's actually being processed has an acceptable lost tolerance, but 1000 messages in a mailbox getting lost isn't so tolerable, what persistence and tolerance is there?
5) The light review I've done appears to advocate for a supervisor pattern to be built into your software to marshal messages around (I'm guessing to manage and release concurrency locks?). Given concurrency isn't an issue here, what out of the box pub/sub mechanism do you support that isn't basic message remoting between two (or x internally defined in code) components? Again with subscriptions and topics in most pub/sub software, your first object pushes a message (it's central so it's a potential single point of failure) but that component (and neither doesn't any other code) have to be aware of what will consume that message. It's expansion nirvana compared the old school way where we manually pushed a message from one object to the next (and to the next), rebuilding or recompiling for each new class that same message had to go to. I'm keen to not have to build our own message router.
6) When all instances of a particular component go offline (say step 3 above) what remembers that there's actually something there that needs to queue and remember those messages (say the ones pushed blindly from step 2 above)? In other software, until you delete the subscription the messages keep queuing up based on what ever rules are defined for TTL etc. What is provided for this?

How can I limit total concurrent subscriber connections to a ZeroMQ publisher endpoint?

When building a pub-sub service using ZeroMQ on a Linux system, is there any way to enforce concurrent subscriber limits?
For example, I might want to create a ZeroMQ publisher service on a resource-limited system, and want to prevent overloading the system by setting a limit of, say, 100 concurrent connections to the tcp publisher endpoint. After that limit is reached, all subsequent connection attempts from ZeroMQ subscribers would fail.
I understand ZeroMQ doesn't provide notifications about connect/disconnect, but I've been looking for socket options that might allow such limits -- so far, no luck.
Or is this something that should be handled at some other level, perhaps within the protocol?
Yes, ZeroMQ is a Can-Do messaging framework:
Besides the trivial Formal Communication Pattern Framework elements ( the library primitives ), the strongest powers behind the ZeroMQ is the ability to develop one's own messaging system(s).
In your case, it is enough to enrich the scene with a few additional things ... a SUB-process -> PUB-process message-flow-channel, so as to allow PUB-side process to count a number of SUB-process instances concurrently connected and to allow for a disconnect ( a step delegated rather "back" to a SUB-process side suicside move, as the classical PUB-process, intentionally, has no instrumentation to manage subscriptions ) once a limit is dynamically achieved.
Plus add some dynamics for the inter-node signalling to start re-counting and/or to equip the SUB-process side(s) with a self-advertising mechanism to push-keepAliveSIG-s to the PUB-side and expect this signalling to be a weak and informative-only indication as there are many real-world collisions, where decentralised node simply fail to deliver a "guaranteed-delivery" message(s) and a well designed, distributed, low-latency, high-performance system has to cope well with this reality and have the self-healing state-recovery policies designed and in-built into own behaviour.
( Fig. courtesy imatix/ZeroMQ )
The ZeroMQ library can be thought of as a very powerful LEGO-tool-box for designing cool distributed systems, than a ready-made / batteries-included, stiff, quasi-solution-for-just-a-few-academic-cases ( well, it might be considered such, but just for some no-brainer's life, while our lives are much more colourful & teasing, aren't they ? )
So, "How to?"
Worth, definitely worth a few days to read the both of Pieter Hintjens' books & a few weeks for shifting one's mind to start designing with the ZeroMQ full-powers on one's side.
With just a few Python add-on habits ( a zmq.Context() early-setup, and not forgetting a finally: aContext.term() )
There's no way that I'm aware of to configure ZMQ to limit connections automatically... however, you have other options to accomplish what you're looking for. Perhaps the "traditional" way to accomplish this is with a second set of "network communication" sockets... perhaps REQ/REP from subscriber to publisher, asking for permission to connect.
You also have the option, depending on your version of ZMQ (and I've never used it and I can't find it in 5 minutes of searching, so I don't know how recent your version must be) to use XPUB/XSUB sockets, which can accomplish bi-directional communication. You can connect with XSUB, send a subscribe request, then receive a positive or negative response (you might have to play with your subscriber topics to communicate directly with just the single subscriber, I'm not sure), and react accordingly.
Either way, you'll be allowing a connection of some sort between the two systems and then either allowing it or terminating it depending on the situation. This could be less than completely ideal since you'll have to carve out a little overhead to handle connections that you'll be refusing... let's say you're saturated at 100 clients and all of a sudden get 100 new subscribe requests... you may or may not be able to cope with that sort of burst traffic.
You can test out the overhead in alternative communication mediums... like you could publish a webservice that indicates subscriber status that a client could check first, but that may not be any better to have clients connecting that way.
If you're absolutely at the limit of your resources, you'll have to set up a second server to handle subscriber status:
Server 1 is your publisher. You could set it up with a PUB socket and a REP socket.
Server 2 is your status server. It has a REQ socket. Have it subscribe to something like "system-status" or some such thing as that. It will also have your mechanism for communicating with new subscribers, be that a ZMQ socket or a web service or whatever else.
A client will request status from your status server. The status server will send a request to your publisher, which will increment it's subscriber count and reply with success, or keep its subscriber count and reply with failure. This success or failure will be communicated back to the subscriber, which will use that information to connect or not.
Disconnections will have to be communicated in a similar way... and you'll have to use some sort of heartbeating round-robin to confirm clients weren't a victim of catastrophic failure.
This will allow your publisher to make intelligent choices about whether it has resources or not. If you just want to set a static number, you don't even need the connection between the status server and the publisher, you can just keep count on the status server... but just to ensure the overall health of the network then it's probably best not to go that simplistic route.
Anyway, those are just some ideas to accomplish what you're looking for. ZMQ gives you options with which to craft your solutions moreso than actual solutions.

Resources