Multiple Paho MQTT clients or only one that routes - spring

I have a Spring-boot service that consumes messages from different device types via an MQTT server, of course each device type has its own message format. At the moment there are five types. Each one of them is handled by a separate component in the application. In total there are approximately 15 000 messages per second. The messages aren't spread even over the topics, one topic has 10 000 of them alone. I've tried to find some best practices and performance info around the Paho client but there isn't much out there.
How is the Paho client performing during high loads?
Should I stay with one Paho client that subscribes to all messages and then internally route them to correct consumers, or should I let each consumer component create its own client?
Right now I'm leaning towards the second option. More threads will be created but less code (no need to create routing etc) and better separation. But on the other hand there are multiple clients running.

Right now I'm leaning towards the second option. More threads will be
created but less code (no need to create routing etc) and better
I agree, the 2nd approach is the best. Always go with the KIS (Keep It Simple) principle.


Performance of Nats Jetstream

I'm trying to understand how Nats Jetstream scales and have a couple of questions.
How efficient is subscribing by subject to historic messages? For example lets say have a stream foo that consists of 100 million messages with a subject of and then a single message with a subject foo.baz. If I then make a subscription to foo.baz from the start of the stream will something on the server have to perform a linear scan of all messages in foo or will it be able to immediately seek to the foo.baz message.
How well does the system horizontally scale? I ask because I'm having issues getting Jetstream to scale much above a few thousand messages per second, regardless of how many machines I throw at it. Test parameters are as follows:
Nats Server 2.6.3 running on 4 core 8GB nodes
Single Stream replicated 3 times (disk or in-memory appears to make no difference)
500 byte message payloads
n publishers each publishing 1k messages per second
The bottleneck appears to be on the publishing side as I can retrieve messages at least as fast as I can publish them.
Publishing in NATS JetStream is slightly different than publishing in Core NATS.
Yes, you can publish a Core NATS message to a subject that is recorded by a stream and that message will indeed be captured in the stream, but in the case of the Core NATS publication, the publishing application does not expect an acknowledgement back from the nats-server, while in the case of the JetStream publish call, there is an acknowledgement sent back to the client from the nats-server that indicates that the message was indeed successfully persisted and replicated (or not).
So when you do js.Publish() you are actually making a synchronous relatively high latency request-reply (especially if your replication is 3 or 5, and more so if your stream is persisted to file, and depending on the network latency between the client application and the nats-server), which means that your throughput is going to be limited if you are just doing those synchronous publish calls back to back.
If you want throughput of publishing messages to a stream, you should use the asynchronous version of the JetStream publish call instead (i.e. you should use js.AsyncPublish() that returns a PubAckFuture).
However in that case you must also remember to introduce some amount of flow control by limiting the number of 'in-flight' asynchronous publish applications you want to have at any given time (this is because you can always publish asynchronously much much faster than the nats-server(s) can replicate and persist messages.
If you were to continuously publish asynchronously as fast as you can (e.g. when publishing the result of some kind of batch process) then you would eventually overwhelm your servers, which is something you really want to avoid.
You have two options to flow-control your JetStream async publications:
specify a max number of in-flight asynchronous publication requests as an option when obtaining your JetStream context: i.e. js = nc.JetStream(nats.PublishAsyncMaxPending(100))
Do a simple batch mechanism to check for the publication's PubAcks every so many asynchronous publications, like nats bench does:
About the expected performance: using async publications allows you to really get the throughput that NATS and JetStream are capable of. A simple way to validate or measure performance is to use the nats CLI tool ( to run benchmarks.
For example you can start with a simple test: nats bench foo --js --pub 4 --msgs 1000000 --replicas 3 (in memory stream with 3 replicas 4 go-routines each with it's own connection publishing 128 byte messages in batches of 100) and you should get a lot more than a few thousands messages per second.
For more information and examples of how to use the nats bench command you can take a look at this video:
Would be good to get an opinion on this. I have a similar behaviour and the only way to achieve higher throughput for publishers is to lower replication (from 3 to 1) but that won't be an acceptable solution.
I have tried adding more resources (cpu/ram) with no success on increasing the publishing rate.
Also, scaling horizontally did not make any difference.
In my situation , i am using Bench tool to publish to js.
For an R3 filestore you can expect ~250k small msgs per second. If you utilize synchronous publish that will be dominated by RTT from the application to the system, and from the stream leader to the closest follower. You can use windowed intelligent async publish to get better performance.
You can get higher numbers with memory stores, but again will be dominated by RTT throughout the system.
If you give me a sense of how large are your messages we can show you some results from nats bench against the demo servers (R1) and NGS (R1 & R3).
For the original question regarding filtered consumers, >= 2.8.x will not do a linear scan to retrieve foo.baz. We could also show an example of this as well if it would help.
Feel free to join the slack channel ( which is a pretty active community. Even feel free to DM me directly, happy to help.

What would be the right ZMQ Pattern?

I am trying to build a ZeroMQ pattern where,
There can be many clients connecting to a single server endpoint
Server will distribute incoming client tasks to available workers (will be mapped to the number of cores on the server)
These tasks are long running (in hours) and need to perform a lot of local I/O
During each task execution (iteration) there will be data/messages (potentially in order of [GB]s) sent back and forth between the client and the server worker
Client and server workers need to know if there are failures/errors on the peer side, so that they can recover (retry) or shutdown gracefully and try later
Based on the above, I presume that the ROUTER/DEALER pattern would be useful. PUB/SUB is discarded as I need to know if the peer fails.
I tried using various combinations of the ROUTER/DEALER pattern but I am unable to ensure that multiple messages from a client reach the same worker within an iteration. I understand that I need to implement a broker/forwarder/device that routes the incoming messages to the right recipient/handler/worker. But I am unable to map the frontend and backend sockets in the broker. I am looking at MajorDomo pattern, but I guess there has to be a simpler broker model that could just route the messages to the assigned worker. (not really get into services)
I am looking for some examples, if there are any or any guidance on what I may be missing. I am trying to build this in Golang.
Q : "What would be the right ZMQ Pattern?"
Based on the complex composition of all the requirements posted under items 1 - 5, I dare to say, The Right would be NOT to use a single one of the standard, built-in, ZeroMQ trivial primitive Communication Archetype Patterns, but to rather create a multi-layered application-specific composition of a ( M + N + 1 hot-standby robust-enough?) (self-resilient?) Signalling-Messaging infrastructure, that covers all your current ( and possibly extensible for any future one ) application-level requirements, like depicted here for a way simpler distributed-computing use-case, where but a trivial remote-SigKILL was implemented.
Yes, the best would be to create ( and maintain ) your own formalised signalling, that the application level can handle and interact across -- like the heart-beating for detecting dead-worker(s) + permitting to re-instate such failed jobs right on-detected failures (most probably re-located and/or re-scheduled to take place & respective resources not statically pre-mapped, but where physically most feasible at the re-instating moment of time - so even more telemetry signalling will help you decide about the re-instating of the such failed micro-jobs).
ZeroMQ is a fabulous framework right for such complex signalling and messaging hierarchies, so your System Architect's imagination is the only ceiling in this concept.
ZeroMQ will take the rest and do all the hard work nice and easily.

Which of these is the best practice for web sockets in terms of performance?

This is more of a hypothetical question, so I can't really show any code examples. Imagine if a site like Twitter wanted to live-update stats on a Tweet via web sockets/ In terms of performance, which of these would be the best approach?
Each action (like, retweet, reply) sends a message to the server, which then gets emitted to all clients, and the client is responsible for updating the appropriate tweet.
Each tweet the client loads is connected to a different room so that it only emits and receives messages relevant to itself.
Or perhaps it's dependent on the scale of the application? Maybe 1 is better if you had a Twitter clone with only a few users, whereas I would think 2 is better in Twitter's case because it's a matter of hundreds of "rooms" vs millions of signals/second? And if that's the case, at what point is one approach preferred over the other?
At scale, you do not want to be sending messages to clients that they did not ask for and do not have any use for. Imagine a twitter client that was receiving every single tweet being sent in real time. That could overwhelm that client and it would mean the server would be delivering every single tweet to every single connected client. That obviously doesn't scale on either the server side or the client side.
So option 1 is out.
The appropriate solution has the server send to the client only the messages that is has a particular interest in seeing. This works just fine at any scale. I can't tell whether your option 2 is that or not since rooms are just a tool for making groups of connections that you can send the same message to - they don't really decide who gets what message - that logic must be baked into your server code.
For a twitter-like service, it seems you're going to have to have a system where your server can easily tell which users have an interest in this particular new message. That can presumably be for a number of reasons such as they are following the author, they are following a hashtag present in the message, they are mentioned in the message, etc... That is server-side logic, not just simple rooms.

Is the mux in this golang example necessary?

In an app that I'm making, a user is always part of a 'game'. I'd like to set up a server to communicate with users in a game. I'm planning to use, which defines the newSocketIOfunction to create a new socketio instance.
Instead of creating one socketio instance, I thought it might be possible to create a map that maps game id's to instances, and configure them so that they listen on an url that represents the game id.
This way, I can use methods such as broadcast and broadcastExcept to broadcast to all players ithin a single game. However, I'd have to start a new goroutine for every game, and I don't know enough about their performance characteristics to know if this is scalable, since the request rate for a single socketio instance will be very low, about 1/second at peak times, but the connection might be idle for tens of seconds at other times (except for heartbeat, and possibly other communication specified by the protocol).
Would I be better off creating 1 instance, and tracking which connections belong to which games?
I'd have to start a new goroutine for every game, and I don't know enough about their performance characteristics to know if this is scalable
Fire away, the Go scheduler is built to efficiently handle thousands and even millions of goroutines.
The default net/http server in the Go standard library spawns a goroutine for every client for instance.
Just remember to return from your goroutines once they're done working. Else you'll end up with a lot of stale ones.
Would I be better off creating 1 instance, and tracking which connections belong to which games?
I'm not involved in the project but if it follows Go's "get sh*t done" philosophy, then it shouldn't matter. You can find out what works better by profiling both approaches though.

When multi MessageConsumer connect to same queue(Websphere MQ),how to load balance message-consumer?

I am Using WebSphere MQ 7,and I have two clients connected to the same QMgr and consuming messages from same queue, like following code:
while (true) {
TextMessage message = (TextMessage) consumer.receive(1000);
if (message != null) {
System.out.println("*********************" + message.getText());
I found only one client always retrieve messages. Is there any method to let consume-message load balancing in two client? Any config options in MQ Server side?
When managing queue handles, it is MUCH faster for WMQ to put them in a stack rather than a LIFO queue. So if the messages arrive on the queue slower than it takes to process them, it is possible that an instance will process the message and perform another GET, which WMQ pushes down on the stack. The result is that only one instance will see messages in a low-volume use case.
In larger environments where there are many instances waiting on messages, it is possible that activity will round-robin amongst a portion of those instances while the other instances starve for messages. For example, with 10 GETters on the queue you may see three processing messages and 7 idle.
Although this is considerably faster for MQ, it is confusing to customers who are not aware of how it works internally and so they open PMRs asking this exact question. IBM had to choose among several alternatives:
Adding several code paths to manage by stack for performance when fully loaded, versus manage by LIFO for apparent balancing when lightly loaded. This bloats the code, adds many new decision points to introduce errors and solves a problem that was one of perception rather than reliability or performance.
Educate the customers as to how it works. Of course, once you document it, then you can't change it. The way I found out about this was attending the "WMQ Internals" presentation at IMPACT. It's not in the Infocenter so IBM can change it, but it is available for customers.
Do nothing. Although this is the best result from the code design point of view, the behavior is counter-intuitive. Users need to understand why things do not behave as expected and will waste time trying to find the configuration that results in the desired behavior, or open a PMR.
I don't know for sure that it still works this way but I expect that it does. The way I used to test it was to put many messages on the queue at once and then see how they were distributed. If you drop about 50 messages on the queue in one unit of work, you should see a better distribution between the two instances.
How do you drop 50 messages on the queue at once? First generate them with the applications turned off or to a spare queue. If you generated them in the target queue, use the Q program to move them to the spare queue. Now start the apps and make sure the queue's IPPROC count equals however many instances of the app you started. Using Q again, copy all of the messages to the original queue in a single unit of work. Since they all become available on the queue at once, your two app instances should both immediately be passed a message. If you used copy instead of move, you can repeat this as often as required.
Your client is not doing much, so one instance can probably handle the full load. Try implementing a more realistic workload, or, simpler yet, put a Thread.sleep in the client.
