Big difference in throughput when using MSMQ as subscription storage for Nservicebus as opposed to RavenDB - performance

Recently we noticed that our Nservicebus subscribers were not able to handle the increasing load. We have a fairly constant input stream of events (measurement data from embedded devices), so it is very important that the throughput follows the input.
After some profiling, we concluded that it was not the handling of the events that was taking a lot of time, but rather the NServiceBus process of retrieving and publishing events. To try to get a better idea of what goes on, I recreated the Pub/Sub sample (http://particular.net/articles/nservicebus-step-by-step-publish-subscribe-communication-code-first).
On my laptop, using all the NServiceBus defaults, the maximum throughput of the Ordering.Server is about 10 events/second. The only thing it does is
class PlaceOrderHandler : IHandleMessages<PlaceOrder>
{
public IBus Bus { get; set; }
public void Handle(PlaceOrder message)
{
Bus.Publish<OrderPlaced>
(e => { e.Id = message.Id; e.Product = message.Product; });
}
}
I then started to play around with configuration settings. None seem to have any impact on this (very low) performance:
Configure.With()
.DefaultBuilder()
.UseTransport<Msmq>()
.MsmqSubscriptionStorage();
With this configuration, the throughput instantly went up to 60 messages/sec.
I have two questions:
When using MSMQ as subscription storage, the performance is much better than RavenDB. Why does something as trivial as the storage for subscription data have such an impact?
I would have expected a much higher performance. Are there any other configuration settings that I should use to get at least one order of magnitude better than this? On our servers, the maximum throughput when running this sample is about 200 msg/s. This is far from spectacular for a system that doesn't even do anything useful yet.

MSMQ doesn't have native pub/sub capabilities so NServiceBus adds support this by storing the list of subscribers and then looping over that list sending a copy of the event to each of the subscribers. This translates to X message queuing operations where X is the number of subscribers. This explains why RabbitMQ is faster since it has native pub/sub so you would only need one operation against the broker.
The reason the storage based on a msmq queue is faster is that it's a local storage (can't be used if you need to scaleout the endpoint) and that means that we can cache the data since that can't be any other endpoint instances updating the storage. In short this means that we get away with a in memory lookup which as you can see is the fastest option.
There are plans to add native caching across all storages:
https://github.com/Particular/NServiceBus/issues/1320
200 msg/s sounds quite low, what number do you get if you skip the bus.Publish? (just to get a base line)

Possibility 1: distributed transactions
Distributed transactions are created when processing messages because of the combination Queue-Database.
Try measuring without transactional handling of the messages. How does that compare?
Possibility 2: msmq might not be the best queueing system for your needs
Ever considered switching to rabbitmq for transport? I have very good experiences with RabbitMq in combination with MassTransit. Way exceed the numbers you are mentioning in your question.

Related

Rate-Limiting / Throttling SQS Consumer in conjunction with Step-Functions

Given following architecture:
The issue with that is that we reach throttling due to the maximum number of concurrent lambda executions (1K per account).
How can this be address or circumvented?
We want to have full control of the rate-limiting.
1) Request concurrency increase.
This would probably be the easiest solution but it would increase the potential workload quite much. It doesn't resolve the root cause nor does it give us any flexibility or room for any custom rate-limiting.
2) Rate Limiting API
This would only address one component, as the API is not the only trigger of the step-functions. Besides, it will have impact to the clients, as they will receive a 4x response.
3) Adding SQS in front of SFN
This will be one of our choices nevertheless, as it is always good to have a queue on top of such number of events. However, a simple queue on top does not provide rate-limiting.
As SQS can't be configured to execute SFN directly a lambda in between would be required, which then triggers then SFN by code. Without any more logic this would not solve the concurrency issues.
4) FIFO-SQS in front of SFN
Something along the line what this blog-post is explaining.
Summary: By using a virtually grouped items we can define the number of items that are being processed. As this solution works quite good for their use-case, I am actually not convinced it would be a good approach for our use-case. Because the SQS-consumer is not the indicator of the workload, as it only triggers the step-functions.
Due to uneven workload this is not optimal as it would be better to have the concurrency distributed by actual workload rather than by chance.
5) Kinesis Data Stream
By using Kinesis data stream with predefined shards and batch-sizes we can implement the logic of rate-limiting. However, this leaves us with the exact same issues described in (3).
6) Provisioned Concurrency
Assuming we have an SQS in front of the SFN, the SQS-consumer can be configured with a fixed provision concurrency. The value could be calculated by the account's maximum allowed concurrency in conjunction with the number of parallel tasks of the step-functions. It looks like we can find a proper value here.
But once the quota is reached, SQS will still retry to send messages. And once max is reached the message will end up in DLQ. This blog-post explains it quite good.
7) EventSourceMapping toogle by CloudWatch Metrics (sort of circuit breaker)
Assuming we have a SQS in front of SFN and a consumer-lambda.
We could create CW-metrics and trigger the execution of a lambda once a metric is hit. The event-lambda could then temporarily disable the event-source-mapping between the SQS and the consumer-lambda. Once the workload of the system eases another event could be send to enable the source-mapping again.
Something like:
However, I wasn't able to determine proper metrics to react on before the throttling kicks in. Additionally, CW-metrics are dealing with 1-minute frames. So the event might happen too late already.
8) ???
Question itself is a nice overview of all the major options. Well done.
You could implement throttling directly with API Gateway. This is the easiest option if you can afford rejecting the client every once in a while.
If you need stream and buffer control, go for Kinesis. You can even put all your events in S3 bucket and trigger lambdas or Step Function when a new event has been stored (more here). Yes, you will ingest events differently and you will need a bridge lambda function to trigger Step Function based on Kinesis events. But this is relatively low implementation effort.

What's the right way to do high-performance lookup on filterable sets of keys in a cache?

I would like to operate a service that anticipates having subscribers who are interested in various kinds of products. A product is a bag of dozens of attributes:
{
"product_name": "...",
"product_category": "...",
"manufacturer_id": "...",
[...]
}
A subscriber can express an interest in any subset of these attributes. For example, this subscription:
{ [...]
"subscription": {
"manufacturer_id": 1234,
"product_category": 427
}
}
will receive events that match both product_category: 427 and manufacturer_id: 1234. Conversely, this event:
{ [...]
"event": {
"manufacturer_id": 1234,
"product_category": 427
}
}
will deliver messages to any subscribers who care about:
that manufacturer_id, or
that product_category, or
both that manufacturer_id and that product_category
It is vital that these notifications be delivered as expeditiously as possible, because the subscriptions may have a few hundred milliseconds, or a second at most, to take downstream actions. The cache lookup should therefore be fast.
Question: If one wants to cache subscriptions this way for highly efficient lookup on one or more filterable attributes, what sort of approaches or architectures would allow one to do this well?
The answer depends on some factors that you have not described in your scenario. For example, what is the extent of the data? How many products/categories/users and what are the estimated data sizes for these- Megabytes, Gigabytes, Terabytes? Also what is the expected throughput of changes to products/subscriptions and events?
So my answer will be a for a medium size scenario in the Gigabytes range where you can likely fit your subscription dataset into memory on a machine.
In this case the straight forward approach would be to have your events appear on an event bus, for example implemented with Kafka or Pulsar. Then you would have a service that consumes the events as they come in and inquires an in memory data store about the subscription matches. (The in-memory db has to be built/copied on startup and kept up to date from a different event source potentially.)
This in-memory store could be a key-value database like MongoDB for example. It comes with an pure in-memory-mode that gives you more predictable performance. In order to ensure predictable, high performance lookups within the db you need to specify your indexes correctly. Any property that is relevant to the lookup needs to be indexed. Also consider that kv-stores can use compound indexes for speeding up lookups of property combinations. Other in-memory kv-stores that you may want to consider as alternatives are redis or mem-cached. If performance is a critical requirement I would recommend to do trials with different systems where you ingest your dataset, build index and try out the queries you need for comparing lookup times.
So the service can now quickly determine the set of users to notify. From here you have two choices - You could have the same service send out notifications directly, or (what I would probably do) you could separate concerns and have a second service whose responsibility is performing the actual notifications. The communication between those services could again be via a topic on the event bus system.
This kind of setup should easily work up to thousands of events per second with single service instances. If it should happen that the number of events scales to massive sizes you can run multiple instances of your services to improve throughput. For that you'd have to look into organizing consumer groups correctly for multiple consumers.
The technologies for implementing the services are probably not critical, but if I'd knew it has strict performance requirements I would go with a language that potentially has manual memory management. For example Rust or C++. Other alternatives could be languages like golang or java, but you'd have to pay attention to how garbage collection is performed and that it doesn't interfere with your performance requirements.
In terms of infrastructure - For a medium or large size system you would typically run your services in a containerized fashion on a cluster of machines, for example using kubernetes.
If it happens that your system scale is on the smaller side you may not need a distributed setup and instead can deploy the described components/services on a single machine.
With such a setup the expected round trip time from a local client should reliably be in the single digit milliseconds from the time the event comes in and a notification goes out.
The way I would do that is having a key/value table that holds an array for the "subscribers ids" by attribute name = value, like this: (where a,b,c,d,y,z are the subscriber's ids).
{ [...]
"manufacturer_id=1234": [a,b,c,d],
"product_category=427": [a,b,y,z],
[...]
}
In you example your event has "manufacturer_id" = 1234 and "product_category" = 427, so just search for the subscribers where key = manufacturer_id=1234 or product_category=427 and you'll get arrays of all subscribers you want. Then just "merge distinct" those arrays and you'll have every subscribe id you need.
Or, depending of how complex/smart is the database you are using, you can normalize it, like this:
{ [...]
"manufacturer_id": {
"1234": [a,b,c,d],
"5678": [e,f,g,h],
[...]
},
"product_category": {
"427": [a,b,g,h],
"555": [c],
[...]
},
[...]
}
I would propose sharding as an architecture pattern.
Every shard will listen for all events for all products from the source of the events.
For best latency I would propose two layers of sharding, first layer is geographical (country or city depending on customer distribution), it is connected to the source with low latency connection and it is in the same data center as the second level sharding for this location. Second level sharding is on userId and it needs to be receiving all product events, but handle subscriptions only for it's region.
The first layer has the responsibility to fan out the events to the second layer based on geographical position of the subscriptions. This is more or less a single microservice. It can be done with genral event brokers but considering it is going to be relatively simple we can implement it in golang or C++ and optimize for latency.
For the second layer every shard will respond for a number of users from the location, every shard will receive all the events for all products. A shard will be made from one microservice for subscriptions caching and notify logic and one or more notifications delivery microservices.
The subscriptions microservice stores an in memory cache of the subscriptions and checks every event for subscribed users based on maps. I.e. It stores a map from product field to subcribed userIds for example. For this microservice latency is more important so a custom implementation in golang / C++ should deliver the best latency. The subscriptions microservice should not have it's own db or any external cache as network latency is a just a drag in this case.
The notifications delivery microservices are dependant on where you want to send the notifications, but again golang or C++ can deliver one of the lowest latencies.
The system data is it's subscriptions, the data can be sharded per location and userId the same way as the rest of the architecture. So we can have a single DB per second level shard.
For storage of the product fields delending on how often they change they can be: in the code (presuming very rarely changed or never) or in the dbs, with synchronisation mechanism between the dbs if they are expected to change more often.

Performance of Nats Jetstream

I'm trying to understand how Nats Jetstream scales and have a couple of questions.
How efficient is subscribing by subject to historic messages? For example lets say have a stream foo that consists of 100 million messages with a subject of foo.bar and then a single message with a subject foo.baz. If I then make a subscription to foo.baz from the start of the stream will something on the server have to perform a linear scan of all messages in foo or will it be able to immediately seek to the foo.baz message.
How well does the system horizontally scale? I ask because I'm having issues getting Jetstream to scale much above a few thousand messages per second, regardless of how many machines I throw at it. Test parameters are as follows:
Nats Server 2.6.3 running on 4 core 8GB nodes
Single Stream replicated 3 times (disk or in-memory appears to make no difference)
500 byte message payloads
n publishers each publishing 1k messages per second
The bottleneck appears to be on the publishing side as I can retrieve messages at least as fast as I can publish them.
Publishing in NATS JetStream is slightly different than publishing in Core NATS.
Yes, you can publish a Core NATS message to a subject that is recorded by a stream and that message will indeed be captured in the stream, but in the case of the Core NATS publication, the publishing application does not expect an acknowledgement back from the nats-server, while in the case of the JetStream publish call, there is an acknowledgement sent back to the client from the nats-server that indicates that the message was indeed successfully persisted and replicated (or not).
So when you do js.Publish() you are actually making a synchronous relatively high latency request-reply (especially if your replication is 3 or 5, and more so if your stream is persisted to file, and depending on the network latency between the client application and the nats-server), which means that your throughput is going to be limited if you are just doing those synchronous publish calls back to back.
If you want throughput of publishing messages to a stream, you should use the asynchronous version of the JetStream publish call instead (i.e. you should use js.AsyncPublish() that returns a PubAckFuture).
However in that case you must also remember to introduce some amount of flow control by limiting the number of 'in-flight' asynchronous publish applications you want to have at any given time (this is because you can always publish asynchronously much much faster than the nats-server(s) can replicate and persist messages.
If you were to continuously publish asynchronously as fast as you can (e.g. when publishing the result of some kind of batch process) then you would eventually overwhelm your servers, which is something you really want to avoid.
You have two options to flow-control your JetStream async publications:
specify a max number of in-flight asynchronous publication requests as an option when obtaining your JetStream context: i.e. js = nc.JetStream(nats.PublishAsyncMaxPending(100))
Do a simple batch mechanism to check for the publication's PubAcks every so many asynchronous publications, like nats bench does: https://github.com/nats-io/natscli/blob/e6b2b478dbc432a639fbf92c5c89570438c31ee7/cli/bench_command.go#L476
About the expected performance: using async publications allows you to really get the throughput that NATS and JetStream are capable of. A simple way to validate or measure performance is to use the nats CLI tool (https://github.com/nats-io/natscli) to run benchmarks.
For example you can start with a simple test: nats bench foo --js --pub 4 --msgs 1000000 --replicas 3 (in memory stream with 3 replicas 4 go-routines each with it's own connection publishing 128 byte messages in batches of 100) and you should get a lot more than a few thousands messages per second.
For more information and examples of how to use the nats bench command you can take a look at this video: https://youtu.be/HwwvFeUHAyo
Would be good to get an opinion on this. I have a similar behaviour and the only way to achieve higher throughput for publishers is to lower replication (from 3 to 1) but that won't be an acceptable solution.
I have tried adding more resources (cpu/ram) with no success on increasing the publishing rate.
Also, scaling horizontally did not make any difference.
In my situation , i am using Bench tool to publish to js.
For an R3 filestore you can expect ~250k small msgs per second. If you utilize synchronous publish that will be dominated by RTT from the application to the system, and from the stream leader to the closest follower. You can use windowed intelligent async publish to get better performance.
You can get higher numbers with memory stores, but again will be dominated by RTT throughout the system.
If you give me a sense of how large are your messages we can show you some results from nats bench against the demo servers (R1) and NGS (R1 & R3).
For the original question regarding filtered consumers, >= 2.8.x will not do a linear scan to retrieve foo.baz. We could also show an example of this as well if it would help.
Feel free to join the slack channel (slack.nats.io) which is a pretty active community. Even feel free to DM me directly, happy to help.

Performance and limitations of temporary queues

I want a bunch of several hundred client apps to create and use temporary queues at one instance of the middleware.
Are there some cons regarding performance why I shouldn't use temp queues? Are there limitations, for example on how many temp. queues can be created per HornetQ instance?
On a recent project we have switched from using temporary queues to using static queues on SonicMQ. We had implemented synchronous service calls over JMS where the response of each call would be delivered on a dedicated temporary queue, created by the consumer. During stress testing we noticed that the overhead of temporary queue creation and allocated resources started to play a bigger and bigger part when pushing the maximum throughput of the solution.
We changed the solution so it would use static queues between consumer and provider and use a selector to correlate on the JMSCorrelationID. This resulted in better throughput in our case. If you are planning on each time (re)creating the temporary queues that your client applications will use, it could start to impact performance when higher throughput rates are needed.
Note that selector performance can also start to play when the number of messages in a queue increase. In our case the solution was designed to hand-off the messages as soon as possible and not play the role of a (storage) buffer in between consumer and provider. As such the number of message inside a queue would always be low.

When multi MessageConsumer connect to same queue(Websphere MQ),how to load balance message-consumer?

I am Using WebSphere MQ 7,and I have two clients connected to the same QMgr and consuming messages from same queue, like following code:
while (true) {
TextMessage message = (TextMessage) consumer.receive(1000);
if (message != null) {
System.out.println("*********************" + message.getText());
}
}
I found only one client always retrieve messages. Is there any method to let consume-message load balancing in two client? Any config options in MQ Server side?
When managing queue handles, it is MUCH faster for WMQ to put them in a stack rather than a LIFO queue. So if the messages arrive on the queue slower than it takes to process them, it is possible that an instance will process the message and perform another GET, which WMQ pushes down on the stack. The result is that only one instance will see messages in a low-volume use case.
In larger environments where there are many instances waiting on messages, it is possible that activity will round-robin amongst a portion of those instances while the other instances starve for messages. For example, with 10 GETters on the queue you may see three processing messages and 7 idle.
Although this is considerably faster for MQ, it is confusing to customers who are not aware of how it works internally and so they open PMRs asking this exact question. IBM had to choose among several alternatives:
Adding several code paths to manage by stack for performance when fully loaded, versus manage by LIFO for apparent balancing when lightly loaded. This bloats the code, adds many new decision points to introduce errors and solves a problem that was one of perception rather than reliability or performance.
Educate the customers as to how it works. Of course, once you document it, then you can't change it. The way I found out about this was attending the "WMQ Internals" presentation at IMPACT. It's not in the Infocenter so IBM can change it, but it is available for customers.
Do nothing. Although this is the best result from the code design point of view, the behavior is counter-intuitive. Users need to understand why things do not behave as expected and will waste time trying to find the configuration that results in the desired behavior, or open a PMR.
I don't know for sure that it still works this way but I expect that it does. The way I used to test it was to put many messages on the queue at once and then see how they were distributed. If you drop about 50 messages on the queue in one unit of work, you should see a better distribution between the two instances.
How do you drop 50 messages on the queue at once? First generate them with the applications turned off or to a spare queue. If you generated them in the target queue, use the Q program to move them to the spare queue. Now start the apps and make sure the queue's IPPROC count equals however many instances of the app you started. Using Q again, copy all of the messages to the original queue in a single unit of work. Since they all become available on the queue at once, your two app instances should both immediately be passed a message. If you used copy instead of move, you can repeat this as often as required.
Your client is not doing much, so one instance can probably handle the full load. Try implementing a more realistic workload, or, simpler yet, put a Thread.sleep in the client.

Resources