Azure eventhub load distribution - performance

I have an eventhub solution where there are alot of publishers publishing data to the hub. Currently we are not using partitions. I would like to have a solution where there can be multiple listeners/subscribers who can listen to these events in parallel. E.g
If there is an eventA and an eventB, can I have only one listener recieve the eventA and other listener receive the eventB so that the load can be distributed?
I have to do some compute on each event so I want the computed distributed and not duplicated

Yes, that's what partitions are for. For a given consumer group, there can be multiple readers splitting their work among them, but their max amount is limited by partition count.
Each consumer would lock 1 or more partitions, and will be the only one working on events from those.

Related

How to deal with concurrent events in an event-driven architecture

Suppose I have a eCommerce application designed in an event-driven architecture. I would publish events like ProductCreated and ProductPriceUpdated. Typically both events are published in seperate channels.
Now a consumer of those events comes into play and would react on these, for example to generate a price-chart for specific products.
In fact this consumer has the requirement to firstly consume the ProductCreated event to create a Product entity with the necessary information in its own bounded context. Only if a product has been created price points can be added to the chart. Depending on the consumers performance it can easily happen that those events arrive "out-of-order".
What are the possible strategies to fulfill this requirement?
The following came to my mind:
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
Use documents over events. Simply publishing every state change of the product entity as a single ProductUpdated event or similar. This way I would lose semantics from the message and need to figure out what exactly changed on consumer-side.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
Just thought of giving you some inline comments, based on my understanding for your requirements (#1,#3 and #4).
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
[Chris] : Apache Kafka preserves the order of messages within a partition. But, the mapping of keys to partitions is consistent only as long as the number of partitions in a topic does not change. So as long as the number of partitions is constant, you can be sure the order is guaranteed. When partitioning keys is important, the easiest solution is to create topics with sufficient partitions and never add partitions.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
[Chris]: If latency is not of a concern, and if we are okay with an additional operation overhead of adding a new entity into your solution, such as a storage layer, this pattern looks fine.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
[Chris] : This is kind of a usual integration pattern (Messaging Later -> Backend REST API) we adopt, works over a unique identifier, in this case a correlation id.
This can be easily acheived, if you have a separate topics and consumer per events and the order of messages from the producer is gaurenteed. Thus, option #1 becomes obsolete.
From my perspective, option #3 and #4 look one and the same, and #4 would be ideal.
On an another note, if you thinking of KAFKA Streams/Table into your solution, just go for it, as there is a stronger relationship between streams and tables is called duality.
Duality of streams and tables makes your application to support more elastic, fault-tolerant stateful transactions and to run interactive queries. And, KSQL add more flavour into it, because, this use is just of of Data Enrichment at the integration layer.

How we can check there are no more events or messages is left on the topic to consume?

Is there a way we can check there are no more events or messages is left on the topic to consume in Spring Boot Kafka. In my scenario, I have a requirement like I receive data from the two-channel source one from Kafka topic, and another I can get a complete dump of data by connecting to some other DB. So there is a case after consuming all the messages from Kafka Topic I need to compare the count of data that I have received from Topic with the other data count which I get from DB connectivity.
Is it possible to do so? I know how to write the code in spring boot to start consuming events from Kafka topic and how to make a DB connectivity get data from one DB table and insert it another db table
See the documentation about detecting idle listener containers.
While efficient, one problem with asynchronous consumers is detecting when they are idle. You might want to take some action if no messages arrive for some period of time.
You can configure the listener container to publish a ListenerContainerIdleEvent when some time passes with no message delivery. While the container is idle, an event is published every idleEventInterval milliseconds.
...

Correct Number of Partitions/Replicas for #RetryableTopic Retry Topics

Hello Stack Overflow community and anyone familiar with spring-kafka!
I am currently working on a project which leverages the #RetryableTopic feature from spring-kafka in order to reattempt the delivery of failed messages. The listener annotated with #RetryableTopic is consuming from a topic that has 50 partitions and 3 replicas. When the app is receiving a lot of traffic, it could possibly be autoscaled up to 50 instances of the app (consumers) grabbing from those partitions. I read in the spring-kafka documentation that by default, the retry topics that #RetryableTopic autocreates are created with one partition and one replica, but you can change these values with autoCreateTopicsWith() in the configuration. From this, I have a few questions:
With the autoscaling in mind, is it recommended to just create the retry topics with the same number of partitions and replicas (50 & 3) as the original topic?
Is there some benefit to having differing numbers of partitions/replicas for the retry topics considering their default values are just one?
The retry topics should have at least as many partitions as the original (by default, records are sent to the same partition); otherwise you have to customize the destination resolution to avoid the warning log. See Destination resolver returned non-existent partition
50 partitions might be overkill unless you get a lot of retried records.
It's up to you how many replicas you want, but in general, yes, I would use the same number of replicas as the original.
Only you can decide what are the "correct" numbers.

A SpringBatch job to produce events for a PubSub preserving source order

I'm considering to create a SpringBatch job that uses rows from a table to create events and push the events to a PubSub implementation. The problem here is that the order of events should be the same as the order of the rows in the table that used as source for the events creation process.
It seems to me now that the SpringBatch is not designed for such order perseverance, as batches are processed and then written in parallel. The only ugly but probably working solution for this problem would be to do all the processing in the reader (so the reader could do reading+processing+writing to PubSub), that could help to keep order inside paginated batches, but even that doesn't seem to guarantee the batches order, according to the doc
Any ideas how the transition ordered rows->ordered events could be implemented using SpringBatch or, at least, SpringBoot? Thank you in advance!
It seems to me now that the SpringBatch is not designed for such order perseverance, as batches are processed and then written in parallel.
This is true only for a multi-threaded or a partitioned step. The default (single-threaded) chunk-oriented step implementation processes items in the same order returned by the item reader. So if you make your database reader return items in the order you want, those items will be written to your Pub/Sub broker in the same order.

DocumentDB unique concurrent insert?

I have a horizontally event-source driven application that runs using an Azure Service Bus Topic and a Service Bus Queue. Some events for building up my domain model's state are received through the topic by all my servers, while the ones on the queue (the ones received a lot more often and not mutating domain model state) are distributed among the servers in order to distribute the load.
Now, every time one of my servers receives an event through the queue or topic, it stores it in a DocumentDB which it uses as event store.
Now here's the problem. How can I be sure that the same document is not inserted twice? Let's say 3 servers receive the same event. They all try to store it. How can I make it fail for 2 of the servers in the case they decide to do it all at the same time? Is there any form of unique constraint I can set in DocumentDB or some kind of transaction scope to prevent the document from being inserted twice?
The id property for each document has a uniqueness constraint. You can use this constraint to ensure that duplicate documents are not written to a collection.

Resources