duplicate events by consumer - events

we observed that one of the consumer try to pick the events multiple times from kafka topic. we have the below seetings on consumer application side.
spring.kafka.consumer.enable-auto-commit=false & spring.kafka.consumer.auto-offset-reset=earliest.
how to avoid the duplicate by the consumer application.
Do we need to fine tune the above configuration settings to avoid the consumer to pick the events multiple times from the kafka topic.

Since you've disabled auto commits, you do need to fine tune when you actually commit a record, otherwise you could have at least once processing.
You could also read the examples of the exactly once processing capabilities using transactions and idempotent producers
The auto.offset.reset only applies if your consumer group is removed, or never exists at all (you're not committing anything). In that case, you're always going to read from the beginning of the topic

Related

ActiveMQ - Competing Consumers with Selector - messages starve in the queue

ActiveMQ 5.15.13
Context: I have a single queue with multiple Consumers. I want to stop some consumers from processing certain messages. This has to be dynamic, I don't want to create separate queues for this. This works without any problems. e.g. Consumer1 ignores Stocks -> Consumer1 can process all invoices and Consumer2 can process all Stocks
But if there is a large number of messages already in the Queue (of one type, e.g. stocks) and I send a message of another type (e.g. invoices), Consumer1 won't process the message of type invoices. It will instead be idle until Consumer2 has processed all Stocks messages. It does not happen every time, but quite often.
Is there any option to change the order of the new messages coming into the queue, such that an idle consumer with matching selector picks up the new message?
Things I've already tried:
using a PendingMessageLimitStrategy -> it seems like it does not work for queues
increasing the maxPageSize and maxBrowsePageSize in the hope that once all Messages are in RAM, the Consumers will search for their messages.
Exclusive Consumers aren't an option since I want to be able to use more than one Consumer per message type.
Im pretty sure that there is some configuration which allows this type of usage. I'm aware that there are better solutions for this issue, but sadly I can't use them easily due to other constraints.
Thanks a lot in advance!
EDIT: I noticed that when I'm refreshing on the localhost queue browser, the stuck messages get executed immediately. It seems like this action performs some sort of queue refresh where the messages get filtered based on their selector again. So I just need this action whenever a new message enters the queue...
This is a 'window' problem where the next set of 'stocks' data needs to be processed before the 'invoicing' data can be processed.
The gotcha with window problems like this is that you need to account for the fact that some messages may never come through, or a consumer may never come back online either. Also, eventually you will be asked 'how many invoices or stocks are left to be processed'-- aka observability.
ActiveMQ has you covered-- check out wild-card destinations and consumers.
Produce 'stocks' to:
queue://data.stocks.input
Produce 'invoices' to:
queue://data.invoices.input
You then setup consumes to connect:
queue://data.*.input
note: the wildard '*'.
ActiveMQ will match queues based on the wildcard pattern, and then process data accordingly. As a bonus, you can still use a selector.

Does EventStoreDB provide message ordering by an event-key on the consumer side?

I have been exploring EventStoreDB and trying to understand more about the ordering of messages on the consumer side. Read about persistent subscriptions and also the Pinned consumer strategy here.
I have a scenario wherein inventory updates get pushed to eventstore and different streams get created by the different unique inventoryIds in the inventory event.
We have multiple consumers with the same consumerGroup name to read these inventory events. We are using Pinned Persistent Subscription with ResolveLinkTos enabled.
My question:
Will every message from a particular stream always go to the same consumer instance of the consumerGroup?
If the answer to the above question is yes, will every message from that particular stream reach the particular consumer instance in the same order as the events were ingested?
The documentation has a warning that ordered message processing using persistent subscriptions is not guaranteed. Any strategy delivers messages with the best-effort level of ordering guarantees, if applicable.
There are a few reasons for this, some of those are:
Spreading out messages across consumer groups lead to a non-linearised checkpoint commit. It means that some messages can be processed before other messages.
Persistent subscriptions attempt to buffer messages, but when a timeout happens on the client side, the whole buffer is redelivered, which can eventually break the processing order
Built-in retry policies essentially can break the message order at any time
Most event log-based brokers, if not all, don't even attempt to guarantee ordered message delivery across multiple consumers. I often hear "but Kafka does it", ignoring the fact that Kafka delivers messages from one partition to at most one consumer in a group. There's no load balancing of one partition between multiple consumers due to exactly the same issue. That being said, EventStoreDB is still not a broker, but a database for events.
So, here are the answers:
Will every message from a particular stream always go to the same consumer instance of the consumer group?
No. It might work most of the time, but it will eventually break.
will every message from that particular stream reach the particular consumer instance in the same order as the events were ingested?
Most of the time, yes, but again, if a message is being retried, you might get the next message before the previous one is Acked.
Overall, load-balancing ordered processing of messages, which aren't pre-partitioned on the server is not an easy task. At most, you get messages re-delivered if the checkpoint fails to persist at some point, and the consumers restart.

One partition multiple consumers same group, consumer IDs

We have one topic with one partition due to ordering of message requirements. We have two consumers running on different servers with same set of configurations i.e. groupId, consumerId, consumerGroup. i.e.
1 Topic -> 1 Partition -> 2 Consumers
When we deploy consumers same code is deployed on both the servers. Noticed when a message comes we see both the consumers are consuming message rather than only one processing. Reason having consumers running on two separate servers is if one server crashes at least other can continue processing messages. But looks like if both up both consuming messages. Reading Kafka docs it says if we have more consumers than partitions then some stay idle don't see that happening. Anything we are missing on configuration side apart from consumerId & groupId. Thanks
As #Gary Russel said, as long as the two consumer instances have their own consumer group, they will consume every event that is written to the topic. Just put them into the same consumer-group. You can provide a consumer-group-id in the consumer.properties.

What will happen if my kafka consumer group is changed after each restart

Let’s say for instance, my kafka consumer (in Consumer Group 1) is reading messages from Kafka Topic A.
Now if that consumer consumes 12 messages before failing.
When the consumer starts up again, and now it has different consumer group (i.e. consumer group 2),
Question 1 -? On restart, will it continue from where it left off in the offset (or position) because that offset is stored by Kafka and/or ZooKeeper or will it start consuming messages from 1st message.
Question 2-> Is there a way to ensure that on restart (When consumer has different consumer group), it still start consuming from where it left off before restarting?
Just to give you the context, i am trying to update in-memory caches in each node/server on receiving a message on kafka topic. In order to do that, i am using a different consumer group for each node/server so that each message is consumed by all the nodes/servers to update in-memory cache. Please let me know if there are better ways to do this. Thanks!
Consumer offsets are maintained per consumer group and hence if you have a different consumer group on each restart you can make use of the auto.offset.reset property
The auto.offset.reset property specifies
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted): earliest: automatically reset the offset to the earliest offsetlatest: automatically reset the offset to the latest offsetnone: throw exception to the consumer if no previous offset is found for the consumer's groupanything else: throw exception to the consumer.
Having informed about the current approach - I believe you should relook at the design and it would be better to have a different consumer group per node but ensure to keep the same consumer group name per node even after a restart. This is a suggestion based on the info provided but there could be better solutions as well after going into the detail of the design/implementation.

How can I reset Kafka state to "start of universe"?

I'm still working on a Kafka Streams application that I described in
Why isn't Kafka consumer producing results?. In that posting, I asked why setting
kstreams_props.put( ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
doesn't appear to reset the state of Kafka to "start of the universe" before any data are pushed to any topic. I am now encountering a variant of that issue:
My application consists of a producer program that pushes data to a Kafka stream and a consumer program that groups the data, aggregates the groups, and then converts the resulting KTable back into a stream, which I print out.
The aggregation step is essentially adding up all the values, then putting those sums into the output stream as new data. What I observe, though, is that every time I run the program, the resulting aggregated values get bigger and bigger, almost as if Kafka is somehow retaining the previous results and including those in the aggregation.
In order to try fixing this, I deleted all my topics (except for __consumer_offsets, which Kafka would not allow), then re-ran my application, but the aggregated values continue to grow, as if Kafka were retaining the result of previous computations even though I thought that deleting the intermediate topics would fix things. I even tried stopping and restarting the Kafka server, to no avail.
What's going on here and, more to the point, how can I fix this? I've tried various suggestions about setting AUTO_OFFSET_RESET_CONFIG, also with no effect. I should mention that one aspect of my application is that my original producer creates its own Kafka timestamps in the Producer.send call, although disabling that also seemed to have no effect.
Thanks in advance, -- Mark
AUTO_OFFSET_RESET_CONFIG only triggers if there are not committed offsets: If an application starts, it first looks for committed offsets and applies the reset policy only, if there are no valid offsets.
Furthermore, for a Kafka Streams application, resetting offsets would not be sufficient and you should use the reset tool bin/kafka-streams-applicaion-reset.sh -- this blog post explains the tool in details: https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/

Resources