Spring Kafka Listener receiving duplicate messages

Spring Kafka Listener receiving duplicate messages - spring-boot

My Spring Kafka Listener is receiving duplicate messages, i can see the messages are being polled from same partition and offset and the same timestamp. In my code, i keep track every incoming message and identify the duplicates, but in this case, i cannot even reject it from processing as both messages - original and duplicate are received at almost same time, and the first record is not even committed in the database tracking table.
1.Please suggest how i should avoid polling the duplicate messages, i dont understand why it is being polled twice- only under load.
2. How i can handle this in tracking table, if message 1 metadata is being processed and not committed in
the tracking table, message 2 comes and is not able to find that record in tracking table and proceeds with processing the duplicate message again.
config of the listener based on my use case:
config.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
config.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 300000);
config.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 50);
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
config.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 15000);

The two consumers need to have the same group.id property so that the partitions are distributed across them.
If they are in different consumer groups, they will both get all the records.

Related

During db downtime I would like to keep data in temporary storage and process it later with certain amount of delay(2mins interval)

Requirement:
Hold the data during db downtime and process it with 5 mins interval by keeping them in dead letter queue.
I have tried below approaches
Kafka retry topic but there are some limitations where I have no control over the listener to configure the interval. #kakfkalistner is picking the message as soon as we push
Pick the message from Kafka listener and storing it in hashset. Create schedular to scan the hashset in 5mins delay and wipe out(this approach is not handy since set is in memory)

system design - How to update cache only after persisted to database?

After watching this awesome talk by Martin Klepmann about how Kafka can be used to stream events so that we can get rid of 2-phase-commits, I have a couple of questions related to updating a cache only when the database is updated properly.
Problem Statement
Lets say you have a Redis cache which stores the user's profile pic and a Postgres database which is used for all the User related operations(creating, updation, deletion, etc)
I want to update my Redis cache only and only when a new user has been successfully added to my database.
How can I do that using Kafka ?
If I am to take the example given in the video then the workflow would follow something like this:
User registers
Request is handled by User Registration Micro service
User Registration Microservice inserts a new entry into the User's table.
Then generates an User Creation Event in the user_created topic.
Cache population microservice consumes the newly created User Creation Event
Cache population microservice updates the redis cache.
The problem starts what would happen if the User Registration Microservice crashed just after writing to the database, but failed to send the event to Kafka ?
What would be the correct way of handling this ?
Does the User Registration Microservice maintain the last event it published ? How can it reliably do that ? Does it write to a DB ? Then the problem starts all over again, what if it published the event to Kafka but failed before it could update its last known offset.

There are three broad approaches one can take for this:
There's the transactional outbox pattern, wherein, in the same transaction as inserting the new entry into the user table, a corresponding user creation event is inserted into an outbox table. Some process then eventually queries that outbox table, publishes the events in that table to Kafka, and deletes the events in the table. Since the inserts are in the same transaction, they either both occur or neither occurs; barring a bug in the process which publishes the outbox to Kafka, this guarantees that every user insert eventually has an associated event published (at least once) to Kafka.
There's a more event-sourcingish pattern, where you publish the user creation event to Kafka and then some consuming process inserts into the user table based on the event. Since this happens with a delay, this strongly suggests that the user registration service needs to keep state of which users it has published creation events for (with the combination of Kafka and Postgres being the source of truth for this). Since Kafka allows a message to be consumed by arbitrarily many consumers, a different consumer can then update Redis.
Change data capture (e.g. Debezium) can be used to tie into Postgres' write-ahead log (as Postgres actually event sources under the hood...) and publish an event that essentially says "this row was inserted into the user table" to Kafka. A consumer of that event can then translate that into a user created event.
CDC in some sense moves the transactional outbox into the infrastructure, at the cost of requiring that the context it inherently throws away be reconstructed later (which is not always possible).
That said, I'd strongly advise against having ____ creation be a microservice and I'd likewise strongly advise against a RInK store like Redis. Both of these smell like attempts to paper over architectural deficiencies by adding microservices and caches.
The one-foot-on-the-way-to-event-sourcing approach isn't one I'd recommend, but if one starts there, the requirement to make the registration service stateful suddenly opens up possibilities which may remove the need for Redis, limit the need for a Kafka-like thing, and allow you to treat the existence of a DB as an implementation detail.

How we can check there are no more events or messages is left on the topic to consume?

Is there a way we can check there are no more events or messages is left on the topic to consume in Spring Boot Kafka. In my scenario, I have a requirement like I receive data from the two-channel source one from Kafka topic, and another I can get a complete dump of data by connecting to some other DB. So there is a case after consuming all the messages from Kafka Topic I need to compare the count of data that I have received from Topic with the other data count which I get from DB connectivity.
Is it possible to do so? I know how to write the code in spring boot to start consuming events from Kafka topic and how to make a DB connectivity get data from one DB table and insert it another db table

See the documentation about detecting idle listener containers.
While efficient, one problem with asynchronous consumers is detecting when they are idle. You might want to take some action if no messages arrive for some period of time.
You can configure the listener container to publish a ListenerContainerIdleEvent when some time passes with no message delivery. While the container is idle, an event is published every idleEventInterval milliseconds.
...

Correct Number of Partitions/Replicas for #RetryableTopic Retry Topics

Hello Stack Overflow community and anyone familiar with spring-kafka!
I am currently working on a project which leverages the #RetryableTopic feature from spring-kafka in order to reattempt the delivery of failed messages. The listener annotated with #RetryableTopic is consuming from a topic that has 50 partitions and 3 replicas. When the app is receiving a lot of traffic, it could possibly be autoscaled up to 50 instances of the app (consumers) grabbing from those partitions. I read in the spring-kafka documentation that by default, the retry topics that #RetryableTopic autocreates are created with one partition and one replica, but you can change these values with autoCreateTopicsWith() in the configuration. From this, I have a few questions:
With the autoscaling in mind, is it recommended to just create the retry topics with the same number of partitions and replicas (50 & 3) as the original topic?
Is there some benefit to having differing numbers of partitions/replicas for the retry topics considering their default values are just one?

The retry topics should have at least as many partitions as the original (by default, records are sent to the same partition); otherwise you have to customize the destination resolution to avoid the warning log. See Destination resolver returned non-existent partition
50 partitions might be overkill unless you get a lot of retried records.
It's up to you how many replicas you want, but in general, yes, I would use the same number of replicas as the original.
Only you can decide what are the "correct" numbers.

Clear Oracle queue after propagation

Good day, respective all!
I found, that for every queue , that have multiple subscribers, Oracle creates an supplementary view AQ$<queue_table_name> where it keeps a history of propagated messages. And the only smart way to purge source queue table from propagated messages, is:
Find corresponding row in AQ$<queue_table_name>. Make sure that msg_state is PROCESSED for each subscriber.
Purge that row from source queue table.
Please, correct me, if I wrong when thinking of it as the only smart way.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio