duplicate consumption of messages with Spring Cloud Stream Kafka binder - spring-boot

We have several micro-services using Spring Boot and Spring Cloud Stream Kafka binder to communicate between them.
Occasionally, we observe bursts of duplicate messages received by a consumer - often several days after it was first consumed and processed (successfully).
While I understand that Kafka does not guarantee exactly-once delivery, it still looks very strange, given that there were no rebalancing events or any 'suspicious' activity in the logs of either the brokers nor the services. Since the consumer is interacting with external APIs, it is a bit difficult to make it idempotent.
Any hints what might be the cause of duplication? What should I be looking for to figure this out?
We are using Kafka broker 1.0.0, and this particular consumer uses Spring Cloud Stream Binder Kafka 2.0.0, which is based on kafka-client 1.0.2 (version of the other services might be a bit different).

You should show your configuration when asking questions like this.
Best guess is the broker's offsets.retention.minutes.
With modern broker versions (since 2.0), it defaults to 1 week; with older versions it was only one day.

Related

Spring Batch and Kafka

I am a junior programmer in banking. I want to make a microservice system that get data from kafka and processes it. after that, save to database and send final data to client app. What technology can i use? I plan to use spring bacth and kafka. Can the technology be implemented in my project or is there a better alternative?
To process data from a Kafka topic I recommend you to use Kafka Streams API, especially Spring Kafka Streams.
Kafka Streams and Spring
And to store the data in a database, you should use a Kafka Sink Connector.
Kafka Connect
This approach is very common and easy if your company has a Kafka ecosystem.
In terms of alternatives, here you will find an interesting comparison:
https://scramjet.org/blog/welcome-to-the-family
3 in 1 serverless
Scramjet takes a slightly different approach - 3 platforms in one.
Both the free product https://hub.scramjet.org/ for installation on your server and the cloud platform are available - currently also free in the beta version https://scramjet.org/#join-beta

Kafka connect with EventStoreDB

I'm working on a small academic project - Event sourcing with EventStoreDB and Apache Kafka as a broker. The idea is that get events from EventStoreDB and push them to Kafka for further distribution. I saw Apache Kafka has connections to different DB systems but didn't find any connector with EvenStoreDB.
How can I create(code or use existing one) Kafka connector to EventStoreDB, so these two systems would be able to transfer events vise-versa, from Kafka to EventStoreDB and from EventStoreDB to Kafka?
There is no official Kafka Connect Connector between Kafka and EventStoreDB, and I haven't heard about any unofficial so far. Still, there is a tool called Replicator that enables replicating data from EventStoreDB to Kafka (https://replicator.eventstore.org/docs/features/sinks/kafka/). It's open-sourced, so you can either use it or check the implementation.
For the EventStoreDB to Kafka, I recommend using the subscriptions mechanism: catch-up if you need an ordering guarantee, persistent if ordering is not critical: https://developers.eventstore.com/clients/grpc/subscriptions.html. The crucial part here is to define how to map EventStoreDB streams to Kafka topics and partitions. Typically you'd expect to have at least an ordering guarantee on the stream level, so single stream events should land to the same partition.
For Kafka to EventStoreDB integration, you could either write your own pass-through service or try to use the HTTP sink connector (e.g. https://docs.confluent.io/kafka-connect-http/current/overview.html). EventStoreDB exposes HTTP API (https://developers.eventstore.com/clients/http-api/v5/introduction/). Sidenote, this API (Atom pub based) may be replaced with another HTTP API in the future, so the structure may change.
You can use Event Store Replicator, which has a Kafka sink.
Keep in mind that it doesn't do anything with regards to events schema, so things like Kafka Streams and KSQL might not work properly.
The sink was created solely for the purpose of pushing events to Kafka being used as a message broker.

Avoid multiple listens to ActiveMQ topic with Spring Boot microservice instances

We have configured our ActiveMQ message broker as a Spring Boot project and there's another Spring Boot application (let's call it service-A) that has a listener configured to listen to some topics using #JmsListener annotation. It's a Spring Cloud microservice appilcation.
The problem:
It is possible that service-A can have multiple instances running.
If we have 2 instances running, then any message coming on topic gets listened to twice.
How can we avoid every instance listening to the topic?
We want to make sure that the topic is listened to only once no matte the number of service-A instances.
Is it possible to run the microservice in a cluster mode or something similar? I also checked out ActiveMQ virtual destinations but not too sure if that's the solution to the problem.
We have also thought of an approach where we can decide who's the leader node from the multiple instances, but that's the last resort and we are looking for a cleaner approach.
Any useful pointers, references are welcome.
What you really want is a shared topic subscription which was added in JMS 2. Unfortunately ActiveMQ 5.x doesn't support JMS 2. However, ActiveMQ Artemis does.
ActiveMQ Artemis is the next generation broker from ActiveMQ. It supports most of the same features as ActiveMQ 5.x (including full support for OpenWire clients) as well as many other features that 5.x doesn't support (e.g. JMS 2, shared-nothing high-availability using replication, last-value queues, ring queues, metrics plugins for integration with tools like Prometheus, duplicate message detection, etc.). Furthermore, ActiveMQ Artemis is built on a high-performance, non-blocking core which means scalability is much better as well.

For a spring enterprise web application with multiple instances, What is the way to retrieve the offset value from Kafka and store it?

I'm working on an enterprise web application that has a requirement to read from a Kafka system and then trigger events. Can anyone suggest a way to get the offset and also an ideal way to store the offset (Ideal way should be able to handle accessing by multiple instances of the application)?
Note:-
I'm using spring-kafka and open for any further suggestions.
Thanks in advance.
With recent versions of Kafka, the offset is stored in a kafka topic. Kafka keeps track of the consumer offset for each partition in a topic __consumer_offsets which is a compacted topic; in other words; kafka itself keeps track of the offset for each consumer group.
With Spring for Apache Kafka; several options are provided for when the offset is committed.
In earlier versions of kafka offsets were often stored externally; it's now a lot simpler.
There may still be use cases for that but such scenarios are all supported by Spring Kafka; especially with the upcoming 2.0 release.

messages published to all consumers with same consumer-group in spring-data-stream project

I got my zookeeper and 3 kafka broker running locally.
I started one producer and one consumer. I can see consumer is consuming message.
I then started three consumers with same consumer group name (different ports since its a spring boot project). but what I found is that all the consumers are now consuming (receiving) messages. But I expect the message to be load-balanced in that only messages are not repeated across the consumers. I don't know what the problem is.
Here is my property file
spring.cloud.stream.bindings.input.destination=timerTopicLocal
spring.cloud.stream.kafka.binder.zkNodes=localhost
spring.cloud.stream.kafka.binder.brokers=localhost
spring.cloud.stream.bindings.input.group=timerGroup
Here the group is timerGroup.
consumer code : https://github.com/codecentric/edmp-sample-stream-sink
producer code : https://github.com/codecentric/edmp-sample-stream-source
Can you please update dependencies to Camden.RELEASE (and start using Kafka 0.9+) ? In Brixton.RELEASE, Kafka consumers were 0.8-based and required passing instanceIndex/instanceCount as properties in order to distribute partitions correctly.
In Camden.RELEASE we are using the Kafka 0.9+ consumer client, which does load-balancing in the way you are expecting (we also support static partition allocation via instanceIndex/instanceCount, but I suspect this is not what you want). I can enter into more details on how to configure this with Brixton, but I guess an upgrade should be a much easier path.

Resources