intermittent issue with kafka (aws msk) consumer - spring-boot

We are facing a strange issue in only one of our environment (with same consumer app).
Basically, it is observed that suddenly a lag starts to build up with only one of the topics on kafka broker (it has multiple topics), with 10 consumer members under a single consumer group.
Even after multiple restarts, adding another pod of consumer application, changing defaults configuration properties (max poll records, session timeout) so far have NOT helped much.
Looking for any suggestions, advice on how to possibly debug the issue (we tried enabling apache logs, cloud watch etc, but so we only saw that regular/periodic rebalancing is happening, even for very low load of 7k messages waiting for processing).
Below are env details:
App - Spring boot app on version 2.7.2 Platform
AWS Kafka - MSK
Kafka Broker - 3 brokers (version 2.8.x)
Consumer Group - 1 with 15 members (partition 8, Topic 1)

Related

A Topic lost a subscription node during run time

Spring boot Version : 2.5.0
Spring Cloud Version : 2020.0.3
I used Spring-Cloud-stream-binder - Kafka and Spring-cloud-stream-binder - Kafka-Streams for kafka production and consumption in the project.
In one project, I subscribed to N topics.
Two nodes were started for service using load balancing.
During run time, it was suddenly discovered that one of the topics had no subscription nodes.
This results in messages being backlogged and lost.
I have to restart these service nodes before I can subscribe to this Topic again.
What is the cause of this, or is there any way to help find some clues.
And is there a way to check at run time so that topics that have lost subscriptions can be re-subscribed?

Sprint Cloud Stream Kafka Streams Binder processor application stuck

I have the following Spring Cloud Stream Kafka Streams Binder 3.x application:
When I run X messages through this application by publishing them to the topic1 from an integration test using #SpringBootTest and #EmbeddedKafka the counts of messages at points 1 and 2 are equal, as I expect.
When I do the same using live application connected to the Kafka broker, the counts at point 1 and point 2 remain significantly different: Count1 >> Count2.
Kafka Tool shows a big Lag of the Processor2 consumer on the topic2 and that lag remains constant (doesn't change after I stop publishing messages)
The Processor2 consists of
flatTransform stateful transformer
aggregator
other downstream steps
What could be the reason of the distinct behaviour during test and live mode and Lag not going down in live mode?
I have thoroughly compared all application property values active in test and in live application, they are exactly equivalent.
There is only 1 partition in all topics in both cases.
In my case the reason was default 7 days retention setting of the topics that were automatically created by the Spring Cloud Stream application.
The messages in my input stream span 8 years, I am using custom TimestampExtractor.
After I have manually configured topics to a large retention time, the issue was solved:
/usr/bin/kafka-configs --bootstrap-server localhost:9092 --alter --entity-type topics --entity-name topic2 --add-config retention.hours=87600
Or set the log.retention.hours for the entire Kafka broker.

RabbitMQ on Kubernates Unacked messages in queue

We are having issue on rabbitmq that happens when we deploy the application on production, we are not able to reproduce the issue on our development environment.
We have a microservices architecture with multiple spring boot applications deployed on kubernates with autoscaler depends on the usage and we notice that after sometimes some Unacked messages are created in queue, the number of Unacked messages will increase with the time and after sometimes rabbitmq seems to stop working.
Is there something we can check in order to identify the problem?

Spring Kafka consumer removed from consumer group when topic idle

Versions
Spring Boot 1.5.x,
Spring Boot 2.4.x,
Apache Kafka 0.10.2
The Situation
We have two service instances hosted on different servers. Each instance initializes multiple Kafka consumers. All consumers are listening to the same topic and are part of the same consumer group.
We are not relying on Spring Boot/Spring Kafka to configure the ConcurrentKafkaListnerContainerFactory and its DefaultKafkaConsumerFactory. All the consumer configuration properties are set to the default Apache Kafka consumer property values except for max.poll.records, session.timeout.ms, and heartbeat.interval.ms. Acknowledgement mode is set to record.
We are using the #KafkaListener annotation and setting its containerFactory property with the bean name of the initialized ConcurrentKafkaListenerContainerFactory and setting it topics property.
The Problem
When a topic does not get any messages published to it for a day or two, all consumers are removed from the consumer group.
I can’t find any reason for this to happen. From my understanding of reading both the Apache Kafka and Spring Kafka documentation if poll is called within max.poll.interval.ms, the consumer is considered alive. And if heartbeats are continuously sent by the consumer within the session.timeout.ms, the consumer is considered alive. According to the documentation, poll is called continuously and heartbeats are sent at the interval set by heartbeat.interval.ms.
The Questions
Is there a setting or property Spring Boot/Spring Kafka is setting that causes a consumer that hasn’t consumed any records from an idle topic for a day or two to be removed from the consumer group?
If yes, can this be turned off and what are the downsides?
If no, is there a way to rejoin the consumer group without having to restart the service and what are the downsides?
That Kafka version is very, very old.
Older versions removed the consumer offsets after no activity for 24 hours, even if the consumer is still connected. In 2.0, this was increased to 7 days. With newer brokers (since 2.1), consumer offsets are only removed if the consumers are not actually connected for 7 days.
See https://kafka.apache.org/documentation/#upgrade_200_notable
You can increase the broker's offsets.retention.minutes with older brokers.

messages published to all consumers with same consumer-group in spring-data-stream project

I got my zookeeper and 3 kafka broker running locally.
I started one producer and one consumer. I can see consumer is consuming message.
I then started three consumers with same consumer group name (different ports since its a spring boot project). but what I found is that all the consumers are now consuming (receiving) messages. But I expect the message to be load-balanced in that only messages are not repeated across the consumers. I don't know what the problem is.
Here is my property file
spring.cloud.stream.bindings.input.destination=timerTopicLocal
spring.cloud.stream.kafka.binder.zkNodes=localhost
spring.cloud.stream.kafka.binder.brokers=localhost
spring.cloud.stream.bindings.input.group=timerGroup
Here the group is timerGroup.
consumer code : https://github.com/codecentric/edmp-sample-stream-sink
producer code : https://github.com/codecentric/edmp-sample-stream-source
Can you please update dependencies to Camden.RELEASE (and start using Kafka 0.9+) ? In Brixton.RELEASE, Kafka consumers were 0.8-based and required passing instanceIndex/instanceCount as properties in order to distribute partitions correctly.
In Camden.RELEASE we are using the Kafka 0.9+ consumer client, which does load-balancing in the way you are expecting (we also support static partition allocation via instanceIndex/instanceCount, but I suspect this is not what you want). I can enter into more details on how to configure this with Brixton, but I guess an upgrade should be a much easier path.

Resources