We are having a Kafka consumer, which all of a sudden(without any activity) went into a rebalancing state and got stuck. This caused the CPU of the k8 pod to shoot and GC time was also nearly 70-80%. The node didn't recover from that state. Upon deleting all the topics, it recovered after almost 4-5 hours.
Kafka version - 2.1.1
No of topics - 520( with 10 partitions each)
Consumer groups - 1
partition assignment strategy - sticky
Attaching some of the info logs here at that time.
2021-10-12 10:41:21
2021-10-12 05:11:21.160 INFO 6 CID: UID: RID: --- [cation_consumer] s.consumer.internals.AbstractCoordinator : [Consumer clientId=consumer-staging_notification_consumer-10, groupId=staging_notification_consumer] Member consumer-staging_notification_consumer-10-8a7eb399-c4e0-4443-b330-bfac42ca89ae sending LeaveGroup request to coordinator kafka5:9092 (id: 2147483642 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
2021-10-12 05:10:27.340 INFO 6 CID:4967 UID: RID:17c72e2d517-4d5e --- [ntainer#9-0-C-1] s.consumer.internals.ConsumerCoordinator : [Consumer clientId=consumer-staging_notification_consumer-9, groupId=staging_notification_consumer] Giving away all assigned partitions as lost since generation has been reset,indicating that consumer is no longer part of the group
2021-10-12 10:40:27
2021-10-12 05:10:27.340 INFO 6 CID:4967 UID: RID:17c72e2d517-4d5e --- [ntainer#9-0-C-1] s.consumer.internals.ConsumerCoordinator : [Consumer clientId=consumer-staging_notification_consumer-9, groupId=staging_notification_consumer] Lost previously assigned partitions staging_notification_topic_app_low_mobi_3119-4, staging_notification_topic_app_low_mobi_2023-4, staging_notification_topic_app_low_mobi_5170-9, staging_notification_topic_app_low_mobi_3540-9, staging_notification_topic_app_low_mobi_5722-9,
2021-10-12 10:40:37
2021-10-12 05:10:37.247 INFO 6 CID: UID: RID: --- [cation_consumer] s.consumer.internals.AbstractCoordinator : [Consumer clientId=consumer-staging_notification_consumer-12, groupId=staging_notification_consumer] Attempt to heartbeat failed since group is rebalancing
2021-10-12 10:40:37
2021-10-12 05:10:37.183 INFO 6 CID: UID: RID: --- [cation_consumer] s.consumer.internals.AbstractCoordinator : [Consumer clientId=consumer-staging_notification_consumer-10, groupId=staging_notification_consumer] Attempt to heartbeat failed since group is rebalancing
2021-10-12 10:40:57
2021-10-12 05:10:57.435 INFO 6 CID: UID: RID: --- [cation_consumer] s.consumer.internals.AbstractCoordinator : [Consumer clientId=consumer-staging_notification_consumer-10, groupId=staging_notification_consumer] Attempt to heartbeat failed since group is rebalancing
2021-10-12 10:40:57
2021-10-12 05:10:57.435 INFO 6 CID: UID: RID: --- [cation_consumer] s.consumer.internals.AbstractCoordinator : [Consumer clientId=consumer-staging_notification_consumer-12, groupId=staging_notification_consumer] Attempt to heartbeat failed since group is rebalancing
2021-10-12 10:41:56
2021-10-12 05:11:56.922 INFO 6 CID:4967 UID: RID:17c72e2579c-22e6 --- [ntainer#9-3-C-1] s.consumer.internals.ConsumerCoordinator : [Consumer clientId=consumer-staging_notification_consumer-12, groupId=staging_notification_consumer] Failing OffsetCommit request since the consumer is not part of an active group
2021-10-12 10:41:56
2021-10-12 05:11:56.923 ERROR 6 CID:4967 UID: RID:17c72e2579c-22e6 --- [ntainer#9-3-C-1] essageListenerContainer$ListenerConsumer : Consumer exception
2021-10-12 10:41:56
java.lang.IllegalStateException: This error handler cannot process 'org.apache.kafka.clients.consumer.CommitFailedException's; no record information is available
2021-10-12 10:41:56
at org.springframework.kafka.listener.SeekUtils.seekOrRecover(SeekUtils.java:151)
2021-10-12 10:41:56
at org.springframework.kafka.listener.SeekToCurrentErrorHandler.handle(SeekToCurrentErrorHandler.java:113)
2021-10-12 10:41:56
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.handleConsumerException(KafkaMessageListenerContainer.java:1427)
2021-10-12 10:41:56
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:1124)
2021-10-12 10:41:56
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
2021-10-12 10:41:56
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2021-10-12 10:41:56
at java.lang.Thread.run(Thread.java:748)
2021-10-12 10:41:56
Caused by: org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
2021-10-12 10:41:56
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:1134)
2021-10-12 10:41:56
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:999)
2021-10-12 10:41:56
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1504)
2021-10-12 10:41:56
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doCommitSync(KafkaMessageListenerContainer.java:2396)
2021-10-12 10:41:56
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.commitSync(KafkaMessageListenerContainer.java:2391)
2021-10-12 10:41:56
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.commitIfNecessary(KafkaMessageListenerContainer.java:2377)
2021-10-12 10:41:56
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.processCommits(KafkaMessageListenerContainer.java:2191)
2021-10-12 10:41:56
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:1149)
2021-10-12 10:41:56
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:1075)
2021-10-12 10:41:56
... 3 common frames omitted
2021-10-12 10:41:58
2021-10-12 05:11:58.442 INFO 6 CID:4967 UID: RID:17c72e2579c-22e6 --- [ntainer#9-3-C-1] s.consumer.internals.ConsumerCoordinator : [Consumer clientId=consumer-staging_notification_consumer-12, groupId=staging_notification_consumer] Giving away all assigned partitions as lost since generation has been reset,indicating that consumer is no longer part of the group
021-10-12 10:58:36
2021-10-12 05:28:36.039 INFO 6 CID:4412 UID: RID:17c72ebadab-64d7 --- [ntainer#3-1-C-1] s.consumer.internals.AbstractCoordinator : [Consumer clientId=consumer-staging_notification_consumer-17, groupId=staging_notification_consumer] Join group failed with org.apache.kafka.common.errors.RebalanceInProgressException: The group is rebalancing, so a rejoin is needed.
2021-10-12 10:58:36
2021-10-12 05:28:36.044 INFO 6 CID:4412 UID: RID:17c72ebadab-64d7 --- [ntainer#3-1-C-1] s.consumer.internals.AbstractCoordinator : [Consumer clientId=consumer-staging_notification_consumer-17, groupId=staging_notification_consumer] (Re-)joining group
021-10-12 10:58:36
2021-10-12 05:28:36.039 INFO 6 CID:4412 UID: RID:17c72ebadab-64d7 --- [ntainer#3-1-C-1] s.consumer.internals.AbstractCoordinator : [Consumer clientId=consumer-staging_notification_consumer-17, groupId=staging_notification_consumer] Join group failed with org.apache.kafka.common.errors.RebalanceInProgressException: The group is rebalancing, so a rejoin is needed.
2021-10-12 10:58:36
2021-10-12 05:28:36.044 INFO 6 CID:4412 UID: RID:17c72ebadab-64d7 --- [ntainer#3-1-C-1] s.consumer.internals.AbstractCoordinator : [Consumer clientId=consumer-staging_notification_consumer-17, groupId=staging_notification_consumer] (Re-)joining group
2021-10-12 11:05:36
2021-10-12 05:35:36.955 INFO 6 CID:3435 UID: RID:17c72f29913-886c --- [ntainer#7-3-C-1] s.consumer.internals.AbstractCoordinator : [Consumer clientId=consumer-staging_notification_consumer-5, groupId=staging_notification_consumer] Join group failed with org.apache.kafka.common.errors.DisconnectException
This was resolved when I deleted the topics.
Similar behaviour is observed when I increase the number of threads in concurrent listener factory, the consumer fails to come up with similar logs of constant rebalancing
The first log indicates that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms (five minutes default).
When this happens, the consumer client will actively initiate a LeaveGroup request to the coordinator to trigger rebalance.
You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
Of course, a better way is to check the reason for the slow processing of the program and optimize it.
Related
I hope someone of you can help me.
I'm using spring boot 2.3.4 with spring kafka 2.5.6. I recently had to reset an offset and saw some strange behavior. We consumed the messages, but after every X (variating) messages we had a timeout of 10 seconds before the consumption continued.
This is my configuration:
spring:
kafka:
bootstrap-servers: localhost:9092
consumer:
enable-auto-commit: false
auto-offset-reset: earliest
heartbeat-interval: 1000
max-poll-records: 50
group-id: kafka-fetch-demo
fetch-max-wait: 10000
listener:
type: single
concurrency: 1
poll-timeout: 1000
no-poll-threshold: 2
monitor-interval: 10
ack-mode: manual
producer:
acks: all
batch-size: 0
retries: 0
This is an examle listener code:
#KafkaListener(id = LISTENER_ID, idIsGroup = false, topicPattern = "#{demoProperties.getTopicPattern()}")
public void onEvent(Acknowledgment acknowledgment, ConsumerRecord<byte[], String> record) {
log.info("Received record on topic {}, partition {} and offset {}",
record.topic(),
record.partition(),
record.offset());
acknowledgment.acknowledge();
}
Analysis
I figured out that the 10 second timeout came from the fetch.max.wait.ms property. However I'm not able to figure out why this property applies.
As far as I understand the fetch-max-wait property only determines the maximum time the broker waits before providing the consumer with new records even if the fetch.min.bytes is not exceeded. (Which in my case is set to the default 1 and should always be fullfilled)
Furthermore I analyzed that this problem only applies when using topic patterns and "larger" messages.
Reproduction
I uploaded an demo application on Github to reproduce the issue: https://github.com/kraennix/kafka-fetch-demo.
How I did reproduce it:
I put a thousand messages with 17,1 KB per message on a kafka topic.
I start my consuming application that listens per topic pattern to this topic. Then you can see this stopping behaviour.
Note: If I do the same with "small" messages (89 Bytes) it works as expected.
Logs
In the logs you can see the successful commit, but then the it says Skipping fetch
2021-01-16 15:04:40.773 DEBUG 19244 --- [_LISTENER-0-C-1] essageListenerContainer$ListenerConsumer : Commit list: {publish.LargeTopic.2.test-0=OffsetAndMetadata{offset=488, leaderEpoch=null, metadata=''}}
2021-01-16 15:04:40.773 DEBUG 19244 --- [_LISTENER-0-C-1] essageListenerContainer$ListenerConsumer : Committing: {publish.LargeTopic.2.test-0=OffsetAndMetadata{offset=488, leaderEpoch=null, metadata=''}}
2021-01-16 15:04:40.773 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Sending OffsetCommit request with {publish.LargeTopic.2.test-0=OffsetAndMetadata{offset=488, leaderEpoch=null, metadata=''}} to coordinator localhost:9092 (id: 2147483647 rack: null)
2021-01-16 15:04:40.773 DEBUG 19244 --- [_LISTENER-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Using older server API v7 to send OFFSET_COMMIT {group_id=kafka-fetch-demo,generation_id=4,member_id=consumer-kafka-fetch-demo-1-cf8e747f-531d-457a-aca8-18960c518ef9,group_instance_id=null,topics=[{name=publish.LargeTopic.2.test,partitions=[{partition_index=0,committed_offset=488,committed_leader_epoch=-1,committed_metadata=}]}]} with correlation id 62 to node 2147483647
2021-01-16 15:04:40.778 TRACE 19244 --- [_LISTENER-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Completed receive from node 2147483647 for OFFSET_COMMIT with correlation id 62, received {throttle_time_ms=0,topics=[{name=publish.LargeTopic.2.test,partitions=[{partition_index=0,error_code=0}]}]}
2021-01-16 15:04:40.779 DEBUG 19244 --- [_LISTENER-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Committed offset 488 for partition publish.LargeTopic.2.test-0
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.1.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.2.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.1.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.2.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
When there is a change in the Size of the message, you might need to change below 2 Props
heartbeat-interval: 1000
max-poll-records: 50
Your heart beat interval is 1sec and Max poll wait is 10secs. If the size of the message is high and you are processing the consumed messages in the same thread, then Heartbeat check will fail by the time the next Pull triggered. Make sure to process messages by an Executor using Callable.
Increase the Heart Beat Interval to 5 to 10 secs and Reduce Max Poll records to 15 when the messages size is high. Hope, this can help
I got the exception 'MongoWaitQueueFullException' and I realize the number of connections that my application is using. I use the default configuration of Spring boot (2.2.7.RELEASE) with reactive MongoDB (4.2.8). Transactions are used.
Even when running an integration test that basically creates a bit more than 200 elements then groups them (200 groups). 10 connections are used. When this algorithm is executed over a real data-set, this exception is thrown. The default limit of the waiting queue (500) was reached. This does not make the application scalable.
My question is: is there a way to design a reactive application that helps to reduce the number of connections?
This is the output of my test. Basically, it scans all translations of bundle files and them group them per translation key. An element is persisted per translation key.
return Flux
.fromIterable(bundleFile.getFiles())
.map(ScannedBundleFileEntry::getLocale)
.flatMap(locale ->
handler
.scanTranslations(bundleFileEntity.toLocation(), locale, context)
.index()
.map(indexedTranslation ->
createTranslation(
workspaceEntity,
bundleFileEntity,
locale.getId(),
indexedTranslation.getT1(), // index
indexedTranslation.getT2().getKey(), // bundle key
indexedTranslation.getT2().getValue() // translation
)
)
.flatMap(bundleKeyTemporaryRepository::save)
)
.thenMany(groupIntoBundleKeys(bundleFileEntity))
.then(bundleKeyTemporaryRepository.deleteByBundleFile(bundleFileEntity.getId()))
.then(Mono.just(bundleFileEntity));
The grouping function:
private Flux<BundleKeyEntity> groupIntoBundleKeys(BundleFileEntity bundleFile) {
return this
.findBundleKeys(bundleFile)
.groupBy(BundleKeyGroupKey::new)
.flatMap(bundleKeyGroup ->
bundleKeyGroup
.collectList()
.map(bundleKeys -> {
final BundleKeyGroupKey key = bundleKeyGroup.key();
final BundleKeyEntity entity = new BundleKeyEntity(key.getWorkspace(), key.getBundleFile(), key.getKey());
bundleKeys.forEach(entity::mergeInto);
return entity;
})
)
.flatMap(bundleKeyEntityRepository::save);
}
The test output:
560 [main] INFO o.s.b.t.c.SpringBootTestContextBootstrapper - Neither #ContextConfiguration nor #ContextHierarchy found for test class [be.sgerard.i18n.controller.TranslationControllerTest], using SpringBootContextLoader
569 [main] INFO o.s.t.c.s.AbstractContextLoader - Could not detect default resource locations for test class [be.sgerard.i18n.controller.TranslationControllerTest]: no resource found for suffixes {-context.xml, Context.groovy}.
870 [main] INFO o.s.b.t.c.SpringBootTestContextBootstrapper - Loaded default TestExecutionListener class names from location [META-INF/spring.factories]: [org.springframework.boot.test.mock.mockito.MockitoTestExecutionListener, org.springframework.boot.test.mock.mockito.ResetMocksTestExecutionListener, org.springframework.boot.test.autoconfigure.restdocs.RestDocsTestExecutionListener, org.springframework.boot.test.autoconfigure.web.client.MockRestServiceServerResetTestExecutionListener, org.springframework.boot.test.autoconfigure.web.servlet.MockMvcPrintOnlyOnFailureTestExecutionListener, org.springframework.boot.test.autoconfigure.web.servlet.WebDriverTestExecutionListener, org.springframework.test.context.web.ServletTestExecutionListener, org.springframework.test.context.support.DirtiesContextBeforeModesTestExecutionListener, org.springframework.test.context.support.DependencyInjectionTestExecutionListener, org.springframework.test.context.support.DirtiesContextTestExecutionListener, org.springframework.test.context.transaction.TransactionalTestExecutionListener, org.springframework.test.context.jdbc.SqlScriptsTestExecutionListener, org.springframework.test.context.event.EventPublishingTestExecutionListener, org.springframework.security.test.context.support.WithSecurityContextTestExecutionListener, org.springframework.security.test.context.support.ReactorContextTestExecutionListener]
897 [main] INFO o.s.b.t.c.SpringBootTestContextBootstrapper - Using TestExecutionListeners: [org.springframework.test.context.support.DirtiesContextBeforeModesTestExecutionListener#4372b9b6, org.springframework.boot.test.mock.mockito.MockitoTestExecutionListener#232a7d73, org.springframework.boot.test.autoconfigure.SpringBootDependencyInjectionTestExecutionListener#4b41e4dd, org.springframework.test.context.support.DirtiesContextTestExecutionListener#22ffa91a, org.springframework.test.context.transaction.TransactionalTestExecutionListener#74960bfa, org.springframework.test.context.jdbc.SqlScriptsTestExecutionListener#42721fe, org.springframework.test.context.event.EventPublishingTestExecutionListener#40844aab, org.springframework.security.test.context.support.WithSecurityContextTestExecutionListener#1f6c9cd8, org.springframework.security.test.context.support.ReactorContextTestExecutionListener#5b619d14, org.springframework.boot.test.mock.mockito.ResetMocksTestExecutionListener#66746f57, org.springframework.boot.test.autoconfigure.restdocs.RestDocsTestExecutionListener#447a020, org.springframework.boot.test.autoconfigure.web.client.MockRestServiceServerResetTestExecutionListener#7f36662c, org.springframework.boot.test.autoconfigure.web.servlet.MockMvcPrintOnlyOnFailureTestExecutionListener#28e8dde3, org.springframework.boot.test.autoconfigure.web.servlet.WebDriverTestExecutionListener#6d23017e]
1551 [background-preinit] INFO o.h.v.i.x.c.ValidationBootstrapParameters - HV000006: Using org.hibernate.validator.HibernateValidator as validation provider.
1677 [main] INFO b.s.i.c.TranslationControllerTest - Starting TranslationControllerTest on sgerard with PID 538 (started by sgerard in /home/sgerard/sandboxes/github-oauth/server)
1678 [main] INFO b.s.i.c.TranslationControllerTest - The following profiles are active: test
3250 [main] INFO o.s.d.r.c.RepositoryConfigurationDelegate - Bootstrapping Spring Data Reactive MongoDB repositories in DEFAULT mode.
3747 [main] INFO o.s.d.r.c.RepositoryConfigurationDelegate - Finished Spring Data repository scanning in 493ms. Found 9 Reactive MongoDB repository interfaces.
5143 [main] INFO o.s.c.s.PostProcessorRegistrationDelegate$BeanPostProcessorChecker - Bean 'org.springframework.security.config.annotation.method.configuration.ReactiveMethodSecurityConfiguration' of type [org.springframework.security.config.annotation.method.configuration.ReactiveMethodSecurityConfiguration] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
5719 [main] INFO org.mongodb.driver.cluster - Cluster created with settings {hosts=[localhost:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
5996 [cluster-ClusterId{value='5f42490f1c60f43aff9d7d46', description='null'}-localhost:27017] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:1, serverValue:4337}] to localhost:27017
6010 [cluster-ClusterId{value='5f42490f1c60f43aff9d7d46', description='null'}-localhost:27017] INFO org.mongodb.driver.cluster - Monitor thread successfully connected to server with description ServerDescription{address=localhost:27017, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 2, 8]}, minWireVersion=0, maxWireVersion=8, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=12207332, setName='rs0', canonicalAddress=4802c4aff450:27017, hosts=[4802c4aff450:27017], passives=[], arbiters=[], primary='4802c4aff450:27017', tagSet=TagSet{[]}, electionId=7fffffff0000000000000013, setVersion=1, lastWriteDate=Sun Aug 23 12:46:30 CEST 2020, lastUpdateTimeNanos=384505436362981}
6019 [main] INFO org.mongodb.driver.cluster - Cluster created with settings {hosts=[localhost:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
6040 [cluster-ClusterId{value='5f42490f1c60f43aff9d7d47', description='null'}-localhost:27017] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:2, serverValue:4338}] to localhost:27017
6042 [cluster-ClusterId{value='5f42490f1c60f43aff9d7d47', description='null'}-localhost:27017] INFO org.mongodb.driver.cluster - Monitor thread successfully connected to server with description ServerDescription{address=localhost:27017, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 2, 8]}, minWireVersion=0, maxWireVersion=8, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1727974, setName='rs0', canonicalAddress=4802c4aff450:27017, hosts=[4802c4aff450:27017], passives=[], arbiters=[], primary='4802c4aff450:27017', tagSet=TagSet{[]}, electionId=7fffffff0000000000000013, setVersion=1, lastWriteDate=Sun Aug 23 12:46:30 CEST 2020, lastUpdateTimeNanos=384505468960066}
7102 [nioEventLoopGroup-2-2] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:3, serverValue:4339}] to localhost:27017
11078 [main] INFO o.s.b.a.e.web.EndpointLinksResolver - Exposing 1 endpoint(s) beneath base path ''
11158 [main] INFO o.h.v.i.x.c.ValidationBootstrapParameters - HV000006: Using org.hibernate.validator.HibernateValidator as validation provider.
11720 [main] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:4, serverValue:4340}] to localhost:27017
12084 [main] INFO o.s.s.c.ThreadPoolTaskScheduler - Initializing ExecutorService 'taskScheduler'
12161 [main] INFO b.s.i.c.TranslationControllerTest - Started TranslationControllerTest in 11.157 seconds (JVM running for 13.532)
20381 [nioEventLoopGroup-2-3] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:5, serverValue:4341}] to localhost:27017
20408 [nioEventLoopGroup-2-2] INFO b.s.i.s.w.WorkspaceManagerImpl - Synchronize, there is no workspace for the branch [master], let's create it.
20416 [nioEventLoopGroup-2-3] INFO b.s.i.s.w.WorkspaceManagerImpl - The workspace [master] alias [e3cea374-0d37-4c57-bdbf-8bd14d279c12] has been created.
20421 [nioEventLoopGroup-2-3] INFO b.s.i.s.w.WorkspaceManagerImpl - Initializing workspace [master] alias [e3cea374-0d37-4c57-bdbf-8bd14d279c12].
20525 [nioEventLoopGroup-2-2] INFO b.s.i.s.i18n.TranslationManagerImpl - A bundle file has been found located in [server/src/main/resources/i18n] named [exception] with 2 file(s).
20812 [nioEventLoopGroup-2-4] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:6, serverValue:4342}] to localhost:27017
21167 [nioEventLoopGroup-2-8] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:10, serverValue:4345}] to localhost:27017
21167 [nioEventLoopGroup-2-6] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:8, serverValue:4344}] to localhost:27017
21393 [nioEventLoopGroup-2-5] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:7, serverValue:4343}] to localhost:27017
21398 [nioEventLoopGroup-2-7] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:9, serverValue:4346}] to localhost:27017
21442 [nioEventLoopGroup-2-2] INFO b.s.i.s.i18n.TranslationManagerImpl - A bundle file has been found located in [server/src/main/resources/i18n] named [validation] with 2 file(s).
21503 [nioEventLoopGroup-2-2] INFO b.s.i.s.i18n.TranslationManagerImpl - A bundle file has been found located in [server/src/test/resources/be/sgerard/i18n/service/i18n/file] named [file] with 2 file(s).
21621 [nioEventLoopGroup-2-2] INFO b.s.i.s.i18n.TranslationManagerImpl - A bundle file has been found located in [front/src/main/web/src/assets/i18n] named [i18n] with 2 file(s).
22745 [SpringContextShutdownHook] INFO o.s.s.c.ThreadPoolTaskScheduler - Shutting down ExecutorService 'taskScheduler'
22763 [SpringContextShutdownHook] INFO org.mongodb.driver.connection - Closed connection [connectionId{localValue:4, serverValue:4340}] to localhost:27017 because the pool has been closed.
22766 [SpringContextShutdownHook] INFO org.mongodb.driver.connection - Closed connection [connectionId{localValue:9, serverValue:4346}] to localhost:27017 because the pool has been closed.
22767 [SpringContextShutdownHook] INFO org.mongodb.driver.connection - Closed connection [connectionId{localValue:6, serverValue:4342}] to localhost:27017 because the pool has been closed.
22768 [SpringContextShutdownHook] INFO org.mongodb.driver.connection - Closed connection [connectionId{localValue:8, serverValue:4344}] to localhost:27017 because the pool has been closed.
22768 [SpringContextShutdownHook] INFO org.mongodb.driver.connection - Closed connection [connectionId{localValue:5, serverValue:4341}] to localhost:27017 because the pool has been closed.
22769 [SpringContextShutdownHook] INFO org.mongodb.driver.connection - Closed connection [connectionId{localValue:10, serverValue:4345}] to localhost:27017 because the pool has been closed.
22770 [SpringContextShutdownHook] INFO org.mongodb.driver.connection - Closed connection [connectionId{localValue:7, serverValue:4343}] to localhost:27017 because the pool has been closed.
22776 [SpringContextShutdownHook] INFO org.mongodb.driver.connection - Closed connection [connectionId{localValue:3, serverValue:4339}] to localhost:27017 because the pool has been closed.
Process finished with exit code 0
Spring Reactive is asynchronous. Imagine you have 3 items in your dataset. It opens a connection for the save of the first item. But it won't wait for it to finish and use for the second save. Instead, it opens a second connection as soon as possible. Thus you'll end up overloading all the possible connections in the pool.
I have deployed microservice on docker-desktop for windows and feign is not able to make a call to another service.
person ms calling organization ms through feign.
I can see in the logs of person pod
2019-11-10 12:58:34.000 INFO [personservice,13631e6ef2efe358,15c75b9a4006485a,true] 6 --- [ionThreadPool-1] c.p.service.OrganizationServiceData : Get the value from the organization ms hystrix-organizationThreadPool-1
2019-11-10 12:58:34.293 INFO [personservice,13631e6ef2efe358,e0bbcecf5f7c349f,true] 6 --- [zationservice-1] c.netflix.config.ChainedDynamicProperty : Flipping property: organizationservice.ribbon.ActiveConnectionsLimit to use NEXT property: niws.loadbalancer.availabilityFilteringRule.activeConnectionsLimit = 2147483647
2019-11-10 12:58:34.319 INFO [personservice,13631e6ef2efe358,e0bbcecf5f7c349f,true] 6 --- [zationservice-1] c.netflix.loadbalancer.BaseLoadBalancer : Client: organizationservice instantiated a LoadBalancer: DynamicServerListLoadBalancer:{NFLoadBalancer:name=organizationservice,current list of Servers=[],Load balancer stats=Zone stats: {},Server stats: []}ServerList:null
2019-11-10 12:58:34.332 INFO [personservice,13631e6ef2efe358,e0bbcecf5f7c349f,true] 6 --- [zationservice-1] c.n.l.DynamicServerListLoadBalancer : Using serverListUpdater PollingServerListUpdater
2019-11-10 12:58:34.535 INFO [personservice,13631e6ef2efe358,e0bbcecf5f7c349f,true] 6 --- [zationservice-1] c.netflix.config.ChainedDynamicProperty : Flipping property: organizationservice.ribbon.ActiveConnectionsLimit to use NEXT property: niws.loadbalancer.availabilityFilteringRule.activeConnectionsLimit = 2147483647
2019-11-10 12:58:34.556 INFO [personservice,13631e6ef2efe358,e0bbcecf5f7c349f,true] 6 --- [zationservice-1] c.n.l.DynamicServerListLoadBalancer : DynamicServerListLoadBalancer for client organizationservice initialized: DynamicServerListLoadBalancer:{NFLoadBalancer:name=organizationservice,current list of Servers=[10.1.0.190:8085],Load balancer stats=Zone stats: {unknown=[Zone:unknown; Instance count:1; Active connections count: 0; Circuit breaker tripped count: 0; Active connections per server: 0.0;]
},Server stats: [[Server:10.1.0.190:8085; Zone:UNKNOWN; Total Requests:0; Successive connection failure:0; Total blackout seconds:0; Last connection made:Thu Jan 01 00:00:00 GMT 1970;
First connection made: Thu Jan 01 00:00:00 GMT 1970; Active Connections:0; total failure count in last (1000) msecs:0; average resp time:0.0; 90 percentile resp time:0.0; 95 percentile resp time:0.0; min resp time:0.0; max resp time:0.0; stddev resp time:0.0]
]}ServerList:org.springframework.cloud.kubernetes.ribbon.KubernetesServerList#5bae3a6b
2019-11-10 12:58:35.345 INFO [personservice,,,] 6 --- [erListUpdater-0] c.netflix.config.ChainedDynamicProperty : Flipping property: organizationservice.ribbon.ActiveConnectionsLimit to use NEXT property: niws.loadbalancer.availabilityFilteringRule.activeConnectionsLimit = 2147483647
2019-11-10 12:58:35.502 INFO [personservice,13631e6ef2efe358,15c75b9a4006485a,true] 6 --- [ionThreadPool-1] c.p.service.OrganizationServiceData : calling fallback method to get the organization data for id 1
The ribbon client gets the IP-address of the pod running organization service. current list of Servers=[10.1.0.190:8085]
Here is my person service application.yml
spring:
cloud:
kubernetes:
ribbon:
mode: SERVICE
organizationservice:
ribbon:
MaxAutoRetries: 2
MaxAutoRetriesNextServer: 0
OkToRetryOnAllOperations: true
ServerListRefreshInterval: 2000
ConnectTimeout: 10000
ReadTimeout: 1000
person ms dependecy
compile "org.springframework.cloud:spring-cloud-starter-kubernetes-all"
compile('org.springframework.cloud:spring-cloud-starter-netflix-ribbon')
upon checking the logs for organization pod. No calls have been made to it.
Edit 1:
Upon changing the log level for feign in person service I found that JWT token is begin passed by default. I solved the issue using filters but the same application was working with out filters when I was using not using spring cloud kubernetes.
I have spent days on this simple issue , I am giving up and finally posting this issue which I am facing locally. I am trying to set up a microservices flow in my local for my hand itching learning purpose. This is no brainer. I have Eureka , Zuul Gateway , Simple Microservice. When I try to reach to the underlying service with the "url route" its working. But when I try to do serviceId look up its not working. Guys help me fixing it.
Git hub link is Git hub source code link
I have also raised an issue Git hut Issue link
Eureka Screenshot
Zuul Gateway logs
2019-10-06 11:11:24.611 INFO 26980 --- [nio-2020-exec-4] o.a.c.c.C.[Tomcat].[localhost].[/] : Initializing Spring DispatcherServlet 'dispatcherServlet'
2019-10-06 11:11:24.611 INFO 26980 --- [nio-2020-exec-4] o.s.web.servlet.DispatcherServlet : Initializing Servlet 'dispatcherServlet'
2019-10-06 11:11:24.633 INFO 26980 --- [nio-2020-exec-4] o.s.web.servlet.DispatcherServlet : Completed initialization in 22 ms
2019-10-06 11:11:25.103 INFO 26980 --- [nio-2020-exec-4] c.netflix.config.ChainedDynamicProperty : Flipping property: CHECKOUT-SERVICE.ribbon.ActiveConnectionsLimit to use NEXT property: niws.loadbalancer.availabilityFilteringRule.activeConnectionsLimit = 2147483647
2019-10-06 11:11:25.157 INFO 26980 --- [nio-2020-exec-4] c.n.u.concurrent.ShutdownEnabledTimer : Shutdown hook installed for: NFLoadBalancer-PingTimer-CHECKOUT-SERVICE
2019-10-06 11:11:25.157 INFO 26980 --- [nio-2020-exec-4] c.netflix.loadbalancer.BaseLoadBalancer : Client: CHECKOUT-SERVICE instantiated a LoadBalancer: DynamicServerListLoadBalancer:{NFLoadBalancer:name=CHECKOUT-SERVICE,current list of Servers=[],Load balancer stats=Zone stats: {},Server stats: []}ServerList:null
2019-10-06 11:11:25.167 INFO 26980 --- [nio-2020-exec-4] c.n.l.DynamicServerListLoadBalancer : Using serverListUpdater PollingServerListUpdater
2019-10-06 11:11:25.215 INFO 26980 --- [nio-2020-exec-4] c.netflix.config.ChainedDynamicProperty : Flipping property: CHECKOUT-SERVICE.ribbon.ActiveConnectionsLimit to use NEXT property: niws.loadbalancer.availabilityFilteringRule.activeConnectionsLimit = 2147483647
2019-10-06 11:11:25.218 INFO 26980 --- [nio-2020-exec-4] c.n.l.DynamicServerListLoadBalancer : DynamicServerListLoadBalancer for client CHECKOUT-SERVICE initialized: DynamicServerListLoadBalancer:{NFLoadBalancer:name=CHECKOUT-SERVICE,current list of Servers=[192.168.0.6:8098],Load balancer stats=Zone stats: {defaultzone=[Zone:defaultzone; Instance count:1; Active connections count: 0; Circuit breaker tripped count: 0; Active connections per server: 0.0;]
},Server stats: [[Server:192.168.0.6:8098; Zone:defaultZone; Total Requests:0; Successive connection failure:0; Total blackout seconds:0; Last connection made:Wed Dec 31 19:00:00 EST 1969; First connection made: Wed Dec 31 19:00:00 EST 1969; Active Connections:0; total failure count in last (1000) msecs:0; average resp time:0.0; 90 percentile resp time:0.0; 95 percentile resp time:0.0; min resp time:0.0; max resp time:0.0; stddev resp time:0.0]
]}ServerList:org.springframework.cloud.netflix.ribbon.eureka.DomainExtractingServerList#6f7f7ca0
2019-10-06 11:11:26.177 INFO 26980 --- [erListUpdater-0] c.netflix.config.ChainedDynamicProperty : Flipping property: CHECKOUT-SERVICE.ribbon.ActiveConnectionsLimit to use NEXT property: niws.loadbalancer.availabilityFilteringRule.activeConnectionsLimit = 2147483647
Never mind guys it was a mistake from my side in resolving the API path
there is a Time-consuming operation (about 10 min) ,but kafka aways rebalance after 5 min,even i pause the consumer. the consumer method :
#KafkaListener(topics = {TopicAppoint.EXECUTE_SCHOOL_DATA_STATICS_TASK})
public void receiveMessage(#Payload String payload, Consumer<String, String> consumer) {
Set<TopicPartition> assignment = consumer.assignment();
consumer.pause(assignment);
if (StringUtils.isNotEmpty(payload)) {
SchoolStatisticsTaskDTO staticsTaskDTO = JSONObject.parseObject(payload, SchoolStatisticsTaskDTO.class);
Optional<SchoolStatisticsTaskDO> taskOptional = schoolStatisticsTaskRepository.findById(staticsTaskDTO.getTrackId());
taskOptional.ifPresent(schoolStaticsTaskDO -> {
// handler
});
}
consumer.resume(assignment);
}
this is my config:
kafka:
bootstrap-servers: 192.168.0.230:9092
producer:
key-serializer: org.apache.kafka.common.serialization.StringSerializer
value-serializer: org.apache.kafka.common.serialization.StringSerializer
retries: 3
properties:
max.request.size: 12582912
consumer:
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
group-id: dc-fitness-data-consumer-group
properties:
max.partition.fetch.bytes: 12582912
#enable-auto-commit: false
listener:
ack-mode: record
concurrency: 6
LOGS
13:09:20.219 [org.springframework.kafka.KafkaListenerEndpointContainer#2-0-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-25, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.219 [org.springframework.kafka.KafkaListenerEndpointContainer#2-0-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-25, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.es.records.incremental.update.task-4, ft.es.records.incremental.update.task-5, ft.es.records.incremental.update.task-2, ft.es.records.incremental.update.task-3, ft.es.records.incremental.update.task-0, ft.es.records.incremental.update.task-1]
13:09:20.219 [org.springframework.kafka.KafkaListenerEndpointContainer#2-0-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.es.records.incremental.update.task-4, ft.es.records.incremental.update.task-5, ft.es.records.incremental.update.task-2, ft.es.records.incremental.update.task-3, ft.es.records.incremental.update.task-0, ft.es.records.incremental.update.task-1]
13:09:20.219 [org.springframework.kafka.KafkaListenerEndpointContainer#2-0-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-25, groupId=dc-fitness-data-consumer-group] (Re-)joining group
13:09:20.220 [org.springframework.kafka.KafkaListenerEndpointContainer#9-2-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-21, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#9-2-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-21, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.student.batch.upload.task-17, ft.student.batch.upload.task-14, ft.student.batch.upload.task-13, ft.student.batch.upload.task-16, ft.student.batch.upload.task-15, ft.student.batch.upload.task-12]
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#9-2-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.student.batch.upload.task-17, ft.student.batch.upload.task-14, ft.student.batch.upload.task-13, ft.student.batch.upload.task-16, ft.student.batch.upload.task-15, ft.student.batch.upload.task-12]
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#9-2-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-21, groupId=dc-fitness-data-consumer-group] (Re-)joining group
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#6-3-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-4, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#8-5-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-18, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#8-5-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-18, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.record.batch.upload.task-32, ft.record.batch.upload.task-31, ft.record.batch.upload.task-34, ft.record.batch.upload.task-33, ft.record.batch.upload.task-30, ft.record.batch.upload.task-35]
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#8-5-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.record.batch.upload.task-32, ft.record.batch.upload.task-31, ft.record.batch.upload.task-34, ft.record.batch.upload.task-33, ft.record.batch.upload.task-30, ft.record.batch.upload.task-35]
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#8-5-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-18, groupId=dc-fitness-data-consumer-group] (Re-)joining group
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#6-3-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-4, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.class.batch.upload.task-18, ft.class.batch.upload.task-19, ft.class.batch.upload.task-20, ft.class.batch.upload.task-21, ft.class.batch.upload.task-22, ft.class.batch.upload.task-23]
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#6-4-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-5, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-53, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#0-3-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-52, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#6-4-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-5, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.class.batch.upload.task-26, ft.class.batch.upload.task-27, ft.class.batch.upload.task-28, ft.class.batch.upload.task-29, ft.class.batch.upload.task-24, ft.class.batch.upload.task-25]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#6-3-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.class.batch.upload.task-18, ft.class.batch.upload.task-19, ft.class.batch.upload.task-20, ft.class.batch.upload.task-21, ft.class.batch.upload.task-22, ft.class.batch.upload.task-23]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-53, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.es.class.incremental.update.task-26, ft.es.class.incremental.update.task-27, ft.es.class.incremental.update.task-28, ft.es.class.incremental.update.task-29, ft.es.class.incremental.update.task-24, ft.es.class.incremental.update.task-25]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#6-4-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.class.batch.upload.task-26, ft.class.batch.upload.task-27, ft.class.batch.upload.task-28, ft.class.batch.upload.task-29, ft.class.batch.upload.task-24, ft.class.batch.upload.task-25]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.es.class.incremental.update.task-26, ft.es.class.incremental.update.task-27, ft.es.class.incremental.update.task-28, ft.es.class.incremental.update.task-29, ft.es.class.incremental.update.task-24, ft.es.class.incremental.update.task-25]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#6-4-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-5, groupId=dc-fitness-data-consumer-group] (Re-)joining group
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#6-3-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-4, groupId=dc-fitness-data-consumer-group] (Re-)joining group
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-53, groupId=dc-fitness-data-consumer-group] (Re-)joining group
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#0-3-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-52, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.es.class.incremental.update.task-22, ft.es.class.incremental.update.task-23, ft.es.class.incremental.update.task-18, ft.es.class.incremental.update.task-19, ft.es.class.incremental.update.task-20, ft.es.class.incremental.update.task-21]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#0-3-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.es.class.incremental.update.task-22, ft.es.class.incremental.update.task-23, ft.es.class.incremental.update.task-18, ft.es.class.incremental.update.task-19, ft.es.class.incremental.update.task-20, ft.es.class.incremental.update.task-21]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#5-4-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-47, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.225 [org.springframework.kafka.KafkaListenerEndpointContainer#0-3-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-52, groupId=dc-fitness-data-consumer-group] (Re-)joining group
Lets look at the documentation for a while, begin with pause() method
public void pause(Collection partitions) here
Suspend fetching from the requested partitions. Future calls to poll(long) will not return any records from these partitions until they have been resumed using resume(Collection). Note that this method does not affect partition subscription. In particular, it does not cause a group rebalance when automatic assignment is used.
So from the above description pause() method will suspend the partition from fetching the messages but it will not pause the consumer thread, which means consumer thread will do the subsequent poll() request to paused partitions but will not fetch any records
In your application : partitions are paused but consumer thread is busy in executing Time consuming operations for more that 10 minutes without doing any poll request.
Detecting Consumer Failures : so when poll stops heartbeat will not sent to the cluster
After subscribing to a set of topics, the consumer will automatically join the group when poll(long) is invoked. The poll API is designed to ensure consumer liveness. As long as you continue to call poll, the consumer will stay in the group and continue to receive messages from the partitions it was assigned. Underneath the covers, the poll API sends periodic heartbeats to the server; when you stop calling poll (perhaps because an exception was thrown), then no heartbeats will be sent. If a period of the configured session timeout elapses before the server has received a heartbeat, then the consumer will be kicked out of the group and its partitions will be reassigned.
Solution : increase the timeouts for below properties since default time is 5 minutes you are seeing rebalancing for every 5 minutes (personally will not support increasing timeouts) here
The new Java Consumer now supports heartbeating from a background thread. There is a new configuration max.poll.interval.ms which controls the maximum time between poll invocations before the consumer will proactively leave the group (5 minutes by default). The value of the configuration request.timeout.ms must always be larger than max.poll.interval.ms because this is the maximum time that a JoinGroup request can block on the server while the consumer is rebalancing, so we have changed its default value to just above 5 minutes. Finally, the default value of session.timeout.ms has been adjusted down to 10 seconds, and the default value of max.poll.records has been changed to 500.
Solution 2: if you need to process huge data with less amount of time increase more partition and consume each partition with each thread (which means more concurrency), less data that can be processed in 5 minutes