Kafka Consumer Hangs Indefinitely after Rebalancing - spring

I am trying to utilize a kafka consumer library that is prewritten in my organization. It takes JSON data from a Kafka topic and stores it in a Mongo database. While I cannot post this code, it is a very simple architecture that uses Apache Camel routes, then stores consumed messages into Mongo using the Springboot Mongo dependency.
I am running into a situation where when deploying to OpenShift, and scaling up more than 1 pod, receiving the below exception, and then the application hangs without any more input or processing. I believe the failure is happening within the logic that is within the kafka client library(s).
I have tried running two instances of the application locally, under different ports. That works perfectly without error. I have tried setting the heartbeat interval, session timeout, batch size, max fetch bytes, number of concurrent consumers, SEDA mode on/off, and request timeout. Changing those Kafka settings up, down, on, off and undefined, the issues remain.
2019-05-23 16:15:51 [Camel (camel-1) thread #1 - KafkaConsumer[mytopic]] ERROR o.a.k.c.c.i.ConsumerCoordinator - Error UNKNOWN_MEMBER_ID occurred while committing offsets for group mytopic-status
2019-05-23 16:15:51 [Camel (camel-1) thread #7 - KafkaConsumer[mytopic]] ERROR o.a.k.c.c.i.ConsumerCoordinator - Error UNKNOWN_MEMBER_ID occurred while committing offsets for group mytopic-status
2019-05-23 16:15:51 [Camel (camel-1) thread #7 - KafkaConsumer[mytopic]] WARN o.a.c.component.kafka.KafkaConsumer - Error consuming mytopic-Thread 0 from kafka topic. Caused by: [org.apache.kafka.clients.consumer.CommitFailedException - Commit cannot be completed due to group rebalance]
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:552)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:493)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:665)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:644)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:380)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:274)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:358)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:968)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:936)
at org.apache.camel.component.kafka.KafkaConsumer$KafkaFetchRecords.run(KafkaConsumer.java:132)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
111

Related

Hikari CP (Spring Boot) Connection Recovery Problem After DB Failure

We have several microservices build on Spring Boot (2.2.4) and Hikari CP (3.4.2) with PostgreSQL.
Recently we have faced DB failure around 30 seconds. After the connections are lost some of the containers are failed to recover connections while others which has exactly the same configuration and application are just fine. Unfortunately we don't have the log indicating the pool sizes(idle active waiting) on time of the error.
We have received some broken pipe and connection lost errors on all containers when the connections are lost. After DB recovery we got the following exception only on some (2/18) containers that are failed to recover.
StackTrace:
org.springframework.orm.jpa.JpaTransactionManager.doBegin(JpaTransactionManager.java:402) ... 20 moreCaused by:
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms. at
com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:689) at
com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:196) at
com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:161) at
com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) at
org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:122) at
org.hibernate.internal.NonContextualJdbcConnectionAccess.obtainConnection(NonContextualJdbcConnectionAccess.java:38) at
org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:104)
... 30 moreCaused by:org.postgresql.util.PSQLException: This connection has been closed. at
org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:857) at
org.postgresql.jdbc.PgConnection.setNetworkTimeout(PgConnection.java:1639) at
com.zaxxer.hikari.pool.PoolBase.setNetworkTimeout(PoolBase.java:556) at
com.zaxxer.hikari.pool.PoolBase.isConnectionAlive(PoolBase.java:169) at
com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:185) ... 35 more
we have seen similar(on the same system) situations and tests where the DB failovers and connections are restored on Hikari without any problem. But in this case one of the containers are restored by itself after 1 hour and others after restart.
As far as we know Hikari is not returning the broken connections on the pool and evicts them from the pool after marked as broken or closed. Any ideas what might happened to those containers while the others(exactly same image and configuration) are just fine.
PS: we cannot reproduce the problem.
Hikari configuration:
allowPoolSuspension.............false
connectionInitSql...............none
connectionTestQuery.............none
connectionTimeout...............30000
idleTimeout.....................600000
initializationFailTimeout.......1
isolateInternalQueries..........false
leakDetectionThreshold..........0
maxLifetime.....................1800000
maximumPoolSize.................15
minimumIdle.....................15
validationTimeout...............5000
You can configure something like:
connectionTestQuery=select 1
This way Hikari tests that the connection is still alive before handling it over to Hibernate.

Tracing memory leak in Spring Azure qPID JMS code

Im trying to trace and identify root cause for memory leak in our very small and simple Spring Boot application.
It uses following:
- Spring Boot 2.2.4
- azure-servicebus-jms-spring-boot-starter 2.2.1
- MSSQL
Function:
The app only dispatches Azure ServiceBus queue and stores data and sends data to other destination.
It is a small app so it starts easily with 64 megs of memory, despite I give it up to 256 megs via Xmx option. Important note is the queue is being dispatched using Spring default transacted mode with dedicated JmsTransactionManager who is actually inner TM of ChainedTransactionManager along with dbTM and additional outbound JMS TM. Both JMS ConnectionFactory objects are created as CachingConnectionFactory.
Behavior:
Once the app is started it seems OK. There is no traffic so I can see in the log it is opening transactions and closing when checking the queue (jms:message-driven-channel-adapter).
However after some time when there is still no traffic, no single message was consumed the memory starts climbing as monitored via JVVM.
There is an error thrown:
--2020-04-24 11:17:01.443 - WARN 39892 --- [er.container-10] o.s.j.l.DefaultMessageListenerContainer : Setup of JMS message listener invoker failed for destination 'MY QUEUE NAME HERE' - trying to recover. Cause: Heuristic completion: outcome state is rolled back; nested exception is org.springframework.transaction.TransactionSystemException: Could not commit JMS transaction; nested exception is javax.jms.IllegalStateException: The Session was closed due to an unrecoverable error.
... and after several minutes it reaches MAX of the heap and since that moment it is failing on OutOfMemory error in the thread opening JMS connections.
--2020-04-24 11:20:04.564 - WARN 39892 --- [windows.net:-1]] i.n.u.concurrent.AbstractEventExecutor : A task raised an exception. Task: org.apache.qpid.jms.provider.amqp.AmqpProvider$$Lambda$871/0x000000080199f840#1ed8f2b9
-
java.lang.OutOfMemoryError: Java heap space
at java.base/java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:61)
at java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:348)
at org.apache.qpid.proton.engine.impl.ByteBufferUtils.newWriteableBuffer(ByteBufferUtils.java:99)
at org.apache.qpid.proton.engine.impl.TransportOutputAdaptor.init_buffers(TransportOutputAdaptor.java:108)
at org.apache.qpid.proton.engine.impl.TransportOutputAdaptor.pending(TransportOutputAdaptor.java:56)
at org.apache.qpid.proton.engine.impl.SaslImpl$SwitchingSaslTransportWrapper.pending(SaslImpl.java:842)
at org.apache.qpid.proton.engine.impl.HandshakeSniffingTransportWrapper.pending(HandshakeSniffingTransportWrapper.java:138)
at org.apache.qpid.proton.engine.impl.TransportImpl.pending(TransportImpl.java:1577)
at org.apache.qpid.proton.engine.impl.TransportImpl.getOutputBuffer(TransportImpl.java:1526)
at org.apache.qpid.jms.provider.amqp.AmqpProvider.pumpToProtonTransport(AmqpProvider.java:994)
at org.apache.qpid.jms.provider.amqp.AmqpProvider.pumpToProtonTransport(AmqpProvider.java:985)
at org.apache.qpid.jms.provider.amqp.AmqpProvider.lambda$close$3(AmqpProvider.java:351)
at org.apache.qpid.jms.provider.amqp.AmqpProvider$$Lambda$871/0x000000080199f840.run(Unknown Source)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518)
at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.base/java.lang.Thread.run(Thread.java:835)
HeapDumps:
I took couple of heap snapshots during this whole process and looked at what gets increased.
I can see suspicious amount of ConcurrentHashMap/String/Byte[] objects.
Has anyone some clue/hint what can be wrong in this setup and libs: Spring Boot, Apache qPid used under the hood of the Azure JMS dependency etc.? Many thanks.
Update #1
I have clear evidence that the problem is either in Spring or azure service bus starter library - not automatically qPid client used. I would say the library has the bug rather than Spring, just my guess. This is how the failing setup looks like:
There are two JMS destinations and one DB, each having its transaction manager
There is ChainedTransactionManager wrapping above three TMs.
Spring integration app which connects to Azure ServiceBus queue via jms:message-driven-channel-adapter and setting the transaction manager on this component (as created in point 2)
Start the app., no traffic on the queue is needed, after 10 minutes the app will crash due to OutOfMemoryError ... within those 10 minutes I watch log on debug level and only thing which is happening is opening and closing transactions using ChainedTransactionManager ... also as written in the comments another important condition is the third JMS TransactionManager ... with 2 TMs it works and is stable, with 3 it will crash ...
Additional research and steps taken identified the most likely root cause Spring CachingConnectionFactory class. Once I removed that and used only native types the problem went away and memory consumption profile is very different and healthy.
I have to say I created CachingConnectionFactory using standard constructor and didnt further configure the behavior. However these Spring defaults clearly lead to memory leak as per my experience.
In past I had memory leak with ActiveMq which had to be resolved by using CachingConnectionFactory and now I have memory leak with Azure ServiceBus when using CachingConnectionFactory .. strange :) In both cases I see that as bugs because memory management should be correct regardless caching involved or not.
Marking this as my answer.
Tested case: The problem occurs when receiving and sending message both with its own TM and both JMS connectionFactories are type CachedConnectionFactory. At the end I tested the app. with inbound connection factory of type CachedConnectionFactory and outbound just native type ... no memory leak as well.

Kafka Cannot Configure Topics on Application Startup, but Later Can Communicate

We have a spring boot application using spring-kafka (2.2.5.RELEASE) that always gets this error when starting up:
Could not configure topics
org.springframework.kafka.KafkaException: Timed out waiting to get existing
topics; nested exception is java.util.concurrent.TimeoutException
However, the application continues to startup:
org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1]
INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: []
INFO o.s.k.l.KafkaMessageListenerContainer - partitions assigned: [my-reply-topic-1]
INFO o.s.k.l.KafkaMessageListenerContainer - partitions assigned: [my-request-topic-0]
INFO o.s.b.w.e.tomcat.TomcatWebServer -
Tomcat started on port(s): 8080 (http) with context path ''
At this point, the application interacts with Kafka as expected.
We like to keep our logs clean, so we would like to understand why this Exception is thrown. Also, it is a bit confusing, because when we move to a different environment where the networking has not been established between the application and the kafka broker(s), we get the same error, but the application does not function. Having the same Exception occur when there is truly a problem and when it can be ignored is irksome when trying to troubleshoot connectivity issues.
Is there a way, on application startup, to determine whether connectivity has been established with Kafka rather than just waiting for a timeout message (which may be a red herring anyway)?
If the topic(s) exist already, remove any NewTopic beans from the application context and the KafkaAdmin won't try to connect to the broker at all.

Spring AMQP, catch listener container stopped event

Using springframework 4.3.6, I have configured AMQP (RabbitMQ) to retry connection to the message broker every 5 seconds for 5 times. When all attempts fail to make a connection to the broker, spring fails with a warning:
WARN [x.x.x.x.SimpleMessageListenerContainer] stopping container - restart recovery attempts exhausted
I would like to listen to the listener container stop event and notify systems team about unavailability of the message bus.
I tried creating ApplicationListener with type ListenerContainerConsumerFailedEvent and also DeclarationExceptionEvent. But none of it worked. So i was wondering if there is any other way to catch container stopped event in case of failure to start the container successfully.
If it helps, the error message i get after each attempt to connect is
ERROR [x.x.x.x.SimpleMessageListenerContainer] Failed to check/redeclare auto-delete queue(s).

Webapp hangs when Active MQ broker is running

I got a strange problem with my spring webapp (running on local jetty) which connects to a locally running ActiveMQ broker for JMS functionality.
As soon as I start the broker the applications becomes incredibly slow, e.g. the startup of the ApplicationContext with active broker takes forever (i.e. > 10mins, did not yet wait long enough for it to complete). If I start the broker after the webapp (i.e. after the ApplicationContext was loaded) it's running but in a very very slow way (requests which usually take <1s take >30s). All operations take longer even the ones without JMS involved. When I run the application without an activemq broker everything runs smoothly (except the JMS related stuff of course ;-) )
Here's what I tried so far:
Updated the ActiveMQ Version to 5.10.1
Used standalone ActiveMQ instead of maven-plugin
moved the broker running from a separate JVM (via active mq maven plugin, connection via JNDI lookup in jetty config) into the same JVM (started via spring config, without JNDI)
changed the active mq transport from tcp to vm
several activemq settings (alwaysSyncSend, alwaysSessionAsync, producerWindowSize)
Using CachingConnectionFactory and PooledConnectionFactory
When analyzing a thread dump (jstack) I see many activemq threads sleeping on a monitor. Which looks like this:
"ActiveMQ VMTransport: vm://localhost#0-3" daemon prio=6 tid=0x000000000b1a3000 nid=0x1840 waiting on condition [0x00000000177df000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000f786d670> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:424)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:323)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:874)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:955)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:917)
at java.lang.Thread.run(Thread.java:662)
Any help is greatly appreciated !
I found the cause of the issue and was able to fix it:
we were passing a transactionmanager to the AbstractMessageListenerContainer. While in production there is a XA-Transactionmanager in use on the local jetty environment only a JPATransactionManager is used. Apparently the JMS is waiting forever for an XA transaction to be commited, which never happens in the local environment.
By overriding the bean definition of the AbstractMessageListenerContainer for the local env without setting a transcationmanager but using sessionTransacted="true" instead everything works fine.
I got the idea that it might be related to transaction handling from enabling the ActiveMQ logging. With this I saw that something was wrong with the transaction (transactionContext.getTransactionId() returned null).

Resources