Kafka consumer health check and recreate - spring-boot

We use Spring Kafka Client on a project. Recently we noticed that if a Kafka Consumer dies due to OutOfMemoryError service continues operating normally and no new consumers get created. The only way to fix this is to monitor OOM in logs and restart the service manually.
We are looking for a way to make consumer recreation automatic, e.g.:
Force Spring (somehow) to detect died consumers and in runtime create new ones.
In case of OOM in consumer thread kill the entire service, so that AWS auto-scaling group can create a new instance of the service.
Any suggestions or ideas are appreciated.
Thank you!

I have found a solution related to option 2 mentioned above.
Starting java version "1.8.0_92" there are a few JVM options allowing to kill entire JVM in case of OOME:
1. -XX:+ExitOnOutOfMemoryError
2. -XX:+CrashOnOutOfMemoryError
were added, see the release notes.
ExitOnOutOfMemoryError When you enable this option, the JVM exits on the first occurrence of an out-of-memory error. It can be used if
you prefer restarting an instance of the JVM rather than handling out
of memory errors.
CrashOnOutOfMemoryError If this option is enabled, when an out-of-memory error occurs, the JVM crashes and produces text and
binary crash files.

Related

Kafka version: 3.0.1- kafka admin clients created repeatedly - Memory leak

We have springboot app that consumes from a single topic and produces records to multiple topics.
Recently upgraded this app to Sprinboot-2.6.7 and other dependencies accordingly in gradle project.
App is able to consume & produce correctly, BUT the issue is it seems to create kafka adminclients repeatedly(1000s) and seems to be leaking memory(potentially due to this?), ultimately leading to instance crashing and not being able to keep up with lag.
Some kafka related dependent jars in external libraries
org.apache.kafka:kafka-clients:3.0.1
org.springframework.cloud:spring-cloud-stream:3.2.3
org.springframework.cloud:spring-cloud-stream-binder-kafka:3.2.3
org.springframework.cloud:spring-cloud-stream-binder-kafka-core:3.2.3
org.springframework.integration:spring-integration-kafka:5.5.11
org.springframework.kafka:spring-kafka:2.8.5
Is there a reason for this? missing configuration?
So adminClient was not the issue. The problem was from the default size 10 of the hashmap that stores output channels.
I have set spring.cloud.stream.dynamic-destination-cache-size=30, since we have actaully 17 output destinations in app already.
In case of the default size 10 of this hashmap "StreamBridge.channelCache" it keeps removing and adding the values to map repeatedly "Once this size is reached, new destinations will trigger the removal of old destinations" calling GC every now and then.

Launch Mongock faster so when changelog fails the application crashes before a heath check can pass

We recently added MongoCk to our Spring 5 app (using the Spring runner), but are having some issues during our deploys. Our final step in the deploy process is a health check where the deployment server checks a health page every 5s for 5 minutes. Once it gets the correct response the deployment is considered successful and it finishes.
The issue is that MongoCk seems to only start the migration around 30s after the application context loads, resulting in the health check passing and the migration possibly failing after the service was "successfully" launched.
Using a standalone runner might solve this, but we really like the availability of other beans during the changelogs. So is there a way to enforce the changelogs to be processed as part of loading the application context? Or where is this delay coming from, and how can we reduce it?
You don't provide much information, but you are saying that Mongock starts 30 secs after the application context is loaded. That could be happening for two reasons:
The most likely possibility is that you are using runner-type ApplicationRunner(by default). This means that Spring decides when to run it after the entire context is loaded. From what you are saying runner-type InitializingBean is a better fit for you .
Please try this:
mongock:
runner-type: InitializingBean
You have multiple instances fighting for the lock. There is nothing we can do about it, this process is optimised(Although we are improving even more). However, as said, I believe the issue is related with the runner-type

How to crash Jboss based on some condition

I am using JBoss 7x, and have the following use case.
I am going to do load testing of messaging queues with Jboss. The queues are external to JBoss.
I will push a lot of message in the queue, around 1000 message. When around 100+ message has been pushed I want to crash JBoss. Later I want to re-start the Jboss the verify the message processing.
I had earlier made use of Byteman to crash the JVM using the following
JAVA_OPTS="-javaagent:/BYTEMAN_HOME/lib/byteman.jar=script:/QUICKSTART_HOME/jta-crash-rec/src/main/scripts/xa.btm ${JAVA_OPTS}"
Details are here: https://github.com/Naresh-Chaurasia/jboss-eap-quickstarts/tree/7.3.x/jta-crash-rec
In the above case when ever XA Transaction is happening the JVM is being crashed using byteman, but in my case I want to only crash the JVM/Jboss lets say after 100+ messages. i.e not for each transaction but after processing some messages.
I have also tried a few examples from here, to get ideas of how to achieve it, but did not succeed. https://developer.jboss.org/docs/DOC-17213#top
Question: How can I crash JBoss/ running JVM using byteman or some other way.
See the Programmers Guide that comes bundled with the distribution.
Sections headed "CountDowns" and "Aborting Execution" provide what's necessary. These are built-in features of the Rule Language.

Apache Ignite: Possible too long JVM pause: 714 milliseconds

I have a setup of Apache Ignite server and having SpringBoot application as client in a Kubernetes cluster.
During performance test, I start to notice that the below log showing up frequently in SpringBoot application:
org.apache.ignite.internal.IgniteKernal: Possible too long JVM pause: 714 milliseconds
According to this post, this is due to "JVM is experiencing long garbage collection pauses", but Infrastructure team has confirmed to me that we have included +UseG1GC and +DisableExplicitGC in the Server JVM option and this line of log only show in SpringBoot application.
Please help on this following questions:
Is the GC happening in the Client(SpringBoot application) or Server node?
What will be that impact of long GC pause?
What should I do to prevent the impact?
Do I have to configure the JVM option in SpringBoot application as well?
Is the GC happening in the Client(SpringBoot application) or Server node?
GC error will be logged to the log of the node which suffers problems.
What will be that impact of long GC pause?
Such pauses decreases overall performance. Also if pause will be longer than failureDetectionTimeout node will be disconnected from cluster.
What should I do to prevent the impact?
General advises are collected here - https://apacheignite.readme.io/docs/jvm-and-system-tuning. Also you can enable GC logs to have full picture of what happens.
Do I have to configure the JVM option in SpringBoot application as well?
Looks like that you should, because you have problems with client's node.

Jboss failover testing

I have a peculiar situation here. I have installed JBoss 5.1.0 as a service in Wintel box.
The service will restart itself if the JBoss instance fails.
However I could not find a way to test this scenario. I killed the JVM that was running the JBoss, but it did not restart the service. I need to make the JBoss service end abnormally so that I can ensure it is restarts again.
In a nutsehll, I need a way to make JBoss end abnormally.
Please help.
Write a JSP that calls System.exit(1)? That might fall foul of the Security Manager, though, and JBoss might not permit it.
In my experience, JBoss nodes (and app servers in general) tend not to crash in such a way as the process exists. Instead, they're more likely to consume increasing resources (e.g. memory) until they stop responding, and need explicit restarting. That's certainly easier to reproduce, but it's harder to handle automatically.

Resources