Spring Boot actuator health issue with Consul - consul

We are running consul in OpenShift cluster. All services have been developed by Spring Boot/Cloud APIs and they have been registered successfully in consul. There is a health point exposed using SpringBoot actuator. The health point itself works just fine when try to hit using curl.. sometimes we are just getting HTTP 200 status code and do not see any response. So which is causing Consul to throw below errors frequently which causes issues in discovering the service.
Any suggestions would be great help..
2016/08/05 05:57:15 [WARN] agent: http request failed 'http://10.1.0.18:9080/health': Get http://10.1.0.18:9080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Discovered this after a long time, my solution was increasing the timeouts for the probes, not sure if this helps after 2 years but worth a shot

Related

Playtika's OSS Feign Client: org.springframework.web.reactive.function.client.WebClientRequestException: Connection prematurely closed BEFORE response

The issue
I've stumbled upon the issue:
Error message: org.springframework.web.reactive.function.client.WebClientRequestException: Connection prematurely closed BEFORE response; nested exception is reactor.netty.http.client.PrematureCloseException: Connection prematurely closed BEFORE response
General info about the issue
It's a Spring Boot app (2.4.5) running on Reactive (WebFlux) stack.
The app also uses Playtika OSS reactive Feign Client (starter 3.0.3) for synchronious REST API communication.
Underlying web client is Netty.
There are no any special Feign or WebClient configs in the app.
All the other microservice parties are running on embedded Tomcat with default Spring Boot autoconfigurations.
All apps are running in Kubernetes cluster.
The error log observed from time to time (not every day).
Guesses
After some investigation, my best guess would be that some long-lived connections are being dropped from the pool on certain conditions. This causing the error log.
This thought is based on Instana that connects the error log to the span that spans accross a lot of subcalls.
Also no data losses/ other inconsistencies were noticed so far xD
Questions
Does Feign has a connection pool by default?
How to know if those are live or idle connections from the pool being closed?
How the connection pool can be configured or dropped to avoid long-running connections ?
Is it possible that Kubernetes can somehow close these connections?
What else can close connections?

Spring Cloud Kubernetes Service registry - Feign client - java.net.NoRouteToHostException

There are two microservices (microservice-A & microservice-B), written in Spring boot, are talking to each other through the Feign Client and the services are registered in K8S Native service registry
Whenever a fresh deployment of a microservice-A is happening, the microservice-B that is already running in kubertnetes fails to make HTTP call to the freshly deployed service-A and below is the exception
feign.RetryableException: No route to host (Host unreachable) executing GET http://microserice-a/api/v1/myresources
This issue is getting resolved immediately after restarting the microservice-B.
When we googled for solutions, we got to see this link https://github.com/spring-cloud/spring-cloud-netflix/issues/769 and an user had given the below comment there
I suspect the root cause is that FeignClient keeps an old list of service providers and the Ribbon cannot move correctly to the next node if one node has been destroyed
Not sure if that is correct root cause. Please comment If anyone has faced similar issue and solved it?

CAS Actuator Health Endpoints Return 403 Intermittently

I recently upgraded CAS to 6.4.6.x and noticed that the liveness/readiness probes will intermittently throw 403 error codes. It appears to be a threading issue in the Spring Security Filter Chain. I have validated with the barebone CAS images that this does not happen in the 6.3.x version but can repeat it rather easily with the 6.4.x version. My configuration has not changed after the upgrade and I'm following the documentation.
Endpoint Configuration:
# allow all by default
cas.monitor.endpoints.endpoint.defaults.access[0]=PERMIT
# enable the health endpoint
management.endpoints.enabled-by-default=true
management.endpoints.web.base-path=/actuator
management.endpoints.web.exposure.include=health
management.endpoint.health.enabled=true
Running load tests against the instance if I send 1 request at a time I get 200 responses. If I bump up the concurrency to 2 or more I'm able to reproduce the threading issue and some of the responses return with a 403 after getting picked up by the Spring Default Error Controller.
Setting a breakpoint on the Error Controller, I'm able to see the same thread in the logs essentially jump to two different points in the code path.
I've gone through the Pull Requests from 6.3.x to 6.4.x and nothing jumped out to me that might be causing this issue. I haven't seen any issues raised up in Spring Boot around the Actuator Health Points failing. I've bumped up Spring and Tomcat to the latest patch versions. Any thoughts on what could be causing this or other things I could try to determine how to fix it?

Webflux: CancelledServerWebExchangeException appears in metrics for seemingly no reason

After upgrading to spring-boot 2.5, CancelledServerWebExchangeException started to appear in prometheus http_server_requests_seconds metrics quite frequently (up to 10% server responses end up with it, according to graphics). It appears in my own API metrics, as well as actuator endpoints metrics (health, info, prometheus).
Example:
http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",status="200",uri="/actuator/health"} 137.0
Kind of strange combination of outcome="UNKNOWN" & status="200"
The problem is: all these requests have successful responses.
Questions: what is this exception for and why may it occur so often?
How to reproduce: start application locally and put some load on it (I used 50 threads in jmeter accessing actuator endpoints)

How to restart kubernetes pod when issue because of Rabbit MQ connectivity in logs

I have a Spring Boot 2 standalone application( not REST service) which connect to rabbit MQ and process message. The application is deployed in kubernetes. While it work great, but when Rabbit MQ remain down for little longer and in logs I see hearbeat exception 60sec and eventually connection get drop even if the rabbit mq comes up after certain time:
Automatic retry connection to broker by spring-rabbitmq
https://www.rabbitmq.com/heartbeats.html
While I try to manage above issue by increasing number of retry :https://stackoverflow.com/questions/45385119/how-configure-timeouts-retries-or-max-attempts-in-differents-queues-with-spring
but after expiry of retry still above issue comes.
How can I reboot/delete-recreate pod if I see above issue in logs from kubernetes.
The easiest way is to use actuator, which has a /actuator/health endpoint. (Note that the recent version also add /actuator/health/liveness and /actuator/health/readiness).
You can assign the endpoint to livenessProbe property of k8s. Then it will automatically restart when it is necessary. You can parameterize, when your app is down if necessary.
See the docs:
Kubernetes liveness probe
Spring actuator health

Resources