CAS Actuator Health Endpoints Return 403 Intermittently - spring-boot

I recently upgraded CAS to 6.4.6.x and noticed that the liveness/readiness probes will intermittently throw 403 error codes. It appears to be a threading issue in the Spring Security Filter Chain. I have validated with the barebone CAS images that this does not happen in the 6.3.x version but can repeat it rather easily with the 6.4.x version. My configuration has not changed after the upgrade and I'm following the documentation.
Endpoint Configuration:
# allow all by default
cas.monitor.endpoints.endpoint.defaults.access[0]=PERMIT
# enable the health endpoint
management.endpoints.enabled-by-default=true
management.endpoints.web.base-path=/actuator
management.endpoints.web.exposure.include=health
management.endpoint.health.enabled=true
Running load tests against the instance if I send 1 request at a time I get 200 responses. If I bump up the concurrency to 2 or more I'm able to reproduce the threading issue and some of the responses return with a 403 after getting picked up by the Spring Default Error Controller.
Setting a breakpoint on the Error Controller, I'm able to see the same thread in the logs essentially jump to two different points in the code path.
I've gone through the Pull Requests from 6.3.x to 6.4.x and nothing jumped out to me that might be causing this issue. I haven't seen any issues raised up in Spring Boot around the Actuator Health Points failing. I've bumped up Spring and Tomcat to the latest patch versions. Any thoughts on what could be causing this or other things I could try to determine how to fix it?

Related

Spring Cloud Kubernetes Service registry - Feign client - java.net.NoRouteToHostException

There are two microservices (microservice-A & microservice-B), written in Spring boot, are talking to each other through the Feign Client and the services are registered in K8S Native service registry
Whenever a fresh deployment of a microservice-A is happening, the microservice-B that is already running in kubertnetes fails to make HTTP call to the freshly deployed service-A and below is the exception
feign.RetryableException: No route to host (Host unreachable) executing GET http://microserice-a/api/v1/myresources
This issue is getting resolved immediately after restarting the microservice-B.
When we googled for solutions, we got to see this link https://github.com/spring-cloud/spring-cloud-netflix/issues/769 and an user had given the below comment there
I suspect the root cause is that FeignClient keeps an old list of service providers and the Ribbon cannot move correctly to the next node if one node has been destroyed
Not sure if that is correct root cause. Please comment If anyone has faced similar issue and solved it?

Webflux: CancelledServerWebExchangeException appears in metrics for seemingly no reason

After upgrading to spring-boot 2.5, CancelledServerWebExchangeException started to appear in prometheus http_server_requests_seconds metrics quite frequently (up to 10% server responses end up with it, according to graphics). It appears in my own API metrics, as well as actuator endpoints metrics (health, info, prometheus).
Example:
http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",status="200",uri="/actuator/health"} 137.0
Kind of strange combination of outcome="UNKNOWN" & status="200"
The problem is: all these requests have successful responses.
Questions: what is this exception for and why may it occur so often?
How to reproduce: start application locally and put some load on it (I used 50 threads in jmeter accessing actuator endpoints)

Is OpenTracing enabled for Reactive Routes in Quarkus?

I have recently changed my Quarkus application from RestEasy to Reactive Routes to implement my HTTP endpoints.
My Quarkus app had OpenTracing enabled and it was working fine. After changing the HTTP resource layer I can not see any trace in Jaeger.
After setting log level in DEBUG I can see my application is registered in Jaeger but I don't see any traceId or spanId in logs neither traces in Jaeger:
15:44:36 DEBUG traceId=, spanId=, sampled= [io.qu.ja.ru.JaegerDeploymentRecorder] (main) Registering tracer to GlobalTracer JaegerTracer(version=Java-0.34.3, serviceName=employee, reporter=RemoteReporter(sender=HttpSender(), closeEnqueueTimeout=1000), sampler=ConstSampler(decision=true, tags={sampler.type=const, sampler.param=true}), tags={hostname=employee-8569585469-tg8wg, jaeger.version=Java-0.34.3, ip=10.244.0.21}, zipkinSharedRpcSpan=false, expandExceptionLogs=false, useTraceId128Bit=false)
15:45:03 INFO traceId=, spanId=, sampled= [or.se.po.re.EmployeeResource] (vert.x-eventloop-thread-0) getEmployees
I'm using the latest version of Quarkus which is 1.9.2.Final.
Is it enabled OpenTracing when I'm using Reactive Routes?
Tracing is enabled by default for JAX-RS endpoints only, not for reactive routes at the moment. You can activate tracing by annotating your route with #org.eclipse.microprofile.opentracing.Traced.
Yes, adding #Traced enable to activate tracing on reactive routes.
Unfortunately, using both JAX-RS reactive and reactive routes bugs the tracing on event-loop threads used by JAX-RS reactive endpoint when they get executed.
I only started Quarkus 2 days ago so i don't really the reason of this behavior (and whether it's normal or it's a bug), but obviously switching between two completely mess up the tracing.
Here is an example to easily reproduce it:
Create a REST Easy reactive endpoint returning an empty Multi
Create a custom reactive route
set up the IO threads to 2 (easier to quickly reproduce it)
Run the application, and request the two endpoints alternatively
Here is a screenshot that show the issue
As you can see, as soon as the JAX-RS resource is it and executed on one of the two threads available, it "corrupts" it, messing the trace_id reported (i don't know if it's the generation or the reporting on logs that is broken) on logs for the next calls of the reactive route.
This does not happen on the JAX-RS resource, as you can notice on the screenshot as well. So it seems to be related to reactive routes only.
Another point here is the fact that JAX-RS Reactive resources are incorrectly reported on Jaeger. (with a mention to a missing root span) Not sure if it's related to the issue but that's also another annoying point.
I'm thinking to completely remove the JAX-RS Reactive endpoint and replace them by normal reactive route to eliminate this bug.
I would appreciate if someone with more experience than me could verify this or tell me what i did wrong :)
EDIT 1: I added a route filter with priority 500 to clear the MDC and the bug is still there, so definitely not coming from MDC.
EDIT 2: I opened a bug report on Quarkus
EDIT 3: It seems related to how both implementations works (thread locals versus context propagation in actor based context)
So, unless JAX-RS reactive resources are marked #Blocking (and get executed in a separated thread pool), JAX-RS reactive and Vertx reactive routes are incompatible when it comes to tracing (but also probably the same for MDC related informations since MDC is also thread related)

Make spring-boot 2.2.0 report status = UP, even when the DB is down?

Up to spring-boot 2.1.9, I used to set management.health.defaults.enabled = false to decouple the /health endpoint overall status from the database status.
As of 2.2.0, that specific setting no longer works that way (see: SpringBoot 2.1.9 -> 2.2.0 - health endpoint no longer works).
Is there a way to configure spring-boot to decouple the overall status of the /health endpoint from whether or not the datasource is up?
I'm inclined to just make my own endpoint hardcoded to return a status of 200.
I don't really understand what you're trying to do and how disabling all defaults achieved what you've described.
What would be the point of having an endpoint that returns 200 unconditionally? That's seriously misleading IMO.
If you do not want the datasource health indicator, then you can disable that (and only that) using management.health.db.enabled=false.
If you want the datasource health check but want to be able to ignore it, create a group that exclude the db health check and use that for monitoring. See the documentation for more details

Spring Boot actuator health issue with Consul

We are running consul in OpenShift cluster. All services have been developed by Spring Boot/Cloud APIs and they have been registered successfully in consul. There is a health point exposed using SpringBoot actuator. The health point itself works just fine when try to hit using curl.. sometimes we are just getting HTTP 200 status code and do not see any response. So which is causing Consul to throw below errors frequently which causes issues in discovering the service.
Any suggestions would be great help..
2016/08/05 05:57:15 [WARN] agent: http request failed 'http://10.1.0.18:9080/health': Get http://10.1.0.18:9080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Discovered this after a long time, my solution was increasing the timeouts for the probes, not sure if this helps after 2 years but worth a shot

Resources