Health check timeout - spring

I have a simple spring boot app with a few endpoints deployed to the pcf cloud. When I am doing load testing towards them I get request time increased on calls to my EPs and finally timeout on /health which causes my instance get restarted (as the timeout is 1sec and AFAIK there is no way to adjust it) As I understand this happens due to the fact that tomcat can't handle that many of requests given provided resources and ideally I would like to handle requests to health endpoint with high priority so that my instance doesn't fail. How can I achieve this?

Related

Vert.x web and micro services - Health check being starved

We use a very custom framework built on Vert.x to build our k8s micro services. This framework does a lot of the heavy lifting for teams, such as setting up all the endpoints and creating the health check endpoint, among other things.
One issue we see is that some of the micro services will start to "starve out" the health check when they get overloaded. So, as an app takes on heavy traffic, the health check, which runs every 20s, will time out. For these micro services I have checked that the teams are properly setting the "blocking" on endpoints that do any sort of blocking calls, like DB reads/writes or downstream API call.
The health check endpoint, being comprised of quick checks, does not. My understanding is that blocking handlers get pushed off to the work queue and non-blocking stay in the event loop, so my theory is that under heavy strain the event loop is filling up, and by the time it gets to the queued health check it's already past the timeout. I say this because we see the timeout of the Kubernetes-side, but out total-processing-time on the health check, which starts once the handler is called, is quick.
I attempted to alleviate this by pushing the health check into it's own Verticle, not quite understanding that you can't have multiple vertices on the same port (that was a misunderstanding on my part in reading the documentation).
So, my question is: What is the correct way to prioritize the health check? Is there a way to push these health checks to the front of the queue, or should we be looking more to some sort of "tuning"?

Disallow queuing of requests in gRPC microservices

SetUp:
We have gRPC pods running in a k8s cluster. The service mesh we use is linkerd. Our gRPC microservices are written in python (asyncio grpcs as the concurrency mechanism), with the exception of the entry-point. That microservice is written in golang (using gin framework). We have an AWS API GW that talks to an NLB in front of the golang service. The golang service communicates to the backend via nodeport services.
Requests on our gRPC Python microservices can take a while to complete. Average is 8s, up to 25s in the 99th %ile. In order to handle the load from clients, we've horizontally scaled, and spawned more pods to handle concurrent requests.
Problem:
When we send multiple requests to the system, even sequentially, we sometimes notice that requests go to the same pod as an ongoing request. What can happen is that this new request ends up getting "queued" in the server-side (not fully "queued", some progress gets made when context switches happen). The issue with queueing like this is that:
The earlier requests can start getting starved, and eventually timeout (we have a hard 30s cap from API GW).
The newer requests may also not get handled on time, and as a result get starved.
The symptom we're noticing is 504s which are expected from our hard 30s cap.
What's strange is that we have other pods available, but for some reason the loadbalancer isn't routing it to those pods smartly. It's possible that linkerd's smarter load balancing doesn't work well for our high latency situation (we need to look into this further, however that will require a big overhaul to our system).
One thing I wanted to try doing is to stop this queuing up of requests. I want the service to immediately reject the request if one is already in progress, and have the client (meaning the golang service) retry. The client retry will hopefully hit a different pod (do let me know if that won’t happen). In order to do this, I set the "maximum_concurrent_rpcs" to 1 on the server-side (Python server). When i sent multiple requests in parallel to the system, I didn't see any RESOURCE_EXHAUSTED exceptions (even under the condition when there is only 1 server pod). What I do notice is that the requests are no longer happening in parallel on the server, they happen sequentially (I think that’s a step in the right direction, the first request doesn’t get starved). That being said, I’m not seeing the RESOURCE_EXHAUSTED error in golang. I do see a delay between the entry time in the golang client and the entry time in the Python service. My guess is that the queuing is now happening client-side (or potentially still server side, but it’s not visible to me)?
I then saw online that it may be possible for requests to get queued up on the client-side as a default behavior in http/2. I tried to test this out in custom Python client that mimics the golang one with:
channel = grpc.insecure_channel(
"<some address>",
options=[("grpc.max_concurrent_streams", 1)]
)
# create stub to server with channel…
However, I'm not seeing any change here either. (Note, this is a test dummy client - eventually i'll need to make this run in golang. Any help there would be appreciated as well).
Questions:
How can I get the desired effect here? Meaning server sends resource exhausted if already handling a request, golang client retries, and it hits a different pod?
Any other advice on how to fix this issue? I'm grasping at straws here.
Thank you!

Terminate job after certain time of inactivity with Nomad/Consul

Does Nomad in combination with Consul supports the termination of services after certain time of inactivity. Inactivity is defined here as no requests have been passed to the service. What would be the best approach to handle that outside the service itself?
One possibility would be to use the Nomad autoscaler to scale your application up and down based on metrics. In your case those metrics would be number of requests going into the service.
If you set minimum count to 0, it should then scale down to 0 when there are no requests going to the service.
I have a demo that shows scaling on prometheus metrics. By changing the scaling query to use requests per second to the application instead (like here you should get your desired result.
If you are already using consul connect to link up the services, this demo should contain everything you need to do it.

Springboot with undertow becomes unresponsive when worker thread pool grows too large

We are running spring-boot microservices on k8s on Amazon EC2, using undertow as our embedded web server.
Whenever - for whatever reason - our downstream services are overwhelmed by incoming requests, and the downstream pods' worker queue grows too large (i've seen this issue happen at 400-ish), then spring-boot stops processing queued requests completely and the app goes silent.
Monitoring the queue size via JMX we can see that the queue size continues to grow as more requests are queued by the IO worker - but by this point no queued requests are ever processed by any worker threads.
We can't see any log output or anything to indicate why this might be happening.
This issue cascades upstream, whereby the paralyzed downstream pods cause the traffic in the upstream pods to experience the same issue and they too become unresponsive - even when we turn off all incoming traffic through the API gateway.
To resolve the issue we have to stop incoming traffic upstream, and then kill all of the affected pods, before bringing them back up in greater numbers and turning the traffic back on.
Does anyone have any ideas about this?
Is it expected behaviour?
If so, how can we make undertow refuse connections before the queue size grows too large and kills the service?
If not, whhat is causing this behaviour?
Many thanks.
Aaron.
I am not entirely sure if tweaking spring boot version / embedded web server will fix this, but below is how you can scale this up using Kubernetes / Istio .
livenessProbe
If livenessProbe is configured correctly then Kubernetes restarts pods if they aren't alive. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-http-request
Horizontal Pod Autoscaller
Increases/Decreases the number of replicas of the pods based on CPU utilization or custom metrics. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
Vertical Pod Autoscaller
Increase/Decrease the CPU / RAM of the POD based on the load. https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler
Cluster Autoscaller
Increase/Decrease the number of nodes in the cluster based on load. https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
Istio Rate limiting & Retry mechanism
Limit the number of requests that the service will receive & have a retry mechanism for the requests which couldn't get executed
https://istio.io/docs/tasks/traffic-management/request-timeouts/
https://istio.io/docs/concepts/traffic-management/#network-resilience-and-testing

How to manage microservice failure?

Let's say, I have several micro-services (REST API), the problem is, if one service is not accessible (let's call service "A" ) the data which was sending to service "A" will be saved in temporary database. And after service worked, the data will be sent again.
Question:
1. Should I create the service which pings to service "A" in every 10 seconds to know service works or not? Or is it possible to do it by task queue? Any suggestions?
Polling is a waste of bandwidth. You want to use a transactional queue.
Throw all your outbound messages in the queue, and have some other process to handle the messages.
How this will work is - after your process reads from the queue, and tries to send to the REST service:
If it works, commit the transaction (for the queue)
If it doesn't work, don't commit. Start a delay (minutes, seconds - you know best) until you read from the queue again.
You can use Circuit Breaker pattern for e.g. hystrix circuit breaker from netflix.
It is possible to open circuit-breaker base on a timeout or when service call fails or inaccessible.
There are multiple dimensions to your question. First you want to consider using an infrastructure that provides resilience and self healing. Meaning you want to deploy a cluster of containers, all containing your Service A. Now you use a load balancer or API gateway in front of your service to distribute calls/load. It will also periodically check for the health of your service. When it detects a container does not respond correctly it can kill the container and start another one. This can be provided by a container infrastructure such as kubernetes / docker swarm etc.
Now this does not protect you from losing any requests. In the event that a container malfunctions there will still be a short time between the failure and the next health check where requests may not be served. In many applications this is acceptable and the client side will just re-request and hit another (healthy container). If your application requires absolutely not losing requests you will have to cache the request in for example an API gateway and make sure it is kept until a Service has completed it (also called Circuit Breaker). An example technology would be Netflix Zuul with Hystrix. Using such a Gatekeeper with built in fault tolerance can increase the resiliency even further. As a side note - Using an API gateway can also solve issues with central authentication/authorization, routing and monitoring.
Another approach to add resilience / decouple is to use a fast streaming / message queue, such as Apache Kafka, for recording all incoming messages and have a message processor process them whenever ready. The trick then is to only mark the messages as processed when your request was served fully. This can also help in scenarios where faults can occur due to large number of requests that cannot be handled in real time by the Service (Asynchronous Decoupling with Cache).
Service "A" should fire a "ready" event when it becomes available. Just listen to that and resend your request.

Resources