Vert.x web and micro services - Health check being starved

Vert.x web and micro services - Health check being starved - microservices

We use a very custom framework built on Vert.x to build our k8s micro services. This framework does a lot of the heavy lifting for teams, such as setting up all the endpoints and creating the health check endpoint, among other things.
One issue we see is that some of the micro services will start to "starve out" the health check when they get overloaded. So, as an app takes on heavy traffic, the health check, which runs every 20s, will time out. For these micro services I have checked that the teams are properly setting the "blocking" on endpoints that do any sort of blocking calls, like DB reads/writes or downstream API call.
The health check endpoint, being comprised of quick checks, does not. My understanding is that blocking handlers get pushed off to the work queue and non-blocking stay in the event loop, so my theory is that under heavy strain the event loop is filling up, and by the time it gets to the queued health check it's already past the timeout. I say this because we see the timeout of the Kubernetes-side, but out total-processing-time on the health check, which starts once the handler is called, is quick.
I attempted to alleviate this by pushing the health check into it's own Verticle, not quite understanding that you can't have multiple vertices on the same port (that was a misunderstanding on my part in reading the documentation).
So, my question is: What is the correct way to prioritize the health check? Is there a way to push these health checks to the front of the queue, or should we be looking more to some sort of "tuning"?

Related

Disallow queuing of requests in gRPC microservices

SetUp:
We have gRPC pods running in a k8s cluster. The service mesh we use is linkerd. Our gRPC microservices are written in python (asyncio grpcs as the concurrency mechanism), with the exception of the entry-point. That microservice is written in golang (using gin framework). We have an AWS API GW that talks to an NLB in front of the golang service. The golang service communicates to the backend via nodeport services.
Requests on our gRPC Python microservices can take a while to complete. Average is 8s, up to 25s in the 99th %ile. In order to handle the load from clients, we've horizontally scaled, and spawned more pods to handle concurrent requests.
Problem:
When we send multiple requests to the system, even sequentially, we sometimes notice that requests go to the same pod as an ongoing request. What can happen is that this new request ends up getting "queued" in the server-side (not fully "queued", some progress gets made when context switches happen). The issue with queueing like this is that:
The earlier requests can start getting starved, and eventually timeout (we have a hard 30s cap from API GW).
The newer requests may also not get handled on time, and as a result get starved.
The symptom we're noticing is 504s which are expected from our hard 30s cap.
What's strange is that we have other pods available, but for some reason the loadbalancer isn't routing it to those pods smartly. It's possible that linkerd's smarter load balancing doesn't work well for our high latency situation (we need to look into this further, however that will require a big overhaul to our system).
One thing I wanted to try doing is to stop this queuing up of requests. I want the service to immediately reject the request if one is already in progress, and have the client (meaning the golang service) retry. The client retry will hopefully hit a different pod (do let me know if that won’t happen). In order to do this, I set the "maximum_concurrent_rpcs" to 1 on the server-side (Python server). When i sent multiple requests in parallel to the system, I didn't see any RESOURCE_EXHAUSTED exceptions (even under the condition when there is only 1 server pod). What I do notice is that the requests are no longer happening in parallel on the server, they happen sequentially (I think that’s a step in the right direction, the first request doesn’t get starved). That being said, I’m not seeing the RESOURCE_EXHAUSTED error in golang. I do see a delay between the entry time in the golang client and the entry time in the Python service. My guess is that the queuing is now happening client-side (or potentially still server side, but it’s not visible to me)?
I then saw online that it may be possible for requests to get queued up on the client-side as a default behavior in http/2. I tried to test this out in custom Python client that mimics the golang one with:
channel = grpc.insecure_channel(
"<some address>",
options=[("grpc.max_concurrent_streams", 1)]
)
# create stub to server with channel…
However, I'm not seeing any change here either. (Note, this is a test dummy client - eventually i'll need to make this run in golang. Any help there would be appreciated as well).
Questions:
How can I get the desired effect here? Meaning server sends resource exhausted if already handling a request, golang client retries, and it hits a different pod?
Any other advice on how to fix this issue? I'm grasping at straws here.
Thank you!

Microservices: how to track fallen down services?

Problem:
Suppose there are two services A and B. Service A makes an API call to service B.
After a while service A falls down or to be lost due to network errors.
How another services will guess that an outbound call from service A is lost / never happen? I need some another concurrent app that will automatically react (run emergency code) if service A outbound CALL is lost.
What are cutting-edge solutions exist?
My thoughts, for example:
service A registers a call event in some middleware (event info, "running" status, timestamp, etc).
If this call is not completed after N seconds, some "call timeout" event in the middleware automatically starts the emergency code.
If the call is completed at the proper time service A marks the call status as "completed" in the same middleware and the emergency code will not be run.
P.S. I'm on Java stack.
Thanks!

I recommend to look into patterns such as Retry, Timeout, Circuit Breaker, Fallback and Healthcheck. Or you can also look into the Bulkhead pattern if concurrent calls and fault isolation are your concern.
There are many resources where these well-known patterns are explained, for instance:
https://www.infoworld.com/article/3310946/how-to-build-resilient-microservices.html
https://blog.codecentric.de/en/2019/06/resilience-design-patterns-retry-fallback-timeout-circuit-breaker/
I don't know which technology stack you are on but usually there is already some functionality for these concerns provided already that you can incorporate into your solution. There are libraries that already take care of this resilience functionality and you can, for instance, set it up so that your custom code is executed when some events such as failed retries, timeouts, activated circuit breakers, etc. occur.
E.g. for the Java stack Hystrix is widely used, for .Net you can look into Polly .Net to make use of retry, timeout, circuit breaker, bulkhead or fallback functionality.
Concerning health checks you can look into Actuator for Java and .Net core already provides a health check middleware that more or less provides that functionality out-of-the box.
But before using any libraries I suggest to first get familiar with the purpose and concepts of the listed patterns to choose and integrate those that best fit your use cases and major concerns.
Update
We have to differentiate between two well-known problems here:
1.) How can service A robustly handle temporary outages of service B (or the network connection between service A and B which comes down to the same problem)?
To address the related problems the above mentioned patterns will help.
2.) How to make sure that the request that should be sent to service B will not get lost if service A itself goes down?
To address this kind of problem there are different options at hand.
2a.) The component that performed the request to service A (which than triggers service B) also applies the resilience patterns mentioned and will retry its request until service A successfully answers that it has performed its tasks (which also includes the successful request to service B).
There can also be several instances of each service and some kind of load balancer in front of these instances which will distribute and direct the requests to an available instance (based on regular performed healthchecks) of the specific service. Or you can use a service registry (see https://microservices.io/patterns/service-registry.html).
You can of course chain several API calls after another but this can lead to cascading failures. So I would rather go with an asynchronous communication approach as described in the next option.
2b.) Let's consider that it is of utmost importance that some instance of service A will reliably perform the request to service B.
You can use message queues in this case as follows:
Let's say you have a queue where jobs to be performed by service A are collected.
Then you have several instances of service A running (see horizontal scaling) where each instance will consume the same queue.
You will use message locking features by the message queue service which makes sure that as soon one instance of service A reads a message from the queue the other instances won't see it. If service A was able to complete it's job (i.e. call service B, save some state in service A's persistence and whatever other tasks you need to be included for a succesfull procesing) it will delete the message from the queue afterwards so no other instance of service A will also process the same message.
If service A goes down during the processing the queue service will automatically unlock the message for you and another instance A (or the same instance after it has restarted) of service A will try to read the message (i.e. the job) from the queue and try to perform all the tasks (call service B, etc.)
You can combine several queues e.g. also to send a message to service B asynchronously instead of directly performing some kind of API call to it.
The catch is, that the queue service is some highly available and redundant service which will already make sure that no message is getting lost once published to a queue.
Of course you also could handle jobs to be performed in your own database of service A but consider that when service A receives a request there is always a chance that it goes down before it can save that status of the job to it's persistent storage for later processing. Queue services already address that problem for you if chosen thoughtfully and used correctly.
For instance, if look into Kafka as messaging service you can look into this stack overflow answer which relates to the problem solution when using this specific technology: https://stackoverflow.com/a/44589842/7730554

There is many way to solve your problem.
I guess you are talk about 2 topics Design Pattern in Microservices and Cicruit Breaker
https://dzone.com/articles/design-patterns-for-microservices
To solve your problem, Normally I put a message queue between services and use Service Discovery to detect which service is live and If your service die or orverload then use Cicruit Breaker methods

Microservice State Synchronization

We are working on an application that has a WebSocket connection to every client. For high availability and load balancing purposes, we would like to scale the receiving micro service. As the WebSocket connection is used to propagate the state of a client to every other client it is important to synchronize the current state of a client with all other instances of the receiving micro service. It is also important that the state has to be reset when a client disconnects.
To give you some specs:
We are using docker swarm
Its a NodeJS Backend and an Angular 9 Frontend
We have looked into multiple ideas, for example:
Redis Cache (The state would not be deleted if the instance fails.)
Queues/Topics (This would mean every instance has to keep track of the current state of all clients.)
WebSockets between instances (This looks promising but is not really scalable.)
What is the best practice to sync the state of a micro service between multiple instances while making sure that there are no inconsistencies? How are you solving this issue? Are we missing something obvious? Any tips and tricks?
We appreciate any suggestions.

This might not be 100% what you want to hear, but generally people advise that all microservices should be stateless.
An overall application, of course, has state, and databases, persistent event streams or key-value caches (e.g. Redis) are excellent ways of persisting this. Ideally this is bounded per service though, otherwise you risk end up writing a distributed monolith.
Hard to say in your particular case, but perhaps rethink how state is stored conceptually, and make that more explicit - determining what is cache (for performance) and what is genuine state that should be persisted externally (e.g. to Redis & a database), that allows many service instances to use instantly, thus making sure they can are truly disposable processes.

How to manage microservice failure?

Let's say, I have several micro-services (REST API), the problem is, if one service is not accessible (let's call service "A" ) the data which was sending to service "A" will be saved in temporary database. And after service worked, the data will be sent again.
Question:
1. Should I create the service which pings to service "A" in every 10 seconds to know service works or not? Or is it possible to do it by task queue? Any suggestions?

Polling is a waste of bandwidth. You want to use a transactional queue.
Throw all your outbound messages in the queue, and have some other process to handle the messages.
How this will work is - after your process reads from the queue, and tries to send to the REST service:
If it works, commit the transaction (for the queue)
If it doesn't work, don't commit. Start a delay (minutes, seconds - you know best) until you read from the queue again.

You can use Circuit Breaker pattern for e.g. hystrix circuit breaker from netflix.
It is possible to open circuit-breaker base on a timeout or when service call fails or inaccessible.

There are multiple dimensions to your question. First you want to consider using an infrastructure that provides resilience and self healing. Meaning you want to deploy a cluster of containers, all containing your Service A. Now you use a load balancer or API gateway in front of your service to distribute calls/load. It will also periodically check for the health of your service. When it detects a container does not respond correctly it can kill the container and start another one. This can be provided by a container infrastructure such as kubernetes / docker swarm etc.
Now this does not protect you from losing any requests. In the event that a container malfunctions there will still be a short time between the failure and the next health check where requests may not be served. In many applications this is acceptable and the client side will just re-request and hit another (healthy container). If your application requires absolutely not losing requests you will have to cache the request in for example an API gateway and make sure it is kept until a Service has completed it (also called Circuit Breaker). An example technology would be Netflix Zuul with Hystrix. Using such a Gatekeeper with built in fault tolerance can increase the resiliency even further. As a side note - Using an API gateway can also solve issues with central authentication/authorization, routing and monitoring.
Another approach to add resilience / decouple is to use a fast streaming / message queue, such as Apache Kafka, for recording all incoming messages and have a message processor process them whenever ready. The trick then is to only mark the messages as processed when your request was served fully. This can also help in scenarios where faults can occur due to large number of requests that cannot be handled in real time by the Service (Asynchronous Decoupling with Cache).

Service "A" should fire a "ready" event when it becomes available. Just listen to that and resend your request.

Send a message from one microservice to another in Azure Service Fabric (APIs)

What is the best architecture, using Service Fabric, to guarantee that the message I need to send from Service 1 (mostly API) to Service 2 (mostly API) does not get ever lost (black arrow)?
Ideas:
1
1.a. Make service 1 and 2 stateful services. Is it a bad call to have a stateful Web API?
1.b. Use Reliable Collections to send the message from API code to Service 2.
2
2.a. Make Service 1 and 2 stateless services
2.b. Add a third service
2.c. Send the message over a queuing system (i.e.: Service Bus) from service 1
2.d. To be picked up by the third service. Notice: this third service would also have access to the DB that service 2 (API) has access to. Not an ideal solution for a microservice architecture, right?
3
3.a. Any other ideas?
Keep in mind that the goal is to never lose the message, not even when service 2 is completely down or temporary removed… so no direct calls.
Thanks

I'd introduce a third (Stateful) service that holds a queue, 'service 3'.
Service 1 would enqueue the message. Service 3 would run an infinite loop, trying to deliver the message to service 2.
You could use the pub/sub package for this. Service 1 is the publisher, Service 2 is the subscriber.
(If you rely on an external queue system like Service Bus, you'll lower the overall availability of the system. Service Bus downtime would lead to messages being undeliverable.)

I think that there is never completely any solution that is 100% sure to never loose a message between two parties. Even if you had a service bus for instance in between two services, there is always the chance (possibly very small, but never null) that the service bus goes down, or that the communication to the service bus goes down. With that being said, there are of course models that are less likely to very seldom loose a message, but you can't completely get around the fact that you still have to handle errors in the client.
In fact, Service Fabric fault handling is mainly designed around clients retrying communication, rather than having the service or an intermediary do that. There are many reasons for this (I guess) but one is the nature of distributed, replicated, reliable services. If a service primary goes down, a replica picks up the responsibility, but it won't know what the primary was doing right at the moment it died (unless it replicated over it's state, but it might have died even before that). The only one that really knows what it wants to do in this scenario is the client. The client knows what it is doing and can react to different fault scenarios in te service. In Fabric Transport, most know exceptions that could "naturally" occur, such as the service dying or the network cable being cut of by the janitor are actuallt retried automatically. This includes re-resolving the address just in case the service primary was replaced with a secondary.
The same actually goes for a scenario where you introduce a third service or a service bus. What if the network goes down before the message has completely reached the service? In this case only the client knows that something went wrong and what it intended to send. What if it goes down after it reached the service but before the response was sent? In this case the client has to assume the message never reached and try to resend it. This is also why service methods are recommended to be idempotent - the same call can be made a number of times by the same client.
Even if you were to introduce a secondary part, like the service bus, there is still the same risk that the service bus goes down, or more likely, the network connecting to the service bus goes down. So, client needs to retry, and when it has retried a number of times, all it can do is put the message in a queue of failed messages or simply just log it, or throw an exception back to the original caller (in your scenario, the browser).
Ok, that's was me being pessimistic. But it could happen. All of the things above, its just that some are not very likely to happen. But they might happen.
On to your questions:
1) the problem with making a stateless service stateful is that you now have to handle partitions in your caller. You can put up Http listeners for stateful services, but you have to include the partition and replica information in the Uri, and that won't work with the load balancer, so in this case the browser has to select partition when calling the API. Not an ideal solution.
2) yes, you could do this, i.e. introduce something else in between that queues messages for you. There is nothing that says that a Service Bus or a Database is more reliable than a Stateful service with a reliable queue there, it's just up to you to go for what you are most comfortable with. I would go for a Stateful service, just so I can easily keep everything within my SF application. But again, this is not 100% protection from disgruntled janitor with scissors, for that you still need clients that can handle faults.
3) make sure you have a way of handling the errors (retry) and logging or storing the messages that fail (after retries) with the client (Service 1).
3.a) One way would be to have it store it localy on the node it is running and periodically (RunAsync for instance) try to re-run those failed messages. This might be dangerous in the scenario where the node it is running on is completely nuked and looses it data though, that data won't be replicated.
3.b) Another would be to use semantic logging with ETW and include enough data in the events to be able to re-create the message from the logged and build some feature, a manual UI perhaps, where you can re-run it from the logged information. Much like you would retry a failed message on an error queue in a service bus.
3.c) Store the failed messages to anything else (database, service bus, queue) that doesn't fail for the same reasons your communication with Service 2.
My main point here is (and I could maybe have started with that) is that there are plenty of scenarios where only the client knows enough to handle the situation. So, make sure you have a strategy for handling faults in your clients.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio