Disallow queuing of requests in gRPC microservices - python-asyncio

We have gRPC pods running in a k8s cluster. The service mesh we use is linkerd. Our gRPC microservices are written in python (asyncio grpcs as the concurrency mechanism), with the exception of the entry-point. That microservice is written in golang (using gin framework). We have an AWS API GW that talks to an NLB in front of the golang service. The golang service communicates to the backend via nodeport services.
Requests on our gRPC Python microservices can take a while to complete. Average is 8s, up to 25s in the 99th %ile. In order to handle the load from clients, we've horizontally scaled, and spawned more pods to handle concurrent requests.
When we send multiple requests to the system, even sequentially, we sometimes notice that requests go to the same pod as an ongoing request. What can happen is that this new request ends up getting "queued" in the server-side (not fully "queued", some progress gets made when context switches happen). The issue with queueing like this is that:
The earlier requests can start getting starved, and eventually timeout (we have a hard 30s cap from API GW).
The newer requests may also not get handled on time, and as a result get starved.
The symptom we're noticing is 504s which are expected from our hard 30s cap.
What's strange is that we have other pods available, but for some reason the loadbalancer isn't routing it to those pods smartly. It's possible that linkerd's smarter load balancing doesn't work well for our high latency situation (we need to look into this further, however that will require a big overhaul to our system).
One thing I wanted to try doing is to stop this queuing up of requests. I want the service to immediately reject the request if one is already in progress, and have the client (meaning the golang service) retry. The client retry will hopefully hit a different pod (do let me know if that won’t happen). In order to do this, I set the "maximum_concurrent_rpcs" to 1 on the server-side (Python server). When i sent multiple requests in parallel to the system, I didn't see any RESOURCE_EXHAUSTED exceptions (even under the condition when there is only 1 server pod). What I do notice is that the requests are no longer happening in parallel on the server, they happen sequentially (I think that’s a step in the right direction, the first request doesn’t get starved). That being said, I’m not seeing the RESOURCE_EXHAUSTED error in golang. I do see a delay between the entry time in the golang client and the entry time in the Python service. My guess is that the queuing is now happening client-side (or potentially still server side, but it’s not visible to me)?
I then saw online that it may be possible for requests to get queued up on the client-side as a default behavior in http/2. I tried to test this out in custom Python client that mimics the golang one with:
channel = grpc.insecure_channel(
"<some address>",
options=[("grpc.max_concurrent_streams", 1)]
# create stub to server with channel…
However, I'm not seeing any change here either. (Note, this is a test dummy client - eventually i'll need to make this run in golang. Any help there would be appreciated as well).
How can I get the desired effect here? Meaning server sends resource exhausted if already handling a request, golang client retries, and it hits a different pod?
Any other advice on how to fix this issue? I'm grasping at straws here.
Thank you!


Vert.x web and micro services - Health check being starved

We use a very custom framework built on Vert.x to build our k8s micro services. This framework does a lot of the heavy lifting for teams, such as setting up all the endpoints and creating the health check endpoint, among other things.
One issue we see is that some of the micro services will start to "starve out" the health check when they get overloaded. So, as an app takes on heavy traffic, the health check, which runs every 20s, will time out. For these micro services I have checked that the teams are properly setting the "blocking" on endpoints that do any sort of blocking calls, like DB reads/writes or downstream API call.
The health check endpoint, being comprised of quick checks, does not. My understanding is that blocking handlers get pushed off to the work queue and non-blocking stay in the event loop, so my theory is that under heavy strain the event loop is filling up, and by the time it gets to the queued health check it's already past the timeout. I say this because we see the timeout of the Kubernetes-side, but out total-processing-time on the health check, which starts once the handler is called, is quick.
I attempted to alleviate this by pushing the health check into it's own Verticle, not quite understanding that you can't have multiple vertices on the same port (that was a misunderstanding on my part in reading the documentation).
So, my question is: What is the correct way to prioritize the health check? Is there a way to push these health checks to the front of the queue, or should we be looking more to some sort of "tuning"?

How to manage microservice failure?

Let's say, I have several micro-services (REST API), the problem is, if one service is not accessible (let's call service "A" ) the data which was sending to service "A" will be saved in temporary database. And after service worked, the data will be sent again.
1. Should I create the service which pings to service "A" in every 10 seconds to know service works or not? Or is it possible to do it by task queue? Any suggestions?
Polling is a waste of bandwidth. You want to use a transactional queue.
Throw all your outbound messages in the queue, and have some other process to handle the messages.
How this will work is - after your process reads from the queue, and tries to send to the REST service:
If it works, commit the transaction (for the queue)
If it doesn't work, don't commit. Start a delay (minutes, seconds - you know best) until you read from the queue again.
You can use Circuit Breaker pattern for e.g. hystrix circuit breaker from netflix.
It is possible to open circuit-breaker base on a timeout or when service call fails or inaccessible.
There are multiple dimensions to your question. First you want to consider using an infrastructure that provides resilience and self healing. Meaning you want to deploy a cluster of containers, all containing your Service A. Now you use a load balancer or API gateway in front of your service to distribute calls/load. It will also periodically check for the health of your service. When it detects a container does not respond correctly it can kill the container and start another one. This can be provided by a container infrastructure such as kubernetes / docker swarm etc.
Now this does not protect you from losing any requests. In the event that a container malfunctions there will still be a short time between the failure and the next health check where requests may not be served. In many applications this is acceptable and the client side will just re-request and hit another (healthy container). If your application requires absolutely not losing requests you will have to cache the request in for example an API gateway and make sure it is kept until a Service has completed it (also called Circuit Breaker). An example technology would be Netflix Zuul with Hystrix. Using such a Gatekeeper with built in fault tolerance can increase the resiliency even further. As a side note - Using an API gateway can also solve issues with central authentication/authorization, routing and monitoring.
Another approach to add resilience / decouple is to use a fast streaming / message queue, such as Apache Kafka, for recording all incoming messages and have a message processor process them whenever ready. The trick then is to only mark the messages as processed when your request was served fully. This can also help in scenarios where faults can occur due to large number of requests that cannot be handled in real time by the Service (Asynchronous Decoupling with Cache).
Service "A" should fire a "ready" event when it becomes available. Just listen to that and resend your request.

Send a message from one microservice to another in Azure Service Fabric (APIs)

What is the best architecture, using Service Fabric, to guarantee that the message I need to send from Service 1 (mostly API) to Service 2 (mostly API) does not get ever lost (black arrow)?
1.a. Make service 1 and 2 stateful services. Is it a bad call to have a stateful Web API?
1.b. Use Reliable Collections to send the message from API code to Service 2.
2.a. Make Service 1 and 2 stateless services
2.b. Add a third service
2.c. Send the message over a queuing system (i.e.: Service Bus) from service 1
2.d. To be picked up by the third service. Notice: this third service would also have access to the DB that service 2 (API) has access to. Not an ideal solution for a microservice architecture, right?
3.a. Any other ideas?
Keep in mind that the goal is to never lose the message, not even when service 2 is completely down or temporary removed… so no direct calls.
I'd introduce a third (Stateful) service that holds a queue, 'service 3'.
Service 1 would enqueue the message. Service 3 would run an infinite loop, trying to deliver the message to service 2.
You could use the pub/sub package for this. Service 1 is the publisher, Service 2 is the subscriber.
(If you rely on an external queue system like Service Bus, you'll lower the overall availability of the system. Service Bus downtime would lead to messages being undeliverable.)
I think that there is never completely any solution that is 100% sure to never loose a message between two parties. Even if you had a service bus for instance in between two services, there is always the chance (possibly very small, but never null) that the service bus goes down, or that the communication to the service bus goes down. With that being said, there are of course models that are less likely to very seldom loose a message, but you can't completely get around the fact that you still have to handle errors in the client.
In fact, Service Fabric fault handling is mainly designed around clients retrying communication, rather than having the service or an intermediary do that. There are many reasons for this (I guess) but one is the nature of distributed, replicated, reliable services. If a service primary goes down, a replica picks up the responsibility, but it won't know what the primary was doing right at the moment it died (unless it replicated over it's state, but it might have died even before that). The only one that really knows what it wants to do in this scenario is the client. The client knows what it is doing and can react to different fault scenarios in te service. In Fabric Transport, most know exceptions that could "naturally" occur, such as the service dying or the network cable being cut of by the janitor are actuallt retried automatically. This includes re-resolving the address just in case the service primary was replaced with a secondary.
The same actually goes for a scenario where you introduce a third service or a service bus. What if the network goes down before the message has completely reached the service? In this case only the client knows that something went wrong and what it intended to send. What if it goes down after it reached the service but before the response was sent? In this case the client has to assume the message never reached and try to resend it. This is also why service methods are recommended to be idempotent - the same call can be made a number of times by the same client.
Even if you were to introduce a secondary part, like the service bus, there is still the same risk that the service bus goes down, or more likely, the network connecting to the service bus goes down. So, client needs to retry, and when it has retried a number of times, all it can do is put the message in a queue of failed messages or simply just log it, or throw an exception back to the original caller (in your scenario, the browser).
Ok, that's was me being pessimistic. But it could happen. All of the things above, its just that some are not very likely to happen. But they might happen.
On to your questions:
1) the problem with making a stateless service stateful is that you now have to handle partitions in your caller. You can put up Http listeners for stateful services, but you have to include the partition and replica information in the Uri, and that won't work with the load balancer, so in this case the browser has to select partition when calling the API. Not an ideal solution.
2) yes, you could do this, i.e. introduce something else in between that queues messages for you. There is nothing that says that a Service Bus or a Database is more reliable than a Stateful service with a reliable queue there, it's just up to you to go for what you are most comfortable with. I would go for a Stateful service, just so I can easily keep everything within my SF application. But again, this is not 100% protection from disgruntled janitor with scissors, for that you still need clients that can handle faults.
3) make sure you have a way of handling the errors (retry) and logging or storing the messages that fail (after retries) with the client (Service 1).
3.a) One way would be to have it store it localy on the node it is running and periodically (RunAsync for instance) try to re-run those failed messages. This might be dangerous in the scenario where the node it is running on is completely nuked and looses it data though, that data won't be replicated.
3.b) Another would be to use semantic logging with ETW and include enough data in the events to be able to re-create the message from the logged and build some feature, a manual UI perhaps, where you can re-run it from the logged information. Much like you would retry a failed message on an error queue in a service bus.
3.c) Store the failed messages to anything else (database, service bus, queue) that doesn't fail for the same reasons your communication with Service 2.
My main point here is (and I could maybe have started with that) is that there are plenty of scenarios where only the client knows enough to handle the situation. So, make sure you have a strategy for handling faults in your clients.

Web server and ZeroMQ patterns

I am running an Apache server that receives HTTP requests and connects to a daemon script over ZeroMQ. The script implements the Multithreaded Server pattern (http://zguide.zeromq.org/page:all#header-73), it successfully receives the request and dispatches it to one of its worker threads, performs the action, responds back to the server, and the server responds back to the client. Everything is done synchronously as the client needs to receive a success or failure response to its request.
As the number of users is growing into a few thousands, I am looking into potentially improving this. The first thing I looked at is the different patterns of ZeroMQ, and whether what I am using is optimal for my scenario. I've read the guide but I find it challenging understanding all the details and differences across patterns. I was looking for example at the Load Balancing Message Broker pattern (http://zguide.zeromq.org/page:all#header-73). It seems quite a bit more complicated to implement than what I am currently using, and if I understand things correctly, its advantages are:
Actual load balancing vs the round-robin task distribution that I currently have
Asynchronous requests/replies
Is that everything? Am I missing something? Given the description of my problem, and the synchronous requirement of it, what would you say is the best pattern to use? Lastly, how would the answer change, if I want to make my setup distributed (i.e. having the Apache server load balance the requests across different machines). I was thinking of doing that by simply creating yet another layer, based on the Multithreaded Server pattern, and have that layer bridge the communication between the web server and my workers.
Some thoughts about the subject...
Keep it simple
I would try to keep things simple and "plain" ZeroMQ as long as possible. To increase performance, I would simply to change your backend script to send request out from dealer socket and move the request handling code to own program. Then you could just run multiple worker servers in different machines to get more requests handled.
I assume this was the approach you took:
I was thinking of doing that by simply creating yet another layer, based on the Multithreaded Server pattern, and have that layer bridge the communication between the web server and my workers.
Only problem here is that there is no request retry in the backend. If worker fails to handle given task it is forever lost. However one could write worker servers so that they handle all the request they got before shutting down. With this kind of setup it is possible to update backend workers without clients to notice any shortages. This will not save requests that get lost if the server crashes.
I have the feeling that in common scenarios this kind of approach would be more than enough.
Mongrel2 seems to handle quite many things you have already implemented. It might be worth while to check it out. It probably does not completely solve your problems, but it provides tested infrastructure to distribute the workload. This could be used to deliver the request to be handled to multithreaded servers running on different machines.
One solution to increase the robustness of the setup is a broker. In this scenario brokers main role would be to provide robustness by implementing queue for the requests. I understood that all the requests the worker handle are basically the same type. If requests would have different types then broker could also do lookups to find correct server for the requests.
Using the queue provides a way to ensure that every request is being handled by some broker even if worker servers crashed. This does not come without price. The broker is by itself a single point of failure. If it crashes or is restarted all messages could be lost.
These problems can be avoided, but it requires quite much work: the requests could be persisted to the disk, servers could be clustered. Need has to be weighted against the payoffs. Does one want to use time to write a message broker or the actual system?
If message broker seems a good idea the time which is required to implement one can be reduced by using already implemented product (like RabbitMQ). Negative side effect is that there could be a lot of unwanted features and adding new things is not so straight forward as to self made broker.
Writing own broker could covert toward inventing the wheel again. Many brokers provide similar things: security, logging, management interface and so on. It seems likely that these are eventually needed in home made solution also. But if not then single home made broker which does single thing and does it well can be good choice.
Even if broker product is chosen I think it is a good idea to hide the broker behind ZeroMQ proxy, a dedicated code that sends/receives messages from the broker. Then no other part of the system has to know anything about the broker and it can be easily replaced.
Using broker is somewhat developer time heavy. You either need time to implement the broker or time to get use to some product. I would avoid this route until it is clearly needed.
Some links
Comparison between broker and brokerless

Progress notifications from HTTP/REST service

I'm working on a web application that submits tasks to a master/worker system that farms out the tasks to any of a series of worker instances. The work queue master runs as a separate process (on a separate machine altogether) and tasks are submitted to the master via HTTP/REST requests. Once tasks are submitted to the work queue, client applications can submit another HTTP request to get status information about tasks.
For my web application, I'd like it to provide some sort of progress bar view that gives the user some indication of how far along task processing has come. The obvious way to implement this would be an AJAX progress meter widget that periodically polls the work queue for status on the tasks that have been submitted. My question is, is there a better way to accomplish this without the frequent polling?
I've considered having the client web application open up a server socket on which it could listen for notifications from the work master. Another similar thought I've had is to use XMPP or a similar protocol for the status notifications. (Of course, the master/worker system would need to be updated to provide notifications either way but I own the code for that so can make any necessary updates myself.)
Any thoughts on the best way to set up a notification system like this? Is the extra effort involved worth it, or is the simple polling solution the way to go?
The client keeps polling the server to get the status of the response.
Being really RESTful means cacheable and scaleable.
Not the best responsiveness if you do not want to poll your server too much.
Persistent connection
The server does not close its HTTP connection with the client until the response is complete. The server can send intermediate status through this connection using HTTP multiparts.
Comet is the most famous framework to implement this behaviour.
Best responsiveness, almost real-time notifications from the server.
Connection limit is limited on a web server, keeping a connection open for too long might, at best load your server, at worst open the server to Denial of Service attacks.
Client as a server
Make the server post status updates and the response to the client as if it were another RESTful application.
Best of every worlds, no resources are wasted waiting for the response, either on the server or on the client side.
You need a full HTTP server and web application stack on the client
Firewalls and routers with their default "no incoming connections at all" will get in the way.
Feel free to edit to add your thoughts or a new method!
I guess it depends on a few factors
How accurate the feedback can be (1 percent, 5 percent, 50 percent) Accurate feedback makes it worth pursuing some kind of progress bar and comet style push. If you can only say "Busy... hold on... almost there... done" then a simple ajax "are we there yet" poll is certainly easier to code.
How timely the Done message has to be seen by the client
How long each task takes (1 second, 10 seconds, 10 minutes)
1 second makes it a bit moot. 10 seconds makes it worth it. 10 minutes means you're better off suggesting the user goes for a coffee break :-)
How many concurrent requests there will be
Unless you've got a "special" server, live push style systems tend to eat connections and you'll be maxed out pretty quickly. Having to throw more webservers in for a fancy progress bar might hurt the budget.
I've got some sample code on 871184 that shows a hand rolled "forever frame" which seems to work out well. The project I developed that for isn't hammered all that hard though, the operations take a few seconds and we can give pretty accurate percent. The code uses asp.net and jquery, but the general techniques will work with any server and javascript framework.
edit As John points out, status reporting probably isn't the job of the RESTful service. But there's nothing that says you can't open an iframe on the client that hooks to a page on the server that polls the service. Theory says the server and the service will at least be closer to one another :-)
Look into Comet. You make a single request to the server and the server blocks and holds the connection open until an update in status occurs. Once that happens the response is sent and committed. The browser receives this response, handles it and immediately re-requests the same URL. The effect is that of events being pushed to the browser. There are pros and cons and it might not be appropriate for all use cases but would provide the most timely status updates.
My opinion is to stick with the polling solution, but you might be interested in this Wikipedia article on HTTP Push technologies.
REST depends on HTTP, which is a request/response protocol. I don't think you're going to get a pure HTTP server calling the client back with status.
Besides, status reporting isn't the job of the service. It's up to the client to decide when, or if, it wants status reported.
One approach I have used is:
When the job is posted to the server, the server responds back a pubnub-channel id (one could alternatively use Google's PUB-SUB kind of service).
The client on browser subscribes to that channel and starts listening for messages.
The worker/task server publishes status on that pubnub channel to update the progress.
On receiving messages on the subscribed pubnub-channel, the client updates the web UI.
You could also use self-refreshing iframe, but AJAX call is much better. I don't think there is any other way.
PS: If you would open a socket from client, that wouldn't change much - PHP browser would show the page as still "loading", which is not very user-friendly. (assuming you would push or flush buffer to have other things displayed before)
