Performance Issue: Async gRPC with Gunicorn + Tornado

Performance Issue: Async gRPC with Gunicorn + Tornado - performance

Background:
We're trying to migrate our API Gateway from REST to gRPC. The API Gateway will consume by Backend Team with REST, and the communication from API Gateway to microservice will be using gRPC. Our API Gateway build with Tornado Python Framework, Gunicorn, and using tornado.curl_httpclient.CurlAsyncHTTPClient to enable Async / Future for each endpoint. Each endpoint will call to Microservices using Unary RPC and the gRPC stub will return Future.
So before fully migrate to gRPC, we're trying to compare gRPC vs REST performance. Here is the detail u might need to know:
We have 3 endpoints to test. /0, /1, and /2 with a single string payload. The payload size are 100KB, 1MB, and 4MB. These message already created when the instance just started, so the endpoint only need to retrieve it.
Concurrency = 1, 4, 10 for each endpoint.
gRPC Thread pool max workers = 1 and Gunicorn's Worker = 16.
We're using APIB for load testing.
All the load test is done with GCP VM Instance. The Machine spec is:
Intel Broadwell, n1-standard-1 (1 vCPU, 3.75 GB memory), OS: Debian
9
The code share the similar structure and same business logic.
Here is the result:
The Conclusion is The higher Concurrency and payload size, the slower gRPC become and eventually slower than REST.
Question:
Is gRPC incapable of handling large payload size and large concurrency by using Unary Call compare to REST ?
Are there anyway to enable gRPC to become faster than REST ?
Are there anything fundamental cause that i missed ?
Here's a few way that i have tried:
GZIP Compression from grpcio. The result is it's become slower than before.
Using GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS and GRPC_ARG_KEEPALIVE_TIMEOUT_MS options on the stub and server config. There's no changes in performance.
Change gRPC server max workers to 10000. Result: No Changes in performance.
Change Gunicorn Worker to 1. Result: No Changes in Performance.
The way I haven't tried:
Using Stream RPC
Any Help is appareciated. Thank you.

Is gRPC incapable of handling large payload size and large concurrency by using Unary Call compare to REST ?
Yes, the gRPC Python bindings are used in production to serve requests as large as several gigabytes at speed.
Are there anyway to enable gRPC to become faster than REST ?
I believe your issue is likely this:
Each endpoint will call to Microservices using Unary RPC and the gRPC stub will return Future.
Each time you use the future API in the Python bindings, a new thread is created to service that request. As you know, Python has a global interpreter lock, so while a process may have many threads, only one of them may access Python objects at any one time. Further, the more threads contending on the GIL, the more slowdowns will happen due to synchronization.
To avoid this, you can either use only the synchronous parts of the gRPC Python API, or you can switch over to the AsyncIO bindings, which were designed to solve exactly this problem.

Related

spring boot maximum throughput can a rest api like get support

I was doing a project that needs to support a cluster of 30k nodes, all those nodes periodic calls the api to get data.
I want to have the maximum amount of concurrent get operation per second, and due to it is get operation, it must be in synced way.
And my local pc is 32GB 8Core, spring boot version is 2.6.6, configurations are like
server.tomcat.max-connections=10000
server.tomcat.threads.max=800
I use jmeter to do concurrent test, and the through out is around 1k/s, average response time is 2 seconds.
Is there any way to make it support more requests per second?

Hard to say without details on the web service, implementation of what it actually does and where the bottleneck actually is (threads, connections, CPU, memory or others) but, as a general recommendation, using non-blocking APIs would help but it should then be full non-blocking to actually make a real difference.
I mean that just adding Webflux and have blocking DB would not improve so much.
Furthermore, all improvements in execute time would help so check if you can improve the code and maybe have a look at trying to go native (which will come "built in" in Boot 3.X btw)

Limit concurrent queries in Spring JPA

I have a simple rest endpoint that executes Postgres procedure.
This procedure returns the current state of device.
For example:
20 devices.
Client app connect to API and make 20 responses to that endpoint every second.
For x clients there are x*20 requests.
For 2 clients 40 requests.
It causes a big cpu load on Postgres server only if there are many clients and/or many devices.
I didn’t create it but I need to redesign it.
How to limit concurrent queries to db only for it? It would be a hot fix.
My second idea is to create background worker that executes queries only one in the same time. Then the endpoint fetches data from memory.

I would try the simple way first. Try to reduce
the amount of database connections in the pool OR
the amount of working threads in the build-in Tomcat.
More flexible option would be to put the logic behind a thread pool limiting the amount of working threads. It is not trivial, if the Spring context and database is used inside a worker. Take a look on a Spring annotation #Async.
Offtopic: The solution we are discussing here looks like a workaround. The discussed solution alone will most probably increase the throughput only by factor 2 maybe 3. It is not JEE conform and it will be most probably not very stable. It is better to refactor the application avoiding such a problem. Another option would be to buy a new database server.
Update: JEE compliant solution would be to implement some sort of bulkhead pattern. It will limit the amount of concurrent running requests and reject it, if the some critical number is reached. The server application answers with "503 Service Unavailable". The client application catches this status and retries a second later (see "exponential backoff").

can't reach maximum performance of lettuce( spirng data redis reactive, webflux)

I wrote a sample WebFlux application, which read just some data from redis and do small cpu job(md5 calculation), repeats several time.
It use spring data redis reactive
'org.springframework.boot:spring-boot-starter-data-redis-reactive'
Codes that connects redis is like this.
reactiveStringRedisTemplate.opsForValue().get(keyname)
redis server runs at localhost
redis-server &
Full codes , you can find here
https://github.com/mouse500/redperf
It's a simple codes
And for test, I call API (/wredis) with Jmeter to loadtest
Problem that I think is...
this application doesn't reach maximum TPS,
It reaches CPU around 40% on my local PC.
Even though it has more room of CPU, it doesn't work any harder
Why doesn't it use resource fully?
In case of some other method to connect redis,
(I put connection proxy ,written in nodejs)
It showed much much higher CPU and got much more TPS.
So I don't think it is not about Redis server performance.
Problem seems issue at "calling Redis from WebFlux application with lettuce"
How can I make this sample application show a maximum TPS ( CPU reach 100%) ?
What option can I try?

Spring reactive poor performance on high load

I have a spring boot webflux application which by default uses netty.
One of the business requirements that we have mandates that requests should time out within 2 seconds.
When very few requests are sent to the app, everything is fine but when the request load is increased (Like over 40 or 50 concurrent per second by Jmeter) sometimes all of them time out due to each taking longer than the 2-second threshold.
I have spent a long time reading things online and looking into what could be causing this issue but with no success. When requests are sent concurrently most end up taking a long time and the problematic part is where an external HTTTP request is made to other microservice. All my tests are local and I have tested the microservices and they seem fast enough to handle a big load so the microservices themselves are not the issue.
I know that netty uses event loop and does not create a thread per request.
I believe there are likely synchronous tasks that are blocking those few netty threads. For this reason I have done massive refactoring and have ".publishOn(Schedulers.boundedElastic())" or ".subscribeOn(Schedulers.boundedElastic())" in the Mono reactive chains. After the refactoring Most of the operations seem to be running on elastic threads and not the "reactor-http-nio-x" (According to the logs) but doing so has not helped the main issue and the problem still exists.
It will be a huge help if someone could direct me to what I should be doing. At this point, I have no more improvements to make, and think I might have been looking at this the wrong way and my approach has not been correct.
I have not attached any code sine the application is big and I do not still know where the actual problem lies.

I've encountered the same problem. I've didn't find the root cause of this, but when I switched from WebClient to RestTemplate with dedicated thread pool per client (external service) then the problem was solved. I've run a blockhound to find if I block somewhere in the stream, but it didn't find anything. I've also tried deploying my application with increased number of NIO worker thread pool (by default it's equal to cores number) and there was some improvement, but after all RestTemplate yielded the best performance. So I'm still on Webflux stack, but I don't use WebClient anymore and the performance on high load is fine.

Single API endpoint for entire application using lambda and API Gateway

Currently we are running a NodeJS webApp using serverless. The API Gateway is using a single API endpoint for the entire application and routing is handled internally. So basically single http {Any+} endpoint for entire application.
My question is,
1, Whats the disadvantage of this method?? ( I know lambda is build for FaaS but right now we are handling it as a monolithic function.)
2, How much instance can lambda run at a time if we are following this method? Can it handle a million+ request at single time?
Every help would be appreciated. Thanks!

Disadvantage would be as you say - it's monolithic so you've not modularised your code at all. The idea is that adjusting one function shouldn't affect the rest, but in this case it can.
You can run as many as you like concurrently; you can set limits though (and there are some limits initially for safety which can be removed).
If you are running the function regularly it should also 'warm start' i.e. have a shorter boot time after the first time.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio