What can cause a Cloud Run instance to not be reused despite continuous load?

What can cause a Cloud Run instance to not be reused despite continuous load? - spring-boot

Context:
My Spring-Boot app runs as expected on Cloud Run when I deploy it with max-instances set to 1: It receives a constant stream of pubsub messages via push, and makes anywhere from 0 to 5 writes to an associated CloudSQL instance, depending on the message payload. Typically it handles between 20 and 40 messages per second. Latency/response-time varies between 50ms and 60sec, probably due to some resource contention.
In order to increase throughput/ decrease resource contention, I'm looking to experiment with the connection pool size per app-instance, as well as the concurrency and max-instances parameters for my cloud run app.
I understand that due to Spring-Boot, my app has a relatively high cold-start time of about 30-40 seconds. This is acceptable for how this service is used.
Problem:
I'm experiencing problems when deploying a spring-boot app to cloud run with max-instances set to a value greater than 1:
Instances start, handle a single request successfully, and then produce no more logs.
This happens a few times per minute, leading me to believe that instances get started (cold-start), handle a single request, die, and then get started again. They are not being reused as described in the docs, and as is happening when I set max-instances to 1. Official docs on concurrency
Instead, I expect 3 container instances to be started, which then each requests according to max-concurrency setting.
Billable container time at max-instances=3:
As shown in the graph, the number of instances is fluctuating wildly, once the new revision with max-instances=3 is deployed.
The graphs for CPU- and memory-usage also look like this.
There are no error logs. As before at max-instaces=1, there are warnings indicating that there are not enough instances available to handle requests (HTTP 429).
Connection Limit of CloudSQL instance has not been exceeded
Requests are handled at less than 10/s
Finally, this is the command used to deploy:
gcloud beta run deploy my-service --project=[...] --image=[...] --add-cloudsql-instances=[...] --region=[...] --platform=managed --memory=1Gi --max-instances=3 --concurrency=3 --no-allow-unauthenticated
What could cause this behavior?

Some month ago, in private Alpha, I performed tests and I observed the same behavior. After discussion with Google team, I understood that instances are over provisioned "in case of": an instances crashes, an instances is preempted, the traffic suddenly increase,...
The trade-off of this is that you will have more cold start that your max instances values. Worse, you will be charged for this over provisioned cold start -> this is not an issue because Cloud Run has a huge free tier that covers this kind of glitches.
Going deeper in the logs (you can do it by creating a sink of Cloud Run logs into BigQuery and then by requesting them), even if there is more instances up than your max instances, only your max instances are active in the same time. I'm not sure to be clear. With your parameters, that means, if you have 5 instances up in the same time, only 3 serve the traffic at the same point of time
This part is not documented because it evolves constantly for find the best balance between over-provisioning and lack of ressources (and 429 errors).
#Steren #AhmetB can you confirm or correct me?

When Cloud Run receives and processes requests rapidly, it predicts how many instances it needs, and will try to scale to the amount. If a sudden burst of requests occur, Cloud Run will instantiate a larger number of instances as a response. This is done in order to adapt to a possible higher number of network requests beyond what it is currently serving, with attempts to take into consideration the length of time it will take for the existing instance to complete loading the request. Per the documentation, it is possible that the amount of container instances can go above the max instance value when it spikes.
You mentioned with max-instances set to 1 it was running fine, but later you mentioned it was in fact producing 429s with it set to 1 as well. Seeing behavior of 429s as well as the instances spiking could indicate that the amount of traffic is not being handled fluidly.
It is also worth noting, because of the cold start time you mention, when instances are serving the first request(s), by design, the number of concurrent requests is actually hard set to 1. Once things are fully ready,only then the concurrency setting you have chosen is applied.
Was there some specific reason you chose 3 and 3 for Max Instance settings and concurrency? Also how was the concurrency set when you had max instance set to 1? Perhaps you could try tinkering up further the concurrency (max 80) and /or Max instances (high limit up to 1000) and see if that removes the 429s.

Related

Concurrency within a single lambda#edge instance?

My understanding is that a single lambda#edge instance can only handle one request at a time, and AWS will spin up new instances if all existing instances are serving a request.
My lambda has a heavy instance startup cost (~2 seconds) but a very light execution cost. It triggers on viewer requests, which always come in batches of ~20 (loading a single-page application). This means one user loading the app, on a cold start, will start ~20 lambda instances and take ~2 seconds.
But due to the very light execution cost, a single lambda instance could handle all 20 requests and it would still take only ~2 seconds.
An extra advantage is, since each instance connects to a 3rd party service on startup, there would be only 1 open connection instead of 20.
Is this possible?

Lambda#edge doesn’t support reserved nor provisioned concurrency.
Here is the link to the documentation for reference: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/edge-functions-restrictions.html#lambda-at-edge-function-restrictions
That being said, with 2s cold start, you might consider using standard lambda.
Also, can’t you reduce that cold start somehow?

Validate newly created server support the same load

We are creating a new hosted server for one of our APIs on managed containers (Kubernetes) and we're trying to validate that it can handle at least the same amount of traffic load requests.
We've started with one of the APIs, where we would need to handle at least 140k requests per minute, all endpoints combined.
To verify this, I created a simple JMeter test as follows:
-Test Plan
---Thread Group Endpoint1
-----HTTP Request -> a GET request with query params for /path1
---Thread Group Endpoint2
-----HTTP Request -> a GET request with query params for /path2
For a local test, I used the following setup:
Thread Groups Endpoint1 and Endpoint2 are set to 200 threads (users), ramp-up period of 1s, loop count = forever and duration 60s.
Using a Summary Report listener when running the test gets me a total of ~9300 # Samples.
Using this approach, is it safe to just increase the number of threads (users) for the Thread Groups until I reach the desired 140k requests per minute?
Note: I only used JMeter a little before, so I'm aware that the entire approach may be wrong, therefore any suggestions and steering to the right path are more than welcomed.

Your approach is viable as long as it represents real-life application usage. If it has 2 endpoints with equally/evenly distributed load - your setup is just fine. If there are more endpoints and some of them are used more than the others - consider defining the workload correspondingly either using different Thread Groups or other distribution mechanism such as Throughput Controller
Increasing the number of threads is also fine, however consider increasing the load gradually, to wit increase ramp-up time so your test could have:
Arrivals phase
Time to hold the load
Ramp-down phase
This way you will be able to correlate various metrics like increasing response time, throughput, number of errors, etc. with the increasing load. Also you will be able to state what was the number of threads/requests per second when the system reached saturation point/breaking point and does it recover when the load gets back.
Also make sure you're following JMeter Best Practices as 2300/2500 requests per second is not something JMeter can support out of the box and you will need to do some tuning, at least increase JVM Heap size allocated to JMeter.

You may not be able to achieve the desired 140k requests per minute using a single Jmeter Machine, in that case you'll need Distributed Load Testing approach here.
refer: http://jmeter.apache.org/usermanual/jmeter_distributed_testing_step_by_step.html
Also keeping the ramp-up period of 1 second will lead to spike and unrealistic load in the system which will not give proper result unless you've pre-warmed your server, you should gradually increase the load as per real/estimated traffic pattern.

How to keep webserver responsive while executing many asynchronous background tasks

I am working on a web application that provides its users to optionally execute long-running processes 'in background'. An example would be some long-running report generation, or deleting thousands of objects simultaneously.
I've implemented this using an ExecutorService defined as FixedThreadPool using a ThreadFactory. The ThreadFactory is built like this:
ThreadFactoryBuilder()
.setNameFormat(clientId + "-BackgroundTask-%d")
.setDaemon(true)
.setPriority(Thread.MIN_PRIORITY)
.build()
I execute the task like this:
Future<TaskStatus> future = clientExecutors.get(clientId).submit(
backgroundTask::execute);
taskFutures.put(backgroundTask.getTaskId(), future);
How can I enforce my webserver to always priorize handling new incoming requests (as fast as possible) over executing background tasks?
In other words: It should never ever happen, that a user has to wait long time while browsing the site, just because there are a lot of background-tasks executing. As you can see from above, I tried to do this by setting .setPriority(Thread.MIN_PRIORITY). However that does not seem to be sufficient.
Furthermore, as for now, I've set some arbitrary value for the FixedThreadPool size (10) and use it globally for the entire background-handling of the application (and all its customers).
Instead I would like to define a threadpool for each customer, to make sure each customer has the same privilege to run a certain amount of tasks in the background. Say, each customer has a FixedThreadPool of size 5, and on the server I'll have a max. of 50 different customers. That would add up to 250 running background tasks at the same time.
The most important requirement here is: it does not matter, how long these background-tasks need to execute (say 2 minutes, or 20 minutes). What is important, is that each customer has the ability to send 5 tasks to be executed in background, and each of those are worked on equally.
I've tested running 30 cpu-intensive background tasks and it turns out that while these are running and cpu is near 100%, new incoming requests take a very long time to be handled.
So obviously, I am doing it wrong.
Update 12.09.2017
I've read about microservices and while it sounds great I see a great challenge in splitting the necessary parts from our monolithic application. Mostly because nearly every operation might turn into a long running process given a big enough data selection.
Furthermore, wouldn't I run into the same problem with my microservice, i.e. the server running the microservice would suffer the same performance degradation. Well the only good thing would, that the rest of the web app would not suffer from it anymore.
I've read some posts about introducing Thread.sleep(1) or Thread.sleep in general into CPU-heavy operations to reduce the amount of CPU used in these operations. I've also read about someone who introduced this as an aspect so that he can even change the amount of time waited dynamically in order to have some control about how much cpu would be used.
However, my gut tells me that ain't right either. What do you think about introducing Thread.sleep to lower the amount of CPU used for a task? Is this common practice? If not, what would be the right approach?

I would highly consider changing your system architecture to offload these long-running requests to a separate instance instead of running them in-process with the general request-service application. In general I think it is an anti-pattern to handle both batch / online (or long / short running) processing in the same application instance.
Ideally you'd build a standalone microservice to handle these requests, but you could also simply just deploy X instances of your existing application, and configure your load balancer to route requests to the long running invocation paths (e.g. POST /myapp/longrunningjob) only to the instances dedicated to running these long-running processes.

Occasional slow requests on Heroku

We are seeing inconsistent performance on Heroku that is unrelated to the recent unicorn/intelligent routing issue.
This is an example of a request which normally takes ~150ms (and 19 out of 20 times that is how long it takes). You can see that on this request it took about 4 seconds, or between 1 and 2 orders of magnitude longer.
Some things to note:
the database was not the bottleneck, and it spent only 25ms doing db queries
we have more than sufficient dynos, so I don't think this was the bottleneck (20 double dynos running unicorn with 5 workers each, we get only 1000 requests per minute, avg response time of 150ms, which means we should be able to serve (60 / 0.150) * 20 * 5 = 40,000 requests per minute. In other words we had 40x the capacity on dynos when this measurement was taken.
So I'm wondering what could cause these occasional slow requests. As I mentioned, anecdotally it seems to happen in about 1 in 20 requests. The only thing I can think of is there is a noisy neighbor problem on the boxes, or the routing layer has inconsistent performance. If anyone has additional info or ideas I would be curious. Thank you.

I have been chasing a similar problem myself, with not much luck so far.
I suppose the first order of business would to be to recommend NewRelic. It may have some more info for you on these cases.
Second, I suggest you look at queue times: how long your request was queued. Look at NewRelic for this, or do it yourself with the "start time" HTTP header that Heroku adds to your incoming request (just print now() minus "start time" as your queue time).
When those failed me in my case, I tried coming up with things that could go wrong, and here's a (unorthodox? weird?) list:
1) DNS -- are you making any DNS calls in your view? These can take a while. Even DNS requests for resolving DB host names, Redis host names, external service providers, etc.
2) Log performance -- Heroku collects all your stdout using their "Logplex", which it then drains to your own defined logdrains, services such as Papertrail, etc. There is no documentation on the performance of this, and writes to stdout from your process could block, theoretically, for periods while Heroku is flushing any buffers it might have there.
3) Getting a DB connection -- not sure which framework you are using, but maybe you have a connection pool that you are getting DB connections from, and that took time? It won't show up as query time, it'll be blocking time for your process.
4) Dyno performance -- Heroku has an add-on feature that will print, every few seconds, some server metrics (load avg, memory) to stdout. I used Graphite to graph those and look for correlation between the metrics and times where I saw increased instances of "sporadic slow requests". It didn't help me, but might help you :)
Do let us know what you come up with.

does testing a website through JMeter actually overload the main server

I am using to test my web server https://buyandbrag.in .
I have tested it for 100 users. But the main server is not showing like it is crowded or not.
I want to know whether it is really pressuring the main server(a cloud server I am using).Or just use the client resourse where the tool is installed.

Yes as mentioned you should be monitoring both servers to see how they handle the load. The simplest way to do this is with TOP (if your server OS is *NIX) also you should be watching the network activity i.e. Bandwidth, connection status (time wait, close wait and so on).
Also if your using apache keep an eye on the logs you should see the requests being logged there
Good luck with the tests

I want to know "how many users my website can handele ?",when I tested with 50 threads ,the cpu usage of my server increased but not the connections log(It showed just 2 connections).also the bandwidth usage is not that much
Firstly what connections are you referring to? Apache, DB etc?
Secondly if you want to see how many users your current setup can hand you need to create a profile or traffic model of what an average user will do on your site.
For example:
Say 90% of the time they will search for something
5% of the time they will purchase x
5% of the time they login.
Once you have your "Traffic Model" defined, implement it in jMeter then start increasing your load in increments i.e. running your load test for 10mins with x users, after 10mins increment that number and so on until you find your breaking point.
If you graph your responses you should see two main things:
1) The optimum response time / number of users before the service degrades
2) The tipping point i.e. at what point you start returning 503's etc
Now you'll have enough data to scale your site or to start making performance improvements from a code point of view.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio