My understanding is that a single lambda#edge instance can only handle one request at a time, and AWS will spin up new instances if all existing instances are serving a request.
My lambda has a heavy instance startup cost (~2 seconds) but a very light execution cost. It triggers on viewer requests, which always come in batches of ~20 (loading a single-page application). This means one user loading the app, on a cold start, will start ~20 lambda instances and take ~2 seconds.
But due to the very light execution cost, a single lambda instance could handle all 20 requests and it would still take only ~2 seconds.
An extra advantage is, since each instance connects to a 3rd party service on startup, there would be only 1 open connection instead of 20.
Is this possible?
Lambda#edge doesn’t support reserved nor provisioned concurrency.
Here is the link to the documentation for reference: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/edge-functions-restrictions.html#lambda-at-edge-function-restrictions
That being said, with 2s cold start, you might consider using standard lambda.
Also, can’t you reduce that cold start somehow?
Related
Using serverless Functions As A Service (AWS Lambda, GCP Functions), what is the best way to run a timer or interval for sometime in the future?
I do not want to keep the instance running idle whilst the timer counts down. The timer will be less than 24 hrs, and needs to change dynamically at runtime (it isn't a single set up cron schedule).
Google has Cloud Scheduler, but that mimics cron and will not let me have a timer for any amount of seconds starting from now.
If you're looking for a product that's similar to Cloud Scheduler, but lets you schedule a single function invocation for an arbitrary amount of time in the future, you should look at Cloud Tasks. You have to create a queue, then arrange for it to create an HTTP target to run at some time in the future.
Context:
My Spring-Boot app runs as expected on Cloud Run when I deploy it with max-instances set to 1: It receives a constant stream of pubsub messages via push, and makes anywhere from 0 to 5 writes to an associated CloudSQL instance, depending on the message payload. Typically it handles between 20 and 40 messages per second. Latency/response-time varies between 50ms and 60sec, probably due to some resource contention.
In order to increase throughput/ decrease resource contention, I'm looking to experiment with the connection pool size per app-instance, as well as the concurrency and max-instances parameters for my cloud run app.
I understand that due to Spring-Boot, my app has a relatively high cold-start time of about 30-40 seconds. This is acceptable for how this service is used.
Problem:
I'm experiencing problems when deploying a spring-boot app to cloud run with max-instances set to a value greater than 1:
Instances start, handle a single request successfully, and then produce no more logs.
This happens a few times per minute, leading me to believe that instances get started (cold-start), handle a single request, die, and then get started again. They are not being reused as described in the docs, and as is happening when I set max-instances to 1. Official docs on concurrency
Instead, I expect 3 container instances to be started, which then each requests according to max-concurrency setting.
Billable container time at max-instances=3:
As shown in the graph, the number of instances is fluctuating wildly, once the new revision with max-instances=3 is deployed.
The graphs for CPU- and memory-usage also look like this.
There are no error logs. As before at max-instaces=1, there are warnings indicating that there are not enough instances available to handle requests (HTTP 429).
Connection Limit of CloudSQL instance has not been exceeded
Requests are handled at less than 10/s
Finally, this is the command used to deploy:
gcloud beta run deploy my-service --project=[...] --image=[...] --add-cloudsql-instances=[...] --region=[...] --platform=managed --memory=1Gi --max-instances=3 --concurrency=3 --no-allow-unauthenticated
What could cause this behavior?
Some month ago, in private Alpha, I performed tests and I observed the same behavior. After discussion with Google team, I understood that instances are over provisioned "in case of": an instances crashes, an instances is preempted, the traffic suddenly increase,...
The trade-off of this is that you will have more cold start that your max instances values. Worse, you will be charged for this over provisioned cold start -> this is not an issue because Cloud Run has a huge free tier that covers this kind of glitches.
Going deeper in the logs (you can do it by creating a sink of Cloud Run logs into BigQuery and then by requesting them), even if there is more instances up than your max instances, only your max instances are active in the same time. I'm not sure to be clear. With your parameters, that means, if you have 5 instances up in the same time, only 3 serve the traffic at the same point of time
This part is not documented because it evolves constantly for find the best balance between over-provisioning and lack of ressources (and 429 errors).
#Steren #AhmetB can you confirm or correct me?
When Cloud Run receives and processes requests rapidly, it predicts how many instances it needs, and will try to scale to the amount. If a sudden burst of requests occur, Cloud Run will instantiate a larger number of instances as a response. This is done in order to adapt to a possible higher number of network requests beyond what it is currently serving, with attempts to take into consideration the length of time it will take for the existing instance to complete loading the request. Per the documentation, it is possible that the amount of container instances can go above the max instance value when it spikes.
You mentioned with max-instances set to 1 it was running fine, but later you mentioned it was in fact producing 429s with it set to 1 as well. Seeing behavior of 429s as well as the instances spiking could indicate that the amount of traffic is not being handled fluidly.
It is also worth noting, because of the cold start time you mention, when instances are serving the first request(s), by design, the number of concurrent requests is actually hard set to 1. Once things are fully ready,only then the concurrency setting you have chosen is applied.
Was there some specific reason you chose 3 and 3 for Max Instance settings and concurrency? Also how was the concurrency set when you had max instance set to 1? Perhaps you could try tinkering up further the concurrency (max 80) and /or Max instances (high limit up to 1000) and see if that removes the 429s.
On my project there is REST API which implemented on AWS API Gateway and AWS Lambda. As AWS Lambda functions are serverless and stateless while we make a call to it, AWS starts a container with code of the Lambda function which process our call. According AWS documentation after finishing of lambda function execution AWS don't stop the container and we are able to process next call in that container. Such approach improves performance of the service - only in time of first call AWS spend time to start container (cold start of Lambda function) and all next calls are executed faster because their use the same container (warm starts).
As a next step for improving the performance we created cron job which calls periodically our Lambda function (we use Cloudwatch rules for that). Such approach allow to keep Lambda function "warm" allowing to avoid stopping and restarting of containers. I.e. when the real user will call our REST API, Lambda will not spent time to start a new container.
But we faced with the issue - such approach allow to keep warm only one container of Lambda function while the actual number of parallel calls from different users can be much larger (in our case that's hundreds and sometimes even thousands of users). Is there any way to implement warm up functionality for Lambda function which could warm not only single container, but some desired number of them?
I understand that such approach can affect cost of Lambda function's using and possibly, at all it will be better to use good old application server, but comparison of these approaches and their costs will be the next steps, I think, and in current moment I would like just to find the way to warm desired count of Lambda function containers.
This can be long but bear with me as this would probably give you workaround and may be would make you understand better How Lambda Works ?
Alternatively You can Skip to Bottom "The Workaround" if you are not interested in reading.
For folks who are not aware about cold starts please read this blog post to better understand it. To describe this in short:
Cold Starts
When a function is executed for the first time or after having the
functions code or resource configuration updated, a container will be
spun up to execute this function. All the code and libraries will be
loaded into the container for it to be able to execute. The code will
then run, starting with the initialisation code. The initialisation
code is the code written outside the handler. This code is only run
when the container is created for the first time. Finally, the Lambda
handler is executed. This set-up process is what is considered a cold
start.
For performance, Lambda has the ability to re-use containers created
by previous invocations. This will avoid the initialisation of a new
container and loading of code. Only the handler code will be
executed. However, you cannot depend on a container from a previous
invocation to be reused. if you haven’t changed the code and not too
much time has gone by, Lambda may reuse the previous container.
If you change the code, resource configuration or some time has
passed since the previous invocation, a new container will be
initialized and you will experience a cold start.
Now Consider these scenarios for better understanding:
Consider the Lambda function, in the example, is invoked for the first time. Lambda will create a container, load the code into the container and run the initialisation code. The function handler will then be executed. This invocation will have experienced a cold start. As mentioned in the comments, the function takes 15 seconds to complete. After a minute, the function is invoked again. Lambda will most likely re-use the container from the previous invocation. This invocation will not experience a cold start.
Now consider the second scenario, where the second invocation is executed 5 seconds after the first invocation. Since the previous function takes 15 seconds to complete and has not finished executing, the new invocation will have to create a new container for this function to execute. Therefore this invocation will experience a cold start.
Now to Come up First Part of Problem that you have solved :
Regarding preventing cold starts, this is a possibility, however, it is not guaranteed, the common workaround will only keep warm one container of the Lambda function. To do, you would run a CloudWatch event using a schedule event (cron expression) that will invoke your Lambda function every couple of minutes to keep it warm.
The Workaround:
For your use-case, your Lambda function will be invoked very frequently with a very high concurrency rate. To avoid as many cold starts as possible, you will need to keep warm as many containers as you expect your highest concurrency to reach. To do this you will need to invoke the functions with a delay to allow the concurrency of this function to build and reach the desired amount of concurrent executions. This will force Lambda to spin up the number of containers you desire. This, as a result, can bring up costs and will not guarantee to avoid cold starts.
That being said, here is a break down on how you can keep multiple containers for your function warm at one time:
You should have a CloudWatch Events Rule that is triggered on a schedule. This schedule can be a fixed rate or a cron expression. for example, You can set this rule to trigger every 5 minutes. You will then specify a Lambda function (Controller function) as the target of this rule.
Your Controller Lambda function will then invoke the Lambda function (Function that you want to be kept warm) for as many concurrent running containers as you desire.
There are a few things to consider here:
You will have to build concurrency because if the first invocation
is finished before another invocation starts then this invocation
may reuse the previous invocations container and not create a new
one. To do this you will need to add some sort of delay on the
Lambda function if the function is invoked by the controller
function. This can be done by passing in a specific payload to
the function with these invocations. The lambda function that you
want to be kept warm will then check if this payload exists. If
it does then the function will wait (to build concurrent
invocations), if it does not then the function can execute as
expected.
You will also need to ensure you are not getting throttled on the Invoke Lambda API call if you are calling it repeatedly. Your
Lambda
function should be written to handle this throttling if it occurs
and consider adding a delay between API calls to avoid throttling.
At the End this solution can reduce cold starts but it will increase costs and will not guarantee that cold starts will occur as they are inevitable when working with Lambda.If your application needs faster response times then what occurs with a Lambda cold start, I would recommend looking into having your server on a EC2 instance.
We are using java (spring boot) lambdas and have come to pretty much an identical solution as Kush Vyas's answer above which works very well.
We did find during load testing, however, that a legitimate user request would often occur during the period that the "Controller function" was executing, again causing the inevitable cold start...
So, now in our "Controller function", we have our regular number of X concurrent warm-up requests, however every 5th execution of the function we call our target lambda an additional 2 times. Theory being that we will end up with X+2 lambdas staying warm, but for 4 out of 5 warm up calls there will still be 2 redundant lambdas that can service user requests.
It did reduce our number of cold starts even further (but obviously still not completely) and we are still playing with concurrency/frequency of warm-ups/sleep-time combinations to find optimum solution for us - these values will always likely be dependent on load requirements for a specific situation.
AWS just announced this:
https://aws.amazon.com/about-aws/whats-new/2019/12/aws-lambda-announces-provisioned-concurrency/
Be aware though that it is not free and for our simple use case of keeping 10 lambda instances warm it seems our daily cost would increase from $0.06 to $4
If you use the serverless framework with AWS Lambda, you can use this plugin to keep all your lambdas warm with a certain level of concurrency.
I'd like to share small but useful tip which we use to reduce 'observed by user' delay related to cold starts. In our case the Lambda function handles HTTP requests from front-end via AWS API Gateway, in particular executes search functionality when user type something in the input field. Usually user start to type with some delay after UI is rendered, so we have some time to execute ping call to our Lambda function for warming it up. And when user will make requests to the back-end, most likely the Lambda will be ready for work.
Actually such approach do nothing for fixing the issue with cold starts on the back-end side and you will need to look for other options how to fix it, but it can be an user experience improvement without much efforts (something like hotfix).
One thing you should remember - if your service is public and you care about Google Insights score you should be careful implementing such approach.
I am working on a web application that provides its users to optionally execute long-running processes 'in background'. An example would be some long-running report generation, or deleting thousands of objects simultaneously.
I've implemented this using an ExecutorService defined as FixedThreadPool using a ThreadFactory. The ThreadFactory is built like this:
ThreadFactoryBuilder()
.setNameFormat(clientId + "-BackgroundTask-%d")
.setDaemon(true)
.setPriority(Thread.MIN_PRIORITY)
.build()
I execute the task like this:
Future<TaskStatus> future = clientExecutors.get(clientId).submit(
backgroundTask::execute);
taskFutures.put(backgroundTask.getTaskId(), future);
How can I enforce my webserver to always priorize handling new incoming requests (as fast as possible) over executing background tasks?
In other words: It should never ever happen, that a user has to wait long time while browsing the site, just because there are a lot of background-tasks executing. As you can see from above, I tried to do this by setting .setPriority(Thread.MIN_PRIORITY). However that does not seem to be sufficient.
Furthermore, as for now, I've set some arbitrary value for the FixedThreadPool size (10) and use it globally for the entire background-handling of the application (and all its customers).
Instead I would like to define a threadpool for each customer, to make sure each customer has the same privilege to run a certain amount of tasks in the background. Say, each customer has a FixedThreadPool of size 5, and on the server I'll have a max. of 50 different customers. That would add up to 250 running background tasks at the same time.
The most important requirement here is: it does not matter, how long these background-tasks need to execute (say 2 minutes, or 20 minutes). What is important, is that each customer has the ability to send 5 tasks to be executed in background, and each of those are worked on equally.
I've tested running 30 cpu-intensive background tasks and it turns out that while these are running and cpu is near 100%, new incoming requests take a very long time to be handled.
So obviously, I am doing it wrong.
Update 12.09.2017
I've read about microservices and while it sounds great I see a great challenge in splitting the necessary parts from our monolithic application. Mostly because nearly every operation might turn into a long running process given a big enough data selection.
Furthermore, wouldn't I run into the same problem with my microservice, i.e. the server running the microservice would suffer the same performance degradation. Well the only good thing would, that the rest of the web app would not suffer from it anymore.
I've read some posts about introducing Thread.sleep(1) or Thread.sleep in general into CPU-heavy operations to reduce the amount of CPU used in these operations. I've also read about someone who introduced this as an aspect so that he can even change the amount of time waited dynamically in order to have some control about how much cpu would be used.
However, my gut tells me that ain't right either. What do you think about introducing Thread.sleep to lower the amount of CPU used for a task? Is this common practice? If not, what would be the right approach?
I would highly consider changing your system architecture to offload these long-running requests to a separate instance instead of running them in-process with the general request-service application. In general I think it is an anti-pattern to handle both batch / online (or long / short running) processing in the same application instance.
Ideally you'd build a standalone microservice to handle these requests, but you could also simply just deploy X instances of your existing application, and configure your load balancer to route requests to the long running invocation paths (e.g. POST /myapp/longrunningjob) only to the instances dedicated to running these long-running processes.
We are seeing inconsistent performance on Heroku that is unrelated to the recent unicorn/intelligent routing issue.
This is an example of a request which normally takes ~150ms (and 19 out of 20 times that is how long it takes). You can see that on this request it took about 4 seconds, or between 1 and 2 orders of magnitude longer.
Some things to note:
the database was not the bottleneck, and it spent only 25ms doing db queries
we have more than sufficient dynos, so I don't think this was the bottleneck (20 double dynos running unicorn with 5 workers each, we get only 1000 requests per minute, avg response time of 150ms, which means we should be able to serve (60 / 0.150) * 20 * 5 = 40,000 requests per minute. In other words we had 40x the capacity on dynos when this measurement was taken.
So I'm wondering what could cause these occasional slow requests. As I mentioned, anecdotally it seems to happen in about 1 in 20 requests. The only thing I can think of is there is a noisy neighbor problem on the boxes, or the routing layer has inconsistent performance. If anyone has additional info or ideas I would be curious. Thank you.
I have been chasing a similar problem myself, with not much luck so far.
I suppose the first order of business would to be to recommend NewRelic. It may have some more info for you on these cases.
Second, I suggest you look at queue times: how long your request was queued. Look at NewRelic for this, or do it yourself with the "start time" HTTP header that Heroku adds to your incoming request (just print now() minus "start time" as your queue time).
When those failed me in my case, I tried coming up with things that could go wrong, and here's a (unorthodox? weird?) list:
1) DNS -- are you making any DNS calls in your view? These can take a while. Even DNS requests for resolving DB host names, Redis host names, external service providers, etc.
2) Log performance -- Heroku collects all your stdout using their "Logplex", which it then drains to your own defined logdrains, services such as Papertrail, etc. There is no documentation on the performance of this, and writes to stdout from your process could block, theoretically, for periods while Heroku is flushing any buffers it might have there.
3) Getting a DB connection -- not sure which framework you are using, but maybe you have a connection pool that you are getting DB connections from, and that took time? It won't show up as query time, it'll be blocking time for your process.
4) Dyno performance -- Heroku has an add-on feature that will print, every few seconds, some server metrics (load avg, memory) to stdout. I used Graphite to graph those and look for correlation between the metrics and times where I saw increased instances of "sporadic slow requests". It didn't help me, but might help you :)
Do let us know what you come up with.