We have an AWS Lambda written in Java that usually completes in about 200 ms. Occasionally, it times out after 5 seconds (our configured timeout value).
I understand that there is occasional added latency due to container setup (though, I'm not clear if that counts against your execution time). I added some debug logging, and it seems like the code just runs slow.
For example, a particularly noticeable log entry shows a call to HttpClients.createDefault usually takes less than 200 ms (based on the fact that the Lambda executes in less than 200 ms), but when the timeout happens, it takes around 2-3 seconds.
2017-09-14 16:31:28 DEBUG Helper:Creating HTTP Client
2017-09-14 16:31:31 DEBUG Helper:Executing request
Unless I'm misunderstanding something, it seems like any latency due to container initialization would have already happened. Am I wrong in assuming that code execution should not have dramatic differences in speed from one execution to the next? Or is this just something we should expect?
Setting up new containers or replacing cold containers takes some time. Both account against your time. The time you see in the console is the time you are billed against.
I assume that Amazon doesn't charge for the provisioning of the container, but they will certainly hit the timer as soon as your runtime is started. You are likely to pay for the time during which the SDK/JDK gets initialized and loads it's classes. They are certainly not charging us for the starting of the operation system which hosts the containers.
Running a simple Java Lambda two times shows the different times for new and reused instances. The first one is 374.58 ms and the second one is 0.89 ms. After that you see the billed duration of 400 and 100 ms. For the second one the container got reused. While you can try to keep your containers warm as already pointed out by #dashmug, AWS will occasionally recycle the containers and as load increases or decreases spawn new containers. The blogs How long does AWS Lambda keep your idle functions around before a cold start? and How does language, memory and package size affect cold starts of AWS Lambda? might be worth a look as well. If you include external libraries you times will increase. If you look at that blog you can see that for Java and smaller memory allocations can regularly exceed 2 - 4 seconds.
Looking at these times you should probably increase your timeout and not just have a look at the log provided by the application, but a look at the START, END and REPORT entries as well for an actual timeout event. Each running Lambda container instance seems to create its own log stream. Consider keeping your Lambdas warm if they aren't called that often.
05:57:20 START RequestId: bc2e7237-99da-11e7-919d-0bd21baa5a3d Version: $LATEST
05:57:20 Hello from Lambda com.udoheld.aws.lambda.HelloLogSimple.
05:57:20 END RequestId: bc2e7237-99da-11e7-919d-0bd21baa5a3d
05:57:20 REPORT RequestId: bc2e7237-99da-11e7-919d-0bd21baa5a3d Duration: 374.58 ms Billed Duration: 400 ms Memory Size: 128 MB Max Memory Used: 44 MB
05:58:01 START RequestId: d534155b-99da-11e7-8898-2dcaeed855d3 Version: $LATEST
05:58:01 Hello from Lambda com.udoheld.aws.lambda.HelloLogSimple.
05:58:01 END RequestId: d534155b-99da-11e7-8898-2dcaeed855d3
05:58:01 REPORT RequestId: d534155b-99da-11e7-8898-2dcaeed855d3 Duration: 0.89 ms Billed Duration: 100 ms Memory Size: 128 MB Max Memory Used: 44 MB
Try to keep your function always warm and see if it would make a difference.
If the timeout is really due to container warmup, then keeping it warm will greatly help reduce the frequency of these timeouts. You'd still get cold starts when you deploy changes but at least that's predictable.
https://read.acloud.guru/how-to-keep-your-lambda-functions-warm-9d7e1aa6e2f0
For Java based applications the warm up period is more as you know it's jvm right. Better to use NodeJS or Python because the warm up period is less for them. If you are not in such a way to switch the tech stack simply keep the container warm by triggering it or increase the memory that will reduce the execution time as lambda cpu allocation is more for larger memory.
Related
My understanding is that a single lambda#edge instance can only handle one request at a time, and AWS will spin up new instances if all existing instances are serving a request.
My lambda has a heavy instance startup cost (~2 seconds) but a very light execution cost. It triggers on viewer requests, which always come in batches of ~20 (loading a single-page application). This means one user loading the app, on a cold start, will start ~20 lambda instances and take ~2 seconds.
But due to the very light execution cost, a single lambda instance could handle all 20 requests and it would still take only ~2 seconds.
An extra advantage is, since each instance connects to a 3rd party service on startup, there would be only 1 open connection instead of 20.
Is this possible?
Lambda#edge doesn’t support reserved nor provisioned concurrency.
Here is the link to the documentation for reference: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/edge-functions-restrictions.html#lambda-at-edge-function-restrictions
That being said, with 2s cold start, you might consider using standard lambda.
Also, can’t you reduce that cold start somehow?
We are using a serverless architecture along with AWS Lambda and an API gateway. The execution time of the lambda is in the range of a few milliseconds. The final response at the client's end is received in seconds (which is far more than the lambdas execution time even if init duration is also taken into count considering cold-start scenarios).
While debugging this using API gateway logs, there is integration latency in seconds which is making the end-to-end response considerably slow. To remove the init duration or cold-start I have added rules in CloudWatch for periodically calling the lambdas to keep them warm.
The init duration is removed completely and this helped in reducing integration latency as well. There are Lambdas which can not be scheduled as authentication is required for calling them and for this I have added provisioned concurrency of 5.
This Lambda has init duration as well in the logs. Lambda provisioning is another option to get rid of cold-starts but this is not having impact over the time in which Lambda's response is available at API gateway.
I have followed below links to assign provisioned concurrency to Lambdas:
Provisioned Concurrency: What it is and how to use it with the Serverless Framework
AWS News Blog – Provisioned Concurrency for Lambda Functions
CloudWatch logs of the Lambda to which I have added provisioning to:
Duration: 1331.38 ms Billed Duration: 1332 ms Memory Size: 256 MB Max Memory Used: 130 MB Init Duration: 1174.18 ms
One thing I could notice in API Gateway and Lambda logs is that Request to lambda was sent from API Gateway at 2021-02-15T11:51:36.621+05:30 but, it was received at Lambda at 2021-02-15T11:51:38.535+05:30. There is about 2 seconds of delay in getting the request at Lambda.
AWS X-RAY TRACING
I have enabled AWS X-Ray logging for API gateway and Lambda both and this is what I have by the traces. This request took 595 ms in total but at Postman the response was received in 1558 ms. From where the delay of approximately 1 second is being added to receiving a response from API gateway?
I believe the reason is that the provisioned concurrency of 5 is not enough and you still run into cold starts of your Lambda function. This means if the external service is calling your API endpoint (i.e. your Lambda function behind API Gateway), your Lambda function is warm with 5 instances. If we assume your Lambda function can handle 2 requests per second (500ms for each invocation), then you can roughly handle 10 requests per second with your Lambda function. If the external service is making 20 requests per second, AWS Lambda tries to spin up new instances because the existing ones are busy handling requests. This has the consequence that the external service experiences high response times because of cold starts of your function.
Also, consider that the instances of your Lambda function do not live "forever" but are cleaned up after some point. I.e. if you experience many spikes in your traffic patterns, then this can mean that after one spike the instances live like 15 minutes, then AWS Lambda shuts them down to only keep the 5 provisioned ones and if then another spike comes, you'll see the same problem as before.
Please note: This is a very simplified explanation of what's happening behind the scenes - and more a good guess based on your description. It would help if you'd provide some example numbers (e.g. init duration, execution duration, response time) and maybe some example code what you're doing in your Lambda function. Also: which runtime are you using? how does your traffic pattern look like?
Potential solutions
Reduce the cold start time of your Lambda functions -> always a good idea for Lambda functions that are behind an API Gateway
Provision more instances -> only possible up to a certain (soft) limit
===== Flow-of-Services =====
API GW ⮕ Lambda function(Provisioned)
===== Query =====
You want to understand why there is latency while processing the request from API GW to Lambda function.
===== Time-stamp-of-CW =====
021-02-15T11:51:36.621+05:30
2021-02-15T11:51:38.535+05:3
Lambda duration - Duration: 1331.38 ms Billed Duration: 1332 ms Memory Size: 256 MB Max Memory Used: 130 MB Init Duration: 1174.18 ms
===== Follow-up questions =====
While the request was processed through the API GW to Lambda function, the execution env took 1174.18 ms(1.1s) sections to become active & execute your code in remaining 0.3s which makes it total of 1.4 seconds.
Q. What is the type of processor you are using?
Q. Type of API & Endpoint type?
===== K.C =====
You should read AWS DOC for optimizing your Lambda function code.Optimizing static initialization
Lambda won't charge you for the time it takes to initialize your code (e.g. importing stuff) as long as it finishes in about X seconds.1
===== Replication Observation =====
Without provisioned concurrency.
API GW execution time - 286 ms
Initialization -195ms
Invocation - 11ms
Overhead - 0ms
With provisioned concurrency.
API GW execution time - 1.103ms
Initialization - 97ms
Invocation - 1ms
Overhead - 0ms
I'm in US-WEST-2 region and calling the request from 12,575 km away from the Region. I hv a REST API which is configured as 'Regional' endpoint type. Lambda function is running on x86_64 – 64-bit x86 architecture, for x86-based processors.
-- Check if you have optimized Lambda function code.
-- To have low latency, you may make use of 'edge' optimized Rest API.An edge-optimized API endpoint is best for geographically distributed clients. API requests are routed to the nearest CloudFront Point of Presence (POP).
-- Alway choose the Region which is closest to the high traffic region.
--
References:
new-provisioned-concurrency-for-lambda-functions
provisioned-concurrency.html#optimizing-latency
Context:
My Spring-Boot app runs as expected on Cloud Run when I deploy it with max-instances set to 1: It receives a constant stream of pubsub messages via push, and makes anywhere from 0 to 5 writes to an associated CloudSQL instance, depending on the message payload. Typically it handles between 20 and 40 messages per second. Latency/response-time varies between 50ms and 60sec, probably due to some resource contention.
In order to increase throughput/ decrease resource contention, I'm looking to experiment with the connection pool size per app-instance, as well as the concurrency and max-instances parameters for my cloud run app.
I understand that due to Spring-Boot, my app has a relatively high cold-start time of about 30-40 seconds. This is acceptable for how this service is used.
Problem:
I'm experiencing problems when deploying a spring-boot app to cloud run with max-instances set to a value greater than 1:
Instances start, handle a single request successfully, and then produce no more logs.
This happens a few times per minute, leading me to believe that instances get started (cold-start), handle a single request, die, and then get started again. They are not being reused as described in the docs, and as is happening when I set max-instances to 1. Official docs on concurrency
Instead, I expect 3 container instances to be started, which then each requests according to max-concurrency setting.
Billable container time at max-instances=3:
As shown in the graph, the number of instances is fluctuating wildly, once the new revision with max-instances=3 is deployed.
The graphs for CPU- and memory-usage also look like this.
There are no error logs. As before at max-instaces=1, there are warnings indicating that there are not enough instances available to handle requests (HTTP 429).
Connection Limit of CloudSQL instance has not been exceeded
Requests are handled at less than 10/s
Finally, this is the command used to deploy:
gcloud beta run deploy my-service --project=[...] --image=[...] --add-cloudsql-instances=[...] --region=[...] --platform=managed --memory=1Gi --max-instances=3 --concurrency=3 --no-allow-unauthenticated
What could cause this behavior?
Some month ago, in private Alpha, I performed tests and I observed the same behavior. After discussion with Google team, I understood that instances are over provisioned "in case of": an instances crashes, an instances is preempted, the traffic suddenly increase,...
The trade-off of this is that you will have more cold start that your max instances values. Worse, you will be charged for this over provisioned cold start -> this is not an issue because Cloud Run has a huge free tier that covers this kind of glitches.
Going deeper in the logs (you can do it by creating a sink of Cloud Run logs into BigQuery and then by requesting them), even if there is more instances up than your max instances, only your max instances are active in the same time. I'm not sure to be clear. With your parameters, that means, if you have 5 instances up in the same time, only 3 serve the traffic at the same point of time
This part is not documented because it evolves constantly for find the best balance between over-provisioning and lack of ressources (and 429 errors).
#Steren #AhmetB can you confirm or correct me?
When Cloud Run receives and processes requests rapidly, it predicts how many instances it needs, and will try to scale to the amount. If a sudden burst of requests occur, Cloud Run will instantiate a larger number of instances as a response. This is done in order to adapt to a possible higher number of network requests beyond what it is currently serving, with attempts to take into consideration the length of time it will take for the existing instance to complete loading the request. Per the documentation, it is possible that the amount of container instances can go above the max instance value when it spikes.
You mentioned with max-instances set to 1 it was running fine, but later you mentioned it was in fact producing 429s with it set to 1 as well. Seeing behavior of 429s as well as the instances spiking could indicate that the amount of traffic is not being handled fluidly.
It is also worth noting, because of the cold start time you mention, when instances are serving the first request(s), by design, the number of concurrent requests is actually hard set to 1. Once things are fully ready,only then the concurrency setting you have chosen is applied.
Was there some specific reason you chose 3 and 3 for Max Instance settings and concurrency? Also how was the concurrency set when you had max instance set to 1? Perhaps you could try tinkering up further the concurrency (max 80) and /or Max instances (high limit up to 1000) and see if that removes the 429s.
Sidekiq documentation says:
Don't set the concurrency higher than 50. I've seen stability issues
with concurrency of 100, for example
Well, my low memory consumption enables me to use concurrency of 350 threads on a single 512MB X1 heroku dyno. And I would like to use ~300 because all jobs are IO intensive (http requests).
I wonder what issues can I encounter in?
I tried to monitor the logs at overload with 80 and seen no issues.
What issues should I expect when setting up concurrency of 300 threads? Will I risk jobs getting terminated without being moved to the "dead" queue? OR just a termination of workers that I will be able to watch.
Is it safe to set a concurrency of 300 or 100?
The owner of sidekiq doesn't know the answer and here is the issue I opened.
UPDATE:
In high load, when I increased from 80 to 100 I started getting 'can't create Thread: Resource temporarily unavailable' errors here and there, in extreme cases of 180 threads it will sometime terminate the entire sidekiq process.
The memory consumption was always between 140MB to 240MB according to Heroku metrics.
I used TTIN signal as describe here
And found that most threads are waiting on those lines of code:
app[worker.1]: 3 TID-ow5z46exw WARN: /app/vendor/ruby-2.3.0/lib/ruby/2.3.0/monitor.rb:187:in `lock'
app[worker.1]: 3 TID-os9ulw8ps WARN: /app/vendor/ruby-2.3.0/lib/ruby/2.3.0/net/http.rb:880:in `initialize'
app[worker.1]: 3 TID-os9ulw8ps WARN: /app/vendor/ruby-2.3.0/lib/ruby/2.3.0/timeout.rb:95:in `join'
app[worker.1]: 3 TID-osjnd6zac WARN: /app/vendor/ruby-2.3.0/lib/ruby/2.3.0/net/protocol.rb:158:in `wait_readable'
Everything is documented in the github issue
The owner of sidekiq says that the traces looks fine, so no luck spotting the root cause for the stablity issue, but there is input in how many threads causes it and what is the symptom.
Based on the number of concurrency the DB pool size has to be increased.
concurrency (thread) + 2 = DB connection pool size
(300+2) = 302 DB connection
The real concurrency for a single sidekiq process depends on the number of processor core and other parameter. So using more currency will take most of the time in thread context switching instead of doing real IO/computation.
512MB X1 heroku dyno
Normal rails app will need atleast 200 MB of memory at startup and if each thread takes (approx) 1 MB of memory then the total consumption of memory will be
200 + (300 *1) = 500 MB
If some thread need more memory during computation (i.e, fetching more ActiveRecords, reading a large file, ...), then the whole sidekiq will crash.
When I ran sidekiq with full machine potential, garbage collection is doesn't happening immediately, which caused the memory to increase and crashed the sidekiq freqently.
Also with more threads, the completion of each job takes more time than the usual time. Analyze this case in your environment.
Well, sidekiq stability issues in high concurrency are as follows.
When you are setting a concurrency that is higher than 80 (or 50) you may encounter in this error "can't create Thread: Resource temporarily unavailable:"
Some jobs will return back to queue, sometimes the entire process will be terminated and jobs will be lost, unless you use sidekiq pro reliability feature
It seems that we are hitting heroku's maximum 256 threads limitation although sidekiq is configured to use 80 threads. It doesn't help if I use multiple sidekiq processes inside single heroku dyno when I did it, I still ran into this limit.
It seems like a thread leak, and this is the next thing to investigate.
The above will happen also when the memory consumption will stay low (< 240MB in my example)
Everything is updated in the github issue
We are using infinispan hot rod in our application.
Some times the retrieval from cache takes more time .This is not happening consistently . Most of the time it takes 6m sec but at times it takes very long ( 200 msec ) .
The size of the object retrieved from cache is around 200 bytes.
We tested both in infinispn 5.2.1 and JDG 6.3.2
Did anybody face this issue ?
Thanks
Lives
Remember that you're running Java, and that means that garbage collector can fire any time and that will give you 200 ms if you're very lucky, several seconds if you're not and up to minutes if you have large heap and not well tuned GC settings.
As the retrieval from distributed cache requires RPC to another node and handled RPC there, thread scheduling also plays vital role. And in busy system there's no surprise to have the thread waiting.
From Infinispan perspective, there's nothing the retrieval should wait for. The request gets translated into RPC to remote mode, and there it's handled by the same thread that received the message. The request does not wait for any locks.
In JGroups, there may be some delay involved. The message can get lost on network or discarded on receiver if it cannot handle the load, and then it's resent. Also, the UFC protocol makes sure that the receiver speed can match to sender's.
Anything built on top of non-realtime Java works with best effort, and sometimes sh!t happens. 200 ms is still a good response time.