I have a specific Lambda function invoked by SNS events that repeatedly times out in about 1/2 of its instances that seem to be running any of the handler code.
What's peculiar is that I have a number of log statements at the very start of the function handler that should be getting triggered.
I've tried increasing the timeout to 120 seconds, but this doesn't fix anything. I've also looked at the Lambda init logic (the code outside the main handler method) but its just simple imports and class initialisation, no database connections or HTTP requests that might be causing a timeout.
The handler logic does include database connections and network requests, but those were timing out then I'd expect to also see some logs prior to the timeouts.
When I view the Lambda logs by stream then around half of them look like the above and just time out, whereas the other half run as expected. Are streams specific to individual Lambda containers? If so then it looks as if there is a number of "dead" containers.
Has anyone experienced an issue like this in the past or has any idea what is going on?
This issue was fixed after realising that the lambda was inside two different subnets, one of which didn't have a NAT gateway. After moving the lambda to a single subnet with a NAT the timeouts have stopped.
Related
I have a Lambda function bound to CodeBuild notifications; a Lambda instance writes details of the notification that triggered it to a DynamoDB table (BillingMode PAY_PER_REQUEST)
Each CodeBuild notification spawns an independent Lambda instance. A CodeBuild build can spawn 7-8 separate notifications/Lambda instances, many of which often happen simultaneously.
The Lambda function uses DynamoDB:PutItem to put details of the notification to DynamoDB. What I find is that out of 7-8 notifications in a 30 second period, sometimes all 7-8 get written to DynamoDB, but sometimes it can be as low as 0-1; many calls to DynamoDB:PutItem simply seem to be "ignored".
Why is this happening?
My guess is that DynamoDB simply shouldn't be accessed by multiple Lambda instances in this way; that best practice is to push the updates to a SQS queue bound to a separate Lambda, and have that separate Lambda write many updates to DynamoDB as part of a transaction.
Is that right? Why might parallel independent calls to DynamoDB:PutItem fail silently?
TIA.
DynamoDB uses a web endpoint and for that reason it can handle any number of concurrent connections, so the issue is not with how many Lambdas are writing.
I typically see this happen when users do not allow the Lambda to wait until the API requests are complete and the container gets shut down prematurely. I would first check your code and ensure that your lambda is staying alive for all items to be processed, you can do this by adding some simple logging in your code.
What you are describing is a good use case for Step Functions.
As much as Lambda functions are great to glue between services, they have their overheads and their limitations. With Step Functions, you can call directly to DynamoDB:PutItem, and you can handle various scenarios and flows, such as Async calls. These flows are possible to implement in a Lambda function, however with less visibility and with less traceability.
BTW, you can also call a Lambda function from Step Functions, however, I recommend you to try and use the direct service call to maximize the benefits of the Step Functions service.
My mistake, I had a separate issue which was messing up some of the range keys and causing updates to "fail" silently. But thx for the tip regarding timeouts
I just inherited some one else's code that uses a server-less lambda function to process records from DynamoDb. The original developer is using DynamoDb much like how RabbitMQ works; as a temporary staging area with some level of fault tolerance and a lambda function that will process them at a later date.
We currently have a way to delay message publication in RabbitMQ at my company, but this feature is missing on the AWS side of the fence.
I wrote some code in my serverless lambda function so that it checks a special property called ProcessAfter (UTC DateTime) and effectively skips processing any given DynamoDb record if the current UTC date/time is less than that specified by the ProcessAfter. However DynamoDb never sends me that record ever again. It appears that DynamoDb only ever allows a single attempt at processing a record (excluding the exception re-tries built in), so I'm stuck with my attempted solution to implementing a delay capability.
Is there anyway to replicate the delay functionality in DynamoDb, or in my lambda function so that messages are skipped, and then re-processed as often as necessary until the delay is over and the record is successfully processed?
Looks like you are listening to dynamo_db streams. They work in a way if any event(insert, update etc which is being configured) happens for a record it will be sent to a listener for processing.
Now talking about your specific scenario, you need to have an SQS in place for processing a record later if you do not wish to process it after listening.
Better architecture I would advice is put an extra SQS and Lambda. The Lambda will listen the dynamo_db stream event, will compare processAfter with Date_Now to compute delay, add that delay as delay_seconds and send message to SQS.
Finally lambda listener will listen and process it after specified delay or 0 delay as required.
We are using an FN Function for OCI API Gateway authorisation (https://docs.cloud.oracle.com/en-us/iaas/Content/APIGateway/Tasks/apigatewayusingauthorizerfunction.htm). We're finding that there is a slight delay in the auth process when it hasn't been triggered for a while as an instance of the Function container spins up, which is expected. As the Oracle documentation states :
When the function has finished executing and after a period being idle, the Docker container is removed. If Oracle Functions receives another call to the same function before the container is removed, the second request is routed to the same running container. If Oracle Functions receives a call to a function that is currently executing inside a running container, Oracle Functions scales horizontally to serve both incoming requests and a second Docker container is started. (https://docs.cloud.oracle.com/en-us/iaas/Content/Functions/Concepts/functionshowitworks.htm)
We would like to minimise or ideally eradicate this initial delay, for instance by keeping one instance of the Function running all the time. What would be the best approach?
I doubt if you could keep the FN Container hot without repeatedly invoking it at the first place. One of the daft options could be to keep calling it after every "sleep" interval; but this has to be traded-off with associated FN Invoking cost/month.
Other options could be based on how long the actual operation runs for. For instance, this could be split into the two Operations represented by two FNs. A FN can call another FN; so you should be able to sequence invoking them one by one if that is achievable for your intended task.
That is known in serverless as the "cold start" and it is something that is being worked on to reduce the initial startup time. Until then a health-check can be used to periodically ping the function.
Essentially create a case in the function where the URL ends in something like /status or /healthcheck. In that case
return response.Response(ctx,response_data=json.dumps({"status": "OK"}),
headers={"Content-Type": "application/json"})
In API Gateway, create a route making sure to enable anonymous for /status (or /healthcheck) that invokes the function.
Then set up a health check to periodically invoke the API with the /status or /healthcheck end-point. This both keeps the function active and also monitors the health. Your case could perform any needed validation rather than just returning an OK response.
Another thing to keep in mind is API Gateway will cache responses, so depending on your chosen TTL you can adjust your healthcheck timing accordingly.
This 'hot start' requirement is now covered by Oracle Cloud's "Provisioned Concurrency" feature for Functions:
https://docs.oracle.com/en-us/iaas/Content/Functions/Tasks/functionsusingprovisionedconcurrency.htm
From the documentation:
Provisioned concurrency is the ability of OCI Functions to always have available the execution infrastructure for at least a certain minimum number of concurrent function invocations.
How come I find so little examples of the KCL being used with AWS Lambda.
https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-kcl.html
It does provide a fine implementation for keeping track of your position on the stream (checkpointing).
I want to use the KCL as a consumer. My set-up is a stream with multiple shards. On each shard a Lambda is consuming. I want to use the KCL in the Lambda's to track the position of the iterator on the shard.
Why can't I find anyone who use the KCL with Lambda.
What is the issue here?
Since you can directly consume from Kinesis in your lambdas (using Kinesis as event source) it doesn't make any sense to use KCL within lambda. The event source framework that AWS has built must be using something like KCL to bring lambdas up in response to kinesis events.
It would be super weird to bring up a lambda, initialize KCL in the handler and wait for events during the lambda runtime. Lambda will go down in 5 mins and you'll again do the same thing. Doing this from an EC2 instance makes sense but then you're reimplementing the Lambda - Kinesis integration by yourself. That is what Lambda is, behind the scene.
I do not work for AWS, so obviously I do not know the exact reason why there is no documentation, but here are my thoughts.
First of all, to run the KCL, you need to have the JVM running. This means you can only do this in a lambda using Java because (to my knowledge at this point) there is no way to pull in other sdk, runtimes, etc into a lambda. You chose one runtime at setup. So already they would only be creating documentation for just java lambdas.
Now for the more technical reason. You need to think about what a lambda is doing, and then what the KCL is doing.
Let's start with the Lambda. Lambdas are by design, ephemeral. They can (and will) spin up and go down consistently throughout the day. Of course, you could set up a warming scheme so the lambdas stay up, but they will still have the ephemeral nature to them and this is completely out of your control. In other words, AWS controls when and if a lambda stays active and the exact methods for this is not published. So you can only try to keep things warmed.
What does a KCL do?
Connects to the stream
Enumerates the shards
Coordinates shard associations with other workers (if any)
Instantiates a record processor for every shard it manages
Pulls data records from the stream
Pushes the records to the corresponding record processor
Checkpoints processed records
Balances shard-worker associations when the worker instance count changes
Balances shard-worker associations when shards are split or merged
After reading through this list, lets now go back to the ephemeral nature of lambdas. This means that every single time a lambda comes up or goes down, all of this work needs to happen. This includes a complete rebalance between the shards and workers, pulling data records from the streams, setting checkpoints, etc. You would also need to make sure that you don't ever have more lambdas spun up than the number of shards as they would be worthless (never used in the best case or registered as workers in the worst case potentially causing lost messages. Think what would happen in this scenario with a rebalance.)
OK, technically could you pull it off? If you used Java and you did everything in your power to keep your lambdas warm, it could technically be possible. But back to your question. Why is there no documentation? I never want to say 'never', but generally speaking, Lambdas, with their ephemeral nature, are just not the best use case for the KCL. And if you don't go deep into the weeds on how the KCL works, you'll probably miss something, causing rebalancing issues and potentially causing messages to get lost.
If there is anything inaccurate here please let me know so I can update. Thanks and I hope this helps somebody.
I am looking at the documentation on Lamba Limits which says:
Number of file descriptors 1,024
I am wondering if this is per invoking lambda or total across all lambdas?
I am processing a very large number of items from a kinesis stream and I am calling a web endpoint and it I seem to be hitting a bottle neck of about 1024 concurrent connections to the API and I'm not sure where the bottleneck is. I'm investigating limits on my load balancer and instances but I'm also wondering if lambda itself simply cannot create more than 1024 concurrent outbound connections across all lambdas?
This question is old, but a suitable answer may help others in the future. The limit as correctly noted in the question is 1,024 outbound connections per Lambda function. However this limit is only for the life cycle of the container. There are currently no public documents stating the length of the life cycle, however through my own testing it resulted in the following:
A new container is created after 5 minutes of idle time for the Lambda function
A new container is created after 60 minutes of frequent use of the Lambda function
A new container is created on any update to the code or configuration of the Lambda
A final note on the new containers, when a new container is created it will run all of your code from the start whereas invoking a warm container will just invoke the handler, skipping the loading of the libraries etc. As this is the case it is a best practice to implement connection pooling and declare the connection outside of the handler so that it can be reused in subsequent invokes, examples of this can be found in the AWS docs