How to load big file model in a lambda function - aws-lambda

Say, I want a lambda function to predict incoming message category with a trained model. However, the model is over-sized (~ 1GB).
With current architecture, I should upload the trained model to AWS S3 and then load it every time the lambda is triggered. This is not desirable since most of time is loading the model.
Some solution in mind:
Don't use lambda. Have a dedicated ec2 instance to work
Keep in warm by periodically sending dummy request
Or, I suspect AWS will cache the file, so the next loading time could be shorter?

I think reading about container reuse in lambda could be helpful here.
https://aws.amazon.com/blogs/compute/container-reuse-in-lambda/
You can add the model as a global cached variable by declaring and initialising it outside the handler function. And if Lambda is reusing the same container for subsequent requests the file won't be re-downloaded.
But it's entirely up to Lambda whether to reuse the container or start a new one. Since this is Lambda's prerogative you can't depend on this behaviour.
If you want to minimise the number of downloads from S3 maybe using an external managed caching solution (Elasticache, Redis) in the same AZ as your function is a possible alternative you can look at.

Related

AWS Lambda caching layer using Extensions

I have a lambda function that uses SSM ParameterStore values. It queries the ParameterStore and stores the fetched values in Lambda env variables so that next time it can use them from env variables instead of making calls to the ParameterStore which is working fine if the lambda is in a hot-state, but during the cold start still, my lambda is making many calls to ParameterStore during peak traffic and getting throttling exceptions.
I'm looking to reduce the num of calls to the parameter store by having a caching layer, I found this article online, but I'm new to Lambda extensions. I just wanted to check if this caching works between the cold starts or not before I create a POC. please advise.
Thanks in advance!

DynamoDB:PutItem calls silently ignored

I have a Lambda function bound to CodeBuild notifications; a Lambda instance writes details of the notification that triggered it to a DynamoDB table (BillingMode PAY_PER_REQUEST)
Each CodeBuild notification spawns an independent Lambda instance. A CodeBuild build can spawn 7-8 separate notifications/Lambda instances, many of which often happen simultaneously.
The Lambda function uses DynamoDB:PutItem to put details of the notification to DynamoDB. What I find is that out of 7-8 notifications in a 30 second period, sometimes all 7-8 get written to DynamoDB, but sometimes it can be as low as 0-1; many calls to DynamoDB:PutItem simply seem to be "ignored".
Why is this happening?
My guess is that DynamoDB simply shouldn't be accessed by multiple Lambda instances in this way; that best practice is to push the updates to a SQS queue bound to a separate Lambda, and have that separate Lambda write many updates to DynamoDB as part of a transaction.
Is that right? Why might parallel independent calls to DynamoDB:PutItem fail silently?
TIA.
DynamoDB uses a web endpoint and for that reason it can handle any number of concurrent connections, so the issue is not with how many Lambdas are writing.
I typically see this happen when users do not allow the Lambda to wait until the API requests are complete and the container gets shut down prematurely. I would first check your code and ensure that your lambda is staying alive for all items to be processed, you can do this by adding some simple logging in your code.
What you are describing is a good use case for Step Functions.
As much as Lambda functions are great to glue between services, they have their overheads and their limitations. With Step Functions, you can call directly to DynamoDB:PutItem, and you can handle various scenarios and flows, such as Async calls. These flows are possible to implement in a Lambda function, however with less visibility and with less traceability.
BTW, you can also call a Lambda function from Step Functions, however, I recommend you to try and use the direct service call to maximize the benefits of the Step Functions service.
My mistake, I had a separate issue which was messing up some of the range keys and causing updates to "fail" silently. But thx for the tip regarding timeouts

Best method to persist data from an AWS Lambda invocation?

I use AWS Simple Email Services (SES) for email. I've configured SES to save incoming email to an S3 bucket, which triggers an AWS Lambda function. This function reads the new object and forwards the object contents to an alternate email address.
I'd like to log some basic info. from my AWS Lambda function during invocation -- who the email is from, to whom it was sent, if it contained any links, etc.
Ideally I'd save this info. to a database, but since AWS Lambda functions are costly (relatively so to other AWS ops.), I'd like to do this as efficiently as possible.
I was thinking I could issue an HTTPS GET request to a private endpoint with a query-string containing the info. I want logged. Since I could fire my request async. at the outset and continue processing, I thought this might be a cheap and efficient approach.
Is this a good method? Are there any alternatives?
My Lambda function fires irregularly so despite Lambda functions being kept alive for 10 minutes or so post-firing, it seems a database connection is likely slow and costly since AWS charges per 100ms of usage.
Since I could conceivable get thousands of emails/month, ensuring my Lambda function is efficient is paramount to cost. I maintain 100s of domain names so my numbers aren't exaggerated. Thanks in advance.
I do not think that thousands per emails per month should be a problem, these cloud services have been developed with scalability in mind and can go way beyond the numbers you are suggesting.
In terms of persisting, I cannot really understand - lack of logs, metrics - why your db connection would be slow. From the moment you use AWS, it will use its own internal infrastructure so speeds will be high and not something you should be worrying about.
I am not an expert on billing but from what you are describing, it seems like using lambdas + S3 + dynamoDB is highly optimised for your use case.
From the type of data you are describing (email data) it doesn't seem that you would have neither a memory issue (lambdas have mem constraints which can be a pain) or an IO bottleneck. If you can share more details on your memory used during invocation and the time taken that would be great. Also how much data you store on each lambda invocation.
I think you could store jsonified strings of your email data in dynamodb easily, it should be pretty seamless and not that costly.
Have not used (SES) but you could put a trigger on DynamoDB whenever you store a record, in case you want to follow up with another lambda.
You could combine S3 + dynamoDB. When you store a record, simply upload a file containing the record to a new S3 key and update the row in DynamoDB with a pointer to the new S3 object
DynamoDB + S3
You can now persist data using AWS EFS.

How can I trigger one AWS Lambda function from another, guaranteeing the second only runs once?

I've built a bit of a pipeline of AWS Lambda functions using the Serverless framework. There are currently five steps/functions, and I need them to run in order and each run exactly once. Roughly, the functions are:
Trigger function by an HTTP request, respond with an ID.
Access and API to get the URL of a resource to download.
Download that resource and upload a copy to S3.
Alter that resource and upload the altered copy to S3.
Submit the altered resource to a different API.
The specifics aren't important, but the question is: What's the best event/trigger to use to move along down this line of functions? The first one is triggered by an HTTP call, but the first one needs to trigger the second somehow, then the second triggers the third, and so on.
I wrote all the code using AWS SNS, but now that I've deployed it to staging I see that SNS often triggers more than once. I could add a bunch of code to detect this, but I'd rather not. And the problem is also compounding -- if the second function gets triggered twice, it sends two SNS notifications to trigger step three. If either of those notifications gets doubled... it's not unreasonable that the last function could be called ten times instead of once.
So what's my best option here? Trigger the chain through HTTP? Kinesis maybe? I have never worked with a trigger other than HTTP or SNS, so I'm not really sure what my options are, and which options are guaranteed to only trigger the function once.
AWS Step Functions seems pretty well targeted at this use-case of tying together separate AWS operations into a coherent workflow with well-defined error handling.
Not sure if the pricing will work for you (can be pricey for millions+ operations) but it may be worth looking at.
Also not sure about performance overhead or other limitations, so YMMV.
You can simply trigger the next lambda asynchronously in your lambda function after you complete the required processing in that step.
So, the first lambda is triggered by an HTTP call and in that lambda execution, after you finish processing this step, just launch the next lambda function asynchronously instead of sending the trigger through SNS or Kinesis. Repeat this process in each of your steps. This would guarantee single time execution of all the steps by lambda.
Eventful Lambda triggers (SNS, S3, CloudWatch, ...) generally guarantee at-least-once invocation, not exactly-once. As you noted you'd have to handle deduplication manually by, for example, keeping track of event IDs in DynamoDB (using strongly consistent reads!), or by implementing idempotent Lambdas, meaning functions that have no additional effects even when invoked several times with the same input. In your example step 4 is essentially idempotent providing that the function doesn't have any side effects apart from storing the altered copy, and that the new copy overwrites any previously stored copies with the same event ID.
One service that does guarantee exactly-once delivery out of the box is SQS FIFO. This service unfortunately cannot be used to trigger Lambdas directly so you'd have to set up a scheduled Lambda to poll the FIFO queue periodically (as per this answer). In your case you could handle step 5 with this arrangement, since I'm assuming you don't want to submit the same resource to the target API several times.
So in summary here's how I'd go about it:
Lambda A, invoked via HTTP, responds with ID and proceeds to asynchronously fetch resource from the API and store it to S3
Lambda B, invoked by S3 upload event, downloads the uploaded resource, alters it, stores the altered copy to S3 and finally pushes a message into the FIFO SQS queue using the altered resource's filename as the distinct deduplication ID
Lambda C, invoked by CloudWatch scheduler, polls the FIFO SQS queue and upon a new message fetches the specified altered resource from S3 and submits it to the other API
With this arrangement even if Lambda B is occasionally executed twice or more by the same S3 upload event there's no harm done since the FIFO SQS queue handles deduplication for you before the flow reaches Lambda C.
AWS Step function is meant for you: https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html
You will execute the steps you want based on previous executions outputs.
Each task/step just need to output a json correctly in the wanted "state".
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-states.html
Based on the state, your workflow will move on. You can create your workflow easily and trigger lambdas, or ECS tasks.
ECS tasks are your own "lambda" environment, running without the constraints of the AWS Lambda environment.
With ECS tasks you can run on Bare metal, on your own EC2 machine, or in ECS Docker containers on ECS and thus have unlimited resources extensible limits.
As compared to Lambda where the limits are pretty strict: 500Mb of disk, execution limited in time, etc.

Amazon Web Services: Spark Streaming or Lambda

I am looking for some high level guidance on an architecture. I have a provider writing "transactions" to a Kinesis pipe (about 1MM/day). I need to pull those transactions off, one at a time, validating data, hitting other SOAP or Rest services for additional information, applying some business logic, and writing the results to S3.
One approach that has been proposed is use Spark job that runs forever, pulling data and processing it within the Spark environment. The benefits were enumerated as shareable cached data, availability of SQL, and in-house knowledge of Spark.
My thought was to have a series of Lambda functions that would process the data. As I understand it, I can have a Lambda watching the Kinesis pipe for new data. I want to run the pulled data through a bunch of small steps (lambdas), each one doing a single step in the process. This seems like an ideal use of Step Functions. With regards to caches, if any are needed, I thought that Redis on ElastiCache could be used.
Can this be done using a combination of Lambda and Step Functions (using lambdas)? If it can be done, is it the best approach? What other alternatives should I consider?
This can be achieved using a combination of Lambda and Step Functions. As you described, the lambda would monitor the stream and kick off a new execution of a state machine, passing the transaction data to it as an input. You can see more documentation around kinesis with lambda here: http://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html.
The state machine would then pass the data from one Lambda function to the next where the data will be processed and written to S3. You need to contact AWS for an increase on the default 2 per second StartExecution API limit to support 1MM/day.
Hope this helps!

Resources