Running a lambda or writing a script? - bash

I have a lambda attached to an s3 bucket which would contain historical data(20-25MB). This data would contain folders by months and each month would have over 400k records in txt file. The lambda triggers for every S3EventNotification and would parse the file line by line and save in the DynamoDB table. I need to save historical data in DynamoDB table before launching to prod. Is it better to write a script than running the lambda?
Some of the research that I've done is as file size may be large, lambda can timeout. Also, memory usage is restricted to 512Mb for the lambda.

Your question is a little unclear, but I think you are asking how to run the Lambda function again over all objects that are already in the S3 bucket.
The easiest method is to Use AWS Lambda with Amazon S3 batch operations - AWS Lambda:
You can use Amazon S3 batch operations to invoke a Lambda function on a large set of Amazon S3 objects. Amazon S3 tracks the progress of batch operations, sends notifications, and stores a completion report that shows the status of each action.
This way, the Lambda function can be triggered again as if the objects had been freshly uploaded. Note that multiple objects might be passed to a single Lambda function invocation, so ensure that the function is looping through the event['Records'] list, rather than merely processing event['Records'][0].
If you fear that the Lambda function might timeout, you can increase the timeout to a maximum of 15 minutes. Allocating more memory to a function will also allocate more CPU, which might make it run faster (but costs also increase). After processing a file, be sure to delete it from /tmp/ to avoid hitting the limit.
However, if objects are bigger than 512MB or take longer than 15 minutes to process, then using an AWS Lambda function is not appropriate.

Related

Limit AWS SQS messages visible per second of AWS Lambda invocations per second

I am implementing a solution that involves SQS that triggers a Lambda funcion, that uses a 3rd party API to perform some operations.
That 3rd party API has a limit of requests per second, so I would like to limit the amount of SQS messages processed by my Lambda funtion to a similar rate.
Is there any way to limit the number of messages visibles per second on the SQS or the number of invocations per second of a Lambda function?
[edited]
After some insights given in the comments about AWS Kinesis:
There is no lean solution by handling Kinesis parameters Batch Window, Batch size and payload size, due to the behaviour of Kinesis has that triggers the lambda execution if ANY of the thresholds and reached:
* Given N = the max number of request per second I can execute over the 3rd party api.
* Configuring a Batch Window = 1 second and a Batch Size of N, back presurre should trigger the execution with more than N_MAX requests.
* Configuring a Batch Windows = 1 secnd and a Batch Size of MAX_ALLOWED_VALUE, will be under performant and also does not guarantee executing less than N execution per second.
The simplest solution I have found is creating a Lambda with a fixed execution rate of 1 second, that reads a fixed number of messages N from SQS/Kinesis, and write those in another SQS/Kinesis, having those another Lambda as endpoint.
This is a difficult situation.
Amazon SQS can trigger multiple AWS Lambda functions in parallel, so there is not central oversight of how fast requests can be made to the 3rd-party API.
From Managing concurrency for a Lambda function - AWS Lambda:
To ensure that a function can always reach a certain level of concurrency, you can configure the function with reserved concurrency. When a function has reserved concurrency, no other function can use that concurrency. Reserved concurrency also limits the maximum concurrency for the function, and applies to the function as a whole, including versions and aliases.
Therefore, concurrency can be used to limit the number of simultaneous Lambda functions executing, but this does not necessarily map to "x API calls per second". That would depend upon how long the Lambda function takes to execute (eg 2 seconds) and how many API calls it makes in that time (eg 2 API calls).
It might be necessary to introduce delays either within the Lambda function (not great because you are still paying for the function to run while waiting), or outside the Lambda function (by triggering the Lambda functions in a different way, or even doing the processing outside of Lambda).
The easiest (but not efficient) method might be:
Set a concurrency of 1
Have the Lambda function retry the API call if it is rejected
Thanks to #John Rotenstein gave a comprehensive and detailed answer about SQS part.
If your design is limited to a single consumer than you may replace sqs with kinesis streams. By replacing it, you may use batch window option of kinesis to limit the requests made by consumer. Batch window option is used to reduce the number of invocations
Lambda reads records from a stream at a fixed cadence (e.g. once per second for Kinesis data streams) and invokes a function with a batch of records. Batch Window allows you to wait as long as 300s to build a batch before invoking a function. Now, a function is invoked when one of the following conditions is met: the payload size reaches 6MB, the Batch Window reaches its maximum value, or the Batch Size reaches its maximum value. With Batch Window, you can increase the average number of records passed to the function with each invocation. This is helpful when you want to reduce the number of invocations and optimize cost.

How to keep desired amount of AWS Lambda function containers warm

On my project there is REST API which implemented on AWS API Gateway and AWS Lambda. As AWS Lambda functions are serverless and stateless while we make a call to it, AWS starts a container with code of the Lambda function which process our call. According AWS documentation after finishing of lambda function execution AWS don't stop the container and we are able to process next call in that container. Such approach improves performance of the service - only in time of first call AWS spend time to start container (cold start of Lambda function) and all next calls are executed faster because their use the same container (warm starts).
As a next step for improving the performance we created cron job which calls periodically our Lambda function (we use Cloudwatch rules for that). Such approach allow to keep Lambda function "warm" allowing to avoid stopping and restarting of containers. I.e. when the real user will call our REST API, Lambda will not spent time to start a new container.
But we faced with the issue - such approach allow to keep warm only one container of Lambda function while the actual number of parallel calls from different users can be much larger (in our case that's hundreds and sometimes even thousands of users). Is there any way to implement warm up functionality for Lambda function which could warm not only single container, but some desired number of them?
I understand that such approach can affect cost of Lambda function's using and possibly, at all it will be better to use good old application server, but comparison of these approaches and their costs will be the next steps, I think, and in current moment I would like just to find the way to warm desired count of Lambda function containers.
This can be long but bear with me as this would probably give you workaround and may be would make you understand better How Lambda Works ?
Alternatively You can Skip to Bottom "The Workaround" if you are not interested in reading.
For folks who are not aware about cold starts please read this blog post to better understand it. To describe this in short:
Cold Starts
When a function is executed for the first time or after having the
functions code or resource configuration updated, a container will be
spun up to execute this function. All the code and libraries will be
loaded into the container for it to be able to execute. The code will
then run, starting with the initialisation code. The initialisation
code is the code written outside the handler. This code is only run
when the container is created for the first time. Finally, the Lambda
handler is executed. This set-up process is what is considered a cold
start.
For performance, Lambda has the ability to re-use containers created
by previous invocations. This will avoid the initialisation of a new
container and loading of code. Only the handler code will be
executed. However, you cannot depend on a container from a previous
invocation to be reused. if you haven’t changed the code and not too
much time has gone by, Lambda may reuse the previous container.
If you change the code, resource configuration or some time has
passed since the previous invocation, a new container will be
initialized and you will experience a cold start.
Now Consider these scenarios for better understanding:
Consider the Lambda function, in the example, is invoked for the first time. Lambda will create a container, load the code into the container and run the initialisation code. The function handler will then be executed. This invocation will have experienced a cold start. As mentioned in the comments, the function takes 15 seconds to complete. After a minute, the function is invoked again. Lambda will most likely re-use the container from the previous invocation. This invocation will not experience a cold start.
Now consider the second scenario, where the second invocation is executed 5 seconds after the first invocation. Since the previous function takes 15 seconds to complete and has not finished executing, the new invocation will have to create a new container for this function to execute. Therefore this invocation will experience a cold start.
Now to Come up First Part of Problem that you have solved :
Regarding preventing cold starts, this is a possibility, however, it is not guaranteed, the common workaround will only keep warm one container of the Lambda function. To do, you would run a CloudWatch event using a schedule event (cron expression) that will invoke your Lambda function every couple of minutes to keep it warm.
The Workaround:
For your use-case, your Lambda function will be invoked very frequently with a very high concurrency rate. To avoid as many cold starts as possible, you will need to keep warm as many containers as you expect your highest concurrency to reach. To do this you will need to invoke the functions with a delay to allow the concurrency of this function to build and reach the desired amount of concurrent executions. This will force Lambda to spin up the number of containers you desire. This, as a result, can bring up costs and will not guarantee to avoid cold starts.
That being said, here is a break down on how you can keep multiple containers for your function warm at one time:
You should have a CloudWatch Events Rule that is triggered on a schedule. This schedule can be a fixed rate or a cron expression. for example, You can set this rule to trigger every 5 minutes. You will then specify a Lambda function (Controller function) as the target of this rule.
Your Controller Lambda function will then invoke the Lambda function (Function that you want to be kept warm) for as many concurrent running containers as you desire.
There are a few things to consider here:
You will have to build concurrency because if the first invocation
is finished before another invocation starts then this invocation
may reuse the previous invocations container and not create a new
one. To do this you will need to add some sort of delay on the
Lambda function if the function is invoked by the controller
function. This can be done by passing in a specific payload to
the function with these invocations. The lambda function that you
want to be kept warm will then check if this payload exists. If
it does then the function will wait (to build concurrent
invocations), if it does not then the function can execute as
expected.
You will also need to ensure you are not getting throttled on the Invoke Lambda API call if you are calling it repeatedly. Your
Lambda
function should be written to handle this throttling if it occurs
and consider adding a delay between API calls to avoid throttling.
At the End this solution can reduce cold starts but it will increase costs and will not guarantee that cold starts will occur as they are inevitable when working with Lambda.If your application needs faster response times then what occurs with a Lambda cold start, I would recommend looking into having your server on a EC2 instance.
We are using java (spring boot) lambdas and have come to pretty much an identical solution as Kush Vyas's answer above which works very well.
We did find during load testing, however, that a legitimate user request would often occur during the period that the "Controller function" was executing, again causing the inevitable cold start...
So, now in our "Controller function", we have our regular number of X concurrent warm-up requests, however every 5th execution of the function we call our target lambda an additional 2 times. Theory being that we will end up with X+2 lambdas staying warm, but for 4 out of 5 warm up calls there will still be 2 redundant lambdas that can service user requests.
It did reduce our number of cold starts even further (but obviously still not completely) and we are still playing with concurrency/frequency of warm-ups/sleep-time combinations to find optimum solution for us - these values will always likely be dependent on load requirements for a specific situation.
AWS just announced this:
https://aws.amazon.com/about-aws/whats-new/2019/12/aws-lambda-announces-provisioned-concurrency/
Be aware though that it is not free and for our simple use case of keeping 10 lambda instances warm it seems our daily cost would increase from $0.06 to $4
If you use the serverless framework with AWS Lambda, you can use this plugin to keep all your lambdas warm with a certain level of concurrency.
I'd like to share small but useful tip which we use to reduce 'observed by user' delay related to cold starts. In our case the Lambda function handles HTTP requests from front-end via AWS API Gateway, in particular executes search functionality when user type something in the input field. Usually user start to type with some delay after UI is rendered, so we have some time to execute ping call to our Lambda function for warming it up. And when user will make requests to the back-end, most likely the Lambda will be ready for work.
Actually such approach do nothing for fixing the issue with cold starts on the back-end side and you will need to look for other options how to fix it, but it can be an user experience improvement without much efforts (something like hotfix).
One thing you should remember - if your service is public and you care about Google Insights score you should be careful implementing such approach.

Filename List S3 with many users

I have an Android and IOS app that uploads images (about 15,000 per minute) to a AWS S3 bucket, everything is all right, but i need to process those images in a web app that is used from 2 to 50 different users called 'Monitores' , when this kind of user logins and begin to process the images the app scan the S3 bucket for the filenames, something like:
$recibidos = Storage::disk('s3recibidos');
$total_archivos = $recibidos->allfiles();
this generates an array with the files are stored in the time the route is invoked, if i use this with one user for process there is no problem, because the process is one time only, but what if i have 2 or more users trigger this process? the process retrieves no the exact list but i think many of the un processed files will be duplicated.
The process of the filenames is to store in a database and to move to a subdirectory.
For example:
I have 1000 files in the AWS S3 bucket and user1 invoke the process so the array will have 1000 filenames to process, right now the time to process those files is about 3 min, so before the process finish 1000 new files was added to the AWS S3 bucket this files are not in the user1 array, then user2 logins and begins to process, so right now the AWS S3 has new files and old files, then when get the new array gets some old filenames (the ones are not process), in fact when user2 process the files some of this was not available, because the user1 process was made the job.
I need help in this two things:
1.- How to deal with the process.
2.- How can i use wildcards, because one of the final process changes the filename of the files in S3, so the filename list that i need to process has its exepecific format.
Thanks for any advice
I'm a little confused about your process, but let's assume:
You have a large number of incoming images
You need to perform some operation on each of those instances
There are two recommended approaches to do this:
Option 1: Serverless
Configure the Amazon S3 bucket to trigger an AWS Lambda function whenever a new object is created in the bucket
Create an AWS Lambda function as a worker -- it receives information about each file, then processes the file
AWS Lambda will automatically scale to run multiple Lambda functions in parallel. The default is up to 1000 concurrent Lambda functions, but this can be increased upon request.
Option 2: Traditional
Create an Amazon SQS queue to store details of images to process
Configure the Amazon S3 bucket to send an event to the SQS queue whenever a new object is created in the bucket
Use Amazon EC2 instance(s) to run multiple workers
Each worker reads the file information from the queue, processes the image, then deletes the message from the queue. It then repeats, pulling the next message from the queue.
Scale the number of EC2 instances and/or workers as necessary
Both of these approaches have workers operating on one image files at a time, so you do not have the problem of maintaining lists while images being continually added. They are also highly scalable with no code changes.

AWS Kinesis stream sending data to Lambda at slower rate

I needed to implement a stream solution using AWS Kinesis streams & Lambda.
Lambda function 1 -
It adds data to stream and is invoked every 10 seconds. I added 100 data request ( each one of 1kb) to stream. I am running two instances of the script which invokes the lambda function.
Lambda function 2 -
This lambda uses above stream as trigger. On small volume of data / interval second lambda get data on same time. But on above metrics, data reaches slower than usual ( 10 minutes slower after +1 hour streaming ).
I checked the logic of both lambda functions and verified that, first lambda does not add latency before pushing data to stream. I also verified this by stream packet in second lambda where approximateArrivalTimestamp & current time clearly have the time difference increasing..
Kinesis itself did not have any issues / throttling shown in analytics ( I am using 1 shard ).
Are their any architectural changes I need to make to have it go smoother as I need to scale up at least 10 times like 20 invocations of first lambda with 200 packets, timeout 1 - 10 seconds as later benchmarks.
I am using 100 as the batch size. Can increasing/decreasing it have advantage?
UPDATE : As I explored more online, I found ideas to implement some async / front facing lambda with kinesis which in-turn invoke actual lambda asynchronously, So lambda processing time will not become bottleneck. However, this approach also failed as I have the same latency issue. I have checked the execution time. Front facing lambda ended in 1 second. But still I get big gap between approximateArrivalTimestamp & current time in both lambdas.
Please help!
For one shard, there will one be one instance of 2nd lambda.
So it works like this for 2nd lambda. The lambda reads configured record size from stream and processes it. It won't read other records until the previous records have been successfully processed.
Adding a second shard, you would have 2 lambdas processing the records. Thus the way I see to scale the architecture is by increasing the number of shards, however make sure data is evenly distributed across shards.

Lambda is take more time when multiple lambdas run at same time

I have problem with AWS lambda execution time.
I have 8 lambdas, each of my lambda functions get the same data from S3 and draw images then upload to S3. So I use SNS to distribute the trigger event to each lambdas.
When testing, I just run 1 of them (it call lambda_1), and it only take about 200s for its execution. But when I add all lambdas to SNS topic to make them run at same time, that lambda_1 takes more than 200s event get timeout (over 300s).
The same data I use, the same configuration for both execute, the only difference is 8 lambdas are executed at same time (get trigger from SNS event).
Is it the behavior of aws lambda? I have no idea about this.
Any document or suggestion is very appreciated!
Thanks

Resources