Best method to persist data from an AWS Lambda invocation? - aws-lambda

I use AWS Simple Email Services (SES) for email. I've configured SES to save incoming email to an S3 bucket, which triggers an AWS Lambda function. This function reads the new object and forwards the object contents to an alternate email address.
I'd like to log some basic info. from my AWS Lambda function during invocation -- who the email is from, to whom it was sent, if it contained any links, etc.
Ideally I'd save this info. to a database, but since AWS Lambda functions are costly (relatively so to other AWS ops.), I'd like to do this as efficiently as possible.
I was thinking I could issue an HTTPS GET request to a private endpoint with a query-string containing the info. I want logged. Since I could fire my request async. at the outset and continue processing, I thought this might be a cheap and efficient approach.
Is this a good method? Are there any alternatives?
My Lambda function fires irregularly so despite Lambda functions being kept alive for 10 minutes or so post-firing, it seems a database connection is likely slow and costly since AWS charges per 100ms of usage.
Since I could conceivable get thousands of emails/month, ensuring my Lambda function is efficient is paramount to cost. I maintain 100s of domain names so my numbers aren't exaggerated. Thanks in advance.

I do not think that thousands per emails per month should be a problem, these cloud services have been developed with scalability in mind and can go way beyond the numbers you are suggesting.
In terms of persisting, I cannot really understand - lack of logs, metrics - why your db connection would be slow. From the moment you use AWS, it will use its own internal infrastructure so speeds will be high and not something you should be worrying about.
I am not an expert on billing but from what you are describing, it seems like using lambdas + S3 + dynamoDB is highly optimised for your use case.
From the type of data you are describing (email data) it doesn't seem that you would have neither a memory issue (lambdas have mem constraints which can be a pain) or an IO bottleneck. If you can share more details on your memory used during invocation and the time taken that would be great. Also how much data you store on each lambda invocation.
I think you could store jsonified strings of your email data in dynamodb easily, it should be pretty seamless and not that costly.
Have not used (SES) but you could put a trigger on DynamoDB whenever you store a record, in case you want to follow up with another lambda.
You could combine S3 + dynamoDB. When you store a record, simply upload a file containing the record to a new S3 key and update the row in DynamoDB with a pointer to the new S3 object
DynamoDB + S3

You can now persist data using AWS EFS.

Related

Invoking 1 AWS Lambda with API Gateway sequentially

I know there's a question with the same title but my question is a little different: I got a Lambda API - saveInputAPI() to save the value into a specified field. Users can invoke this API with different parameter, for example:
saveInput({"adressType",1}); //adressType is a DB field.
or
saveInput({"name","test"}) //name is a DB field.
And of course, this hosts on AWS so I'm also using API Gateway as well. But the problem is sometimes, an error like this happened:
As you can see. API No. 19 was invoked first but ended up finishing later
(10:10:16:828) -> (10:10:18:060)
While API No.18 was invoked later but finished sooner...
(10:10:17:611) -> (10:10:17:861)
This leads to a lot of problems in my project. And sometimes, the delay between 2 API was up to 10 seconds. The front project acts independently so users don't know what happens behind. They think they have set addressType to 1 but in reality, the addressType is still 2. Since this project is large and I cannot change this kind of [using only 1 API to update DB value] design. Is there any way for me to fix this problem ?? Really appreciate any idea. Thanks
If updates to Database can't be skipped if last updated timestamp is more recent than the source event timestamp, we need to decouple Api Gateway and Lambda.
Api Gateway writes to SQS FIFO Queue.
Lambda to consume SQS and process the request.
This will ensure older event is processed first.
Amazon Lambda is asynchronous by design. That means that trying to make it synchronous and predictable is kind of waste.
If your concern is avoiding "old" data (in a sense of scheduling) overwrite "fresh" data, then you might consider timestamping each data and then applying constraints like "if you want to overwrite target data, then your source timestamp have to be in the future compared to timestamp of the targeted data"

How can I trigger one AWS Lambda function from another, guaranteeing the second only runs once?

I've built a bit of a pipeline of AWS Lambda functions using the Serverless framework. There are currently five steps/functions, and I need them to run in order and each run exactly once. Roughly, the functions are:
Trigger function by an HTTP request, respond with an ID.
Access and API to get the URL of a resource to download.
Download that resource and upload a copy to S3.
Alter that resource and upload the altered copy to S3.
Submit the altered resource to a different API.
The specifics aren't important, but the question is: What's the best event/trigger to use to move along down this line of functions? The first one is triggered by an HTTP call, but the first one needs to trigger the second somehow, then the second triggers the third, and so on.
I wrote all the code using AWS SNS, but now that I've deployed it to staging I see that SNS often triggers more than once. I could add a bunch of code to detect this, but I'd rather not. And the problem is also compounding -- if the second function gets triggered twice, it sends two SNS notifications to trigger step three. If either of those notifications gets doubled... it's not unreasonable that the last function could be called ten times instead of once.
So what's my best option here? Trigger the chain through HTTP? Kinesis maybe? I have never worked with a trigger other than HTTP or SNS, so I'm not really sure what my options are, and which options are guaranteed to only trigger the function once.
AWS Step Functions seems pretty well targeted at this use-case of tying together separate AWS operations into a coherent workflow with well-defined error handling.
Not sure if the pricing will work for you (can be pricey for millions+ operations) but it may be worth looking at.
Also not sure about performance overhead or other limitations, so YMMV.
You can simply trigger the next lambda asynchronously in your lambda function after you complete the required processing in that step.
So, the first lambda is triggered by an HTTP call and in that lambda execution, after you finish processing this step, just launch the next lambda function asynchronously instead of sending the trigger through SNS or Kinesis. Repeat this process in each of your steps. This would guarantee single time execution of all the steps by lambda.
Eventful Lambda triggers (SNS, S3, CloudWatch, ...) generally guarantee at-least-once invocation, not exactly-once. As you noted you'd have to handle deduplication manually by, for example, keeping track of event IDs in DynamoDB (using strongly consistent reads!), or by implementing idempotent Lambdas, meaning functions that have no additional effects even when invoked several times with the same input. In your example step 4 is essentially idempotent providing that the function doesn't have any side effects apart from storing the altered copy, and that the new copy overwrites any previously stored copies with the same event ID.
One service that does guarantee exactly-once delivery out of the box is SQS FIFO. This service unfortunately cannot be used to trigger Lambdas directly so you'd have to set up a scheduled Lambda to poll the FIFO queue periodically (as per this answer). In your case you could handle step 5 with this arrangement, since I'm assuming you don't want to submit the same resource to the target API several times.
So in summary here's how I'd go about it:
Lambda A, invoked via HTTP, responds with ID and proceeds to asynchronously fetch resource from the API and store it to S3
Lambda B, invoked by S3 upload event, downloads the uploaded resource, alters it, stores the altered copy to S3 and finally pushes a message into the FIFO SQS queue using the altered resource's filename as the distinct deduplication ID
Lambda C, invoked by CloudWatch scheduler, polls the FIFO SQS queue and upon a new message fetches the specified altered resource from S3 and submits it to the other API
With this arrangement even if Lambda B is occasionally executed twice or more by the same S3 upload event there's no harm done since the FIFO SQS queue handles deduplication for you before the flow reaches Lambda C.
AWS Step function is meant for you: https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html
You will execute the steps you want based on previous executions outputs.
Each task/step just need to output a json correctly in the wanted "state".
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-states.html
Based on the state, your workflow will move on. You can create your workflow easily and trigger lambdas, or ECS tasks.
ECS tasks are your own "lambda" environment, running without the constraints of the AWS Lambda environment.
With ECS tasks you can run on Bare metal, on your own EC2 machine, or in ECS Docker containers on ECS and thus have unlimited resources extensible limits.
As compared to Lambda where the limits are pretty strict: 500Mb of disk, execution limited in time, etc.

Amazon Web Services: Spark Streaming or Lambda

I am looking for some high level guidance on an architecture. I have a provider writing "transactions" to a Kinesis pipe (about 1MM/day). I need to pull those transactions off, one at a time, validating data, hitting other SOAP or Rest services for additional information, applying some business logic, and writing the results to S3.
One approach that has been proposed is use Spark job that runs forever, pulling data and processing it within the Spark environment. The benefits were enumerated as shareable cached data, availability of SQL, and in-house knowledge of Spark.
My thought was to have a series of Lambda functions that would process the data. As I understand it, I can have a Lambda watching the Kinesis pipe for new data. I want to run the pulled data through a bunch of small steps (lambdas), each one doing a single step in the process. This seems like an ideal use of Step Functions. With regards to caches, if any are needed, I thought that Redis on ElastiCache could be used.
Can this be done using a combination of Lambda and Step Functions (using lambdas)? If it can be done, is it the best approach? What other alternatives should I consider?
This can be achieved using a combination of Lambda and Step Functions. As you described, the lambda would monitor the stream and kick off a new execution of a state machine, passing the transaction data to it as an input. You can see more documentation around kinesis with lambda here: http://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html.
The state machine would then pass the data from one Lambda function to the next where the data will be processed and written to S3. You need to contact AWS for an increase on the default 2 per second StartExecution API limit to support 1MM/day.
Hope this helps!

How to load big file model in a lambda function

Say, I want a lambda function to predict incoming message category with a trained model. However, the model is over-sized (~ 1GB).
With current architecture, I should upload the trained model to AWS S3 and then load it every time the lambda is triggered. This is not desirable since most of time is loading the model.
Some solution in mind:
Don't use lambda. Have a dedicated ec2 instance to work
Keep in warm by periodically sending dummy request
Or, I suspect AWS will cache the file, so the next loading time could be shorter?
I think reading about container reuse in lambda could be helpful here.
https://aws.amazon.com/blogs/compute/container-reuse-in-lambda/
You can add the model as a global cached variable by declaring and initialising it outside the handler function. And if Lambda is reusing the same container for subsequent requests the file won't be re-downloaded.
But it's entirely up to Lambda whether to reuse the container or start a new one. Since this is Lambda's prerogative you can't depend on this behaviour.
If you want to minimise the number of downloads from S3 maybe using an external managed caching solution (Elasticache, Redis) in the same AZ as your function is a possible alternative you can look at.

Check if S3 file has been modified

How can I use a shell script check if an Amazon S3 file ( small .xml file) has been modified. I'm currently using curl to check every 10 seconds, but it's making many GET requests.
curl "s3.aws.amazon.com/bucket/file.xml"
if cmp "file.xml" "current.xml"
then
echo "no change"
else
echo "file changed"
cp "file.xml" "current.xml"
fi
sleep(10s)
Is there a better way to check every 10 seconds that reduces the number of GET requests? (This is built on top of a rails app so i could possibly build a handler in rails?)
Let me start by first telling you some facts about S3. You might know this, but in case you don't, you might see that your current code could have some "unexpected" behavior.
S3 and "Eventual Consistency"
S3 provides "eventual consistency" for overwritten objects. From the S3 FAQ, you have:
Q: What data consistency model does Amazon S3 employ?
Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.
Eventual consistency for overwrites means that, whenever an object is updated (ie, whenever your small XML file is overwritten), clients retrieving the file MAY see the new version, or they MAY see the old version. For how long? For an unspecified amount of time. It typically achieves consistency in much less than 10 seconds, but you have to assume that it will, eventually, take more than 10 seconds to achieve consistency. More interestingly (sadly?), even after a successful retrieval of the new version, clients MAY still receive the older version later.
One thing that you can be assured of is: if a client starts download a version of the file, it will download that entire version (in other words, there's no chance that you would receive for example, the first half of the XML file as the old version and the second half as the new version).
With that in mind, notice that your script could fail to identify the change within your 10-second timeframe: you could make multiple requests, even after a change, until your script downloads a changed version. And even then, after you detect the change, it is (unfortunately) entirely possible the the next request would download the previous (!) version, and trigger yet another "change" in your code, then the next would give the current version, and trigger yet another "change" in your code!
If you are OK with the fact that S3 provides eventual consistency, there's a way you could possibly improve your system.
Idea 1: S3 event notifications + SNS
You mentioned that you thought about using SNS. That could definitely be an interesting approach: you could enable S3 event notifications and then get a notification through SNS whenever the file is updated.
How do you get the notification? You would need to create a subscription, and here you have a few options.
Idea 1.1: S3 event notifications + SNS + a "web app"
If you have a "web application", ie, anything running in a publicly accessible HTTP endpoint, you could create an HTTP subscriber, so SNS will call your server with the notification whenever it happens. This might or might not be possible or desirable in your scenario
Idea 2: S3 event notifications + SQS
You could create a message queue in SQS and have S3 deliver the notifications directly to the queue. This would also be possible as S3 event notifications + SNS + SQS, since you can add a queue as a subscriber to an SNS topic (the advantage being that, in case you need to add functionality later, you could add more queues and subscribe them to the same topic, therefore getting "multiple copies" of the notification).
To retrieve the notification you'd make a call to SQS. You'd still have to poll - ie, have a loop and call GET on SQS (which cost about the same, or maybe a tiny bit more depending on the region, than S3 GETs). The slight difference is that you could reduce a bit the number of total requests -- SQS supports long-polling requests of up to 20 seconds: you make the GET call on SQS and, if there are no messages, SQS holds the request for up to 20 seconds, returning immediately if a message arrives, or returning an empty response if no messages are available within those 20 seconds. So, you would send only 1 GET every 20 seconds, to get faster notifications than you currently have. You could potentially halve the number of GETs you make (once every 10s to S3 vs once every 20s to SQS).
Also - you could chose to use one single SQS queue to aggregate all changes to all XML files, or multiple SQS queues, one per XML file. With a single queue, you would greatly reduce the overall number of GET requests. With one queue per XML file, that's when you could potentially "halve" the number of GET request as compared to what you have now.
Idea 3: S3 event notifications + AWS Lambda
You can also use a Lambda function for this. This could require some more changes in your environment - you wouldn't use a Shell Script to poll, but S3 can be configured to call a Lambda Function for you as a response to an event, such as an update on your XML file. You could write your code in Java, Javascript or Python (some people devised some "hacks" to use other languages as well, including Bash).
The beauty of this is that there's no more polling, and you don't have to maintain a web server (as in "idea 1.1"). Your code "simply runs", whenever there's a change.
Notice that, no matter which one of these ideas you use, you still have to deal with eventual consistency. In other words, you'd know that a PUT/POST has happened, but once your code sends a GET, you could still receive the older version...
Idea 4: Use DynamoDB instead
If you have the ability to make a more structural change on the system, you could consider using DynamoDB for this task.
The reason I suggest this is because DynamoDB supports strong consistency, even for updates. Notice that it's not the default - by default, DynamoDB operates in eventual consistency mode, but the "retrieval" operations (GetItem, for example), support fully consistent reads.
Also, DynamoDB has what we call "DynamoDB Streams", which is a mechanism that allows you to get a stream of changes made to any (or all) items on your table. These notifications can be polled, or they can even be used in conjunction with a Lambda function, that would be called automatically whenever a change happens! This, plus the fact that DynamoDB can be used with strong consistency, could possibly help you solve your problem.
In DynamoDB, it's usually a good practice to keep the records small. You mentioned in your comments that your XML files are about 2kB - I'd say that could be considered "small enough" so that it would be a good fit for DynamoDB! (the reasoning: DynamoDB reads are typically calculated as multiples of 4kB; so to fully read 1 of your XML files, you'd consume just 1 read; also, depending on how you do it, for example using a Query operation instead of a GetItem operation, you could possibly be able to read 2 XML files from DynamoDB consuming just 1 read operation).
Some references:
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
http://docs.aws.amazon.com/lambda/latest/dg/with-ddb.html
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html
I can think of another way by using S3 Versioning; this would require the least amount of changes to your code.
Versioning is a means of keeping multiple variants of an object in the same bucket.
This would mean that every time a new file.xml is uploaded, S3 will create a new version.
In your script, instead of getting the object and comparing it, get the HEAD of the object which contains the VersionId field. Match this version with the previous version to find out if the file has changed.
If the file has indeed changed, get the new file, and also get the new version of that file and save it locally so that next time you can use this version to check if a newer-newer version has been uploaded.
Note 1: You will still be making lots of calls to S3, but instead of fetching the entire file every time, you are only fetching the metadata of the file which is much faster and smaller in size.
Note 2: However, if your aim was to reduce the number of calls, the easiest solution I can think of is using lambdas. You can trigger a lambda function every time a file is uploaded that then calls the REST endpoint of your service to notify you of the file change.
You can use --exact-timestamps
see AWS discussion
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
Instead of using versioning, you can simply compare the E-Tag of the file, which is available in the header, and is similar to the MD-5 hash of the file (and is exactly the MD-5 hash if the file is small, i.e. less than 4 MB, or sometimes even larger. Otherwise, it is the MD-5 hash of a list of binary hashes of blocks.)
With that said, I would suggest you look at your application again and ask if there is a way you can avoid this critical path.

Resources