Intermittent success with AssumeRole from Lambda function - aws-lambda

I am running the STS AssumeRole operation from inside a Lambda function and experiencing weird behaviour. My Lambda function runs as a dedicated Role, call it LambdaRole, and I'm trying to assume a second role (call is S3Role) in order to get credentials for S3 access that I can pass to another system. This other system doesn't have an IAM role attached and I'd rather not generate static keys for that system.
The operation sometimes succeeds upon first deploying my Lambda function, and continues to work for a while, but eventually stops working. The 'stopped working' is simply a timeout where the service call never returns. Sometimes a fresh deployment of my lambda function doesn't succeed for the 'first' call either.
I've tried exploring any rate limits etc for STS but don't see any that are relevant. I can call AssumeRole from the CLI as many times as a I want and it's fast and responsive.
My Lambda function runs inside a VPC, and I've tried with and without an endpoint to STS (apparently you do not need an STS endpoint inside your VPC, which makes some sense).
So in summary - is there any extra intelligence happening during the AssumeRole operation which is causing this problem? Is something special or difference happening in the Lambda container that causes this to break? Any debugging ideas?

Related

Best way to monitor health of a spring app set up as a lambda function

I have a spring batch application that is hosted on AWS and runs as a lambda function. This lambda function is triggered when a file is dropped in the corresponding S3 bucket.
My question is: What would be the best way to perform health checks in this scenario? If this were a regular service running in EC2 (i.e. constantly running), I'd just schedule a health check to run after a fixed time interval, but since this lambda only runs for a couple of minutes at most, I'm not sure how I should proceed. I was thinking of just simply setting the health check status based on the individual reader and writer steps somehow. For instance, if the job was able to read successfully, return status UP, else, return some other status.
I do also want to note that the health of this app will need to be documented in splunk via logs.
Please let me know if there is a better solution. I'm new to health checks so my implementation might be incorrect.

Custom resource not running properly on deployment

For over two days, I've been trying to deploy a CloudFormation stack using serverless framework. The thing is, as part of the stack, I have an RDS cluster as well as a custom resource which relies on a Lambda function (written in Python) for initializing some database tables.
The details of this custom resource in the serverless.yml file are the following:
rdsMigration:
Type: Custom::DatabaseMigration
DependsOn: rdsCluster
Properties:
ServiceToken: !GetAtt MigrateDatabaseLambdaFunction.Arn
Version: 1.0
When deploying using sls deploy, the cluster and the lambda functions are created correctly, but the process is stuck on creating the rdsMigration resource.
In the Lambda code, I've been careful to generate the response in all possible scenarios, including exceptions. However, that does not seem to be the problem.
Apparently, the function is not being invoked... kind of, because even the charts look weird:
You can see how there are no invocations, but there is a red dot in "Error count and success rate" about 5:15 PM, which is the time at which the resource creation started. Also, there are no green dots, and you can see the warning down in the legend, which claims that "One or more data-points have been dropped due to non-numeric values (NaN, -Infinite, +Infinite)". How is this possible? I assume it is no standard behavior, since other Lambda functions (which must be called using an API Gateway endpoint) do not show this strange chart.
Also, there are no log streams in CloudWatch. It is completely empty, as if the function was never invoked (which seems the case, except for the strange "red dot" at the moment of resource creation).
Finally, if I run a test case using the "AWS CloudFormation Create Request" template, the function runs properly, it creates the initial tables I expected for the DB (not always, but that is a different matter) and returns the response.
Do you have any idea of what is going on here? The worst about this is that I need to wait two hours between tests, since the CFN stack gets stuck during the creation and destruction steps until the timeout occurs.
Thanks!
The issue is with your lambda function. You have to send back the SUCCESS or FAILURE signals back to the CFN. Since your lambda function is nots sending any signals, its waiting for Timeout (2 hours) and the Cloudformation gets failed
1.The custom resource provider processes the AWS CloudFormation request and
returns a response of SUCCESS or FAILED to the pre-signed URL. AWS
CloudFormation waits and listens for a response in the pre-signed URL location.
2.After getting a SUCCESS response, AWS CloudFormation proceeds with the stack
operation. If a FAILURE or no response is returned, the operation fails.
Please use cfnresponse module in your lambda function to send the SUCCESS/FAILURE signals back to your Cloudformation
For more details:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-lambda-function-code-cfnresponsemodule.html
I finally managed to find a solution to the issue, albeit it is not explaining the strange behavior with the charts that I explained in the question.
My problem was similar to what Abhinaya suggested in her response. The Lambda function was not sending the signal properly because of a programming error. Essentially, I took the code from the documentation (the one for Python 3, second fragment starting by the end) and apparently I mistakenly removed the line for retrieving the ResponseURL. Of course, that was failing.
A side-comment about this: be careful when using Python's cfnresponse library or even the code snippet I linked in the documentation. It relies on botocore.vendored which was deprecated and no longer exist in latest botocore releases. Therefore, it will fail if your code relies on new versions of this library (as in my case). A simple solution is to replace botocore.vendored.requests with the requests library.
Still, there is some strange behavior that I cannot understand. On creation, the Lambda function is not recording anything to CloudWatch and there is this strange behavior in the charts that I explained in my question. However, this only happens on creation. If the function is manually invoked, or is invoked as part of the delete process (when removing the CFN stack), then it does write to CloudWatch. Therefore, the problem only occurs in the first invokation, apparently.
Best.

Lambda timeout after trying to connect to firestore

I have a continuously running lambda(30 min interval) that is getting timedout when trying to connect to firestore. I don't really know why it is happening like this.. I have used at the beginning of the lambda
context.callbackWaitsForEmptyEventLoop = false;
Can anyone help me to solve this...please....
Does you lambda function has access to the internet? This is a really common error, you will need to setup your VPC's subnet to allow it to reach the internet.
https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/
There's a 15 minutes constraint for Lambda functions. If you go over that limit, they'll timeout and there's no way to work around it.
You can see it in the docs:
You can now set the timeout value for a function to any value up to 15
minutes. When the specified timeout is reached, AWS Lambda terminates
execution of your Lambda function. As a best practice, you should set
the timeout value based on your expected execution time to prevent
your function from running longer than intended.
You can also check AWS Lambda Limits. While some of these limits can be raised by contacting AWS, the maximum execution time is not one of them.
If your function runs in less than 15 minutes, you can simply increase the timeout for your function via the console (under basic settings, I am attaching a screenshot) or via aws-cli (or via frameworks such as AWS SAM, Serverless, etc. if you're using one).
Check how to change the limits here
However, I would try to understand why your function is timing out when connecting to Google's Firestore. I don't know anything about Google Cloud, but maybe you should allow outbound traffic on it. Maybe the timeout should be increased, but maybe Firebase is blocking any outbound traffic, making your Lambda to timeout. If your Lambda is outside a VPC, it should be able to connect to the internet seamlessly, so the connection with Firebase should be fairly quick.
One other thing I suggest is to run your Lambda function under Node 8 as you can take advantage from async/await and get rid of context and callback objects which are very confusing at first.

occasionally connection timeout from lambda function

I'm not sure why, but from time to time - once in 20 lambda calls, I receive an error:
Connection timed out after 120000ms
the calls are done from ECS container, and all (caller and lambda) are written in node.js.
what should I check?
I know this is an old post, but I'll write how I solved a situation with the same error message in a lambda that I was working on. I hope this helps someone with a similar issue.
In my case, I also have a web app inside an EC2 which calls a lambda through lambda.invoke() (npm aws-sdk). Both EC2 and lambda runs on Node.JS. Even though the error is logged inside the EC2, the message is thrown by the lambda itself to the caller (EC2).
My lambda makes ~3,000 requests to an API, what takes ~5 minutes (300,000 ms) to get all responses back. It seems that the lambda Node.JS runtime is keeping a socket alive during the lambda execution, which is higher than 120,000 ms (2 minutes). As the lambda code keeps running for more than this threshold, the runtime throws the error, and the lambda return a callback with it.
According to aws js sdk, the AWS object has one parameter for the http timeout:
httpOptions (map) — A set of options to pass to the low-level HTTP request. Currently supported options are:
timeout [Integer] — Sets the socket to timeout after timeout milliseconds of inactivity on the socket. Defaults to two minutes (120000).
After I changed this configuration to 360,000 ms (6 minutes), the lambda executes successfully. So you can just set this parameter to a higher value, according to your needs:
AWS.config.update({httpOptions: {timeout: 360000}});
for me it was not ocassionally, but like for one time to another :S.
Somehow, one of my network devices was deactivated (PANGP Virtual Ethernet Adapter), so I re-activated it worked out
Best!

Why does my Redis key show up only minutes after being stored?

I have a handler function on AWS Lambda that is connecting to a Redis instance to store a single key in the cache. The function has completed successfully but the key in Redis shows up minutes (or more) after the fact.
This behavior is observable on both Heroku Redis and Redis Cloud, they're both hosted solutions.
I can't for the life of me figure out what's causing this lag. My Redis knowledge is practically zero, I know how to store a list using LPUSH and how to trim that list using LTRIM.
The writer to Redis uses this Node client while I observe the lag using redis-cli on my local machine.
Is it common to experience this kind of lack in the setup I describe? What can I do to debug this?
I'm purposefully ignoring most of the information in the question and would like to refer only to the alleged symptom, namely that
key show up only minutes after being stored
This behavior is impossible with Redis - any change to the data is immediately visible given Redis' design. That said, the only scenario what you're describing could be remotely possible is when you're writing to a Redis master server and reading from a very-badly-lagged slave. I can ensure you that this is not the case with Redis Cloud however.
The main reason is due to the fact that the Lambda container starts to sleep as soon as your function terminates, and the Redis client you are using is all asynchronous APIs.
Note that the API is entire asynchronous. To get data back from the server, you'll need to use a callback.
I'm assuming that the asynchronous SET is the last action performed in your Lambda function. Once that is called, the underlying Lambda container goes to sleep, and most likely, the actual SET action hasn't finished its job yet. Therefore, the record will not show in Redis until the exact same Lambda container was called to execute your function again, and finished the job that it was supposed to finish on the last execution. This is probably the lag that you are experiencing.
To test whether or not this is true, do a sleep action for a couple of seconds at the end of your function to delay the Lambda container going to sleep immediately, and see if the lag is still there.
I would also recommend not to use asynchronous behaviour APIs inside Lambda functions. They'll add state to your Lambda computation, and this is actually not recommended by AWS themselves within the Lambda documentations too.

Resources