Is it OK to panic() when failed to create AWS session?
As an opposite, I can just return the error from my handler function (in this case I have to create the session in the handler code, but not in the init()).
The docs say
Lambda will re-create the function automatically
Does it mean the panic always causes the cold-start and is preferred to return error from the handler?
Yes. A panic will trigger a cold restart of your code. The use of panic should be reserved for exceptional circumstances; returning an error is to be preferred in most circumstances.
The answer depends on what is going on in init section.
If you create session clients to connect to other services it might be good to panic and cause cold-start than continue container's lifecycle with the failed clients.
Related
I am running the STS AssumeRole operation from inside a Lambda function and experiencing weird behaviour. My Lambda function runs as a dedicated Role, call it LambdaRole, and I'm trying to assume a second role (call is S3Role) in order to get credentials for S3 access that I can pass to another system. This other system doesn't have an IAM role attached and I'd rather not generate static keys for that system.
The operation sometimes succeeds upon first deploying my Lambda function, and continues to work for a while, but eventually stops working. The 'stopped working' is simply a timeout where the service call never returns. Sometimes a fresh deployment of my lambda function doesn't succeed for the 'first' call either.
I've tried exploring any rate limits etc for STS but don't see any that are relevant. I can call AssumeRole from the CLI as many times as a I want and it's fast and responsive.
My Lambda function runs inside a VPC, and I've tried with and without an endpoint to STS (apparently you do not need an STS endpoint inside your VPC, which makes some sense).
So in summary - is there any extra intelligence happening during the AssumeRole operation which is causing this problem? Is something special or difference happening in the Lambda container that causes this to break? Any debugging ideas?
For over two days, I've been trying to deploy a CloudFormation stack using serverless framework. The thing is, as part of the stack, I have an RDS cluster as well as a custom resource which relies on a Lambda function (written in Python) for initializing some database tables.
The details of this custom resource in the serverless.yml file are the following:
rdsMigration:
Type: Custom::DatabaseMigration
DependsOn: rdsCluster
Properties:
ServiceToken: !GetAtt MigrateDatabaseLambdaFunction.Arn
Version: 1.0
When deploying using sls deploy, the cluster and the lambda functions are created correctly, but the process is stuck on creating the rdsMigration resource.
In the Lambda code, I've been careful to generate the response in all possible scenarios, including exceptions. However, that does not seem to be the problem.
Apparently, the function is not being invoked... kind of, because even the charts look weird:
You can see how there are no invocations, but there is a red dot in "Error count and success rate" about 5:15 PM, which is the time at which the resource creation started. Also, there are no green dots, and you can see the warning down in the legend, which claims that "One or more data-points have been dropped due to non-numeric values (NaN, -Infinite, +Infinite)". How is this possible? I assume it is no standard behavior, since other Lambda functions (which must be called using an API Gateway endpoint) do not show this strange chart.
Also, there are no log streams in CloudWatch. It is completely empty, as if the function was never invoked (which seems the case, except for the strange "red dot" at the moment of resource creation).
Finally, if I run a test case using the "AWS CloudFormation Create Request" template, the function runs properly, it creates the initial tables I expected for the DB (not always, but that is a different matter) and returns the response.
Do you have any idea of what is going on here? The worst about this is that I need to wait two hours between tests, since the CFN stack gets stuck during the creation and destruction steps until the timeout occurs.
Thanks!
The issue is with your lambda function. You have to send back the SUCCESS or FAILURE signals back to the CFN. Since your lambda function is nots sending any signals, its waiting for Timeout (2 hours) and the Cloudformation gets failed
1.The custom resource provider processes the AWS CloudFormation request and
returns a response of SUCCESS or FAILED to the pre-signed URL. AWS
CloudFormation waits and listens for a response in the pre-signed URL location.
2.After getting a SUCCESS response, AWS CloudFormation proceeds with the stack
operation. If a FAILURE or no response is returned, the operation fails.
Please use cfnresponse module in your lambda function to send the SUCCESS/FAILURE signals back to your Cloudformation
For more details:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-lambda-function-code-cfnresponsemodule.html
I finally managed to find a solution to the issue, albeit it is not explaining the strange behavior with the charts that I explained in the question.
My problem was similar to what Abhinaya suggested in her response. The Lambda function was not sending the signal properly because of a programming error. Essentially, I took the code from the documentation (the one for Python 3, second fragment starting by the end) and apparently I mistakenly removed the line for retrieving the ResponseURL. Of course, that was failing.
A side-comment about this: be careful when using Python's cfnresponse library or even the code snippet I linked in the documentation. It relies on botocore.vendored which was deprecated and no longer exist in latest botocore releases. Therefore, it will fail if your code relies on new versions of this library (as in my case). A simple solution is to replace botocore.vendored.requests with the requests library.
Still, there is some strange behavior that I cannot understand. On creation, the Lambda function is not recording anything to CloudWatch and there is this strange behavior in the charts that I explained in my question. However, this only happens on creation. If the function is manually invoked, or is invoked as part of the delete process (when removing the CFN stack), then it does write to CloudWatch. Therefore, the problem only occurs in the first invokation, apparently.
Best.
I'd like to get event in kernel on each new process that starts (fork+execve or posix_spawn), and be able to prevent this operations.
The first option would be using Mac framework named mpo_vnode_check_exec by Hooking to this method with function that return 0 when access is granted or check deferred to next hook.. non zero returned value means access is refused right away.
Unfortunately, this framework is unsupported by apple, and I wish to use a stable alternative like kauth fileop scope with KAUTH_FILEOP_EXEC flag.
However, this framework is for detection only and lacks prevention capabilities..
Perhaps there's a way to prevent the process from running when I get relevant kauth callback on process creation, or halt the process from running until I decide whether it should run or not (and enforce the verdict in another thread).
thanks
However, this framework is for detection only and lacks prevention capabilities..
Correct, if you're only focussing on the File scope.
Register with the Vnode scope and your callback returns whether or not access is allowed.
kauth_listen_scope(KAUTH_SCOPE_VNODE, &myCallback, NULL);
Finally, note that this scope is very noisy, as every type of access to every resource is reported.
I have a handler function on AWS Lambda that is connecting to a Redis instance to store a single key in the cache. The function has completed successfully but the key in Redis shows up minutes (or more) after the fact.
This behavior is observable on both Heroku Redis and Redis Cloud, they're both hosted solutions.
I can't for the life of me figure out what's causing this lag. My Redis knowledge is practically zero, I know how to store a list using LPUSH and how to trim that list using LTRIM.
The writer to Redis uses this Node client while I observe the lag using redis-cli on my local machine.
Is it common to experience this kind of lack in the setup I describe? What can I do to debug this?
I'm purposefully ignoring most of the information in the question and would like to refer only to the alleged symptom, namely that
key show up only minutes after being stored
This behavior is impossible with Redis - any change to the data is immediately visible given Redis' design. That said, the only scenario what you're describing could be remotely possible is when you're writing to a Redis master server and reading from a very-badly-lagged slave. I can ensure you that this is not the case with Redis Cloud however.
The main reason is due to the fact that the Lambda container starts to sleep as soon as your function terminates, and the Redis client you are using is all asynchronous APIs.
Note that the API is entire asynchronous. To get data back from the server, you'll need to use a callback.
I'm assuming that the asynchronous SET is the last action performed in your Lambda function. Once that is called, the underlying Lambda container goes to sleep, and most likely, the actual SET action hasn't finished its job yet. Therefore, the record will not show in Redis until the exact same Lambda container was called to execute your function again, and finished the job that it was supposed to finish on the last execution. This is probably the lag that you are experiencing.
To test whether or not this is true, do a sleep action for a couple of seconds at the end of your function to delay the Lambda container going to sleep immediately, and see if the lag is still there.
I would also recommend not to use asynchronous behaviour APIs inside Lambda functions. They'll add state to your Lambda computation, and this is actually not recommended by AWS themselves within the Lambda documentations too.
My service has a thread that may potentially be executing a WinHttpSendRequest when someone tries to stop my service.
The WinHttpCloseHandle documentation says:
An application can terminate an in-progress synchronous or asynchronous request by closing the HINTERNET request handle using WinHttpCloseHandle
But, then later on the same documentation seems to contradict this. It says:
An application should never WinHttpCloseHandle call on a synchronous request. This can create a race condition.
I've found this blog post that seems to agree I can't call WinHttpCloseHandle.
I'm wondering how can I cancel this operation so that my service can be stopped gracefully? I can't really wait for the WinHttpSendRequest to timeout naturally because it takes too long and my service doesn't stop quickly enough. I think windows reports this as an error and then forcefully kills the service in a shutdown.
An ideas would be appreciated.
Calling WinHttpCloseHandle off a background thread to force handle close is perhaps not the best solution. Still it works, and the original caller would receive something like "bad handle" error code and the request would be forcefully terminated.
It would be possibly unsafe to abuse this power, and one would rather implement async requests instead. However, in case of shutting down stale service it is going to work fine.