For over two days, I've been trying to deploy a CloudFormation stack using serverless framework. The thing is, as part of the stack, I have an RDS cluster as well as a custom resource which relies on a Lambda function (written in Python) for initializing some database tables.
The details of this custom resource in the serverless.yml file are the following:
rdsMigration:
Type: Custom::DatabaseMigration
DependsOn: rdsCluster
Properties:
ServiceToken: !GetAtt MigrateDatabaseLambdaFunction.Arn
Version: 1.0
When deploying using sls deploy, the cluster and the lambda functions are created correctly, but the process is stuck on creating the rdsMigration resource.
In the Lambda code, I've been careful to generate the response in all possible scenarios, including exceptions. However, that does not seem to be the problem.
Apparently, the function is not being invoked... kind of, because even the charts look weird:
You can see how there are no invocations, but there is a red dot in "Error count and success rate" about 5:15 PM, which is the time at which the resource creation started. Also, there are no green dots, and you can see the warning down in the legend, which claims that "One or more data-points have been dropped due to non-numeric values (NaN, -Infinite, +Infinite)". How is this possible? I assume it is no standard behavior, since other Lambda functions (which must be called using an API Gateway endpoint) do not show this strange chart.
Also, there are no log streams in CloudWatch. It is completely empty, as if the function was never invoked (which seems the case, except for the strange "red dot" at the moment of resource creation).
Finally, if I run a test case using the "AWS CloudFormation Create Request" template, the function runs properly, it creates the initial tables I expected for the DB (not always, but that is a different matter) and returns the response.
Do you have any idea of what is going on here? The worst about this is that I need to wait two hours between tests, since the CFN stack gets stuck during the creation and destruction steps until the timeout occurs.
Thanks!
The issue is with your lambda function. You have to send back the SUCCESS or FAILURE signals back to the CFN. Since your lambda function is nots sending any signals, its waiting for Timeout (2 hours) and the Cloudformation gets failed
1.The custom resource provider processes the AWS CloudFormation request and
returns a response of SUCCESS or FAILED to the pre-signed URL. AWS
CloudFormation waits and listens for a response in the pre-signed URL location.
2.After getting a SUCCESS response, AWS CloudFormation proceeds with the stack
operation. If a FAILURE or no response is returned, the operation fails.
Please use cfnresponse module in your lambda function to send the SUCCESS/FAILURE signals back to your Cloudformation
For more details:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-lambda-function-code-cfnresponsemodule.html
I finally managed to find a solution to the issue, albeit it is not explaining the strange behavior with the charts that I explained in the question.
My problem was similar to what Abhinaya suggested in her response. The Lambda function was not sending the signal properly because of a programming error. Essentially, I took the code from the documentation (the one for Python 3, second fragment starting by the end) and apparently I mistakenly removed the line for retrieving the ResponseURL. Of course, that was failing.
A side-comment about this: be careful when using Python's cfnresponse library or even the code snippet I linked in the documentation. It relies on botocore.vendored which was deprecated and no longer exist in latest botocore releases. Therefore, it will fail if your code relies on new versions of this library (as in my case). A simple solution is to replace botocore.vendored.requests with the requests library.
Still, there is some strange behavior that I cannot understand. On creation, the Lambda function is not recording anything to CloudWatch and there is this strange behavior in the charts that I explained in my question. However, this only happens on creation. If the function is manually invoked, or is invoked as part of the delete process (when removing the CFN stack), then it does write to CloudWatch. Therefore, the problem only occurs in the first invokation, apparently.
Best.
Related
In our Spring Boot app, we are using AmazonS3Client.deleteObjects() to delete multiple objects in a bucket. From time to time, the request throws MultiObjectDeleteException and one or many objects won't be deleted. It is not often, about 5 failures among thousands of requests. But still it could be a problem. What could lead to the exception?
And I have no idea how to debug. The log from our app follows the data flow but not showing much useful information. It suddenly throws the exception after the request. Please help.
Another thing is that the exception comes back with a 200 code. How could this be possible?
com.amazonaws.services.s3.model.MultiObjectDeleteException: One or
more objects could not be deleted (Service: null; Status Code: 200;
Error Code: null; Request ID: xxxx; S3 Extended Request ID: yyyy;
Proxy: null)
TLDR: Some error rates are normal and the application should handle them. 500 and 503 errors are retriable. The MultiObjectDeleteException should provide a clue and getDeletedObjects() gives you a list of the deleted objects. The rest you should mostly try later.
In the MultiObjectDeleteException documentation is said that exception should have an explanation of the issue which caused the error
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html
Exception for partial or total failure of the multi-object delete API, including the errors that occurred. For successfully deleted objects, refer to getDeletedObjects().
According to https://aws.amazon.com/s3/sla/ AWS does not guarantee 100% availability. Again, according to that document:
• “Error Rate” means: (i) the total number of internal server errors returned by the Amazon S3 Service as error status “InternalError” or “ServiceUnavailable” divided by (ii) the total number of requests for the applicable request type during that 5-minute interval. We will calculate the Error Rate for each Amazon S3 Service account as a percentage for each 5-minute interval in the monthly billing cycle. The calculation of the number of internal server errors will not include errors that arise directly or indirectly as a result of any of the Amazon S3 SLA Exclusions.
Usually we think about SLA in the terms of downtimes so it is easy to assume that AWS does mean the same. But that's not the case here. Some number of errors is normal and should be expected. In many documents AWS does suggest that you should implement a combination of slowdowns and retries e.g. here https://docs.aws.amazon.com/AmazonS3/latest/userguide/ErrorBestPractices.html
Some 500 and 503 errors are, again, part of the normal operation https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/
The documents specifically says:
Because Amazon S3 is a distributed service, a very small percentage of 5xx errors is expected during normal use of the service. All requests that return 5xx errors from Amazon S3 can be retried. This means that it's a best practice to have a fault-tolerance mechanism or to implement retry logic for any applications making requests to Amazon S3. By doing so, S3 can recover from these errors.
Edit: Later was added a question: "How is it possible that the API call returned status code 200 while some objects were not deleted."
And the answer to that is very simple: This is how the API is defined. From the JDK reference page for deleteObjects you can go directly to the AWS API documentation page https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html
Which says that this is the expected behavior. Status code 200 means that the high level API code succeeded and was able to request the deletion of the listed objects. Well, some of these actions did fail and, but the API call did create a report about it in the response.
Why does the Java API throw an exception then? Again, the authors of the AWS Java SDK tried to translate the response to the Java programming language and they clearly thought that while AWS API works with a non-zero error rate as part of the service agreement, Java developers are more used to a situation that anything but 100% success should end up by an exception.
Both of the abstractions are well documented and it is the programmer who is responsible for a precise implementation. The engineering rule is cheap, fast, reliable - chose two. AWS was able to provide a service which has all three with a reasonable concession that part of the reliability will be implemented on the client side - retries and slow-downs.
I have a spring batch application that is hosted on AWS and runs as a lambda function. This lambda function is triggered when a file is dropped in the corresponding S3 bucket.
My question is: What would be the best way to perform health checks in this scenario? If this were a regular service running in EC2 (i.e. constantly running), I'd just schedule a health check to run after a fixed time interval, but since this lambda only runs for a couple of minutes at most, I'm not sure how I should proceed. I was thinking of just simply setting the health check status based on the individual reader and writer steps somehow. For instance, if the job was able to read successfully, return status UP, else, return some other status.
I do also want to note that the health of this app will need to be documented in splunk via logs.
Please let me know if there is a better solution. I'm new to health checks so my implementation might be incorrect.
I am running the STS AssumeRole operation from inside a Lambda function and experiencing weird behaviour. My Lambda function runs as a dedicated Role, call it LambdaRole, and I'm trying to assume a second role (call is S3Role) in order to get credentials for S3 access that I can pass to another system. This other system doesn't have an IAM role attached and I'd rather not generate static keys for that system.
The operation sometimes succeeds upon first deploying my Lambda function, and continues to work for a while, but eventually stops working. The 'stopped working' is simply a timeout where the service call never returns. Sometimes a fresh deployment of my lambda function doesn't succeed for the 'first' call either.
I've tried exploring any rate limits etc for STS but don't see any that are relevant. I can call AssumeRole from the CLI as many times as a I want and it's fast and responsive.
My Lambda function runs inside a VPC, and I've tried with and without an endpoint to STS (apparently you do not need an STS endpoint inside your VPC, which makes some sense).
So in summary - is there any extra intelligence happening during the AssumeRole operation which is causing this problem? Is something special or difference happening in the Lambda container that causes this to break? Any debugging ideas?
Currently we are running a NodeJS webApp using serverless. The API Gateway is using a single API endpoint for the entire application and routing is handled internally. So basically single http {Any+} endpoint for entire application.
My question is,
1, Whats the disadvantage of this method?? ( I know lambda is build for FaaS but right now we are handling it as a monolithic function.)
2, How much instance can lambda run at a time if we are following this method? Can it handle a million+ request at single time?
Every help would be appreciated. Thanks!
Disadvantage would be as you say - it's monolithic so you've not modularised your code at all. The idea is that adjusting one function shouldn't affect the rest, but in this case it can.
You can run as many as you like concurrently; you can set limits though (and there are some limits initially for safety which can be removed).
If you are running the function regularly it should also 'warm start' i.e. have a shorter boot time after the first time.
I have a handler function on AWS Lambda that is connecting to a Redis instance to store a single key in the cache. The function has completed successfully but the key in Redis shows up minutes (or more) after the fact.
This behavior is observable on both Heroku Redis and Redis Cloud, they're both hosted solutions.
I can't for the life of me figure out what's causing this lag. My Redis knowledge is practically zero, I know how to store a list using LPUSH and how to trim that list using LTRIM.
The writer to Redis uses this Node client while I observe the lag using redis-cli on my local machine.
Is it common to experience this kind of lack in the setup I describe? What can I do to debug this?
I'm purposefully ignoring most of the information in the question and would like to refer only to the alleged symptom, namely that
key show up only minutes after being stored
This behavior is impossible with Redis - any change to the data is immediately visible given Redis' design. That said, the only scenario what you're describing could be remotely possible is when you're writing to a Redis master server and reading from a very-badly-lagged slave. I can ensure you that this is not the case with Redis Cloud however.
The main reason is due to the fact that the Lambda container starts to sleep as soon as your function terminates, and the Redis client you are using is all asynchronous APIs.
Note that the API is entire asynchronous. To get data back from the server, you'll need to use a callback.
I'm assuming that the asynchronous SET is the last action performed in your Lambda function. Once that is called, the underlying Lambda container goes to sleep, and most likely, the actual SET action hasn't finished its job yet. Therefore, the record will not show in Redis until the exact same Lambda container was called to execute your function again, and finished the job that it was supposed to finish on the last execution. This is probably the lag that you are experiencing.
To test whether or not this is true, do a sleep action for a couple of seconds at the end of your function to delay the Lambda container going to sleep immediately, and see if the lag is still there.
I would also recommend not to use asynchronous behaviour APIs inside Lambda functions. They'll add state to your Lambda computation, and this is actually not recommended by AWS themselves within the Lambda documentations too.