I have a lambda function triggered by API Gateway. I trigger it, and I see in the logs that it's starting but nothing happens, it's stuck and the timeout is reached. The very first line of the actual function is a console.log that is not printed at all.
This is the log of the lambda:
2021-10-20T13:27:50.740+03:00 START RequestId: effb5220-abd0-492a-a355-e17cba47b491 Version: $LATEST
2021-10-20T13:29:22.746+03:00 END RequestId: effb5220-abd0-492a-a355-e17cba47b491
2021-10-20T13:29:22.746+03:00 REPORT RequestId: effb5220-abd0-492a-a355-e17cba47b491 Duration: 92004.98 ms Billed Duration: 90000 ms Memory Size: 128 MB Max Memory Used: 128 MB
2021-10-20T13:29:22.746+03:00 2021-10-20T10:29:22.746Z effb5220-abd0-492a-a355-e17cba47b491 Task timed out after 92.00 seconds
I don't know where else to debug this issue....
If it matters, the lambda function is in typescript using the CDK.
Related
I am running cdk deploy in my textract pipline folder for large document processing. However, when i run this porgram I get this error
The error
| CREATE_FAILED | AWS::Lambda::Function | S3BatchProcessor6C619AEA
Resource handler returned message: "Specified ReservedConcurrentExecutions for function decreases account's UnreservedConcurrentExecution below its minimum value of [10]. (Service: Lambda, Status Code: 400, Request ID: 7f6d1305-e248-4745-983e-045eccde562d)" (RequestToken: 9c84827d-502e-5697-b023-e
0be45f8d451, HandlerErrorCode: InvalidRequest)
By default AWS will provide with at max 1000 concurrency limit.
In your case, the different concurrencies in all lambdas in your account is exceeding UnreservedConcurrentExecution Limit of 10 i.e.,
1000 - sum all reservedConcurrency in lambdas > 10
This is causing deployment failure as you're trying to exceed concurrency limit.
There can be two solutions here:
Reduce the reserved concurrency of lambdas so that above equation holds or
You can raise the account concurrency limit by contacting aws support. Please refer this
I have built my own application around AWS Lambda and Salesforce.
I have around 10 users using my internal app, so not talkiing about big usage.
Daily, I have around 500-1000 SQS task which can be processed on a normal day, with one task which can take around 1-60 seconds depending on its complexity.
This is working perfectly.
Timeout for my lambda is 900.
BatchSize = 1
Using Python 3.8
I've created a decorator which allows me to process through SQS some of my functions which required to be processed ASYNC with FIFO logic.
Everything is working well.
My Lambda function doesn't return anything at the end, but it completes with success (standard scenario). However, I have noted that some tasks were going intot my DLQ (I only allow processing once, if it gets represented it goes into DLQ immediately).
The thing I don't get is why is this going on like this ?
Lambda ends with succes --> Normally the task should be deleted from the initial SQS queue.
So I've added a manual deletion of the task processed at the total end of the function. I've logged the result which is sent when I do boto3.client.delete_message and I get a 200 status so everything is OK..... However once in a while (1 out of 100, so 10 times per day in my case) I can see the task going into the DLQ...
Reprocessing the same task into my standard queue without changing anything... it gets processed successfuly (again) and deleted (as expected initially).
What is the most problematic to me is the fact that deleting the message still ends it with it going sometimes into DLQ ? What could be the problem ?
Example of my async processor
def process_data(event, context):
"""
By convention, we need to store in the table AsyncTaskQueueNamea dict with the following parameters:
- python_module: use to determine the location of the method to call asynchronously
- python_function: use to determine the location of the method to call asynchronously
- uuid: uuid to get the params stored in dynamodb
"""
print('Start Processing Async')
client = boto3.client('sqs')
queue_url = client.get_queue_url(QueueName=settings.AsyncTaskQueueName)['QueueUrl']
# batch size = 1 so only record 1 to process
for record in event['Records']:
try:
kwargs = json.loads(record['body'])
print(f'Start Processing Async Data Record:\n{kwargs}')
python_module = kwargs['python_module']
python_function = kwargs['python_function']
# CALLING THE FUNCTION WE WANTED ASYNC, AND DOING ITS STUFF... (WORKING OK)
getattr(sys.modules[python_module], python_function)(uuid=kwargs['uuid'], is_in_async_processing=True)
print('End Processing Async Data Record')
res = client.delete_message(QueueUrl=queue_url, ReceiptHandle=record['receiptHandle'])
print(f'End Deleting Async Data Record with status: {res}') # When the problem I'm monitoring occurs, it goes up to this line, with res status = 200 !! That's where I'm losing my mind. I can confirm the uuid in the DLQ being the same as in the queue so we are definitely talking of the same message which has been moved to the DLQ.
except Exception:
# set expire to 0 so that the task goes into DLQ
client.change_message_visibility(
QueueUrl=queue_url,
ReceiptHandle=record['receiptHandle'],
VisibilityTimeout=0
)
utils.raise_exception(f'There was a problem during async processing. Event:\n'
f'{json.dumps(event, indent=4, default=utils.jsonize_datetime)}')
Example of today's bug with logs from CloudWatch:
Initial event:
{'Records': [{'messageId': '75587372-256a-47d4-905b-62e1b42e2dad', 'receiptHandle': 'YYYYYY", "python_module": "quote.processing", "python_function": "compute_price_data"}', 'attributes': {'ApproximateReceiveCount': '1', 'SentTimestamp': '1621432888344', 'SequenceNumber': '18861830893125615872', 'MessageGroupId': 'compute_price_data', 'SenderId': 'XXXXX:main-app-production-main', 'MessageDeduplicationId': 'b4de6096-b8aa-11eb-9d50-5330640b1ec1', 'ApproximateFirstReceiveTimestamp': '1621432888344'}, 'messageAttributes': {}, 'md5OfBody': '5a67d0ed88898b7b71643ebba975e708', 'eventSource': 'aws:sqs', 'eventSourceARN': 'arn:aws:sqs:eu-west-3:XXXXX:async_task-production.fifo', 'awsRegion': 'eu-west-3'}]}
Res (after calling delete_message):
End Deleting Async Data Record with status: {'ResponseMetadata': {'RequestId': '7738ffe7-0adb-5812-8701-a6f8161cf411', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '7738ffe7-0adb-5812-8701-a6f8161cf411', 'date': 'Wed, 19 May 2021 14:02:47 GMT', 'content-type': 'text/xml', 'content-length': '215'}, 'RetryAttempts': 0}}
BUT... 75587372-256a-47d4-905b-62e1b42e2dad is in the DLQ after this delete_message. I'm becoming crazy
OK, the problem was due to my serverless.yml timeout settings to be 900, but not in AWS. I may have changed it manually to 1min, so my long tasks were released after 1 min and then going immediately to DLQ.
Hence the deletion doing anything since the task was already in the DLQ when the deletion was made
The main question is: how is time accounted in a wsgi application deployed on aws Lambda?
Suppose I deploy the following simple Flask app:
from flask import Flask
app = Flask(__name__)
#app.route("/")
def hello():
return "Hello World!", 200
using Zappa on AWS Lambda, with the following configuration:
{
"dev": {
"app_function": "simple_application.app",
"profile_name": "default",
"project_name": "simple_application",
"runtime": "python3.7",
"s3_bucket": "zappa-deployments-RANDOM",
"memory_size": 128,
"keep_warm": false,
"aws_region": "us-east-1"
}
}
Now if AWS has a request for my website, it will spin up a container with my code inside and let it deal with the request. Suppose the request is served in 200ms.
Obviously the Lambda with the wsgi server inside continues running (Zappa by default makes the lambda run for at least 30s).
So now to the various subquestions:
How much time am I charged for the execution?
200ms because of the request duration
30s because of the below limit for my lambda execution time
until the lambda is killed by AWS to reclaim space (which could occur even 30-45 minutes after)
If another request come along (and the first one is still being served) will the second request spin up another Lambda container or it will be queued until a threshold time has passed?
I expected to be charged just for the 200ms by reading AWS Lambda pricing page, but I would bet on it charging for 30s because, after all, I'm the one who imposed such limit.
In case I'm just charged for 200ms (and subsequent requests time) but the container keeps running uninterruptly for 30-45 minutes, I have a third subquestion:
Suppose now that I want to use a global variable as a simple local cache and synchronize it with a database (let's say DynamoDB) before the container is killed.
To do this I would like to raise the execution time limit of my lambdas to 15m, then at lambda creation set a Timer to fire a function that synchronizes the state and aborts the function after 14m30s.
How would accounted running time change in this settings (i.e. with a Timer that fires after a certain amount of time)?
The proposed lambda code for this subquestion is:
from flask import Flask
from threading import Timer
from db import Visits
import sys
lambda_uuid = "SOME-UUID-OF-LAMBDA-INSTANCE"
# Collects number of visits
visits = 0
def report_visits():
Visits(uuid=lambda_uuid, visits=visits).save()
sys.exit(0)
t = Timer(14 * 60 + 30, report_visits)
t.start()
# Start of flask routes
app = Flask(__name__)
#app.route("/")
def hello():
visits = visits + 1
return "Hello World!", 200
Thanks in advance for any information.
Zappa by default makes the lambda run for at least 30s
I can find no documentation to support this, and further all the Zappa documentation I can find states that a Zappa function will end as soon as a response is returned (just like any other AWS Lambda function) and as such you are only billed for those milliseconds that it took to generate the response.
I do see that the default maximum execution time for a Zappa Lambda function is 30 seconds. Perhaps that is where you are confused? That setting simply tells AWS Lambda to kill your running function instance if it runs for that long.
I expected to be charged just for the 200ms by reading AWS Lambda pricing page, but I would bet on it charging for 30s because, after all, I'm the one who imposed such limit.
You would be charged for exactly how long your function runs. So if it runs for 200ms, then you are charged for 200ms.
In case I'm just charged for 200ms (and subsequent requests time) but the container keeps running uninterruptly for 30-45 minutes, I have a third subquestion:
It doesn't keep running that long, it ends when it returns a response.
You are telling Lambda to kill your function instance if it runs for more than 30 seconds, so it's never going to run for more than 30 seconds with your current settings.
The current maximum run time for an AWS Lambda function is 15 minutes, so it couldn't possibly run for 30-45 minutes anyway.
Via CloudFormation, I have a setup including DynamoDB tables, DAX, VPC, Lambdas (living in VPC), Security Groups (allowing access to port 8111), and so on.
Everything works, except when it doesn't.
I can access DAX from my VPC'd Lambdas 99% of the time. Except occasionally they get NoRouteException errors... seemingly randomly. Here's the output from CloudWatch for a single Lambda function doing the exact same thing each time (a DAX get). Notice how it works, fails, and then works again:
/aws/lambda/BigOnion_accountGet START RequestId: 2b732899-f380-11e7-a650-cbfe0f7dfb3d Version: $LATEST
/aws/lambda/BigOnion_accountGet END RequestId: 2b732899-f380-11e7-a650-cbfe0f7dfb3d
/aws/lambda/BigOnion_accountGet REPORT RequestId: 2b732899-f380-11e7-a650-cbfe0f7dfb3d Duration: 58.24 ms Billed Duration: 100 ms Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_accountGet START RequestId: 3b63a928-f380-11e7-a116-5bb37bb69bee Version: $LATEST
/aws/lambda/BigOnion_accountGet END RequestId: 3b63a928-f380-11e7-a116-5bb37bb69bee
/aws/lambda/BigOnion_accountGet REPORT RequestId: 3b63a928-f380-11e7-a116-5bb37bb69bee Duration: 35.01 ms Billed Duration: 100 ms Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_accountGet START RequestId: 4b7fa7f2-f380-11e7-a0c8-513a66a11e7a Version: $LATEST
/aws/lambda/BigOnion_accountGet 2018-01-07T07:56:40.643Z 3b63a928-f380-11e7-a116-5bb37bb69bee caught exception during cluster refresh: { Error: NoRouteException: not able to resolve address
at DaxClientError (/var/task/index.js:545:5)
at AutoconfSource._resolveAddr (/var/task/index.js:18400:23)
at _pull (/var/task/index.js:18421:20)
at _pullFrom.then.catch (/var/task/index.js:18462:18)
time: 1515311800643,
code: 'NoRouteException',
retryable: true,
requestId: null,
statusCode: -1,
_tubeInvalid: false,
waitForRecoveryBeforeRetrying: false }
/aws/lambda/BigOnion_accountGet 2018-01-07T07:56:40.682Z 3b63a928-f380-11e7-a116-5bb37bb69bee Error: NoRouteException: not able to resolve address
at DaxClientError (/var/task/index.js:545:5)
at AutoconfSource._resolveAddr (/var/task/index.js:18400:23)
at _pull (/var/task/index.js:18421:20)
at _pullFrom.then.catch (/var/task/index.js:18462:18)
/aws/lambda/BigOnion_accountGet END RequestId: 4b7fa7f2-f380-11e7-a0c8-513a66a11e7a
/aws/lambda/BigOnion_accountGet REPORT RequestId: 4b7fa7f2-f380-11e7-a0c8-513a66a11e7a Duration: 121.24 ms Billed Duration: 200 ms Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_accountGet START RequestId: 5b951673-f380-11e7-9818-f1effc29edd5 Version: $LATEST
/aws/lambda/BigOnion_accountGet END RequestId: 5b951673-f380-11e7-9818-f1effc29edd5
/aws/lambda/BigOnion_accountGet REPORT RequestId: 5b951673-f380-11e7-9818-f1effc29edd5 Duration: 39.42 ms Billed Duration: 100 ms Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_siteCreate START RequestId: 0ec60080-f380-11e7-afea-a95d25c6e53f Version: $LATEST
/aws/lambda/BigOnion_siteCreate END RequestId: 0ec60080-f380-11e7-afea-a95d25c6e53f
/aws/lambda/BigOnion_siteCreate REPORT RequestId: 0ec60080-f380-11e7-afea-a95d25c6e53f Duration: 3.48 ms Billed Duration: 100 ms Memory Size: 768 MB Max Memory Used: 48 MB
Any ideas what it could be?
It's presumably not the VPC and security access as 9/10 times access is perfectly fine. I have a wide range of CIDR IPs, so I don't think it's anything related to EIN provisioning... but what else?
The only hint I have is the initial error which states "caught exception during cluster refresh". What exactly is a "cluster refresh" and how could it lead to these failures?
A "cluster refresh" is a background process used by the DAX Client to ensure that its knowledge of the cluster membership state somewhat matches reality, as the DAX client is responsible for routing requests to the appropriate node in the cluster.
Normally a failure on refresh is not an issue because the cluster state rarely changes (And thus the existing state can be reused), but on startup, the client "blocks" to get an initial membership list. If that fails, the client can't proceed as it doesn't know which node can handle which requests.
There can be a slight delay creating the VPC-connected ENI during a Lambda cold-start, which means the client cannot reach the cluster (hence, "No route to host") during initialization. One the Lambda container is running it shouldn't be an issue (you might still get the exception in the logs if there's a network hiccup, but it shouldn't affect anything).
If it only happens for you during a cold-start, retrying after a slight delay should be able to work around it.
I am still confused with some JMeter logs displayed here. Can someone please give me some light into this?
Below is a log generated by JMeter for my tests.
Waiting for possible Shutdown/StopTestNow/Heapdump message on port 4445
summary + 1 in 00:00:02 = 0.5/s Avg: 1631 Min: 1631 Max: 1631 Err: 0 (0.00%) Active: 2 Started: 2 Finished: 0
summary + 218 in 00:00:25 = 8.6/s Avg: 816 Min: 141 Max: 1882 Err: 1 (0.46%) Active: 10 Started: 27 Finished: 17
summary = 219 in 00:00:27 = 8.1/s Avg: 820 Min: 141 Max: 1882 Err: 1 (0.46%)
summary + 81 in 00:00:15 = 5.4/s Avg: 998 Min: 201 Max: 2096 Err: 1 (1.23%) Active: 0 Started: 30 Finished: 30
summary = 300 in 00:00:42 = 7.1/s Avg: 868 Min: 141 Max: 2096 Err: 2 (0.67%)
Tidying up ... # Fri Jun 09 04:19:15 IDT 2017 (1496971155116)
Does this log means [ last step ] 300 requests were fired, 00.00:42 secs took for the whole tests, 7.1 threads/sec or 7.1 requests/sec fired?
How can i make sure to increase the TPS? Same tests were done in a different site and they are getting 132TPS for the same tests and on the same server. Can someone put some light into this?
In here, total number of requests is 300. Throughput is 7 requests per second. These 300 requests generated by your given number of threads in Thread group configuration. You can also see the number of active threads in the log results. These threads become active depend on your ramp-up time.
Ramp-up time is the speed at which users or threads arrive on your application.
Check this for an example: How should I calculate Ramp-up time in Jmeter
You can give enough duration in your script and also check the loop count forever, so that all of the threads will be hitting those requests in your application server until the test finishes.
When all the threads become active on the server, then they will hit those requests in server.
To increase the TPS, you must have to increase the number of threads because those threads will hit your desired requests in server.
It also depends on the response time of your requests.
Suppose,
If you have 500 virtual users and application response time is 1 second - you will have 500 RPS
If you have 500 virtual users and application response time is 2 seconds - you will have 250 RPS
If you have 500 virtual users and application response time is 500 ms - you will have 1000 RPS.
First of all, a little of theory:
You have Sampler(s) which should mimic real user actions
You have Threads (virtual users) defined under Thread Group which mimic real users
JMeter starts threads which execute samplers as fast as they can and generate certain amount of requests per second. This "request per second" value depends on 2 factors:
number of virtual users
your application response time
JMeter Summarizer doesn't tell the full story, I would recommend generating the HTML Reporting Dashboard from the .jtl results file, it will provide more comprehensive load test result data which is much easier to analyze looking into tables and charts, it can be done as simple as:
jmeter -g /path/to/testresult.jtl -o /path/to/dashboard/output/folder
Looking into current results, you achieved maximum throughput of 7.1 requests second with average response time of 868 milliseconds.
So in order to have more "requests per second" you need to increase the number of "virtual users". If you increase the number of virtual users and "requests per second" is not increasing - it means that you identified so called saturation point and your application is not capable of handling more.