Intermittent DynamoDB DAX errors: NoRouteException during cluster refresh - aws-lambda

Via CloudFormation, I have a setup including DynamoDB tables, DAX, VPC, Lambdas (living in VPC), Security Groups (allowing access to port 8111), and so on.
Everything works, except when it doesn't.
I can access DAX from my VPC'd Lambdas 99% of the time. Except occasionally they get NoRouteException errors... seemingly randomly. Here's the output from CloudWatch for a single Lambda function doing the exact same thing each time (a DAX get). Notice how it works, fails, and then works again:
/aws/lambda/BigOnion_accountGet START RequestId: 2b732899-f380-11e7-a650-cbfe0f7dfb3d Version: $LATEST
/aws/lambda/BigOnion_accountGet END RequestId: 2b732899-f380-11e7-a650-cbfe0f7dfb3d
/aws/lambda/BigOnion_accountGet REPORT RequestId: 2b732899-f380-11e7-a650-cbfe0f7dfb3d Duration: 58.24 ms Billed Duration: 100 ms Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_accountGet START RequestId: 3b63a928-f380-11e7-a116-5bb37bb69bee Version: $LATEST
/aws/lambda/BigOnion_accountGet END RequestId: 3b63a928-f380-11e7-a116-5bb37bb69bee
/aws/lambda/BigOnion_accountGet REPORT RequestId: 3b63a928-f380-11e7-a116-5bb37bb69bee Duration: 35.01 ms Billed Duration: 100 ms Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_accountGet START RequestId: 4b7fa7f2-f380-11e7-a0c8-513a66a11e7a Version: $LATEST
/aws/lambda/BigOnion_accountGet 2018-01-07T07:56:40.643Z 3b63a928-f380-11e7-a116-5bb37bb69bee caught exception during cluster refresh: { Error: NoRouteException: not able to resolve address
at DaxClientError (/var/task/index.js:545:5)
at AutoconfSource._resolveAddr (/var/task/index.js:18400:23)
at _pull (/var/task/index.js:18421:20)
at _pullFrom.then.catch (/var/task/index.js:18462:18)
time: 1515311800643,
code: 'NoRouteException',
retryable: true,
requestId: null,
statusCode: -1,
_tubeInvalid: false,
waitForRecoveryBeforeRetrying: false }
/aws/lambda/BigOnion_accountGet 2018-01-07T07:56:40.682Z 3b63a928-f380-11e7-a116-5bb37bb69bee Error: NoRouteException: not able to resolve address
at DaxClientError (/var/task/index.js:545:5)
at AutoconfSource._resolveAddr (/var/task/index.js:18400:23)
at _pull (/var/task/index.js:18421:20)
at _pullFrom.then.catch (/var/task/index.js:18462:18)
/aws/lambda/BigOnion_accountGet END RequestId: 4b7fa7f2-f380-11e7-a0c8-513a66a11e7a
/aws/lambda/BigOnion_accountGet REPORT RequestId: 4b7fa7f2-f380-11e7-a0c8-513a66a11e7a Duration: 121.24 ms Billed Duration: 200 ms Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_accountGet START RequestId: 5b951673-f380-11e7-9818-f1effc29edd5 Version: $LATEST
/aws/lambda/BigOnion_accountGet END RequestId: 5b951673-f380-11e7-9818-f1effc29edd5
/aws/lambda/BigOnion_accountGet REPORT RequestId: 5b951673-f380-11e7-9818-f1effc29edd5 Duration: 39.42 ms Billed Duration: 100 ms Memory Size: 768 MB Max Memory Used: 48 MB
/aws/lambda/BigOnion_siteCreate START RequestId: 0ec60080-f380-11e7-afea-a95d25c6e53f Version: $LATEST
/aws/lambda/BigOnion_siteCreate END RequestId: 0ec60080-f380-11e7-afea-a95d25c6e53f
/aws/lambda/BigOnion_siteCreate REPORT RequestId: 0ec60080-f380-11e7-afea-a95d25c6e53f Duration: 3.48 ms Billed Duration: 100 ms Memory Size: 768 MB Max Memory Used: 48 MB
Any ideas what it could be?
It's presumably not the VPC and security access as 9/10 times access is perfectly fine. I have a wide range of CIDR IPs, so I don't think it's anything related to EIN provisioning... but what else?
The only hint I have is the initial error which states "caught exception during cluster refresh". What exactly is a "cluster refresh" and how could it lead to these failures?

A "cluster refresh" is a background process used by the DAX Client to ensure that its knowledge of the cluster membership state somewhat matches reality, as the DAX client is responsible for routing requests to the appropriate node in the cluster.
Normally a failure on refresh is not an issue because the cluster state rarely changes (And thus the existing state can be reused), but on startup, the client "blocks" to get an initial membership list. If that fails, the client can't proceed as it doesn't know which node can handle which requests.
There can be a slight delay creating the VPC-connected ENI during a Lambda cold-start, which means the client cannot reach the cluster (hence, "No route to host") during initialization. One the Lambda container is running it shouldn't be an issue (you might still get the exception in the logs if there's a network hiccup, but it shouldn't affect anything).
If it only happens for you during a cold-start, retrying after a slight delay should be able to work around it.

Related

Create_Failed S3BatchProcessor, AWS Lambda

I am running cdk deploy in my textract pipline folder for large document processing. However, when i run this porgram I get this error
The error
| CREATE_FAILED | AWS::Lambda::Function | S3BatchProcessor6C619AEA
Resource handler returned message: "Specified ReservedConcurrentExecutions for function decreases account's UnreservedConcurrentExecution below its minimum value of [10]. (Service: Lambda, Status Code: 400, Request ID: 7f6d1305-e248-4745-983e-045eccde562d)" (RequestToken: 9c84827d-502e-5697-b023-e
0be45f8d451, HandlerErrorCode: InvalidRequest)
By default AWS will provide with at max 1000 concurrency limit.
In your case, the different concurrencies in all lambdas in your account is exceeding UnreservedConcurrentExecution Limit of 10 i.e.,
1000 - sum all reservedConcurrency in lambdas > 10
This is causing deployment failure as you're trying to exceed concurrency limit.
There can be two solutions here:
Reduce the reserved concurrency of lambdas so that above equation holds or
You can raise the account concurrency limit by contacting aws support. Please refer this

Please suggest hardware configuration for network-intensive Flink job (Async I/O)

TLDR; I am running Flink Streaming job in mode=Batch on EMR. I have tried several EMR cluster configurations but neither of them works as required. Some do not work at all. Workflow is very network-intensive that cases main problems.
Question: What EMR cluster configuration (ec2 instance types) would you recommend for this use-case?
--
The job has following stages:
Read from MySQL
KeyBy user_id
Reduce by user_id
Async I/O enriching from Redis
Async I/O enriching from other Redis
Async I/O enriching from REST #1
Async I/O enriching from REST #2
Async I/O enriching from REST #2
Write to Elasticsearch
Other info:
Flink version: 1.13.1
EMR version: 6.4.0
Java version: JDK version Corretto-8.302.08.1 (provided by EMR)
Input data size: ~800 GB
Output data size: ~300 GB
"taskmanager.network.sort-shuffle.min-parallelism": 1
"taskmanager.memory.framework.off-heap.batch-shuffle.size": 256m
"taskmanager.network.sort-shuffle.min-buffers": 2048
"taskmanager.network.blocking-shuffle.compression.enabled": true
"taskmanager.memory.framework.off-heap.size": 512m
"taskmanager.memory.network.max": 2g
Configurations we tried:
#1
master: r6g.xlarge
core: r6g.xlarge (per/hour: $0.2; CPU: 4; RAM: 32 GiB; Disk: EBS 128 GB, network: 1.25 Gigabit baseline with burst up to 10 Gigabit)
min_scale: 2
max_scale: 25
expected: finishes within 24 hours
actual: works with sort-based shuffling enabled but very slowly (~36h), as this type of instance has a baseline & burst performance, when burst credits are exhausted degrades to the baseline of 1GBps, that slows down I/O. With hash-based shuffling fails on KeyBy -> Reduce with "Connection reset by peer", Task Manager fails -> Job fails -> Job manager is not able to restart.
#2
master: m5.xlarge
core: r6g.12xlarge (per/hour: $2.4; CPU: 48; RAM: 384 GiB; Disk: EBS 1.5 TB, network: 20 Gigabit)
min_scale: 1
max_scale: 4
expected: finishes within 24 hours, as there is much higher network badwith
actual: does not work. With sort-based shuffling fails on the writing phase with exception "Failed to transfer file from TaskExecutor". With hash-based shuffling fails on the same stage with "Connection reset by peer".

Lambda function starts, not executed, and timeout is reached

I have a lambda function triggered by API Gateway. I trigger it, and I see in the logs that it's starting but nothing happens, it's stuck and the timeout is reached. The very first line of the actual function is a console.log that is not printed at all.
This is the log of the lambda:
2021-10-20T13:27:50.740+03:00 START RequestId: effb5220-abd0-492a-a355-e17cba47b491 Version: $LATEST
2021-10-20T13:29:22.746+03:00 END RequestId: effb5220-abd0-492a-a355-e17cba47b491
2021-10-20T13:29:22.746+03:00 REPORT RequestId: effb5220-abd0-492a-a355-e17cba47b491 Duration: 92004.98 ms Billed Duration: 90000 ms Memory Size: 128 MB Max Memory Used: 128 MB
2021-10-20T13:29:22.746+03:00 2021-10-20T10:29:22.746Z effb5220-abd0-492a-a355-e17cba47b491 Task timed out after 92.00 seconds
I don't know where else to debug this issue....
If it matters, the lambda function is in typescript using the CDK.

Requests and Threads understanding in JMeter logs

I am still confused with some JMeter logs displayed here. Can someone please give me some light into this?
Below is a log generated by JMeter for my tests.
Waiting for possible Shutdown/StopTestNow/Heapdump message on port 4445
summary + 1 in 00:00:02 = 0.5/s Avg: 1631 Min: 1631 Max: 1631 Err: 0 (0.00%) Active: 2 Started: 2 Finished: 0
summary + 218 in 00:00:25 = 8.6/s Avg: 816 Min: 141 Max: 1882 Err: 1 (0.46%) Active: 10 Started: 27 Finished: 17
summary = 219 in 00:00:27 = 8.1/s Avg: 820 Min: 141 Max: 1882 Err: 1 (0.46%)
summary + 81 in 00:00:15 = 5.4/s Avg: 998 Min: 201 Max: 2096 Err: 1 (1.23%) Active: 0 Started: 30 Finished: 30
summary = 300 in 00:00:42 = 7.1/s Avg: 868 Min: 141 Max: 2096 Err: 2 (0.67%)
Tidying up ... # Fri Jun 09 04:19:15 IDT 2017 (1496971155116)
Does this log means [ last step ] 300 requests were fired, 00.00:42 secs took for the whole tests, 7.1 threads/sec or 7.1 requests/sec fired?
How can i make sure to increase the TPS? Same tests were done in a different site and they are getting 132TPS for the same tests and on the same server. Can someone put some light into this?
In here, total number of requests is 300. Throughput is 7 requests per second. These 300 requests generated by your given number of threads in Thread group configuration. You can also see the number of active threads in the log results. These threads become active depend on your ramp-up time.
Ramp-up time is the speed at which users or threads arrive on your application.
Check this for an example: How should I calculate Ramp-up time in Jmeter
You can give enough duration in your script and also check the loop count forever, so that all of the threads will be hitting those requests in your application server until the test finishes.
When all the threads become active on the server, then they will hit those requests in server.
To increase the TPS, you must have to increase the number of threads because those threads will hit your desired requests in server.
It also depends on the response time of your requests.
Suppose,
If you have 500 virtual users and application response time is 1 second - you will have 500 RPS
If you have 500 virtual users and application response time is 2 seconds - you will have 250 RPS
If you have 500 virtual users and application response time is 500 ms - you will have 1000 RPS.
First of all, a little of theory:
You have Sampler(s) which should mimic real user actions
You have Threads (virtual users) defined under Thread Group which mimic real users
JMeter starts threads which execute samplers as fast as they can and generate certain amount of requests per second. This "request per second" value depends on 2 factors:
number of virtual users
your application response time
JMeter Summarizer doesn't tell the full story, I would recommend generating the HTML Reporting Dashboard from the .jtl results file, it will provide more comprehensive load test result data which is much easier to analyze looking into tables and charts, it can be done as simple as:
jmeter -g /path/to/testresult.jtl -o /path/to/dashboard/output/folder
Looking into current results, you achieved maximum throughput of 7.1 requests second with average response time of 868 milliseconds.
So in order to have more "requests per second" you need to increase the number of "virtual users". If you increase the number of virtual users and "requests per second" is not increasing - it means that you identified so called saturation point and your application is not capable of handling more.

load test with visual studio doesn't increment the number of users

I configured a load test in visual studio with the following settings:
Load Pattern: Step
Initial User Count: 500
Maximum User Count: 1000
Step Duration (seconds): 300
Step Ramp Time (seconds): 0
Step User Count: 250
I need to start the test with 500 users, then increment to 750 users and finish with 1000 users.
So, in my inform, in the Key Indicators section, I expect to see:
Load users:
min: 500
max: 1000
avg: 750
But really I see:
Load users:
min: 500
max: 500
avg: 500
I don't know what I'm doing wrong.

Resources