Integration latency on API Gateway - aws-lambda

Recently my system have a problem like this, suddenly integration latency from API Gateway is higher than usual, especially at night.
Check Logs Insight, I saw these happened.
Somehow, for some request, it took 7s to finish integration. At first I think this problem is from lambda and I also checked it, but seems like it's not.
Lambda took only more than 1s (higher than normal a little) to finish execute and it also started right after integration completed.
Please have anyone solved these kind of problem before ? can you plaease give me some advice ?

Related

Spring reactive poor performance on high load

I have a spring boot webflux application which by default uses netty.
One of the business requirements that we have mandates that requests should time out within 2 seconds.
When very few requests are sent to the app, everything is fine but when the request load is increased (Like over 40 or 50 concurrent per second by Jmeter) sometimes all of them time out due to each taking longer than the 2-second threshold.
I have spent a long time reading things online and looking into what could be causing this issue but with no success. When requests are sent concurrently most end up taking a long time and the problematic part is where an external HTTTP request is made to other microservice. All my tests are local and I have tested the microservices and they seem fast enough to handle a big load so the microservices themselves are not the issue.
I know that netty uses event loop and does not create a thread per request.
I believe there are likely synchronous tasks that are blocking those few netty threads. For this reason I have done massive refactoring and have ".publishOn(Schedulers.boundedElastic())" or ".subscribeOn(Schedulers.boundedElastic())" in the Mono reactive chains. After the refactoring Most of the operations seem to be running on elastic threads and not the "reactor-http-nio-x" (According to the logs) but doing so has not helped the main issue and the problem still exists.
It will be a huge help if someone could direct me to what I should be doing. At this point, I have no more improvements to make, and think I might have been looking at this the wrong way and my approach has not been correct.
I have not attached any code sine the application is big and I do not still know where the actual problem lies.
I've encountered the same problem. I've didn't find the root cause of this, but when I switched from WebClient to RestTemplate with dedicated thread pool per client (external service) then the problem was solved. I've run a blockhound to find if I block somewhere in the stream, but it didn't find anything. I've also tried deploying my application with increased number of NIO worker thread pool (by default it's equal to cores number) and there was some improvement, but after all RestTemplate yielded the best performance. So I'm still on Webflux stack, but I don't use WebClient anymore and the performance on high load is fine.

AWS Aurora / Lambda serverless production environment exhibiting occasional spikes

We've been running our production web app off AWS Lambda / API Gateway, with an Aurora serverless database. Things had been running smoothly for over a year, but recently (coinciding with much increased periods of peak usage) we've experienced temporary slowness, and in the worst case unavailability, due to some kind of bottleneck that results in a spike in the number of DB connections and 4XX and 5XX from our two APIs.
We're using the serverless-mysql library to execute queries and manage DB connections.
Some potential causes of the issue that have been eliminated:
There are no long-running queries locking up tables or anything of that sort (as demonstrated by show full processlist in MySQL), in fact no query runs longer than 1s accordingly to our slow_log
All calls to await serverlessMysql.query() are immediately followed by await serverlessMysql.end()
Our database manager class is instantiated outside the Lambda handler, so it isn't reinstantiated every time a Lambda instance is reused
We've adjusted the config options for serverless-mysql so that retries aren't so aggresive. The default config makes it very aggressive in retrying to connect, both in frequency and number of retries. This has definitely helped, but has not eliminated the problem.
What details can I post that might help someone diagnose this problem? It's a major pain in the ass.
It would be helpful to see the load this application is getting. Which I know is easier said than done with Lambda.
You sort of hinted at it, but it's possible you're hitting the Max Connections() on the 'capacity class' your aurora serverless instance is set to. I've hit this a few times. It's hard to discover with lambda and serverless aurora because you don't have the same logging you would traditionally have.
Outside of that, the core issue you're experiencing seems to be related to spikes created from your application - so you need to discover if a query is maybe just inefficient, and running too many times at once. These are almost impossible to troubleshoot with Lambda logs. But db locks still occur with aurora serverless.
To help track down the issue, you could try the following:
Setup APM
I highly, highly, recommend getting something like NewRelic setup and monitoring your Lambda function.
I'm pretty sure NR has a free trial option, and tracking down a problem like this would be seemingly simple with an APM. I can't tell you how much easier problems like this are to solve with a solid apm.
Monitor traffic ingress
Again, I'm not sure of what this application is doing, but it could be possible that a spike in network traffic from a particular user kicks off a load of queries that make things go awry. Setup a free Cloudflare account or some other proxy if you can, and determine network traffic more easily.
Hope this helps.

Possible explanation of sudden spike in memcached get

Newbie in Newrelic here. I have an API service hosted on Heroku and being monitored at Newrelic.
While I was studying how to use newrelic. I found out my 2 workers are being underutilised with very low RPM and low transaction time. So I decided to cut down to one worker which saves me $36 a month. =]
Shortly after that I received tonnes of logEntries emails stating request timeouts of one of my web dynos. Looking into Newrelic. I found out that one of my actions are being called suspciously high number of times for 2-3 minutes.
The action being V1::CarsController#Index, which basically shows a collection of cars.
While I was not sure whether the deletion of one worker dyno has caused memcached to do something, I also suspect that may be someone is trying scrap the data off the database. I am not too sure how to further investigate into the issue. I wonder if I can track down the request IP and see it is the same? or how can I further investigate?
If further information is needed I am happy to provide in Edits!
Thanks

Node.js suddenly getting extremely slow

We have this architecture with 2 node processes.
One polls a private API and pushes the changes to the second node if any.
The second node process the data and calls a bunch other API's and eventually emits a change event to the client, a HTML5 website, with socket.io
This second node will always process the data and will always emit changes even if no clients are connected. So in my opinion the CPU or mem usage is not that greatly affected by the number of connected clients. Also note that this architecture is still running on a private staging environment.
Everything runs fine and we're ready to go live until we noticed after couple of days, maybe a week, the second node suddenly gets extremely slow while the first node is still fine.
It gets so bad that even the connection between the two nodes gets timed out and they are on the same network over localhost. It also takes more then 10 seconds to browse to the socket.io/socket.io.js file.
I know its very hard to understand the problem without seeing any code but I'm kinda pulling my hair out because we have to go live in couple of days and my logs are not revealing anything and google isn't helping either.
Whats a good practice towards building Have you ever experienced anything like this? What was the problem and how did you fix it?
Whats a good monitor and profiler for node.js? (preferably free)
What are good practices towards building a node.js app with makes a lot of outgoing API calls?
Anything or anyone that could help me in the right direction of solving or even discovering the actual problem will be greatly appreciated!
Thank you!
Never experienced anything like this but may be the second node is blocking the event loop by doing CPU intensive work or waiting for some resource synchronously.
Add some logging in your code to see how much time second node is taking for processing each change pushed by first node. May be some type of change consumes CPU for 10 seconds or so to complete.
You should also start monitoring memory, CPU and network connections. When things slow down your monitoring will provide some clue as to where is the bottle neck.
For monitoring you can try following 3 tools
nodetime
hummingbird
node-monitor
Also read http://nodetime.com/blog/monitoring-nodejs-application-performance
It sounds like you have a memory leak somewhere in the second node, maybe from calling too many anonymous functions etc... do you notice your RAM usage slightly creeping up as it runs?

AppEngine response time is slow

I am using a modified version of the TaskCloud example to try and read/write my own data.
While testing on a a deployed version, I've noticed that the round-trip response time is slow.
From my Android device, I have a 100ms ping response to appspot.com.
I have changed the AppEngine application to do nothing (The Google Dashboard shows insignificant Average Latency.
The problem is that the time it takes for HttpClient client .execute(post) is about 3 seconds.
(This is the time when an instance is already loaded)
Any suggestions would be greatly appreciated.
EDIT: I've watched the video of Google I/O showing the CloudTasks Android-AppEngine app, and you can see that refreshing the list (a single call to AppEngine) takes about 3 seconds as well. The guy is saying something about performance which I didn't fully get (debuggers are running at both ends?)
The video: http://www.youtube.com/watch?v=M7SxNNC429U&feature=related
Time location: 0:46:45
I'll keep investigating...
Thanks for your help so far.
EDIT 2: Back to this issue...
I've used shark packet sniffer to find out what is happening. Some of the time is spent negotiating a SSL connection for each server call. Using http (and ACSID) is faster than https (and SACSID).
new DefaultHttpClient() and new HttpPost() are used for each server call.
EDIT 3:
Looking at the sniffer logs again, there is an almost 2 seconds delay before the actual POST.
I have also found out that the issue exists with Android 2.2 (all versions) but is resolved with Android 2.3
EDIT 4: It's been resolved. Please see my answer below.
It's difficult to answer your question since no detail about your app is provided. Anyway you can try to use appstats tool provided by Google to analyze the bottleneck.
After using the Shark sniffer, I was able to understand the exact issue and I've found the answer in this question.
I have used Liudvikas Bukys's comment and solved the problem using the suggested line:
post.getParams().setBooleanParameter(CoreProtocolPNames.USE_EXPECT_CONTINUE, false);
Often the first call to your GAE app will take longer than subsequent calls. You should make yourself familiar with loading and warm-up requests and how GAE handles instances of your app: http://code.google.com/intl/de-DE/appengine/docs/adminconsole/instances.html
Some things you could also try:
make your app handle more than one request per instance (make sure your app is threadsafe!) http://code.google.com/intl/de-DE/appengine/docs/java/config/appconfig.html#Using_Concurrent_Requests
enable always on feature in app admin (this will cost you)

Resources