Amazon Load Balancer excessively high latency - magento

I'm having an issue with an AWS load balancer - loading pages through it seems to give high latency (~5s)
There are two EC2 instances living behind the load balancer, let's call them p1 and p2.
I'm running Magento on these instances, they're both connected to the same database.
When viewing a category page on p1 or p2 directly, the initial load time is < 500ms, but when I visit the load balancer (which then points to p1 or p2) the browser spends ~5 seconds waiting for a response from the server.
This is a typical request to p1 or p2 directly:
This is a typical request from the load balancer:
I initially suspected it may be an issue with Magento trying to re-cache for requests coming from the load balancer but I then set p1 and p2 to have their caches synchronised so cache is unlikely the cause.
The stack on p1 and p2 are fairly regular Apache2 + PHP-FPM + PHP setups that are lightning fast on their own.

AWS has recently released a new feature of ELB just for such troubleshooting scenarios. Now you can get ELB access logs. These Acces logs can help yopu determine the time taken for a request at different intervals. e.g:
request_processing_time: Total time elapsed (in seconds) from the time the load balancer receives the request and sends the request to a registered instance.
backend_processing_time: Total time elapsed (in seconds) from the time the load balancer sends the request to a registered instance and the instance begins sending the response headers.
response_processing_time: Total time elapsed (in seconds) from the time the load balancer receives the response header from the registered instance and starts sending the response to the client. this processing time includes both queuing time at the load balancer and the connection acquisition time from the load balancer to the backend.
...and a lot more information. You need to configure the access logs first. Please follow below articles to get more understanding around using ELB access logs:
Access Logs for Elastic Load Balancers
Access Logs
These logs may/may not solve your problem but is certainly a good point to start with. Besides, you can always check with AWS Technical support for more in depth analysis.

Related

GKE and RPS - 'Usage is at capacity' - and performance issues

We have a GKE cluster with Ingress (kubernetes.io/ingress.class: "gce") where one backend is serving our production site.
The cluster is regional one with 3 zones enabled (autoscaling enabled).
The backend serving production site is a Varnish server running as Deployment - single replica. Behind Varnish there are multiple Nginx/PHP pods running under HorizontalPodAutoscaler.
The performance of of the site is slow. We have noticed by using GCP console that all traffic is routed only to one Backend and there is only 1/1 healthy endpoint in one zone?
We are getting exclamation mark next to the serving Backend with message 'Usage is at capacity, max = 1' and 'Backend utilization: 0%'. The other backend in second zone has no endpoint configured? And there is no third backed in third zone?
Initially we were getting a lot of 5xx responses from the backend at around 80RPS rate so we have turned on CDN via BackendConfig.
This has reduced 5xx responses including RPS on the backend to around 9RPS and around 83% RPS is being served from CDN.
We are trying to figure it out if it is possible to improve our backend utilization as clearly serving 80RPS from one Varnish server which has many pods behind should be easily achievable. We can not find any underperforming POD (varnish itself or nginx/php) in this scenario.
Is GKE/GCP throttling the backend/endpoint to only support 1RPS?
Is there any way to increase RPS per endpoint and increase number of endpoints, at least one per zone?
Is there any documentation available that explain how to scale such architecture on GKE/GCP?

AWS Application load balancer stops to send traffic when new instance is added

I have a problem with autoscaling with AWS Application Load Balancer.
I'm running my Jmeter tests and discovered that whenever new instance is added to autoscaling group (that is when it becomes healthy and ALB starts to route traffic to it), then for some short period of time Load Balancer forwards less requests to targets and a lot of requests are apparently stuck at Load Balancer itself.
I'm attaching 3 images that show this issue. CPU of JVM of one of instances drops and than goes back to normal, some requests are hanging for more than 30 sec. Number of requests per target drops and then goes back to trend. (see attached pictures)
I'm using sticky sessions with 3 minutes validity period.
Does any one knows what may cause this temporary "choking" when new instance is added?
It is quite crucial to our user experience. Can't actually understand why adding new instance can have such adverse effect on traffic routing.
Issue is fully reproducible.

Using Azure load balancer to reboot/update server with zero downtime

I have a really simple setup: An azure load balancer for http(s) traffic, two application servers running windows and one database, which also contains session data.
The goal is being able to reboot or update the software on the servers, without a single request being dropped. The problem is that the health probe will do a test every 5 seconds and needs to fail 2 times in a row. This means when I kill the application server, a lot of requests during those 10 seconds will time out. How can I avoid this?
I have already tried running the health probe on a different port, then denying all traffic to the different port, using windows firewall. Load balancer will think the application is down on that node, and therefore no longer send new traffic to that specific node. However... Azure LB does hash-based load balancing. So the traffic which was already going to the now killed node, will keep going there for a few seconds!
First of all, could you give us additional details: is your database load balanced as well ? Are you performing read and write on this database or only read ?
For your information, you have the possibility to change Azure Load Balancer distribution mode, please refer to this article for details: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-distribution-mode
I would suggest you to disable the server you are updating at load balancer level. Wait a couple of minutes (depending of your application) before starting your updates. This should "purge" your endpoint. When update is done, update your load balancer again and put back the server in it.
Cloud concept is infrastructure as code: this could be easily scripted and included in you deployment / update procedure.
Another solution would be to use Traffic Manager. It could give you additional option to manage your endpoints (It might be a bit oversized for 2 VM / endpoints).
Last solution is to migrate to a PaaS solution where all this kind of features are already available (Deployment Slot).
Hoping this will help.
Best regards

How to load test an Apache HTTP Load balanced servers

Apache Jmeter allows us to hit the server with simultaneous connections. On, the other hand I have 4 webservers - one acting as a load balancer and other 3 acting as a application server. So, i want to load test these servers at once to check its performance. Is there a way to load test a load balancer. Any tools that will be helpful to carry out. I will edit this question further with more information. For time being could someone point out a starting point.
Ramp your normal app-test as usual (through the load balancer).
Eventually, you'll get high response times. If you see your application servers are running fine, then it's (probably*) your load balancer that's the issue. If the application servers are falling over, then you don't need to worry about the performance of your load balancer- it's not the bottleneck.
*obviously, there could be other problems, eg, simple network throughput. But you should be able to tell what's going on with some simple monitoring.
Yes you can make a load test on your load balancer Apache server.
Target your http requests to the load balancer apache server. I'm assuming the LB will distribute the load evenly through the 3 backend servers. After the test you evaluate the response times. Are they good? Great.
If they aren't good you can make a second test targeting the 3 backend servers (don't let the requests pass by the LB). If the response times are better now then you know your LB is the problem, otherwise you may need to add more backend servers or optimize your applications (i guess this is what you'll do).

What is your Health Check Settings for Elastic Load Balancer

What is your health check settings for elastic load balancer? I am not really well into this as my goal is to get the good settings to put where the ELB immediately failover the traffic when my 1st ec2 instance is down to the 2nd ec2 instance. Can anyone mind to share their configuration and knowledge?
Thanks.
James
Health check settings in ELB are important, but usually not that important.
1) ELB doesn't support active/passive application instances - only active/active.
2) If an application stops accepting connections or slows dramatically, load will automatically shift to the available / faster instances. This happens without the help of health checks.
3) Health checks prevent ELB from having to try to send a request to an instance in order to find out it is not well. This is good because a request to an unhealthy back end can sacrifice the request (an error will be sent to the client).
4) If your health check settings are too sensitive (such as using a 1 second timeout when some percent of your requests take longer than that) then it can pull instances out of service too easily. Too much of this and your site will appear to be down from time to time.
If you are trying a scenario with multiple availability zones and only one back-end in each zone, then the health checks are more important. If there are NO healthy back-ends in a zone, ELB will try to forward requests to another zone that has at least one healthy instance. In this case, the frequency of health checks determines the failover time, so you'll want faster checks.

Resources