AWS Application load balancer stops to send traffic when new instance is added - performance

I have a problem with autoscaling with AWS Application Load Balancer.
I'm running my Jmeter tests and discovered that whenever new instance is added to autoscaling group (that is when it becomes healthy and ALB starts to route traffic to it), then for some short period of time Load Balancer forwards less requests to targets and a lot of requests are apparently stuck at Load Balancer itself.
I'm attaching 3 images that show this issue. CPU of JVM of one of instances drops and than goes back to normal, some requests are hanging for more than 30 sec. Number of requests per target drops and then goes back to trend. (see attached pictures)
I'm using sticky sessions with 3 minutes validity period.
Does any one knows what may cause this temporary "choking" when new instance is added?
It is quite crucial to our user experience. Can't actually understand why adding new instance can have such adverse effect on traffic routing.
Issue is fully reproducible.

Related

Using Azure load balancer to reboot/update server with zero downtime

I have a really simple setup: An azure load balancer for http(s) traffic, two application servers running windows and one database, which also contains session data.
The goal is being able to reboot or update the software on the servers, without a single request being dropped. The problem is that the health probe will do a test every 5 seconds and needs to fail 2 times in a row. This means when I kill the application server, a lot of requests during those 10 seconds will time out. How can I avoid this?
I have already tried running the health probe on a different port, then denying all traffic to the different port, using windows firewall. Load balancer will think the application is down on that node, and therefore no longer send new traffic to that specific node. However... Azure LB does hash-based load balancing. So the traffic which was already going to the now killed node, will keep going there for a few seconds!
First of all, could you give us additional details: is your database load balanced as well ? Are you performing read and write on this database or only read ?
For your information, you have the possibility to change Azure Load Balancer distribution mode, please refer to this article for details: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-distribution-mode
I would suggest you to disable the server you are updating at load balancer level. Wait a couple of minutes (depending of your application) before starting your updates. This should "purge" your endpoint. When update is done, update your load balancer again and put back the server in it.
Cloud concept is infrastructure as code: this could be easily scripted and included in you deployment / update procedure.
Another solution would be to use Traffic Manager. It could give you additional option to manage your endpoints (It might be a bit oversized for 2 VM / endpoints).
Last solution is to migrate to a PaaS solution where all this kind of features are already available (Deployment Slot).
Hoping this will help.
Best regards

Amazon Auto Scale is ping ponging since the health checks fail

I have my Auto Scaling Group configured like so:
My Load balancer, when it registers an instance, fails since the health check fails while the instance is still loading.
For what it's worth, my health check is as follows:
I've added the "web-master" instance myself, not part of the auto scale, since what normally happens is that the registering instance fails to add itself to the load balancer, terminates and a new one pops up. This happens countless times until I manually intervene. What am I doing wrong? Is there any way to delay the ELB Health check or at least have it wait till the instance is fully registered?
An instance shouldn't register until its passed its availability tests. Could it be that its your applications spin up time that's the issue, not the instance.
I guess my question is, is your application set up on the port your pinging? I have a lightweight 'healthcheck' app in iis for heathchecks to get around the 'ping pong' effect.

Amazon Load Balancer excessively high latency

I'm having an issue with an AWS load balancer - loading pages through it seems to give high latency (~5s)
There are two EC2 instances living behind the load balancer, let's call them p1 and p2.
I'm running Magento on these instances, they're both connected to the same database.
When viewing a category page on p1 or p2 directly, the initial load time is < 500ms, but when I visit the load balancer (which then points to p1 or p2) the browser spends ~5 seconds waiting for a response from the server.
This is a typical request to p1 or p2 directly:
This is a typical request from the load balancer:
I initially suspected it may be an issue with Magento trying to re-cache for requests coming from the load balancer but I then set p1 and p2 to have their caches synchronised so cache is unlikely the cause.
The stack on p1 and p2 are fairly regular Apache2 + PHP-FPM + PHP setups that are lightning fast on their own.
AWS has recently released a new feature of ELB just for such troubleshooting scenarios. Now you can get ELB access logs. These Acces logs can help yopu determine the time taken for a request at different intervals. e.g:
request_processing_time: Total time elapsed (in seconds) from the time the load balancer receives the request and sends the request to a registered instance.
backend_processing_time: Total time elapsed (in seconds) from the time the load balancer sends the request to a registered instance and the instance begins sending the response headers.
response_processing_time: Total time elapsed (in seconds) from the time the load balancer receives the response header from the registered instance and starts sending the response to the client. this processing time includes both queuing time at the load balancer and the connection acquisition time from the load balancer to the backend.
...and a lot more information. You need to configure the access logs first. Please follow below articles to get more understanding around using ELB access logs:
Access Logs for Elastic Load Balancers
Access Logs
These logs may/may not solve your problem but is certainly a good point to start with. Besides, you can always check with AWS Technical support for more in depth analysis.

EC2 for handling demand spikes

I'm writing the backend for a mobile app that does some cpu intensive work. We anticipate the app will not have heavy usage most of the time, but will have occasional spikes of high demand. I was thinking what we should do is reserve a couple of 24/7 servers to handle the steady-state of low demand traffic and then add and remove EC2 instances as needed to handle the spikes. The mobile app will first hit a simple load balancing server that does a simple round-robin user distribution among all the available processing servers. The load balancer will handle bringing new EC2 instances up and turning them back off as needed.
Some questions:
I've never written something like this before, does this sound like a good strategy?
What's the best way to handle bringing new EC2 instances up and back down? I was thinking I could just create X instances ahead of time, set them up as needed (install software, etc), and then stop each instance. The load balancer will then start and stop the instances as needed (eg through boto). I think this should be a lot faster and easier than trying to create new instances and install everything through a script or something. Good idea?
One thing I'm concerned about here is the cost of turning EC2 instances off and back on again. I looked at the AWS Usage Report and had difficulty interpreting it. I could see starting a stopped instance being a potentially costly operation. But it seems like since I'm just starting a stopped instance rather than provisioning a new one from scratch it shouldn't be too bad. Does that sound right?
This is a very reasonable strategy. I used it successfully before.
You may want to look at Elastic Load Balancing (ELB) in combination with Auto Scaling. Conceptually the two should solve this exact problem.
Back when I did this around 2010, ELB had some problems with certain types of HTTP requests that prevented us from using it. I understand those issues are resolved.
Since ELB was not an option, we manually launched instances from EBS snapshots as needed and manually added them to an NGinX load balancer. That certainly could have been automated using the AWS APIs, but our peaks were so predictable (end of month) that we just tasked someone to spin up the new instances and didn't get around to automating the task.
When an instance is stopped, I believe the only cost that you pay is for the EBS storage backing the instance and its data. Unless your instances have a huge amount of data associated, the EBS storage charge should be minimal. Perhaps things have changed since I last used AWS, but I would be surprised if this changed much if at all.
First with regards to costs, whether an instance is started from scratch or from a stopped state has no impact on cost. You are billed for the amount of compute units you use over time, period.
Second, what you are looking to do is called autoscaling. What you do is setup up a launch config that specifies an AMI you are going to use (along with any user-data configs you are using, the ELB and availiabilty zones you are going to use, min and max number of instances, etc. You set up a scaling group using that launch config. Then you set up scaling policies to determine what scaling actions are going to be attached to the group. You then attach cloud watch alarms to each of those policies to trigger the scaling actions.
You don't have servers in reserve that you attach to the ELB or anything like that. Everything is based on creating a single AMI that is used as the template for the servers you need.
You should read up on autoscaling at the link below:
http://aws.amazon.com/autoscaling/

Haproxy Load Balancer, EC2, writing my own availability script

I've been looking at high availability solutions such as heartbeat, and keepalived to failover when an haproxy load balancer goes down. I realised that although we would like high availability it's not really a requirement at this point in time to do it to the extent of the expenditure on having 2 load balancer instances running at any one time so that we get instant failover (particularly as one lb is going to be redundant in our setup).
My alternate solution is to fire up a new load balancer EC2 instance from an AMI if the current load balancer has stopped working and associate it to the elastic ip that our domain name points to. This should ensure that downtime is limited to the time it takes to fire up the new instance and associate the elastic ip, which given our current circumstance seems like a reasonably cost effective solution to high availability, particularly as we can easily do it multi-av zone. I am looking to do this using the following steps:
Prepare an AMI of the load balancer
Fire up a single ec2 instance acting as the load balancer and assign the Elastic IP to it
Have a micro server ping the current load balancer at regular intervals (we always have an extra micro server running anyway)
If the ping times out, fire up a new EC2 instance using the load balancer AMI
Associate the elastic ip to the new instance
Shut down the old load balancer instance
Repeat step 3 onwards with the new instance
I know how to run the commands in my script to start up and shut down EC2 instances, associate the elastic IP address to an instance, and ping the server.
My question is what would be a suitable ping here? Would a standard ping suffice at regular intervals, and what would be a good interval? Or is this a rather simplistic approach and there is a smarter health check that I should be doing?
Also if anyone foresees any problems with this approach please feel free to comment
I understand exactly where you're coming from, my company is in the same position. We care about having a highly available fault tolerant system however the overhead cost simply isn't viable for the traffic we get.
One problem I have with your solution is that you're assuming the micro instance and load balancer wont both die at the same time. With my experience with amazon I can tell you it's defiantly possible that this could happen, however unlikely, its possible that whatever causes your load balancer to die also takes down the micro instance.
Another potential problem is you also assume that you will always be able to start another replacement instance during downtime. This is simply not the case, take for example an outage amazon had in their us-east-1 region a few days ago. A power outage caused one of their zones to loose power. When they restored power and began to recover the instances their API's were not working properly because of the sheer load. During this time it took almost 1 hour before they were available. If an outage like this knocks out your load balancer and you're unable to start another you'll be down.
That being said. I find the ELB's provided by amazon are a better solution for me. I'm not sure what the reasoning is behind using HAProxy but I recommend investigating the ELB's as they will allow you to do things such as auto-scaling etc.
For each ELB you create amazon creates one load balancer in each zone that has an instance registered. These are still vulnerable to certain problems during severe outages at amazon like the one described above. For example during this downtime I could not add new instances to the load balancers but my current instances ( the ones not affected by the power outage ) were still serving requests.
UPDATE 2013-09-30
Recently we've changed our infrastructure to use a combination of ELB and HAProxy. I find that ELB gives the best availability but the fact that it uses DNS load balancing doesn't work well for my application. So our setup is ELB in front of a 2 node HAProxy cluster. Using this tool HAProxyCloud I created for AWS I can easily add auto scaling groups to the HAProxy servers.
I know this is a little old, but the solution you suggest is overcomplicated, there's a much simpler method that does exactly what you're trying to accomplish...
Just put your HAProxy machine, with your custom AMI in an auto-scaling group with a minimum AND maximum of 1 instance. That way when your instance goes down the ASG will bring it right back up, EIP and all. No external monitoring necessary, same if not faster response to downed instances.

Resources