Amazon Auto Scale is ping ponging since the health checks fail - amazon-ec2

I have my Auto Scaling Group configured like so:
My Load balancer, when it registers an instance, fails since the health check fails while the instance is still loading.
For what it's worth, my health check is as follows:
I've added the "web-master" instance myself, not part of the auto scale, since what normally happens is that the registering instance fails to add itself to the load balancer, terminates and a new one pops up. This happens countless times until I manually intervene. What am I doing wrong? Is there any way to delay the ELB Health check or at least have it wait till the instance is fully registered?

An instance shouldn't register until its passed its availability tests. Could it be that its your applications spin up time that's the issue, not the instance.
I guess my question is, is your application set up on the port your pinging? I have a lightweight 'healthcheck' app in iis for heathchecks to get around the 'ping pong' effect.

Related

Ec2 instance out of service

I have an ec2 instance. and I use this instance for serving Nodejs applications with GitHub action. everything is working as expected. but sometimes like after 1/2 days, the instance is not reachable by anything. it's showing an offline flag on the GitHub action tab. But, in the AWS ec2 console. it's showing running. All stats look normal. I actually have no idea why this behaves like this.
every time this happened, I have to stop and start the instance manually to up and run my server.
EC2 monitor stats
I actually think, there may be resource limitations, therefore the instance is hanging. but after I upgrade my instance type to t2.xlarge it still happens.

How to use ELB and AutoScaling termination for long living connections

I want to set up Autoscaling groups where we can launch and terminate instances based on the CPU load. But usually our connections stays for long like more than 8hrs sometimes even more than that. When I use NLB, the Deregistration delay is only supported till 3600sec and after that NLB will forcefully remove the connection which cause our long living connections to fail and autoscaling will terminate the instances as well.
How do I make sure that all my connections to the target group is processed after 8-10hrs and then NLB deregister or autoscaling terminate the instance?
I checked the ASG Lifecycle hooks and it allows connections only till 2hrs.
Is it possible to deregister the instances in target group after all the connections are drained and terminate the instance using ASG?
There isn't any good/easy way to do what you want to do. What are the instances doing that can last up to 10 hours?
Depending on your work type, this is the best workaround I can think of, but it would probably involve a bit of rearchitecting.
1) design your application so that all data is stored off the instance is some sort of data tier (S3, RDS, EFS, etc). When an instance is done doing whatever its doing, save that info to the data tier. This way a user request can go to any instance and get the same information
2) The ASG decides to scale in
3) You have a lifecycle hook configured and a cloudwatch notification setup to be triggered when an instance enters the terminating:wait state which notifies the instance
4) The instance periodically sends a heartbeat to the lifecycle hook which can extend the hooks timeout for up to 2 days
5) Whenever the instance finishes what its doing, it saves the information out to the data tier mentioned in 1) and the client can connect to a new instance to get the information that was being processed on the old one
https://docs.aws.amazon.com/cli/latest/reference/autoscaling/record-lifecycle-action-heartbeat.html
https://docs.aws.amazon.com/cli/latest/reference/autoscaling/complete-lifecycle-action.html
Try to use, Scaling CoolDown period. By the default Scaling Cooldown Period is (300 Secs). you can increase the number. which will help to increase the scale in time.

AWS Application load balancer stops to send traffic when new instance is added

I have a problem with autoscaling with AWS Application Load Balancer.
I'm running my Jmeter tests and discovered that whenever new instance is added to autoscaling group (that is when it becomes healthy and ALB starts to route traffic to it), then for some short period of time Load Balancer forwards less requests to targets and a lot of requests are apparently stuck at Load Balancer itself.
I'm attaching 3 images that show this issue. CPU of JVM of one of instances drops and than goes back to normal, some requests are hanging for more than 30 sec. Number of requests per target drops and then goes back to trend. (see attached pictures)
I'm using sticky sessions with 3 minutes validity period.
Does any one knows what may cause this temporary "choking" when new instance is added?
It is quite crucial to our user experience. Can't actually understand why adding new instance can have such adverse effect on traffic routing.
Issue is fully reproducible.

Initiate EC2 Instance Shutdown via EstimatedCharges Threshold

I am using amazon-ec2 and currently have a couple of CloudWatch Alarms set using the EstimatedCharges Threshold at different price ranges.
While my request here would obviously be a last/worst case situation, I am wondering how it would be possible to do this, if it even is.
What I am wanting to do is to setup an alarm that will (somehow, how??) initiate a Shutdown of a specific EC2 Instance when a specific alarm goes from a state of OK to a state of ALARM ?
I do not want to TERMINATE the instance, just a SHUTDOWN.
The idea here being making sure that a monthly bill does not all of a sudden go way beyond what can be afforded, even if it does mean shutting down the entire server.
Maybe there is another/better method of doing what I am after via another AWS service, if so, would love to know about that.

Haproxy Load Balancer, EC2, writing my own availability script

I've been looking at high availability solutions such as heartbeat, and keepalived to failover when an haproxy load balancer goes down. I realised that although we would like high availability it's not really a requirement at this point in time to do it to the extent of the expenditure on having 2 load balancer instances running at any one time so that we get instant failover (particularly as one lb is going to be redundant in our setup).
My alternate solution is to fire up a new load balancer EC2 instance from an AMI if the current load balancer has stopped working and associate it to the elastic ip that our domain name points to. This should ensure that downtime is limited to the time it takes to fire up the new instance and associate the elastic ip, which given our current circumstance seems like a reasonably cost effective solution to high availability, particularly as we can easily do it multi-av zone. I am looking to do this using the following steps:
Prepare an AMI of the load balancer
Fire up a single ec2 instance acting as the load balancer and assign the Elastic IP to it
Have a micro server ping the current load balancer at regular intervals (we always have an extra micro server running anyway)
If the ping times out, fire up a new EC2 instance using the load balancer AMI
Associate the elastic ip to the new instance
Shut down the old load balancer instance
Repeat step 3 onwards with the new instance
I know how to run the commands in my script to start up and shut down EC2 instances, associate the elastic IP address to an instance, and ping the server.
My question is what would be a suitable ping here? Would a standard ping suffice at regular intervals, and what would be a good interval? Or is this a rather simplistic approach and there is a smarter health check that I should be doing?
Also if anyone foresees any problems with this approach please feel free to comment
I understand exactly where you're coming from, my company is in the same position. We care about having a highly available fault tolerant system however the overhead cost simply isn't viable for the traffic we get.
One problem I have with your solution is that you're assuming the micro instance and load balancer wont both die at the same time. With my experience with amazon I can tell you it's defiantly possible that this could happen, however unlikely, its possible that whatever causes your load balancer to die also takes down the micro instance.
Another potential problem is you also assume that you will always be able to start another replacement instance during downtime. This is simply not the case, take for example an outage amazon had in their us-east-1 region a few days ago. A power outage caused one of their zones to loose power. When they restored power and began to recover the instances their API's were not working properly because of the sheer load. During this time it took almost 1 hour before they were available. If an outage like this knocks out your load balancer and you're unable to start another you'll be down.
That being said. I find the ELB's provided by amazon are a better solution for me. I'm not sure what the reasoning is behind using HAProxy but I recommend investigating the ELB's as they will allow you to do things such as auto-scaling etc.
For each ELB you create amazon creates one load balancer in each zone that has an instance registered. These are still vulnerable to certain problems during severe outages at amazon like the one described above. For example during this downtime I could not add new instances to the load balancers but my current instances ( the ones not affected by the power outage ) were still serving requests.
UPDATE 2013-09-30
Recently we've changed our infrastructure to use a combination of ELB and HAProxy. I find that ELB gives the best availability but the fact that it uses DNS load balancing doesn't work well for my application. So our setup is ELB in front of a 2 node HAProxy cluster. Using this tool HAProxyCloud I created for AWS I can easily add auto scaling groups to the HAProxy servers.
I know this is a little old, but the solution you suggest is overcomplicated, there's a much simpler method that does exactly what you're trying to accomplish...
Just put your HAProxy machine, with your custom AMI in an auto-scaling group with a minimum AND maximum of 1 instance. That way when your instance goes down the ASG will bring it right back up, EIP and all. No external monitoring necessary, same if not faster response to downed instances.

Resources