ELB Health Check not checking web instance after booting up - amazon-ec2

We have a web instance (nginx) behind a ELB which we manually power on when required.
The web app starts up quickly and returns a successful 200 response when we run wget locally.
However the website will not load as the ELB isn't sending healthcheck requests to the instance. I can confirm this by viewing the nginx access logs.
The workaround I've been using is to remove the web instance from the ELB and add it back in.
This seem to activate the healthchecks again and they are visible from our access logs.
I've edited our Healthcheck settings to allow a longer timeout and raise the Unhealthy Threshold to 3 but this has made no difference.
Currently our Health Check Config is:
Ping Target: HTTPS:443/login
Timeout: 10 sec
Interval: 12 sec
Unhealthy: 2
Healthy: 2
Listener:
HTTPS 443 to HTTPS 443 SSL Cert
The ELB and web instance are both on the same public VPC Security Group which has http/https opened to 0.0.0.0/0
Can anyone help me figure out why the ELB Health checks aren't kicking in as soon as the web instance has started? Is this by design or is there a way of automatically initiating the checks? Thank you.
Niall

Does your instance come up with a different IP address each time you start it?
Elastic Load Balancing registers your load balancer with your EC2 instances using the IP addresses that are associated with your instances. When an instance is stopped and then restarted, the IP address associated with your instance changes. Your load balancer cannot recognize the new IP address, which prevents it from routing traffic to your instances.
— http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/TerminologyandKeyConcepts.html#registerinstance
It would seem like the appropriate approach to getting the instance re-associated would be for code running on the web server instance to programmatically register itself with the load balancer via the API when the startup process determines that the instance is ready for web traffic.
Update:
Luke#AWS: "You should be de-registering from your ELB during a stop/start."
— https://forums.aws.amazon.com/thread.jspa?messageID=463835
I'm curious what the console shows as the reason why the instance isn't active in the ELB. There does appear to be some kind of interaction between ELB and EC2 where ELB has some kind of awareness of an instance's EC2 state (e.g. "stopped") that goes beyond just the health checks. This isn't well-documented, but I would speculate that ELB, based on that awareness, decides that it isn't worth bothering with the health checks, and the console may provide something useful to at least confirm this.
It's possible that, given sufficient time, ELB might become aware that the instance is running again and start sending health checks, but it's also possible that instances have a hidden global meta-identifier separate from i-xxxxxx and that a stopped and restarted instance is, from the perspective of this identifier, a different instance.
...but the answer seems to be that stopping an instance and restarting it requires re-registration with the ELB.

Related

AWS Elastic Load Balancer not responding from Internet connection

I have created one EC2 instance (as part of the provision of a Tomcat Beanstalk instance). Now I need to configure HTTPS connection to the EC2 instance. As per the Beanstalk documentation, the easiest way is to configure a load balancer that interacts with browsers using HTTPS and that routes traffic to the EC2 instance using HTTP.
So I configured a load balancer under the EC2 management console. After the configuration, I tried to ping the public DNS name of the load balancer or the resolved IP address. The target is reachable but does not produce any response, as shown below:
ping 13.54.72.179
PING 13.54.72.179 (13.54.72.179) 56(84) bytes of data.
^C
13.54.72.179 ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 6139ms
I carefully checked all the configurations, as per the load balancer configuration and trouble-shooting documentation. All seem to have been configured properly.
Target group: the target group has the healthy state in monitoring tab.
VPC: the load balancer availability zone and the EC2 instance are in
the same VPC zone. Also in the route table, there is an internet
gateway associated to 0.0.0.0/0 destination.
load balancer listeners: both HTTP and HTTPS listeners are
configured. Load balancer is also configured for internet-facing
connection.
Security group for load balancer: for inbound traffic, both
HTTP/HTTPS and TCP protocol are configured, accepting all sources;
for outbound traffic: all protocols to all destinations are allowed.
Security group for EC2: for the purpose of testing, we enable all
traffic for all sources in inbound traffic.
I researched a few forum threads about the "load balancer not responding" topic and checked the configurations they mentioned. However, none of them worked for me.
So I am at loss now. Can someone enlighten me where I might have missed in configuring the load balancer? Or what I need to do for trouble-shooting?

Can Marathon assign the same randomly selected host_port across instances?

For my containerized application, I want to Marathon to allocate the same host_port for the container's bridge network endpoint for all instances of that application. Specifying the host port runs the risk of resource exhaustion. Not specifying it will cause a random port to be picked for each instance.
I dont mind a randomly picked port so long as it is identical across all instances of my application. Is there a way to request Marathon to pick such a host port for my container endpoint.
I think what you are really after is service discovery / load balancing. Have a look at the Marathon docs at
https://mesosphere.github.io/marathon/docs/service-discovery-load-balancing
to get an overview.
Also, see the Docker networking docs at
https://mesosphere.github.io/marathon/docs/native-docker.html
You can probably either make use of the hostPort or the more general ports properties.

Why is my auto scaling EC2 instance reported as 'out of service' by the load balancer?

I'm having an issue with an Amazon EC2 instance during auto scaling. Every command I typed worked. I found no errors. But when testing whether auto scaling is working or not I found that it works until the instance started. The newly spawned instance does not work afterwards though: It's under my load balancer but its status is out of service. One more issue is when I copy and paste the public DNS link into the browser it does not respond and an error is triggered like "firefox can't find ..."
I doubt that there should be problem with the image or the Linux configuration.
Thanks in advance.
Although its been long since you posted it, but try adjusting the health check of the Load balancer,
if your health check is like this
Ping Target:
HTTP:80/index.php
Timeout:
10 seconds
Interval:
30 seconds
Unhealthy Threshold:
4
Healthy Threshold:
2
that means an instance will be marked out of service if the ping target doesn't respond within 10 seconds for 4 consecutive instances, while ELB will try to reach it every 30 seconds.
usually the fact that you get "firefox can't find ..." when you try to access the instance directly means that the service is down. Try to login on the instance check if the service is alive, also check the firewall rules which might block internet/elb requests. Check also your ELB health-check it's a good place to start. If you still have issues try to post some debug information like instance netstat, elb describe, parameters.
Rules on security groups assigned to the instance and the load balancer were not allowing traffic to pass between the two. This caused the health check to fail.So , u r load balancer is out of service.
If you don't have index.html in document root of instance - default health check will fail. You can set custom protocol, port and path for health check when creating load balancer saying as per my experience

How to gracefully shut down or remove AWS instances from an ELB group

I have a cloud of server instances running at Amazon using their load balancer to distribute the traffic. Now I am looking for a good way to gracefully scale the network down, without causing connection errors on the browser's side.
As far as I know, any connections of an instance will be rudely terminated when removed from the load balancer.
I would like to have a way to inform my instance like one minute before it gets shut down or to have the load balancer stop sending traffic to the dying instance, but without terminating existing connections to it.
My app is node.js based running on Ubuntu. I also have some special software running on it, so I prefer not to use the many PAAS offering node.js hosting.
Thanks for any hints.
I know this is an old question, but it should be noted that Amazon has recently added support for connection draining, which means that when an instance is removed from the loadbalancer, the instance will complete requests that were in progress before the instance was removed from the loadbalancer. No new requests will be routed to the instance that was removed. You can also supply a timeout for these requests, meaning any requests that run longer than the timeout window will be terminated after all.
To enable this behaviour, go to the Instances tab of your loadbalancer and change the Connection Draining behaviour.
This idea uses the ELB's capability to detect an unhealthy node and remove it from the pool BUT it relies upon the ELB behaving as expected in the assumptions below. This is something I've been meaning to test for myself but haven't had the time yet. I'll update the answer when I do.
Process Overview
The following logic could be wrapped and run at the time the node needs to be shut down.
Block new HTTP connections to nodeX but continue to allow existing connections
Wait for existing connections to drain, either by monitoring existing connections to your application or by allowing a "safe" amount of time.
Initiate a shutdown on the nodeX EC2 instance using the EC2 API directly or Abstracted scripts.
"safe" according to your application, which may not be possible to determine for some applications.
Assumptions that need to be tested
We know that ELB removes unhealthy instances from it's pool I would expect this to be graceful, so that:
A new connection to a recently closed port will be gracefully redirected to the next node in the pool
When a node is marked Bad, the already established connections to that node are unaffected.
possible test cases:
Fire HTTP connections at ELB (E.g. from a curl script) logging the
results during scripted opening an closing of one of the nodes
HTTP ports. You would need to experiment to find an
acceptable amount of time that allows ELB to always determine a state
change.
Maintain a long HTTP session, (E.g. file download) while blocking new
HTTP connections, the long session should hopefully continue.
1. How to block HTTP Connections
Use a local firewall on nodeX to block new sessions but continue to allow established sessions.
For example IP tables:
iptables -A INPUT -j DROP -p tcp --syn --destination-port <web service port>
The recommended way for distributing traffic from your ELB is to have an equal number of instances across multiple availability zones. For example:
ELB
Instance 1 (us-east-a)
Instance 2 (us-east-a)
Instance 3 (us-east-b)
Instance 4 (us-east-b)
Now there are two ELB APIs of interest provided that allow you to programmatically (or via the control panel) detach instances:
Deregister an instance
Disable an availability zone (which subsequently disables the instances within that zone)
The ELB Developer Guide has a section that describes the effects of disabling an availability zone. A note in that section is of particular interest:
Your load balancer always distributes traffic to all the enabled
Availability Zones. If all the instances in an Availability Zone are
deregistered or unhealthy before that Availability Zone is disabled
for the load balancer, all requests sent to that Availability Zone
will fail until DisableAvailabilityZonesForLoadBalancer calls for that
Availability Zone.
Whats interesting about the above note is that it could imply that if you call DisableAvailabilityZonesForLoadBalancer, the ELB could instantly start sending requests only to available zones - possibly resulting in a 0 downtime experience while you perform maintenance on the servers in the disabled availability zone.
The above 'theory' needs detailed testing or acknowledgement from an Amazon cloud engineer.
Seems like there have already been a number of responses here and some of them have good advice. But I think that in general your design is flawed. No matter how perfect you design your shutdown procedure to make sure that a clients connection is closed before shutting down a server you're still vulnerable.
The server could loose power.
Hardware failure causes server to fail.
Connection could be closed by a network issue.
Client looses internet or wifi.
I could go on with the list, but my point is that instead of designing for the system to always work correctly. Design it to handle failures. If you design a system that can handle a server loosing power at any time then you've created a very robust system. This isn't a problem with the ELB this is a problem with the current system architecture you have.
A caveat that was not discussed in the existing answers is that ELBs also use DNS records with 60 second TTLs to balance load between multiple ELB nodes (each having one or more of your instances attached to it).
This means that if you have instances in two different availability zones, you probably have two IP addresses for your ELB with a 60s TTL on their A records. When you remove the final instances from such an availability zone, your clients "might" still use the old IP address for at least a minute - faulty DNS resolvers might behave much worse.
Another time ELBs wear multiple IPs and have the same problem, is when in a single availability zone you have a very large number of instances which is too much for one ELB server to handle. ELB in that case will also create another server and add its IP to the list of A records with a 60 second TTL.
I can't comment cause of my low reputation. Here is some snippets I crafted that might be very useful for someone out there. It utilizes the aws cli tool to check when an instance been drained of connections.
You need an ec2-instance with provided python server behind an ELB.
from flask import Flask
import time
app = Flask(__name__)
#app.route("/")
def index():
return "ok\n"
#app.route("/wait/<int:secs>")
def wait(secs):
time.sleep(secs)
return str(secs) + "\n"
if __name__ == "__main__":
app.run(
host='0.0.0.0',
debug=True)
Then run following script from local workstation towards the ELB.
#!/bin/bash
which jq >> /dev/null || {
echo "Get jq from http://stedolan.github.com/jq"
}
# Fill in following vars
lbname="ELBNAME"
lburl="http://ELBURL.REGION.elb.amazonaws.com/wait/30"
instanceid="i-XXXXXXX"
getState () {
aws elb describe-instance-health \
--load-balancer-name $lbname \
--instance $instanceid | jq '.InstanceStates[0].State' -r
}
register () {
aws elb register-instances-with-load-balancer \
--load-balancer-name $lbname \
--instance $instanceid | jq .
}
deregister () {
aws elb deregister-instances-from-load-balancer \
--load-balancer-name $lbname \
--instance $instanceid | jq .
}
waitUntil () {
echo -n "Wait until state is $1"
while [ "$(getState)" != "$1" ]; do
echo -n "."
sleep 1
done
echo
}
# Actual Dance
# Make sure instance is registered. Check latency until node is deregistered
if [ "$(getState)" == "OutOfService" ]; then
register >> /dev/null
fi
waitUntil "InService"
curl $lburl &
sleep 1
deregister >> /dev/null
waitUntil "OutOfService"

Process for telling when a new ec2 host can be connected to

I've been using fabric and boto to start up new ec2 hosts for some temporary processing but I've always had trouble knowing when I can connect to the host. The problem is that I can ask ec2 when something is ready but it's never really ready.
This is the process that I've noticed works best (though it still sucks):
Poll ec2 until it says that the host it "active"
Poll ec2 until it has a public_dns_name
Try to connect to the new host in a loop until it accepts the connection
But sometimes it accepts the connection seemingly before it knows about the ssh key pair that I've associated it with and then asks for a password.
Is there a better way to decide when I can start connecting to my ec2 hosts after they've started up? Has anyone written a library that does this nicely and efficiently?
I do the same for #1 and #2, but for #3 I have a code loop that attempts to make a simple TCP connection to the ssh port (22) with short timeouts and retry. When it finally succeeds, it waits five more seconds an then run the ssh command.
The timing and order in which sshd is started and the public ssh key is added to .ssh/authorized_keys may vary depending on the AMI you are running.
Note: I mildly recommend using the public IP address directly instead of the DNS name. The IP address is encoded in the DNS name, so there's no benefit to adding DNS lookups into the process.
EC2 itself doesn't have any way of knowing when your instance is ready to accept SSH connections; it operates on a much lower level than that.
The best way to do this is to update your AMI to have some sort of health servlet. It can be very simple -- just a few lines of web.py script -- that runs at the later stages of startup, and which just returns status code 200 to any HTTP request. By the time that servlet is responding to requests, everything else should be up too, so you can check your instance with exponential backoff on that URL.
If you ever put your instances behind a load balancer (which has its own benefits), this health servlet is required anyway, and has the added benefit of telling the load balancer when an instance has gone down, for any reason. It's just a general best-practice on EC2.

Resources