What's are best practices for using EC2 Availability Zones? - amazon-ec2

I'm relaunching a site (~5mm+ visits per day) on EC2, and am confused about how to deploy nodes in different data centers. My most basic setup is two nodes behind a Varnish server.
Should I have two Varnish instances in different availability zones, each with WWW nodes that talk to a shared RDS database? Each Varnish instance can be load balanced w/ Amazon's load balancer.
Something like:
1 load balancer talking to:
Varnish in Virginia, which talks to its own us-east-x nodes
Varnish in California, which talks to its own us-west-x nodes

We use amazon EC2 extensively to do load balancing and fault tolerance. While we still don't extensively use the LoadBalancers provided by Amazon we have our own load balancers(running outside Amazon). Amazon promises that the LoadBalancers will never go down, they are internally fault tolerant, but I havent tested that well enough.
In general we host two instances per availability zone. One acting as a mirroring server to the real server. If in case one of the servers go down we send the customers to the other one. But lately Amazon has shown a pattern that a single availability zone goes down quite often.
So the wise technique I presume is to set up servers across availability zones like you mentioned. We use postgres so, we can replicate content in the database across the instances. With 9.0 there is Binary replication that works great for two way replication. This way both the servers can take the load when up but when a availability zone does go down all the users are sent to one server. Since a common database is available it does not matter where the users go to. Just that they will experience a slight slowness, if they go to the wrong server.
With this approach you can do tandem updating of the web sites. Update one ensure that it is running fine and then update the next. So even if the server failed to upgrade the whole website is always up.

Related

How to prevent being affected by data-center DDoS attack & maintainance related downtime?

I'm hosting a web application which should be highly-available. I'm hosting on multiple linodes and using a nodebalancer to distribute the traffic. My question might be stupid simple - but not long ago I was affected by a DDoS hitting the data-center. That made me think how I can be better prepared next time this happens.
The nodebalancer and servers are all in the same datacenter which should, of course, be fixed. But how does one go about doing this? If I have two load balancers in two different data centers - how can I setup the domain to point to both, but ignore the one affected by DDoS? Should I look into the DNS manager? Am I making things too complicated?
Really would appreciate some insights.
Thanks everyone...
You have to look at ways to load balance across datacenters. There's a few ways to do this, each with pros and cons.
If you have a lot of DB calls, running to datacenters HOT can introduce a lot of latency problems. What I would do is as follows.
Have the second datacenter (DC2) be a warm location. It is configured for everything to work and is constantly getting data from the master DB in DC 1, but isn't actively getting traffic.
Use a service like CLoudFlare for their extremely fast DNS switching. Have a service in DC2 that constantly pings the load balancer in DC1 to make sure that everything is up and well. When it has trouble contacting DC1, it can connect to CloudFlare via the API and switch the main 'A' record to point to DC2, in which case it now picks up the traffic.
I forget what CloudFlare calls it but it has a DNS feature that allows you to switch 'A' records almost instantly because the actual IP address given to the public is their own, they just route the traffic for you.
Amazon also have a similar feature with CloudFront I believe.
This plan is costly however as you're running much more infrastructure that rarely gets used. Linode is and will be rolling out more network improvements so hopefully this becomes less necessary.
For more advanced load balancing and HA, you can go with more "cloud" providers but it does come at a cost.
-Ricardo
Developer Evangelist, CircleCI, formally Linode

How Amazon ELB will distribute requests to Amazon EC2 instances of different instance types?

Anyone has any idea how the ELB will distribute requests if I register multiple EC2 instances of different sizes. Say one m1.medium, one m1.large and one m1.xlarge.
Will it be different if I register EC2 instances of same size? If so then how?
That's a fairly complicated topic, mostly due to the Amazon ELB routing documentation falling short of being non existent, so one needs to assemble some pieces to draw a conclusion - see my answer to the related question Can Elastic Load Balancers correctly distribute traffic to different size instances for a detailed analysis including all the references I'm aware of.
For the question at hand I think it boils down to the somewhat vague AWS team response from 2009 to ELB Strategy:
ELB loosely keeps track of how many requests (or connections in the
case of TCP) are outstanding at each instance. It does not monitor
resource usage (such as CPU or memory) at each instance. ELB
currently will round-robin amongst those instances that it believes
has the fewest outstanding requests. [emphasis mine]
Depending on your application architecture and request variety, larger Amazon EC2 instance types might be able to serve requests faster, thus have less outstanding requests and receive more traffic accordingly, but either way the ELB supposedly distributes traffic appropriately on average, i.e. should implicitly account for the uneven instance characteristics to some extent - I haven't tried this myself though and would recommend both, Monitoring Your Load Balancer Using CloudWatch as well as monitoring your individual EC2 instances and correlate the results in order to gain respective insight and confidence into such a setup eventually.
Hi I agree with Steffen Opel, also I met one of the solution architect of AWS recently, he gave couple of heads up on Elastic Load balancing to achieve better performance through ELB.
1) make sure you have equal number of instances running on all the availability zones. For example in case of ap-southeast, we have to availability zones 1a and 1b so make sure you have equal number of instances attached to the ELB from both the regions.
2) Make sure your application is stateless, that's the cloud enforces and suggests the developers.
3)Dont use sticky sessions.
5)Reduce the TTL (Time to live) to the maximum possible level, like 10 secs or something.
6) Unhealthy checks TTL should be minimum so that ELB doesn't retain the unhealthy instances.
7)If you are excepting a lot of traffic to your ELB make sure you do a load testing on ELB itself, it doesn't scale as fast as your ec2 instances.
8) If you are caching then think thousand times from which point you are picking the data to cache.
Above all tips is just to help you get this better. Its better you have a same size of instances.

Amazon EC2 high availability

Following the scenario:
There is a service that runs 24/7 and a downtime is extremely expensive. This service is deployed on Amazon EC2. I am aware to the importance of deploying the application on two different availability zones and even in two different regions in order to prevent single points of failure. But...
My question is whether there are any additional configuration issues that may affect the redundancy of an application. I mean also to wrong configuration (for example wrong configuration of the DNS that will make it fail in case of a fail over).
Just to make sure I am clear - I am trying to create a list of validations that should be tested in order to ensure the redundancy of an application deployed on EC2.
Thank you all!
Just as a warning, just because you put your services in two availability zones doesn't mean that you're fault tolerant.
For example, one setup I had was to have 4 servers on a load balancer with us-east-1a us-east-1b as the two zones. Amazon's outage a few months ago caused some outages with my software because the load balancers weren't working properly. They were still forwarding requests but the two dead instances I had in one of the zones were also still receiving requests. Part of the load balancer logic is to remove dead instances, but since the load balancer queue was backlogged those instances were never removed. In my setup there are two load balancers once in each zone, so all of the requests to one load balancer were timing out because there were no instances to respond to the request. Luckily for me, the browser retried the request with the 2nd load balancer so the feeds I had were still loading but were very very slow.
My advice is to make sure that if you choose to go with only two availability zones over two regions that you make sure your systems are not dependent on any part of another availability zone, not even the load balancers. For me, it's not worth the extra cost to launch two completely independent systems in different zones so I'm unable to avoid this problem again in the future. But if your software is critical to the point where losing the service for 1 hour would pay for the cost of running extra hardware then it's definitely worth the extra servers to set it up correctly.
I also recommend paying for AWS support and working with their engineers to make sure that your design doesn't have any flaws for high-availability.
Recap of the issue I discussed: http://aws.amazon.com/message/67457/

Failover proxy on Amazon aws?

This is a fairly generic question. Suppose I have three ec2 boxes: two app boxes and a box that hosts nginx as a reverse proxy, delegating requests to the two app boxes (my database is hosted elsewhere). Now, the two app machines can absorb a failure amongst themselves, however the third one represents a single point of failure. How can I configure my setup so that if the reverse proxy goes down, the site is still available?
I am looking at keepalived and HAproxy. For me this stuff is non-obvious, and any help for the ears of a beginner is appreciated.
If your nginx does no much more than proxying HTTP requests, please have a look at Amazon Elastic Load Balancer. You can set up your two (or more) app boxes, leave some spare ones (in order to keep always two or more up, if you need it), set up health checks, have SSL termination at the balancer, make use of sticky sessions, etc.
There is a lot of people, though, that would like to see the ability to set elastic IP addresses to ELBs, and others with good arguments why it is not neeeded.
My suggestions is that you take a look at ELB documentation, as it seems to perfectly fit your needs. I also recommend reading this interesting post for a good discussion on this subject.
I think if you are a beginner with HA and clusters, your best solution is Elastic Load Balancer (ELB) which is maintained by Amazon. They scale up automatically and implements a high availability cluster of balancers. So using ELB service you already mitigate the point of failure that you commented. Also it's important to have in mind that an ELB is cheaper than 2 instances in AWS. And of course it's easier to launch and maintain.
You can't see multiple ELB because it is a service, so you don't have to take care of the availability.
Other important point is that AWS elastic ips aren't assigned to NIC interface of your OS instance, so use virtual ips as well in classical infrastructures it's difficult.
After this explanation, if you still want Nginx as a proxy reverse in AWS because your reasons, I think you can implement an autoscaling group with a layer composed by Nginx instances. But if you aren't expert in autoscaling technology, it could be very tricky.

Amazon EC2 consideration - redundancy and elastic IPs

I've been tasked with determining if Amazon EC2 is something we should move our ecommerce site to. We currently use Amazon S3 for a lot of images and files. The cost would go up by about $20/mo for our host costs, but we could sell our server for a few thousand dollars. This all came up because right now there are no procedures in place if something happened to our server.
How reliable is Amazon EC2? Is the redundancy good, I don't see anything about this in the FAQ and it's a problem on our current system I'm looking to solve.
Are elastic IPs beneficial? It sounds like you could point DNS to that IP and then on Amazon's end, reroute that IP address to any EC2 instance so you could easily get another instance up and running if the first one failed.
I'm aware of scalability, it's the redundancy and reliability that I'm asking about.
At work, I've had something like 20-40 instances running at all times for over a year. I think we've had 1-3 alert emails come from amazon suggesting that we terminate and boot another instance (presumably because they are detecting possible failure in the underlying hardware). We've never had an instance go down suddenly, which seems rather good.
Elastic IP's are amazing and are part of the solution. The other part is being able to rapidly bring up new instances. I've learned that you shouldn't care about instances going down, that it's more important to use proper load balancing and be able to bring up commodity instances quickly.
Yes, it's very good. If you aren't able to put together a concurrent redundancy (where you have multiple servers fulfilling requests simultaneously), using the elastic IP to quickly redirect to another EC2 instance would be a way to minimize downtime.
Yeah I think moving from inhouse server to Amazon will definitely make a lot of sense economically. EBS backed instances ensure that even if the machine gets rebooted, the transient memory is not lost. And if you have a clear separation between your application and data layer and can have them on different machines, then you can build even better redundancy for your data.
For ex, if you use mysql, then you can consider using Amazon RDS service - which gives you a highly available and reliable MySQL instance, fully managed (patches and all). The application layer then can be made more resilient by having more smaller instances rather than one larger instance, through load balancing.
The cost you will save on is really hardware maintenance and the cost you would have to incur to build in disaster recovery.

Resources