Following the scenario:
There is a service that runs 24/7 and a downtime is extremely expensive. This service is deployed on Amazon EC2. I am aware to the importance of deploying the application on two different availability zones and even in two different regions in order to prevent single points of failure. But...
My question is whether there are any additional configuration issues that may affect the redundancy of an application. I mean also to wrong configuration (for example wrong configuration of the DNS that will make it fail in case of a fail over).
Just to make sure I am clear - I am trying to create a list of validations that should be tested in order to ensure the redundancy of an application deployed on EC2.
Thank you all!
Just as a warning, just because you put your services in two availability zones doesn't mean that you're fault tolerant.
For example, one setup I had was to have 4 servers on a load balancer with us-east-1a us-east-1b as the two zones. Amazon's outage a few months ago caused some outages with my software because the load balancers weren't working properly. They were still forwarding requests but the two dead instances I had in one of the zones were also still receiving requests. Part of the load balancer logic is to remove dead instances, but since the load balancer queue was backlogged those instances were never removed. In my setup there are two load balancers once in each zone, so all of the requests to one load balancer were timing out because there were no instances to respond to the request. Luckily for me, the browser retried the request with the 2nd load balancer so the feeds I had were still loading but were very very slow.
My advice is to make sure that if you choose to go with only two availability zones over two regions that you make sure your systems are not dependent on any part of another availability zone, not even the load balancers. For me, it's not worth the extra cost to launch two completely independent systems in different zones so I'm unable to avoid this problem again in the future. But if your software is critical to the point where losing the service for 1 hour would pay for the cost of running extra hardware then it's definitely worth the extra servers to set it up correctly.
I also recommend paying for AWS support and working with their engineers to make sure that your design doesn't have any flaws for high-availability.
Recap of the issue I discussed: http://aws.amazon.com/message/67457/
Related
I'm hosting a web application which should be highly-available. I'm hosting on multiple linodes and using a nodebalancer to distribute the traffic. My question might be stupid simple - but not long ago I was affected by a DDoS hitting the data-center. That made me think how I can be better prepared next time this happens.
The nodebalancer and servers are all in the same datacenter which should, of course, be fixed. But how does one go about doing this? If I have two load balancers in two different data centers - how can I setup the domain to point to both, but ignore the one affected by DDoS? Should I look into the DNS manager? Am I making things too complicated?
Really would appreciate some insights.
Thanks everyone...
You have to look at ways to load balance across datacenters. There's a few ways to do this, each with pros and cons.
If you have a lot of DB calls, running to datacenters HOT can introduce a lot of latency problems. What I would do is as follows.
Have the second datacenter (DC2) be a warm location. It is configured for everything to work and is constantly getting data from the master DB in DC 1, but isn't actively getting traffic.
Use a service like CLoudFlare for their extremely fast DNS switching. Have a service in DC2 that constantly pings the load balancer in DC1 to make sure that everything is up and well. When it has trouble contacting DC1, it can connect to CloudFlare via the API and switch the main 'A' record to point to DC2, in which case it now picks up the traffic.
I forget what CloudFlare calls it but it has a DNS feature that allows you to switch 'A' records almost instantly because the actual IP address given to the public is their own, they just route the traffic for you.
Amazon also have a similar feature with CloudFront I believe.
This plan is costly however as you're running much more infrastructure that rarely gets used. Linode is and will be rolling out more network improvements so hopefully this becomes less necessary.
For more advanced load balancing and HA, you can go with more "cloud" providers but it does come at a cost.
-Ricardo
Developer Evangelist, CircleCI, formally Linode
Please keep in mind I have very little experience with any of this stuff.
I've been trying to set up a multi-server Magento site in Elastic Beanstalk, using RDS as the database and EC2s for the separate admin and frontend servers. I was testing the speed of a few different setups on tools.pingdom.com, mostly looking at what they call "wait" (DNS, connect, send, WAIT, receive), assuming that is a rough measure of how long it takes for Magento to generate the html for a page. I was puzzled by how on different setups, the wait time could vary so dramatically for creating roughly the same page using similar server instances. I was getting values between 980ms and 1.8s.
I thought I started to notice a pattern. It seemed that the setups in which the EC2s were in the same availability zones as the RDS instances would be faster and more consistently faster. So I changed the elastic beanstalk configurations so that the EC2s would be in the same zone as the database. My unscientific findings were that I would consistently get wait times of around 1s after this change. It seems to me that the fairly significant differences in speed were due to network latency between the application server and the database.
There are three parts to my question. First, is this what one would expect from keeping instances in the same zones, or am I reading too much into a small set of test results? Second, is this a significant real world difference in speed? Because it seems to be to me, and it also seems that it would only be made worse by using things like NFS to share the media folders. Third, are there any advantages to allowing application servers to be launched in different zones, and are those advantages worth the increase in wait time?
Also, if I'm doing something wrong, please tell me.
The latency between zones should be in the single millisecond range. While this may a problem from some scenarios, in most cases, you can compensate for it by making good use of caching.
The primary benefit of using multiple zones is availability. The system is designed so that there is very little shared between each zone. If something causes a single zone to fail, it shouldn't cascade to other zones.
While this mostly works, there have been some cases (especially with EBS volumes) that have shown the limits of that isolation.
Anyone has any idea how the ELB will distribute requests if I register multiple EC2 instances of different sizes. Say one m1.medium, one m1.large and one m1.xlarge.
Will it be different if I register EC2 instances of same size? If so then how?
That's a fairly complicated topic, mostly due to the Amazon ELB routing documentation falling short of being non existent, so one needs to assemble some pieces to draw a conclusion - see my answer to the related question Can Elastic Load Balancers correctly distribute traffic to different size instances for a detailed analysis including all the references I'm aware of.
For the question at hand I think it boils down to the somewhat vague AWS team response from 2009 to ELB Strategy:
ELB loosely keeps track of how many requests (or connections in the
case of TCP) are outstanding at each instance. It does not monitor
resource usage (such as CPU or memory) at each instance. ELB
currently will round-robin amongst those instances that it believes
has the fewest outstanding requests. [emphasis mine]
Depending on your application architecture and request variety, larger Amazon EC2 instance types might be able to serve requests faster, thus have less outstanding requests and receive more traffic accordingly, but either way the ELB supposedly distributes traffic appropriately on average, i.e. should implicitly account for the uneven instance characteristics to some extent - I haven't tried this myself though and would recommend both, Monitoring Your Load Balancer Using CloudWatch as well as monitoring your individual EC2 instances and correlate the results in order to gain respective insight and confidence into such a setup eventually.
Hi I agree with Steffen Opel, also I met one of the solution architect of AWS recently, he gave couple of heads up on Elastic Load balancing to achieve better performance through ELB.
1) make sure you have equal number of instances running on all the availability zones. For example in case of ap-southeast, we have to availability zones 1a and 1b so make sure you have equal number of instances attached to the ELB from both the regions.
2) Make sure your application is stateless, that's the cloud enforces and suggests the developers.
3)Dont use sticky sessions.
5)Reduce the TTL (Time to live) to the maximum possible level, like 10 secs or something.
6) Unhealthy checks TTL should be minimum so that ELB doesn't retain the unhealthy instances.
7)If you are excepting a lot of traffic to your ELB make sure you do a load testing on ELB itself, it doesn't scale as fast as your ec2 instances.
8) If you are caching then think thousand times from which point you are picking the data to cache.
Above all tips is just to help you get this better. Its better you have a same size of instances.
I'm relaunching a site (~5mm+ visits per day) on EC2, and am confused about how to deploy nodes in different data centers. My most basic setup is two nodes behind a Varnish server.
Should I have two Varnish instances in different availability zones, each with WWW nodes that talk to a shared RDS database? Each Varnish instance can be load balanced w/ Amazon's load balancer.
Something like:
1 load balancer talking to:
Varnish in Virginia, which talks to its own us-east-x nodes
Varnish in California, which talks to its own us-west-x nodes
We use amazon EC2 extensively to do load balancing and fault tolerance. While we still don't extensively use the LoadBalancers provided by Amazon we have our own load balancers(running outside Amazon). Amazon promises that the LoadBalancers will never go down, they are internally fault tolerant, but I havent tested that well enough.
In general we host two instances per availability zone. One acting as a mirroring server to the real server. If in case one of the servers go down we send the customers to the other one. But lately Amazon has shown a pattern that a single availability zone goes down quite often.
So the wise technique I presume is to set up servers across availability zones like you mentioned. We use postgres so, we can replicate content in the database across the instances. With 9.0 there is Binary replication that works great for two way replication. This way both the servers can take the load when up but when a availability zone does go down all the users are sent to one server. Since a common database is available it does not matter where the users go to. Just that they will experience a slight slowness, if they go to the wrong server.
With this approach you can do tandem updating of the web sites. Update one ensure that it is running fine and then update the next. So even if the server failed to upgrade the whole website is always up.
I've been tasked with determining if Amazon EC2 is something we should move our ecommerce site to. We currently use Amazon S3 for a lot of images and files. The cost would go up by about $20/mo for our host costs, but we could sell our server for a few thousand dollars. This all came up because right now there are no procedures in place if something happened to our server.
How reliable is Amazon EC2? Is the redundancy good, I don't see anything about this in the FAQ and it's a problem on our current system I'm looking to solve.
Are elastic IPs beneficial? It sounds like you could point DNS to that IP and then on Amazon's end, reroute that IP address to any EC2 instance so you could easily get another instance up and running if the first one failed.
I'm aware of scalability, it's the redundancy and reliability that I'm asking about.
At work, I've had something like 20-40 instances running at all times for over a year. I think we've had 1-3 alert emails come from amazon suggesting that we terminate and boot another instance (presumably because they are detecting possible failure in the underlying hardware). We've never had an instance go down suddenly, which seems rather good.
Elastic IP's are amazing and are part of the solution. The other part is being able to rapidly bring up new instances. I've learned that you shouldn't care about instances going down, that it's more important to use proper load balancing and be able to bring up commodity instances quickly.
Yes, it's very good. If you aren't able to put together a concurrent redundancy (where you have multiple servers fulfilling requests simultaneously), using the elastic IP to quickly redirect to another EC2 instance would be a way to minimize downtime.
Yeah I think moving from inhouse server to Amazon will definitely make a lot of sense economically. EBS backed instances ensure that even if the machine gets rebooted, the transient memory is not lost. And if you have a clear separation between your application and data layer and can have them on different machines, then you can build even better redundancy for your data.
For ex, if you use mysql, then you can consider using Amazon RDS service - which gives you a highly available and reliable MySQL instance, fully managed (patches and all). The application layer then can be made more resilient by having more smaller instances rather than one larger instance, through load balancing.
The cost you will save on is really hardware maintenance and the cost you would have to incur to build in disaster recovery.