Curious as to 99.95% uptime REALLY means; Is it really going to go down 7 minutes a month? Please post your longest/average uptimes on EC2, thanks.
Usually uptime is calculated in a yearly basis. So if you have a Service Level Agreement for 99.95% this means:
365 * 0.0005 = 0.1825 days or 4.38 hours
If during a year of service there is an outage and your system is down for more than that, then you are liable for compensation.
As of your question, I have a server running unstopped in EC2 for about 3 months now. I would say that their uptime is good, but if you have a mission critical application you definitely need to have a fail-over solution. A good uptime only means that they will be able to respond to an outage quickly. Even a 99.9999% uptime won't be able to save you if you aren't prepared for an outage.
Read the SLA carefully (http://aws.amazon.com/ec2-sla/) they only count "Region Unavailable" as downtime, and what is more they only count it as downtime if the region is down for 5 consecutive minutes.
“"Annual Uptime Percentage” is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of “Region Unavailable.”
By my count this mean any downtime of less then 4 minutes is not countable. Also if they do break the SLA they are only in for %10 of the month in which you had largest downtime bill.
So if they where down for all of January and your bill was $100 they would apply a $10 credit to your account.
I would have a hard time convensing my boss that this is a serious product with a SLA like that.
SLA's are useless. They only measure how much risk the company is willing to take on and have no bearing on actual uptime. I've seen SLA's, with heavy penalties, offered when the company knew the could not meet the SLA in order to land the sale.
I have one client with 400+ days of EC2 uptime and another with 300+ days as measured by web pulse, this is by far the most reliable service I've worked with.
For my single instance running in the US-East availability zone, 9 months, 0 downtime.
Since Amazon switched to provide an SLA, I've never had an instance go down on me. When I've had instances go down in the past, Amazon has always sent a message informing me that the instance is degraded before it actually disappeared, so I've had time to start up a new instance.
The previous answer makes a good point, though; EC2's service model dictates that you write your apps to handle failover to a new server if you're not prepared for extended down time.
conrad#papa ~ $ uptime
04:42:36 up 495 days, 8:51, 8 users, load average: 0.02, 0.02, 0.00
Checking out the AWS Service Health Dashboard will get you a good idea of any current or past issues. My experience is that the AWS uptime is better than most "traditional" hosting options (even full-blown redundant RackSpace setup...).
However, simply going with AWS for uptime is like getting a car for the keychain (ok, almost.. ;)). With an architecture utilizing AWS the big benefit is scaling (without upfront costs).
SLA... Guaranteed uptime...
These are all very nice taglines. But when the servers aren't available for an hour (March 1, 2012, in the EU region) and the clients start calling, then it won't help you that they still have a 300 days uptime.
And when the lightning struck 1 out of 3 of their datacenters in the EU, we all found out that they have no off-site redundancies, and the fact that they have 3 datacenters doesn't mean a thing.
One must love the phrase "degraded performance", that actually means: "cross your fingers and pray that your data will still be available after the catastrophe passes".
I'm still trying to look for any official/non-official statistics about the availability percentages of all of their datacenters.
No luck thus far...
I've never had downtime on EC2, however, I do keep local backups and make daily images of my machines and port them to another availability zone, just in case. I use twilio to alert me if a machine is unreachable with a phone call to all my devices. Then I can just log in to EC2 and fire up a machine in another availability zone; worst case I'll be down for a few minutes.
Which in my case, is potentially pretty sucky, because my machines are doing 24/7 Forex trading.
My rule: know the potential cost of downtime, and be willing to invest that much in redundancy assuming it will happen - because it will.
That said, EC2 has never let me down. Helps probably that my servers are not in an area of the country where natural disasters are common. If you're in an earthquake zone, tornado alley, or a potential hurricane path, downtime truly is an inevitability.
Related
We have the algorithm to reuse AWS EC2 instances for jobs. This was very useful when the payments were by using time rounded by hours. Now, due to the change in the AWS policy the payments can be done by time expressed in minutes. At first glance there is no reason to keep the reuse algorithm because a job lasts at least 10 minutes up to many hours. Are there any suggestions why this algorithm can still be useful?
Unless you're running an EC2 instance for less than a minute (which it sounds like you're not) and need per second billing or are using an EBS volume with it, I would say probably not. See more details here
I have a use case where we have a very large computation job, which can be broken up into many small units of work fairly efficiently. There could be effectively lets say 1,000 hours of computational work for an m4.large instance. Lets say I wanted the result back within the next 10 minutes, that would mean I would need 6,000 instances to get the job done in time.
So far I have setup AWS batch, I haven't used any more than the 20 m4.large instances your account comes with. I know I can up the amount of instances requested by AWS but I still don't really know much about what the behaviour is if you suddenly try and provision thousands of on-demand instances or if AWS limits how many instances you can use.
So my question is am I able to launch thousands of m4.large instances on-demand? And if so what are sort of times would I be looking at for all instances to get to the Running state.
I have done this many times with ~100 instances but never in the thousands of instances.
STEP 1: Open a support ticket with AWS. You will need to get your account approved, credit checked, etc. My customers are very big companies, so for them the credit and approval process is easy. If you are a little guy, I don't know.
STEP 2: Think thru your VPC design and how you will address that many instances. If is one thing to have 5 instances going thru a NAT Gateway, but a hundred systems will bring Internet connectivity to its knees.
STEP 3: Think thru the networking bandwidth required. Do you need placement groups or very high speed Intranet or Internet connectivity?
STEP 4: Be prepared that you cannot launch all instances with a specific instance type (capacity not available error). Have a selection of instances that you can fall back on.
STEP 5: Create your own software, I use Python, to launch the instances, perform updates, install software, etc. You can then poll the instances using the Boto3 EC2 API to determine when all the instances are running. The length of time for 1,000 instances won't be much different than 1 instance.
Now for the real world. If your job takes 1,000 hours, launching 1,000 instances will not reduce it to 1 hour unless you have a really scalable software design with minimum inter-machine communications required. Once you go beyond 10 systems, networking bandwidth and communications overhead becomes an issue. Even though AWS's resources are huge, launching 1,000 EC2 instances at one time by one customer is not a common launch case.
I would also NOT launch 1,000 instances to get processing down to 10 minutes. It can take 10 minutes for your instances to come online, get updated, synchronize, etc. This means that you will be spending 50% of your budget on waiting time. For really large jobs today we prefer to use Hadoop / Spark where scaling to hundreds of machines is realistic.
You can contact AWS Customer Service to increase your EC2 limits (use the link shown in the Limits section of the EC2 management console). They will verify your use-case.
You might also consider using Spot Pricing to lower your costs. Spot instances take longer to provision.
Sample use-case: Gigaom | Cycle Computing once again showcases Amazon’s high-performance computing potential
There are also services like Spotinst that can help you provision servers at the lowest possible cost.
At the end of last month, I started experimenting with Amazon EC2. I launched one or two t2 micro instances, to experiment for free under the free tier.
However, I have quickly noticed on my billing dashboard that my services usage was increasing fast, and after a few dayswas forecast to exceed the free tier limitations by the end of the month. Since I don't need to maintain an 24/7 online presence at the moment and mostly use my instances to experiment, I have terminated one and stopped another. I have even detached the volume of the stopped instance.
All I have left are:
One stopped micro instance
One detached volume
Two snapshots of that volume
One AMI that I had used to move my instance from one region to another in the beginning
As far as I understand it, those more or less "idle" resources, at least not highly active ones. And yet, in my billing dashboard, in the row EC2 - Linux, the month-to-date usage keeps increasing, at the rate of more than one hour of usage per real-time hour.
Before I detached the volume, the usage would increase by 14 hours in only 3 hours. Now that I have detached it, it slowed down a bit, but still increased by 16 hours in the last 13 hours. All that, again, without actively running anything!
I am aware that the tally of the usage isn't strictly limited to running instances, but to associated resources as well. But still, it seems very high to me. I don't even dare to test my app anymore.
I would like to know if such an increase is normal, or if there may be something wrong with my account and how I configured my instances. If it is normal, any indications as to what actions I could take to reduce this increase would be very welcome!
Note that I did contact the Amazon support several days ago and didn't get a single reply, this is why I'm turning to here.
Thanks in advance!
Edit: Solved. I did have an instance running in another region, probably by mistake because I had never interacted directly with that region. See comments for a script to systematically check all regions and avoid this kind of stupid mistake.
I see a lot of questions that are similar but no direct answer. I just want to play around with an instance to run some light hourly data aggregation and familiarize myself with the ec2 instances. I feel like I do not understand what my risk is however. Is there a way to set my max cost per month? I don't care if my instances get shutdown when the cost is hit, I just want limited liability for my experiment.
Also, running some monitoring of logs myself is not really a solution I will undertake for such a simple experiment. I just want to know if there is some easy way of limiting liability.
Short answer is no, you wont be able to set a cap and stop services.
You can however easily monitor your costs. Starting with billing alerts. Account activity shows you a nearly realtime accounting of the costs you have occurred so far.
I'd like a non-amazon answer to this quandry...
It looks like, via spot instance pricing, you could run an instance for 22 or 23 cents an hour, for as many hours as you want, because the historical charts for hours/days/months show the spot price never goes over 21 (22?) cents per hour. That's like half of the non-reserved instance cost for the same sized instance and its even less than a reserved instance would ever work out to be per hour. With no commitment.
Am I missing something, do I have a complete and total misunderstanding of the spot/bid/ask instance mechanisim? Or is this a cheap way to get an 24/7 instance while Amazon has a bunch of extra capacity?
Jeremy
No, you are not missing anything. I asked the same question many times when I first looked at Spot, followed by "why doesn't everyone use this all the time?"
So what's the downside? Amazon reserves the right to terminate a Spot instance at any time for any reason. Now, a normal "on-demand" instance might die at any time too, but Amazon goes to great efforts to keep them online and to serve customers with warnings well in advance (days / weeks) if the host server needs to be powered down for maintenance. If you have a Spot instance running on a server they want to reboot ... they will just shut it off. In practice, both are pretty reliable (but NOT 100%!!), and many roles can run 24/7 on spot without issues. Just don't go whining to Amazon that your Spot instance got shut off and your entire database was stored on the ephemeral drive... of course if you do that on ANY instance, you are taking a HUGE (and very stupid) risk.
Some companies are saving tons of money with Spot. Here's a writeup on Vimeo saving 50%, and one on Pinterest saving 60%+ ($54/hr => $20/hr).
Why don't more companies use Spot for their instances? Many of the companies buying EC2 instance hours aren't very price sensitive and are very very risk-adverse, especially when it comes to outages and to operational events that sap engineering effort. They don't want to deal with the hassle to save a few bucks, especially if AWS fees aren't a significant cost-center versus personel. And for 24/7 instances, they already pay 1/2 price via "reserved instances", so the savings aren't as dramatic as they seem versus full-priced "on-demand" instances. Spot isn't fully relevant to large customers. You can be nearly certain that when a customer gets to be the size of a Netflix, they 1) need to coordinate with Amazon on capacity planning because you can't just spin up 1/2 a datacenter on a whim, and 2) getting significant volume discounts that bring their usage costs down into the Spot price range anyways. Plus, the first tier of cost cutting is to reclaim hardware that isn't really needed; at my last company, one guy found a bug where as we cycled through boxes we would "forget" about some of them and shutting that down saved $100+k / month (yikes). Once companies burn through that fat, they start looking at Spot.
There's a second, less discussed reason Spot doesn't get used... It's a different API. Think about how this interacts with "organizational inertia" .... Working at a company that continuously spends $XX / hr on EC2 (and coming from a company that spent $XXXX / hr), engineers start instances with the tools they are given. Our Chef deployment doesn't know how to talk to spot. Rightscale (prev place) defaulted to launching on-demand instances. With some quantity of work, I could probably figure out how to make a spot instance, but why bother if my priority is to get role XYZ up and running by tomorrow? I'm not about to engineer a spot-based solution just for my one role and then evangelize why that was a good idea; it's gotta be an org-wide decision. If you read the Pinterest case-study I linked above, you'll notice they talk about migrating their whole deployment over from $54/hr to $20/hr on spot. Reading between the lines, they didn't choose to launch Spot instances 1-by-1; one day, they woke up and made a company-wide decision to "solve the spot problem" and 'migrate' their deployment tools to using Spot by default (probably with support for a flag that keeps their DB instances off Spot). I can't imagine how much money Amazon has made by making Spot a different API instead of being a flag on the normal EC2 API; Hint: it's boatloads .. as in, you could buy a boat and then fill it with cash until it sinks.
So if you are willing to tolerate slightly higher risk and / or you are somewhat price-sensitive ... then, yes, you absolutely can save a crapton of money by running your service under Spot 24/7.
Just make sure you are double-prepared to unexpectedly lose your instance (ie, take backups) .... something you ALREADY need to be prepared for with an "on-demand" instance that doesn't have 100.0% uptime either.
Think of it this way:
Instead of getting something 99.9% reliable, you are getting something 99.5% reliable and paying half-price
(I made those numbers up to convey the idea, but they probably aren't too far off from the truth).
So long as your bid price is above the spot instance market price, you can continue to run whatever spot instances you want, and only pay the market price.
However, when the market price goes above your bid price, you lose your instances. Without any warning. They just terminate. While the spot price rarely spikes, and when it does it tends to come back down again quickly, for many applications the possibility of losing all your instances without warming is unacceptable. You can insulate yourself against that possibility by bidding higher, but then you risk having to pay that much.
TL;DR: If your application is tolerant to sudden termination, then spot instances are great. But there is a risk involved in using them.
I think these answers are slightly missing the point...
You need to select the most appropriate pricing for your workload and architect your solution with this in mind. AWS offers 3 pricing types:
Reserved Instances (low cost, high reliability, but pay up-front)
On-demand instances (highest cost, high reliability, but pay as you go)
Spot Instances (generally lowest cost, but can terminate unexpectedly)
Reserved Instances
- Use these for cost savings on long running / constant / predictable workloads.
On-demand Instances
- Use these for temporary workloads e.g. development / proof of concept / unpredictable workloads that can't be interrupted.
Spot Instances
- Use these for transitory workloads. Ensure applications are designed with this in mind (e.g. maintain state somewhere permanent and support the ability for new instances to resume where previous ones left off).
A useful design pattern can be to have a "pilot light" instance and use auto-scaling to bring spot instances on as required, and with a bit of cunning bring on-demand instances on if spot-instances fail to appear.
TL;DR: Spot Instances are suitable for workloads that can pause and resume, but are not mission critical. They can be subject to extraordinary peaks (e.g. N. California m2.2xlarge spot price is usually $0.11/hr but has sustained peaks of $10.00/hr!).
Or is this a cheap way to get an 24/7 instance while Amazon has a bunch of extra capacity?
Spot on, if your bid price always remains above the spot price.
I couldn't find any other explicit mention of when they will terminate your instance.
I would have assumed it would be when they would require that capacity for customers willing to pay full charges for the instance, but then again, the spot price could technically go above the on-demand price.