I get every 2 or 3 days "Ec2 1/2 checks passed" alarm and my instance is not reachable. I can resolve it just by stopping and starting it again (not by rebooting), basically the problem is in "Instance Status Checks".
My instance type is "t2.micro" and used Amazon Linux AMI.
Any help?
From Status Checks for Your Instances - Amazon Elastic Compute Cloud:
System Status Checks
Monitor the AWS systems on which your instance runs. These checks detect underlying problems with your instance that require AWS involvement to repair.
The following are examples of problems that can cause system status checks to fail:
Loss of network connectivity
Loss of system power
Software issues on the physical host
Hardware issues on the physical host that impact network reachability
Instance Status Checks
Monitor the software and network configuration of your individual instance.The following are examples of problems that can cause instance status checks to fail:
Failed system status checks
Incorrect networking or startup configuration
Exhausted memory
Corrupted file system
Incompatible kernel
Basically, the System Status Check says that something is wrong with the hardware, and the Instance Status Check says that something is wrong with the virtual machine (eg kernel problems, networking).
When an instance is stopped and then started again, it is provisioned on different hardware. This will resolve hardware-related issue. Please note that rebooting will not re-provision the virtual machine — instead, the operating system simply restarts. That's why a stop/start is more effective than a reboot.
It is quite rare to have the instance status check fail after some period of time — it normally has problems at initial startup. This suggests that something is going wrong with your operating system, but that should be corrected by a reboot, which you say isn't sufficient. So, it's a bit of a mystery.
If possible, I would suggest launching a new instance and installing/configuring it as a replacement to the existing instance.
Related
I'm Azure newbie and need some clarifications:
When adding machines to Availability set, in order to prevent VM from rebooting, what's best strategy for VM's, put them in:
-different update and fault domains
-same update domain
-same fault domain ?
My logic is that it's enough to put them in diffrent update AND fault domain
I used this as reference:https://blogs.msdn.microsoft.com/plankytronixx/2015/05/01/azure-exam-prep-fault-domains-and-update-domains/
Am i correct ?
These update/fault domains are confusing
My logic is that it's enough to put them in diffrent update AND fault
domain
You are right, we should put VMs in different update and fault domain.
We put them in different update domain, when Azure hosts need update, Microsoft engineer will update one update domain, when it completed, update another update domain. In this way, our VMs will not reboot in the same time.
we put them in different fault domain, when an Unexpected Downtime happened, VMs in that fault domain will reboot, other VMs will keep running, in this way, our application running on those VMs will keep health.
To shot, add VMs to an availability set with different update domain and fault domain, that will get a high SLA, but not means one VM will not reboot.
Hope that helps.
There are three scenarios that can lead to virtual machine in Azure being impacted: unplanned hardware maintenance, unexpected downtime, and planned maintenance.
Unplanned Hardware Maintenance
An Unexpected Downtime
Planned Maintenance events
Each virtual machine in your availability set is assigned an update domain and a fault domain by the underlying Azure platform. For a given availability set, five non-user-configurable update domains are assigned by default (Resource Manager deployments can then be increased to provide up to 20 update domains) to indicate groups of virtual machines and underlying physical hardware that can be rebooted at the same time. When more than five virtual machines are configured within a single availability set, the sixth virtual machine is placed into the same update domain as the first virtual machine, the seventh in the same update domain as the second virtual machine, and so on. The order of update domains being rebooted may not proceed sequentially during planned maintenance, but only one update domain is rebooted at a time. A rebooted update domain is given 30 minutes to recover before maintenance is initiated on a different update domain.
Fault domains define the group of virtual machines that share a common power source and network switch. By default, the virtual machines configured within your availability set are separated across up to three fault domains for Resource Manager deployments (two fault domains for Classic). While placing your virtual machines into an availability set does not protect your application from operating system or application-specific failures, it does limit the impact of potential physical hardware failures, network outages, or power interruptions.
For more details, refer this documentation.
We run a server architecture where we have an X number of base servers which are always on. Our servers process jobs sent to them and the vast majority of our job requests come in during the workday. To facilitate this particular spike in volume, we use EC2 auto-scaling.
I prefer to launch servers through auto-scaling with as much of a configured AMI as possible as opposed to launching from a base AMI and installing packages through long Chef or Puppet scripts.
In our current build process, we implement changes to our code base late at night when only our base servers are needed and no servers are launched through auto-scaling. But every once in a while, we'll have a critical bug fix that needs to be implemented immediately during the day.
We have a rather large EBS hard drive associated with our servers (app. 400 GB) and AMI creation of a base server with the latest changes usually takes upwards of one hour. This isn't a problem for late night deployments when no additional servers need to be launched, but causes us to lose valuable time during the day because it prevents us from launching additional servers when the latest AMI isn't ready.
Is there anything out there which can speed up the AMI creation process here? I've heard of Netflix's Aminator and Boxfuse, but are there any other alternatives? Also, how do these services stack up against one another?
We had few instances of our system on EC2. Some of them application servers, some of them Memcached, Database and etc.
After few weeks after creating instance, it starts to raise a lot number of errors depending to network: errors like "MEMCACHED TIMEOUT ERROR", "RABBITMQ connection error" and same. Errors happens only from single instance. After creating copy of this instance - errors goes away.
Did anybody have same problems?
I have experienced this before. I think it has to do with problems with the network stack of the host, at least that is as much information I could get form aws.
If you are using EBS backed instances. Simply stopping and then restarting the instance should solve the problem. The instance gets assigned to a new host in that case.
EC2 dashboard mentions about a running instance, even when the instance is not running. I see a EBS volume also in a in-use status. I am confused, is the machine running or not?
I have seen that happen when closing down an linux instance on the machine (with shutdown now from the command line).
If the console says that the instance is running even though you shut it down you should probably shut it down from the console (to avoid being billed).
Sometimes there are problems with the hardware on the server. The instance is showing as running but you cannot connect and you cannot use any services on that instance. The best thing to do in this situation is post a message on EC2's forums and ask them to look at your instance.
They're usually pretty quick to respond though they don't make any grantees. They can force the machine into a stopped state, whether or not they can fix the issue without you loosing your data will depend on what is actually wrong with the instance.
This happens from time to time with my instances as well.
We're running a lightweight web app on a single EC2 server instance, which is fine for our needs, but we're wondering about monitoring and restarting it if it goes down.
We have a separate non-Amazon server we'd like to use to monitor the EC2 and start a fresh instance if necessary and shut down the old one. All our user data is on Elastic Storage, so we're not too worried about losing anything.
I was wondering if anyone has any experience of using EC2 in this way, and in particular of automating the process of starting the new instance? We have no problem creating something from scratch, but it seems like it should be a solved problem, so I was wondering if anyone has any tips, links, scripts, tutorials, etc to share.
Thanks.
You should have a look at puppet and its support for AWS. I would also look at the RightScale AWS library as well as this post about starting a server with the RightScale scripts. You may also find this article on web serving with EC2 useful. I have done something similar to this but without the external monitoring, the node monitored itself and shut down when it was no longer needed then a new one would start up later when there was more work to do.
Couple of points:
You MUST MUST MUST back up your Amazon EBS volume.
They claim "better" reliability, but not 100%, and it's SEVERAL orders of magnitude off of S3's "12 9's" of durability. S3 durability >> EBS durability. That's a fact. EBS supports a "snapshots" feature which backs up your storage efficiently and incrementally to S3. Also, with EBS snapshots, you only pay for the compressed deltas, which is typically far far less than the allocated volume size. In another life, I've sent lost-volume emails to smaller customers like you who "thought" that EBS was "durable" and trusted it with the only copy of a mission-critical database... it's heartbreaking.
Your Q: automating start-up of a new instance
The design path you mention is relatively untraveled; here's why... Lots of companies run redundant "hot-spare" instances where the second instance is booted and running. This allows rapid failover (seconds) in the event of "failure" (could be hardware or software). The issue with a "cold-spare" is that it's harder to keep the machine up to date and ready to pick up where the old box left off. More important, it's tricky to VALIDATE that the spare is capable of successfully recovering your production service. Hardware is more reliable than untested software systems. TEST TEST TEST. If you haven't tested your fail-over, it doesn't work.
The simple automation of starting a new EBS instance is easy, bordering on trivial. It's just a one-line bash script calling the EC2 command-line tools. What's tricky is everything on top of that. Such a solution pretty much implies a fully 100% automated deployment process. And this is all specific to your application. Can your app pull down all the data it needs to run (maybe it's stored in S3?). Can you kill you instance today and boot a new instance with 0.000 manual setup/install steps?
Or, you may be talking about a scenario I'll call "re-instancing an EBS volume":
EC2 box dies (root volume is EBS)
Force detach EBS volume
Boot new EC2 instance with the EBS volume
... That mostly works. The gotchas:
Doesn't protect against EBS failures, either total volume loss or an availability loss
Recovery time is O(minutes) assuming everything works just right
Your services need to be configured to restart automatically. It does no good to bring the box back if Nginx isn't running.
Your DNS routes or other services or whatever need to be ok with the IP-address changing. This can be worked around with ElasticIP.
How are your host SSH keys handled? Same name, new host key can break SSH-based automation when it gets the strong-warning for host-key-changed.
I don't have proof of this (other than seeing it happen once), but I believe that EC2/EBS _already_does_this_ automatically for boot-from-EBS instances
Again, the hard part here is on your plate. Can you stop your production service today and bring it up RELIABLY on a new instance? If so, the EC2 part of the story is really really easy.
As a side point:
All our user data is on Elastic Storage, so we're not too worried about losing anything.
I'd strongly suggest to regularly snapshot your EBS (Elastic Block Storage) to S3 if you are not doing that already.
You can use an autoscale group with a min/max/desired quantity of 1. Place the instance behind an ELB and have the autoscale group be triggered by the ELB healthy node count. This allows you to have built in monitoring by cloudwatch and the ELB health check. Anytime there is an issue the instance be replaced by the autoscale service.
If you have not checked 'Protect against accidental termination' you might want to do so.
Even if you have disabled 'Detailed Monitoring' for your instance you should still see the 'StatusCheckFailed' metric for your instance over which you can configure an alarm (In the CloudWatch dashboard)
Your application (hosted in a different server) should receive the alarm and start the instance using the AWS API (or CLI)
Since you have protected against accidental termination you would never need to spawn a new instance.