How to replace ECS cluster instances without downtime or reduced redundancy? - amazon-ec2

I currently have a try-out environment with ~16 services divided over 4 micro-instances. Instances are managed by an autoscaling group (ASG). When I need to update the AMI of my cluster instances, currently I do:
Create new launch config, edit ASG with new launch config.
Detach all instances with replacement option from the ASG and wait until the new ones are listed in the cluster instance list.
MANUALLY find and deregister the old instances from the ECS cluster (very tricky)
Now the services are killed by ECS due to deregistering the instances :(
Wait 3 minutes until the services are restarted on the new instances
MANUALLY find the EC2 instances in the EC2 instance list and terminate them (be very very careful not to terminate the new ones).
With this approach I have about 3 minutes of downtime and I shiver from the idea to do this in production envs.. Is there a way to do this without downtime but keeping the overall amount of instance the same (so without 200% scaling settings etc.).

You can update the Launch Configuration with the new AMI and then assign it to the ASG. Make sure to include the following in the user-data section:
echo ECS_CLUSTER=your_cluster_name >> /etc/ecs/ecs.config
Then terminate one instance at a time, and wait until the new one is up and automatically registered before terminating the next.
This could be scriptable to be automated too.

Related

if an aws spot instance is stopped by AWS and then restarts will it just start where it left off?

I am running luigi, a pipeline manager which is processing 1000 tasks. Currently I poll for the AWS termination notice. If it is present then I requeue the job; wait 30 minutes; then launch a new server starting all the tasks from scratch. However sometimes it restarts the same job multiple times which is inefficient.
Instead I am considering using create_fleet with InstanceInterruptionBehaviour=Stop? If I do this then when it restarts will it still be running the luigi daemon and retain the state of all the tasks?
All InstanceInterruptionBehaviour=Stop does is effectively shutdown your EC2 instance rather than terminate it. Since the "persistent" setting is required in addition to EBS storage" you will keep all the data currently on the attached EBS volumes at the time of the instance stop.
It is completely dependent on the application itself (Luigi in this case) to be able to store the state of its execution and pick back up from where it left off. For one, you'll want to ensure you enable the service daemon to automatically start upon system start (example):
sudo systemctl enable yourservice

How to create a bash script for autoscaling EC2 instances given the work volume of a SQS?

I created a bash script with aws-cli that sends 1000 messages using SQS, now I want to create another one that runs in parallel and creates and destroys EC2 instances given this condition:
Checks every 15 seconds: if (((ApproximateNumberOfMessages + 9)/10) - N running instances) > 0 creates an instance, else destroys an instance.
My first problem is that I don't know how to connect my SQS queue to a EC2 instance so it can process these messages. I tried following this tutorial: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-sending-messages-from-vpc.html, but I don't want to use a private VPC and security groups so I was wondering if there is a way to make it easier.
My questions are: Is it possible to do it just using a bash script instead of CloudWatch and Autoscaling Groups? How do I create a EC2 instance that is ready to process these messages?
When you create an EC2 instance, it automatically gets an Elastic Network Interface (ENI, a virtual network card) for which AWS automatically assigns either a default security group, either another user created security group. You can not detach the default ENI, also you cannot have an ENI without a security group. Moreover, EC2 instances have to run inside a VPC, which can be private of public. Nevertheless, if you work with EC2 instances, you have to deal with security groups as well.
Is it possible to do it just using a bash script instead of CloudWatch and Autoscaling Groups?
It might be possible, but you will find yourself reinventing the wheel. Autoscaling does more than just adding/removing instances based on some condition. For example, it also makes sure that your instances are replaced if they become unhealthy or if they are terminated for some reason. For more info see AWS ASG FAQ.
How do I create a EC2 instance that is ready to process these messages?
You can't just start an instance and expect to process your messages. You need to have some code or some kind software deployed to it and configured to poll messages from your queues.

Lifecycle of an EC2 Container Service Instance

In my project I have a constraint where all of the traffic received will go to a certain IP. The Elastic IP feature works well for this.
My question is, considering we are using Amazon's docker service (ECS) without autoscaling (so instances/tasks will be scaled manually), can we treat the instances created by the ECS service as we would treat normal, on-demand instances? As in they won't be terminated/stopped unless explicitly done by a user (or API call or whatever).
As is mentioned in the Scaling a Cluster documentation, if you created your cluster after November 24th, 2015 using either the first run wizard, or the Create Cluster wizard, then an Autoscaling group would have been created to manage the instances backing your cluster.
For the most part, the answer to your question is Yes. The instances wouldn't normally go about getting replaced. It is also important to note that because this is backed by an auto scaling group, AutoScaling might go about Replacing unhealthy instances for you. If an instance fails it EC2 Health Checks for some reason, it will be marked as unhealthy, and scheduled for replacement.
By default, my understanding is there are no CloudWatch Alarms or Scaling Policies effecting this AutoScaling group, so it should just be when an instance becomes healthy that it would get replaced.

Change Instance type of a cluster registered ec2 instance

I have an Amazon EC2 instance which is registered to a cluster of Amazon ECS.
And I want to change this instance's type from c4.large to c4.8xlarge.
I'm able to change its type from c4.large to c4.8xlarge in AWS console. But after the change, I found
[ERROR] Could not register module="api client" err="ClientException: Container instance type changes are not supported. Container instance XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX was previously registered as c4.large.
being printed in /var/log/ecs/ecs-agent.log.20XX-XX-XX-XX file.
Is it possible to change ec2 instance type and re-register it to a cluster?
I think maybe deregister it first, then register it again should work. But I'm afraid this may cause something irreversible in my AWS working environment. So I haven't tried this method yet.
To solve this connection problem between the agent and cluster, just delete the file /var/lib/ecs/data/ecs_agent_data.json and restart docker and ECS.
After that, a new container instance will be created in your cluster with the new size.
sudo rm /var/lib/ecs/data/ecs_agent_data.json
sudo service docker restart
sudo start ecs
Then you can go to the ECS cluster console and deregister the old container instance
UPDATE:
According to #florins and #MBear commented below, AWS updated the data file on ECS instances.
sudo rm /var/lib/ecs/data/agent.db
sudo service docker restart
sudo start ecs
As of March 2021 / AMI image ami-0db98e57137013b2d, /var/lib/ecs/data/ecs_agent_data.json mentioned in the last useful answer does not exist. For me, the commands to execute on the changed instance were:
sudo rm /var/lib/ecs/data/agent.db
sudo service docker restart
After that, it was possible to deploy containers to the instance, without fresh registration (AWS automatically registered a second ECS container instance of the new type). I did have a leftover container instance with the resources of the old instance type to remove.
You can't do this. Per their docs:
The type of EC2 instance that you choose for your container instances determines the resources available in your cluster. Amazon EC2 provides different instance types, each with different CPU, memory, storage, and networking capacity that you can use to run your tasks. For more information, see Amazon EC2 Instances.
This means that when you launch a container on an instance, the agent gathers a bunch of metadata about the instance to run it. If you change it, all of that metadata (or a lot) has changed in a bad way. CPU units, memory, etc. The agent is aware of this and will report it as an error.
You should spin up a new instance of the new type and register it to the cluster and let the task run on it. If it's a service, just terminate the old instance and let it run it against the new one.
I can't think of any real reason why terminating your old instance would cause something irreversible unless it is misconfigured or fragile via user specific settings, by default this would not cause anything destructive.
As alternative approach if the EC2 instance does not store any valuable a new instance using the old instance as template could be started. This takes all existing values and can be achieved just with a few clicks in minutes.
Select the EC2 instance and then "Actions -> Images and templates -> Start more like this". Just change the instance type.
When the instance is running got the the ECS cluster to the tab "ECS instances" and activate the new created instance.
Shutdown the old instance
Update your task maybe taking more cpu and memory and update the service to take the new task revision

Error in autoscaling in ec2?

I am using the autoscaling feature.
I set up the entire thing but the instance were automatically launched and terminated even if the instance does not reach the threshold.
I followed the steps:
Created an instance
Created a load balancer and registered an instance
Created a launch configuration
Created a cloud watch that cpu >=50 %\
An autoscale policy that launches and terminates the instance when CPU >=50 %
But as soon as I apply the policy the instances begin to launch and terminated without any CPU load and it continues
Cause: At 2014-01-14T10:51:08Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.
StartTime: 2014-01-14T10:51:08.791Z
Cause: At 2014-01-14T10:02:16Z an instance was taken out of service in response to a system health-check.
UPDATE
Documentation:
Follow the instructions on how to Set Up an Auto-Scaled and Load-Balanced Application
Notes:
The instance, created outside of AutoScaling Group can be added to Elastic Load Balancer, but will not be monitored or managed by AutoScaling group.
Instance, created outside of AutoScaling Group can be marked as unhealthy by Elastic Load Balancer if the health check fails, but it will not cause AutoScaling Group to spawn a new instance.

Resources