I want to stop EC2 instances after office hours to save costs. How can I do the same with ECS instances? Even if I stop all tasks/services, the instance is still there? Do I stop the EC2 instance directly?
From EC2 Management Console
Click Auto Scaling Groups from the left menu.
Select the group from the list.
Click edit on the details tab.
Set desired property to '0'.
After clicking save it is all done.
The Auto Scaling Group is smart enough to shut down all instances.
You can use the "Scheduled Actions" feature of Auto Scaling Groups.
Starts similar to Kerem Baydoğan's answer
From EC2 Management Console:
1 Click Auto Scaling Groups from the left menu.
2 Select the group from the list.
3 Select "Scheduled Actions" from the bar that appeared in the lower middle of the screen.
4 Click on create scheduled action
5 Fill the fields as you see fit and notice that under recurrence there is also a cron option for extra flexibility.
If you have the cluster set to a minimum number of nodes with an asg. If you turn off the nodes the asg will start another node to bring it up to three minimum number of nodes. You must set the asg to zero nodes.. Then turn off the current nodes.
Yes, just stop the EC2 instance directly. When you start the instance again during office hours, the ECS agent will make the services start according to their desired value.
We are doing the same thing and it works for us.
Related
I have one spring scheduler , which I will be deploying in 2 different data center.
My data centers will be in active and passive mode. I am looking for a mechanism where passive data center scheduler start working where that data center become active .
We can do it using manually changing some configurations to true/false but , I am looking for a automated process.
-Initial state:
Data center A active - Scheduler M is running.
Data center B passive - Scheduler M is turned off.
-May be after 3 days.
Data center A passive - Scheduler M turned off.
Data center B active - Scheduler M is starting
I don't know your business requirements but unless you want multiple instances running but only one active, the purpose you will have a load balancer would be to spread the load to multiple instances of the same application rather to stick with only one instance.
Anyway I think an easy way of doing this without using a very sophisticated mechanism (coming with a lot of complexity depending where you run your application) would be this:
Have shared location such as a semaphore table in your database storing the ID of the application instance owning the scheduler process
Have a timeout set for each task. Say if the scheduler is supposed to run every two minutes set the timeout to two minutes.
Have your schedulers always kick off on all application instances
Once the tasks kicks off first check if it is the one owning the processing. If yes do the work, if not go at point 7.
After doing the work record the time stamp of the task completion in the semaphore table
Wait for the time to pass for the next kick off
If not the one owning the processing check when the task last run in the semaphore table. If the time since last run is greater than the timeout set for that process take the ownership of the process (recording your application instance id in the semaphore table)
We applied this and it ran very well with one of our applications. In reality it was much more complex than explained above as we had a lot of application instances and we had to avoid starting an ownership battle between them. To address this we put in place a "Permission to process request" concept so no matter how many instances wanted to take control it was only one which was granted.
For another application with similar requirements we used a much much easier way to achieve this but the price we paid was some extra learning curve in using ILock from Hazelcast IMGB framework. That is really very easy but keep in mind the Hazelcat community edition comes with absolutely no security and paying for a Hazelcast license just to achieve this may be a bit of expense.
Again all depends on you use case, for us the semaphore table was good enough in first scenario but prove bad in the second one as the multiple processes trying to update the same table at the same time ended up with a lot of database contention which took us to Hazelcast.
Other ideas would be a custom health check implementation that could trigger activating one scheduler or the other depending of response received.
Hope that helps, just ideas from our experience. Good luck.
I have a scylla cluster running on AWS i3en.xlarge instances which has 16 nodes.
Is there an easy way for me to switch the cluster to i3en.2xlarge or i3en.4xlarge other than replacing existing node one by one (e.g. add a new node and remove a node)?
If I add one i3en.2xlarge instance, will the cluster auto balances the data so that on the i3en.2xlarge it uses roughly twice the disk space as i3en.xlarge?
You can add a logical DC of the new nodes, run repair and then get rid of the original DC
Add a new DC with the desired instance type (see the procedure #TzachLivyatan posted in his comment)
Wait for streaming to the new DC to complete
Run a full cluster repair -> wait for it to complete
Decommission the "original" DC:
https://docs.scylladb.com/operating-scylla/procedures/cluster-management/decommissioning_data_center/
To reduce costs I would like to stop and start an container instance in a cluster in-between tasks. The task run every now and again so doesn't seem efficient keeping an EC2 running in-between.
What is the best way to allow this?
I have looked into lambda functions triggered by a cloudwatch scheduler and also thought about autoscaling.
Amazon doesn't make this incredibly straight-forward (though, they're trying to with Fargate). The best solution for now (if you're not in a region where Fargate is an option) is to try to keep your desired task count in line with your desired instance count on an autoscaling group.
The way we have it setup is through Lambda, triggering based on Autoscaling events (pretty easy to setup). The least trivial part about this is the Lambda script, though it's not incredibly difficult. Add tags to your ASG that help identify what cluster / service it's associated with. When a scaling event triggers your script, just have your script describe the ASG that triggered it, look for the cluster / service that's in the tags, and updated the desired count of that service:
asgDetail = paginator_describe_asg.paginate(
AutoScalingGroupNames=[
asgName,
]
)
# set service desired count equal to ASG desired capacity
newDesiredCount = iter(asgDetail).next()['AutoScalingGroups'][0]['DesiredCapacity']
response = client_ecs.update_service(
cluster = <ecs cluster>,
desiredCount = newDesiredCount,
service = <ecs service>
)
The reason you shouldn't rely on CloudWatch for this is because it doesn't do a great job at granular scaling. What I mean is, the CPU that CloudWatch monitors on your ASG is the overall group average (I think). So the scenario we ran into was as follows:
CloudWatch detects hosts are at 90%, desired is 70%
CloudWatch launches 4 hosts
Service detects tasks are at 85%, desired is 70%
Service launches new task
Service detects tasks are at 80%, desired is 70%
Service launches new task
Service detects tasks are at 75%, desired is 70%
Service launches new task
Service detects tasks are at 70%, no action
While this is a trivial example, it's easy to see how the number of instances get out of sync from the number of tasks actually running (i.e., you may end up with a host sitting idle because ECS doesn't detect that it needs more capacity).
Could we just scale up 3 hosts? Sure, but ECS might still only place 2 tasks (depending on how the usage is per task). Could we scale one host at a time? Sure, but then it's pretty difficult to account for bursts.
All that to say, the best solution I can recommend for now is to have a Lambda script help keep your ASG instance count == your ECS service desired task count.
I have decided to create a lambda function that starts the instance and on container instance start a task is ran. Then I have a cloud watch event watching for the task changing status to STOPPED which triggers another lambda that stops the instance.
I have an EC2 instance that I want to scale based on the number of messages in a SQS queue. If there are many messages (for 5 minutes) I want to pop up a new EC2, for consuming faster the messages. Then if the messages are few (for 5 minutes), I want to pop down the oldest EC2. This way, if the service that consumes the messages stops for some reason, I will terminate the old EC2, and the service will run.
I have created an AutoScalling for this. I have set the TerminationPolicy to OldestInstance, but it works as I expect only if I set just one zone (eg: eu-west-1a): it creates a new instance and terminates the oldest each time. But if I have 3 regions (eu-west-1a, eu-west-1b, eu-west-1c), it just launches and terminates the instances not in the OldestInstance manner. Or, at least, not as I expect: delete the oldest every time. Is there something linked to different zones? On this pace I have not found anything about it, except for the default policy.
And even if the case linking to multiple zones from default policy is applied, I can have maximum only 2 instances that turn at the same time. And they are always launched in a new zone.
This is probably the key paragraph:
When you customize the termination policy, Auto Scaling first assesses the Availability Zones for any imbalance. If an Availability Zone has more instances than the other Availability Zones that are used by the group, then Auto Scaling applies your specified termination policy on the instances from the imbalanced Availability Zone. If the Availability Zones used by the group are balanced, then Auto Scaling selects an Availability Zone at random and applies the termination policy that you specified.
I interpret this to mean that if you have instances in multiple zones, and those zones are already balanced, then AWS will select a zone at random AND THEN pick the oldest instance, within the randomly selected zone - it won't pick the oldest instances across AZ's, it picks and random AZ and then the oldest instance is terminated within that AZ.
So I've been using Boto in Python to try and configure autoscaling based on CPUUtilization, more or less exactly as specified in this example:
http://boto.readthedocs.org/en/latest/autoscale_tut.html
However both alarms in CloudWatch just report:
State Details: State changed to 'INSUFFICIENT_DATA' at 2012/11/12
16:30 UTC. Reason: Unchecked: Initial alarm creation
Auto scaling is working fine but the alarms aren't picking up any CPUUtilization data at all. Any ideas for things I can try?
Edit: The instance itself reports CPU utilisation data, just not when I try and create an alarm in CloudWatch, programatically in python or in the interface. Detailed monitoring is also enabled just in case...
Thanks!
The official answer from AWS goes like this:
Hi, There is an inherent delay in transitioning into INSUFFICIENT_DATA
state (only) as alarms wait for a period of time to compensate for
metric generation latency. For an alarm with a 60 second period, the
delay before transition into I_D state will be between 5 and 10
minutes.
John.
Apparently this is a temporary state and will likely resolve itself.
I am not sure what's going on in the backend, but if you compare the alarm history you will see AWS remove the 'unit' column if you just modify the alarm without any change as at7000ft said. So remove the unit column of your script.
Make sure that the alarm's Namespace is 'AWS/EC2'.
I know this is a long time after the original question, but in case others find this via Google, I had the same problem, and it turned out I set alarm's Namespace improperly.
It is needed to publish data with the same unit used to create the alarm. If you didn't specify one, it will be a <None> unit.
Unit can be specified in aws put-metric-data and aws-put-metric-alarm with --unit <value>
Unit <value> can be:
Seconds
Bytes
Bits
Percent
Count
Bytes/Second (bytes per second)
Bits/Second (bits per second)
Count/Second (counts per second)
None (default when no unit is specified)
Units are also case-sensitive, be carefull about that in your scripts.
For CPUUtilization, you can use Percent.
After the first data-set is sent to your alarm (it can take up to 5 minutes for a non-detailed monitored instance), the alarm will switch to the OK or ALARM state instead of the INSUFFICIENT_DATA one.
I am having the same INSUFFICIENT_DATA alarm state show up in CloudWatch for an RDS CPUUtilization > 60 alarm created with CloudFormation. ("Reason: Unchecked: Initial alarm creation" shows up under details). This is a very crude fix but I found that by selecting the alarm, clicking the Modify button, and then the Save button (without changing anything) the alarm goes to the OK state and everything is file.
I had this problem. Make sure the metric name you use to create the alarm matches the actual metric name.
You can list your metrics with:
aws cloudwatch list-metrics --namespace=<NAMESPACE, e.g. System/Linux, etc>
Find the metric and the MetricName. Make sure your alarm is configured for that metric.
As far as I know, default metric resolution is 5 minutes (which can be lowered to 1 minute if you pay up, or something like that), so if your alarm's measurement period is lower than that, then it'll remain permanently in an INSUFFICIENT_DATA state. In my case, I had a 1 minute measurement period on CPU utilization, and changing it to 5 minutes has fixed the state issue.
I had a similar problem, my alarm was constantly in INSUFFICIENT_DATA status although I can see the metric in the GUI.
Come out that this happen, because I specified the wrong Unit for the metric, when I created the Alarm. No error was reported back but it never became GREEN.
Better to avoid to specify it, if you are not sure, and AWS will do the correct match in the background.
There is a directory /var/tmp/aws-mon/ that contains a couple files. One is instance-id. The instance I was on was created from an AMI and this file retained the old instance id. I just edited it and made sure /var/tmp/aws-mon/placement/availability-zone was also correct. The alarms changed to OK almost instantly.
Also ran into this problem but for a different reason: I passed ES cluster ARN instead of domain name in my Cloudformation template. It was pretty frustrating