We are planning to use ElastiCache (Redis) instead of our own redis cluster. However, the "maintenance window" setting creates some questions,
If I use a multi-az replicated cluster, will elasticache failover to available replicas during maintenance windows or does the entire cluster go down during maintenance?
How long does it generally take?
We can also use MemCached instead of Redis, does it have better availability situation during maintenance windows?
How do others handle ElastiCache manintenance windows? Just go woth the downtime?
Thanks!
There are usually 2 maintenance AWS does.
Continuous managed maintenance updates.
Service updates
While creating cluster you need to specify a 60 min maintenance window. Usually all the maintenance updates (1) will happen during that time.
For every service updates you will recieve notifications when there is a scheduled one. Notification will be in the form of email or a notification on the elasticache page etc... Based on the notification you can reschedule the service updates to a comfortable time. If you fail to reschedule it will by default pick you maintenance window and apply the service updates.
Basically during the maintenance updates, AWS will replace your node with a new node with required updates. If you have primary/replica set up with multi az and auto failover set to true, then during maintenance window of the primary node, you replica will be promoted to master and your read/write requests will be served from there. So ideally you don't see any issue during maintenance maybe a few second downtime to promote the replica as master.
If you either don't set up multiaz with auto failover to true or your elasticache has just one node, you will see downtime during maintenance window.
Refer AWS documentation
How long does it generally take?
Under 60 minutes:
"If a "maintenance" event is scheduled for a given week, it will be initiated and completed at some point during the 60 minute maintenance window you identify."
How often:
Software patching occurs infrequently (typically once every few months) and should seldom require more than a fraction of your maintenance window. If you do not specify a preferred weekly maintenance window when creating your Cache Cluster, a 60 minute default value is assigned.
http://aws.amazon.com/elasticache/faqs/
Related
I want to set up Autoscaling groups where we can launch and terminate instances based on the CPU load. But usually our connections stays for long like more than 8hrs sometimes even more than that. When I use NLB, the Deregistration delay is only supported till 3600sec and after that NLB will forcefully remove the connection which cause our long living connections to fail and autoscaling will terminate the instances as well.
How do I make sure that all my connections to the target group is processed after 8-10hrs and then NLB deregister or autoscaling terminate the instance?
I checked the ASG Lifecycle hooks and it allows connections only till 2hrs.
Is it possible to deregister the instances in target group after all the connections are drained and terminate the instance using ASG?
There isn't any good/easy way to do what you want to do. What are the instances doing that can last up to 10 hours?
Depending on your work type, this is the best workaround I can think of, but it would probably involve a bit of rearchitecting.
1) design your application so that all data is stored off the instance is some sort of data tier (S3, RDS, EFS, etc). When an instance is done doing whatever its doing, save that info to the data tier. This way a user request can go to any instance and get the same information
2) The ASG decides to scale in
3) You have a lifecycle hook configured and a cloudwatch notification setup to be triggered when an instance enters the terminating:wait state which notifies the instance
4) The instance periodically sends a heartbeat to the lifecycle hook which can extend the hooks timeout for up to 2 days
5) Whenever the instance finishes what its doing, it saves the information out to the data tier mentioned in 1) and the client can connect to a new instance to get the information that was being processed on the old one
https://docs.aws.amazon.com/cli/latest/reference/autoscaling/record-lifecycle-action-heartbeat.html
https://docs.aws.amazon.com/cli/latest/reference/autoscaling/complete-lifecycle-action.html
Try to use, Scaling CoolDown period. By the default Scaling Cooldown Period is (300 Secs). you can increase the number. which will help to increase the scale in time.
I want to enable the new "Performance Insights" on active RDS.
Can I do that without expecting any downtime?
Thanks
Performance Insights does need you to mention a scheduling rule for modifications [1], which does indicate that it may need to restart the database for changes to take effect.
For Scheduling of Modifications, choose one of the following:
Apply during the next scheduled maintenance window – Wait to apply the Performance Insights modification until the next maintenance window.
Apply immediately – Apply the Performance Insights modification as soon as possible.
However, it does appear to be an instance level setting, which means that if you have a multi instance aurora cluster, then you can modify one instance at a time, and failovers should prevent you from having any downtime.
[1] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.Enabling.html
At the top it of the article it states "Enabling and disabling Performance Insights doesn't cause downtime, a reboot, or a failover."
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.Enabling.html
I want to maintain a pool of stopped amazon ec2 instances. Whenever the amount is below the threshold, I would like to be able to create new instances and then immediately stop them once they are running. Is this possible within the amazon infrastructure alone?
You can certainly create Amazon EC2 instances and then Stop them, making the available to Start later. As you point out, this has the benefit that instances will Start faster than they take to Launch a new instance.
There is no automated method to assist with this. You could have to code a solution that does the following:
Monitor the number of Stopped instances
If the quantity is below the threshold, launch a new instance
The new instance could automatically stop itself via User Data (either via a Shutdown command to the Operating System, or via a StopInstances call to EC2)
Some things you would have to consider:
What triggers the monitoring? Would it be on a schedule?
The task that launches a new instance would need to wait for the new instance to Launch & Stop before launching any more instances
What Starts the instances when they are needed?
Do instances ever get Stopped when they are no longer required?
The much better choice would be to use Auto Scaling, with a scale-out alarm based on some metric that says your fleet is busy, and a scale-in alarm to remove instances when the fleet is not busy. The scale-out alarm could be set to launch instances once a threshold is passed (eg 80% CPU) that should allow the new instance(s) to launch before things are 100% busy. The time difference between launching a new instance and starting an existing instance is quite small (at least for Linux).
If you're using Windows, the biggest time delay when launching a new instance is due to Sysprep, which makes a "clean" machine with new Unique IDs. You could cheat by creating an AMI without Sysprep, which would boot faster.
Perhaps I am misunderstanding your objective... you can't "ensure availability" of instances without paying for them.
Instances in the stopped state are only logical entities that don't physically exist anywhere -- hardware is allocated on launch, deallocated on stop, reallocated on the next start. In the unlikely condition where an availability zone is exhausted of capacity for given instance class, stopped instances of that class won't start, because there is no hardware available for them to be deployed onto.
To ensure that instances are always available, you have to reserve them, and you have to specify the reservations in a specific availability zone:
Amazon EC2 Reserved Instances provide a significant discount (up to 75%) compared to On-Demand pricing and provide a capacity reservation when used in a specific Availability Zone. [emphasis added]
https://aws.amazon.com/ec2/pricing/reserved-instances/
Under most plans, reserved instances are billed the same rate whether they are running or not, so there would be little point in stopping them.
I have an issue that from time to time one of the EC2 instances within my cluster have its ECS-agent disconnected. This silently removes the EC2 instance from the cluster (i.e. not eligible to run any services anymore) and silently drains my cluster from serving servers. I have my cluster backed with an autoscaling group, spawning servers to keep up the healthy amount. But the ECS-agent'disconnected servers are not marked as unhealthy, so the AS-group thinks everything is alright.
I have the feeling there must be something (easy) to mitigate this, or I'm having a big issue with choosing ECS and using it in production.
We had this issue for a long time. With each new AWS ECS-optimized AMI it got better, but as of 3 months ago it still happened from time to time. As mcheshier mentioned make sure to always use the latest AMI or at least the latest aws ecs agent
The only way we were able to resolve it was through:
Timed autoscale rotations
We would try to prevent it by scaling up and down at random times
Good cloudwatch alerts
We happened to have our application set up as a bunch of microservices that were all queue (SQS) based. We could scale up and down based on queues. We had decent monitoring set up that let us approximate rates of queues across number of ECS containers. When we detected that the rate was off we would rotate that whole ECS instance. Ie. Say our cluster deployed 4 running containers of worker-1. We approximate that each worker does 1000 messages per 5 minutes. If our queue rate was 3000 per 5 minutes and we had 4 workers, then 1 was not working as expected. We had some scripts set up in lambda to find the faulty one and terminate the entire instance that ran that container.
I hope this helps, I realize it's specific to our in-house application, but the advice I can give you and anyone else is to take the initiative and put as many metrics out there as you can. This will let you do some neat analytics and look for kinks in the system, this being one of them.
We run a server architecture where we have an X number of base servers which are always on. Our servers process jobs sent to them and the vast majority of our job requests come in during the workday. To facilitate this particular spike in volume, we use EC2 auto-scaling.
I prefer to launch servers through auto-scaling with as much of a configured AMI as possible as opposed to launching from a base AMI and installing packages through long Chef or Puppet scripts.
In our current build process, we implement changes to our code base late at night when only our base servers are needed and no servers are launched through auto-scaling. But every once in a while, we'll have a critical bug fix that needs to be implemented immediately during the day.
We have a rather large EBS hard drive associated with our servers (app. 400 GB) and AMI creation of a base server with the latest changes usually takes upwards of one hour. This isn't a problem for late night deployments when no additional servers need to be launched, but causes us to lose valuable time during the day because it prevents us from launching additional servers when the latest AMI isn't ready.
Is there anything out there which can speed up the AMI creation process here? I've heard of Netflix's Aminator and Boxfuse, but are there any other alternatives? Also, how do these services stack up against one another?