We're looking at using EC2 autoscaling to deal with spikes in load. In our case we want to scale up instances based on an SQS queue size and then down scale with the queue size gets back under control. Each SQS message defines a potentially long running job (sometimes up to 20 minutes each for message) that must complete before the instance can be terminated.
Our software handles the shutdown process gracefully, so issuing sudo service ourapp stop will wait for the app to complete before returning.
My question; when autoscaling starts scaling down it issues a terminate (which apparently is like hitting the power button), will it wait for for our app to completely exit before the instance is 'powered off'?
https://forums.aws.amazon.com/message.jspa?messageID=180674 <- that and other things I've found seem to suggest that it doesn't
On most newer AMI's, the machines are given the equivalent to a 'halt' (or 'shutdown -h now' command so that the services are gracefully shut down. As long as your program plays nicely with the startup/shutdown scripts, you should be fine -- but, if your program takes more than 20 seconds to terminate, you may experience that amazon will kill the instance completely.
Amazon's documentation with regards to their autoscaling doesn't specify the termination process, but, AWS's documentation for ec2 in general does contain about what happens during the termination process -- that the machines is given a 'shutdown' command, and the default shutdown time on most systems is 30 seconds.
In mid 2014 AWS introduced 'lifecycle hooks' which allows for full control of the termination process.
Our high level down scale process is:
Auto Scaling sends a message to a SQS queue with an instance ID
Controller app picks up the message
Controller app issues a 'stop instance' request
Controller app re-queues the SQS message while the instance is stopping
Controller app picks up the message again, checks if the instance has stopped (or re-queues the message to try again later)
Controller app notifies Auto Scaling to 'PROCEED' with the termination
Controller app deletes the message from the SQS queue
More details: http://docs.aws.amazon.com/autoscaling/latest/userguide/lifecycle-hooks.html
use replaceunhealty option in autoscaling.
refer:
http://alestic.com/2011/11/ec2-schedule-instance
particularly see this comment.
Related
I'm running Python Celery (a distributed task queue library) workers in an AWS ECS cluster (1 Celery worker running per EC2 instance), but the tasks are long running and NOT idempotent. This means that when an autoscaling scale-in event happens, which is when ECS terminates one of the containers running a worker because of low task load, the long running tasks currently in progress on that worker would be lost forever.
Does anyone have any suggestions on how to configure ECS autoscaling so no tasks are terminated before completion? Ideally, ECS scale-in event would initiate a warm-shutdown on the Celery worker in the EC2 instance it wants to terminate, but only ACTUALLY terminate the EC2 instance once the Celery worker has finished the warm shutdown, which occurs after all its tasks have completed.
I also understand there is something called instance protection, which can be set programmatically and protects instances from being terminated in a scale-in autoscale event: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-instance-termination.html#instance-protection-instance
However, I'm not aware of any Celery signals which trigger after all tasks have finished out in a warm shutdown, so I'm not sure how I'd programmatically know when to disable the protection anyways. And even if I found a way to disable the protection at the right moment, who would manage which worker gets sent the shutdown signal in the first place? Can EC2 be configured to do a custom action to instances in a scale-in event (like doing a warm celery shutdown) instead of just terminating the EC2 instance?
I think that while ECS scale-in your tasks it sends SIGTERM, wait for 30 seconds (default) and kill your task's containers with SIGKILL.
I think that you can increase the time between the signals with this variable: ECS_CONTAINER_STOP_TIMEOUT.
That way, your celery task could finish and no new tasks will be added to this celery worker (warm-shutdown after receiving the SIGTERM).
This answer might help you:
https://stackoverflow.com/a/49564080/1011253
What we do in our company is we do not use ECS, just "plain" EC2 (for this particular service). We have an "autoscaling" task that runs every N minutes, which depending on situation scales up the cluster by M new machines (all configurable via AWS parameter store). So basically Celery scales up/down itself. The task I mentioned also sends shutdown signal to every worker older than 10 minutes that is completely idle. When Celery worker shuts down, the whole machine terminates (in fact, Celery worker shuts it down via the #worker_shutdown.connect handler that powers off the machine - all these EC2 instances have "terminate" shutdown policy). The cluster processes millions of tasks per day, some of them running for up to 12 hours...
I want to set up Autoscaling groups where we can launch and terminate instances based on the CPU load. But usually our connections stays for long like more than 8hrs sometimes even more than that. When I use NLB, the Deregistration delay is only supported till 3600sec and after that NLB will forcefully remove the connection which cause our long living connections to fail and autoscaling will terminate the instances as well.
How do I make sure that all my connections to the target group is processed after 8-10hrs and then NLB deregister or autoscaling terminate the instance?
I checked the ASG Lifecycle hooks and it allows connections only till 2hrs.
Is it possible to deregister the instances in target group after all the connections are drained and terminate the instance using ASG?
There isn't any good/easy way to do what you want to do. What are the instances doing that can last up to 10 hours?
Depending on your work type, this is the best workaround I can think of, but it would probably involve a bit of rearchitecting.
1) design your application so that all data is stored off the instance is some sort of data tier (S3, RDS, EFS, etc). When an instance is done doing whatever its doing, save that info to the data tier. This way a user request can go to any instance and get the same information
2) The ASG decides to scale in
3) You have a lifecycle hook configured and a cloudwatch notification setup to be triggered when an instance enters the terminating:wait state which notifies the instance
4) The instance periodically sends a heartbeat to the lifecycle hook which can extend the hooks timeout for up to 2 days
5) Whenever the instance finishes what its doing, it saves the information out to the data tier mentioned in 1) and the client can connect to a new instance to get the information that was being processed on the old one
https://docs.aws.amazon.com/cli/latest/reference/autoscaling/record-lifecycle-action-heartbeat.html
https://docs.aws.amazon.com/cli/latest/reference/autoscaling/complete-lifecycle-action.html
Try to use, Scaling CoolDown period. By the default Scaling Cooldown Period is (300 Secs). you can increase the number. which will help to increase the scale in time.
What is the advantage of using threadpool in Hystrix?
Suppose we are calling a third party service ,when we call a service or DB that thread goes into waiting state than what is the use of keep creating thread for each call?
So,I mean how short circuited(Threadpooled) method is batter then normal(non short circuited) method?
Lets say when a remote service(any service) is started to respond slowly, but a typical application(service which is making call to remote service) will still continue to call that remote service. So short circuited(Threadpooled) method helps you build a Defensive system in this particular case.
As calling service does not know if the remote service is healthy or not and new threads are spawned every time a request comes in. This will cause threads on an already struggling server to be used.
We don’t want this to happen as we need these threads for other remote calls or processes running on our server and we also want to avoid CPU utilization spiking up. so this prevents resources from becoming blocked if latency occurs. Also Bounded thread pool also gives some breathing room for downstream services to recover.
For detail : ThreadPool in Hystrix
I am new to Go and I am using go routines in my app in Heroku, which are long (up to 7 minutes), and cannot be interrupted.
I saw that the auto scaler sometimes kills the Heroku dyno which is running the routine. I need a way of running this routine independently from the dynos so I know that it will not get shutdown. I read articles and still don't understand how to perform a go routine in a background worker. It is hard for me to believe I am the only one experiencing this.
My go routines use my redis database.
Could someone please point me to an example of how to setup a background worker in heroku for go and how to send my go routine to that worker?
Thank you very much
I need a way of running this routine independently from the dynos so I
know that it will not get shutdown.
If you don't want to run your worker code on a dyno then you'll need to use a different provider from Heroku, like Amazon AWS, Digital Ocean, Linode etc.
Having said that, you should design your workers, especially those that are mission critical, to be able to recover from a shutdown. Either to be able to continue where they left off or to start over. Heroku's dyno manager restarts the dynos at least once a day but I wouldn't be surprised if the other cloud providers also restart their virtual instances once in a while, probably not once a day but still... And even if you decide to deploy your workers on a physical machine that you control and never turn off, you cannot prevent things like hardware failure or power outage from happening.
If your workers need to perform some task till it's done you need to make them be aware of possible shutdowns and have them handle such scenarios gracefully. Do not ever rely on a machine, physical or virtual, to keep running while your worker is doing it's job.
For example if you're on Heroku, use a worker dyno and make your worker listen for the SIGTERM signal, after your worker receives such a signal...
The application processes have 30 seconds to shut down cleanly
(ideally, they will do so more quickly than that). During this time
they should stop accepting new requests or jobs and attempt to finish
their current requests, or put jobs back on the queue for other worker
processes to handle. If any processes remain after that time period,
the dyno manager will terminate them forcefully with SIGKILL.
... continue reading here.
But keep in mind, as I mentioned earlier, if there is an outage and Heroku goes down, which is something that happens from time to time, your worker won't even have those 30 seconds to clean up.
I have a worker role process that want to stop processing new requests when it's too busy (e.g. CPU load > 80%, long disk queue, or some other metrics).
If I set the role status to "busy", will it get killed by Fabric Controller after busying for too long time? If yes, how long will it takes until the Fabric Controller kill the process?
I assume the process is still capable to receive/send signals to the Fabric agent.
Thanks!
You can leave an instance in the Busy status forever. The only time Azure will take recovery action is if the process exits. See http://blogs.msdn.com/b/kwill/archive/2013/02/28/heartbeats-recovery-and-the-load-balancer.aspx for some additional information.
Also, what is your worker role doing? Setting the instance status to Busy will only take it out of the load balancer rotation so that new incoming TCP connections will not get routed to that instance. But if your worker role is a typical worker role where it does background jobs (ie. sits in a loop picking messages up from a queue, or listening on an InternalEndpoint for requests coming from a front end web role) then setting it to Busy will have no effect. In this scenario you would add logic to your code to stop doing work, but what that looks like will depend on what type of work your role is doing.