Stopping ec2 to scale-down before it completes the process running on it - amazon-ec2

We have an application which runs on ec2 instances that we use as docker host in ecs cluster. There are multiple tasks running on each ec2. Each task picks up one message from SQS and process some event(which convert data from one format to other and upload it to a file system), which may take from few seconds to 12-15hours depending on the size of the data it contains. Once an event processing is completed task is stopped and for new message(event) new task is created. Whenever there are huge number of messages in SQS we are scaling-up the instances to process the messages (to avoid wait time). When (number of messages) < (number of running tasks) for certain duration then we need to scale-down i.e. we need to terminate ec2 instances.
For ec2 scale-down we need to make sure there is no task running i.e. container is not processing any event on it. There is no way to found out which ec2 are free(not processing any event) so we marked container to DRAINING state and then TERMINATING ec2. But while we mark any container to draining state, task running on it, are stopped(hence event processing is killed in between and data is lost). Is there any why we can complete the process before tasks are stopped or anyone can suggest better approach.

Related

if an aws spot instance is stopped by AWS and then restarts will it just start where it left off?

I am running luigi, a pipeline manager which is processing 1000 tasks. Currently I poll for the AWS termination notice. If it is present then I requeue the job; wait 30 minutes; then launch a new server starting all the tasks from scratch. However sometimes it restarts the same job multiple times which is inefficient.
Instead I am considering using create_fleet with InstanceInterruptionBehaviour=Stop? If I do this then when it restarts will it still be running the luigi daemon and retain the state of all the tasks?
All InstanceInterruptionBehaviour=Stop does is effectively shutdown your EC2 instance rather than terminate it. Since the "persistent" setting is required in addition to EBS storage" you will keep all the data currently on the attached EBS volumes at the time of the instance stop.
It is completely dependent on the application itself (Luigi in this case) to be able to store the state of its execution and pick back up from where it left off. For one, you'll want to ensure you enable the service daemon to automatically start upon system start (example):
sudo systemctl enable yourservice

Interrupting a job in quartz with multiple instances

I have 5 instances of an application using quartz in cluster mode both having the quartz scheduler running. (with postgresql)
org.quartz.jobStore.isClustered:true
org.quartz.scheduler.instanceName: myInstanceName
org.quartz.scheduler.instanceId: AUTO
So I have a job which starts and do some operations, update itself if necessary with new scheduled time or else deletes itself. (One job can contain only one trigger.)
The application has a UI interface to allow the user to cancel the job.
When the interrupt command is send from the UI;
If job is not currently working; I can pause the job or cancel.
If my job is currently working at that time, how can I stop the job with the correct instance and get the current state of the job? Basically I want to catch at that moment and save that data at that time, which user is actually interrupt moment
Does scheduler.interrupt(jobKey) interrupt my job which implements InterruptableJob correctly ?
Is scheduler.interrupt() exactly knows which instance should currently running the job and find the correct instance and get the right state of the job ?
Can u correct me, or which way should I go with ?
interrupt method implementations and getCurrentlyExecutingJobs() in quartz are not cluster aware,
which means the method has to be run on the instance which is executing that job, in other words only jobs with specified job key running in the current instance will be interrupted.
An interrupt request can be broadcasted to all running instances of quartz to cancel all instances of running jobs.
from: https://www.quartz-scheduler.org/api/2.1.7/org/quartz/Scheduler.html#interrupt(org.quartz.JobKey)
This method is not cluster aware. That is, it will only interrupt
instances of the identified InterruptableJob currently executing in
this Scheduler instance, not across the entire cluster.

Does ECS update-service command marks the container instance state to draining when use with the --force-new-deployment option?

The command:
aws ecs update-service --service my-http-service --task-definition amazon-ecs-sample --force-new-deployment
As per AWS docs: You can use this option (--force-new-deployment) to trigger a new deployment with no service definition changes. For example, you can update a service's tasks to use a newer Docker image with the same image/tag combination (my_image:latest ) or to roll Fargate tasks onto a newer platform version.
My question, if I use '--force-new-deployment' (as I will use the exisiting tag or definition), will the underline 'ECS Instance' automatically set to DRAINING state, so that any new task (if any) will not start in the EXISTING ecs-instance that is suppose to go away during rolling-update deployment strategy (or deployment controller) ?
In other words, will there be any chance:
For a new task to be created on the existing/old container instance, that is suppose to go away during rolling update.
Also, what would happen with the ongoing task that is running on this existing/old container instance, that is suppose to go away during rolling update.
Ref: https://docs.aws.amazon.com/cli/latest/reference/ecs/update-service.html
Please note that no Container instance is going anywhere with this 'update-service' command. This command will only create a new Deployment under the ECS service and when the new tasks become healthy, remove the old task(s).
Edit 1:
How about the request that were served by old task?
I am assuming the tasks are behind an Application Load Balancer. In this case, old tasks will be deregistered from the ALB.
Note: In the following discussion, target is the ECS Task.
To give you a brief description on how the Deregistration Delay works with ECS, the following is the sequential order when a task connected to an ALB is stopped. It can be due to a scale in event, deployment of a new task definition, decrease of the number of tasks, force deployment, etc.
ECS sends DeregisterTargets call and the targets change the status to "draining". New connections will not be served to these targets.
If the deregistration delay time elapsed and there are still in-flight requests, the ALB will terminate them and clients will receive 5XX responses originated from the ALB.
The targets are deregistered from the target group.
ECS will send the stop call to the tasks and the ECS-agent will gracefully stop the containers (SIGTERM).
If the containers are not stopped within the stop timeout period (ECS_CONTAINER_STOP_TIMEOUT by default 30s) they will force stopped (SIGKILL).
As per the ELB documentation [1] if a deregistering target has no in-flight requests and no active connections, Elastic Load Balancing immediately completes the deregistration process, without waiting for the deregistration delay to elapse. However, even though target deregistration is complete, the status of the target will be displayed as draining and you can see the registered Target of the old task is still present in the TargetGroup console with status as draining until the deregistration delay elapses.
The design of ECS is to stop the Task after the completion of Draining process as mentioned in the ECS document [2] and hence the ECS Service waits for the TargetGroup to complete the Draining process before issuing the stop call.
Ref:
[1] Target Groups for Your Application Load Balancers - Deregistration Delay - https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#deregistration-delay
[2] Updating a Service - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/update-service.html

How to deploy laravel into a docker container while there are jobs running

We are trying to migrate our laravel setup to use docker. Dockerizing the laravel app was straight forward however we ran into an issue where if do a deployment while scheduled jobs are running they would be killed since the container is destroyed. Whats the best practice here? Having a separate container to run the laravel scheduler doesnt seem like it would solve the problem.
Run the scheduled job in a different container so you can scale it independently of the laravel app.
Run multiple containers of the scheduled job so you can stop some to upgrade them while the old ones will continue processing jobs.
Docker will send a SIGTERM signal to the container and wait for the container to exit cleanly before issuing SIGKILL (the time between the two signals is configurable, 10 seconds by default). This will allow to finish your current job cleanly (or save a checkpoint to continue later).
The plan is to stop old containers and start new containers gradually so there aren't lost jobs or downtime. If you use an orchestrator like Docker Swarm or Kubernetes, they will handle most of these logistics for you.
Note: the laravel scheduler is based on cron and will fire processes that will be killed by docker. To prevent this have the scheduler add a job to a laravel queue. The queue is a foreground process and it will be given the chance to stop/save cleanly by the SIGTERM that it will receive before being killed.

Mesos/Marathon checkpointing and HA

Mesos and Marathon mention checkpointing from time to time, but I couldn't find a good explanation of how it works anywhere. Also, what does it mean in practice?
1) Is the Task current state continuously being stored, or is only the Task ID stored? Where is it stored and what does it contain?
2) There are two Marathon instances. Marathon has been running Nginx for a week, then goes down. Does that mean that the actual Nginx application state continues running on the second Marathon instance, or does it just restart the task from beginning? If the Task actual state is copied, isn't there a lot of data to be continuously persisted and passed around between slaves?
Slave recovery is a feature of Mesos that allows:
Executors/tasks to keep running when the slave process is down and
Allows a restarted slave process to reconnect with running executors/tasks on the slave.
(Mesos Slave recovery).
So regarding you questions this means:
Enough information (a little more than TaskID) is stored in order that a new slave process can reconnect to the still running executor/task.
As the task state is not checkpointed, it would start the task from the beginning.
Hope this helps,
Joerg

Resources