Starting and stopping container instances in-between task definitions on ECS - amazon-ec2

To reduce costs I would like to stop and start an container instance in a cluster in-between tasks. The task run every now and again so doesn't seem efficient keeping an EC2 running in-between.
What is the best way to allow this?
I have looked into lambda functions triggered by a cloudwatch scheduler and also thought about autoscaling.

Amazon doesn't make this incredibly straight-forward (though, they're trying to with Fargate). The best solution for now (if you're not in a region where Fargate is an option) is to try to keep your desired task count in line with your desired instance count on an autoscaling group.
The way we have it setup is through Lambda, triggering based on Autoscaling events (pretty easy to setup). The least trivial part about this is the Lambda script, though it's not incredibly difficult. Add tags to your ASG that help identify what cluster / service it's associated with. When a scaling event triggers your script, just have your script describe the ASG that triggered it, look for the cluster / service that's in the tags, and updated the desired count of that service:
asgDetail = paginator_describe_asg.paginate(
AutoScalingGroupNames=[
asgName,
]
)
# set service desired count equal to ASG desired capacity
newDesiredCount = iter(asgDetail).next()['AutoScalingGroups'][0]['DesiredCapacity']
response = client_ecs.update_service(
cluster = <ecs cluster>,
desiredCount = newDesiredCount,
service = <ecs service>
)
The reason you shouldn't rely on CloudWatch for this is because it doesn't do a great job at granular scaling. What I mean is, the CPU that CloudWatch monitors on your ASG is the overall group average (I think). So the scenario we ran into was as follows:
CloudWatch detects hosts are at 90%, desired is 70%
CloudWatch launches 4 hosts
Service detects tasks are at 85%, desired is 70%
Service launches new task
Service detects tasks are at 80%, desired is 70%
Service launches new task
Service detects tasks are at 75%, desired is 70%
Service launches new task
Service detects tasks are at 70%, no action
While this is a trivial example, it's easy to see how the number of instances get out of sync from the number of tasks actually running (i.e., you may end up with a host sitting idle because ECS doesn't detect that it needs more capacity).
Could we just scale up 3 hosts? Sure, but ECS might still only place 2 tasks (depending on how the usage is per task). Could we scale one host at a time? Sure, but then it's pretty difficult to account for bursts.
All that to say, the best solution I can recommend for now is to have a Lambda script help keep your ASG instance count == your ECS service desired task count.

I have decided to create a lambda function that starts the instance and on container instance start a task is ran. Then I have a cloud watch event watching for the task changing status to STOPPED which triggers another lambda that stops the instance.

Related

Is it possible to automatically rerun Databricks Job Clusters

I have a job cluster that I would like to rerun when it reaches the end of the notebook - is that possible?
For example, lets say my Databricks notebook ends with the following code.
rdd = sc.parallelize([json.dumps(result)])
spark.read.json(rdd) \
.write.mode("overwrite").json('/mnt/lake/RAW/FormulaClassification/F1Area/')
Under normal circumstances, when the job cluster has successfully completed all the cells in the notebook without any failures the job cluster would end and provide a status notification saying 'Succeeded'.
I would like the notebook to re-run straight after the notification - and run indefinitely?
Is that possible?
Or is it even possible to keep a cluster just up and running indefinitely, with it just sitting there waiting for up and coming executions (I hope the last sentence makes sense). I guess what I'm trying to say is that once a Job cluster is running I don't want it to terminate unless I physically terminate it.
You can opt out of auto termination by clearing the Auto Termination checkbox or by specifying an inactivity period of 0.
Refer - https://learn.microsoft.com/en-us/azure/databricks/clusters/clusters-manage#configure-automatic-termination
The best way to accomplish this would be to use a loop in your notebook that implements some kind of logic to check if there is anything to do.
import time
while (true):
if (isNewDataAvailable):
dbutils.notebook.run("/path/to/notebook")
time.sleep(10)
If you use autoscale for your cluster then it should scale down to one node while sleeping saving costs.

Whats the best way to do `setTimeout` or `setInterval` with FAAS?

Using serverless Functions As A Service (AWS Lambda, GCP Functions), what is the best way to run a timer or interval for sometime in the future?
I do not want to keep the instance running idle whilst the timer counts down. The timer will be less than 24 hrs, and needs to change dynamically at runtime (it isn't a single set up cron schedule).
Google has Cloud Scheduler, but that mimics cron and will not let me have a timer for any amount of seconds starting from now.
If you're looking for a product that's similar to Cloud Scheduler, but lets you schedule a single function invocation for an arbitrary amount of time in the future, you should look at Cloud Tasks. You have to create a queue, then arrange for it to create an HTTP target to run at some time in the future.

How to find the right portion between hadoop instance types

I am trying to find out how many MASTER, CORE, TASK instances are optimal to my jobs. I couldn't find any tutorial that explains how do I figure it out.
How do I know if I need more than 1 core instance? What are the "symptoms" I would see in EMR's console in the metrics that would hint I need more than one core? So far when I tried the same job with 1*core+7*task instances it ran pretty much like on 8*core, but it doesn't make much sense to me. Or is it possible that my job is so much CPU bound that the IO is such minor? (I have a map-only job that parses apache log files into csv file)
Is there such a thing to have more than 1 master instance? If yes, when is it needed? I wonder, because my master node pretty much is just waiting for the other nodes to do the job (0%CPU) for 95% of the time.
Can the master and the core node be identical? I can have a master only cluster, when the 1 and only node does everything. It looks like it would be logical to be able to have a cluster with 1 node that is the master and the core , and the rest are task nodes, but it seems to be impossible to set it up that way with EMR. Why is that?
The master instance acts as a manager and coordinates everything that goes in the whole cluster. As such, it has to exist in every job flow you run but just one instance is all you need. Unless you are deploying a single-node cluster (in which case the master instance is the only node running), it does not do any heavy lifting as far as actual MapReducing is concerned, so the instance does not have to be a powerful machine.
The number of core instances that you need really depends on the job and how fast you want to process it, so there is no single correct answer. A good thing is that you can resize the core/task instance group, so if you think your job is running slow, then you can add more instances to a running process.
One important difference between core and task instance groups is that the core instances store actual data on HDFS whereas task instances do not. In turn, you can only increase the core instance group (because removing running instances would lose the data on those instances). On the other hand, you can both increase and decrease the task instance group by adding or removing task instances.
So these two types of instances can be used to adjust the processing power of your job. Typically, you use ondemand instances for core instances because they must be running all the time and cannot be lost, and you use spot instances for task instances because losing task instances do not kill the entire job (e.g., the tasks not finished by task instances will be rerun on core instances). This is one way to run a large cluster cost-effectively by using spot instances.
The general description of each instance type is available here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/InstanceGroups.html
Also, this video may be useful for using EMR effectively:
https://www.youtube.com/watch?v=a5D_bs7E3uc

Amazon EMR: Set unique number of mappers and reducers per EMR instance

I'm running an Amazon EMR cluster that has M core instances and N task instances.
My jobs run multiple times per day and are time sensitive so I am keeping the M core instances up and running 24/7 so that I don't have data transfer overhead to/from S3.
The N task nodes are being dynamically launched and terminated as needed.
The M core nodes are c1.mediums and the N task nodes are m2.xlarge.
Is there a way to configure mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum per instance?
For the core nodes I want:
mapred.tasktracker.map.tasks.maximum=2
mapred.tasktracker.reduce.tasks.maximum=1
For the task nodes I want at least:
mapred.tasktracker.map.tasks.maximum=2
mapred.tasktracker.reduce.tasks.maximum=2
Note that task trackers run on the core nodes as well, so I think this configuration will need to be on a per-instance basis depending on the instance size.
Is this possible? And if so how can I set up this type of configuration?
There is a great blog here - which gives you the answer.
http://blog.earlh.com/index.php/2013/05/modifying-the-number-of-mappers-or-reducers-on-a-running-emr-cluster/
Note though that you might have to play around a bit with sshing into your task nodes. It will not work just like that.
I would get my pem file onto a local directory.
chmod 400 on that pem file
and then do "scp -l hadoop -i .pem and then the rest of of it"
as mentioned in the blog
Mind you I have not tried this yet but I believe it will work.
Also - the .versions... stuff may not be needed. You will probably just need conf.
Thanks

how to implement custom cloud worker

I am designing a cloud app and need a worker process which scours my database looking for work, and then performs it.
Most of the info I seem to find on the subject of background tasks in the cloud involves some kind of scheduler and/or queuing system.
What I have doesn't quite fit into the "run this task every 5 minutes" or "add this to the queue to be executed later" models. I think the main difference to my problem is that the workers themselves find work to do, rather than being assigned it by a periodic scheduler or an external process that generates work.
What I have is basically a giant table where each entry has three fields:
job: a small task to be performed, lets say it gets the last message from a twitter account and stores it in the database
the interval at which to perform that job: say every 5 minutes, N.B. the interval is arbitrary and different for each entry in the table
the last date when the job was performed
The way I would implement this is to have a worker which has an infinite loop. When it enters the loop, it scours the database a)looking for items whose date + interval < currentTime, b)when it finds one, it sets date = currentTime, and c)then executes the job. If there is no work ATM, it sleep for a few seconds, then tries again.
I will have many parallel workers scouring the database simultaneously, which is why I do b) first and then c) in the paragraph above. Since there are parallel workers, action a) and b) are atomic operations on the database to prevent work being duplicated. If the worker crashes after a) and b), but before it manages to finish the work, it's no big deal, and the workers can just do it at the next interval; reason for this is that the work is not performed in a time-invariant system so a backlog scenario of failed jobs has no benefit as the tasks have to be performed at their exact intervals, so it's better to skip 1 interval than to have uneven intervals between which the tasks were executed.
My question is whether that is a reasonable implementation strategy? If so, how do I bring this process to life on the cloud (I am using Heroku, but may switch to EC2 in the future)? I still haven't written any code so I would welcome other suggestions (maybe I misunderstood the use cases/applications for queue systems).
This sounds so close to using something like a scheduled job that you might as well tread the well beaten path and do it the more conventional way. There's no reason why you can't schedule a job to run once every few seconds.
However, this idea of looking for work sounds dodgy. What happens if two workers find the same task to run at the same time for instance? Also, are there not triggers in the application which can indicate that work needs doing? It seems strange that you have code 'looking for work'.
You can go a very long way with simple periodic background tasks, so I would exhaust all possibilities in that area before rolling your own.

Resources