Minimum value for EC2 instance StatusCheckFailed interval seems to be one minute. Is it possible to reduce this to 2 failures for 15 seconds?
We have a requirement to detect failures quickly in 10-15 seconds range. Are there any other ways to accomplish this?
I don't believe you can set the resolution of the status check to less than 1 minute. One potential workaround would be to implement a lambda function that essentially performs a status check (via your own code) on a more frequent time interval via a cron job.
Related
We have the algorithm to reuse AWS EC2 instances for jobs. This was very useful when the payments were by using time rounded by hours. Now, due to the change in the AWS policy the payments can be done by time expressed in minutes. At first glance there is no reason to keep the reuse algorithm because a job lasts at least 10 minutes up to many hours. Are there any suggestions why this algorithm can still be useful?
Unless you're running an EC2 instance for less than a minute (which it sounds like you're not) and need per second billing or are using an EBS volume with it, I would say probably not. See more details here
There are several partitions on the cluster I work on. With sinfo I can see the time limit for each partition. I put my code to work on mid1 partition which has time limit of 8-00:00:00 from which I understand that time limit is 8 days. I had to wait for 1-15:23:41 which means nearly 1 day and 15 hours. However, my code ran for only 00:02:24 which means nearly 2.5 minutes (and the solution was converging). Also, I did not set a time limit in the file submitted with sbatch The reason of my code stopped was given as:
JOB 3216125 CANCELLED AT 2015-12-19T04:22:04 DUE TO TIME LIMIT
So, why my code was stopped if I did not exceed the time limit? I was asking this to the guys who were responsible for the cluster but they did not return.
Look at the value of DefaultTime in the output of scontrol show partitions. This is the maximum time that is allocated to your job in the case you do not specify it by yourself with --time.
Most probably this value is set to 2 minutes to force you to specify a sensible time limit (within the limits of the partition.)
Imagine you're building something like a monitoring service, which has thousands of tasks that need to be executed in given time interval, independent of each other. This could be individual servers that need to be checked, or backups that need to be verified, or just anything at all that could be scheduled to run at a given interval.
You can't just schedule the tasks via cron though, because when a task is run it needs to determine when it's supposed to run the next time. For example:
schedule server uptime check every 1 minute
first time it's checked the server is down, schedule next check in 5 seconds
5 seconds later the server is available again, check again in 5 seconds
5 seconds later the server is still available, continue checking at 1 minute interval
A naive solution that came to mind is to simply have a worker that runs every second or so, checks all the pending jobs and executes the ones that need to be executed. But how would this work if the number of jobs is something like 100 000? It might take longer to check them all than it is the ticking interval of the worker, and the more tasks there will be, the higher the poll interval.
Is there a better way to design a system like this? Are there any hidden challenges in implementing this, or any algorithms that deal with this sort of a problem?
Use a priority queue (with the priority based on the next execution time) to hold the tasks to execute. When you're done executing a task, you sleep until the time for the task at the front of the queue. When a task comes due, you remove and execute it, then (if its recurring) compute the next time it needs to run, and insert it back into the priority queue based on its next run time.
This way you have one sleep active at any given time. Insertions and removals have logarithmic complexity, so it remains efficient even if you have millions of tasks (e.g., inserting into a priority queue that has a million tasks should take about 20 comparisons in the worst case).
There is one point that can be a little tricky: if the execution thread is waiting until a particular time to execute the item at the head of the queue, and you insert a new item that goes at the head of the queue, ahead of the item that was previously there, you need to wake up the thread so it can re-adjust its sleep time for the item that's now at the head of the queue.
We encountered this same issue while designing Revalee, an open source project for scheduling triggered callbacks. In the end, we ended up writing our own priority queue class (we called ours a ScheduledDictionary) to handle the use case you outlined in your question. As a free, open source project, the complete source code (C#, in this case) is available on GitHub. I'd recommend that you check it out.
I have a map-reduce job to be run on the Amazon EMR. I would like to have up to 400 mappers and reducers and I would like to use either Medium or Large instances. How can I estimate the number of instances I need.
Besides, if one job ends within 2 minutes, let's say, and I run another job which take 4 minutes, will I be charged for 2 hours or that's considered 1 hour?
I know if you use the CLI tool to create your Job Flow and add the steps, then you can run both of the steps one after another on the same job flow and they will be counted within the same hour.
I believe if you use the GUI then you can not re-use the job flow and so you may get charged one hour for each job. I haven't tried this though so may be wrong there.
Check this article which is where I got the information:
https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
We have three EC2 instances—one in each availability zone (AZ) in the eu-west-1 region. They are loadbalanced using ELB. We'd like to monitor how many instances are registered at the loadbalancer, using CloudWatch. The problem ist: I don't really understand the HealthyHostCount metric.
For a deployment, we'd like to be able to de-register a single instance (take it out of the LB) without being notified. So the alarm would be: Notify if there is only 1 healthy instance left behind the loadbalancer for 5 minutes.
As far as I understand, HealthyHostCount (HHC) is the number of healthy instances that are registered with a given ELB, averaged over all AZs. If everything is okay, the HHC should be 1 (no matter over what period of time) because there is 1 instance in each AZ.
A couple of days ago, someone deployed without re-registering the instances, so there was only 1 instance being balanced. When we noticed that, we created an alarm that was to notify us when the average HHC sunk below 0.6 after 5 minutes. (If only 1 instance is registered in ELB, the HHC should average 0.33 for any period of time.) However, the alarm never changed to state "ALARM."
When I checked the HHC in CloudWatch, the HHC were numbers that didn't make sense (sum of 10.0 for a 5-minute interval is all I remember now).
It's all a big mess to me. Any time I think I understand the metric, the CloudWatch charts are all gibberish to me.
Could someone please explain how to use HHC to get an alarm when only 1 instance is registered? Is average HHC the way to go or should I use another metric?
The HealthyHostCount metric records one data value with the count of available hosts for each availability zone, each time a health check is executed. Your ELB health check has an Interval parameter that defines how many health checks are executed per minute.
If you are watching a Per-AZ metric, with a health check Interval of 10 seconds, with 2 healthy hosts in that AZ, you will see 6 data points per minute (60/10) with a value of 2. The average, max and min will be 2, but the sum will be 6*2=12.
If you have 3 AZs with 2 hosts each, again with an Interval=10, but you are looking at the Per-LB metric, you will see 3*6=18 data points per minute, each one with a value of 2. The average, max and min will be 2, but the sum will be 18*2=36
I recommend you to set-up an interval value that can divide 60 seconds (either 5, 6, 10, 15, 20, 30 or 60 seconds).
In your case, if your interval is 30 seconds, and you have 3 AZs and 1 server per AZ: You should expect 2 data points per AZ per minute, so set-up an alarm Per-LB, with a Period of 1 minute, for Sum of HealthyHostCount that triggers when value is LowerOrEqual than 2 (2 data values * 1 Healthy AZ * 1 healthy server = 2, the other 4 data values of the unhealthy AZs should be 0 so they won't affect the sum).
UPDATE:
It turns out that the number of health check executed also depends on the number of internal instances that shapes the ELB (ussually one per AZ), so if you are suffering a traffic spike, or enough load to saturate a single elb-internal-instance, the amount of internal servers inside the ELB will grow and you will have more data points unexpectedly. This may affect the sum value, only if you have lots of traffic. I didn't saw this issue with a peak load of 6k RPM distributed in 3 AZs. If this is your scenario, then using average is a safer bet, but I would recommend that you use LowerThan 0.65 as your threshold.
The link also makes me wonder how does the Cross-Zone Load Balancing feature affects the amount of data points...
This is an area where the CloudWatch web console doesn't expose everything that cloud watch can do. As the docs explain, HealthyHostCount is a per availability zone metric. The console lets you have HealthHostCount by availability zone (but across all load balancers) or by load balancer (but across all zones) but not sliced both ways.
If you only have one load balancer the simplest thing would be to setup one alarm on each of the per zone metrics. If you have multiple availability zones then you should be able to use the api to create an alarm slicing across availability zone and load balancer (again, one alarm per load balancer) but you can't do this from the web UI as far as I know.