Calculating Uptime Percentage for a multi-node Cluster - cluster-computing

I have a two-node cluster that I need to track.
Cluster is expected to be up when atleast one of these nodes is up.
What would be a good approach to track cluster uptime percent?
I'm trying to set up a formula like:
cluster_uptime % = 1 - ( total seconds when both nodes were down / total seconds both nodes were connected (up/down) ) * 100
I already have a Prometheus stack connected to my nodes. But, I'm not really able to figure out the query functions that I need to use.
I have a couple of customer metrics setup on my nodes:
# HELP agent_metricsfile_last_updated Time when agent updated metrics file, unix epoch time
# TYPE agent_metricsfile_last_updated gauge
agent_metricsfile_last_updated{platform_version="6.4"} 1.6561110690828052e+09
# HELP platform_uptime_state node status is 1 when up, 0 otherwise
# TYPE platform_uptime_state gauge
platform_uptime_state{platform_version="6.4"} 1
# HELP platform_downtime_seconds_total Total seconds of node downtime
# TYPE platform_downtime_seconds_total counter
platform_downtime_seconds_total{platform_version="6.4"} 45
# HELP downtime_tracking_start_time Time when agent started to track platform_downtime_seconds_total, unix epoch time
# TYPE downtime_tracking_start_time gauge
downtime_tracking_start_time{platform_version="6.4"} 1.6561123625754898e+09
I expected to use platform_uptime_state to figure out my uptime, but am not able to formulate an expression.
Is there a better tool/approach to calculate cluster uptime?

Related

What does a node hour mean in Vertex AI and how do I estimate how many I'll need for a job?

Much of Vertex AI's pricing is calculated per node hour. What is a node hour and how do I go about estimating how many I'll need for a given job?
A node hour represents the time a virtual machine spends running your prediction job or waiting in a ready state to handle prediction or explanation requests. The cost of one node running for one hour is a node hour.
The price of a node hour varies across regions and by operation.
You can consume node hours in fractional increments. For example, one node running for 30 minutes costs 0.5 node hours.
There are tables in the pricing documentation that can help you estimate your costs, and you can use the Cloud Pricing Calculator.
You can also use Billing Reports to monitor your usage.

Throughput in Apache storm

I want to know exactly throughput in apache Storm. Is it the number of tuples processed/Total time?
If so what is the total number of tuples emitted? I am not getting exact significance of total tuples emitted/Time. Please let me know.
You need to look at the execute count of the sink bolts (the end ones in your topology that don't connect to any other bolts). This is the throughput and is reported for the last 10 mins, 3 hrs, 1 day and all time. Dividing the values by the time period in seconds will give you the throughput in tuples per second.

How do I use ELB's HealthyHostCount for monitoring in CloudWatch?

We have three EC2 instances—one in each availability zone (AZ) in the eu-west-1 region. They are loadbalanced using ELB. We'd like to monitor how many instances are registered at the loadbalancer, using CloudWatch. The problem ist: I don't really understand the HealthyHostCount metric.
For a deployment, we'd like to be able to de-register a single instance (take it out of the LB) without being notified. So the alarm would be: Notify if there is only 1 healthy instance left behind the loadbalancer for 5 minutes.
As far as I understand, HealthyHostCount (HHC) is the number of healthy instances that are registered with a given ELB, averaged over all AZs. If everything is okay, the HHC should be 1 (no matter over what period of time) because there is 1 instance in each AZ.
A couple of days ago, someone deployed without re-registering the instances, so there was only 1 instance being balanced. When we noticed that, we created an alarm that was to notify us when the average HHC sunk below 0.6 after 5 minutes. (If only 1 instance is registered in ELB, the HHC should average 0.33 for any period of time.) However, the alarm never changed to state "ALARM."
When I checked the HHC in CloudWatch, the HHC were numbers that didn't make sense (sum of 10.0 for a 5-minute interval is all I remember now).
It's all a big mess to me. Any time I think I understand the metric, the CloudWatch charts are all gibberish to me.
Could someone please explain how to use HHC to get an alarm when only 1 instance is registered? Is average HHC the way to go or should I use another metric?
The HealthyHostCount metric records one data value with the count of available hosts for each availability zone, each time a health check is executed. Your ELB health check has an Interval parameter that defines how many health checks are executed per minute.
If you are watching a Per-AZ metric, with a health check Interval of 10 seconds, with 2 healthy hosts in that AZ, you will see 6 data points per minute (60/10) with a value of 2. The average, max and min will be 2, but the sum will be 6*2=12.
If you have 3 AZs with 2 hosts each, again with an Interval=10, but you are looking at the Per-LB metric, you will see 3*6=18 data points per minute, each one with a value of 2. The average, max and min will be 2, but the sum will be 18*2=36
I recommend you to set-up an interval value that can divide 60 seconds (either 5, 6, 10, 15, 20, 30 or 60 seconds).
In your case, if your interval is 30 seconds, and you have 3 AZs and 1 server per AZ: You should expect 2 data points per AZ per minute, so set-up an alarm Per-LB, with a Period of 1 minute, for Sum of HealthyHostCount that triggers when value is LowerOrEqual than 2 (2 data values * 1 Healthy AZ * 1 healthy server = 2, the other 4 data values of the unhealthy AZs should be 0 so they won't affect the sum).
UPDATE:
It turns out that the number of health check executed also depends on the number of internal instances that shapes the ELB (ussually one per AZ), so if you are suffering a traffic spike, or enough load to saturate a single elb-internal-instance, the amount of internal servers inside the ELB will grow and you will have more data points unexpectedly. This may affect the sum value, only if you have lots of traffic. I didn't saw this issue with a peak load of 6k RPM distributed in 3 AZs. If this is your scenario, then using average is a safer bet, but I would recommend that you use LowerThan 0.65 as your threshold.
The link also makes me wonder how does the Cross-Zone Load Balancing feature affects the amount of data points...
This is an area where the CloudWatch web console doesn't expose everything that cloud watch can do. As the docs explain, HealthyHostCount is a per availability zone metric. The console lets you have HealthHostCount by availability zone (but across all load balancers) or by load balancer (but across all zones) but not sliced both ways.
If you only have one load balancer the simplest thing would be to setup one alarm on each of the per zone metrics. If you have multiple availability zones then you should be able to use the api to create an alarm slicing across availability zone and load balancer (again, one alarm per load balancer) but you can't do this from the web UI as far as I know.

Auto Scaling - How to ensure that the instances run for nearly an hour?

By using multiple autoscaling policies for scale up and scale down can I ensure that my Instances run for nearly an hour ?
How are instances terminated in a Autoscaling group ? Can I specify instance with the lowest CPU utilization to be terminated ?
Thank you
I'm inferring that your goal here is to keep servers around for as much of a full hour once they've been started up.
One possible approach would be to add a custom metric that report once a minute what your uptime is, and then set a policy to downscale any instances with:
CPU < x% and uptime > 50 minutes and uptime < 60 minutes

Creating a formula for calculating device "health" based on uptime/reboots

I have a few hundred network devices that check in to our server every 10 minutes. Each device has an embedded clock, counting the seconds and reporting elapsed seconds on every check in to the server.
So, sample data set looks like
CheckinTime Runtime
2010-01-01 02:15:00.000 101500
2010-01-01 02:25:00.000 102100
2010-01-01 02:35:00.000 102700
etc.
If the device reboots, when it checks back into the server, it reports a runtime of 0.
What I'm trying to determine is some sort of quantifiable metric for the device's "health".
If a device has rebooted a lot in the past but has not rebooted in the last xx days, then it is considered healthy, compared to a device that has a big uptime except for the last xx days where it has repeatedly rebooted.
Also, a device that has been up for 30 days and just rebooted, shouldn't be considered "distressed", compared to a device that has continually rebooted every 24 hrs or so for the last xx days.
I've tried multiple ways of calculating the health, using a variety of metrics:
1. average # of reboots
2. max(uptime)
3. avg(uptime)
4. # of reboots in last 24 hrs
5. # of reboots in last 3 days
6. # of reboots in last 7 days
7. # of reboots in last 30 days
Each individual metric only accounts for one aspect of the device health, but doesn't take into account the overall health compared to other devices or to its current state of health.
Any ideas would be GREATLY appreciated.
You could do something like Windows' 7 reliability metric - start out at full health (say 10). Every hour / day / checkin cycle, increment the health by (10 - currenthealth)*incrementfactor). Every time the server goes down, subtract a certain percentage.
So, given a crashfactor of 20%/crash and an incrementfactor of 10%/day:
If a device has rebooted a lot in the past but has not rebooted in the last 20 days will have a health of 8.6
Big uptime except for the last 2 days where it has repeatedly rebooted 5 times will have a health of 4.1
a device that has been up for 30 days and just rebooted will have a health of 8
a device that has continually rebooted every 24 hrs or so for the last 10 days will have a health of 3.9
To run through an example:
Starting at 10
Day 1: no crash, new health = CurrentHealth + (10 - CurrentHealth)*.1 = 10
Day 2: One crash, new health = currenthealth - currentHealth*.2 = 8
But still increment every day so new health = 8 + (10 - 8)*.1 = 8.2
Day 3: No crash, new health = 8.4
Day 4: Two crashes, new health = 5.8
You might take the reboot count / t of a particular machine and compare that to the standard deviation of the entire population. Those that fall say three standard deviations from the mean, where it's rebooting more often, could be flagged.
You could use weighted average uptime and include the current uptime only when it would make the average higher.
The weight would be how recent the uptime is, so that most recent uptimes have the biggest weight.
Are you able to break the devices out into groups of similar devices? Then you could compare an individual device to its peers.
Another suggestions is to look in to various Moving Average algorithms. These are supposed to smooth out time-series data as well as highlight trends.
Does it always report it a runtime of 0, on reboot? Or something close to zero (less then former time anyway)?
You could calculate this two ways.
1. The lower the number, the less troubles it had.
2. The higher the number, it scored the largest periods.
I guess you need to account, that the health can vary. So it can worsen over time. So the latest values, should have a higher weight then the older ones. This could indicate a exponential growth.
The more reboots it had in the last period, the more broken the system could be. But also looking at shorter intervals of the reboots. Let's say, 5 reboots a day vs. 10 reboots in 2 weeks. That does mean a lot different. So I guess time should be a metric as well as the amount of reboots in this formula.
I guess you need to calculate the density of the amount of reboots in the last period.
You can use the weight of the density, by simply dividing. Because how larger the number is, on which you divide, how lower the result will be, so how lower the weight of the number can become.
Pseudo code:
function calcHealth(machine)
float value = 0;
float threshold = 800;
for each (reboot in machine.reboots) {
reboot.daysPast = time() - reboot.time;
// the more days past, the lower the value, so the lower the weight
value += (100 / reboot.daysPast);
}
return (value == 0) ? 0 : (threshold / value);
}
You could advance this function by for example, filtering for maxDaysPast and playing with the threshold and stuff like that.
This formula is based on this plot: f(x) = 100/x. As you see, on low numbers (low x value), the value is higher, then on large x value. So that's on how this formula calculates the weight of the daysPast. Because lower daysPast == lower x == heigher weight.
With the value += this formula counts the reboots and with the 100/x part it gives weight to the reboot, on where the weight is the time.
At the return, the threshold is divided through the value. This is because, the higher the score of the reboots, the lower the result must be.
You can use a plotting program or calculator, to see the bending of the plot, which is also the bending of the weight of the daysPast.

Resources