I am running a EC2 workload for 60 days. I do see my work load have max utilization let say 90% and average utilization let us say 30%. Though max is 90% but observed during 60 days hardly 3/4 time it spiked. In this kind of scenario, which math model will help to find optimal capacity ?
Looking for getting stat model/algorithm to get the optimal capacity
I have an AWS EC2 t2 micro. CPU Credit Usage shows that I am using 1.1 credit every hour (use the summation stats). But Credit Balance shows a 0.5 credit decrease per hour.
My understanding is that micro instance will earn 3 credits per hour. So the balance should only decrease if the credit usage is more than 3 per hour.
However I am only using 1.1 credit per hour. Why does the balance decrease?
AWS has answer for this:
For example, if a t2.small instance had a CPU utilization of 5% for the hour, it would have used 3 CPU credits (5% of 60 minutes), but it would have earned 12 CPU credits during the hour, so the difference of 9 CPU credits would be added to the CPU credit balance. Any CPU credits in the balance that reached their 24 hour expiration date during that time (which could be as many as 12 credits if the instance was completely idle 24 hours ago) would also be removed from the balance. If the amount of credits expired is greater than those earned, the credit balance will go down; conversely, if the amount of credits expired is fewer than those earned, the credit balance will go up.
See link: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html
I want to find server system reliability based on multi-objective optimization. The optimization will be based on budget and duration of server usage.
Here is the situation,
Let say a company want to have a cloud storage system with specific amount of budget (B). Based on the budget, they will identify how many number of server can be purchase, depend on price of each server.
For example:
Budget: $100,000
Server Cost: $18,000
Total Server can be purchase: 5
Based on that,the company want to find maximum server reliability based on number of server combination and duration. They will set specific target duration, for example 10 years and target reliability for example 99.9%. According to reliability calculation, high number of server used will be the best reliability, but based on target duration and target reliability, the minimum server that achieve 99.9% reliability within 10 years is selected.
Here is the formula to find the reliability:
R = 1 - q(to the power of n)
R is the reliability
q is the failure rate
n is the number of server
Assume failure rate of the server is identical (same failure rate)
For example, a server having 0.2%/1000 hours, which means the device will fail to operate two times within one million hours. Consider the server is operated 24 hours a day in a year.
q = (0.2/100) * (1/1000) * 24 * 365
q = 0.01752
So the reliability of 1 server in one year is
R = 1 - 0.01752
R = 0.98248 which is 98.2%
I have made a calculation, but it is not in linear programming.
For example ( i want to provide screen shot image but not enough reputation),
Budget(b) = 100,000
Server cost(Sc) = 18,000
Total server (n)= 5
Fail per million hours (nF)= 2
million hours (mh)= 10(to the power of -6)
Duration to check reliability (y) = 10 years
Target reliability = 99.9%
Calculation:
Find total server, n=b/Sc
Find reliability:
R = Rs = 1-[(nF*mh)*(hours*day*y)]to the power of n
The OUTPUT, From the calculation, I found using:
1 server is not reliable
2 server is not reliable
3 server is not reliable
4 server is reliable
5 server is reliable
I want to convert this to linear programming and will display target output is 4
I know that Little's Law states (paraphrased):
the average number of things in a system is the product of the average rate at which things leave the system and the average time each one spends in the system,
or:
n=x*(r+z);
x-throughput
r-response time
z-think time
r+z - average response time
now i have question about a problem from programming pearls:
Suppose that system makes 100 disk accesses to process a transaction (although some systems require fewer, some systems will require several hundred disk access per transaction). How many transactions per hour per disk can the system handle?
Assumption: disk access takes 20 milliseconds.
Here is solution on this problem
Ignoring slowdown due to queuing, 20 milliseconds (of the seek time) per disk operation gives 2 seconds per transaction or 1800 transactions per hour
i am confused because i did not understand solution of this problem
please help
It will be more intuitive if you forget about that formula and think that the rate at which you can do something is inversely proportional to the time that it takes you to do it. For example, if it takes you 0.5 hour to eat a pizza, you eat pizzas at a rate of 2 pizzas per hour because 1/0.5 = 2.
In this case the rate is the number of transactions per time and the time is how long a transaction takes. According to the problem, a transaction takes 100 disk accesses, and each disk access takes 20 ms. Therefore each transaction takes 2 seconds total. The rate is then 1/2 = 0.5 transactions per second.
Now, more formally:
Rate of transactions per seconds R is inversely proportional to the transaction time in seconds TT.
R = 1/TT
The transaction time TT in this case is:
TT = disk access time * number of disk accesses per transaction =
20 milliseconds * 100 = 2000 milliseconds = 2 seconds
R = 1/2 transactions per second
= 3600/2 transactions per hour
= 1800 transactions per hour
I have a few hundred network devices that check in to our server every 10 minutes. Each device has an embedded clock, counting the seconds and reporting elapsed seconds on every check in to the server.
So, sample data set looks like
CheckinTime Runtime
2010-01-01 02:15:00.000 101500
2010-01-01 02:25:00.000 102100
2010-01-01 02:35:00.000 102700
etc.
If the device reboots, when it checks back into the server, it reports a runtime of 0.
What I'm trying to determine is some sort of quantifiable metric for the device's "health".
If a device has rebooted a lot in the past but has not rebooted in the last xx days, then it is considered healthy, compared to a device that has a big uptime except for the last xx days where it has repeatedly rebooted.
Also, a device that has been up for 30 days and just rebooted, shouldn't be considered "distressed", compared to a device that has continually rebooted every 24 hrs or so for the last xx days.
I've tried multiple ways of calculating the health, using a variety of metrics:
1. average # of reboots
2. max(uptime)
3. avg(uptime)
4. # of reboots in last 24 hrs
5. # of reboots in last 3 days
6. # of reboots in last 7 days
7. # of reboots in last 30 days
Each individual metric only accounts for one aspect of the device health, but doesn't take into account the overall health compared to other devices or to its current state of health.
Any ideas would be GREATLY appreciated.
You could do something like Windows' 7 reliability metric - start out at full health (say 10). Every hour / day / checkin cycle, increment the health by (10 - currenthealth)*incrementfactor). Every time the server goes down, subtract a certain percentage.
So, given a crashfactor of 20%/crash and an incrementfactor of 10%/day:
If a device has rebooted a lot in the past but has not rebooted in the last 20 days will have a health of 8.6
Big uptime except for the last 2 days where it has repeatedly rebooted 5 times will have a health of 4.1
a device that has been up for 30 days and just rebooted will have a health of 8
a device that has continually rebooted every 24 hrs or so for the last 10 days will have a health of 3.9
To run through an example:
Starting at 10
Day 1: no crash, new health = CurrentHealth + (10 - CurrentHealth)*.1 = 10
Day 2: One crash, new health = currenthealth - currentHealth*.2 = 8
But still increment every day so new health = 8 + (10 - 8)*.1 = 8.2
Day 3: No crash, new health = 8.4
Day 4: Two crashes, new health = 5.8
You might take the reboot count / t of a particular machine and compare that to the standard deviation of the entire population. Those that fall say three standard deviations from the mean, where it's rebooting more often, could be flagged.
You could use weighted average uptime and include the current uptime only when it would make the average higher.
The weight would be how recent the uptime is, so that most recent uptimes have the biggest weight.
Are you able to break the devices out into groups of similar devices? Then you could compare an individual device to its peers.
Another suggestions is to look in to various Moving Average algorithms. These are supposed to smooth out time-series data as well as highlight trends.
Does it always report it a runtime of 0, on reboot? Or something close to zero (less then former time anyway)?
You could calculate this two ways.
1. The lower the number, the less troubles it had.
2. The higher the number, it scored the largest periods.
I guess you need to account, that the health can vary. So it can worsen over time. So the latest values, should have a higher weight then the older ones. This could indicate a exponential growth.
The more reboots it had in the last period, the more broken the system could be. But also looking at shorter intervals of the reboots. Let's say, 5 reboots a day vs. 10 reboots in 2 weeks. That does mean a lot different. So I guess time should be a metric as well as the amount of reboots in this formula.
I guess you need to calculate the density of the amount of reboots in the last period.
You can use the weight of the density, by simply dividing. Because how larger the number is, on which you divide, how lower the result will be, so how lower the weight of the number can become.
Pseudo code:
function calcHealth(machine)
float value = 0;
float threshold = 800;
for each (reboot in machine.reboots) {
reboot.daysPast = time() - reboot.time;
// the more days past, the lower the value, so the lower the weight
value += (100 / reboot.daysPast);
}
return (value == 0) ? 0 : (threshold / value);
}
You could advance this function by for example, filtering for maxDaysPast and playing with the threshold and stuff like that.
This formula is based on this plot: f(x) = 100/x. As you see, on low numbers (low x value), the value is higher, then on large x value. So that's on how this formula calculates the weight of the daysPast. Because lower daysPast == lower x == heigher weight.
With the value += this formula counts the reboots and with the 100/x part it gives weight to the reboot, on where the weight is the time.
At the return, the threshold is divided through the value. This is because, the higher the score of the reboots, the lower the result must be.
You can use a plotting program or calculator, to see the bending of the plot, which is also the bending of the weight of the daysPast.