I have two systems in parallel, each with an 99.9% uptime throughout the year. What's the overall uptime? - high-availability

If two systems in parallel have the same uptime of 99.9% over the year then how can I determine the uptime of the overall system?

0.99 * 0.99 = 0.9801 is the chance that both of them will be up (e.g. 1 is user server and 1 is product server)
The more PC you add to the system the bigger chance that either one of them goes down. A similar formula is used in calculating the RAID stability chance.
EDIT
If you consider that the system is stable while at least one node is up (e.g. 2 instances of the same service) then the chance is 100 - 0.01*0.01 = 99.9999

Related

Resources needed for simulating 5000 virtual users, sending every 5 seconds, using Locust

I am trying to simulate 5000 virtual users using Locust, with each user sending a message every 5 seconds. What are the resources needed in terms of EC2 specifications in order to achieve this with some level of concurrency.
Number of users is not so important (in my experience, at least when talking about less than a couple of Users per worker), the only thing that matters is the number of requests per second.
Because the performance depends on exactly what your tests do, it is impossible to give a hard number. But the manual gives some best-case figures on what you can do with FastHttpUser/HttpUser
https://docs.locust.io/en/stable/increase-performance.html#increase-locust-s-performance-with-a-faster-http-client
It is impossible to say what your particular hardware can handle, but in a best case scenario you should be able to do close to 5000 requests per second per core, instead of around 850 for the normal HttpUser (tested on a 2018 MacBook Pro i7 2.6GHz)
If your test plan is reasonably simple then you should be fine running at ~50% of that load.

Convert Situation Into Linear Programming

I want to find server system reliability based on multi-objective optimization. The optimization will be based on budget and duration of server usage.
Here is the situation,
Let say a company want to have a cloud storage system with specific amount of budget (B). Based on the budget, they will identify how many number of server can be purchase, depend on price of each server.
For example:
Budget: $100,000
Server Cost: $18,000
Total Server can be purchase: 5
Based on that,the company want to find maximum server reliability based on number of server combination and duration. They will set specific target duration, for example 10 years and target reliability for example 99.9%. According to reliability calculation, high number of server used will be the best reliability, but based on target duration and target reliability, the minimum server that achieve 99.9% reliability within 10 years is selected.
Here is the formula to find the reliability:
R = 1 - q(to the power of n)
R is the reliability
q is the failure rate
n is the number of server
Assume failure rate of the server is identical (same failure rate)
For example, a server having 0.2%/1000 hours, which means the device will fail to operate two times within one million hours. Consider the server is operated 24 hours a day in a year.
q = (0.2/100) * (1/1000) * 24 * 365
q = 0.01752
So the reliability of 1 server in one year is
R = 1 - 0.01752
R = 0.98248 which is 98.2%
I have made a calculation, but it is not in linear programming.
For example ( i want to provide screen shot image but not enough reputation),
Budget(b) = 100,000
Server cost(Sc) = 18,000
Total server (n)= 5
Fail per million hours (nF)= 2
million hours (mh)= 10(to the power of -6)
Duration to check reliability (y) = 10 years
Target reliability = 99.9%
Calculation:
Find total server, n=b/Sc
Find reliability:
R = Rs = 1-[(nF*mh)*(hours*day*y)]to the power of n
The OUTPUT, From the calculation, I found using:
1 server is not reliable
2 server is not reliable
3 server is not reliable
4 server is reliable
5 server is reliable
I want to convert this to linear programming and will display target output is 4

Optimum number of threads for a highly parallelizable problem

I parallelized a simulation engine in 12 threads to run it on a cluster of 12 nodes(each node running one thread). Since chances of availability of 12 systems is generally less, I also tweaked it for 6 threads(to run on 6 nodes), 4 threads(to run on 4 nodes), 3 threads(to run on 3 nodes), and 2 threads(to run on 2 nodes). I have noticed that more the number of nodes/threads, more is the speedup. But obviously, the more nodes I use, the more expensive(in terms of cost and power) the execution becomes.
I want to publish these results in a journal so I want to know if there are any laws/theorems which will help me to decide the optimum number of nodes on which I should run this program?
Thanks,
Akshey
How have you parallelised your program and what is inside each of your nodes ?
For instance, on one of my clusters I have several hundred nodes each containing 4 dual-core Xeons. If I were to run an OpenMP program on this cluster I would place a single execution on one node and start up no more than 8 threads, one for each processor core. My clusters are managed by Grid Engine and used for batch jobs, so there is no contention while a job is running. In general there is no point in asking for more than one node on which to run an OpenMP job since the shared-memory approach doesn't work on distributed-memory hardware. And there's not much to be gained by asking for fewer than 8 threads on an 8-core node, I have enough hardware available not to have to share it.
If you have used a distributed-memory programming approach, such as MPI, then you are probably working with a number of processes (rather than threads) and may well be executing these processes on cores on different nodes, and be paying the costs in terms of communications traffic.
As #Blank has already pointed out the most efficient way to run a program, if by efficiency one means 'minimising total cpu-hours', is to run the program on 1 core. Only. However, for jobs of mine which can take, say, a week on 256 cores, waiting 128 weeks for one core to finish its work is not appealing.
If you are not already familiar with the following terms, Google around for them or head for Wikipedia:
Amdahl's Law
Gustafson's Law
weak scaling
strong scaling
parallel speedup
parallel efficiency
scalability.
"if there are any laws/theorems which will help me to decide the optimum number of nodes on which I should run this program?"
There's no such general laws, because every problem has slightly different characteristics.
You can make a mathematical model of the performance of your problem on different number of nodes, knowing how much computational work has to be done, and how much communications has to be done, and how long each takes. (The communications times can be estimated by the amount of commuincations, and typical latency/bandwidth numbers for your nodes' type of interconnect). This can guide you as to good choices.
These models can be valuable for understanding what is going on, but to actually determine the right number of nodes to run on for your code for some given problem size, there's really no substitute for running a scaling test - running the problem on various numbers of nodes and actually seeing how it performs. The numbers you want to see are:
Time to completion as a function of number of processors: T(P)
Speedup as a function of number of processors: S(P) = T(1)/T(P)
Parallel efficiency: E(P) = S(P)/P
How do you choose the "right" number of nodes? It depends on how many jobs you have to run, and what's an acceptable use of computational resources.
So for instance, in plotting your timing results you might find that you have a minimum time to completion T(P) at some number of processors -- say, 32. So that might seem like the "best" choice. But when you look at the efficiency numbers, it might become clear that the efficiency started dropping precipitously long before that; and you only got (say) a 20% decrease in run time over running at 16 processors - that is, for 2x the amount of computational resources, you only got a 1.25x increase in speed. That's usually going to be a bad trade, and you'd prefer to run at fewer processors - particularly if you have a lot of these simulations to run. (If you have 2 simulations to run, for instance, in this case you could get them done in 1.25 time units insetad of 2 time units by running the two simulations each on 16 processors simultaneously rather than running them one at a time on 32 processors).
On the other hand, sometimes you only have a couple runs to do and time really is of the essence, even if you're using resources somewhat inefficiently. Financial modelling can be like this -- they need the predictions for tomorrow's markets now, and they have the money to throw at computational resources even if they're not used 100% efficiently.
Some of these concepts are discussed in the "Introduction to Parallel Performance" section of any parallel programming tutorials; here's our example, https://support.scinet.utoronto.ca/wiki/index.php/Introduction_To_Performance
Increasing the number of nodes leads to diminishing returns. Two nodes is not twice as fast as one node; four nodes even less so than two. As such, the optimal number of nodes is always one; it is with a single node that you get most work done per node.

Google transit is too idealistic. How would you change that?

Suppose you want to get from point A to point B. You use Google Transit directions, and it tells you:
Route 1:
1. Wait 5 minutes
2. Walk from point A to Bus stop 1 for 8 minutes
3. Take bus 69 till stop 2 (15 minues)
4. Wait 2 minutes
5. Take bus 6969 till stop 3(12 minutes)
6. Walk 7 minutes from stop 3 till point B for 3 minutes.
Total time = 5 wait + 40 minutes.
Route 2:
1. Wait 10 minutes
2. Walk from point A to Bus stop I for 13 minutes
3. Take bus 96 till stop II (10 minues)
4. Wait 17 minutes
5. Take bus 9696 till stop 3(12 minutes)
6. Walk 7 minutes from stop 3 till point B for 8 minutes.
Total time = 10 wait + 50 minutes.
All in all Route 1 looks way better. However, what really happens in practice is that bus 69 is 3 minutes behind due to traffic, and I end up missing bus 6969. The next bus 6969 comes at least 30 minutes later, which amounts to 5 wait + 70 minutes (including 30 m wait in the cold or heat). Would not it be nice if Google actually advertised this possibility? My question now is: what is the better algorithm for displaying the top 3 routes, given uncertainty in the schedule?
Thanks!
How about adding weightings that express a level of uncertainty for different types of journey elements.
Bus services in Dublin City are notoriously untimely, you could add a 40% margin of error to anything to do with Dublin Bus schedule, giving a best & worst case scenario. you could also factor in the chronic traffic delays at rush hours. Then a user could see that they may have a 20% or 80% chance of actually making a connection.
You could sort "best" journeys by the "most probably correct" factor, and include this data in the results shown to the user.
My two cents :)
For the UK rail system, each interchange node has an associated 'minimum transfer time to allow'. The interface to the route planner here then has an Advanced option allowing the user to either accept the default, or add half hour increments.
In your example, setting a' minimum transfer time to allow' of say 10 minutes at step 2 would prevent Route 1 as shown being suggested. Of course, this means that the minimum possible journey time is increased, but that's the trade off.
If you take uncertainty into account then there is no longer a "best route", but instead there can be a "best strategy" that minimizes the total time in transit; however, it can't be represented as a linear sequence of instructions but is more of the form of a general plan, i.e. "go to bus station X, wait until 10:00 for bus Y, if it does not arrive walk to station Z..." This would be notoriously difficult to present to the user (in addition of being computationally expensive to produce).
For a fixed sequence of instructions it is possible to calculate the probability that it actually works out; but what would be the level of certainty users want to accept? Would you be content with, say, 80% success rate? When you then miss one of your connections the house of cards falls down in the worst case, e.g. if you miss a train that leaves every second hour.
I wrote many years a go a similar program to calculate long-distance bus journeys in Finland, and I just reported the transfer times assuming every bus was on schedule. Then basically every plan with less than 15 minutes transfer time or so was disregarded because they were too risky (there were sometimes only one or two long-distance buses per day at a given route).
Empirically. Record the actual arrival times vs scheduled arrival times, and compute the mean and standard deviation for each. When considering possible routes, calculate the probability that a given leg will arrive late enough to make you miss the next leg, and make the average wait time P(on time)*T(first bus) + (1-P(on time))*T(second bus). This gets more complicated if you have to consider multiple legs, each of which could be late independently, and multiple possible next legs you could miss, but the general principle holds.
Catastrophic failure should be the first check.
This is especially important when you are trying to connect to that last bus of the day which is a critical part of the route. The rider needs to know that is what is happening so he doesn't get too distracted and knows the risk.
After that it could evaluate worst-case single misses.
And then, if you really wanna get fancy, take a look at the crime stats for the neighborhood or transit station where the waiting point is.

Creating a formula for calculating device "health" based on uptime/reboots

I have a few hundred network devices that check in to our server every 10 minutes. Each device has an embedded clock, counting the seconds and reporting elapsed seconds on every check in to the server.
So, sample data set looks like
CheckinTime Runtime
2010-01-01 02:15:00.000 101500
2010-01-01 02:25:00.000 102100
2010-01-01 02:35:00.000 102700
etc.
If the device reboots, when it checks back into the server, it reports a runtime of 0.
What I'm trying to determine is some sort of quantifiable metric for the device's "health".
If a device has rebooted a lot in the past but has not rebooted in the last xx days, then it is considered healthy, compared to a device that has a big uptime except for the last xx days where it has repeatedly rebooted.
Also, a device that has been up for 30 days and just rebooted, shouldn't be considered "distressed", compared to a device that has continually rebooted every 24 hrs or so for the last xx days.
I've tried multiple ways of calculating the health, using a variety of metrics:
1. average # of reboots
2. max(uptime)
3. avg(uptime)
4. # of reboots in last 24 hrs
5. # of reboots in last 3 days
6. # of reboots in last 7 days
7. # of reboots in last 30 days
Each individual metric only accounts for one aspect of the device health, but doesn't take into account the overall health compared to other devices or to its current state of health.
Any ideas would be GREATLY appreciated.
You could do something like Windows' 7 reliability metric - start out at full health (say 10). Every hour / day / checkin cycle, increment the health by (10 - currenthealth)*incrementfactor). Every time the server goes down, subtract a certain percentage.
So, given a crashfactor of 20%/crash and an incrementfactor of 10%/day:
If a device has rebooted a lot in the past but has not rebooted in the last 20 days will have a health of 8.6
Big uptime except for the last 2 days where it has repeatedly rebooted 5 times will have a health of 4.1
a device that has been up for 30 days and just rebooted will have a health of 8
a device that has continually rebooted every 24 hrs or so for the last 10 days will have a health of 3.9
To run through an example:
Starting at 10
Day 1: no crash, new health = CurrentHealth + (10 - CurrentHealth)*.1 = 10
Day 2: One crash, new health = currenthealth - currentHealth*.2 = 8
But still increment every day so new health = 8 + (10 - 8)*.1 = 8.2
Day 3: No crash, new health = 8.4
Day 4: Two crashes, new health = 5.8
You might take the reboot count / t of a particular machine and compare that to the standard deviation of the entire population. Those that fall say three standard deviations from the mean, where it's rebooting more often, could be flagged.
You could use weighted average uptime and include the current uptime only when it would make the average higher.
The weight would be how recent the uptime is, so that most recent uptimes have the biggest weight.
Are you able to break the devices out into groups of similar devices? Then you could compare an individual device to its peers.
Another suggestions is to look in to various Moving Average algorithms. These are supposed to smooth out time-series data as well as highlight trends.
Does it always report it a runtime of 0, on reboot? Or something close to zero (less then former time anyway)?
You could calculate this two ways.
1. The lower the number, the less troubles it had.
2. The higher the number, it scored the largest periods.
I guess you need to account, that the health can vary. So it can worsen over time. So the latest values, should have a higher weight then the older ones. This could indicate a exponential growth.
The more reboots it had in the last period, the more broken the system could be. But also looking at shorter intervals of the reboots. Let's say, 5 reboots a day vs. 10 reboots in 2 weeks. That does mean a lot different. So I guess time should be a metric as well as the amount of reboots in this formula.
I guess you need to calculate the density of the amount of reboots in the last period.
You can use the weight of the density, by simply dividing. Because how larger the number is, on which you divide, how lower the result will be, so how lower the weight of the number can become.
Pseudo code:
function calcHealth(machine)
float value = 0;
float threshold = 800;
for each (reboot in machine.reboots) {
reboot.daysPast = time() - reboot.time;
// the more days past, the lower the value, so the lower the weight
value += (100 / reboot.daysPast);
}
return (value == 0) ? 0 : (threshold / value);
}
You could advance this function by for example, filtering for maxDaysPast and playing with the threshold and stuff like that.
This formula is based on this plot: f(x) = 100/x. As you see, on low numbers (low x value), the value is higher, then on large x value. So that's on how this formula calculates the weight of the daysPast. Because lower daysPast == lower x == heigher weight.
With the value += this formula counts the reboots and with the 100/x part it gives weight to the reboot, on where the weight is the time.
At the return, the threshold is divided through the value. This is because, the higher the score of the reboots, the lower the result must be.
You can use a plotting program or calculator, to see the bending of the plot, which is also the bending of the weight of the daysPast.

Resources