Calculate the accuracy of the multi-stage state machine - probability

This is an application question. I have a state machine with 2 stages.
The first stage alone of the machine has the accuracy of 90%.
2.If the first stage output is 1, then goes to 2-1 of second stage. The 2-1 of the second stage alone has the accuracy of 70%.
3.If the first stage output is 0, then goes to 2-2 of the second stage. The 2-2 of the second stage alone has the accuracy of 20%.
I may introduce multiple stages (>2 stages)soon.
May I know the formula to calculate the overall accuracy of the multi-stage state machine ?

Related

Users distribution in Load Test Visual Studio

I created load test project in VS. There are 5 scenarios which are implemented as normal unit test.
Test mix model: Test mix percentage based on the number of tests started.
Scenario A: 10%
Scenario B: 65%
Scenario C: 9%
Scenario D: 8%
Scenario E: 8%
Load pattern: Step. Initial user count: 10. Step user count: 10. Step duration: 10sec. Maximum user count: 300.
Run Duration: 10 minutes.
I would like to know how the load is put on all the scenarios? How the users are distributed between the scenarios in time?
If I put 100 users as initial user count, then 10 virtual users (10% from 100) start replaying scenario A at one time? What happend when they finish? I would be really grateful if someone can explain me know the user distribution works.
Please use the correct terminology. Each "Scenario" in a load test has its own load pattern. This answer assumes that there are 5 test cases A to E.
The precise way load tests start test cases is not defined but the documentation is quite clear. Additionally the load test wizard used when initially creating a load test has good descriptions of the test mix models.
Load tests also make use of random numbers for think times and when choosing which test to run next. This tends to mean the final test results show counts of test cases executed that differ from the desired percentages.
My observations of load tests leads me to believe it works as follows. At various times the load test compares the number of tests currently executing against the number of virtual users that should be active. These times are when the load test's clock ticks and a step load pattern changes, also when a test case finishes. If the comparison shows more virtual users than test cases being executed then sufficient new tests are started to make the numbers equal. The test cases are chosen to match the desired test mix, but remember that there is some randomization in the choice.
Your step pattern is initially 10, step by 10 every 10 sec to a maximum of 300. Maximum users should be after (10 seconds per step)*(300 users)/(10 users per step) = 300 seconds = (5 minutes). The run duration of 10 minutes means 5 minutes ramp up then 5 minutes steady at max users.
For the final paragraph of your question. Given the same percentages but an constant user count of 100 then you would expect the initial number of each test case to be close to the percentages. Thus 10 of A, 65 of B, 9 of C, 8 of D and 8 of E. When any test case completes visual studio will choose a new test case attempting to follow the test mix model, but, as I said earlier, there is some randomization in the choice.

How long does it take to process the file If I have only one worker node?

Let's say I have a data with 25 blocks and the replication factor is 1. The mapper requires about 5 mins to read and process a single block of the data. Then how can I calculate the time for one worker node? The what about 15 nodes? Will the time be changed if we change the replication factor to 3?
I really need a help.
First of all I would advice reading some scientific papers regarding the issue (Google Scholar is a good starting point).
Now a bit of discussion. From my latest experiments I have concluded that processing time has very strong relation with amount of data you want to process (makes sense). On our cluster, on average it takes around 7-8 seconds for Mapper to read a block of 128MBytes. Now there are several factors which you need to consider in order to predict the overall execution time:
How much data the Mapper produces, which will determine moreless the time Hadoop requires to execute Shuffling
What Reducer is doing? Does it do some iterative processing? (might be slow!)
What is the configuration of the resources? (how many Mappers and Reducers are allowed to run on the same machine)
Finally are there other jobs running simultaneously? (this might be slowing down the jobs significantly, since your Reducer slots can be occupied waiting for data instead of doing useful things).
So already for one machine you are seeing the complexity of the task of predicting the time of job execution. Basically during my study I was able to conclude that in average one machine is capable of processing from 20-50 MBytes/second (the rate is calculated according to the following formula: total input size/total job running time). The processing rate includes the staging time (when your application is starting and uploading required files to the cluster for example). The processing rate is different for different use cases and greatly influenced by the input size and more importantly the amount of data produced by Mappers (once again this values are for our infrastructure and on different machine configuration you will be seeing completely different execution times).
When you start scaling your experiments, you would see in average improved performance, but once again from my study I could conclude that it is not linear and you would need to fit by yourself, for your own infrastructure the model with respective variables which would approximate the job execution time.
Just to give you an idea, I will share some part of the results. The rate when executing determine use case on 1 node was ~46MBytes/second, for 2 nodes it was ~73MBytes/second and for 3 nodes it was ~85MBytes/second (in my case the replication factor was equal to the number of nodes).
The problem is complex requires time, patience and some analytical skills to solve it. Have fun!

Sorting networks costs and delay

From what I read I could not figure out how the cost and delay are calculated.
Cost: the number of sticks or compare-exchange blocks.
Delay: the number of compare-exchanges in sequence.
I have posted my example bellow
From what I can see, your answer is correct.
Cost is the total number compare exchanges done in the sorting network. I believe here it's 28.
Delay is the number of stages that must be done in sequence, i.e. have data dependencies. In the example there is a delay of 13.
Why do we care about the difference? Cost represents the amount of work we have to do in a serial implementation however the benefit of using a sorting network is that many of the compare-exchanges can be done in parallel. When you have as much parallelism available as there are compare-exchanges in a single stage, you can calculate that stage concurrently.
In a perfectly parallel system, the latency of the algorithm is going to be related to the delay rather than the cost. In a completely serial system, the latency is going to be related to the cost rather than the delay.

How do I weight my rate by sample size (in Datadog)?

So I have an ongoing metric of events. They are either tagged as success or fail. So I have 3 numbers; failed, completed, total. This is easily illustrated (in Datadog) using a stacked bar graph like so:
So the dark part are the failures. And by looking at the y scale and the dashed red line for scale, this easily tells a human if the rate is a problem and significant. Which to mean means that I have a failure rate in excess of 60%, over at least some time (10 minutes?) and that there are enough events in this period to consider the rate exceptional.
So I am looking for some sort of formula that starts with: failures divided by total (giving me a score between 0 and 1) and then multiplies this somehow again with the total and some thresholds that I decide means that the total is high enough for me to get an automated alert.
For extra credit, here is the actual Datadog metric that I am trying to get to work:
(sum:event{status:fail}.rollup(sum, 300) / sum:event{}.rollup(sum,
300))
And I am watching for 15 minutes and alert of score above 0.75. But I am not sure about sum, count, avg, rollup or count. And ofc this alert will send me mail during the night when the total events goes low enough to were a high failure rate isn't proof of any problem.

Google transit is too idealistic. How would you change that?

Suppose you want to get from point A to point B. You use Google Transit directions, and it tells you:
Route 1:
1. Wait 5 minutes
2. Walk from point A to Bus stop 1 for 8 minutes
3. Take bus 69 till stop 2 (15 minues)
4. Wait 2 minutes
5. Take bus 6969 till stop 3(12 minutes)
6. Walk 7 minutes from stop 3 till point B for 3 minutes.
Total time = 5 wait + 40 minutes.
Route 2:
1. Wait 10 minutes
2. Walk from point A to Bus stop I for 13 minutes
3. Take bus 96 till stop II (10 minues)
4. Wait 17 minutes
5. Take bus 9696 till stop 3(12 minutes)
6. Walk 7 minutes from stop 3 till point B for 8 minutes.
Total time = 10 wait + 50 minutes.
All in all Route 1 looks way better. However, what really happens in practice is that bus 69 is 3 minutes behind due to traffic, and I end up missing bus 6969. The next bus 6969 comes at least 30 minutes later, which amounts to 5 wait + 70 minutes (including 30 m wait in the cold or heat). Would not it be nice if Google actually advertised this possibility? My question now is: what is the better algorithm for displaying the top 3 routes, given uncertainty in the schedule?
Thanks!
How about adding weightings that express a level of uncertainty for different types of journey elements.
Bus services in Dublin City are notoriously untimely, you could add a 40% margin of error to anything to do with Dublin Bus schedule, giving a best & worst case scenario. you could also factor in the chronic traffic delays at rush hours. Then a user could see that they may have a 20% or 80% chance of actually making a connection.
You could sort "best" journeys by the "most probably correct" factor, and include this data in the results shown to the user.
My two cents :)
For the UK rail system, each interchange node has an associated 'minimum transfer time to allow'. The interface to the route planner here then has an Advanced option allowing the user to either accept the default, or add half hour increments.
In your example, setting a' minimum transfer time to allow' of say 10 minutes at step 2 would prevent Route 1 as shown being suggested. Of course, this means that the minimum possible journey time is increased, but that's the trade off.
If you take uncertainty into account then there is no longer a "best route", but instead there can be a "best strategy" that minimizes the total time in transit; however, it can't be represented as a linear sequence of instructions but is more of the form of a general plan, i.e. "go to bus station X, wait until 10:00 for bus Y, if it does not arrive walk to station Z..." This would be notoriously difficult to present to the user (in addition of being computationally expensive to produce).
For a fixed sequence of instructions it is possible to calculate the probability that it actually works out; but what would be the level of certainty users want to accept? Would you be content with, say, 80% success rate? When you then miss one of your connections the house of cards falls down in the worst case, e.g. if you miss a train that leaves every second hour.
I wrote many years a go a similar program to calculate long-distance bus journeys in Finland, and I just reported the transfer times assuming every bus was on schedule. Then basically every plan with less than 15 minutes transfer time or so was disregarded because they were too risky (there were sometimes only one or two long-distance buses per day at a given route).
Empirically. Record the actual arrival times vs scheduled arrival times, and compute the mean and standard deviation for each. When considering possible routes, calculate the probability that a given leg will arrive late enough to make you miss the next leg, and make the average wait time P(on time)*T(first bus) + (1-P(on time))*T(second bus). This gets more complicated if you have to consider multiple legs, each of which could be late independently, and multiple possible next legs you could miss, but the general principle holds.
Catastrophic failure should be the first check.
This is especially important when you are trying to connect to that last bus of the day which is a critical part of the route. The rider needs to know that is what is happening so he doesn't get too distracted and knows the risk.
After that it could evaluate worst-case single misses.
And then, if you really wanna get fancy, take a look at the crime stats for the neighborhood or transit station where the waiting point is.

Resources