Amazon Interview about Success probability [duplicate] - probability

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Failure rate of a system
If a system has a 10% independent chance of failing in any given hour, what are the chances of it failing in a given 2 hour period or n-hours period?
Note: 10% failure probability in 1 hour has nothing to do with 10% of the time . It's just that a system has a 10% independent chance of failing in any given hour

Let Pfail be the probability that the system fails in any given hour.
Then Pnofail, the probability that the system does not fail in any given hour, is 1 - Pfail.
The chance of it not failing in 2 hours is (Pnofail)2, since it must independently not-fail in each of those hours, and the joint probability of two independent events is the product of the probability of each event (that is, P(A ∩ B) = P(A)*P(B)).
More generally, then, the chance of it not failing in n hours is (Pnofail)n .
The chance of it failing in n hours is 1 - (chance of not failing in n hours).
You should be able to work it out from there.

Related

Calculate a date when total duration of multiple sub-intervals (within a larger interval) drops below X

I am building an expert system that will run as a web service (i.e. continuously).
Some of the rules in it are coded procedurally and deal with intervals — the rule processor maps over a set of user's events and calculates their total duration within a certain time-frame which is defined in relative terms (like N years ago). This result is then compared with a required threshold to determine whether the rule passes.
So for example the rule calculates for how long you were employed from 3 years ago to 1 year ago and passes if it's more than 9 months.
I have no problem calculating the durations. The difficult part is that I need to display to the user not simply whether the particular rule passed, but also the exact date when this "true" is due to become "false". Ideally, I'd love to display one more additional step ahead - i.e. when "false" switches back to "true" again — if there's data for this, of course. So on the day when the total duration of their employment for last year drops below 6 months the rule reruns, the result changes, and they get an email "hey, your result has just changed, you no longer qualify, but it 5 months you will qualify once again".
| | |
_____|||1|||_______|||2|||__________|||3|||________|||4|||...
| | |
3 y. ago ---------------------- 1 y. ago Now
min 9 months work experience is required
In the example above the user qualifies, but is going not to, we need to tell them up front: "expect this to happen in 44 days" (also the system schedules a background job for that date) and when that will reverse back to true.
| | |
____________________|1|__________________||||||||2||||||||...
| | |
3 y. ago ---------------------- 1 y. ago Now
min 9 months work experience is required
In this one the user doesn't qualify, we need to tell them when they are going to start to qualify.
| |
_____|||1|||___________|||||||2|||||||_________|||3|||____...
| |
1 y. ago ------------------------------------------ Now
at least 6 months of work experience is required
And here — when they are due to stop qualifying, because there's no event that is going on for them currently, so once these events roll to the left far enough, it's over until the user changes their CV and the engine re-runs with new dataset.
I hope it's clear what I want to do. Is there a smart algorithm that can help me here? Or do I just brute-force the solution?
UPD:
The solution I am developing lies in creating a 2-dimensional graph where each point signifies a date (x-axis value) when the curve of total duration for the timeframe (y-axis value) changes direction. There are 4 such breakpoints for any given event. This graph will allow me to do a linear interpolation between two values to find when exactly the duration line crosses the threshold. I am currently writing this in Ruby.

Density of time events

I am working on an assignment where I am supposed to compute the density of an event. Let's say that a certain event happens 5 times within seconds, it would mean that it would have a higher density than if it were to happen 5 times within hours.
I have in my possession, the time at which the event happens.
I was first thinking about computing the elapsed time between each two successive events and then play with the average and mean of these values.
My problem is that I do not know how to accurately represent this notion of density through mathematics. Let's say that I have 5 events happening really close to each other, and then a long break, and then again 5 events happening really close to each other. I would like to be able to represent this as high density. How should I go about it?
In the last example, I understand that my mean won't be truly representative but that my standard deviation will show that. However, how could I have a single density value (let's say between 0 and 1) with which I could rank different events?
Thank you for your help!
I would try the harmonic mean, which represents the rate at which your events happen, by still giving you an averaged time value. It is defined by :
I think its behaviour is close to what you expect as it measures what you want, but not between 0 and 1 and with inverse tendencies (small values mean dense, large values mean sparse). Let us go through a few of your examples :
~5 events in an hour. Let us suppose for simplicity there is 10 minutes between each event. Then we have H = 6 /(6 * 1/10) = 10
~5 events in 10 minutes, then nothing until the end of the hour (50 minutes). Let us suppose all short intervals are 2.5 minutes, then H = 6 / (5/2.5 + 1/50) = 6 * 50 / 101 = 2.97
~5 events in 10 minutes, but this cycle restarts every half hour thus we have 20 minutes as the last interval instead of 50. Then we get H = 6 / (5/2.5 + 1/20) = 6 * 20 / 41 = 2.92
As you can see the effect of the longer and rarer values in a set is diminished by the fact that we use inverses, thus less weight to the "in between bursts" behaviour. Also you can compare behaviours with the same "burst density" but that do not happen at the same frequency, and you will get numbers that are close but whose ordering still reflects this difference.
For density to make sense you need to define 2 things:
the range where you look at it,
and the unit of time
After that you can say for example, that from 12:00 to 12:10 the density of the event was an average of 10/minute.
What makes sense in your case obviously depends on what your input data is. If your measurement lasts for 1 hour and you have millions of entries then probably seconds or milliseconds are better choice for unit. If you measure for a week and have a few entries then day is a better unit.

Meaning of axis of figures of simulation or performance modeling papers

I am reading some papers on simulation and performance modeling. The Y axis in some figures is labeled "Seconds per Simulation Day". I am not sure what it actually means. It span from 0, 20, 40 to 120.
Another label is "Simulation years per day". I guess it means the guest OS inside simulation environment thinks it has passed several years while actually it just passed a day in the real world? But I guess simulation should slow down the execution, so I guess inside simulation environment passed several hours while actually it just passed a day in the real world would be more reasonable.
Thanks.
Without seeing the paper, I assume they are trying to compare the CPU time it takes to get to some physical time in a simulation.
So "Seconds per Simulation Day" is likely the walltime it took to get 24 hours in the simulation.
Likewise, "Simulation Years per Day" is physical time of simulation/real life day.
Of course, without seeing the paper it's impossible to know for sure. It's also possible they are looking at CPU-seconds or CPU-days, which would be nCPUs*walltime.
Simulations typically run in discrete time units, called time steps. If you'd like to simulate a certain process that spans certain time in the simulation, you would have to perform certain number of time steps. If the length of a time step is fixed, the number of steps is then just the simulated time divided by the length of the time step. Calculations in each time step take certain amount of time and the total run time for the simulation would equal the number of time steps times the time it takes to perform one time step:
(1) total_time = (simulation_time / timestep_length) * run_time_per_timestep
Now several benchmark parameters can be obtained by placing different parameters on the left hand side. E.g. if you fix simulation_time = 1 day then total_time would give you the total simulation run time, i.e.
(2) seconds_per_sim_day = (1 day / timestep_length) * run_time_per_timestep
Large values of seconds_per_sim_day could mean:
it takes too much time to compute a single time step, i.e. run_time_per_timestep is too high -> the computation algorithm should be optimised for speed;
the time step is too short -> search for better algorithms that can accept larger time steps and still produce (almost) the same result.
On the other hand, if you solve (1) for simulation_time and fix total_time = 1 day, you get the number of time steps that can be performed per day times the length of the time step, or the total simulation time that can be achieved per day of computation:
(3) simulation_time_per_day = (1 day / run_time_per_step) * timestep_length
Now one can observe that:
larger time steps lead to larger values of simulation_time_per_day, i.e. longer simulations can be computed;
if it takes too much time to compute a time step, the value of simulation_time_per_day would go down.
Usually those figures could be used when making decisions about buying CPU time at some computing centre. For example, you would like to simulate 100 years, then just divide that by the amount of simulation years per day and you get how many compute days you would have to pay (or wait) for. Larger values of simulation_time_per_day definitely benefit you in this case. If, on the other hand, you only have 10 compute days at your disposal, then you can compute how long of a simulation could be computed and make some decisions, e.g. more short simulations but with many different parameters vs. less but longer simulations with some parameters that you have predicted to be the optimal ones.
In real life things are much more complicated. Usually computing each time step could take different time (although there are cases where each time step takes exactly the same amount of time as all other time steps) and it would strongly depend on the simluation size, configuration, etc. That's why standardised tests exist and usually some averaged value is reported.
Just to summarise: given that all test parameters are kept equal,
faster computers would give less "seconds per simulation day" and more "simulation years per day"
slower computers would give more "seconds per simulation day" and less "simulation years per day"
By the way, both quantites are reciprocial and related by this simple equation:
simuation_years_per_day = 236,55 / seconds_per_simulation_day
(that is "simulation years per day" equals 86400 divided by "seconds per simulation day" /which gives you the simulation days per day/ and then dividied by 365.25 to convert the result into years)
So it doesn't really matter if "simulation years per day" or "seconds per simulation day" is presented. One just have to chose the representation which clearly shows how much better the newer system is from the previous/older/existing one :)

What is the correct way to write a sweepstakes algorithm?

For example, if I wanted to ensure that I had one winner every four hours, and I expected to have 125 plays per hour, what is the best way to provide for the highest chance of having a winner and the lowest chance of having no winners at the end of the four hour period?
The gameplay is like a slot-machine, not a daily number. i.e. the entrant enters the game and gets notified right away if they have won or lost.
Sounds like a homework problem, I know, but it's not :)
Thanks.
There's really only so much you can do to keep things fair (i.e. someone who enters at the beginning of a four hour period has the same odds of winning as someone who enters at the end) if you want to enforce this constraint. In general, the best you can do while remaining legal is to take a guess at how many entrants you're going to have and set the probability accordingly (and if there's no winner at the end of a given period, give it to a random entrant from that period).
Here's what I'd do to adjust your sweepstakes probability as you go (setting aside the legal ramifications of doing so):
For each period, start the probability at 1 / (number of expected entries * 2)
At any time, if you get a winner, the probability goes to 0 for the rest of that period.
Every thirty minutes, if you're still without a winner, set the probability at 1 / ((number of expected entries * (1 - percentage of period complete)) * 2). So here, the percentage of period complete is the number of hours elapsed in that current period / number of total hours in the period (4). Basically as you go, the probability will scale upwards.
Practical example: expected entries is 200.
Starting probability = 1 / 400 = 0.0025.
After first half hour, we don't have a winner, so we reevaluate probability:
probability = 1 / ((200 * (1 - 0.125) * 2) = 1 / (200 * 2 * 0.875) = 1/350
This goes down all the way until the probability is a maximum of 1/50, assuming no winner occurs before then.
You can adjust these parameters if you want to maximize the acceleration or whatever. But I'd be remiss if I didn't emphasize that I don't believe running a sweepstakes like this is legal. I've done a few sweepstakes for my company and am somewhat familiar with the various laws and regulations, and the general rule of thumb, as I understand it, is that no one entrant should have an advantage over any other entrant that the other entrant doesn't know about. I'm no expert, but consult with your lawyer before running a sweepstakes like this. That said, the solution above will help you maximize your odds of giving away a prize.
If you're wanting a winner for every drawing, you'd simply pick a random winner from your entrants.
If you're doing it like a lottery, where you don't have to have a winner for every drawing, the odds are as high or low as you care to make them based on your selection scheme. For instance, if you have 125 entries per hour, and you're picking every four hours, that's 500 entries per contest. If you generate a random number between 1 and 1000, there's a 50% chance that someone will win, 1 and 750 is a 75% chance that someone will win, and so forth. Then you just pick the entry that corresponds to the random number generated.
There's a million different ways to implement selecting a winner, in the end you just need to pick one and use it consistently.

Algorithm for the allocation of work with dynamic programming

The problem is this:
Need to perform n jobs, each characterized by a gain {v1, v2,. . . , vn}, a time required for its implementation {t1, t2,. . . , tn} and a deadline for its implementation {d1, d2,. . . , dn} with d1<=d2<=.....<=d3. Knowing that the gain occurs only if the work is done by that time and that you have a single machine. Must describe an algorithm that computes the maximum gain that is possible to obtain.
I had thought of a recurrence equation with two parameters, one indicating the i-th job and the other shows the moment in which we are implementing : OPT(i,d) , If d+t_i <= d then adds the gain t_i. (then a variant of multiway choice ..that is min for 1<=i<=n).
My main problem is: how can I find jobs that previously were carried out? I use a data structure of support?
As you would have written the equation of recurrence?
thanks you!!!!
My main problem is: how can I find jobs that previously were carried out? I use a data structure of support?
The trick is, you don't need to know what jobs are completed already. Because you can execute them in the order of increasing deadline.
Let's say, some optimal solution (yielding maximum profit) requirers you to complete job A (deadline 10) and then job B (deadline 3). But in this case you can safely swap A and B. They both will still be completed in time and new arrangement will yield the same total profit.
End of proof.
As you would have written the equation of recurrence?
You already have general idea, but you don't need a loop (min for 1<=i<=n).
max_profit(current_job, start_time)
// skip this job
result1 = max_profit(current_job + 1, start_time)
// start doing this job now
finish_time = start_time + T[current_job]
if finish_time <= D[current_job]
// only if we can finish it before deadline
result2 = max_profit(current_job + 1, finish_time) + V[current_job];
end
return max(result1, result2);
end
Converting it to DP should be trivial.
If you don't want O(n*max_deadline) complexity (e.g., when d and t values are big), you can resort to recursion with memoization and store results in a hash-table instead of two-dimensional array.
edit
If all jobs must be performed, but not all will be paid for, the problem stays the same. Just push jobs you don't have time for (jobs you can't finish before deadline) to the end. That's all.
First of all I would pick the items with the biggest yield. Meaning the jobs that have the
biggest rate in value/time that can match their deadline (if now+t1 exceeds d1 then it is bogus). Afterwards I check the time between now+job_time and each deadline obtaining a "chace to finish" of each job. The jobs that will come first will be the jobs with biggest yield and lowest chance to finish. The idea is to squeeze the most valuable jobs.
CASES:
If a job with a yield of 5 needs 10 seconds to finish and it's deadline comes in 600 seconds and a job with the same yield needs 20 seconds to finish and it's deadline comes in 22 seconds, then I run the second one.
If a job with a yield of 10 needs 10 seconds to finish and it's deadline comes in 100 seconds while another job has a yield of 5 needs 10 seconds to finish and it's deadline comes in 100 seconds,I'll run the first one.
If their yield is identical and they take same time to finish while their deadline comes in 100 seconds,respectively 101 seconds, I'll run the first one as it wins more time.
.. so on and so forth..
Recursion in this case refers only to reordering the jobs by these parameters by "Yield" and "Chance to finish".
Remember to always increase "now" (+job_time)after inserting a job in the order.
Hope it answers.
I read the upper comments and understood that you are not looking for efficiency you are looking for completion, so that takes the yield out of the way and leaves us with just ordering by deadline. It's the classic problem done by
Divide et Impera Quicksort
http://en.wikipedia.org/wiki/Quicksort

Resources