How do you determine the best, worse and average case complexity of a problem generated with random dice rolls? - algorithm

There is a picture book with 100 pages. If dice are rolled randomly to select one of the pages and subsequently rerolled in order to search for a certain picture in the book -- how do I determine the best, worst and average case complexity of this problem?
Proposed answer:
best case: picture is found on the first dice roll
worst case: picture is found on 100th dice roll or picture does not exist
average case: picture is found after 50 dice rolls (= 100 / 2)
Assumption: incorrect pictures are searched at most one time

Given your description of the problem, I don't think your assumption (that incorrect pictures are only "searched" once) sounds right. If you don't make that assumption, then the answer is as shown below. You'll see the answers are somewhat different from what you proposed.
It is possible that you could be successful on the first attempt. Therefore, the first answer is 1.
If you were unlucky, you could keep rolling the wrong number forever. So the second answer is infinity.
The third question is less obvious.
What is the average number of rolls? You need to be familiar with the Geometric Distribution: the number of trials needed to get a single success.
Define p as the probability for a successful trial; p=0.01.
Let Pr(x = k) be the probability that the first successful trial being the k th. Then we're going to have to have (k-1) failures and one success. So Pr(x=k) = (1-p)^(k-1) * p. Verify that this is the "probability mass function" on the wiki page (left column).
The mean of the geometric distribution is 1/p, which is therefore 100. This is the average number of rolls required to find the specific picture.
(Note: We need to consider 1 as the lowest possible value, rather than 0, so use the left hand column of the table on the Wikipedia page.)

To analyze this, think about what the best, worst and average cases actually are. You need to answer three questions to find those three cases:
What is the fewest number of rolls to find the desired page?
What is the largest number of rolls to find the desired page?
What is the average number of rolls to find the desired page?
Once you find the first two, the third should be less tricky. If you need asymptotic notation as opposed to just the number of rolls, think about how the answers to each question change if you change the number of pages in the book (e.g. 200 pages vs 100 pages vs 50 pages).

The worst case is not the page found after 100 dice rolls. That would be is your dice always returned different numbers. The worst case is that you never find the page (the way you stated the problem).
The average case is not average of the best and worst cases, fortunately.
The average case is:
1 * (probability of finding page on the first dice roll)
+ 2 * (probability of finding page on the second dice roll)
+ ...
And yes, the sum is infinite, since in thinking about the worst case we determined that you may have an arbitrarily large number of dice rolls. It doesn't mean that it can't be computed (it could mean that, but it doesn't have to).
The probability of finding the page on the first try is 1/100. What's the probability of finding it on the second dice roll?

You're almost there, but (1 + 2 + ... + 100)/100 isn't 50.
It might help to observe that your random selection method is equivalent to randomly shuffling the whole deck and then searching it in order for your target. Each position is equally likely, so the average is straightforward to calculate. Except of course that you aren't doing all that work up front, just as much as is needed to generate each random number and access the corresponding element.
Note that if your book were stored as a linked list, then the cost of moving from each randomly-selected page to the next selection depends on how far apart they are, which will complicate the analysis quite a lot. You don't actually state that you have constant-time access, and it's possibly debatable whether a "real book" provides that or not.
For that matter, there's more than one way to choose random numbers without repeats, and not all of them have the same running time.
So, you'd need more detail in order to analyse the algorithm in terms of anything other than "number of pages visited".

Related

Most optimal match-up

Let's assume you're a baseball manager. And you have N pitchers in your bullpen (N<=14) and they have to face M batters (M<=100). Also to mention you know the strength of each of the pitchers and each of the batters. For those who are not familiar to baseball once you brought in a relief pitcher he can pitch to k consecutive batters, but once he's taken out ofthe game he cannot come back.
For each pitcher the probability that he's gonna lose his match-ups is given by (sum of all batter he will face)/(his strength). Try to minimize these probabilities, i.e. try to maximize your chances of winning the game.
For example we have 3 pitchers and they have to face 3 batters. The batters' stregnths are:
10 40 30
While the strength of your pitchers is:
40 30 3
The most optimal solution would be to bring the strongest pitcher to face the first 2 batters and the second to face the third batter. Then the probability of every pitcher losing his game will be:
50/40 = 1.25 and 30/30 = 1
So the probability of losing the game would be 1.25 (This number can be bigger than 100).
How can you find the optimal number? I was thinking to take a greedy approach, but I suspect whether it will always hold. Also the fact that the pitcher can face unlimited number of batters (I mean it's only limited by M) poses the major problem for me.
Probabilities must be in the range [0.0, 1.0] so what you call a probability can't be a probability. I'm just going to call it a score and minimize it.
I'm going to assume for now that you somehow know the order in which the pitchers should play.
Given the order, what is left to decide is how long each pitcher plays. I think you can find this out using dynamic programming. Consider the batters to be faced in order. Build an NxM table best[pitchers, batter] where best[i, j] is the best score you can make considering just the first j batters using the first i pitchers, or HUGE if it does not make sense.
best[1,1] is just the score for the best pitcher against the first batter, and best[1,j] doesn't make sense for any other values of j.
For larger values of i you work out best[i,j] by considering when the last change of pitcher could be, considering all possibilities (so 1, 2, 3...i). If the last change of pitcher was at time t, then look up best[t, j-1] to get the score up to the time just before that change, and then calculate the a/b value to take account of the sum of batter strengths between time t+1 and time i. When you have considered all possible times, take the best score and use it as the value for best[i, j]. Note down enough info (such as the last time of pitcher change that turned out to be best) so that once you have calculated best[N, M], you can trace back to find the best schedule.
You don't actually know the order, and because the final score is the maximum of the a/b value for each pitcher, the order does matter. However, given a separation of players into groups, the best way to assign pitchers to groups is to assign the best pitcher to the group with the highest total score, the next best pitcher to the group with the next best total score, and so on. So you could alternate between dividing batters into groups, as described above, and then assigning pitchers to groups to work out the order the pitchers really should be in - keep doing this until the answer stops changing and hope the result is a global optimum. Unfortunately there is no guarantee of this.
I'm not convinced that your score is a good model for baseball, especially since it started out as a probability but can't be. Perhaps you should work out a few examples (maybe even solving small examples by brute force) and see if the results look reasonable.
Another way to approach this problem is via http://en.wikipedia.org/wiki/Branch_and_bound.
With branch and bound you need some way to describe partial answers, and you need some way to work out a value V for a given partial answer, such that no way of extending that partial answer can possibly produce a better answer than V. Then you run a tree search, extending partial answers in every possible way, but discarding partial answers which can't possibly be any better than the best answer found so far. It is good if you can start off with at least a guess at the best answer, because then you can discard poor partial answers from the start. My other answer might provide a way of getting this.
Here a partial answer is a selection of pitchers, in the order they should play, together with the number of batters they should pitch to. The first partial answer would have 0 pitchers, and you could extend this by choosing each possible pitcher, pitching to each possible number of batters, giving a list of partial answers each mentioning just one pitcher, most of which you could hopefully discard.
Given a partial answer, you can compute the (total batter strength)/(Pitcher strength) for each pitcher in its selection. The maximum found here is one possible way of working out V. There is another calculation you can do. Sum up the total strengths of all the batters left and divide by the total strengths of all the pitchers left. This would be the best possible result you could get for the pitchers left, because it is the result you get if you somehow manage to allocate pitchers to batters as evenly as possible. If this value is greater than the V you have calculated so far, use this instead of V to get a less optimistic (but more accurate) measure of how good any descendant of that partial answer could possibly be.

Finding the average of large list of numbers

Came across this interview question.
Write an algorithm to find the mean(average) of a large list. This
list could contain trillions or quadrillions of number. Each number is
manageable in hundreds, thousands or millions.
Googling it gave me all Median of Medians solutions. How should I approach this problem?
Is divide and conquer enough to deal with trillions of number?
How to deal with the list of the such a large size?
If the size of the list is computable, it's really just a matter of how much memory you have available, how long it's supposed to take and how simple the algorithm is supposed to be.
Basically, you can just add everything up and divide by the size.
If you don't have enough memory, dividing first might work (Note that you will probably lose some precision that way).
Another approach would be to recursively split the list into 2 halves and calculating the mean of the sublists' means. Your recursion termination condition is a list size of 1, in which case the mean is simply the only element of the list. If you encounter a list of odd size, make either the first or second sublist longer, this is pretty much arbitrary and doesn't even have to be consistent.
If, however, you list is so giant that its size can't be computed, there's no way to split it into 2 sublists. In that case, the recursive approach works pretty much the other way around. Instead of splitting into 2 lists with n/2 elements, you split into n/2 lists with 2 elements (or rather, calculate their mean immediately). So basically, you calculate the mean of elements 1 and 2, that becomes you new element 1. the mean of 3 and 4 is your new second element, and so on. Then apply the same algorithm to the new list until only 1 element remains. If you encounter a list of odd size, either add an element at the end or ignore the last one. If you add one, you should try to get as close as possible to your expected mean.
While this won't calculate the mean mathematically exactly, for lists of that size, it will be sufficiently close. This is pretty much a mean of means approach. You could also go the median of medians route, in which case you select the median of sublists recursively. The same principles apply, but you will generally want to get an odd number.
You could even combine the approaches and calculate the mean if your list is of even size and the median if it's of odd size. Doing this over many recursion steps will generate a pretty accurate result.
First of all, this is an interview question. The problem as stated would not arise in practice. Also, the question as stated here is imprecise. That is probably deliberate. (They want to see how you deal with solving an imprecisely specified problem.)
Write an algorithm to find the mean(average) of a large list.
The word "find" is rubbery. It could mean calculate (to some precision) or it could mean estimate.
The phrase "large list" is rubbery. If could mean a list or array data structure in memory, or the "list" could be the result of a database query, the contents of a file or files.
There is no mention of the hardware constraints on the system where this will be implemented.
So the first thing >>I<< would do would be to try to narrow the scope by asking some questions of the interviewer.
But assuming that you can't, then a complete answer would need to cover the following points:
The dataset probably won't fit in memory at the same time. (But if it does, then that is good.)
Calculating the average of N numbers is O(N) if you do it serially. For N this size, it could be an intractable problem.
An alternative is to split into sublists of equals size and calculate the averages, and the average of the averages. In theory, this gives you O(N/P) where P is the number of partitions. The parallelism could be implemented with multiple threads, with multiple processes on the same machine, or distributed.
In practice, the limiting factors are going to be computational, memory and/or I/O bandwidth. A parallel solution will be effective if you can address these limits. For example, you need to balance the problem of each "worker" having uncontended access to its "sublist" versus the problem of making copies of the data so that that can happen.
If the list is represented in a way that allows sampling, then you can estimate the average without looking at the entire dataset. In fact, this could be O(C) depending on how you sample. But there is a risk that your sample will be unrepresentative, and the average will be too inaccurate.
In all cases doing calculations, you need to guard against (integer) overflow and (floating point) rounding errors. Especially while calculating the sums.
It would be worthwhile discussing how you would solve this with a "big data" platform (e.g. Hadoop) and the limitations of that approach (e.g. time taken to load up the data ...)

Distribute numbers to two "containers" and minimize their difference of sum

Suppose there are n numbers let says we have the following 4 numbers 15,20,10,25
There are two container A and B and my job is to distribute numbers to them so that the sum of the number in each container have the least difference.
In the above example, A should have 15+20 and B should have 10+ 25. So difference = 0.
I think of a method. It seems to work but I don't know why.
Sort the number list in descending order first. In each round, take the maximum number out
and put to the container have less sum.
Btw, is it can be solved by DP?
THX
In fact, your method doesn't always work. Think about that 2,4,4,5,5.The result by your method will be (5,4,2)(5,4), while the best answer is (5,5)(4,4,2).
Yes, it can be solved by Dynamical Programming.Here are some useful link:
Tutorial and Code: http://www.cs.cornell.edu/~wdtseng/icpc/notes/dp3.pdf
A practice: http://people.csail.mit.edu/bdean/6.046/dp/ (then click Balanced Partition)
What's more, please note that if the scale of problem is damn large (like you have 5 million numbers etc.), you won't want to use DP which needs a too huge matrix. If this is the case, you want to use a kind of Monte Carlo Algorithm:
divide n numbers into two groups randomly (or use your method at this step if you like);
choose one number from each group, if (to swap these two number decrease the difference of sum) swap them;
repeat step 2 until "no swap occurred for a long time".
You don't want to expect this method could always work out with the best answer, but it is the only way I know to solve this problem at very large scale within reasonable time and memory.

Guessing an unbounded integer

If I say to you:
"I am thinking of a number between 0 and n, and I will tell you if your guess is high or low", then you will immediately reach for binary search.
What if I remove the upper bound? i.e. I am thinking of a positive integer, and you need to guess it.
One possible method would be for you to guess 2, 4, 8, ..., until you guess 2**k for some k and I say "lower". Then you can apply binary search.
Is there a quicker method?
EDIT:
Clearly, any solution is going to take time proportional to the size of the target number. If I chuck Graham's number through the Ackermann function, we'll be waiting a while whatever strategy you pursue.
I could offer this algorithm too: Guess each integer in turn, starting from 1.
It's guaranteed to finish in a finite amount of time, but yet it's clearly much worse than my "powers of 2" strategy. If I can find a worse algorithm (and know that it is worse), then maybe I could find a better one?
For example, instead of powers of 2, maybe I can use powers of 10. Then I find the upper bound in log_10(n) steps, instead of log_2(n) steps. But I have to then search a bigger space. Say k = ceil(log_10(n)). Then I need log_2(10**k - 10**(k-1)) steps for my binary search, which I guess is about 10+log_2(k). For powers of 2, I have roughly log_2(log_2(n)) steps for my search phase. Which wins?
What if I search upwards using n**n? Or some other sequence? Does the prize go to whoever can find the sequence that grows the fastest? Is this a problem with an answer?
Thank you for your thoughts. And my apologies to those of you suggesting I start at MAX_INT or 2**32-1, since I'm clearly drifting away from the bounds of practicality here.
FINAL EDIT:
Hi all,
Thank you for your responses. I accepted the answer by Norman Ramsey (and commenter onebyone) for what I understood to be the following argument: for a target number n, any strategy must be capable of distinguishing between (at least) the numbers from 0..n, which means you need (at least) O(log(n)) comparisons.
However seveal of you also pointed out that the problem is not well-defined in the first place, because it's not possible to pick a "random positive integer" under the uniform probability distribution (or, rather, a uniform probability distribution cannot exist over an infinite set). And once I give you a nonuniform distribution, you can split it in half and apply binary search as normal.
This is a problem that I've often pondered as I walk around, so I'm pleased to have two conclusive answers for it.
If there truly is no upper bound, and all numbers all the way to infinity are equally likely, then there is no optimum way to do this. For any finite guess G, the probability that the number is lower than G is zero and the probability that it is higher is 1 - so there is no finite guess that has an expectation of being higher than the number.
RESPONSE TO JOHN'S EDIT:
By the same reasoning that powers of 10 are expected to be better than powers of 2 (there's only a finite number of possible Ns for which powers of 2 are better, and an infinite number where powers of 10 are better), powers of 20 can be shown to be better than powers of 10.
So basically, yes, the prize goes to fastest-growing sequence (and for the same sequence, the highest starting point) - for any given sequence, it can be shown that a faster growing sequence will win in infinitely more cases. And since for any sequence you name, I can name one that grows faster, and for any integer you name, I can name one higher, there's no answer that can't be bettered. (And every algorithm that will eventually give the correct answer has an expected number of guesses that is infinite, anyway).
People (who have never studied probability) tend to think that "pick a number from 1 to N" means "with equal probability of each", and they act according to their intuitive understanding of probability.
Then when you say "pick any positive integer", they still think it means "with equal probability of each".
This is of course impossible - there exists no discrete probability distribution with domain the positive integers, where p(n) == p(m) for all n, m.
So, the person picking the number must have used some other probability distribution. If you know anything at all about that distribution, then you must base your guessing scheme on that knowledge in order to have the "fastest" solution.
The only way to calculate how "fast" a given guessing scheme is, is to calculate its expected number of guesses to find the answer. You can only do this by assuming a probability distribution for the target number. For example, if they have picked n with probability (1/2) ^ n, then I think your best guessing scheme is "1", "2", "3",... (average 2 guesses). I haven't proved it, though, maybe it's some other sequence of guesses. Certainly the guesses should start small and grow slowly. If they have picked 4 with probability 1 and all other numbers with probability 0, then your best guessing scheme is "4" (average 1 guess). If they have picked a number from 1 to a trillion with uniform distribution, then you should binary search (average about 40 guesses).
I say the only way to define "fast" - you could look at worst case. You have to assume a bound on the target, to prevent all schemes having the exact same speed, namely "no bound on the worst case". But you don't have to assume a distribution, and the answer for the "fastest" algorithm under this definition is obvious - binary search starting at the bound you selected. So I'm not sure this definition is terribly interesting...
In practice, you don't know the distribution, but can make a few educated guesses based on the fact that the picker is a human being, and what numbers humans are capable of conceiving. As someone says, if the number they picked is the Ackermann function for Graham's number, then you're probably in trouble. But if you know that they are capable of representing their chosen number in digits, then that actually puts an upper limit on the number they could have chosen. But it still depends what techniques they might have used to generate and record the number, and hence what your best knowledge is of the probability of the number being of each particular magnitude.
Worst case, you can find it in time logarithmic in the size of the answer using exactly the methods you describe. You might use Ackermann's function to find an upper bound faster than logarithmic time, but then the binary search between the number guessed and the previous guess will require time logarithmic in the size of the interval, which (if guesses grow very quickly) is close to logarithmic in the size of the answer.
It would be interesting to try to prove that there is no faster algorithm (e.g., O(log log n)), but I have no idea how to do it.
Mathematically speaking:
You cannot ever correctly find this integer. In fact, strictly speaking, the statement "pick any positive integer" is meaningless as it cannot be done: although you as a person may believe you can do it, you are actually picking from a bounded set - you are merely unconscious of the bounds.
Computationally speaking:
Computationally, we never deal with infinites, as we would have no way of storing or checking against any number larger than, say, the theoretical maximum number of electrons in the universe. As such, if you can estimate a maximum based on the number of bits used in a register on the device in question, you can carry out a binary search.
Binary search can be generalized: each time set of possible choices should be divided into to subsets of probability 0.5. In this case it's still applicable to infinite sets, but still requires knowledge about distribution (for finite sets this requirement is forgotten quite often)...
My main refinement is that I'd start with a higher first guess instead of 2, around the average of what I'd expect them to choose. Starting with 64 would save 5 guesses vs starting with 2 when the number's over 64, at the cost of 1-5 more when it's less. 2 makes sense if you expect the answer to be around 1 or 2 half the time. You could even keep a memory of past answers to decide the best first guess. Another improvement could be to try negatives when they say "lower" on 0.
If this is guessing the upper bound of a number being generated by a computer, I'd start with 2**[number of bits/2], then scale up or down by powers of two. This, at least, gets you the closest to the possible values in the least number of jumps.
However, if this is a purely mathematical number, you can start with any value, since you have an infinite range of values, so your approach would be fine.
Since you do not specify any probability distribution of the numbers (as others have correctly mentioned, there is no uniform distribution over all the positive integers), the No Free Lunch Theorem give the answer: any method (that does not repeat the same number twice) is as good as any other.
Once you start making assumptions about the distribution (f.x. it is a human being or binary computer etc. that chooses the number) this of course changes, but as the problem is stated any algorithm is as good as any other when averaged over all possible distributions.
Use binary search starting with MAX_INT/2, where MAX_INT is the biggest number your platform can handle.
No point in pretending we can actually have infinite possibilities.
UPDATE: Given that you insist on entering the realms of infinity, I'll just vote to close your question as not programming related :-)
The standard default assumption of a uniform distribution for all positive integers doesn't lead to a solution, so you should start by defining the probability distribution of the numbers to guess.
I'd probably start my guessing with Graham's Number.
The practical answer within a computing context would be to start with whatever is the highest number that can (realistically) be represented by the type you are using. In case of some BigInt type you'd probably want to make a judgement call about what is realistic... obviously ultimately the bound in that case is the available memory... but performance-wise something smaller may be more realistic.
Your starting point should be the largest number you can think of plus 1.
There is no 'efficient search' for a number in an infinite range.
EDIT: Just to clarify, for any number you can think of there are still infinitely more numbers that are 'greater' than your number, compared to a finite collection of numbers that are 'less' than your number. Therefore, assuming the chosen number is randomly selected from all positive numbers, you have zero | (approaching zero) chance of being 'above' the chosen number.
I gave an answer to a similar question "Optimal algorithm to guess any random integer without limits?"
Actually, provided there algorithm not just searches for the conceived number, but it estimates a median of the distribution of the number that you may re-conceive at each step! And also the number could be even from the real domain ;)

What are the ways of calculating Average Case's

Background:
For my Data Structures and Algorithms I am studying the Big O Notation. So far I understand how to workout the time complexity, best and worst case scenario. However, the average case is just baffling my head. The teacher is just throwing at us equations that I don't understand. And he is not willing to explain them in detail.
Question:
So please guys, what is the best way to calculate this? Is there one equation that calculates this or does it vary from algorithm to algorithm?
What are the steps you take to calculate this?
Let's take an example of Insertion sort algorithm?
Research:
I looked on youtube and stackoverflow for answers. But they all use different equations.
Any help would be great
thanks
As mentioned in the comment you have to look at the average input to the algorithm (which in this case means random). A good way to think about it is to try at trace what the algorithm would do if the input was average.
For the example of insertion sort:
In the best case (when the input is already sorted) the algorithm will look through the input but never exchanging anything, clearly resulting in a running time of O(n).
In the worst case (when the input is exactly opposite if the desired order) the algorithm will move every input all the way from it's current position to the start of the list, that is, the object on index 0 will not be moved, the object on index 1 will be moved once, the object on input 2 will be moved twice and so on, resulting in a running time of 0+1+2+3+...+n-1 ≈ 0.5n² = O(n²).
The same way of thinking can be used to find the average case, but instead of each object moving all the way to the start, we can expect that it will on average move halfway down to the start, that is, the object on index 0 will not be moved, the object on index 1 will be moved a half time (of cause this only makes sense on average), the object on input 2 will be moved once, the object on index 3 will be moved 1,5 times and so on, resulting in a running time of 0 + 0.5 + 1 + 1.5 + 2 + ... + (n-1)/2 ≈ 0.25n² (at each index, we have half of what we had in the worst case) = 0(n²).
Of cause not all algorithms are as simple as this, but looking at what the algorithm would do on each step if the input was random usually helps. If you have any kind of information available on the input to the algorithm, (for instance insertion sort is often used as the last step after an other algorithm has done most of the sorting, as it is very efficient if the input is almost sorted, and in such a case we might for example know that no object is going to be moved more than x times) then this can be taken into account when computing the average running time.

Resources