Linear probing hash in Mark Allen Weiss 's book - data-structures

In data structure and algorithm analysis in C++ 's hash related chapters, λ is the load factor of a hast table,,When the author talks about linear probing which to resolve collisions,There's a sentence I can't understand:
We will assume a very large table and that each probe is independent
of the previous probes. These assumptions are satisfied by a random
collision resolution strategy and are reasonable unless λ is very
close to 1. First, we derive the expected number of probes in an
unsuccessful search. This is just the expected number of probes until
we find an empty cell. Since the fraction of empty cells is 1 − λ, the
number of cells we expect to probe is 1/(1 − λ).
my questions are:
what is the meaning of the first bolded paragraph
how can I deduce 1/(1-λ)

How many randomly-chosen cells do you need to test before you find an empty one?
If the fraction of filled slots is λ, then the probability that a randomly-chosen slot will be full is also λ. You can find the expected number of slots you need to test by solving λn-1 = 0.5, so n = 1 + (log 0.5) / (log λ).
As long as λ is less than 0.6 or so 1/(1-λ) is a good approximation. Here is a plot from wolfram alpha:


A sudoku problem: Efficiently find or approximate probability distribution over chosen numbers at each index of an array with no repeats

I'm looking for an efficient algorithm to generate or iteratively approximate a solution to the problem described below.
You are given an array of length N and a finite set of numbers Si for each index i of the array. Now, if we are to place a number from Si at each index i to fill the entire array, while ensuring that the number is unique across the entire array; given all the possible arrays, what is the probability ditribution over each number at each index?
Here I give an example:
Assuming we have the following array of length 3 with each column representing Si at the index of the column
4 4 4
   2  2
1  1  1
We will have the following possible arrays:
And the following probability distribution: (over 1 2 4 at each index respectively)
0.5 0.25 0.25
      0.5   0.5
0.5 0.25 0.25
Brute forcing this problem is obviously doable but I have a gut feeling that there must be some more efficient algorithms for this.
The reason why I think so is due to the fact that one can derive the probability distribution from the set of all possibilities but not the other way around, so the distribution itself must contain less information then the set of all possibilities have. Therefore, I believe that we do not need to generate all possibilites just to obtain the probability distribution.
Hence, I am wondering if there is any smart matrix operation we could use for this problem or even fixed-point iteration/density evolution to approximate the end probability distribution? Some other potentially more efficient approaches to this problem are also appreciated.
(p.s. The reason why I am interested in this problem is because I wanted to generate probability distribution over candidate numbers for the empty cells in a sudoku and other sudoku-like games without a unique answers by only applying all the standard rules)
Sudoku is a combinatorial problem. It is easy to show that the probability of any independent cell is uniform (because you can relabel a configuration to put any number at a given position). The joint probabilities are more complicated.
If the game is partially filled you have constraints that will affect this distribution.
You must devise an algorithm to calculate the number of solutions from a given initial configuration. Then you compute the fraction of the total solutions are will have a specific value at the position of interest.
counts = {}
for i in range(1, 10):
board[cell] = i;
counts[i] = countSolutions(board);
prob = {i: counts[i] / sum(counts[i] for i in range(1, 10))}
The same approach works for joint probabilities but in some cases the number of possibilities may be too high.

Sampling from Geometric distribution in constant time

I would like to know if there is any method to sample form the Geometric distribution in constant time without using log which can be hard to approximate. Thanks.
Without relying on logarithms, there is no algorithm to sample from a geometric(p) distribution in constant expected time. Rather, on a realistic computing model, such an algorithm's expected running time must grow at least as fast as 1 + log(1/p)/w, where w is the word size of the computer in bits (Bringmann and Friedrich 2013). The following algorithm, which is equivalent to one in the Bringmann paper, generates a geometric(px/py) random number without relying on logarithms, and when px/py is very small, the algorithm is considerably faster than the trivial algorithm of generating trials until a success:
Set pn to px, k to 0, and d to 0.
While pn*2 <= py, add 1 to k and multiply pn by 2.
With probability (1− px /py)2k, add 1 to d and repeat this step.
Generate a uniform random integer in [0, 2k), call it m, then with probability (1− px /py)m, return d*2k+m. Otherwise, repeat this step.
(The actual algorithm described in the Bringmann paper is in fact much more involved than this; see my note "On a Geometric Sampler".)
Bringmann, K., and Friedrich, T., 2013, July. Exact and efficient generation of geometric random variates and random graphs, in International Colloquium on Automata, Languages, and Programming (pp. 267-278).

Marathon24 contest quals: DNA -TLE

I'm re-attempting this problem statement, now that the contest is all over (so it's not cheating or anything, just want to learn, since the answers are not published, only the correct output for the given test case input files).
There are 10 given test case inputs, for which the associated output files are to be submitted. My original submission was an implementation of a naive nested for loop of (start,end) pairs, answering the query: What is the volatility measure of the substring starting at (0-based) index start, and ending at end (inclusive).
Clearly, for the maximum problem limits of 106, O(N2) is infeasible, so I only got as far as 5/10 test cases correct (the first - and simpler - 5, of course).
As such, I'm writing here to seek the crowd intelligence on how I could go about improving my algorithm, namely I suspect the nested for loops (start,end) is the main bottleneck to optimize (of course!) So far, I've gone down the route of trying to formulate this as a dynamic programming (DP) on strings/substrings problem, but without any much success on coming up with the state representation and transition bits so that the DP can be implemented.
For easy reference, and to show that this is not homework, and I have honestly tried, my original submission is available here.
Any help is much appreciated, even links to similar problems for which I can google for tutorial blog posts/sample solutions/post-contest editorial analysis.
Have you tried divide-and-conquer?
If I understand the problem correctly, given a DNA chain S of length n, we divide S into two halves, S_left and S_right, with S_left consisting of S[i] where 0 <= i < n/2, and S_right consisting of S[j] where (n/2)+1 <= j < n. The most volatile fragment either occurs entirely within S_left, entirely within S_right, or crosses the boundary of S_left and S_right.
To find the most volatile fragment of S_left and S_right we just use recursion. THe tricky bit is to find the volatility measure of the fragment which crosses the boundary of S_left and S_right. There is a mathematical property of positive integer fraction here: given four positive (non-zero) integers a, b, c and d, (a + c) / (b + d) is never greater than both (a / b) and (c / d). Here a and b are the cumulative count of purines and pyrimidines in S_left starting at the boundary, while c and d are the cumulative count of purines and pyrimidines in S_right starting at the boundary. This mathematical property means that we don't need to examine the volatility measure of the crossing fragment beyond a = 0 or c = 0 because it is guaranteed to be less than the maximum volatility of S_left or S_right. The time complexity of such search can be done in O(n) for the crossing fragment, and O(n lg n) for the overall algorithm.
Hope this works as I haven't coded the algorithm. Perhaps it has a O(n) time DP algorithm for this problem but this is all I've got for now.

Reducing the Average Number of Comparisons in Selection

The problem here is to reduce the average number of comparisons need in a selection sort.
I am reading an article on this and here is text snippet:
More generally, a sample S' of s elements is chosen from the n
elements. Let "delta" be some number, which we will choose later so
as to minimize the average number of comparisons used by the
procedure. We find the (v1 = (k * s)/(n - delta))th and (v2 = (k* * s)/(n + delta)
)th smallest elements in S'. Almost certainly, the kth smallest
element in S will fall between v1 and v2, so we are left with a
selection problem on (2 * delta) elements. With low probability, the
kth smallest element does not fall in this range, and we have
considerable work to do. However, with a good choice of s and delta,
we can ensure, by the laws of probability, that the second case does
not adversely affect the total work.
I do not follow the above text. Can anyone please explain to me with examples. How did the author reduce to 2 * delta elements? And how does he know that there is a low probablity that element does not fall into this category.
The basis for the idea is that the normal selection algorithm has linear runtime complexity, but in practical terms is slow. We need to sort all the elements in groups of five, and recursively do even more work. O(n) but with too large a constant. The idea then, is to reduce the number of comparisons in the selection algorithm (not a selection sort necessarily). Intuitively it is the same as in basic statistics; if I take a sample subspace of large enough proportion, it is likely that the distribution of data in the subspace adequately reflects the data in the whole space.
So if I'm looking for the kth number in a set of size one million, I could instead take say 10 000 (already one hundredth the size), which is still large enough to be a good representation of the global distribution, and look for the k/100th number. That's simple scaling. So if the space was 10 and I was looking for the 3rd, that's like looking for the 30th in 100, or the 300th in 1000, etc. Essentially k/S = k'/S' (where we're looking for the kth number in S, and we translate that to the k'th number in S' our subspace) and therefore k' = k*S'/S which should look familiar, since in the text you quoted S' is denoted by s, and S by n, and that's the same fraction quoted.
Now in order to take statistical fluctuations into account, we don't assume that the subspace will be a perfect representation of the data's distribution, so we allow for some fluctuation, namely, delta. We say let's find the k'th-delta and k'th+delta elements in S', and then we can say with great certainty (i.e. high mathematical probability) that the kth value from S is in the interval (k'th-delta, k'th+delta).
To wrap it all up we perform these two selections on S', then partition S accordingly, and now do [normal] selection on the much smaller interval in the partition. This ends up being almost optimal for the elements outside the interval, because we don't do selection on those, only partition them. So the selection process is faster, because we have reduced the problem size from S to S'.

Programming problem - Game of Blocks

maybe you would have an idea on how to solve the following problem.
John decided to buy his son Johnny some mathematical toys. One of his most favorite toy is blocks of different colors. John has decided to buy blocks of C different colors. For each color he will buy googol (10^100) blocks. All blocks of same color are of same length. But blocks of different color may vary in length.
Jhonny has decided to use these blocks to make a large 1 x n block. He wonders how many ways he can do this. Two ways are considered different if there is a position where the color differs. The example shows a red block of size 5, blue block of size 3 and green block of size 3. It shows there are 12 ways of making a large block of length 11.
Each test case starts with an integer 1 ≤ C ≤ 100. Next line consists c integers. ith integer 1 ≤ leni ≤ 750 denotes length of ith color. Next line is positive integer N ≤ 10^15.
This problem should be solved in 20 seconds for T <= 25 test cases. The answer should be calculated MOD 100000007 (prime number).
It can be deduced to matrix exponentiation problem, which can be solved relatively efficiently in O(N^2.376*log(max(leni))) using Coppersmith-Winograd algorithm and fast exponentiation. But it seems that a more efficient algorithm is required, as Coppersmith-Winograd implies a large constant factor. Do you have any other ideas? It can possibly be a Number Theory or Divide and Conquer problem
Firstly note the number of blocks of each colour you have is a complete red herring, since 10^100 > N always. So the number of blocks of each colour is practically infinite.
Now notice that at each position, p (if there is a valid configuration, that leaves no spaces, etc.) There must block of a color, c. There are len[c] ways for this block to lie, so that it still lies over this position, p.
My idea is to try all possible colors and positions at a fixed position (N/2 since it halves the range), and then for each case, there are b cells before this fixed coloured block and a after this fixed colour block. So if we define a function ways(i) that returns the number of ways to tile i cells (with ways(0)=1). Then the number of ways to tile a number of cells with a fixed colour block at a position is ways(b)*ways(a). Adding up all possible configurations yields the answer for ways(i).
Now I chose the fixed position to be N/2 since that halves the range and you can halve a range at most ceil(log(N)) times. Now since you are moving a block about N/2 you will have to calculate from N/2-750 to N/2-750, where 750 is the max length a block can have. So you will have to calculate about 750*ceil(log(N)) (a bit more because of the variance) lengths to get the final answer.
So in order to get good performance you have to through in memoisation, since this inherently a recursive algorithm.
So using Python(since I was lazy and didn't want to write a big number class):
T = int(raw_input())
for case in xrange(T):
#read in the data
C = int(raw_input())
lengths = map(int, raw_input().split())
minlength = min(lengths)
n = int(raw_input())
#setup memoisation, note all lengths less than the minimum length are
#set to 0 as the algorithm needs this
memoise = {}
memoise[0] = 1
for length in xrange(1, minlength):
memoise[length] = 0
def solve(n):
global memoise
if n in memoise:
return memoise[n]
ans = 0
for i in xrange(C):
if lengths[i] > n:
if lengths[i] == n:
ans += 1
ans %= 100000007
for j in xrange(0, lengths[i]):
b = n/2-lengths[i]+j
a = n-(n/2+j)
if b < 0 or a < 0:
ans += solve(b)*solve(a)
ans %= 100000007
memoise[n] = ans
return memoise[n]
print "Case %d: %d" % (case+1, memoise[n])
Note I haven't exhaustively tested this, but I'm quite sure it will meet the 20 second time limit, if you translated this algorithm to C++ or somesuch.
EDIT: Running a test with N = 10^15 and a block with length 750 I get that memoise contains about 60000 elements which means non-lookup bit of solve(n) is called about the same number of time.
A word of caution: In the case c=2, len1=1, len2=2, the answer will be the N'th Fibonacci number, and the Fibonacci numbers grow (approximately) exponentially with a growth factor of the golden ratio, phi ~ 1.61803399. For the
huge value N=10^15, the answer will be about phi^(10^15), an enormous number. The answer will have storage
requirements on the order of (ln(phi^(10^15))/ln(2)) / (8 * 2^40) ~ 79 terabytes. Since you can't even access 79
terabytes in 20 seconds, it's unlikely you can meet the speed requirements in this special case.
Your best hope occurs when C is not too large, and leni is large for all i. In such cases, the answer will
still grow exponentially with N, but the growth factor may be much smaller.
I recommend that you first construct the integer matrix M which will compute the (i+1,..., i+k)
terms in your sequence based on the (i, ..., i+k-1) terms. (only row k+1 of this matrix is interesting).
Compute the first k entries "by hand", then calculate M^(10^15) based on the repeated squaring
trick, and apply it to terms (0...k-1).
The (integer) entries of the matrix will grow exponentially, perhaps too fast to handle. If this is the case, do the
very same calculation, but modulo p, for several moderate-sized prime numbers p. This will allow you to obtain
your answer modulo p, for various p, without using a matrix of bigints. After using enough primes so that you know their product
is larger than your answer, you can use the so-called "Chinese remainder theorem" to recover
your answer from your mod-p answers.
I'd like to build on the earlier #JPvdMerwe solution with some improvements. In his answer, #JPvdMerwe uses a Dynamic Programming / memoisation approach, which I agree is the way to go on this problem. Dividing the problem recursively into two smaller problems and remembering previously computed results is quite efficient.
I'd like to suggest several improvements that would speed things up even further:
Instead of going over all the ways the block in the middle can be positioned, you only need to go over the first half, and multiply the solution by 2. This is because the second half of the cases are symmetrical. For odd-length blocks you would still need to take the centered position as a seperate case.
In general, iterative implementations can be several magnitudes faster than recursive ones. This is because a recursive implementation incurs bookkeeping overhead for each function call. It can be a challenge to convert a solution to its iterative cousin, but it is usually possible. The #JPvdMerwe solution can be made iterative by using a stack to store intermediate values.
Modulo operations are expensive, as are multiplications to a lesser extent. The number of multiplications and modulos can be decreased by approximately a factor C=100 by switching the color-loop with the position-loop. This allows you to add the return values of several calls to solve() before doing a multiplication and modulo.
A good way to test the performance of a solution is with a pathological case. The following could be especially daunting: length 10^15, C=100, prime block sizes.
Hope this helps.
In the above answer
ans += 1
ans %= 100000007
could be much faster without general modulo :
ans += 1
if ans == 100000007 then ans = 0
Please see TopCoder thread for a solution. No one was close enough to find the answer in this thread.
