Hashing with the Division Method - Choosing number of slots? - algorithm

So, in CLRS, there's this quote
A prime not too close to an exact power of 2 is often a good choice for m.
Several Questions...
I understand how a power of 2 will just be the lower order bits of your key...however, say you have keys from a universe of 1 to 1 million, with each key having an equal probability of being any number from universe (which I'm guessing is a common assumption about your universe if given no other data?) then wouldn't taking say the 4 lower order bits result in (2^4) lower order bit patterns that were pretty much equally likely for the keys from 1 to 1 million? How am I thinking about this incorrectly?
Why a prime number? So, if power of 2's aren't a good idea, why is a prime number a better choice as opposed to a composite number close to a power of 2 (Also why should it be close to a power of 2...lol)?

You are trying to find a hash table that works well for typical input data, and typical input data does things that you wouldn't expect from good random number generators. Very often you get formatted or semi-formatted strings which, when converted to numbers, end up as K, K+A, K+2A, K+3A,.... for some integers K and A. If K+xA and K+yA hash to the same number mod m, then (x-y)A must be 0 mod m. If m is prime, this can only happen if A = 0 mod m or if x = y mod m, so one time in m. But if m=pq and A happens to be divisible by p, then you get a collision every time x-y is divisible by q, which is more often since q < m.
I guess close to a power of 2 because it might be convenient for the memory management system to have blocks of memory of the resulting size - I really don't know. If you really care, and if you have the time, you could try different primes with some representative data and see which of them are best in practice.

Related

How does flajolet martin sketch works?

I am trying to understand this sketch but am not able to understand.
Correct me if I am wrong but basically, lets say I have a text data.. words.. I have a hash function.. which takes a word and create an integer hash and then I convert that hash to binary bit vector?? Right..
Then I keep a track of the first 1 I see from left.. And the position where that 1 is (say , k)... the cardinality of this set is 2^k?
http://ravi-bhide.blogspot.com/2011/04/flajolet-martin-algorithm.html
But ... say I have just one word. and the hash function of it is such that hash it generates is 2^5, then I am guessing there are 5 (??) trailing 0's?? so it will predict 2^5 (??) cardinality?
That doesnt sounds right? What am I missing
For a single word the distribution of R is a geometric distribution with p = 1/2, and its standard deviation is sqrt(2) ≈ 1.41.
So for a word with hash ending in 100000b the algorithm will, indeed, yield 25/0.77351 = 41.37. But the probability of that is only 1/64, which is consistent with the statement that the standard deviation of R is close to 1.
http://ravi-bhide.blogspot.com/2011/04/flajolet-martin-algorithm.html
We had a good, random hash function that acted on strings and generated integers, what can we say about the generated integers? Since they are random themselves, we would expect:
1/2 of them to have their binary representation end in 0(i.e. divisible by 2),
1/4 of them to have their binary representation end in 00 (i.e. divisible by 4)
1/8 of them to have their binary representation end in 000 (i.e. divisible by 8)
Turning the problem around, if the hash function generated an integer ending in 0^m bits ..intuitively, the number of unique strings is around 2^m.
What is really important to remember is that the Flajolet Martin Algorithm is meant to count distinct elements (lets say M distinct elements) from a set of N elements, when M is expected to be very very large.
There is no point of using the algorithm if N or M are small enough for us to store all distinct elements in memory.
In the case where N and M are really large, the probability of the estimate being close to 2^k is actually very reasonable.
There is an explanation of this at : http://infolab.stanford.edu/~ullman/mmds/ch4.pdf (page 143)

Compression of sequence of integers providing random access

I have a sequence of n integers in a small range [0,k) and all the integers have the same frequency f (so the size of the sequence is n=f∗k). What I'm trying to do now is to compress this sequence while providing random access (what is the i-th integer). The time to achieve random access doesn't have to be O(1). I'm more interested in achieving high compression at the expense of higher random access times.
I haven't tried with Huffman coding since it assigns codes based on frequencies (and all my frequencies are the same). Perhaps I'm missing some simple encoding for this particular case.
Any help or pointers would be appreciated.
Thanks in advance.
PS: Already asked in cs.stackexchange, but asking here also for better coverage, sorry.
If all your integers have the same frequency, then a fair approximation to optimal compression will be ceil(log2(k)) bits per integer. You can access a bit-array of these in constant time.
If k is painfully small (like 3), the above method may waste a fair amount of space. But, you can combine a fixed number of your small integers into a base-k number, which can fit more efficiently into a fixed number of bits (you may also be able to fit the result conveniently into a standard-sized word). In any case, you can also access this coding in constant time.
If your integers don't have the same frequency, optimal compression may yield variable bit rates from different parts of your input, so the simple array access won't work. In that case, good random-access performance would require an index structure: break your compressed data into convenient sized chunks, which can each be decompressed sequentially, but this time is bounded by the chunk size.
If the frequency of each number is exactly the same, you may be able to save some space by taking advantage of this -- but it may not be enough to be worthwhile.
The entropy of n random numbers in range [0,k) is n log2(k), which is log2(k) bits per number; this is the number of bits it takes to encode your numbers without taking advantage of the exact frequency.
The entropy of distinguishable permutations of f copies each of k elements (where n=f*k) is:
log2( n!/(f!)^k ) = log2(n!) - k * log2(f!)
Applying Stirling's approximation (which is good here only if n and f are large), yields:
~ n log2(n) - n log2(e) - k ( f log2(f) - f log2(e) )
= n log2(n) - n log2(e) - n log2(f) + n log2(e)
= n ( log2(n) - log2(f) )
= n log2(n/f)
= n log2(k)
What this means is that, if n is large and k is small, you will not gain a significant amount of space by taking advantage of the exact frequency of your input.
The total error from the Stirling approximation above is O(log2(n) + k log2(f)), which is O(log2(n)/n + log2(f)/f) per number encoded. This does mean that if your k is so large that your f is small (i.e., each distinct number only has a small number of copies), you may be able to save some space with a clever encoding. However, the question specifies that k is, in fact, small.
If you work out the number of possible different combinations and take its log base 2 you can find the best possible compression, and I don't think it will be that great in your case. With 16 numbers of frequency 1 the number of possible messages is 16! and Excel tells me log base 2 of 16! is 44.25, whereas storing them as 4-bit codes would only take 64 bits. (where there is more than one of each kind you want http://mathworld.wolfram.com/MultinomialCoefficient.html)
I think you will have a problem mixing random access into this because the only information you have is that there are fixed numbers of each type of element - in the whole sequence. That's not a lot of information for the whole sequences, and it says almost nothing about the first half of the sequence in isolation, because you could well have more of some number in the first half and less in the second half.

Understanding assumptions about machine word size in analyzing computer algorithms

I am reading the book Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein.. In the second chapter under "Analyzing Algorithms" it is mentioned that :
We also assume a limit on the size of each word of data. For example , when working with inputs of size n , we typically assume that integers are represented by c lg n bits for some constant c>=1 . We require c>=1 so that each word can hold the value of n , enabling us to index the individual input elements , and we restrict c to be a constant so that the word size doesn't grow arbitrarily .( If the word size could grow arbitrarily , we could store huge amounts of data in one word and operate on it all in constant time - clearly an unrealistic scenario.)
My questions are why this assumption that each integer should be represented by c lg n bits and also how c>=1 being the case allows us to index the individual input elements ?
first, by lg they apparently mean log base 2, so lg n is the number of bits in n.
then what they are saying is that if they have an algorithm that takes a list of numbers (i am being more specific in my example to help make it easier to understand) like 1,2,3,...n then they assume that:
a "word" in memory is big enough to hold any of those numbers.
a "word" in memory is not big enough to hold all the numbers (in one single word, packed in somehow).
when calculating the number of "steps" in an algorithm, an operation on one "word" takes one step.
the reason they are doing this is to keep the analysis realistic (you can only store numbers up to some size in "native" types; after that you need to switch to arbitrary precision libraries) without choosing a particular example (like 32 bit integers) that might be inappropriate in some cases, or become outdated.
You need at least lg n bits to represent integers of size n, so that's a lower bound on the number of bits needed to store inputs of size n. Setting the constant c >= 1 makes it a lower bound. If the constant multiplier were less than 1, you wouldn't have enough bits to store n.
This is a simplifying step in the RAM model. It allows you to treat each individual input value as though it were accessible in a single slot (or "word") of memory, instead of worrying about complications that might arise otherwise. (Loading, storing, and copying values of different word sizes would take differing amounts of time if we used a model that allowed varying word lengths.) This is what's meant by "enabling us to index the individual input elements." Each input element of the problem is assumed to be accessible at a single address, or index (meaning it fits in one word of memory), simplifying the model.
This question was asked very long ago and the explanations really helped me, but I feel like there could still be a little more clarification about how the lg n came about. For me talking through things really helps:
Lets choose a random number in base 10, like 27, we need 5 bits to store this. Why? Well because 27 is 11011 in binary. Notice 11011 has 5 digits each 'digit' is what we call a bit hence 5 bits.
Think of each bit as being a slot. For binary, each of those slots can hold a 0 or 1. What's the largest number I can store with 5 bits? Well, the largest number would fill each slot: 11111
11111 = 31 = 2^5 so to store 31 we need 5 bits and 31 is 2^5
Generally (and I will use very explicit names for clarity):
numToStore = 2 ^ numBitsNeeded
Since log is the mathematical inverse of exponent we get:
log(numToStore) = numBitsNeeded
Since this is likely to not result in an integer, we use ceil to round our answer up. So applying our example to find how many bits are needed to store the number 31:
log(31) = 4.954196310386876 = 5 bits

Create a random permutation of 1..N in constant space

I am looking to enumerate a random permutation of the numbers 1..N in fixed space. This means that I cannot store all numbers in a list. The reason for that is that N can be very large, more than available memory. I still want to be able to walk through such a permutation of numbers one at a time, visiting each number exactly once.
I know this can be done for certain N: Many random number generators cycle through their whole state space randomly, but entirely. A good random number generator with state size of 32 bit will emit a permutation of the numbers 0..(2^32)-1. Every number exactly once.
I want to get to pick N to be any number at all and not be constrained to powers of 2 for example. Is there an algorithm for this?
The easiest way is probably to just create a full-range PRNG for a larger range than you care about, and when it generates a number larger than you want, just throw it away and get the next one.
Another possibility that's pretty much a variation of the same would be to use a linear feedback shift register (LFSR) to generate the numbers in the first place. This has a couple of advantages: first of all, an LFSR is probably a bit faster than most PRNGs. Second, it is (I believe) a bit easier to engineer an LFSR that produces numbers close to the range you want, and still be sure it cycles through the numbers in its range in (pseudo)random order, without any repetitions.
Without spending a lot of time on the details, the math behind LFSRs has been studied quite thoroughly. Producing one that runs through all the numbers in its range without repetition simply requires choosing a set of "taps" that correspond to an irreducible polynomial. If you don't want to search for that yourself, it's pretty easy to find tables of known ones for almost any reasonable size (e.g., doing a quick look, the wikipedia article lists them for size up to 19 bits).
If memory serves, there's at least one irreducible polynomial of ever possible bit size. That translates to the fact that in the worst case you can create a generator that has roughly twice the range you need, so on average you're throwing away (roughly) every other number you generate. Given the speed an LFSR, I'd guess you can do that and still maintain quite acceptable speed.
One way to do it would be
Find a prime p larger than N, preferably not much larger.
Find a primitive root of unity g modulo p, that is, a number 1 < g < p such that g^k ≡ 1 (mod p) if and only if k is a multiple of p-1.
Go through g^k (mod p) for k = 1, 2, ..., ignoring the values that are larger than N.
For every prime p, there are φ(p-1) primitive roots of unity, so it works. However, it may take a while to find one. Finding a suitable prime is much easier in general.
For finding a primitive root, I know nothing substantially better than trial and error, but one can increase the probability of a fast find by choosing the prime p appropriately.
Since the number of primitive roots is φ(p-1), if one randomly chooses r in the range from 1 to p-1, the expected number of tries until one finds a primitive root is (p-1)/φ(p-1), hence one should choose p so that φ(p-1) is relatively large, that means that p-1 must have few distinct prime divisors (and preferably only large ones, except for the factor 2).
Instead of randomly choosing, one can also try in sequence whether 2, 3, 5, 6, 7, 10, ... is a primitive root, of course skipping perfect powers (or not, they are in general quickly eliminated), that should not affect the number of tries needed greatly.
So it boils down to checking whether a number x is a primitive root modulo p. If p-1 = q^a * r^b * s^c * ... with distinct primes q, r, s, ..., x is a primitive root if and only if
x^((p-1)/q) % p != 1
x^((p-1)/r) % p != 1
x^((p-1)/s) % p != 1
...
thus one needs a decent modular exponentiation (exponentiation by repeated squaring lends itself well for that, reducing by the modulus on each step). And a good method to find the prime factor decomposition of p-1. Note, however, that even naive trial division would be only O(√p), while the generation of the permutation is Θ(p), so it's not paramount that the factorisation is optimal.
Another way to do this is with a block cipher; see this blog post for details.
The blog posts links to the paper Ciphers with Arbitrary Finite Domains which contains a bunch of solutions.
Consider the prime 3. To fully express all possible outputs, think of it this way...
bias + step mod prime
The bias is just an offset bias. step is an accumulator (if it's 1 for example, it would just be 0, 1, 2 in sequence, while 2 would result in 0, 2, 4) and prime is the prime number we want to generate the permutations against.
For example. A simple sequence of 0, 1, 2 would be...
0 + 0 mod 3 = 0
0 + 1 mod 3 = 1
0 + 2 mod 3 = 2
Modifying a couple of those variables for a second, we'll take bias of 1 and step of 2 (just for illustration)...
1 + 2 mod 3 = 0
1 + 4 mod 3 = 2
1 + 6 mod 3 = 1
You'll note that we produced an entirely different sequence. No number within the set repeats itself and all numbers are represented (it's bijective). Each unique combination of offset and bias will result in one of prime! possible permutations of the set. In the case of a prime of 3 you'll see that there are 6 different possible permuations:
0,1,2
0,2,1
1,0,2
1,2,0
2,0,1
2,1,0
If you do the math on the variables above you'll not that it results in the same information requirements...
1/3! = 1/6 = 1.66..
... vs...
1/3 (bias) * 1/2 (step) => 1/6 = 1.66..
Restrictions are simple, bias must be within 0..P-1 and step must be within 1..P-1 (I have been functionally just been using 0..P-2 and adding 1 on arithmetic in my own work). Other than that, it works with all prime numbers no matter how large and will permutate all possible unique sets of them without the need for memory beyond a couple of integers (each technically requiring slightly less bits than the prime itself).
Note carefully that this generator is not meant to be used to generate sets that are not prime in number. It's entirely possible to do so, but not recommended for security sensitive purposes as it would introduce a timing attack.
That said, if you would like to use this method to generate a set sequence that is not a prime, you have two choices.
First (and the simplest/cheapest), pick the prime number just larger than the set size you're looking for and have your generator simply discard anything that doesn't belong. Once more, danger, this is a very bad idea if this is a security sensitive application.
Second (by far the most complicated and costly), you can recognize that all numbers are composed of prime numbers and create multiple generators that then produce a product for each element in the set. In other words, an n of 6 would involve all possible prime generators that could match 6 (in this case, 2 and 3), multiplied in sequence. This is both expensive (although mathematically more elegant) as well as also introducing a timing attack so it's even less recommended.
Lastly, if you need a generator for bias and or step... why don't you use another of the same family :). Suddenly you're extremely close to creating true simple-random-samples (which is not easy usually).
The fundamental weakness of LCGs (x=(x*m+c)%b style generators) is useful here.
If the generator is properly formed then x%f is also a repeating sequence of all values lower than f (provided f if a factor of b).
Since bis usually a power of 2 this means that you can take a 32-bit generator and reduce it to an n-bit generator by masking off the top bits and it will have the same full-range property.
This means that you can reduce the number of discard values to be fewer than N by choosing an appropriate mask.
Unfortunately LCG Is a poor generator for exactly the same reason as given above.
Also, this has exactly the same weakness as I noted in a comment on #JerryCoffin's answer. It will always produce the same sequence and the only thing the seed controls is where to start in that sequence.
Here's some SageMath code that should generate a random permutation the way Daniel Fischer suggested:
def random_safe_prime(lbound):
while True:
q = random_prime(lbound, lbound=lbound // 2)
p = 2 * q + 1
if is_prime(p):
return p, q
def random_permutation(n):
p, q = random_safe_prime(n + 2)
while True:
r = randint(2, p - 1)
if pow(r, 2, p) != 1 and pow(r, q, p) != 1:
i = 1
while True:
x = pow(r, i, p)
if x == 1:
return
if 0 <= x - 2 < n:
yield x - 2
i += 1

Programming problem - Game of Blocks

maybe you would have an idea on how to solve the following problem.
John decided to buy his son Johnny some mathematical toys. One of his most favorite toy is blocks of different colors. John has decided to buy blocks of C different colors. For each color he will buy googol (10^100) blocks. All blocks of same color are of same length. But blocks of different color may vary in length.
Jhonny has decided to use these blocks to make a large 1 x n block. He wonders how many ways he can do this. Two ways are considered different if there is a position where the color differs. The example shows a red block of size 5, blue block of size 3 and green block of size 3. It shows there are 12 ways of making a large block of length 11.
Each test case starts with an integer 1 ≤ C ≤ 100. Next line consists c integers. ith integer 1 ≤ leni ≤ 750 denotes length of ith color. Next line is positive integer N ≤ 10^15.
This problem should be solved in 20 seconds for T <= 25 test cases. The answer should be calculated MOD 100000007 (prime number).
It can be deduced to matrix exponentiation problem, which can be solved relatively efficiently in O(N^2.376*log(max(leni))) using Coppersmith-Winograd algorithm and fast exponentiation. But it seems that a more efficient algorithm is required, as Coppersmith-Winograd implies a large constant factor. Do you have any other ideas? It can possibly be a Number Theory or Divide and Conquer problem
Firstly note the number of blocks of each colour you have is a complete red herring, since 10^100 > N always. So the number of blocks of each colour is practically infinite.
Now notice that at each position, p (if there is a valid configuration, that leaves no spaces, etc.) There must block of a color, c. There are len[c] ways for this block to lie, so that it still lies over this position, p.
My idea is to try all possible colors and positions at a fixed position (N/2 since it halves the range), and then for each case, there are b cells before this fixed coloured block and a after this fixed colour block. So if we define a function ways(i) that returns the number of ways to tile i cells (with ways(0)=1). Then the number of ways to tile a number of cells with a fixed colour block at a position is ways(b)*ways(a). Adding up all possible configurations yields the answer for ways(i).
Now I chose the fixed position to be N/2 since that halves the range and you can halve a range at most ceil(log(N)) times. Now since you are moving a block about N/2 you will have to calculate from N/2-750 to N/2-750, where 750 is the max length a block can have. So you will have to calculate about 750*ceil(log(N)) (a bit more because of the variance) lengths to get the final answer.
So in order to get good performance you have to through in memoisation, since this inherently a recursive algorithm.
So using Python(since I was lazy and didn't want to write a big number class):
T = int(raw_input())
for case in xrange(T):
#read in the data
C = int(raw_input())
lengths = map(int, raw_input().split())
minlength = min(lengths)
n = int(raw_input())
#setup memoisation, note all lengths less than the minimum length are
#set to 0 as the algorithm needs this
memoise = {}
memoise[0] = 1
for length in xrange(1, minlength):
memoise[length] = 0
def solve(n):
global memoise
if n in memoise:
return memoise[n]
ans = 0
for i in xrange(C):
if lengths[i] > n:
continue
if lengths[i] == n:
ans += 1
ans %= 100000007
continue
for j in xrange(0, lengths[i]):
b = n/2-lengths[i]+j
a = n-(n/2+j)
if b < 0 or a < 0:
continue
ans += solve(b)*solve(a)
ans %= 100000007
memoise[n] = ans
return memoise[n]
solve(n)
print "Case %d: %d" % (case+1, memoise[n])
Note I haven't exhaustively tested this, but I'm quite sure it will meet the 20 second time limit, if you translated this algorithm to C++ or somesuch.
EDIT: Running a test with N = 10^15 and a block with length 750 I get that memoise contains about 60000 elements which means non-lookup bit of solve(n) is called about the same number of time.
A word of caution: In the case c=2, len1=1, len2=2, the answer will be the N'th Fibonacci number, and the Fibonacci numbers grow (approximately) exponentially with a growth factor of the golden ratio, phi ~ 1.61803399. For the
huge value N=10^15, the answer will be about phi^(10^15), an enormous number. The answer will have storage
requirements on the order of (ln(phi^(10^15))/ln(2)) / (8 * 2^40) ~ 79 terabytes. Since you can't even access 79
terabytes in 20 seconds, it's unlikely you can meet the speed requirements in this special case.
Your best hope occurs when C is not too large, and leni is large for all i. In such cases, the answer will
still grow exponentially with N, but the growth factor may be much smaller.
I recommend that you first construct the integer matrix M which will compute the (i+1,..., i+k)
terms in your sequence based on the (i, ..., i+k-1) terms. (only row k+1 of this matrix is interesting).
Compute the first k entries "by hand", then calculate M^(10^15) based on the repeated squaring
trick, and apply it to terms (0...k-1).
The (integer) entries of the matrix will grow exponentially, perhaps too fast to handle. If this is the case, do the
very same calculation, but modulo p, for several moderate-sized prime numbers p. This will allow you to obtain
your answer modulo p, for various p, without using a matrix of bigints. After using enough primes so that you know their product
is larger than your answer, you can use the so-called "Chinese remainder theorem" to recover
your answer from your mod-p answers.
I'd like to build on the earlier #JPvdMerwe solution with some improvements. In his answer, #JPvdMerwe uses a Dynamic Programming / memoisation approach, which I agree is the way to go on this problem. Dividing the problem recursively into two smaller problems and remembering previously computed results is quite efficient.
I'd like to suggest several improvements that would speed things up even further:
Instead of going over all the ways the block in the middle can be positioned, you only need to go over the first half, and multiply the solution by 2. This is because the second half of the cases are symmetrical. For odd-length blocks you would still need to take the centered position as a seperate case.
In general, iterative implementations can be several magnitudes faster than recursive ones. This is because a recursive implementation incurs bookkeeping overhead for each function call. It can be a challenge to convert a solution to its iterative cousin, but it is usually possible. The #JPvdMerwe solution can be made iterative by using a stack to store intermediate values.
Modulo operations are expensive, as are multiplications to a lesser extent. The number of multiplications and modulos can be decreased by approximately a factor C=100 by switching the color-loop with the position-loop. This allows you to add the return values of several calls to solve() before doing a multiplication and modulo.
A good way to test the performance of a solution is with a pathological case. The following could be especially daunting: length 10^15, C=100, prime block sizes.
Hope this helps.
In the above answer
ans += 1
ans %= 100000007
could be much faster without general modulo :
ans += 1
if ans == 100000007 then ans = 0
Please see TopCoder thread for a solution. No one was close enough to find the answer in this thread.

Resources