How does flajolet martin sketch works? - algorithm

I am trying to understand this sketch but am not able to understand.
Correct me if I am wrong but basically, lets say I have a text data.. words.. I have a hash function.. which takes a word and create an integer hash and then I convert that hash to binary bit vector?? Right..
Then I keep a track of the first 1 I see from left.. And the position where that 1 is (say , k)... the cardinality of this set is 2^k?
http://ravi-bhide.blogspot.com/2011/04/flajolet-martin-algorithm.html
But ... say I have just one word. and the hash function of it is such that hash it generates is 2^5, then I am guessing there are 5 (??) trailing 0's?? so it will predict 2^5 (??) cardinality?
That doesnt sounds right? What am I missing

For a single word the distribution of R is a geometric distribution with p = 1/2, and its standard deviation is sqrt(2) ≈ 1.41.
So for a word with hash ending in 100000b the algorithm will, indeed, yield 25/0.77351 = 41.37. But the probability of that is only 1/64, which is consistent with the statement that the standard deviation of R is close to 1.

http://ravi-bhide.blogspot.com/2011/04/flajolet-martin-algorithm.html
We had a good, random hash function that acted on strings and generated integers, what can we say about the generated integers? Since they are random themselves, we would expect:
1/2 of them to have their binary representation end in 0(i.e. divisible by 2),
1/4 of them to have their binary representation end in 00 (i.e. divisible by 4)
1/8 of them to have their binary representation end in 000 (i.e. divisible by 8)
Turning the problem around, if the hash function generated an integer ending in 0^m bits ..intuitively, the number of unique strings is around 2^m.

What is really important to remember is that the Flajolet Martin Algorithm is meant to count distinct elements (lets say M distinct elements) from a set of N elements, when M is expected to be very very large.
There is no point of using the algorithm if N or M are small enough for us to store all distinct elements in memory.
In the case where N and M are really large, the probability of the estimate being close to 2^k is actually very reasonable.
There is an explanation of this at : http://infolab.stanford.edu/~ullman/mmds/ch4.pdf (page 143)

Related

Hashing with the Division Method - Choosing number of slots?

So, in CLRS, there's this quote
A prime not too close to an exact power of 2 is often a good choice for m.
Several Questions...
I understand how a power of 2 will just be the lower order bits of your key...however, say you have keys from a universe of 1 to 1 million, with each key having an equal probability of being any number from universe (which I'm guessing is a common assumption about your universe if given no other data?) then wouldn't taking say the 4 lower order bits result in (2^4) lower order bit patterns that were pretty much equally likely for the keys from 1 to 1 million? How am I thinking about this incorrectly?
Why a prime number? So, if power of 2's aren't a good idea, why is a prime number a better choice as opposed to a composite number close to a power of 2 (Also why should it be close to a power of 2...lol)?
You are trying to find a hash table that works well for typical input data, and typical input data does things that you wouldn't expect from good random number generators. Very often you get formatted or semi-formatted strings which, when converted to numbers, end up as K, K+A, K+2A, K+3A,.... for some integers K and A. If K+xA and K+yA hash to the same number mod m, then (x-y)A must be 0 mod m. If m is prime, this can only happen if A = 0 mod m or if x = y mod m, so one time in m. But if m=pq and A happens to be divisible by p, then you get a collision every time x-y is divisible by q, which is more often since q < m.
I guess close to a power of 2 because it might be convenient for the memory management system to have blocks of memory of the resulting size - I really don't know. If you really care, and if you have the time, you could try different primes with some representative data and see which of them are best in practice.

Hashing google interview

why can't powers of 2 or power's of 10 or prime numbers be good hashing functions? If we want to store overflow records in a hash function, why aren't those good for selection of hashing functions?
Suppose your hash function returns a 32-bit unsigned result. Suppose you choose a modulus of 4096. What you do is, effectively: index = hash & 0xFFF -- so, you throw away the top 20 bits of your hash value. Now, if your hash is really good, and the bottom 12 bits are just as good as the rest, then that's not a problem. However, if your hash is pretty good over all 32 bits, but the bottom 12 bits are suspect (they might, for example, be more strongly influenced by the last characters of a string)... then you may regret discarding the top 20. In this case, if you choose any odd modulus, then index = hash % modulus the result depends on all 32 bits of the hash.
So, more generally, if your hash is calculated modulo M, and your index is taken as hash % N, then what you want is for your M and N to be co-prime.
If M is 2^m (as it usually is), then N=10^n is a poor choice, because the bottom n bits of the resulting index are a straight copy of the bottom n bits of the hash.

Compression of sequence of integers providing random access

I have a sequence of n integers in a small range [0,k) and all the integers have the same frequency f (so the size of the sequence is n=f∗k). What I'm trying to do now is to compress this sequence while providing random access (what is the i-th integer). The time to achieve random access doesn't have to be O(1). I'm more interested in achieving high compression at the expense of higher random access times.
I haven't tried with Huffman coding since it assigns codes based on frequencies (and all my frequencies are the same). Perhaps I'm missing some simple encoding for this particular case.
Any help or pointers would be appreciated.
Thanks in advance.
PS: Already asked in cs.stackexchange, but asking here also for better coverage, sorry.
If all your integers have the same frequency, then a fair approximation to optimal compression will be ceil(log2(k)) bits per integer. You can access a bit-array of these in constant time.
If k is painfully small (like 3), the above method may waste a fair amount of space. But, you can combine a fixed number of your small integers into a base-k number, which can fit more efficiently into a fixed number of bits (you may also be able to fit the result conveniently into a standard-sized word). In any case, you can also access this coding in constant time.
If your integers don't have the same frequency, optimal compression may yield variable bit rates from different parts of your input, so the simple array access won't work. In that case, good random-access performance would require an index structure: break your compressed data into convenient sized chunks, which can each be decompressed sequentially, but this time is bounded by the chunk size.
If the frequency of each number is exactly the same, you may be able to save some space by taking advantage of this -- but it may not be enough to be worthwhile.
The entropy of n random numbers in range [0,k) is n log2(k), which is log2(k) bits per number; this is the number of bits it takes to encode your numbers without taking advantage of the exact frequency.
The entropy of distinguishable permutations of f copies each of k elements (where n=f*k) is:
log2( n!/(f!)^k ) = log2(n!) - k * log2(f!)
Applying Stirling's approximation (which is good here only if n and f are large), yields:
~ n log2(n) - n log2(e) - k ( f log2(f) - f log2(e) )
= n log2(n) - n log2(e) - n log2(f) + n log2(e)
= n ( log2(n) - log2(f) )
= n log2(n/f)
= n log2(k)
What this means is that, if n is large and k is small, you will not gain a significant amount of space by taking advantage of the exact frequency of your input.
The total error from the Stirling approximation above is O(log2(n) + k log2(f)), which is O(log2(n)/n + log2(f)/f) per number encoded. This does mean that if your k is so large that your f is small (i.e., each distinct number only has a small number of copies), you may be able to save some space with a clever encoding. However, the question specifies that k is, in fact, small.
If you work out the number of possible different combinations and take its log base 2 you can find the best possible compression, and I don't think it will be that great in your case. With 16 numbers of frequency 1 the number of possible messages is 16! and Excel tells me log base 2 of 16! is 44.25, whereas storing them as 4-bit codes would only take 64 bits. (where there is more than one of each kind you want http://mathworld.wolfram.com/MultinomialCoefficient.html)
I think you will have a problem mixing random access into this because the only information you have is that there are fixed numbers of each type of element - in the whole sequence. That's not a lot of information for the whole sequences, and it says almost nothing about the first half of the sequence in isolation, because you could well have more of some number in the first half and less in the second half.

Returning i-th combination of a bit array

Given a bit array of fixed length and the number of 0s and 1s it contains, how can I arrange all possible combinations such that returning the i-th combinations takes the least possible time?
It is not important the order in which they are returned.
Here is an example:
array length = 6
number of 0s = 4
number of 1s = 2
possible combinations (6! / 4! / 2!)
000011 000101 000110 001001 001010
001100 010001 010010 010100 011000
100001 100010 100100 101000 110000
problem
1st combination = 000011
5th combination = 001010
9th combination = 010100
With a different arrangement such as
100001 100010 100100 101000 110000
001100 010001 010010 010100 011000
000011 000101 000110 001001 001010
it shall return
1st combination = 100001
5th combination = 110000
9th combination = 010100
Currently I am using a O(n) algorithm which tests for each bit whether it is a 1 or 0. The problem is I need to handle lots of very long arrays (in the order of 10000 bits), and so it is still very slow (and caching is out of the question). I would like to know if you think a faster algorithm may exist.
Thank you
I'm not sure I understand the problem, but if you only want the i-th combination without generating the others, here is a possible algorithm:
There are C(M,N)=M!/(N!(M-N)!) combinations of N bits set to 1 having at most highest bit at position M.
You want the i-th: you iteratively increment M until C(M,N)>=i
while( C(M,N) < i ) M = M + 1
That will tell you the highest bit that is set.
Of course, you compute the combination iteratively with
C(M+1,N) = C(M,N)*(M+1)/(M+1-N)
Once found, you have a problem of finding (i-C(M-1,N))th combination of N-1 bits, so you can apply a recursion in N...
Here is a possible variant with D=C(M+1,N)-C(M,N), and I=I-1 to make it start at zero
SOL=0
I=I-1
while(N>0)
M=N
C=1
D=1
while(i>=D)
i=i-D
M=M+1
D=N*C/(M-N)
C=C+D
SOL=SOL+(1<<(M-1))
N=N-1
RETURN SOL
This will require large integer arithmetic if you have that many bits...
If the ordering doesn't matter (it just needs to remain consistent), I think the fastest thing to do would be to have combination(i) return anything you want that has the desired density the first time combination() is called with argument i. Then store that value in a member variable (say, a hashmap that has the value i as key and the combination you returned as its value). The second time combination(i) is called, you just look up i in the hashmap, figure out what you returned before and return it again.
Of course, when you're returning the combination for argument(i), you'll need to make sure it's not something you have returned before for some other argument.
If the number you will ever be asked to return is significantly smaller than the total number of combinations, an easy implementation for the first call to combination(i) would be to make a value of the right length with all 0s, randomly set num_ones of the bits to 1, and then make sure it's not one you've already returned for a different value of i.
Your problem appears to be constrained by the binomial coefficient. In the example you give, the problem can be translated as follows:
there are 6 items that can be chosen 2 at a time. By using the binomial coefficient, the total number of unique combinations can be calculated as N! / (K! (N - K)!, which for the case of K = 2 simplifies to N(N-1)/2. Plugging 6 in for N, we get 15, which is the same number of combinations that you calculated with 6! / 4! / 2! - which appears to be another way to calculate the binomial coefficient that I have never seen before. I have tried other combinations as well and both formulas generate the same number of combinations. So, it looks like your problem can be translated to a binomial coefficient problem.
Given this, it looks like you might be able to take advantage of a class that I wrote to handle common functions for working with the binomial coefficient:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to the language of your choice.
There may be some limitations since you are using a very large N that could end up creating larger numbers than the program can handle. This is especially true if K can be large as well. Right now, the class is limited to the size of an int. But, it should not be hard to update it to use longs.

Counting permutation of Strings

I need help with a problem. Given an input string with repetitions, say "aab", how to
count the number of distinct permutations of that string.
One formula that could be used is n!/n1!n2!.....nr!.
However calculating these ni's takes time O(rn) and O(n),if we
use a lookup table.
However I need a solution without use of such tables.Is any recursive or
dynamic programming solution possible for this problem.
Thanks in advance.
no. of distinct permutations will be n!/(c1!*c2*..*cn!)
here n is length of the string
ck denotes the no. of occurence of each distinct character.
For eg: string :aabb n=4 ca=2,cb=2
solution=4!/(2!*2!)=6
If you want to do this for very large strings, consider using the gamma function (with gamma(n+1)=n!), which is faster for large n and still gives you floating-point accuracy even in cases where you would get an int overflow.
If you have arbitrary precision arithmetic, you could probably push the effort down to O(r+n) by exploiting the fact that you can, e.g. write 1*2*3 * 1*2*3*4 * 1*2*3*4*5*6*7 as (1*2*3)^3 * 4^2 * 6*7. The end result will still have O(rn) digits and you'll still have an O(rn) time consumption, because multiplication cost increases with the size of the number.
I don't see the difference between lookup tables and dynamic programming - basically, dynamic programming uses a lookup table that you build on-the-fly. (i.e., use a lookup table, but only populate it on-demand).
Do you need approximate answers, or exact ones? Which part of this calculation do you think is slow?
If you need approximate answers, use the gamma function as #Yannick Versley suggested.
If you need exact answers, here is how I'd do it. I'd first figure out the prime factorization of the answer, then multiply those factors out. This avoids division. The hard part of figuring out the prime factorization is figuring out the prime factorization of n!. For that you can use a trick. Suppose that p is a prime, and k is the integer part of n/p'. Then the number of times thatpdividesn!iskplus the number of times thatpdividesk. Proceed recursively and it is quick to see that, for instance, the number of times that3is a factor of80!is26 + 8 + 2 = 36`. So after you find the primes up to 'n', it isn't hard to find the prime factorization of 'n!'.
Once you know the prime factorization, you can multiply it out. You expect to be dealing with large numbers, so try to arrange to do lots of small multiplications first, and only a few big ones. Here is a simple way to do that.
Make an array of the prime factors. Scramble it (to mix up big and small factors). Then as long as you have at least 2 factors in your array grab the first two, multiply them, push them onto the end. When you have one number left, that is your answer.
This should be much, much faster for large strings than the naive approach of multiplying the numbers one at a time. However in the end you will have very large numbers, and nothing can make multiplying those fast.
You can keep a running counts for each character, and build the result up as you go along. It's impossible to do better than O(n), since without looking at every character in the string you can't know how many of each character there are.
I've written some code in Python, with some simple unit tests. The code carefully avoids large intermediate values when the result is going to be small (in fact, the variable result is never larger than len(s) times the final result). If you were going to code this up in another language, say C, then you might use an array of size 256 rather than the defaultdict.
If you want an exact result, then I don't think you can do better than this.
from collections import defaultdict
def permutations(s):
seen = defaultdict(int)
for c in s:
seen[c] += 1
result = 1
n = 0
for k, count in seen.iteritems():
for j in xrange(count):
n += 1
result *= n
result //= j + 1
return result
test_cases = [
('abc', 6),
('aab', 3),
('abcd', 24),
('aabb', 6),
('aaaaa', 1),
('a', 1)]
for s, want in test_cases:
got = permutations(s)
if got != want:
print 'permutations(%s) = %s want %s' % (s, got, want)
As #MRalwasser says, the number of permutations should be n!. You can generate those permutations fairly simply, but the run time is going to be exponential because you have to hit exponentially many output strings. (Quick way to show O(n!) = O(2n) is by using Stirling's Formula.)

Resources