I understand the rabin-karp algo and its usage in string searching. What I don't quite understand is how it can dynamically slice a file into variable-length chunks.
It's said to calculate the hash of a small window of data bytes (ex: 48 bytes) at every single byte offset, and the chunk boundaries—called breakpoints—are whenever the last N (ex: 13) bits of the hash are zero. This gives you an average block size of 2^N = 2^13 = 8192 = 8 KB.
Questions:
Does the rabin-karp rolling hash start from the first 48 bytes, then roll over one byte each time.
If so, is it too much to calculate for a large file even with simple hash function?
Given unpredictable data, how is it possible to have N bits of the hash are zero within the large chunk size limit?
Yes, the sliding window is fix-sized, moving forward byte by byte.
The hash function has O(n) complexity, in each step it only add (and may shift bits) the next byte and minus the original first byte in the window, which is the core method of Rabin hash.
It depends on the hash function actually. The distribution of the chuck sizes may be different. To reduce chunk size variability, the Two Thresholds, Two Divisors Algorithm (TTTD) was proposed. You can also find some advances in this thread from academic research papers.
Related
Since time complexity is a function of the size of the input, Assume we have input as an integer number,
now since memory allocated to every integer is fixed(4 bytes), should not the time complexity be constant and not matter whether the input integer is 1 or 234345.
I know the input size is a number of bits used to represent the input(which is logn(base 2)) but then why do we say that size allocated to an integer data type is 4 bytes(or 2 bytes).
For example:
since decimal number 100 in binary can be represented as 01100100 which is log100 bits why do we allocate 32 bits and waste the other 24 bits?
When talking about complexity, we usually assume the input is an array of numbers which have constant size. Then the size of the input is proportional to the number of elements in the array.
However, when this is not the case, you have to agree on what you mean by "input size". There are several possibilities, the number of bits being an important one. Another definition which is used sometimes, is the sum of the input values (or the only input value). Also called "unary representation".
If you have one input, and it has constant size, complexity is meaningless. In this case, it could be either 8 or 32 bits — in any case, there is no complexity here at all.
Are there any adverse effects if during calculation of FNV-1a hash, 4 bytes are xor'ed at a time rather than just one?
Yes, there is a problem. The algorithm does an XOR on each byte and then multiplies to "mix" that byte with the rest of the value. If you did an XOR on four bytes at a time, the last four bytes of the value you're hashing would have an overwhelming effect on the result.
Basically, the calculation was designed to mix in one byte at a time. If you mix in four bytes at a time, you'll change the distribution of the values that it produces.
I have studied the working of Sieve of Eratosthenes for generating the prime numbers up to a given number using iteration and striking off all the composite numbers. And the algorithm needs just to be iterated up to sqrt(n) where n is the upper bound upto which we need to find all the prime numbers. We know that the number of prime numbers up to n=10^9 is very less as compared to the number of composite numbers. So we use all the space to just tell that these numbers are not prime first by marking them composite.
My question is can we modify the algorithm to just store prime numbers since we deal with a very large range (since number of primes are very less)?
Can we just store straight away the prime numbers?
Changing the structure from that of a set (sieve) - one bit per candidate - to storing primes (e.g. in a list, vector or tree structure) actually increases storage requirements.
Example: there are 203.280.221 primes below 2^32. An array of uint32_t of that size requires about 775 MiB whereas the corresponding bitmap (a.k.a. set representation) occupies only 512 MiB (2^32 bits / 8 bits/byte = 2^29 bytes).
The most compact number-based representation with fixed cell size would be storing the halved distance between consecutive odd primes, since up to about 2^40 the halved distance fits into a byte. At 193 MiB for the primes up to 2^32 this is slightly smaller than an odds-only bitmap but it is only efficient for sequential processing. For sieving it is not suitable because, as Anatolijs has pointed out, algorithms like the Sieve of Eratosthenes effectively require a set representation.
The bitmap can be shrunk drastically by leaving out the multiples of small primes. Most famous is the odds-only representation that leaves out the number 2 and its multiples; this halves the space requirement to 256 MiB at virtually no cost in added code complexity. You just need to remember to pull the number 2 out of thin air when needed, since it isn't represented in the sieve.
Even more space can be saved by leaving out multiples of more small primes; this generalisation of the 'odds-only' trick is usually called 'wheeled storage' (see Wheel Factorization in the Wikipedia). However, the gain from adding more small primes to the wheel gets smaller and smaller whereas the wheel modulus ('circumference') increases explosively. Adding 3 removes 1/3rd of the remaining numbers, adding 5 removes a further 1/5th, adding 7 only gets you a further 1/7th and so on.
Here's an overview of what adding another prime to the wheel can get you. 'ratio' is the size of the wheeled/reduced set relative to the full set that represents every number; 'delta' gives the shrinkage compared to the previous step. 'spokes' refers to the number of prime-bearing spokes which need to be represented/stored; the total number of spokes for a wheel is of course equal to its modulus (circumference).
The mod 30 wheel (about 136 MiB for the primes up to 2^32) offers an excellent cost/benefit ratio because it has eight prime-bearing spokes, which means that there is a one-to-one correspondence between wheels and 8-bit bytes. This enables many efficient implementation tricks. However, its cost in added code complexity is considerable despite this fortuitous circumstance, and for many purposes the odds-only sieve ('mod 2 wheel') gives the most bang for buck by far.
There are two additional considerations worth keeping in mind. The first is that data sizes like these often exceed the capacity of memory caches by a wide margin, so that programs can often spend a lot of time waiting for the memory system to deliver the data. This is compounded by the typical access patterns of sieving - striding over the whole range, again and again and again. Speedups of several orders of magnitude are possible by working the data in small batches that fit into the level-1 data cache of the processor (typically 32 KiB); lesser speedups are still possible by keeping within the capacity of the L2 and L3 caches (a few hundred KiB and a few MiB, respectively). The keyword here is 'segmented sieving'.
The second consideration is that many sieving tasks - like the famous SPOJ PRIME1 and its updated version PRINT (with extended bounds and tightened time limit) - require only the small factor primes up to the square root of the upper limit to be permanently available for direct access. That's a comparatively small number: 3512 when sieving up to 2^31 as in the case of PRINT.
Since these primes have already been sieved there's no need for a set representation anymore, and since they are few there are no problems with storage space. This means they are most profitably kept as actual numbers in a vector or list for easy iteration, perhaps with additional, auxiliary data like current working offset and phase. The actual sieving task is then easily accomplished via a technique called 'windowed sieving'. In the case of PRIME1 and PRINT this can be several orders of magnitude faster than sieving the whole range up to the upper limit, since both tasks only asks for a small number of subranges to be sieved.
You can do that (remove numbers that are detected to be non-prime from your array/linked list), but then time complexity of algorithm will degrade to O(N^2/log(N)) or something like that, instead of original O(N*log(N)). This is because you will not be able to say "the numbers 2X, 3X, 4X, ..." are not prime anymore. You will have to loop through your entire compressed list.
You could erase each composite number from the array/vector once you have shown it to be composite. Or when you fill an array of numbers to put through the sieve, remove all even numbers (other than 2) and all numbers ending in 5.
If you have studied the sieve right, you must know we don't have the primes to begin with. We have an array, sizeof which is equal to the range. Now, if you want the range to be 10e9, you want this to be the size of the array. You have mentioned no language, but for each number, you must need a bit to represent whether it is prime or not.
Even that means you need 10^9 bits = 1.125 * 10^8 bytes which is greater than 100 MB of RAM.
Assuming you have all this, most optimized sieve takes O(n * log(log n)) time, which is, if n = 10e9, on a machine that evaluates 10e8 instructions per second, will still take some minutes.
Now, assuming you have all this with you, still, number of primes till 10e9 is q = 50,847,534, to save these will still take q * 4 bytes, which is still greater than 100MB. (more RAM)
Even if you remove the indexes which are multiples of 2, 3 or 5, this removes 21 numbers in every 30. This is not good enough, because in total, you will still need around 140 MB space. (40MB = a third of 10^9 bits + ~100MB for storing prime numbers).
So, since, for storing the primes, you will, in any case require similar amount of memory (of the same order as calculation), your question, IMO has no solution.
You can halve the size of the sieve by only 'storing' the odd numbers. This requires code to explicitly deal with the case of testing even numbers. For odd numbers, bit b of the sieve represents n = 2b + 3. Hence bit 0 represents 3, bit 1 represents 5 and so on. There is a small overhead in converting between the number n and the bit index b.
Whether this technique is any use to you depends on the memory/speed balance you require.
I need to generate cyptographically secure random number which is 256-bit long in specific range. I use microcontroller suited with random number generator (producer boasts that it's true random number, based on thermal noise).
The upper limit of number to be generated is given as byte array. My question is: will it be secure, to get the random number byte by byte, and performing:
n[i] = rand[i] mod limit[i]
where n[i] is i'th byte of my number etc.
The standard method, using all the bits from the RNG is:
number <- random()
while (number outside range)
number <- random()
endwhile
return number
There are some tweaks possible if the required range is less than half the size of the RNG output, but I assume that is not the case here: it would reduce the output size by one or more bits. Given that, then the while loop will normally only be entered once or twice if at all.
Comparing byte arrays is reasonably simple, and usually speedy providing you compare the most significant bytes first. If the most significant bytes differ, then there is no need to compare less significant bytes at all. We can tell that 7,###,###,### is larger than 5,###,###,### without knowing what digits the # stand for.
How do I store phone numbers where I can efficiently query weather a particular phone number has been used or not?
This was a interview question, I suggested many data structures (tree, trie, compressed trie, skip-list, bloom filter) but he was looking for BITMAP. How do I store phone numbers using bitmap?
Interesting that you didn't mention a hashtable. Seems like a hashtable or bitmap would indeed be ideal for that situation and it probably would be more space efficient and lookups would be somewhat faster to use a bitmap. Trie/compressed trie would likely all be similar to a hashtable in terms of space/time performance. Skip-list and Trees may have even worst performance. A bloom filter is pretty much a cross between a hashtable and a bitmap; it is mostly intended to limit the number of disk accesses and thus is of somewhat limited use in this case.
A bitmap is generally better choice space-wise than a hashtable of used or unused elements if it contains more than approximately N/log(N) used or unused elements and becomes ideal* at when a hashtable of either unused or used elements is half full. Resulting in a space saving of upwards of log_2(N)/2.*
An example, is if you use a 10 digit phone number and half the numbers are used. It takes about 10^10 bits. However, storing 5x10^9 numbers in a hashtable would require roughly 32 bits for each number resulting in a total of about 1.6x10^11 bits, which is 16x (log N / 2) that of a bitmap.
*Assuming a hashtable of unused elements would be used if there were more thanN/2 elements else a hashtable of used elements would be used. Otherwise, the ideal for a bitmap is when the hashtable is full and would result in a space saving of upwards of log_2(N).
The term "bitmap" is overloaded and has several different meanings. Here, I think the interviewer was probably referring to a bitvector, an array of bits numbered 0, 1, 2, ..., U. You can use a bitvector to represent a numbers in the range 0, 1, 2, ..., U as follows: if the bit at index i is 0, then i is not present in the set, and if the bit at index i is 1, then i is present in the set. Since you can index into a bitvector and flip bits in time O(1), the runtime of inserting an element, deleting an element, and looking up an element in the set is O(1).
The drawback is that the space usage is always Θ(U) and is independent of the number of elements in the set. If you assume that phone numbers are 10 digits long, you'd need 10,000,000,000 bits = 2,500,000,000 bytes = 2.5GB of space to store the phone numbers using a naive encoding. If you assume that phone numbers can't start with 0, you could shave off 1,000,000,000 bits from the encoding by pretending that the number system starts at 1000000000 rather than 0.
Hope this helps!