This article says:
Every prime number can be expressed as
30k±1, 30k±7, 30k±11, or
30k±13 for some k.
That means we can use eight bits per
thirty numbers to store all the
primes; a million primes can be
compressed to 33,334 bytes
"That means we can use eight bits per thirty numbers to store all the primes"
This "eight bits per thirty numbers" would be for k, correct? But each k value will not necessarily take-up just one bit. Shouldn't it be eight k values instead?
"a million primes can be compressed to 33,334 bytes"
I am not sure how this is true.
We need to indicate two things:
VALUE of k (can be arbitrarily large)
STATE from one of the eight states (-13,-11,-7,-1,1,7,11,13)
I am not following how "33,334 bytes" was arrived at, but I can say one thing: as the prime numbers become larger and larger in value, we will need more space to store the value of k.
How, then can we fix it at "33,334 bytes"?
The article is a bit misleading: we can't store 1 million primes, but we can store all primes below 1 million.
The value of k comes from its position in the list. We only need 1 bit for each of those 8 permutations (-13,-11..,11,13)
In other words, we'll use 8 bits to store for k=0, 8 to store for k=1, 8 to store for k=2, etc. By letting these follow sequentially, we don't need to specify the value of k for each 8 bits - it's simply the value for the previous 8 bits + 1.
Since 1,000,000 / 30 = 33,333 1/3, we can store 33,334 of these 8 bit sequences to represent which values below 1 million are prime, since we cover all of the values k can have without 30k-13 exceeding the limit of 1 million.
You don't need to store each value of k. If you want to store the prime numbers below 1 million, use 33,334 bytes - the first byte corresponds to k=0, the second to k=1 etc. Then, in each byte, use 1 bit to indicate "prime" or "composite" for 30k+1, 30k+7 etc.
It's a bitmask--one bit for each of the 8 values out of 30 that might be prime, so 8 bits per 30 numbers. To tabulate all primes up to 10^6, you thus need 8*10^6/30 = 2666667 bits = 33334 bytes.
To explain why this is a good way to go, you need to look at the obvious alternatives.
A more naive way to go would just be to use a bitmask. You need a million bits, 125000 bytes.
You could also store the values of the primes themselves. Up to 1000000, the values fit in 20 bits, and there are 78498 primes, so this gives a disappointing 1569960 bits (196245 bytes).
Another way to go--though less useful for looking up primes--is to store the differences between each prime and the next. Under a million, this fits in 6 bits (as long as you remember that the primes are all odd at that point, so you only need to store even differences and can thus throw away the lowest bit), for 470998 bits == 58874 bytes. (You could shave off another bit by counting how many mod-30 slots you had to jump.)
Now, there's nothing particularly special about 30 except that 30 = 2*3*5, so this lookup is actually walking you up through a bitmask representation of the Sieve of Eratosthanes pattern just after you've gotten started. You could instead use 2*3*5*7 = 210, and then you'd have to consider +- 1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, for 48 values. If you were doing this with 7 blocks of 30, you'd need 7*8=56 bits, so this is a slight improvement, but ugh...hardly worth the hassle.
So this is one of the better tricks out there for compactly storing reasonably small prime numbers.
(P.S. It's interesting to note that if primes appeared randomly (but the same number appeared up to 1000000 as actually appear) the amount of information stored in the primality of a number between 1 and 10^6 would be ~0.397 bits per number. Thus, under naive information-theoretic assumptions, you'd think that the best you could possibly do to store the first million primes was to use 1000000*0.397 bits, or 49609 bytes.)
As another perspective on this, the first 23,163,298 primes can be considered nicely compressible. It is the maximum number of primes for which every gap is <= 255, i.e. fits into a single byte.
I used this fact here, to reduce memory footprint for primes cache by 8 times, i.e. instead of using number (8 bytes), I'm caching only the gaps between primes, using just 1 byte per prime.
Related
I am trying to understand the danger in using an unstable sorting algorithm(like Quick sort) in Radix sort.
Also, is stable algorithm must in both cases(i.e.,MSD Radix sort and LSD Radix sort)?
Thanks in advance.
MSD radix sort is usually not practical, as the virtual bins can not be concatenated after each pass. If sorting by 8 bit bytes, after the first pass you have 256 separate bins, after two passes, 65536 bins, after three passes, 16777216 bins, ... .
Update - one exception to this is doing just one MSD pass to split up a large array into 256 (or 512 or 1024 or ...) bins, with the goal that each bin will fit in cache. This assumes somewhat uniform distribution so that the bins are similar in size. After the initial pass, then each bin is sorted using LSD passes, which could be done with multiple threads (if 4 cores, then LSD sort 4 bins at at time using 4 threads), since there would be no collision issues between the bins.
LSD radix sort needs to be stable, since the virtual bins are concatenated in order and the following passes on the more significant "digits" need to retain the order established by the prior passes. Note that LSD radix sort is how the old card sorters dating back to the early 1900's operated.
http://en.wikipedia.org/wiki/IBM_card_sorter#Earlier_sorters
It would be a good start to give two minutes to the history.
Radix sort is the algorithm used by the card-sorting machines you now find only in
computer museums. The cards have 80 columns, and in each column, a machine can
punch a hole in one of 12 places. The sorter can be mechanically “programmed”
to examine a given column of each card in a deck and distribute the card into one
of 12 bins depending on which place has been punched. An operator can then
gather the cards bin by bin, so that cards with the first place punched are on top of
cards with the second place punched, and so on.
For decimal digits, each column uses only 10 places. (The other two places
are reserved for encoding numeric characters.) A d-digit number would then
occupy a field of d columns. Since the card sorter can look at only one column
at a time, the problem of sorting n cards on a d-digit number requires a sorting
algorithm.
Intuitively, you might sort numbers on their most significant digit, sort each of
the resulting bins recursively, and then combine the decks in order. Unfortunately,
since the cards in 9 of the 10 bins must be put aside to sort each of the bins, this
procedure generates many intermediate piles of cards that you would have to keep
track of. (See Exercise 8.3-5.)
Radix sort solves the problem of card sorting—counterintuitively—by sorting on
the least significant digit first. The algorithm then combines the cards into a single
deck, with the cards in the 0 bin preceding the cards in the 1 bin preceding the
cards in the 2 bin, and so on. Then it sorts the entire deck again on the second-least
significant digit and recombines the deck in a like manner. The process continues
until the cards have been sorted on all d digits. Remarkably, at that point, the cards
are fully sorted on the d-digit number. Thus, only d passes through the deck are
required to sort. Figure 8.3 shows how radix sort operates on a “deck” of seven
3-digit numbers.
In order for radix sort to work correctly, the digit sorts must be stable. The sort
performed by a card sorter is stable, but the operator has to be wary about not
changing the order of the cards as they come out of a bin, even though all the cards
in a bin have the same digit in the chosen column.
-by CLRS
From the article, you may get that, MSD radix sort is not feasible.
and
for the need of stable digit sorting algo, let's try to understand with an example
assume a list to be sorted
21, 52, 35, 76, 49, 55, 51, 34, 31, 39
sort the number using digit at once place.
(21, 51, 31,) (52,) (34,) (35, 55,) (76,) (49, 39) <---- this is what when we use stable sort to sort once digit.
But if we use unstable sorting algo to sort once digit, then values within the parenthesis can be interchanged with each other.
can be like
(31, 51, 21,) (52,) (34,) (35, 55,) (76,) (49, 39) <----- this order will not affect the final result
let's sort this w.r.t. digit at tenth place
(21,) (31, 34, 35, 39,) (49,) (51, 52, 55,) (76)<----this will be the (final)output if we use stable sort for digit sorting.
if digit sort is not stable then the output may not be sored order.
like this
(21,) (39, 35, 39, 31) (49,) (52, 51, 55,) (76)
I've read this question: Which is the fastest algorithm to find prime numbers?, but I'd like to do this only for 2 and 5 primes.
For example, the number 42000 is factorized as:
24 • 31 • 53 • 71
I'm only interested in finding the powers of 2 and 5: 4 and 3 in this example.
My naive approach is to successively divide by 2 while the remainder is 0, then successively divide by 5 while the remainder is 0.
The number of successful divisions (with zero remainder) are the powers of 2 and 5.
This involves performing (x + y + 2) divisions, where x is the power of 2 and y is the power of 5.
Is there a faster algorithm to find the powers of 2 and 5?
Following the conversation, I do think your idea is the fastest way to go, with one exception:
Division (in most cases) is expensive. On the other hand, checking the last digit of the number is (usually?) faster, so I would check the last digit (0/5 and 0/2/4/6/8) before dividing.
I am basing this off this comment by the OP:
my library is written in PHP and the number is actually stored as a string in base 10. That's not the most efficient indeed, but this is what worked best within the technical limits of the language.
If you are committed to strings-in-php, then the following pseudo-code will speed things up compared to actual general-purpose repeated modulus and division:
while the string ends in 0, but is not 0
chop a zero off the end,
increment ctr2 and ctr5
switch repeatedly depending on the last digit:
if it is a 5,
divide it by 5
increment ctr5
if it is 2, 4, 6, 8,
divide it by 2
increment ctr2
otherwise
you have finished
This does not require any modulus operations, and you can implement divide-by-5 and divide-by-2 cheaper than a general-purpose long-number division.
On the other hand, if you want performance, using string representations for unlimited-size integers is suicide. Use gmp (which has a php library) for your math, and convert to strings only when necessary.
edit:
you can gain extra efficiency (and simplify your operations) using the following pseudocode:
if the string is zero, terminate early
while the last non-zero character of the string is a '5',
add the string to itself
decrement ctr2
count the '0's at the end of the string into a ctr0
chop off ctr0 zeros from the string
ctr2 += ctr0
ctr5 += ctr0
while the last digit is 2, 4, 6, 8
divide the string by 2
increment ctr2
Chopping many 0s at once is better than looping. And mul2 beats div5 in terms of speed (it can be implemented by adding the number once).
If you have a billion digit number, you do not want to do divisions on it unless it's really necessary. If you don't have reason to believe that it is in the 1/2^1000 numbers divisible by 2^1000, then it makes sense to use much faster tests that only look at the last few digits. You can tell whether a number is divisible by 2 by looking at the last digit, whether it is divisible by 4 by looking at the last 2 digits, and by 2^n by looking at the last n digits. Similarly, you can tell whether a number is divisible by 5 by looking at the last digit, whether it is divisible by 25 by looking at the last 2 digits, and by 5^n by looking at the last n digits.
I suggest that you first count and remove the trailing 0s, then decide from the last digit whether you are testing for powers of 2 (last digit 2,4,6, or 8) or powers of 5 (last digit 5).
If you are testing for powers of 2, then take the last 2, 4, 8, 16, ... 2^i digits, and multiply this by 25, 625, ... 5^2^i, counting the trailing 0s up to 2^i (but not beyond). If you get fewer than 2^i trailing 0s, then stop.
If you are testing for powers of 5, then take the last 2, 4, 8, 16, ... 2^i digits, and multiply this by 4, 16, ... 2^2^i, counting the trailing 0s up to 2^i (but not beyond). If you get fewer than 2^i trailing 0s, then stop.
For example, suppose the number you are analyzing is 283,795,456. Multiply 56 by 25, you get 1400 which has 2 trailing 0s, continue. Multiply 5,456 by 625, you get 3,410,000, which has 4 trailing 0s, continue. Multiply 83,795,456 by 5^8=390,625, you get 32,732,600,000,000, which has 8 trailing 0s, continue. Multiply 283,795,456 by 5^16 to get 43,303,750,000,000,000,000 which has only 13 trailing 0s. That's less than 16, so stop, the power of 2 in the prime factorization is 2^13.
I hope that for larger multiplications you are implementing an n log n algorithm for multiplying n digit numbers, but even if you aren't, this technique should outperform anything involving division on typical large numbers.
Let's look at the average-case time complexity of various algorithms, assuming that each n-digit number is equally likely.
Addition or subtraction of two n-digit numbers takes theta(n) steps.
Dividing an n-digit number by a small number like 5 takes theta(n) steps. Dividing by the base is O(1).
Dividing an n-digit number by another large number takes theta(n log n) steps using the FFT, or theta(n^2) by a naive algorithm. The same is true for multiplication.
The algorithm of repeatedly dividing a base 10 number by 2 has an average case time complexity of theta(n): It takes theta(n) time for the first division, and on average, you need to do only O(1) divisions.
Computing a large power of 2 with at least n digits takes theta(n log n) by repeated squaring, or theta(n^2) with simple multiplication. Performing Euclid's algorithm to compute the GCD takes an average of theta(n) steps. Although divisions take theta(n log n) time, most of the steps can be done as repeated subtractions and it takes only theta(n) time to do those. It takes O(n^2 log log n) to perform Euclid's algorithm this way. Other improvements might bring this down to theta(n^2).
Checking the last digit for divisibility by 2 or 5 before performing a more expensive calculation is good, but it only results in a constant factor improvement. Applying the original algorithm after this still takes theta(n) steps on average.
Checking the last d digits for divisibility by 2^d or 5^d takes O(d^2) time, O(d log d) with the FFT. It is very likely that we only need to do this when d is small. The fraction of n-digit numbers divisible by 2^d is 1/2^d. So, the average time spent on these checks is O(sum(d^2 / 2^d)) and that sum is bounded independent of n, so it takes theta(1) time on average. When you use the last digits to check for divisibility, you usually don't have to do any operations on close to n digits.
depends on whether you're starting with a native binary number or some bigint string -
chopping off very long chains of trailing edge zeros in bigint strings are a lot easier than trying to extract powers of 2 and 5 separately - e.g. 23456789 x 10^66
23456789000000000000000000000000000000000000000000000000000000000000000000
This particular integer, on the surface, is 244-bits in total, requiring a 177-bit-wide mantissa (178-bit precision minus 1-bit implicit) to handle it losslessly, so even newer data types such as uint128 types won't suffice :
11010100011010101100101010010000110000101000100001000110100101
01011011111101001110100110111100001001010000110111110101101101
01001000011001110110010011010100001001101000010000110100000000
0000000000000000000000000000000000000000000000000000000000
The sequential approach is to spend 132 loop cycles in a bigint package to get them out ::
129 63 2932098625
130 64 586419725
131 65 117283945
132 66 23456789
133 2^66 x
5^66 x
23456789
But once you can quickly realize there's a chain of 66 trailing zeros, the bigint package becomes fully optional, since the residual digits is less than 24.5-bits in total width:
2^66
5^66
23456789
I think your algorithm will be the fastest. But I have a couple of suggestions.
One alternative is based on the greatest common divisor. Take the gcd of your input number with the smallest power of 2 greater than your input number; that will give you all the factors of 2. Divide by the gcd, then repeat the same operation with 5; that will give you all the factors of 5. Divide again by the gcd, and the remainder tells you if there are any other factors.
Another alternative is based on binary search. Split the binary representation of your input number in half; if the right half is 0, move left, otherwise move right. Once you have the factors of 2, divide, then apply the same algorithm on the remainder using powers of 5.
I'll leave it to you to implement and time these algorithms. But my gut feeling is that repeated division will be hard to beat.
I just read your comment that your input number is stored in base 10. In that case, divide repeatedly by 10 as long as the remainder is 0; that gives factors of both 2 and 5. Then apply your algorithm on the reduced number.
If I have a 256 bit array (selector),Ho do I select 5 elements from an array of 54 element using the 256 bit array ?. It's possible to take only first K bits from selector array to accomplish it and not use all the 256 bit.
The requirements are:
Same selector will lead to same 5 elements being picked .
Need it to be statistically fair so if I run all every possibility of bits in
the selector array it will bring an even spread of times occurrences
in the the 5 elements array.
I know that there is 2,598,960 combinations of 5 element can be selected from array of 54, without caring about the order of selecting them.
You need to pick one of 54, one of 53, ... one of 50. Take the random bits 6 at a time as numbers from 1..64. Simply discard any that are too big (more than 54, 53, or whatever). On the average, you'll need six or seven tries to get your 5 random numbers. You have 42 available, so there's no chance you'll run out, and your distribution will be perfectly uniform.
Well, if 2^K is larger than 2.598,960* you can use K bits to select 5 elements. You won't get an entirely uniform distribution because no power of 2 is divisible by 2,598,960.
*I did not check your math, I just assume 2,598,960 is correct
This is a google interview question:
There are around thousand phone numbers to be stored each having 10 digits. You can assume first 5 digits of each to be same across thousand numbers. You have to perform the following operations:
a. Search if a given number exists.
b. Print all the number
What is the most efficient space saving way to do this ?
I answered hash table and later huffman coding but my interviewer said I was not going in right direction. Please help me here.
Could using a suffix trie help?
Ideally 1000 numbers storing takes 4 bytes per number so in all it would take 4000 bytes to store 1000 number. Quantitatively, I wish to reduce the storage to < 4000 bytes, this is what my interviewer explained to me.
In what follows, I treat the numbers as integer variables (as opposed to strings):
Sort the numbers.
Split each number into the first five digits and the last five digits.
The first five digits are the same across numbers, so store them just once. This will require 17 bits of storage.
Store the final five digits of each number individually. This will require 17 bits per number.
To recap: the first 17 bits are the common prefix, the subsequent 1000 groups of 17 bits are the last five digits of each number stored in ascending order.
In total we're looking at 2128 bytes for the 1000 numbers, or 17.017 bits per 10-digit telephone number.
Search is O(log n) (binary search) and full enumeration is O(n).
Here's an improvement to aix's answer. Consider using three "layers" for the data structure: the first is a constant for the first five digits (17 bits); so from here on, each phone number has only the remaining five digits left. We view these remaining five digits as 17-bit binary integers and store k of those bits using one method and 17 - k = m with a different method, determining k at the end to minimize the required space.
We first sort the phone numbers (all reduced to 5 decimal digits). Then we count how many phone numbers there are for which the binary number consisting of the first m bits is all 0, for how many phone numbers the first m bits are at most 0...01, for how many phone numbers the first m bits are at most 0...10, etcetera, up to the count of phone numbers for which the first m bits are 1...11 - this last count is 1000(decimal). There are 2^m such counts and each count is at most 1000. If we omit the last one (because we know it is 1000 anyway), we can store all of these numbers in a contiguous block of (2^m - 1) * 10 bits. (10 bits is enough for storing a number less than 1024.)
The last k bits of all (reduced) phone numbers are stored contiguously in memory; so if k is, say, 7, then the first 7 bits of this block of memory (bits 0 thru 6) correspond to the last 7 bits of the first (reduced) phone number, bits 7 thru 13 correspond to the last 7 bits of the second (reduced) phone number, etcetera. This requires 1000 * k bits for a total of 17 + (2^(17 - k) - 1) * 10 + 1000 * k, which attains its minimum 11287 for k = 10. So we can store all phone numbers in ceil(11287/8)=1411 bytes.
Additional space can be saved by observing that none of our numbers can start with e.g. 1111111(binary), because the lowest number that starts with that is 130048 and we have only five decimal digits. This allows us to shave a few entries off the first block of memory: instead of 2^m - 1 counts, we need only ceil(99999/2^k). That means the formula becomes
17 + ceil(99999/2^k) * 10 + 1000 * k
which amazingly enough attains its minimum 10997 for both k = 9 and k = 10, or ceil(10997/8) = 1375 bytes.
If we want to know whether a certain phone number is in our set, we first check if the first five binary digits match the five digits we have stored. Then we split the remaining five digits into its top m=7 bits (which is, say, the m-bit number M) and its lower k=10 bits (the number K). We now find the number a[M-1] of reduced phone numbers for which the first m digits are at most M - 1, and the number a[M] of reduced phone numbers for which the first m digits are at most M, both from the first block of bits. We now check between the a[M-1]th and a[M]th sequence of k bits in the second block of memory to see if we find K; in the worst case there are 1000 such sequences, so if we use binary search we can finish in O(log 1000) operations.
Pseudocode for printing all 1000 numbers follows, where I access the K'th k-bit entry of the first block of memory as a[K] and the M'th m-bit entry of the second block of memory as b[M] (both of these would require a few bit operations that are tedious to write out). The first five digits are in the number c.
i := 0;
for K from 0 to ceil(99999 / 2^k) do
while i < a[K] do
print(c * 10^5 + K * 2^k + b[i]);
i := i + 1;
end do;
end do;
Maybe something goes wrong with the boundary case for K = ceil(99999/2^k), but that's easy enough to fix.
Finally, from an entropy point of view, it is not possible to store a subset of 10^3 positive integers all less than 10^5 in fewer than ceil(log[2](binomial(10^5, 10^3))) = 8073. Including the 17 we need for the first 5 digits, there is still a gap of 10997 - 8090 = 2907 bits. It's an interesting challenge to see if there are better solutions where you can still access the numbers relatively efficiently!
http://en.wikipedia.org/wiki/Acyclic_deterministic_finite_automaton
I once had an interview where they asked about data structures. I forgot "Array".
I'd probably consider using some compressed version of a Trie (possibly a DAWG as suggested by #Misha).
That would automagically take advantage of the fact that they all have a common prefix.
Searching will be performed in constant time, and printing will be performed in linear time.
I've heard of this problem before (but without first-5-digits-are-same assumption), and the simplest way to do it was Rice Coding:
1) Since the order does not matter we can sort them, and save just differences between consecutive values. In our case the average differences would be 100.000 / 1000 = 100
2) Encode the differences using Rice codes (base 128 or 64) or even Golomb codes (base 100).
EDIT : An estimation for Rice coding with base 128 (not because it would give best results, but because it's easier to compute):
We'll save first value as-is (32 bits).
The rest of 999 values are differences (we expect them to be small, 100 on average) will contain:
unary value value / 128 (variable number of bits + 1 bit as terminator)
binary value for value % 128 (7 bits)
We have to estimate somehow the limits (let's call it VBL) for number of variable bits:
lower limit: consider we are lucky, and no difference is larger than our base (128 in this case). this would mean give 0 additional bits.
high limit: since all differences smaller than base will be encoded in binary part of number, the maximum number we would need to encode in unary is 100000/128 = 781.25 (even less, because we don't expect most of differences to be zero).
So, the result is 32 + 999 * (1 + 7) + variable(0..782) bits = 1003 + variable(0..98) bytes.
This is a well-know problem from Bentley's Programming Pearls.
Solution:
Strip the first five digits from the numbers as they are the same for every
number. Then use bitwise-operations to represent the remaining 9999 possible
value. You will only need 2^17 Bits to represent the numbers. Each Bit
represents a number. If the bit is set, the number is in the telephon book.
To print all numbers, simply print all the numbers where the bit is set
concatened with the prefix. To search for a given number do the necessary bit
arithmetic to check for bitwise representation of the number.
You can search for a number in O(1) and the space efficiency is maximal due to the bit represenatation.
HTH Chris.
Fixed storage of 1073 bytes for 1,000 numbers:
The basic format of this storage method is to store the first 5 digits, a count for each group, and the offset for each number in each group.
Prefix:
Our 5-digit prefix takes up the first 17 bits.
Grouping:
Next, we need to figure out a good sized grouping for numbers. Let's try have about 1 number per group. Since we know there are about 1000 numbers to store, we divide 99,999 into about 1000 parts. If we chose the group size as 100, there would be wasted bits, so let's try a group size of 128, which can be represented with 7 bits. This gives us 782 groups to work with.
Counts:
Next, for each of the 782 groups, we need to store the count of entries in each group. A 7-bit count for each group would yield 7*782=5,474 bits, which is very inefficient because the average number represented is about 1 because of how we chose our groups.
Thus, instead we have variable sized counts with leading 1's for each number in a group followed by a 0. Thus, if we had x numbers in a group, we'd have x 1's followed by a 0 to represent the count. For example, if we had 5 numbers in a group the count would be represented by 111110. With this method, if there are 1,000 numbers we end up with 1000 1's and 782 0's for a total of 1000 + 782 = 1,782 bits for the counts.
Offset:
Last, the format of each number will just be the 7-bit offset for each group. For example, if 00000 and 00001 are the only numbers in the 0-127 group, the bits for that group would be 110 0000000 0000001. Assuming 1,000 numbers, there will be 7,000 bits for the offsets.
Thus our final count assuming 1,000 numbers is as follows:
17 (prefix) + 1,782 (counts) + 7,000 (offsets) = 8,799 bits = 1100 bytes
Now, let's check if our group-size selection by rounding up to 128 bits was the best choice for group size. Choosing x as the number of bits to represent each group, the formula for the size is:
Size in bits = 17 (prefix) + 1,000 + 99,999/2^x + x * 1,000
Minimizing this equation for integer values of x gives x=6, which yields 8,580 bits = 1,073 bytes. Thus, our ideal storage is as follows:
Group size: 2^6 = 64
Number of groups: 1,562
Total storage:
1017 (prefix plus 1's) + 1563 (0's in count) + 6*1000 (offsets) = 8,580 bits = 1,073 bytes
Taking this as a purely theoretical problem and leaving implementation asside, the single most efficient way is to just index all possible sets of 10000 last digits in a gigantic indexing table. Assuming you have exactely 1000 numbers, you would need a little more than 8000 bits to uniquely identify the current set. There is no bigger compression possible, because then you would have two sets which are identified with the same state.
Problems with this is, that you would have to represent each of the 2^8000 sets in your program as a lut, and not even google would be remotely capable of this.
Lookup would be O(1), printing all number O(n). Insertion would be O(2^8000) which in theory is O(1), but in practice is unusable.
In an interview I would only give this answer, if I were sure, that the company is looking for someone who is able to think out of the box a lot. Otherwise this might make you look like a theorist with no real world concerns.
EDIT: Ok, here is one "implementation".
Steps to constructe the implementation:
Take a constant array of size 100 000*(1000 choose 100 000) bits. Yes, I am aware of the fact that this array will need more space than atoms in the universe by several magnitudes.
Seperate this large array into chunks of 100 000 each.
In each chunk store a bit array for one specific combination of last five digits.
This is not the program, but a kind of meta programm, that will construct a gigantic LUT that can now be used in a programm. Constant stuff of the programm is normally not counted when calculating space efficiency, so we do not care about this array, when doing our final calculations.
Here is how to use this LUT:
When someone gives you 1000 numbers, you store the first five digits seperately.
Find out which of the chunks of your array matches this set.
Store the number of the set in a single 8074 bit number (call this c).
This means for storage we only need 8091 bits, which we have proven here to be the optimal encoding. Finding the correct chunk however takes O(100 000*(100 000 choose 1000)), which according to math rules is O(1), but in practice will always take longer than the time of the universe.
Lookup is simple though:
strip of first five digits (remaining number will be called n').
test if they match
Calculate i=c*100000+n'
Check if the bit at i in the LUT is set to one
Printing all numbers is simple also (and takes O(100000)=O(1) actually, because you always have to check all bits of the current chunk, so I miscalculated this above).
I would not call this a "implementation", because of the blatant disregard of the limitations (size of the universe and time this universe has lived or this earth will exist). However in theory this is the optimal solution. For smaller problems, this actually can be done, and sometimes will be done. For example sorting networks are a example for this way of coding, and can be used as a final step in recursive sorting algorithms, to get a big speedup.
This is equivalent to storing one thousand non-negative integers each less than 100,000. We can use something like arithmetic encoding to do this.
Ultimately, the numbers will be stored in a sorted list. I note that the expected difference between adjacent numbers in the list is 100,000/1000 = 100, which can be represented in 7 bits. There will also be many cases where more than 7 bits are necessary. A simple way to represent these less common cases is to adopt the utf-8 scheme where one byte represents a 7-bit integer unless the first bit is set, in which case the next byte is read to produce a 14-bit integer, unless its first bit is set, in which case the next byte is read to represent a 21-bit integer.
So at least half of the differences between consecutive integers may be represented with one byte, and almost all the rest require two bytes. A few numbers, separated by bigger differences than 16,384, will require three bytes, but there cannot be more than 61 of these. The average storage then will be about 12 bits per number, or a bit less, or at most 1500 bytes.
The downside to this approach is that checking the existence of a number is now O(n). However, no time complexity requirement was specified.
After writing, I noticed ruslik already suggested the difference method above, the only difference is the encoding scheme. Mine is likely simpler but less efficient.
Just to ask quickly any reason that we would not want to change the numbers into a base 36. It may not save as much space but it would for sure save time on the search since u will be looking at a lot less then 10digts. Or I would split them into files depending on each group. so i would name a file (111)-222.txt and then i would only store numbers that fit in to that group in there and then have them seearchable in numeric order this way i can always chack to see if the file exits. before i run a biger search. or to be correct i would run to binary searchs one for the file to see if it exits. and another bonary search on the contents of the file
Why not keep it simple? Use an array of structs.
So we can save the first 5 digits as a constant, so forget those for now.
65535 is the most that can be stored in a 16-bit number, and the max number we can have is 99999, which fits withing the 17th bit number with a max of 131071.
Using 32-bit data types is a wast because we only need 1 bit of that extra 16-bits...therefore, we can define a structure that has a boolean (or character) and a 16-bit number..
Assuming C/C++
typedef struct _number {
uint16_t number;
bool overflow;
}Number;
This struct only takes up 3-bytes, and we need an array of 1000, so 3000 bytes total. We have reduced the total space by 25%!
As far as storing the numbers, we can do simple bitwise math
overflow = (number5digits & 0x10000) >> 4;
number = number5digits & 0x1111;
And the inverse
//Something like this should work
number5digits = number | (overflow << 4);
To print all of them, we can use a simple loop over the array. Retrieving a specific number happens in constant time of course, since it is an array.
for(int i=0;i<1000;i++) cout << const5digits << number5digits << endl;
To search for a number, we would want a sorted array. So when the numbers are saved, sort the array (I would choose a merge sort personally, O(nlogn)). Now to search, I would go a merge sort approach. Split the array, and see which one our number falls between. Then call the function on only that array. Recursively do this until you have a match and return the index, otherwise, it does not exist and print an error code. This search would be quite quick, and worst case is still better than O(nlogn) since it will absolutely execute in less time than the merge sort (only recursing 1 side of the split each time, instead of both sides :)), which is O(nlogn).
My solution: best case 7.025 bits/number, worst case 14.193 bits/number, rough average 8.551 bits/number. Stream-encoded, no random access.
Even before reading ruslik’s answer, I immediately thought of encoding the difference between each number, since it will be small and should be relatively consistent, but the solution must also be able to accommodate the worst case scenario. We have a space of 100000 numbers that contain only 1000 numbers. In a perfectly uniform phone book, each number would be greater than the previous number by 100:
55555-12345
55555-12445
55555-12545
If that was the case, it would require zero storage to encode the differences between numbers, since it’s a known constant. Unfortunately, numbers may vary from the ideal steps of 100. I would encode the difference from the ideal increment of 100, so that if two adjacent numbers differ by 103, I would encode the number 3 and if two adjacent numbers differ by 92, I would encode -8. I call the delta from an ideal increment of 100 the “variance”.
The variance can range from -99 (i.e. two consecutive numbers) to 99000 (the entire phonebook consists of numbers 00000…00999 and an additional furthest-away number 99999), which is a range of 99100 possible values.
I’d aim to allocate a minimal storage to encode the most common differences and expand the storage if I encounter bigger differences (like ProtoBuf’s varint). I’ll use chunks of seven bits, six for storage and an additional flag bit at the end to indicate that this variance is stored with an additional chunk after the current one, up to a maximum of three chunks (which will provide a maximum of 3 * 6 = 18 bits of storage, which are 262144 possible value, more than the number of possible variances (99100). Each additional chunk that follows a raised flag has bits of a higher significance, so the first chunk always has bits 0-5, the optional second chunks has bits 6-11, and the optional third chunk has bits 12-17.
A single chunk provides six bits of storage which can accommodate 64 values. I’d like to map the 64 smallest variances to fit in that single chunk (i.e. variances of -32 to +31) so I’ll use ProtoBuf ZigZag encoding, up to the variances of -99 to +98 (since there’s no need for a negative variance beyond -99), at which point I’ll switch to regular encoding, offset by 98:
Variance | Encoded Value
-----------+----------------
0 | 0
-1 | 1
1 | 2
-2 | 3
2 | 4
-3 | 5
3 | 6
... | ...
-31 | 61
31 | 62
-32 | 63
-----------|--------------- 6 bits
32 | 64
-33 | 65
33 | 66
... | ...
-98 | 195
98 | 196
-99 | 197
-----------|--------------- End of ZigZag
100 | 198
101 | 199
... | ...
3996 | 4094
3997 | 4095
-----------|--------------- 12 bits
3998 | 4096
3999 | 4097
... | ...
262045 | 262143
-----------|--------------- 18 bits
Some examples of how variances would be encoded as bits, including the flag to indicate an additional chunk:
Variance | Encoded Bits
-----------+----------------
0 | 000000 0
5 | 001010 0
-8 | 001111 0
-32 | 111111 0
32 | 000000 1 000001 0
-99 | 000101 1 000011 0
177 | 010011 1 000100 0
14444 | 001110 1 100011 1 000011 0
So the first three numbers of a sample phone book would be encoded as a stream of bits as follows:
BIN 000101001011001000100110010000011001 000110 1 010110 1 00001 0
PH# 55555-12345 55555-12448 55555-12491
POS 1 2 3
Best case scenario, the phone book is somewhat uniformly distributed and there are no two phone numbers that have a variance greater than 32, so it would use 7 bits per number plus 32 bits for the starting number for a total of 32 + 7*999 = 7025 bits.
A mixed scenario, where 800 phone numbers' variance fits within one chunk (800 * 7 = 5600), 180 numbers fit in two chunks each (180 * 2 * 7 = 2520) and 19 numbers fit in three chunks each (20 * 3 * 7 = 399), plus the initial 32 bits, totals 8551 bits.
Worst case scenario, 25 numbers fit in three chunks (25 * 3 * 7 = 525 bits) and the remaining 974 numbers fit in two chunks (974 * 2 * 7 = 13636 bits), plus 32 bits for the first number for a grand total of 14193 bits.
Amount of encoded numbers |
1-chunk | 2-chunks | 3-chunks | Total bits
---------+----------+----------+------------
999 | 0 | 0 | 7025
800 | 180 | 19 | 8551
0 | 974 | 25 | 14193
I can see four additional optimizations that can be performed to further reduce the space required:
The third chunk doesn’t need the full seven bits, it can be just five bits and without a flag bit.
There can be an initial pass of the numbers to calculate the best sizes for each chunk. Maybe for a certain phonebook, it would be optimal to have the first chunk have 5+1 bits, the second 7+1 and the third 5+1. That would further reduce the size to a minimum of 6*999 + 32 = 6026 bits, plus two sets of three bits to store the sizes of chunks 1 and 2 (chunk 3’s size is the remainder of the required 16 bits) for a total of 6032 bits!
The same initial pass can calculate a better expected increment than the default 100. Maybe there's a phone book that starts from 55555-50000, and so it has half the number range so the expected increment should be 50. Or maybe there's a non-linear distribution (standard deviation perhaps) and some other optimal expected increment can be used. This would reduce the typical variance and might allow an even smaller first chunk to be used.
Further analysis can be done in the first pass to allow the phone book to be partitioned, with each partition having its own expected increment and chunk size optimizations. This would allow for a smaller first chunk size for certain highly uniform parts of the phone book (reducing the number of bits consumed) and larger chunks sizes for non-uniform parts (reducing the number of bits wasted on continuation flags).
The real question is one of storing five-digit phone numbers.
The trick is that you'd need 17 bits to store the range of numbers from 0..99,999. But storing 17-bits on conventional 8-byte word boundaries is a hassle. That's why they are asking if you can do in less than 4k by not using 32-bit integers.
Question: are all number combinations possible?
Because of the nature of the telephone system, there may be fewer than 65k possible combinations. I will assume that yes because we are talking about the latter five positions in the phone number, as opposed to the area code or exchange prefixes.
Question: will this list be static or will it need to support updates?
If it is static, then when it comes time to populate the database, count the number of digits < 50,000 and the number of digits >= 50,000. Allocate two arrays of uint16 of appropriate length: one for the integers below 50,000 and one for the higher set. When storing integers in the higher array, subtract 50,000 and when reading integers from that array, add 50,000. Now you've stored your 1,000 integers in 2,000 8-byte words.
Building the phonebook will require two input traversals, but lookups should happen in half the time, on average, than they would with a single array. If lookup time were very important you could use more arrays for smaller ranges but I think at these sizes your performance bound would be pulling the arrays from memory and 2k will probably stash into CPU cache if not register space on anything you'd be using these days.
If it is dynamic, allocate one array of 1000 or so uint16, and add the numbers in sorted order. Set the first byte to 50,001, and set the second byte to an appropriate null value, like NULL or 65,000. When you store the numbers, store them in sorted order. If a number is below 50,001 then store it before the 50,001 marker. If a number is 50,001 or greater, store it after the 50,001 marker, but subtract 50,000 from the stored value.
Your array will look something like:
00001 = 00001
12345 = 12345
50001 = reserved
00001 = 50001
12345 = 62345
65000 = end-of-list
So, when you look up a number in the phonebook, you'll traverse the array and if you've hit the 50,001 value you start adding 50,000 to your array values.
This makes inserts very expensive, but lookups are easy, and you're not going to spend much more than 2k on storage.
There is one question and I have the solution to it also. But I couldn't understand the solution. Kindly help with some set of examples and shower some experience.
Question
Given a file containing roughly 300 million social security numbers (9-digit numbers), find a 9-digit number that is not in the file. You have unlimited drive space but only 2MB of RAM at your disposal.
Answer
In the first step, we build an array 2^16 integers that is initialized to 0 and for every number in the file, we take its 16 most significant bits to index into this array and increment the number.
Since there are less than 2^32 numbers in the file, there is bound to be (at least) one number in the array that is less than 2^16. This tells us that there is at least one number missing among the possible numbers with those upper bits.
In the second pass, we can focus only only on the numbers that match this criterion and use a bit-vector of size 2^16 to identify one of the missing numbers.
To make the explanation simpler, let's say you have a list of two-digit numbers, where each digit is between 0 and 3, but you can't spare the 16 bits to remember for each of the 16 possible numbers, whether you have already encountered it. What you do is to create an array a of 4 3-bit integers and in a[i], you store how many numbers with the first digit i you encountered. (Two-bit integers wouldn't be enough, because you need the values 0, 4 and all numbers between them.)
If you had the file
00, 12, 03, 31, 01, 32, 02
your array would look like this:
4, 1, 0, 2
Now you know that all numbers starting with 0 are in the file, but for each of the remaining, there is at least one missing. Let's pick 1. We know there is at least one number starting with 1 that is not in the file. So, create an array of 4 bits, for each number starting with 1 set the appropriate bit and in the end, pick one of the bits that wasn't set, in our example if could be 0. Now we have the solution: 10.
In this case, using this method is the difference between 12 bits and 16 bits. With your numbers, it's the difference between 32 kB and 119 MB.
In round terms, you have about 1/3 of the numbers that could exist in the file, assuming no duplicates.
The idea is to make two passes through the data. Treat each number as a 32-bit (unsigned) number. In the first pass, keep a track of how many numbers have the same number in the most significant 16 bits. In practice, there will be a number of codes where there are zero (all those for 10-digit SSNs, for example; quite likely, all those with a zero for the first digit are missing too). But of the ranges with a non-zero count, most will not have 65536 entries, which would be how many would appear if there were no gaps in the range. So, with a bit of care, you can choose one of the ranges to concentrate on in the second pass.
If you're lucky, you can find a range in the 100,000,000..999,999,999 with zero entries - you can choose any number from that range as missing.
Assuming you aren't quite that lucky, choose one with the lowest number of bits (or any of them with less than 65536 entries); call it the target range. Reset the array to all zeroes. Reread the data. If the number you read is not in your target range, ignore it. If it is in the range, record the number by setting the array value to 1 for the low-order 16-bits of the number. When you've read the whole file, any of the numbers with a zero in the array represents a missing SSN.