Ideas for an efficient way of hashing a 15-puzzle state - algorithm

I am implementing a 15-puzzle solver by Ant Colony Optimization, and I am thinking a way of efficiently hashing each state into a number, so I waste the least amount of bytes.
A state is represented by a list of 16 numbers, from 0 to 15 (0 is the hole).
Like:
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,0]
So I want to create an unique number to identify that state.
I could convert all the digits to a base 16 number, but I don't think that's very efficient
Any ideas?.
Thanks

Your state is isomorphic to the permutations of 16 elements. A 45 bit number is enough to enumerate those (log2 16!), but we might as well round up to 64 bit if it's beneficial. The problem reduces to finding an efficient conversion from the state to its position in the enumeration of states.
Knowing that each number in 0..16 occurs only once, we could create 16 variables of log2 16 = 4 bits each, where the ith variable denotes which position the number i occurs. This has quite a bit of redundancy: It takes log2(16) * 16 bits, but that's exactly 64 bit. It can be implemented pretty efficiently (untested pseudocode):
state2number(state):
idx = 0
for i in [0;16):
val = state[i]
idx |= i << (val * 4)
return idx
I don't know if this is what you meant by "convert all the digits to a base 16 number". It is insanely efficient, when unrolled and otherwise micro-optimized it's only a few dozen cycles. It takes two bytes more than necessary, but 64 bit is still pretty space efficient and directly using it as index into some array isn't feasible for 64 nor for 45 bit.

There are 16! = 2.09*10^13 possible states which needs about 44.25 bits to be encoded.
So if you want to encode the state in bytes, you need at least 6 bytes to do it.
Why not encode it this way:
Let us name the values a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
The value can be
b`:= b - (b>a)?1:0;
c`:= c - (c>a)?1:0 - (c>b)?1:0
d`:= d - (d>a)?1:0 - (d>b)?1:0 - (d>c)?1:0
....
hashNumber= a+b*15+c*15*14+d`*15*14*15+....
This will give you a bijective mapping of each possible sate to a number fitting in 6 bytes.
Also converting the number back to its referring state is quite easy, if you need to do it.
Not optimal but fast is:
Use 4 bits for each number (leave out the last one because it can be computed from the previous 15 numbers) that needs 15*4 bits = 60 bits.
can be stored in 7.5 bytes or if you are ok to waste more, simply use 8 bytes.

Related

I have a piece of data that can be a number from 1 to 7. How many of these can I compress into a single byte and how?

As a single byte can hold up to 256 values it should be possible to uniquely map more than one value per byte. I am however unsure about how many I can store without loss.
Is there a generalized algorithm to do this if I have a specific range of values I need to cover?
Can I improve the mapping if I map to a larger wordsize?
And idealized answer would be an algorithm to map x (independent) ranges of size n to m bits.
In the most general terms, when you have a value that goes from 1 to 7 (I'm assuming inclusive,) which is equivalent of the range [0..6], you can
think of it as a digit in base 7.
This means that a string of them forms a number in base 7. Then, the easiest thing you can do is to store that number in a computer variable (in effect, turning it into base 2, but you don't usually have to think about that, since a number is a number.)
Now lets do some calculations about how many of your values will fit in how many bits. Obviously, one value will fit in 3 bits (and you will waste 1 out of 8 values, or 12.5% of the 3 bits.)
Two values - which is equivalent to a two-digit number in base 7 - will have a range of 0 to 48, and will fit in 6 whole bits (and will waste 15 out of 64 binary values, or 23.4%.)
Three values will have a range of 0 to 342, and will fit in 9 whole bits and waste a lot of values.
I think you can figure out the rest. Note that the actual (fractional) number of bits required to store each number (in this case 7) is by definition its logarithm in base 2.
To actually encode and decode in such a way, you just need to successively multiply by 7 (encode) or divide by 7 (decode) the same what that you would convert numbers into any other bases. For example, if your values are x, y, and z, the following will hold:
// x,y,z must be in [0..6], so probably have to subtract 1 from each first...
// then:
encoded = x * 1 + y * 7 + z * 49; // encoded will fit in 9 bits.
decoded_x = (encoded / 1) % 7; // also need a plus one.
decoded_y = (encoded / 7) % 7; // same
decoded_z = (encoded / 49) % 7; // ditto
In general, to map and pack any number of values, each in the range [0..m-1] into a number of bits, you do the same thing as above. Think of each of your values as a single digit in a (possibly large) number in base m, and then store that number in any variable that can hold it.
Note 1: Usually, you get less wasted bits the more values you pack into a single number. For example, some comments and other answers suggest packing each base-7 value into 3 bits. If you do that, you can only store 10 of them in a 32-bit variable. But you can actually pack 11 (whole) values in 32 bits. Admittedly, storing 1 value per 3 bits is faster to encode and decode (using only shifts and masking,) which makes the choice dependent on your needs.
Note 2: There are ways to talk about (and reason about) fractional bit counts. For example, each base-7 digit will take up 2.81 bits. And there are other ways to encode and store them as a collection. But that's a more complex matter in the realm of coding and compression. If you are curious, you can read about "arithmetic coding".
Note 3: Talking about compression, there are far better ways to pack several values into a number of bits, if you know more about them. That is, if some of them are more likely to be specific values, or equal to each other, or whatever. Also, if there is a large enough number of values, you can extract these relationships from the data itself. This is - in essense - what a general purpose lossless compressor does.
Note 4: Keep in mind that you can even do this kind of encode and decode (the way I've suggested) if the ranges of your values aren't the same. For example, if x and z are in the range [0..6], but y is in [0..4], then you can encode and decode them like this:
encoded = x * 1 + y * 7 + z * (7 * 5); // encoded will fit in 8 bits now!
decoded_x = (encoded / 1) % 7;
decoded_y = (encoded / 7) % 5;
decoded_z = (encoded / 35) % 7;
n symbols that are equally probable will take lg n bits, where lg is the logarithm base 2. You can store floor(m / lg n) of those in m bits, encoding them into the m bits as digits of a base n integer.
For n=7 and m=8, you can store 2. Generally you would want to pick an m that makes the number of unused bits small. m=48 (six bytes) is a good choice here, in which you can store 17 symbols. 717=232,630,513,987,207, which is close to but less than 248=281,474,976,710,656. Only about a quarter of a bit is wasted, or about half a percent of the 48 bits. Also the base conversions can be done efficiently with 64-bit arithmetic.
Maybe i totally misunderstand you, but a byte contains 8 bits. Every bit can be zero or one. Now one byte can be in 256 different states respectively can hold a number as large as 255 at max. To represent a number between 1 to 7 (or 0 to 6) you need three bits. Like 001 = 1; 010 = 2; 011 = 3 ... and so on. So normally you can store to two numbers between 1 to 7 in one byte. But then you are wasting two bits. Maybe you can use overlapping bytes. In three bytes you can store 8 values and you would not waste any bit. Maybe you can think of some fancy algorithm to store three values in one byte with some restriction. For example if you know that at least one value is not identical to the others, you can try to sort the values within the byte. You think of a imaginary nineth bit. If the first value is smaller than the second it is one otherwise it is 0. But to go deeper on this some extra information on the data and the use case would be super.

Algorithm for Converting large integer to string without modulo base

I have looked for a while to find an algorithm which converts integers to string. My requirement is to do this manually as I am using my own large number type. I have + - * /(with remainder) defined, but need to find a way to print a single number from a double int (high and low, if int is 64bits, 128bits total).
I have seen some answers such as
Convert integer to string without access to libraries
Converting a big integer to decimal string
but was wondering if a faster algorithm was possible. I am open to working with bits directly(e.g. base2 to base10-string - I could not find such an algorithm however), but I was just hoping to avoid repeated division by 10 for numbers possibly as large as 2^128.
You can use divide-and-conquer in such a way that the parts can be converted to string using your standard library (which will typically be quite efficient at that job).
So instead of dividing by 10 in every iteration, you can e.g. divide by 10**15, and have your library convert the chunks to 15-digit strings. After at most three steps, you're finished.
Of course you have to do some string manipulation regarding the zero-padding. But maybe your library can help you here as well, if you use something like a %015d zero-padding format for all the lower parts, and for the highest non-zero part use a non-padding %d format.
You may try your luck with a contrived method, as follows.
Numbers can be represented using the Binary-Coded Decimal representation. In this representation, every decimal digit is stored on 4 bits, and when performing an addition, if the sum of two digits exceeds 9, you add 6 and carry to the left.
If you have pre-stored the BCD representation of all powers of 2, then it takes at most 128 additions to perform the conversion. You can spare a little by the fact that for low powers, you don't need full length addition (39 digits).
But this sounds as a lot of operations. You can "parallelize" them by packing several BCD digits in an single integer: an integer addition on 32 bits is equivalent to 8 simultaneaous BCD digit additions. But we have a problem with the carries. To work around, we can store the digits on 5 bits instead of 4, and the carries will appear in the fifth bit. Then we can obtain the carries by masking, add them to the next digits (shift left 5), and adjust the digit sums (multiply by 10 and subtract).
2 3 4 5 6
+ 7 6 9 2 1
= 9 913 7 7
Carries:
0-0-1-0-0
Adjustments:
9 913 7 7
-0000010000
= 9 9 3 7 7
Actually, you have to handle possible cascaded carries, so the sum will involve the two addends and carries in, and generate a sum and carries out.
32 bits operations allow you to process 6 digits at a time (7 rounds for 39 digits), and 64 bits operations, 12 digits (4 rounds for 39 digits).
if you want to just encode your numbers as string
use hex numbers that is fast as you can cast all the digits just by bit operations ... also using Base64 encoding is doable just by bit operations + the translation table. Booth representations can be done on small int arithmetics only in O(n) where n is the count of printed digits.
If you need base10
then print a hex string and convert it to decimal on strings like this:
str_hex2dec
this is much slower than #1 but still doable on small int arithmetics ... You can do this also in reverse (input number from string) by using dec2hex ...
For bigint libs there are also another ways of easing up the string/integer conversions:
BCD
binary coded decimal ... the number printed as hex is the decadic number. So each digit has 4 bits. This waste some memory but many CPU's has BCD support and can do operations on such integers natively.
Base 10^n
sometimes is used base 10^n instead of 2^m while
10^n <= 2^m
The m is bitwidth of your atomic integer and n i snumber of decadic digits that fits inside it.
for example if your atomic unsigned integer is 16 bit it can hold up to 65536 values in base 2. If you use base 10000 instead you can print each atom asa decadic number with zeropad from left and simply stack all such prints together.
This also waste some memory but usually not too much (if the bitwidth is reasonably selected) and you can use standard instructions on the integers. Only the Carry propagation will change a bit...
for example for 32bit words:
2^32 = 4294967296 >= 1000000000
so we wasted log2(4.2949...) = ~2.1 bits per each 32 bits. This is much better than BCD log2(16/10)*(32/4)= ~5.42 bits And usually even better with higher bit widths

Improve number compression algorithm?

I have many unique numbers, all positive and the order doesn't matter, 0 < num < 2^32.
Example: 23 56 24 26
The biggest, 56, needs 6 bits space. So, I need: 4*6 = 24 bits in total.
I do the following to save space:
I sort them first: 23 24 26 56 (because the order doesn't matter)
Now I get the difference of each from the previous: 23 1 2 30
The biggest, 30, needs 5 bits space.
After this I store all the numbers in 4*5 bits = 20 bits space.
Question: how to further improve this algorithm?
More information: Since requested, the numbers are mostly on the range of 2.000-4.000. Numbers less than 300 are pretty rare. Numbers more than 16.000 are pretty rare also. Generally speaking, all the numbers will be close. For example, they may be all in the 1.000-2.000 range or they may all be in the 16.000-20.000 range. The total number of numbers will be something in the range of 500-5.000.
Your first step is good one to take because sorting reduces the differences to least. Here is a way to improve your algorithm:
sort and calculate differences as you have done.
Use Huffman coding on it.
Use of Huffman coding is more important then your step; I'll show you why:
consider the following data:
1 2 3 4 5 6 7 4294967295
where 4294967295 = 2^32-1. Using your algorithm:
1 1 1 1 1 1 1 4294967288
total bits needed is still 32*8
Using Huffman coding, the frequencies are:
1 => 7
4294967288 => 1
Huffman codes are 1 => 0 and 4294967288 => 1
Total bits needed = 7*1 + 1 = 8 bits
Huffman coding reduces size by 32*8/8 = 32 times
This problem is well known in database community as "Inverted index compression". You can google for some papers.
Following are some of the most common techniques:
Variable byte coding (VByte)
Simple9, Simple16
"Frame Of Reference" family of techniques
PForDelta
Adaptive Frame Of Reference (AFOR)
Rice-Golomb coding (often used as a part of other techniques)
VByte and Simple9/16 are easiest to implement, fast and have good compression ratio in practice.
Huffman coding is not very good for index compression because it is slow and differences are quite random in practice. (But it may be a good choice in your case.)
How many numbers do you have ? If your set covers the range [0..(2^32)-1] densely enough (you do the maths) then a 4GiB bitfield, where the n-th bit represents the presence, or absence, of the natural number n may be useful.
If your numbers are not uniformly distributed, a better compression will be achieved by using frequencies of the numbers and affect less bits to most frequent ones. This is the idea behind huffman coding.

Fastest method for adding/summing the individual digit components of a number

I saw a question on a math forum a while back where a person was discussing adding up the digits in a number over and over again until a single digit is achieved. (i.e. "362" would become "3+6+2" which would become "11"... then "11" would become "1+1" would would become "2" therefor "362" would return 2... I wrote some nice code to get an answer to this and posted it only to be outdone by a user who suggested that any number in modulo 9 is equal to this "infinite digit sum", I checked it an he was right... well almost right, if zero was returned you had to switch it out with a "9" but that was a very quick fix...
362 = 3+6+2 = 11 = 1+1 = 2
or...
362%9 = 2
Anways, the mod9 method works fantastic for infinitely adding the sum of the digits until you are left with just a single digit... but what about only doing it once (i.e. 362 would just return "11")... Can anyone think of fast algorithms?
There's a cool trick for summing the 1 digits in binary, and with a fixed-width integer. At each iteration, you separate out half the digits each into two values, bit-shift one value down, then add. First iteration, separate ever other digit. Second iteration, pairs of digits, and so on.
Given that 27 is 00011011 as 8-bit binary, the process is...
00010001 + 00000101 = 00010110 <- every other digit step
00010010 + 00000001 = 00010011 <- pairs of digits
00000011 + 00000001 = 00000100 <- quads, giving final result 4
You could do a similar trick with decimal, but it would be less efficient than a simple loop unless you had a direct representation of decimal numbers with fast operations to zero out selected digits and to do digit-shifting. So for 12345678 you get...
02040608 + 01030507 = 03071115 <- every other digit
00070015 + 00030011 = 00100026 <- pairs
00000026 + 00000010 = 00000036 <- quads, final result
So 1+2+3+4+5+6+7+8 = 36, which is correct, but you can only do this efficiently if your number representation is fixed-width decimal. It always takes lg(n) iterations, where lg means the base two logarithm, and you round upwards.
To expand on this a little (based on in-comments discussions), let's pretend this was sane, for a bit...
If you count single-digit additions, there's actually more work than a simple loop here. The idea, as with the bitwise trick for counting bits, is to re-order those additions (using associativity) and then to compute as many as possible in parallel, using a single full-width addition to implement two half-width additions, four quarter-width additions etc. There's significant overhead for the digit-clearing and digit-shifting operations, and even more if you implement this as a loop (calculating or looking up the digit-masking and shift-distance values for each step). The "loop" should probably be fully unrolled and those masks and shift-distances be included as constants in the code to avoid that.
A processor with support for Binary Coded Decimal (BCD) could handle this. Digit masking and digit shifting would be implemented using bit masking and bit shifting, as each decimal digit would be encoded in 4 (or more) bits, independent of the encoding of other digits.
One issue is that BCD support is quite rare these days. It used to be fairly common in the 8 bit and 16 bit days, but as far as I'm aware, processors that still support it now do so mainly for backward compatibility. Reasons include...
Very early processors didn't include hardware multiplication and division. Hardware support for these operations means it's easier and more efficient to convert binary to decimal now. Binary is used for almost everything now, and BCD is mostly forgotten.
There are decimal number representations around in libraries, but few if any high level languages ever provided portable support to hardware BCD, so since assembler stopped being a real-world option for most developers BCD support simply stopped being used.
As numbers get larger, even packed BCD is quite inefficiently packed. Number representations base 10^x have the most important properties of base 10, and are easily decoded as decimal. Base 1000 only needs 10 bits per three digits, not 12, because 2^10 is 1024. That's enough to show you get an extra decimal digit for 32 bits - 9 digits instead of 8 - and you've still got 2 bits left over, e.g. for a sign bit.
The thing is, for this digit-totalling algorithm to be worthwhile at all, you need to be working with fixed-width decimal of probably at least 32 bits (8 digits). That gives 12 operations (6 masks, 3 shifts, 3 additions) rather than 15 additions for the (fully unrolled) simple loop. That's a borderline gain, though - and other issues in the code could easily mean it's actually slower.
The efficiency gain is clearer at 64 bits (16 decimal digits) as there's still only 16 operations (8 masks, 4 shifts, 4 additions) rather than 31, but the odds of finding a processor that supports 64-bit BCD operations seems slim. And even if you did, how often do you need this anyway? It seems unlikely that it could be worth the effort and loss of portability.
Here's something in Haskell:
sumDigits n =
if n == 0
then 0
else let a = mod n 10
in a + sumDigits (div n 10)
Oh, but I just read you're doing that already...
(then there's also the obvious:
sumDigits n = sum $ map (read . (:[])) . show $ n
)
For short code, try this:
int digit_sum(int n){
if (n<10) return n;
return n%10 + digit_sum(n/10);
}
Or, in words,
-If the number is less than ten, then the digit sum is the number itself.
-Otherwise, the digit sum is the current last digit (a.k.a. n mod10 or n%10), plus the digit sum of everything to the left of that number (n divided by 10, using integer division).
-This algorithm can also be generalized for any base, substituting the base in for 10.
int digit_sum(int n)
Do
if (n<10) return n;
Exit do
else
n=n%10 + digit_sum(n/10);
Loop

Most efficient way to store thousand telephone numbers

This is a google interview question:
There are around thousand phone numbers to be stored each having 10 digits. You can assume first 5 digits of each to be same across thousand numbers. You have to perform the following operations:
a. Search if a given number exists.
b. Print all the number
What is the most efficient space saving way to do this ?
I answered hash table and later huffman coding but my interviewer said I was not going in right direction. Please help me here.
Could using a suffix trie help?
Ideally 1000 numbers storing takes 4 bytes per number so in all it would take 4000 bytes to store 1000 number. Quantitatively, I wish to reduce the storage to < 4000 bytes, this is what my interviewer explained to me.
In what follows, I treat the numbers as integer variables (as opposed to strings):
Sort the numbers.
Split each number into the first five digits and the last five digits.
The first five digits are the same across numbers, so store them just once. This will require 17 bits of storage.
Store the final five digits of each number individually. This will require 17 bits per number.
To recap: the first 17 bits are the common prefix, the subsequent 1000 groups of 17 bits are the last five digits of each number stored in ascending order.
In total we're looking at 2128 bytes for the 1000 numbers, or 17.017 bits per 10-digit telephone number.
Search is O(log n) (binary search) and full enumeration is O(n).
Here's an improvement to aix's answer. Consider using three "layers" for the data structure: the first is a constant for the first five digits (17 bits); so from here on, each phone number has only the remaining five digits left. We view these remaining five digits as 17-bit binary integers and store k of those bits using one method and 17 - k = m with a different method, determining k at the end to minimize the required space.
We first sort the phone numbers (all reduced to 5 decimal digits). Then we count how many phone numbers there are for which the binary number consisting of the first m bits is all 0, for how many phone numbers the first m bits are at most 0...01, for how many phone numbers the first m bits are at most 0...10, etcetera, up to the count of phone numbers for which the first m bits are 1...11 - this last count is 1000(decimal). There are 2^m such counts and each count is at most 1000. If we omit the last one (because we know it is 1000 anyway), we can store all of these numbers in a contiguous block of (2^m - 1) * 10 bits. (10 bits is enough for storing a number less than 1024.)
The last k bits of all (reduced) phone numbers are stored contiguously in memory; so if k is, say, 7, then the first 7 bits of this block of memory (bits 0 thru 6) correspond to the last 7 bits of the first (reduced) phone number, bits 7 thru 13 correspond to the last 7 bits of the second (reduced) phone number, etcetera. This requires 1000 * k bits for a total of 17 + (2^(17 - k) - 1) * 10 + 1000 * k, which attains its minimum 11287 for k = 10. So we can store all phone numbers in ceil(11287/8)=1411 bytes.
Additional space can be saved by observing that none of our numbers can start with e.g. 1111111(binary), because the lowest number that starts with that is 130048 and we have only five decimal digits. This allows us to shave a few entries off the first block of memory: instead of 2^m - 1 counts, we need only ceil(99999/2^k). That means the formula becomes
17 + ceil(99999/2^k) * 10 + 1000 * k
which amazingly enough attains its minimum 10997 for both k = 9 and k = 10, or ceil(10997/8) = 1375 bytes.
If we want to know whether a certain phone number is in our set, we first check if the first five binary digits match the five digits we have stored. Then we split the remaining five digits into its top m=7 bits (which is, say, the m-bit number M) and its lower k=10 bits (the number K). We now find the number a[M-1] of reduced phone numbers for which the first m digits are at most M - 1, and the number a[M] of reduced phone numbers for which the first m digits are at most M, both from the first block of bits. We now check between the a[M-1]th and a[M]th sequence of k bits in the second block of memory to see if we find K; in the worst case there are 1000 such sequences, so if we use binary search we can finish in O(log 1000) operations.
Pseudocode for printing all 1000 numbers follows, where I access the K'th k-bit entry of the first block of memory as a[K] and the M'th m-bit entry of the second block of memory as b[M] (both of these would require a few bit operations that are tedious to write out). The first five digits are in the number c.
i := 0;
for K from 0 to ceil(99999 / 2^k) do
while i < a[K] do
print(c * 10^5 + K * 2^k + b[i]);
i := i + 1;
end do;
end do;
Maybe something goes wrong with the boundary case for K = ceil(99999/2^k), but that's easy enough to fix.
Finally, from an entropy point of view, it is not possible to store a subset of 10^3 positive integers all less than 10^5 in fewer than ceil(log[2](binomial(10^5, 10^3))) = 8073. Including the 17 we need for the first 5 digits, there is still a gap of 10997 - 8090 = 2907 bits. It's an interesting challenge to see if there are better solutions where you can still access the numbers relatively efficiently!
http://en.wikipedia.org/wiki/Acyclic_deterministic_finite_automaton
I once had an interview where they asked about data structures. I forgot "Array".
I'd probably consider using some compressed version of a Trie (possibly a DAWG as suggested by #Misha).
That would automagically take advantage of the fact that they all have a common prefix.
Searching will be performed in constant time, and printing will be performed in linear time.
I've heard of this problem before (but without first-5-digits-are-same assumption), and the simplest way to do it was Rice Coding:
1) Since the order does not matter we can sort them, and save just differences between consecutive values. In our case the average differences would be 100.000 / 1000 = 100
2) Encode the differences using Rice codes (base 128 or 64) or even Golomb codes (base 100).
EDIT : An estimation for Rice coding with base 128 (not because it would give best results, but because it's easier to compute):
We'll save first value as-is (32 bits).
The rest of 999 values are differences (we expect them to be small, 100 on average) will contain:
unary value value / 128 (variable number of bits + 1 bit as terminator)
binary value for value % 128 (7 bits)
We have to estimate somehow the limits (let's call it VBL) for number of variable bits:
lower limit: consider we are lucky, and no difference is larger than our base (128 in this case). this would mean give 0 additional bits.
high limit: since all differences smaller than base will be encoded in binary part of number, the maximum number we would need to encode in unary is 100000/128 = 781.25 (even less, because we don't expect most of differences to be zero).
So, the result is 32 + 999 * (1 + 7) + variable(0..782) bits = 1003 + variable(0..98) bytes.
This is a well-know problem from Bentley's Programming Pearls.
Solution:
Strip the first five digits from the numbers as they are the same for every
number. Then use bitwise-operations to represent the remaining 9999 possible
value. You will only need 2^17 Bits to represent the numbers. Each Bit
represents a number. If the bit is set, the number is in the telephon book.
To print all numbers, simply print all the numbers where the bit is set
concatened with the prefix. To search for a given number do the necessary bit
arithmetic to check for bitwise representation of the number.
You can search for a number in O(1) and the space efficiency is maximal due to the bit represenatation.
HTH Chris.
Fixed storage of 1073 bytes for 1,000 numbers:
The basic format of this storage method is to store the first 5 digits, a count for each group, and the offset for each number in each group.
Prefix:
Our 5-digit prefix takes up the first 17 bits.
Grouping:
Next, we need to figure out a good sized grouping for numbers. Let's try have about 1 number per group. Since we know there are about 1000 numbers to store, we divide 99,999 into about 1000 parts. If we chose the group size as 100, there would be wasted bits, so let's try a group size of 128, which can be represented with 7 bits. This gives us 782 groups to work with.
Counts:
Next, for each of the 782 groups, we need to store the count of entries in each group. A 7-bit count for each group would yield 7*782=5,474 bits, which is very inefficient because the average number represented is about 1 because of how we chose our groups.
Thus, instead we have variable sized counts with leading 1's for each number in a group followed by a 0. Thus, if we had x numbers in a group, we'd have x 1's followed by a 0 to represent the count. For example, if we had 5 numbers in a group the count would be represented by 111110. With this method, if there are 1,000 numbers we end up with 1000 1's and 782 0's for a total of 1000 + 782 = 1,782 bits for the counts.
Offset:
Last, the format of each number will just be the 7-bit offset for each group. For example, if 00000 and 00001 are the only numbers in the 0-127 group, the bits for that group would be 110 0000000 0000001. Assuming 1,000 numbers, there will be 7,000 bits for the offsets.
Thus our final count assuming 1,000 numbers is as follows:
17 (prefix) + 1,782 (counts) + 7,000 (offsets) = 8,799 bits = 1100 bytes
Now, let's check if our group-size selection by rounding up to 128 bits was the best choice for group size. Choosing x as the number of bits to represent each group, the formula for the size is:
Size in bits = 17 (prefix) + 1,000 + 99,999/2^x + x * 1,000
Minimizing this equation for integer values of x gives x=6, which yields 8,580 bits = 1,073 bytes. Thus, our ideal storage is as follows:
Group size: 2^6 = 64
Number of groups: 1,562
Total storage:
1017 (prefix plus 1's) + 1563 (0's in count) + 6*1000 (offsets) = 8,580 bits = 1,073 bytes
Taking this as a purely theoretical problem and leaving implementation asside, the single most efficient way is to just index all possible sets of 10000 last digits in a gigantic indexing table. Assuming you have exactely 1000 numbers, you would need a little more than 8000 bits to uniquely identify the current set. There is no bigger compression possible, because then you would have two sets which are identified with the same state.
Problems with this is, that you would have to represent each of the 2^8000 sets in your program as a lut, and not even google would be remotely capable of this.
Lookup would be O(1), printing all number O(n). Insertion would be O(2^8000) which in theory is O(1), but in practice is unusable.
In an interview I would only give this answer, if I were sure, that the company is looking for someone who is able to think out of the box a lot. Otherwise this might make you look like a theorist with no real world concerns.
EDIT: Ok, here is one "implementation".
Steps to constructe the implementation:
Take a constant array of size 100 000*(1000 choose 100 000) bits. Yes, I am aware of the fact that this array will need more space than atoms in the universe by several magnitudes.
Seperate this large array into chunks of 100 000 each.
In each chunk store a bit array for one specific combination of last five digits.
This is not the program, but a kind of meta programm, that will construct a gigantic LUT that can now be used in a programm. Constant stuff of the programm is normally not counted when calculating space efficiency, so we do not care about this array, when doing our final calculations.
Here is how to use this LUT:
When someone gives you 1000 numbers, you store the first five digits seperately.
Find out which of the chunks of your array matches this set.
Store the number of the set in a single 8074 bit number (call this c).
This means for storage we only need 8091 bits, which we have proven here to be the optimal encoding. Finding the correct chunk however takes O(100 000*(100 000 choose 1000)), which according to math rules is O(1), but in practice will always take longer than the time of the universe.
Lookup is simple though:
strip of first five digits (remaining number will be called n').
test if they match
Calculate i=c*100000+n'
Check if the bit at i in the LUT is set to one
Printing all numbers is simple also (and takes O(100000)=O(1) actually, because you always have to check all bits of the current chunk, so I miscalculated this above).
I would not call this a "implementation", because of the blatant disregard of the limitations (size of the universe and time this universe has lived or this earth will exist). However in theory this is the optimal solution. For smaller problems, this actually can be done, and sometimes will be done. For example sorting networks are a example for this way of coding, and can be used as a final step in recursive sorting algorithms, to get a big speedup.
This is equivalent to storing one thousand non-negative integers each less than 100,000. We can use something like arithmetic encoding to do this.
Ultimately, the numbers will be stored in a sorted list. I note that the expected difference between adjacent numbers in the list is 100,000/1000 = 100, which can be represented in 7 bits. There will also be many cases where more than 7 bits are necessary. A simple way to represent these less common cases is to adopt the utf-8 scheme where one byte represents a 7-bit integer unless the first bit is set, in which case the next byte is read to produce a 14-bit integer, unless its first bit is set, in which case the next byte is read to represent a 21-bit integer.
So at least half of the differences between consecutive integers may be represented with one byte, and almost all the rest require two bytes. A few numbers, separated by bigger differences than 16,384, will require three bytes, but there cannot be more than 61 of these. The average storage then will be about 12 bits per number, or a bit less, or at most 1500 bytes.
The downside to this approach is that checking the existence of a number is now O(n). However, no time complexity requirement was specified.
After writing, I noticed ruslik already suggested the difference method above, the only difference is the encoding scheme. Mine is likely simpler but less efficient.
Just to ask quickly any reason that we would not want to change the numbers into a base 36. It may not save as much space but it would for sure save time on the search since u will be looking at a lot less then 10digts. Or I would split them into files depending on each group. so i would name a file (111)-222.txt and then i would only store numbers that fit in to that group in there and then have them seearchable in numeric order this way i can always chack to see if the file exits. before i run a biger search. or to be correct i would run to binary searchs one for the file to see if it exits. and another bonary search on the contents of the file
Why not keep it simple? Use an array of structs.
So we can save the first 5 digits as a constant, so forget those for now.
65535 is the most that can be stored in a 16-bit number, and the max number we can have is 99999, which fits withing the 17th bit number with a max of 131071.
Using 32-bit data types is a wast because we only need 1 bit of that extra 16-bits...therefore, we can define a structure that has a boolean (or character) and a 16-bit number..
Assuming C/C++
typedef struct _number {
uint16_t number;
bool overflow;
}Number;
This struct only takes up 3-bytes, and we need an array of 1000, so 3000 bytes total. We have reduced the total space by 25%!
As far as storing the numbers, we can do simple bitwise math
overflow = (number5digits & 0x10000) >> 4;
number = number5digits & 0x1111;
And the inverse
//Something like this should work
number5digits = number | (overflow << 4);
To print all of them, we can use a simple loop over the array. Retrieving a specific number happens in constant time of course, since it is an array.
for(int i=0;i<1000;i++) cout << const5digits << number5digits << endl;
To search for a number, we would want a sorted array. So when the numbers are saved, sort the array (I would choose a merge sort personally, O(nlogn)). Now to search, I would go a merge sort approach. Split the array, and see which one our number falls between. Then call the function on only that array. Recursively do this until you have a match and return the index, otherwise, it does not exist and print an error code. This search would be quite quick, and worst case is still better than O(nlogn) since it will absolutely execute in less time than the merge sort (only recursing 1 side of the split each time, instead of both sides :)), which is O(nlogn).
My solution: best case 7.025 bits/number, worst case 14.193 bits/number, rough average 8.551 bits/number. Stream-encoded, no random access.
Even before reading ruslik’s answer, I immediately thought of encoding the difference between each number, since it will be small and should be relatively consistent, but the solution must also be able to accommodate the worst case scenario. We have a space of 100000 numbers that contain only 1000 numbers. In a perfectly uniform phone book, each number would be greater than the previous number by 100:
55555-12345
55555-12445
55555-12545
If that was the case, it would require zero storage to encode the differences between numbers, since it’s a known constant. Unfortunately, numbers may vary from the ideal steps of 100. I would encode the difference from the ideal increment of 100, so that if two adjacent numbers differ by 103, I would encode the number 3 and if two adjacent numbers differ by 92, I would encode -8. I call the delta from an ideal increment of 100 the “variance”.
The variance can range from -99 (i.e. two consecutive numbers) to 99000 (the entire phonebook consists of numbers 00000…00999 and an additional furthest-away number 99999), which is a range of 99100 possible values.
I’d aim to allocate a minimal storage to encode the most common differences and expand the storage if I encounter bigger differences (like ProtoBuf’s varint). I’ll use chunks of seven bits, six for storage and an additional flag bit at the end to indicate that this variance is stored with an additional chunk after the current one, up to a maximum of three chunks (which will provide a maximum of 3 * 6 = 18 bits of storage, which are 262144 possible value, more than the number of possible variances (99100). Each additional chunk that follows a raised flag has bits of a higher significance, so the first chunk always has bits 0-5, the optional second chunks has bits 6-11, and the optional third chunk has bits 12-17.
A single chunk provides six bits of storage which can accommodate 64 values. I’d like to map the 64 smallest variances to fit in that single chunk (i.e. variances of -32 to +31) so I’ll use ProtoBuf ZigZag encoding, up to the variances of -99 to +98 (since there’s no need for a negative variance beyond -99), at which point I’ll switch to regular encoding, offset by 98:
 
Variance | Encoded Value
-----------+----------------
0 | 0
-1 | 1
1 | 2
-2 | 3
2 | 4
-3 | 5
3 | 6
... | ...
-31 | 61
31 | 62
-32 | 63
-----------|--------------- 6 bits
32 | 64
-33 | 65
33 | 66
... | ...
-98 | 195
98 | 196
-99 | 197
-----------|--------------- End of ZigZag
100 | 198
101 | 199
... | ...
3996 | 4094
3997 | 4095
-----------|--------------- 12 bits
3998 | 4096
3999 | 4097
... | ...
262045 | 262143
-----------|--------------- 18 bits
 
Some examples of how variances would be encoded as bits, including the flag to indicate an additional chunk:
Variance | Encoded Bits
-----------+----------------
0 | 000000 0
5 | 001010 0
-8 | 001111 0
-32 | 111111 0
32 | 000000 1 000001 0
-99 | 000101 1 000011 0
177 | 010011 1 000100 0
14444 | 001110 1 100011 1 000011 0
So the first three numbers of a sample phone book would be encoded as a stream of bits as follows:
BIN 000101001011001000100110010000011001 000110 1 010110 1 00001 0
PH# 55555-12345 55555-12448 55555-12491
POS 1 2 3
Best case scenario, the phone book is somewhat uniformly distributed and there are no two phone numbers that have a variance greater than 32, so it would use 7 bits per number plus 32 bits for the starting number for a total of 32 + 7*999 = 7025 bits.
A mixed scenario, where 800 phone numbers' variance fits within one chunk (800 * 7 = 5600), 180 numbers fit in two chunks each (180 * 2 * 7 = 2520) and 19 numbers fit in three chunks each (20 * 3 * 7 = 399), plus the initial 32 bits, totals 8551 bits.
Worst case scenario, 25 numbers fit in three chunks (25 * 3 * 7 = 525 bits) and the remaining 974 numbers fit in two chunks (974 * 2 * 7 = 13636 bits), plus 32 bits for the first number for a grand total of 14193 bits.
Amount of encoded numbers |
1-chunk | 2-chunks | 3-chunks | Total bits
---------+----------+----------+------------
999 | 0 | 0 | 7025
800 | 180 | 19 | 8551
0 | 974 | 25 | 14193
I can see four additional optimizations that can be performed to further reduce the space required:
The third chunk doesn’t need the full seven bits, it can be just five bits and without a flag bit.
There can be an initial pass of the numbers to calculate the best sizes for each chunk. Maybe for a certain phonebook, it would be optimal to have the first chunk have 5+1 bits, the second 7+1 and the third 5+1. That would further reduce the size to a minimum of 6*999 + 32 = 6026 bits, plus two sets of three bits to store the sizes of chunks 1 and 2 (chunk 3’s size is the remainder of the required 16 bits) for a total of 6032 bits!
The same initial pass can calculate a better expected increment than the default 100. Maybe there's a phone book that starts from 55555-50000, and so it has half the number range so the expected increment should be 50. Or maybe there's a non-linear distribution (standard deviation perhaps) and some other optimal expected increment can be used. This would reduce the typical variance and might allow an even smaller first chunk to be used.
Further analysis can be done in the first pass to allow the phone book to be partitioned, with each partition having its own expected increment and chunk size optimizations. This would allow for a smaller first chunk size for certain highly uniform parts of the phone book (reducing the number of bits consumed) and larger chunks sizes for non-uniform parts (reducing the number of bits wasted on continuation flags).
The real question is one of storing five-digit phone numbers.
The trick is that you'd need 17 bits to store the range of numbers from 0..99,999. But storing 17-bits on conventional 8-byte word boundaries is a hassle. That's why they are asking if you can do in less than 4k by not using 32-bit integers.
Question: are all number combinations possible?
Because of the nature of the telephone system, there may be fewer than 65k possible combinations. I will assume that yes because we are talking about the latter five positions in the phone number, as opposed to the area code or exchange prefixes.
Question: will this list be static or will it need to support updates?
If it is static, then when it comes time to populate the database, count the number of digits < 50,000 and the number of digits >= 50,000. Allocate two arrays of uint16 of appropriate length: one for the integers below 50,000 and one for the higher set. When storing integers in the higher array, subtract 50,000 and when reading integers from that array, add 50,000. Now you've stored your 1,000 integers in 2,000 8-byte words.
Building the phonebook will require two input traversals, but lookups should happen in half the time, on average, than they would with a single array. If lookup time were very important you could use more arrays for smaller ranges but I think at these sizes your performance bound would be pulling the arrays from memory and 2k will probably stash into CPU cache if not register space on anything you'd be using these days.
If it is dynamic, allocate one array of 1000 or so uint16, and add the numbers in sorted order. Set the first byte to 50,001, and set the second byte to an appropriate null value, like NULL or 65,000. When you store the numbers, store them in sorted order. If a number is below 50,001 then store it before the 50,001 marker. If a number is 50,001 or greater, store it after the 50,001 marker, but subtract 50,000 from the stored value.
Your array will look something like:
00001 = 00001
12345 = 12345
50001 = reserved
00001 = 50001
12345 = 62345
65000 = end-of-list
So, when you look up a number in the phonebook, you'll traverse the array and if you've hit the 50,001 value you start adding 50,000 to your array values.
This makes inserts very expensive, but lookups are easy, and you're not going to spend much more than 2k on storage.

Resources