How can I sort numbers lexicographically? - algorithm

Here is the scenario.
I am given an array 'A' of integers. The size of the array is not fixed. The function that I am supposed to write may be called once with an array of just a few integers while another time, it might even contain thousands of integers. Additionally, each integer need not contain the same number of digits.
I am supposed to 'sort' the numbers in the array such that the resulting array has the integers ordered in a lexicographic manner (i.e they are sorted based on their string representations. Here "123" is the string representation of 123). Please note that the output should contain integers only, not their string equivalents.
For example: if the input is:
[ 12 | 2434 | 23 | 1 | 654 | 222 | 56 | 100000 ]
Then the output should be:
[ 1 | 100000 | 12 | 222 | 23 | 2434 | 56 | 654 ]
My initial approach: I converted each integer to its string format, then added zeros to its right to make all the integers contain the same number of digits (this was the messy step as it involved tracking etc making the solution very inefficient) and then did radix sort.
Finally, I removed the padded zeros, converted the strings back to their integers and put them in the resulting array. This was a very inefficient solution.
I've been led to believe that the solution doesn't need padding etc and there is a simple solution where you just have to process the numbers in some way (some bit processing?) to get the result.
What is the space-wise most efficient solution you can think of? Time-wise?
If you are giving code, I'd prefer Java or pseudo-code. But if that doesn't suit you, any such language should be fine.

Executable pseudo-code (aka Python): thenumbers.sort(key=str). Yeah, I know that using Python is kind of like cheating -- it's just too powerful;-). But seriously, this also means: if you can sort an array of strings lexicographically, as Python's sort intrinsically can, then just make the "key string" out of each number and sort that auxiliary array (you can then reconstruct the desired numbers array by a str->int transformation, or by doing the sort on the indices via indirection, etc etc); this is known as DSU (Decorate, Sort, Undecorate) and it's what the key= argument to Python's sort implements.
In more detail (pseudocode):
allocate an array of char** aux as long as the numbers array
for i from 0 to length of numbers-1, aux[i]=stringify(numbers[i])
allocate an array of int indices of the same length
for i from 0 to length of numbers-1, indices[i]=i
sort indices, using as cmp(i,j) strcmp(aux[i],aux[j])
allocate an array of int results of the same length
for i from 0 to length of numbers-1, results[i]=numbers[indices[i]]
memcpy results over numbers
free every aux[i], and also aux, indices, results

Since you mentioned Java is the actual language in question:
You don't need to convert to and from strings. Instead, define your own comparator and use that in the sort.
Specifically:
Comparator<Integer> lexCompare = new Comparator<Integer>(){
int compareTo( Integer x, Integer y ) {
return x.toString().compareTo( y.toString() );
}
};
Then you can sort the array like this:
int[] array = /* whatever */;
Arrays.sort( array, lexCompare );
(Note: The int/Integer mismatch works automatically through auto-boxing)

I'd just turn them into strings, and then sort then sort using strcmp, which does lex comparisons.
Alternatively you can write a "lexcmp" function that compares two numbers using % 10 and /10 but that's basically the same thing as calling atoi many times, so not a good idea.

The actual sorting can be done by any algorithm you like. The key to this problem is finding the comparison function that will properly identify which numbers should be "less than" others, according to this scheme:
bool isLessThan(int a, int b)
{
string aString = ToString(a);
string bString = ToString(b);
int charCount = min(aString.length(), bString.length())
for (charIndex = 0; charIndex < charCount; charIndex++)
{
if (aString[charIndex] < bString[charIndex]) { return TRUE; }
}
// if the numbers are of different lengths, but identical
// for the common digits (e.g. 123 and 12345)
// the shorter string is considered "less"
return (aString.length() < bString.length());
}

My temptation would be to say that the int to string conversion would happen in the comparitor code rather than in bulk. Although this may be more elegant from a code-perspective I'd have to say that the execution effort would be greater as each number may be compared several times.
I'd be inclined to create a new array containing both the int and string representation (not sure that you need to pad the string versions for the string comparison to produce the order you've given), sort that on the string and then copy the int values back to the original array.
I can't think of a smart mathematically way of sorting this as by your own statement you want to sort lexicographically so you need to transform the numbers to strings to do that.

You definitely don't need to pad the result. It will not change the order of the lexicographical compare, it will be more error prone, and it will just waste CPU cycles. The most "space-wise" efficient method would be to convert the numbers to strings when they are compared. That way, you would not need to allocate an additional array, the numbers would be compared in place.
You can get a reasonably good implementation quickly by just converting them to strings as needed. Stringifying a number isn't particularly expensive and, since you are only dealing with two strings at a time, it is quite likely that they will remain in the CPU cache at all times. So the comparisons will be much faster than the case where you convert the entire array to strings since they will not need to be loaded from main memory into the cache. People tend to forget that a CPU has a cache and that algorithms which do a lot of their work in a small local area of memory will benefit greatly from the much faster cache access. On some architectures, the cache is so much faster than the memory that you can do hundreds of operations on your data in the time it would have taken you to load it from main memory. So doing more work in the comparison function could actually be significantly faster than pre-processing the array. Especially if you have a large array.
Try doing the string serialization and comparison in a comparator function and benchmark that. I think it will be a pretty good solution. Example java-ish pseudo-code:
public static int compare(Number numA, Number numB) {
return numA.toString().compare(numB.toString());
}
I think that any fancy bit wise comparisons you could do would have to be approximately equivalent to the work involved in converting the numbers to strings. So you probably wouldn't get significant benefit. You can't just do a direct bit for bit comparison, that would give you a different order than lexicographical sort. You'll need to be able to figure out each digit for the number anyway, so it is most straightforward to just make them strings. There may be some slick trick, but every avenue I can think of off the top of my head is tricky, error-prone, and much more work than it is worth.

Pseudocode:
sub sort_numbers_lexicographically (array) {
for 0 <= i < array.length:
array[i] = munge(array[i]);
sort(array); // using usual numeric comparisons
for 0 <= i < array.length:
array[i] = unmunge(array[i]);
}
So, what are munge and unmunge?
munge is different depending on the integer size. For example:
sub munge (4-bit-unsigned-integer n) {
switch (n):
case 0: return 0
case 1: return 1
case 2: return 8
case 3: return 9
case 4: return 10
case 5: return 11
case 6: return 12
case 7: return 13
case 8: return 14
case 9: return 15
case 10: return 2
case 11: return 3
case 12: return 4
case 13: return 5
case 14: return 6
case 15: return 7
}
Esentially what munge is doing is saying what order 4 bit integers come in when sorted lexigraphically. I'm sure you can see that there is a pattern here --- I didn't have to use a switch --- and that you can write a version of munge that handles 32 bit integers reasonably easily. Think about how you would write versions of munge for 5, 6, and 7 bit integers if you can't immediately see the pattern.
unmunge is the inverse of munge.
So you can avoid converting anything to a string --- you don't need any extra memory.

If you want to try a better preprocess-sort-postprocess, then note that an int is at most 10 decimal digits (ignoring signed-ness for the time being).
So the binary-coded-decimal data for it fits in 64 bits. Map digit 0->1, 1->2 etc, and use 0 as a NUL terminator (to ensure that "1" comes out less than "10"). Shift each digit in turn, starting with the smallest, into the top of a long. Sort the longs, which will come out in lexicographical order for the original ints. Then convert back by shifting digits one at a time back out of the top of each long:
uint64_t munge(uint32_t i) {
uint64_t acc = 0;
while (i > 0) {
acc = acc >> 4;
uint64_t digit = (i % 10) + 1;
acc += (digit << 60);
i /= 10;
}
return acc;
}
uint32_t demunge(uint64_t l) {
uint32_t acc = 0;
while (l > 0) {
acc *= 10;
uint32_t digit = (l >> 60) - 1;
acc += digit;
l << 4;
}
}
Or something like that. Since Java doesn't have unsigned ints, you'd have to modify it a little. It uses a lot of working memory (twice the size of the input), but that's still less than your initial approach. It might be faster than converting to strings on the fly in the comparator, but it uses more peak memory. Depending on the GC, it might churn its way through less memory total, though, and require less collection.

If all the numbers are less than 1E+18, you could cast each number to UINT64, multiply by ten and add one, and then multiply by ten until they are at least 1E+19. Then sort those. To get back the original numbers, divide each number by ten until the last digit is non-zero (it should be one) and then divide by ten once more.

The question doesn't indicate how to treat negative integers in the lexicographic collating order. The string-based methods presented earlier typically will sort negative values to the front; eg, { -123, -345, 0, 234, 78 } would be left in that order. But if the minus signs were supposed to be ignored, the output order should be { 0, -123, 234, -345, 78 }. One could adapt a string-based method to produce that order by somewhat-cumbersome additional tests.
It may be simpler, in both theory and code, to use a comparator that compares fractional parts of common logarithms of two integers. That is, it will compare the mantissas of base 10 logarithms of two numbers. A logarithm-based comparator will run faster or slower than a string-based comparator, depending on a CPU's floating-point performance specs and on quality of implementations.
The java code shown at the end of this answer includes two logarithm-based comparators: alogCompare and slogCompare. The former ignores signs, so would produce { 0, -123, 234, -345, 78 } from { -123, -345, 0, 234, 78 }.
The number-groups shown next are the output produced by the java program.
The “dar rand” section shows a random-data array dar as generated. It reads across and then down, 5 elements per line. Note, arrays sar, lara, and lars initially are unsorted copies of dar.
The “dar sort” section is dar after sorting via Arrays.sort(dar);.
The “sar lex” section shows array sar after sorting with Arrays.sort(sar,lexCompare);, where lexCompare is similar to the Comparator shown in Jason Cohen's answer.
The “lar s log” section shows array lars after sorting by Arrays.sort(lars,slogCompare);, illustrating a logarithm-based method that gives the same order as do lexCompare and other string-based methods.
The “lar a log” section shows array lara after sorting by Arrays.sort(lara,alogCompare);, illustrating a logarithm-based method that ignores minus signs.
dar rand -335768 115776 -9576 185484 81528
dar rand 79300 0 3128 4095 -69377
dar rand -67584 9900 -50568 -162792 70992
dar sort -335768 -162792 -69377 -67584 -50568
dar sort -9576 0 3128 4095 9900
dar sort 70992 79300 81528 115776 185484
sar lex -162792 -335768 -50568 -67584 -69377
sar lex -9576 0 115776 185484 3128
sar lex 4095 70992 79300 81528 9900
lar s log -162792 -335768 -50568 -67584 -69377
lar s log -9576 0 115776 185484 3128
lar s log 4095 70992 79300 81528 9900
lar a log 0 115776 -162792 185484 3128
lar a log -335768 4095 -50568 -67584 -69377
lar a log 70992 79300 81528 -9576 9900
Java code is shown below.
// Code for "How can I sort numbers lexicographically?" - jw - 2 Jul 2014
import java.util.Random;
import java.util.Comparator;
import java.lang.Math;
import java.util.Arrays;
public class lex882954 {
// Comparator from Jason Cohen's answer
public static Comparator<Integer> lexCompare = new Comparator<Integer>(){
public int compare( Integer x, Integer y ) {
return x.toString().compareTo( y.toString() );
}
};
// Comparator that uses "abs." logarithms of numbers instead of strings
public static Comparator<Integer> alogCompare = new Comparator<Integer>(){
public int compare( Integer x, Integer y ) {
Double xl = (x==0)? 0 : Math.log10(Math.abs(x));
Double yl = (y==0)? 0 : Math.log10(Math.abs(y));
Double xf=xl-xl.intValue();
return xf.compareTo(yl-yl.intValue());
}
};
// Comparator that uses "signed" logarithms of numbers instead of strings
public static Comparator<Integer> slogCompare = new Comparator<Integer>(){
public int compare( Integer x, Integer y ) {
Double xl = (x==0)? 0 : Math.log10(Math.abs(x));
Double yl = (y==0)? 0 : Math.log10(Math.abs(y));
Double xf=xl-xl.intValue()+Integer.signum(x);
return xf.compareTo(yl-yl.intValue()+Integer.signum(y));
}
};
// Print array before or after sorting
public static void printArr(Integer[] ar, int asize, String aname) {
int j;
for(j=0; j < asize; ++j) {
if (j%5==0)
System.out.printf("%n%8s ", aname);
System.out.printf(" %9d", ar[j]);
}
System.out.println();
}
// Main Program -- to test comparators
public static void main(String[] args) {
int j, dasize=15, hir=99;
Random rnd = new Random(12345);
Integer[] dar = new Integer[dasize];
Integer[] sar = new Integer[dasize];
Integer[] lara = new Integer[dasize];
Integer[] lars = new Integer[dasize];
for(j=0; j < dasize; ++j) {
lara[j] = lars[j] = sar[j] = dar[j] = rnd.nextInt(hir) *
rnd.nextInt(hir) * (rnd.nextInt(hir)-44);
}
printArr(dar, dasize, "dar rand");
Arrays.sort(dar);
printArr(dar, dasize, "dar sort");
Arrays.sort(sar, lexCompare);
printArr(sar, dasize, "sar lex");
Arrays.sort(lars, slogCompare);
printArr(lars, dasize, "lar s log");
Arrays.sort(lara, alogCompare);
printArr(lara, dasize, "lar a log");
}
}

If you're going for space-wise efficiency, I'd try just doing the work in the comparison function of the sort
int compare(int a, int b) {
// convert a to string
// convert b to string
// return -1 if a < b, 0 if they are equal, 1 if a > b
}
If it's too slow (it's slower than preprocessing, for sure), keep track of the conversions somewhere so that the comparison function doesn't keep having to do them.

Possible optimization: Instead of this:
I converted each integer to its string format, then added zeros to its right to make all the integers contain the same number of digits
you can multiply each number by (10^N - log10(number)), N being a number larger than log10 of any of your numbers.

#!/usr/bin/perl
use strict;
use warnings;
my #x = ( 12, 2434, 23, 1, 654, 222, 56, 100000 );
print $_, "\n" for sort #x;
__END__
Some timings ... First, with empty #x:
C:\Temp> timethis s-empty
TimeThis : Elapsed Time : 00:00:00.188
Now, with 10,000 randomly generated elements:
TimeThis : Elapsed Time : 00:00:00.219
This includes the time taken to generate the 10,000 elements but not the time to output them to the console. The output adds about a second.
So, save some programmer time ;-)

One really hacky method (using C) would be:
generate a new array of all the values converted to floats
do a sort using the mantissa (significand) bits for the comparison
In Java (from here):
long bits = Double.doubleToLongBits(5894.349580349);
boolean negative = (bits & 0x8000000000000000L) != 0;
long exponent = bits & 0x7ff0000000000000L >> 52;
long mantissa = bits & 0x000fffffffffffffL;
so you would sort on the long mantissa here.

Related

Generate an integer that is not among four billion given ones

I have been given this interview question:
Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.
My analysis:
The size of the file is 4×109×4 bytes = 16 GB.
We can do external sorting, thus letting us know the range of the integers.
My question is what is the best way to detect the missing integer in the sorted big integer sets?
My understanding (after reading all the answers):
Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.
Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.
Solution:
If we use one bit representing one distinct integer, it is enough. we don't need sort.
Implementation:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits
Solution:
For all possible 16-bit prefixes, there are 216 number of
integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
Build the counter of each bucket through the first pass through the file.
Scan the buckets, find the first one who has less than 65536 hit.
Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.
Conclusion:
We decrease memory through increasing file pass.
A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.
Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.
If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.
If "integer" means mathematical integer: Read through the input once and keep track of the largest number length of the longest number you've ever seen. When you're done, output the maximum plus one a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).
Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.
If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.
If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.
A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10
Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :)
Coming to the problem, 4 billion numbers can be represented in :
4 billion bits = (4000000000 / 8) bytes = about 0.466 GB
The code to implement the bitset is simple: (taken from solutions page )
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
Bentley's algorithm makes a single pass over the file, setting the appropriate bit in the array and then examines this array using test macro above to find the missing number.
If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of 32/8 = 4 passes.
HTH.
Since the problem does not specify that we have to find the smallest possible number that is not in the file we could just generate a number that is longer than the input file itself. :)
For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.
If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.
In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.
But let's say instead of having about 4 billion numbers; you had something like 232 - 1 numbers and less than 10 MB of RAM; so any small range of ints only has a small possibility of not containing the number.
If you were guaranteed that each int in the list was unique, you could sum the numbers and subtract the sum with one # missing to the full sum (½)(232)(232 - 1) = 9223372034707292160 to find the missing int. However, if an int occurred twice this method will fail.
However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.
That method was inefficient. We are only using two integers in each step (or about 8 bytes of RAM with a 4 byte (32-bit) integer). A better method would be to divide into sqrt(232) = 216 = 65536 bins, each with 65536 numbers in a bin. Each bin requires 4 bytes to store its count, so you need 218 bytes = 256 kB. So bin 0 is (0 to 65535=216-1), bin 1 is (216=65536 to 2*216-1=131071), bin 2 is (2*216=131072 to 3*216-1=196607). In python you'd have something like:
import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
if bin_count < 65536:
break # we have found an incomplete bin with missing ints (bin_num)
Read through the ~4 billion integer list; and count how many ints fall in each of the 216 bins and find an incomplete_bin that doesn't have all 65536 numbers. Then you read through the 4 billion integer list again; but this time only notice when integers are in that range; flipping a bit when you find them.
del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
if N // 65536 == bin_num:
my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
if not bit:
print bin_num*65536 + i
break
Why make it so complicated? You ask for an integer not present in the file?
According to the rules specified, the only thing you need to store is the largest integer that you encountered so far in the file. Once the entire file has been read, return a number 1 greater than that.
There is no risk of hitting maxint or anything, because according to the rules, there is no restriction to the size of the integer or the number returned by the algorithm.
This can be solved in very little space using a variant of binary search.
Start off with the allowed range of numbers, 0 to 4294967295.
Calculate the midpoint.
Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.
If no numbers were equal, you're done. The midpoint number is the answer.
Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.
This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.
This is essentially the same as Henning's solution, except it uses two bins instead of 16k.
EDIT Ok, this wasn't quite thought through as it assumes the integers in the file follow some static distribution. Apparently they don't need to, but even then one should try this:
There are ≈4.3 billion 32-bit integers. We don't know how they are distributed in the file, but the worst case is the one with the highest Shannon entropy: an equal distribution. In this case, the probablity for any one integer to not occur in the file is
( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4
The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).
Memory consumption: a few dozen bytes, complexity: O(n), overhead: neclectable as most of the time will be spent in the unavoidable hard disk accesses rather than comparing ints anyway.
The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only
1 - 4000000000/2³² ≈ 6%
of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.
If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:
>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5
(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)
They may be looking to see if you have heard of a probabilistic Bloom Filter which can very efficiently determine absolutely if a value is not part of a large set, (but can only determine with high probability it is a member of the set.)
Based on the current wording in the original question, the simplest solution is:
Find the maximum value in the file, then add 1 to it.
Use a BitSet. 4 billion integers (assuming up to 2^32 integers) packed into a BitSet at 8 per byte is 2^32 / 2^3 = 2^29 = approx 0.5 Gb.
To add a bit more detail - every time you read a number, set the corresponding bit in the BitSet. Then, do a pass over the BitSet to find the first number that's not present. In fact, you could do this just as effectively by repeatedly picking a random number and testing if it's present.
Actually BitSet.nextClearBit(0) will tell you the first non-set bit.
Looking at the BitSet API, it appears to only support 0..MAX_INT, so you may need 2 BitSets - one for +'ve numbers and one for -'ve numbers - but the memory requirements don't change.
If there is no size limit, the quickest way is to take the length of the file, and generate the length of the file+1 number of random digits (or just "11111..." s). Advantage: you don't even need to read the file, and you can minimize memory use nearly to zero. Disadvantage: You will print billions of digits.
However, if the only factor was minimizing memory usage, and nothing else is important, this would be the optimal solution. It might even get you a "worst abuse of the rules" award.
If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:
The Theory
Given any 0 based range of integers that has 2^n elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.
The Proof
Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:
0 - 00
1 - 01
2 - 10
3 - 11
Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be 10 (or 2).
Now, let's look at n+1. Let's call the number of times each bit is set in n, x and the number of times each bit is set in n+1 y. The value of y will be equal to y = x * 2 because there are x elements with the n+1 bit set to 0, and x elements with the n+1 bit set to 1. And since 2x will always be even, n+1 will always have each bit set an even number of times.
Therefore, since n=2 works, and n+1 works, the xor method will work for all values of n>=2.
The Algorithm For 0 Based Ranges
This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.
long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
result = result ^ supplied;
}
return result;
The Algorithm For Arbitrary Based Ranges
This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).
long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
result = result ^ (supplied - offset);
}
return result + offset;
Arbitrary Ranges
We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:
long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
n++;
result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
result = result ^ (n++);
}
return result + offset;
Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.
Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):
function find($array) {
$offset = min($array);
$n = 0;
$result = 0;
foreach ($array as $value) {
$result = $result ^ ($value - $offset);
$n++;
}
$n++; // This takes care of the missing value
while ($n == 1 || 0 != ($n & ($n - 1))) {
$result = $result ^ ($n++);
}
return $result + $offset;
}
Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.
Another Approach
Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:
long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
if (supplied != last + 1) {
return last + 1;
}
last = supplied;
}
// The range is contiguous, so what do we do here? Let's return last + 1:
return last + 1;
Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n, and return n+1.
Of course you'd need a backup plan in case n+1 causes an integer overflow.
Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).
void maxNum(ulong filesize)
{
ulong bitcount = filesize * 8; //number of bits in file
for (ulong i = 0; i < bitcount; i++)
{
Console.Write(9);
}
}
Should print 10 bitcount - 1, which will always be greater than 2 bitcount. Technically, the number you have to beat is 2 bitcount - (4 * 109 - 1), since you know there are (4 billion - 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.
The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.
The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.
A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).
The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.
Strip the white space and non numeric characters from the file and append 1. Your file now contains a single number not listed in the original file.
From Reddit by Carbonetc.
For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.
Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.
You can use bit flags to mark whether an integer is present or not.
After traversing the entire file, scan each bit to determine if the number exists or not.
Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.
Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.
Let all possible integers be the range from int_min to int_max, and
bool isNotInFile(integer) a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)
for (integer i = int_min; i <= int_max; ++i)
{
if (isNotInFile(i)) {
return i;
}
}
For the 10 MB memory constraint:
Convert the number to its binary representation.
Create a binary tree where left = 0 and right = 1.
Insert each number in the tree using its binary representation.
If a number has already been inserted, the leafs will already have been created.
When finished, just take a path that has not been created before to create the requested number.
4 billion number = 2^32, meaning 10 MB might not be sufficient.
EDIT
An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.
EDIT II
There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.
I will answer the 1 GB version:
There is not enough information in the question, so I will state some assumptions first:
The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.
Pseudo-code:
var bitArray = new bit[4294967296]; // 0.5 GB, initialized to all 0s.
foreach (var number in file) {
bitArray[number + 2147483648] = 1; // Shift all numbers so they start at 0.
}
for (var i = 0; i < 4294967296; i++) {
if (bitArray[i] == 0) {
return i - 2147483648;
}
}
As long as we're doing creative answers, here is another one.
Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed).
Read through the sorted file and output the first number that is missing.
Bit Elimination
One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:
long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
val = val & ~fileVal;
if (val == 0) error;
}
Bit Counts
Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.
Range Logic
Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:
struct Range
{
long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };
Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.
2128*1018 + 1 ( which is (28)16*1018 + 1 ) - cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.
I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:
If there are exactly 4,294,967,295 (2^32 - 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.
Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.
The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).
Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.
# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
for (b=0; b<4; b++) {
print "2";
}
}
As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it :)
EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.
If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is 1 in 2^64/(4*10^9) = 4611686018.4 (roughly 1 in 4 billion). You'd be right most of the time!
(Joking... kind of.)

Sort N numbers in digit order

Given a N number range E.g. [1 to 100], sort the numbers in digit order (i.e) For the numbers 1 to 100, the sorted output wound be
1 10 100 11 12 13 . . . 19 2 20 21..... 99
This is just like Radix Sort but just that the digits are sorted in reversed order to what would be done in a normal Radix Sort.
I tried to store all the digits in each number as a linked list for faster operation but it results in a large Space Complexity.
I need a working algorithm for the question.
From all the answers, "Converting to Strings" is an option, but is there no other way this can be done?
Also an algorithm for Sorting Strings as mentioned above can also be given.
Use any sorting algorithm you like, but compare the numbers as strings, not as numbers. This is basically lexiographic sorting of regular numbers. Here's an example gnome sort in C:
#include <stdlib.h>
#include <string.h>
void sort(int* array, int length) {
int* iter = array;
char buf1[12], buf2[12];
while(iter++ < array+length) {
if(iter == array || (strcmp(itoa(*iter, &buf1, 10), itoa(*(iter-1), &buf2, 10) >= 0) {
iter++;
} else {
*iter ^= *(iter+1);
*(iter+1) ^= *iter;
*iter ^= *(iter+1);
iter--;
}
}
}
Of course, this requires the non-standard itoa function to be present in stdlib.h. A more standard alternative would be to use sprintf, but that makes the code a little more cluttered. You'd possibly be better off converting the whole array to strings first, then sort, then convert it back.
Edit: For reference, the relevant bit here is strcmp(itoa(*iter, &buf1, 10), itoa(*(iter-1), &buf2, 10) >= 0, which replaces *iter >= *(iter-1).
I have a solution but not exactly an algorithm.. All you need to do is converts all the numbers to strings & sort them as strings..
Here is how you can do it with a recursive function (the code is in Java):
void doOperation(List<Integer> list, int prefix, int minimum, int maximum) {
for (int i = 0; i <= 9; i++) {
int newNumber = prefix * 10 + i;
if (newNumber >= minimum && newNumber <= maximum) {
list.add(newNumber);
}
if (newNumber > 0 && newNumber <= maximum) {
doOperation(list, newNumber, minimum, maximum);
}
}
}
You call it like this:
List<Integer> numberList = new ArrayList<Integer>();
int min=1, max =100;
doOperation(numberList, 0, min, max);
System.out.println(numberList.toString());
EDIT:
I translated my code in C++ here:
#include <stdio.h>
void doOperation(int list[], int &index, int prefix, int minimum, int maximum) {
for (int i = 0; i <= 9; i++) {
int newNumber = prefix * 10 + i;
if (newNumber >= minimum && newNumber <= maximum) {
list[index++] = newNumber;
}
if (newNumber > 0 && newNumber <= maximum) {
doOperation(list, index, newNumber, minimum, maximum);
}
}
}
int main(void) {
int min=1, max =100;
int* numberList = new int[max-min+1];
int index = 0;
doOperation(numberList, index, 0, min, max);
printf("[");
for(int i=0; i<max-min+1; i++) {
printf("%d ", numberList[i]);
}
printf("]");
return 0;
}
Basically, the idea is: for each digit (0-9), I add it to the array if it is between minimum and maximum. Then, I call the same function with this digit as prefix. It does the same: for each digit, it adds it to the prefix (prefix * 10 + i) and if it is between the limits, it adds it to the array. It stops when newNumber is greater than maximum.
i think if you convert numbers to string, you can use string comparison to sort them.
you can use anny sorting alghorighm for it.
"1" < "10" < "100" < "11" ...
Optimize the way you are storing the numbers: use a binary-coded decimal (BCD) type that gives simple access to a specific digit. Then you can use your current algorithm, which Steve Jessop correctly identified as most significant digit radix sort.
I tried to store all the digits in
each number as a linked list for
faster operation but it results in a
large Space Complexity.
Storing each digit in a linked list wastes space in two different ways:
A digit (0-9) only requires 4 bits of memory to store, but you are probably using anywhere from 8 to 64 bits. A char or short type takes 8 bits, and an int can take up to 64 bits. That's using 2X to 16X more memory than the optimal solution!
Linked lists add additional unneeded memory overhead. For each digit, you need an additional 32 to 64 bits to store the memory address of the next link. Again, this increases the memory required per digit by 8X to 16X.
A more memory-efficient solution stores BCD digits contiguously in memory:
BCD only uses 4 bits per digit.
Store the digits in a contiguous memory block, like an array. This eliminates the need to store memory addresses. You don't need linked lists' ability to easily insert/delete from the middle. If you need the ability to grow the numbers to an unknown length, there are other abstract data types that allow that with much less overhead. For example, a vector.
One option, if other operations like addition/multiplication are not important, is to allocate enough memory to store each BCD digit plus one BCD terminator. The BCD terminator can be any combination of 4 bits that is not used to represent a BCD digit (like binary 1111). Storing this way will make other operations like addition and multiplication trickier, though.
Note this is very similar to the idea of converting to strings and lexicographically sorting those strings. Integers are internally stored as binary (base 2) in the computer. Storing in BCD is more like base 10 (base 16, actually, but 6 combinations are ignored), and strings are like base 256. Strings will use about twice as much memory, but there are already efficient functions written to sort strings. BCD's will probably require developing a custom BCD type for your needs.
Edit: I missed that it's a contiguous range. That being the case, all the answers which talk about sorting an array are wrong (including your idea stated in the question that it's like a radix sort), and True Soft's answer is right.
just like Radix Sort but just that the digits are sorted in reversed order
Well spotted :-) If you actually do it that way, funnily enough, it's called an MSD radix sort.
http://en.wikipedia.org/wiki/Radix_sort#Most_significant_digit_radix_sorts
You can implement one very simply, or with a lot of high technology and fanfare. In most programming languages, your particular example faces a slight difficulty. Extracting decimal digits from the natural storage format of an integer, isn't an especially fast operation. You can ignore this and see how long it ends up taking (recommended), or you can add yet more fanfare by converting all the numbers to decimal strings before sorting.
Of course you don't have to implement it as a radix sort: you could use a comparison sort algorithm with an appropriate comparator. For example in C, the following is suitable for use with qsort (unless I've messed it up):
int lex_compare(void *a, void *b) {
char a_str[12]; // assuming 32bit int
char b_str[12];
sprintf(a_str, "%d", *(int*)a);
sprintf(b_str, "%d", *(int*)b);
return strcmp(a_str,b_str);
}
Not terribly efficient, since it does a lot of repeated work, but straightforward.
If you do not want to convert them to strings, but have enough space to store an extra copy of the list I would store the largest power of ten less than the element in the copy. This is probably easiest to do with a loop. Now call your original array x and the powers of ten y.
int findPower(int x) {
int y = 1;
while (y * 10 < x) {
y = y * 10;
}
return y;
}
You could also compute them directly
y = exp10(floor(log10(x)));
but I suspect that the iteration may be faster than the conversions to and from floating point.
In order to compare the ith and jth elements
bool compare(int i, int j) {
if (y[i] < y[j]) {
int ti = x[i] * (y[j] / y[i]);
if (ti == x[j]) {
return (y[i] < y[j]); // the compiler will optimize this
} else {
return (ti < x[j]);
}
} else if (y[i] > y[j]) {
int tj = x[j] * (y[i] / y[j]);
if (x[i] == tj) {
return (y[i] < y[j]); // the compiler will optimize this
} else {
return (x[i] < tj);
}
} else {
return (x[i] < x[j];
}
}
What is being done here is we are multiplying the smaller number by the appropriate power of ten to make the two numbers have an equal number of digits, then comparing them. if the two modified numbers are equal, then compare the digit lengths.
If you do not have the space to store the y arrays you can compute them on each comparison.
In general, you are likely better off using the preoptimized digit conversion routines.

Sorting numbers from 1 to 999,999,999 in words as strings

Interesting programming puzzle:
If the integers from 1 to 999,999,999
are written as words, sorted
alphabetically, and concatenated, what
is the 51 billionth letter?
To be precise: if the integers from 1
to 999,999,999 are expressed in words
(omitting spaces, ‘and’, and
punctuation - see note below for format), and sorted
alphabetically so that the first six
integers are
eight
eighteen
eighteenmillion
eighteenmillioneight
eighteenmillioneighteen
eighteenmillioneighteenthousand
and the last is
twothousandtwohundredtwo
then reading top to bottom, left to
right, the 28th letter completes the
spelling of the integer
“eighteenmillion”.
The 51 billionth letter also completes
the spelling of an integer. Which one,
and what is the sum of all the
integers to that point?
Note: For example, 911,610,034 is
written
“ninehundredelevenmillionsixhundredtenthousandthirtyfour”;
500,000,000 is written
“fivehundredmillion”; 1,709 is written
“onethousandsevenhundrednine”.
I stumbled across this on a programming blog 'Occasionally Sane', and couldn't think of a neat way of doing it, the author of the relevant post says his initial attempt ate through 1.5GB of memory in 10 minutes, and he'd only made it up to 20,000,000 ("twentymillion").
Can anyone think of come up with share with the group a novel/clever approach to this?
Edit: Solved!
You can create a generator that outputs the numbers in sorted order. There are a few rules for comparing concatenated strings that I think most of us know implicitly:
a < a+b, where b is non-null.
a+b < a+c, where b < c.
a+b < c+d, where a < c, and a is not a subset of c.
If you start with a sorted list of the first 1000 numbers, you can easily generate the rest by appending "thousand" or "million" and concatenating another group of 1000.
Here's the full code, in Python:
import heapq
first_thousand=[('', 0), ('one', 1), ('two', 2), ('three', 3), ('four', 4),
('five', 5), ('six', 6), ('seven', 7), ('eight', 8),
('nine', 9), ('ten', 10), ('eleven', 11), ('twelve', 12),
('thirteen', 13), ('fourteen', 14), ('fifteen', 15),
('sixteen', 16), ('seventeen', 17), ('eighteen', 18),
('nineteen', 19)]
tens_name = (None, 'ten', 'twenty', 'thirty', 'forty', 'fifty', 'sixty',
'seventy','eighty','ninety')
for number in range(20, 100):
name = tens_name[number/10] + first_thousand[number%10][0]
first_thousand.append((name, number))
for number in range(100, 1000):
name = first_thousand[number/100][0] + 'hundred' + first_thousand[number%100][0]
first_thousand.append((name, number))
first_thousand.sort()
def make_sequence(base_generator, suffix, multiplier):
prefix_list = [(name+suffix, number*multiplier)
for name, number in first_thousand[1:]]
prefix_list.sort()
for prefix_name, base_number in prefix_list:
for name, number in base_generator():
yield prefix_name + name, base_number + number
return
def thousand_sequence():
for name, number in first_thousand:
yield name, number
return
def million_sequence():
return heapq.merge(first_thousand,
make_sequence(thousand_sequence, 'thousand', 1000))
def billion_sequence():
return heapq.merge(million_sequence(),
make_sequence(million_sequence, 'million', 1000000))
def solve(stopping_size = 51000000000):
total_chars = 0
total_sum = 0
for name, number in billion_sequence():
total_chars += len(name)
total_sum += number
if total_chars >= stopping_size:
break
return total_chars, total_sum, name, number
It took a while to run, about an hour. The 51 billionth character is the last character of sixhundredseventysixmillionsevenhundredfortysixthousandfivehundredseventyfive, and the sum of the integers to that point is 413,540,008,163,475,743.
I'd sort the names of the first 20 integers and the names of the tens, hundreds and thousands, work out how many numbers start with each of those, and go from there.
For example, the first few are [ eight, eighteen, eighthundred, eightmillion, eightthousand, eighty, eleven, ....
The numbers starting with "eight" are 8. With "eighthundred", 800-899, 800,000-899,999, 800,000,000-899,999,999. And so on.
The number of letters in the concatenation of words for 0 ( represented by the empty string ) to 99 can be found and totalled; this can be multiplied with "thousand"=8 or "million"=7 added for higher ranges. The value for 800-899 will be 100 times the length of "eighthundred" plus the length of 0-99. And so on.
This guy has a solution to the puzzle written in Haskell. Apparently Michael Borgwardt was right about using a Trie for finding the solution.
Those strings are going to have lots and lots of common prefixes - perfect use case for a trie, which would drastically reduce memory usage and probably also running time.
Here's my python solution that prints out the correct answer in a fraction of a second. I'm not a python programmer generally, so apologies for any egregious code style errors.
#!/usr/bin/env python
import sys
ONES=[
"", "one", "two", "three", "four",
"five", "six", "seven", "eight", "nine",
"ten", "eleven", "twelve", "thirteen", "fourteen",
"fifteen", "sixteen", "seventeen","eighteen", "nineteen",
]
TENS=[
"zero", "ten", "twenty", "thirty", "forty",
"fifty", "sixty", "seventy", "eighty", "ninety",
]
def to_s_h(i):
if(i<20):
return(ONES[i])
return(TENS[i/10] + ONES[i%10])
def to_s_t(i):
if(i<100):
return(to_s_h(i))
return(ONES[i/100] + "hundred" + to_s_h(i%100))
def to_s_m(i):
if(i<1000):
return(to_s_t(i))
return(to_s_t(i/1000) + "thousand" + to_s_t(i%1000))
def to_s_b(i):
if(i<1000000):
return(to_s_m(i))
return(to_s_m(i/1000000) + "million" + to_s_m(i%1000000))
def try_string(s,t):
global letters_to_go,word_sum
l=len(s)
letters_to_go -= l
word_sum += t
if(letters_to_go == 0):
print "solved: " + s
print "sum is: " + str(word_sum)
sys.exit(0)
elif(letters_to_go < 0):
print "failed: " + s + " " + str(letters_to_go)
sys.exit(-1)
def solve(depth,prefix,prefix_num):
global millions,thousands,ones,letters_to_go,onelen,thousandlen,word_sum
src=[ millions,thousands,ones ][depth]
for x in src:
num=prefix + x[2]
nn=prefix_num+x[1]
try_string(num,nn)
if(x[0] == 0):
continue
if(x[0] == 1):
stl=(len(num) * 999) + onelen
ss=(nn*999) + onesum
else:
stl=(len(num) * 999999) + thousandlen + onelen*999
ss=(nn*999999) + thousandsum
if(stl < letters_to_go):
letters_to_go -= stl
word_sum += ss
else:
solve(depth+1,num,nn)
ones=[]
thousands=[]
millions=[]
onelen=0
thousandlen=0
onesum=(999*1000)/2
thousandsum=(999999*1000000)/2
for x in range(1,1000):
s=to_s_b(x)
l=len(s)
ones.append( (0,x,s) )
onelen += l
thousands.append( (0,x,s) )
thousands.append( (1,x*1000,s + "thousand") )
thousandlen += l + (l+len("thousand"))*1000
millions.append( (0,x,s) )
millions.append( (1,x*1000,s + "thousand") )
millions.append( (2,x*1000000,s + "million") )
ones.sort(key=lambda x: x[2])
thousands.sort(key=lambda x: x[2])
millions.sort(key=lambda x: x[2])
letters_to_go=51000000000
word_sum=0
solve(0,"",0)
It works by precomputing the length of the numbers from 1..999 and 1..999999 so that it can skip entire subtrees unless it knows that the answer lies somewhere within them.
(The first attempt at this is wrong, but I will leave it up since it's more useful to see mistakes on the way to solving something rather than just the final answer.)
I would first generate the strings from 0 to 999 and store them into an array called thousandsStrings. The 0 element is "", and "" represents a blank in the lists below.
The thousandsString setup uses the following:
Units: "" one two three ... nine
Teens: ten eleven twelve ... nineteen
Tens: "" "" twenty thirty forty ... ninety
The thousandsString setup is something like this:
thousandsString[0] = ""
for (i in 1..10)
thousandsString[i] = Units[i]
end
for (i in 10..19)
thousandsString[i] = Teens[i]
end
for (i in 20..99)
thousandsString[i] = Tens[i/10] + Units[i%10]
end
for (i in 100..999)
thousandsString[i] = Units[i/100] + "hundred" + thousandsString[i%100]
end
Then, I would sort that array alphabetically.
Then, assuming t1 t2 t3 are strings taken from thousandsString, all of the strings have the form
t1
OR
t1 + million + t2 + thousand + t3
OR
t1 + thousand + t2
To output them in the proper order, I would process the individual strings, followed by the millions strings followed by the string + thousands strings.
foreach (t1 in thousandsStrings)
if (t1 == "")
continue;
process(t1)
foreach (t2 in thousandsStrings)
foreach (t3 in thousandsStrings)
process (t1 + "million" + t2 + "thousand" + t3)
end
end
foreach (t2 in thousandsStrings)
process (t1 + "thousand" + t2)
end
end
where process means store the previous sum length and then add the new string length to the sum and if the new sum is >= your target sum, you spit out the results, and maybe return or break out of the loops, whatever makes you happy.
=====================================================================
Second attempt, the other answers were right that you need to use 3k strings instead of 1k strings as a base.
Start with the thousandsString from above, but drop the blank "" for zero. That leaves 999 elements and call this uStr (units string).
Create two more sets:
tStr = the set of all uStr + "thousand"
mStr = the set of all uStr + "million"
Now create two more set unions:
mtuStr = mStr union tStr union uStr
tuStr = tStr union uStr
Order uStr, tuStr, mtuStr
Now the looping and logic here are a bit different than before.
foreach (s1 in mtuStr)
process(s1)
// If this is a millions or thousands string, add the extra strings that can
// go after the millions or thousands parts.
if (s1.contains("million"))
foreach (s2 in tuStr)
process (s1+s2)
if (s2.contains("thousand"))
foreach (s3 in uStr)
process (s1+s2+s3)
end
end
end
end
if (s1.contains("thousand"))
foreach (s2 in uStr)
process (s1+s2)
end
end
end
What I did:
1) Iterate through 1 - 999 and generate the words for each of these.
As we generate:
2) Create 3 data structures where each node has a pointer to children and each node has a character value, and a pointer to Siblings. (A binary tree, in fact, but we don't want to think of it that way necessarily - for me it's easier to conceptualise as a list of siblings with lists of children hanging off, but if you think about it {draw a pic} you'll realise it is in fact a Binary Tree).
These 3 data structures are created cocurrently as follows:
a) first one with the word as generated (ie 1-999 sorted alphabetically)
b) all the values in the first + all the values with 'thousand' appended (ie 1-999 and 1,000 - 999,000 (step 1000) (1998 values in total)
c) all the values in B + all the values in a with million appended (2997 values in total)
3) For every leaf node in(b) add a Child as (a). For every leaf node in (c) add a child as (b).
4) Traverse the tree, counting how many characters we pass and stopping at 51 Billion.
NOTE: This doesn't sum the values (I didn't read that bit when I originally did it), and runs in just over 3 minutes (about 192 secs usually, using c++).
NOTE 2: (in case it isn't obvious) there are only 5,994 values stored, but they are stored in such a way that there are a billion paths through the tree
I did this about a year or two ago when I stumbled accross it, and have since realised there are many optimisations (the most time consuming bit is traversing the tree - by a LONG WAY). There are a few optimisations that I think would significantly improve this approach, but I could never be bothered taking it further, other than to optimise redundant nodes in the tree slightly, so they stored strings rather than characters
I have seen people claim on line that they've solved it in less than 5 seconds....
weird but fun idea.
build a sparse list of the lengths of the number from 0 to 9, then 10-90 by tens, then 100, 1000, etc etc, to billion, indexes are the value of the integer part who's lenght is stored.
write a function to calculate the number as a string length using the table.
(breaking the number into it's parts, and looking up the length of the aprts, never actally creating a string.)
then you're only doing math as you traverse the numbers, calculating the length from the
table afterward summing for your sum.
with the sum, and the value of the final integer, figure out the integer that's being spelled, and volia, you're done.
Yes, me again, but a completely different approach.
Simply, rather than storing the "onethousandeleventyseven" words, you write the sort to use that when comparing.
Crude java POC:
public class BillionsBillions implements Comparator {
public int compare(Object a, Object b) {
String s1 = (String)a; // "1234";
String s2 = (String)b; // "1235";
int n1 = Integer.valueOf(s1);
int n2 = Integer.valueOf(s2);
String w1 = numberToWords(n1);
String w2 = numberToWords(n2);
return w1.compare(w2);
}
public static void main(String args[]) {
long numbers[] = new long[1000000000]; // Bring your 64 bit JVMs
for(int i = 0; i < 1000000000; i++) {
numbers[i] = i;
}
Arrays.sort(numbers, 0, numbers.length, new BillionsBillions());
long total = 0;
for(int i : numbers) {
String l = numberToWords(i);
long offset = total + l - 51000000000;
if (offset >= 0) {
String c = l.substring(l - offset, l - offset + 1);
System.out.println(c);
break;
}
}
}
}
"numberToWords" is left as an exercise for the reader.
Do you need to save the entire string in memory?
If not, just save how many characters you've appended so far. For each iteration, you check the length the next number's textual representation. If it exceeds the nth letter you are looking for, the letter must be in that string, so extract it by it's index, print it, and stop execution. Otherwise, add the string length to the character count and move to the next number.
All the strings are going to start with either one, ten, two, twenty, three, thirty, four, etc so I'd start with figuring out how many are in each of the buckets. Then you should at least know which bucket you need to look closer at.
Then I'd look at subdividing the buckets further based on the possible prefixes. For example, within ninehundred, you are going to have all the same buckets that you had to start off with, just for numbers starting with 900.
The question is about efficient data storage not string manipulation. Create an enum to represent the words. the words should appear in sorted order so that when it comes time to sort it is a simplish compare. Now generate the list and sort. use the fact that you know how long each word is in conjunction with the enum to add up to the character you need.
Code wins...
#!/bin/bash
n=0
while [ $n -lt 1000000000 ]; do
number -l $n | sed -e 's/[^a-z]//g'
let n=n+1
done | sort > /tmp/bignumbers
awk '
BEGIN {
total = 0;
}
{
l = length($0);
offset = total + l - 51000000000;
print total " " offset
if (offset >= 0) {
c = substr($0, l - offset, 1);
print c;
exit;
}
total = total + l;
}' /tmp/bignumbers
Tested for a much smaller range ;-). Requires a LOT of diskspace, a compressed filesystem would be, umm, valuable, but not so much memory.
Sort has options to compress work files as well, and you could toss in gzip to directly compress data.
Not the zippiest solution.
But it does work.
Honestly I would let an RDBMS like SQL Server or Oracle do the work for me.
Insert the billion strings into an indexed table.
Compute a string length column.
Start pulling off the top X records at a time with a SUM, until I get to 51 billion.
Might beat up the server for a while as it would need to do a lot of Disk IO, but overall I think I could find an answer faster than someone who would write a program to do it.
Sometimes just getting it done is what the client really wants, and could care less what fancy design pattern or data structure you used.
figure out lengths for 1-999 and include length for 0 as 0.
so now you have an array for 0-999 namely uint32 sizes999[1000];
(not going to get into the details of generating this)
also need an array of thousand last letters last_letters[1000]
(again not going to get into the details of generating this as it is even easier even hundreds d even tens y except 10 which is n others cycle though last of on e through nin e zero is irrelavant)
uint32 sizes999[1000];
uint64 totallen = 0;
strlen_million = strlen("million");
strlen_thousand = strlen("thousand");
for (uint32 i = 0; i<1000;++i){
for (uint32 j = 0; j<1000;++j){
for (uint32 j = 0; j<1000;++j){
total_len += sizes999[i]+strlen_million +
sizes999[j]+strlen_thousand +
sizes999[k];
if totallen == 51000000000 goto done;
ASSERT(totallen <51000000000);//he claimed 51000000000 was not intermediate
}
}
}
done:
//now use i j k to get last letter by using last_letters999
//think of i,j,k as digits base 1000
//if k = 0 & j ==0 then the letter is n million
//if only k = 0 then the letter is d thousand
//other wise use the array of last_letters since
//the units digit base 1000, that is k, is not zero
//for the sum of the numbers i,j,k are the digits of the number base 1000 so
n = i*1000000 + j*1000 + k;
//represent the number and use
sum = n*(n+1)/2;
if you need to do it for number other than 51000000000 then also calculate sums_sizes999 and use that in the natural way.
total memory: 0(1000);
total time: 0(n) where n is the number
This is what I'd do:
Create an array of 2,997 strings: "one" through "ninehundredninetynine", "onethousand" through "ninehundredninetyninethousand", and "onemillion" through "ninehundredninetyninemillion".
Store the following about each string: length (this can be calculated of course), the integer value represented by the string, and some enum to signify whether it's "ones", "thousands", or "millions".
Sort the 2,997 strings alphabetically.
With this array created, it's straightforward to find all 999,999,999 strings in order alphabetically based on the following observations:
Nothing can follow a "ones" string
Either nothing, or a "ones" string, can follow a "thousands" string
Either nothing, a "ones" string, a "thousands" string, or a "thousands" string then a "ones" string, can follow a "millions" string.
Constructing the words basically involves creating one- to three-letter "words" based on these 2,997 tokens, making sure that the order of the tokens makes a valid number according to the rules above. Given a particular "word", the next "word" is found like this:
Lengthen the "word" by adding the token first alphabetically, if possible.
If this can't be done, advance the rightmost token to the next one alphabetically, if possible.
If this too is not possible, then remove the rightmost token, and advance the second-rightmost token to the next one alphabetically, if possible.
If this too is not possible, you're done.
At each step you can calculate the total length of the string and the sum of the numbers by just keeping two running totals.
It's important to note that there is a lot of overlapping and double counting if you iterate over all 100 billion possible numbers. It's important to realize that the number of strings that start with "eight" is the same number of numbers that start with "nin" or "seven" or "six" etc...
To me, this begs for a dynamic programming solution where the number of strings for tens, hundreds, thousands, etc are calculated and stored in some type of look up table. Ofcourse, there will be special cases for one vs eleven, two vs twelve, etc
I'll update this if I can get a quick running solution.
WRONG!!!!!!!!! I READ THE PROBLEM WRONG. I thought it meant "what's the last letter of the alphabetically last number"
what's wrong with:
public class Nums {
// if overflows happen, switch to an BigDecimal or something
// with arbitrary precision
public static void main(String[] args) {
System.out.println("last letter: " + lastLetter(1L, 51000000L);
System.out.println("sum: " + sum(1L, 51000000L);
}
static char lastLetter(long start, long end) {
String last = toWord(start);
for(long i = start; i < end; i++)
String current = toWord(i);
if(current.compareTo(last) > 1)
last = current;
return last.charAt(last.length()-1);
}
static String toWord(long num) {
// should be relatively easy, but for now ...
return "one";
}
static long sum(long first, long n) {
return (n * first + n*n) / 2;
}
}
haven't actually tried this :/ LOL
You have one billion numbers and 51 billion characters - there's a good chance that this is a trick question, as there are an average of 51 characters per number. Sum up the conversions of all the numbers and see if it adds up to 51 billion.
Edit: It adds up to 70,305,000,000 characters, so this is the wrong answer.
I solved this in Java sometime in 2008 as part of an application to work at ITA Software.
The code is long, and it now being three years later, I look at it with a bit of horror... So I'm not going to post it.
But I'll post quotes from some notes that I included with the application.
The problem with this puzzle is of course the size. The naïve approach would be to sort the list in word number order and then to iterate through the sorted list counting characters and summing. With a list of size 999,999,999 this would of course take a rather long time and the sort could likely not be done in memory.
But there are natural patterns in the ordering which allow shortcuts.
Immediately following any entry (say the number is X) ending in “million” will come 999,999 entries starting with the same text, representing all the numbers from X +1
to X + 10^6 -1.
The sum of all these numbers can be computed by a classic formula (an “arithmetic series”), and the character count can be computed by a similarly simple formula based on the prefix (X above) and a once-computed character count for the numbers from 1 to 999,999. Both depend only on the “millions” part of the number at the base of the range. Thus if the character count for the entire range will keep the entire count below the search goal, the individual entries need not be traversed.
Similar shortcuts apply for “thousand”, and indeed could be applied to “hundred” or “billion” though I didn’t bother with shortcuts at the hundreds level and the billions level is out of range for this problem.
In order to apply these shortcuts, my code creates and sorts a list of 2997 objects representing the numbers:
1 to 999 stepping by 1
1000 to 999000 stepping by 1000
1000000 to 999000000 stepping by 1000000
The code iterates through this list, accumulating sums and character counts, recursively creating, sorting and traversing similar but smaller lists as needed.
Explicit counting and adding is only needed near the end.
I didn't get the job, but later used the code as a "code sample" for another job, which I did get.
The Java code using these techniques for skipping much of the explicit counting and adding runs in about 8 seconds.

Convert string to integer (not atoi!)

I want to be able to take, as input, a character pointer to a number in base 2 through 16 and as a second parameter, what base the number is in and then convert that to it's representation in base 2. The integer can be of arbitrary length. My solution now does what the atoi() function does, but I was curious purely out of academic interest if a lookup table solution is possible.
I have found that this is simple for binary, octal, and hexadecimal. I can simply use a lookup table for each digit to get a series of bits. For instance:
0xF1E ---> (F = 1111) (1 = 0001) (E = 1110) ---> 111100011110
0766 ---> (7 = 111) (6 = 110) (6 = 110) ---> 111110110
1000 ---> ??? ---> 1111101000
However, my problem is that I want to do this look up table method for odd bases, like base 10. I know that I could write the algorithm like atoi does and do a bunch of multiplies and adds, but for this specific problem I'm trying to see if I can do it with a look up table. It's definitely not so obvious with base 10, though. I was curious if anyone had any clever way to figure out how to generate a generic look up table for Base X -> Base 2. I know that for base 10, you can't just give it one digit at a time, so the solution would likely have to lookup a group of digits at a time.
I am aware of the multiply and add solution but since these are arbitrary length numbers, the multiply and add operations are not free so I'd like to avoid them, if at all possible.
You will have to use a look up table with an input width of m base b symbols returning n bits so that
n = log2(b) * m
for positive integers b, n and m. So if b is not a power of two, there will be no (simple) look up table solution.
I do not think that there is a solution. The following example with base 10 illustrates why.
65536 = 1 0000 0000 0000 0000
Changing the last digit from 6 to 5 will flip all bits.
65535 = 0 1111 1111 1111 1111
And almost the same will hold if you process the input starting from the end. Changing the first digit from 6 to 5 flips a significant number of bits.
55535 = 0 1101 1000 1111 0000
This is not possible in bases that aren't powers of two to convert to base-2. The reason that it is possible for base 8 (and 16) is that the way the conversion works is following:
octal ABC = 8^2*A + 8^1*B + 8^0*C (decimal)
= 0b10000000*A + 0b1000*B + C (binary)
so if you have the lookup table of A = (0b000 to 0b111), then the multiplication is always by 1 and some trailing zeros, so the multiplication is simple (just shifting left).
However, consider the 'odd' base of 10. When you look at the powers of 10:
10^1 = 0b1010
10^2 = 0b1100100
10^3 = 0b1111101000
10^4 = 0b10011100010000
..etc
You'll notice that the multiplication never gets simple, so you can't have any lookup tables and do bitshifts and ors, no matter how big you group them. It will always overlap. The best you can do is have a lookup table of the form: (a,b) where a is the digit position, and b is the digit (0..9). Then, you are only reduced to adding n numbers, rather than multiplying and adding n numbers (plus the cost of the memory of the lookup table)
How big are the strings? You can potentially convert the multiply-and-add to a lookup-and-add by doing something like this:
Store the numbers 0-9, 10, 20, 30, 40, ... 90, 100, 200, ... 900, 1000, 2000, ... , 9000, 10000, ... in the target base in a table.
For each character starting with the rightmost, index appropriately into the table and add it to a running result.
Of course I'm not sure how well this will actually perform, but it's a thought.
The algorithm is quite simple. Language agnostic would be:
total = 0
base <- input_base
for each character in input:
total <- total*base + number(char)
In C++:
// Helper to convert a digit to a number
unsigned int number( char ch )
{
if ( ch >= '0' && ch <= '9' ) return ch-'0';
ch = toupper(ch);
if ( ch >= 'A' && ch <= 'F' ) return 10 + (ch-'A');
}
unsigned int parse( std::string const & input, unsigned int base )
{
unsigned int total = 0;
for ( int i = 0; i < input.size(); ++i )
{
total = total*base + number(input[i]);
}
return total;
}
Of course, you should take care of possible errors (incoherent input: base 2 and input string 'af12') or any other exceptional condition.
Start with a running count of 0.
For each character in the string (reading left to right)
Multiply count by base.
Convert character to int value (0 through base)
Add character value to running count.
How accurate do you need to be?
If you're looking for perfection, then multiply-and-add is really your only recourse. And I'd be very surprised if it's the slowest part of your application.
If order-of-magnitude is good enough, use a lookup table to find the closest power of 2.
Example 1: 1234, closest power of 2 is 1024.
Example 2: 98765, closest is 65536
You could also drive this by counting the number of digits, and multiplying the appropriate power of 2 by the leftmost digit. This can be implemented as a left-shift:
Example 3: 98765 has 5 digits, closest power of 2 to 10000 is 8192 (2^13), so result is 9 << 13
I wrote this before your clarifying comment so it probably isn't quite is applicable. I'm not sure if a lookup table approach is possible or not. If you really don't need arbitrary precision, then take advantage of the runtime.
If a C/C++ solution is acceptable, I believe that the following is what you are looking for is something like the following. It probably contains bugs in edge cases, but it does compile and work as expected at least for positive numbers. Making it really work is an exercise for the reader.
/*
* NAME
* convert_num - convert a numerical string (str) of base (b) to
* a printable binary representation
* SYNOPSIS
* int convert_num(char const* s, int b, char** o)
* DESCRIPTION
* Generates a printable binary representation of an input number
* from an arbitrary base. The input number is passed as the ASCII
* character string `s'. The input string consists of characters
* from the ASCII character set {'0'..'9','A'..('A'+b-10)} where
* letter characters may be in either upper or lower case.
* RETURNS
* The number of characters from the input string `s' which were
* consumed by this operation. The output string is placed into
* newly allocated storage which is pointed to by `*o' upon successful
* completion. An error is signalled by returning `-1'.
*/
int
convert_num(char const *str, int b, char **out)
{
int rc = -1;
char *endp = NULL;
char *outp = NULL;
unsigned long num = strtoul(str, &endp, b);
if (endp != str) { /* then we have some numbers */
int numdig = -1;
rc = (endp - str); /* we have this many base `b' digits! */
frexp((double)num, &numdig); /* we need this many base 2 digits */
if ((outp=malloc(numdig+1)) == NULL) {
return -1;
}
*out = outp; /* return the buffer */
outp += numdig; /* make sure it is NUL terminated */
*outp-- = '\0';
while (numdig-- != 0) { /* fill it in from LSb to MSb */
*outp-- = ((num & 1) ? '1' : '0');
num >>= 1;
}
}
return rc;
}

Expressing an integer as a series of multipliers

Scroll down to see latest edit, I left all this text here just so that I don't invalidate the replies this question has received so far!
I have the following brain teaser I'd like to get a solution for, I have tried to solve this but since I'm not mathematically that much above average (that is, I think I'm very close to average) I can't seem wrap my head around this.
The problem: Given number x should be split to a serie of multipliers, where each multiplier <= y, y being a constant like 10 or 16 or whatever. In the serie (technically an array of integers) the last number should be added instead of multiplied to be able to convert the multipliers back to original number.
As an example, lets assume x=29 and y=10. In this case the expected array would be {10,2,9} meaning 10*2+9. However if y=5, it'd be {5,5,4} meaning 5*5+4 or if y=3, it'd be {3,3,3,2} which would then be 3*3*3+2.
I tried to solve this by doing something like this:
while x >= y, store y to multipliers, then x = x - y
when x < y, store x to multipliers
Obviously this didn't work, I also tried to store the "leftover" part separately and add that after everything else but that didn't work either. I believe my main problem is that I try to think this in a way too complex manner while the solution is blatantly obvious and simple.
To reiterate, these are the limits this algorithm should have:
has to work with 64bit longs
has to return an array of 32bit integers (...well, shorts are OK too)
while support for signed numbers (both + and -) would be nice, if it helps the task only unsigned numbers is a must
And while I'm doing this using Java, I'd rather take any possible code examples as pseudocode, I specifically do NOT want readily made answers, I just need a nudge (well, more of a strong kick) so that I can solve this at least partly myself. Thanks in advance.
Edit: Further clarification
To avoid some confusion, I think I should reword this a bit:
Every integer in the result array should be less or equal to y, including the last number.
Yes, the last number is just a magic number.
No, this is isn't modulus since then the second number would be larger than y in most cases.
Yes, there is multiple answers to most of the numbers available, however I'm looking for the one with least amount of math ops. As far as my logic goes, that means finding the maximum amount of as big multipliers as possible, for example x=1 000 000,y=100 is 100*100*100 even though 10*10*10*10*10*10 is equally correct answer math-wise.
I need to go through the given answers so far with some thought but if you have anything to add, please do! I do appreciate the interest you've already shown on this, thank you all for that.
Edit 2: More explanations + bounty
Okay, seems like what I was aiming for in here just can't be done the way I thought it could be. I was too ambiguous with my goal and after giving it a bit of a thought I decided to just tell you in its entirety what I'd want to do and see what you can come up with.
My goal originally was to come up with a specific method to pack 1..n large integers (aka longs) together so that their String representation is notably shorter than writing the actual number. Think multiples of ten, 10^6 and 1 000 000 are the same, however the representation's length in characters isn't.
For this I wanted to somehow combine the numbers since it is expected that the numbers are somewhat close to each other. I firsth thought that representing 100, 121, 282 as 100+21+161 could be the way to go but the saving in string length is neglible at best and really doesn't work that well if the numbers aren't very close to each other. Basically I wanted more than ~10%.
So I came up with the idea that what if I'd group the numbers by common property such as a multiplier and divide the rest of the number to individual components which I can then represent as a string. This is where this problem steps in, I thought that for example 1 000 000 and 100 000 can be expressed as 10^(5|6) but due to the context of my aimed usage this was a bit too flaky:
The context is Web. RESTful URL:s to be specific. That's why I mentioned of thinking of using 64 characters (web-safe alphanumberic non-reserved characters and then some) since then I could create seemingly random URLs which could be unpacked to a list of integers expressing a set of id numbers. At this point I thought of creating a base 64-like number system for expressing base 10/2 numbers but since I'm not a math genius I have no idea beyond this point how to do it.
The bounty
Now that I have written the whole story (sorry that it's a long one), I'm opening a bounty to this question. Everything regarding requirements for the preferred algorithm specified earlier is still valid. I also want to say that I'm already grateful for all the answers I've received so far, I enjoy being proven wrong if it's done in such a manner as you people have done.
The conclusion
Well, bounty is now given. I spread a few comments to responses mostly for future reference and myself, you can also check out my SO Uservoice suggestion about spreading bounty which is related to this question if you think we should be able to spread it among multiple answers.
Thank you all for taking time and answering!
Update
I couldn't resist trying to come up with my own solution for the first question even though it doesn't do compression. Here is a Python solution using a third party factorization algorithm called pyecm.
This solution is probably several magnitudes more efficient than Yevgeny's one. Computations take seconds instead of hours or maybe even weeks/years for reasonable values of y. For x = 2^32-1 and y = 256, it took 1.68 seconds on my core duo 1.2 ghz.
>>> import time
>>> def test():
... before = time.time()
... print factor(2**32-1, 256)
... print time.time()-before
...
>>> test()
[254, 232, 215, 113, 3, 15]
1.68499994278
>>> 254*232*215*113*3+15
4294967295L
And here is the code:
def factor(x, y):
# y should be smaller than x. If x=y then {y, 1, 0} is the best solution
assert(x > y)
best_output = []
# try all possible remainders from 0 to y
for remainder in xrange(y+1):
output = []
composite = x - remainder
factors = getFactors(composite)
# check if any factor is larger than y
bad_remainder = False
for n in factors.iterkeys():
if n > y:
bad_remainder = True
break
if bad_remainder: continue
# make the best factors
while True:
results = largestFactors(factors, y)
if results == None: break
output += [results[0]]
factors = results[1]
# store the best output
output = output + [remainder]
if len(best_output) == 0 or len(output) < len(best_output):
best_output = output
return best_output
# Heuristic
# The bigger the number the better. 8 is more compact than 2,2,2 etc...
# Find the most factors you can have below or equal to y
# output the number and unused factors that can be reinserted in this function
def largestFactors(factors, y):
assert(y > 1)
# iterate from y to 2 and see if the factors are present.
for i in xrange(y, 1, -1):
try_another_number = False
factors_below_y = getFactors(i)
for number, copies in factors_below_y.iteritems():
if number in factors:
if factors[number] < copies:
try_another_number = True
continue # not enough factors
else:
try_another_number = True
continue # a factor is not present
# Do we want to try another number, or was a solution found?
if try_another_number == True:
continue
else:
output = 1
for number, copies in factors_below_y.items():
remaining = factors[number] - copies
if remaining > 0:
factors[number] = remaining
else:
del factors[number]
output *= number ** copies
return (output, factors)
return None # failed
# Find prime factors. You can use any formula you want for this.
# I am using elliptic curve factorization from http://sourceforge.net/projects/pyecm
import pyecm, collections, copy
getFactors_cache = {}
def getFactors(n):
assert(n != 0)
# attempt to retrieve from cache. Returns a copy
try:
return copy.copy(getFactors_cache[n])
except KeyError:
pass
output = collections.defaultdict(int)
for factor in pyecm.factors(n, False, True, 10, 1):
output[factor] += 1
# cache result
getFactors_cache[n] = output
return copy.copy(output)
Answer to first question
You say you want compression of numbers, but from your examples, those sequences are longer than the undecomposed numbers. It is not possible to compress these numbers without more details to the system you left out (probability of sequences/is there a programmable client?). Could you elaborate more?
Here is a mathematical explanation as to why current answers to the first part of your problem will never solve your second problem. It has nothing to do with the knapsack problem.
This is Shannon's entropy algorithm. It tells you the theoretical minimum amount of bits you need to represent a sequence {X0, X1, X2, ..., Xn-1, Xn} where p(Xi) is the probability of seeing token Xi.
Let's say that X0 to Xn is the span of 0 to 4294967295 (the range of an integer). From what you have described, each number is as likely as another to appear. Therefore the probability of each element is 1/4294967296.
When we plug it into Shannon's algorithm, it will tell us what the minimum number of bits are required to represent the stream.
import math
def entropy():
num = 2**32
probability = 1./num
return -(num) * probability * math.log(probability, 2)
# the (num) * probability cancels out
The entropy unsurprisingly is 32. We require 32 bits to represent an integer where each number is equally likely. The only way to reduce this number, is to increase the probability of some numbers, and decrease the probability of others. You should explain the stream in more detail.
Answer to second question
The right way to do this is to use base64, when communicating with HTTP. Apparently Java does not have this in the standard library, but I found a link to a free implementation:
http://iharder.sourceforge.net/current/java/base64/
Here is the "pseudo-code" which works perfectly in Python and should not be difficult to convert to Java (my Java is rusty):
def longTo64(num):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
output = ""
# special case for 0
if num == 0:
return mapping[0]
while num != 0:
output = mapping[num % 64] + output
num /= 64
return output
If you have control over your web server and web client, and can parse the entire HTTP requests without problem, you can upgrade to base85. According to wikipedia, url encoding allows for up to 85 characters. Otherwise, you may need to remove a few characters from the mapping.
Here is another code example in Python
def longTo85(num):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~!*'();:#&=+$,/?%#[]"
output = ""
base = len(mapping)
# special case for 0
if num == 0:
return mapping[0]
while num != 0:
output = mapping[num % base] + output
num /= base
return output
And here is the inverse operation:
def stringToLong(string):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~!*'();:#&=+$,/?%#[]"
output = 0
base = len(mapping)
place = 0
# check each digit from the lowest place
for digit in reversed(string):
# find the number the mapping of symbol to number, then multiply by base^place
output += mapping.find(digit) * (base ** place)
place += 1
return output
Here is a graph of Shannon's algorithm in different bases.
As you can see, the higher the radix, the less symbols are needed to represent a number. At base64, ~11 symbols are required to represent a long. At base85, it becomes ~10 symbols.
Edit after final explanation:
I would think base64 is the best solution, since there are standard functions that deal with it, and variants of this idea don't give much improvement. This was answered with much more detail by others here.
Regarding the original question, although the code works, it is not guaranteed to run in any reasonable time, as was answered as well as commented on this question by LFSR Consulting.
Original Answer:
You mean something like this?
Edit - corrected after a comment.
shortest_output = {}
foreach (int R = 0; R <= X; R++) {
// iteration over possible remainders
// check if the rest of X can be decomposed into multipliers
newX = X - R;
output = {};
while (newX > Y) {
int i;
for (i = Y; i > 1; i--) {
if ( newX % i == 0) { // found a divider
output.append(i);
newX = newX /i;
break;
}
}
if (i == 1) { // no dividers <= Y
break;
}
}
if (newX != 1) {
// couldn't find dividers with no remainder
output.clear();
}
else {
output.append(R);
if (output.length() < shortest_output.length()) {
shortest_output = output;
}
}
}
It sounds as though you want to compress random data -- this is impossible for information theoretic reasons. (See http://www.faqs.org/faqs/compression-faq/part1/preamble.html question 9.) Use Base64 on the concatenated binary representations of your numbers and be done with it.
The problem you're attempting to solve (you're dealing with a subset of the problem, given you're restriction of y) is called Integer Factorization and it cannot be done efficiently given any known algorithm:
In number theory, integer factorization is the breaking down of a composite number into smaller non-trivial divisors, which when multiplied together equal the original integer.
This problem is what makes a number of cryptographic functions possible (namely RSA which uses 128 bit keys - long is half of that.) The wiki page contains some good resources that should move you in the right direction with your problem.
So, your brain teaser is indeed a brain teaser... and if you solve it efficiently we can elevate your math skills to above average!
Updated after the full story
Base64 is most likely your best option. If you want a custom solution you can try implementing a Base 65+ system. Just remember that just because 10000 can be written as "10^4" doesn't mean that everything can be written as 10^n where n is an integer. Different base systems are the simplest way to write numbers and the higher the base the less digits the number requires. Plus most framework libraries contain algorithms for Base64 encoding. (What language you are using?).
One way to further pack the urls is the one you mentioned but in Base64.
int[] IDs;
IDs.sort() // So IDs[i] is always smaller or equal to IDs[i-1].
string url = Base64Encode(IDs[0]);
for (int i = 1; i < IDs.length; i++) {
url += "," + Base64Encode(IDs[i-1] - IDs[i]);
}
Note that you require some separator as the initial ID can be arbitrarily large and the difference between two IDs CAN be more than 63 in which case one Base64 digit is not enough.
Updated
Just restating that the problem is unsolvable. For Y = 64 you can't write 87681 in multipliers + remainder where each of these is below 64. In other words, you cannot write any of the numbers 87617..87681 with multipliers that are below 64. Each of these numbers has an elementary term over 64. 87616 can be written in elementary terms below 64 but then you'd need those + 65 and so the remainder will be over 64.
So if this was just a brainteaser, it's unsolvable. Was there some practical purpose for this which could be achieved in some way other than using multiplication and a remainder?
And yes, this really should be a comment but I lost my ability to comment at some point. :p
I believe the solution which comes closest is Yevgeny's. It is also easy to extend Yevgeny's solution to remove the limit for the remainder in which case it would be able to find solution where multipliers are smaller than Y and remainder as small as possible, even if greater than Y.
Old answer:
If you limit that every number in the array must be below the y then there is no solution for this. Given large enough x and small enough y, you'll end up in an impossible situation. As an example with y of 2, x of 12 you'll get 2 * 2 * 2 + 4 as 2 * 2 * 2 * 2 would be 16. Even if you allow negative numbers with abs(n) below y that wouldn't work as you'd need 2 * 2 * 2 * 2 - 4 in the above example.
And I think the problem is NP-Complete even if you limit the problem to inputs which are known to have an answer where the last term is less than y. It sounds quite much like the [Knapsack problem][1]. Of course I could be wrong there.
Edit:
Without more accurate problem description it is hard to solve the problem, but one variant could work in the following way:
set current = x
Break current to its terms
If one of the terms is greater than y the current number cannot be described in terms greater than y. Reduce one from current and repeat from 2.
Current number can be expressed in terms less than y.
Calculate remainder
Combine as many of the terms as possible.
(Yevgeny Doctor has more conscise (and working) implementation of this so to prevent confusion I've skipped the implementation.)
OP Wrote:
My goal originally was to come up with
a specific method to pack 1..n large
integers (aka longs) together so that
their String representation is notably
shorter than writing the actual
number. Think multiples of ten, 10^6
and 1 000 000 are the same, however
the representation's length in
characters isn't.
I have been down that path before, and as fun as it was to learn all the math, to save you time I will just point you to: http://en.wikipedia.org/wiki/Kolmogorov_complexity
In a nutshell some strings can be easily compressed by changing your notation:
10^9 (4 characters) = 1000000000 (10 characters)
Others cannot:
7829203478 = some random number...
This is a great great simplification of the article I linked to above, so I recommend that you read it instead of taking my explanation at face value.
Edit:
If you are trying to make RESTful urls for some set of unique data, why wouldn't you use a hash, such as MD5? Then include the hash as part of the URL, then look up the data based on the hash. Or am I missing something obvious?
The original method you chose (a * b + c * d + e) would be very difficult to find optimal solutions for simply due to the large search space of possibilities. You could factorize the number but it's that "+ e" that complicates things since you need to factorize not just that number but quite a few immediately below it.
Two methods for compression spring immediately to mind, both of which give you a much-better-than-10% saving on space from the numeric representation.
A 64-bit number ranges from (unsigned):
0 to
18,446,744,073,709,551,616
or (signed):
-9,223,372,036,854,775,808 to
9,223,372,036,854,775,807
In both cases, you need to reduce the 20-characters taken (without commas) to something a little smaller.
The first is to simply BCD-ify the number the base64 encode it (actually a slightly modified base64 since "/" would not be kosher in a URL - you should use one of the acceptable characters such as "_").
Converting it to BCD will store two digits (or a sign and a digit) into one byte, giving you an immediate 50% reduction in space (10 bytes). Encoding it base 64 (which turns every 3 bytes into 4 base64 characters) will turn the first 9 bytes into 12 characters and that tenth byte into 2 characters, for a total of 14 characters - that's a 30% saving.
The only better method is to just base64 encode the binary representation. This is better because BCD has a small amount of wastage (each digit only needs about 3.32 bits to store [log210], but BCD uses 4).
Working on the binary representation, we only need to base64 encode the 64-bit number (8 bytes). That needs 8 characters for the first 6 bytes and 3 characters for the final 2 bytes. That's 11 characters of base64 for a saving of 45%.
If you wanted maximum compression, there are 73 characters available for URL encoding:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789$-_.+!*'(),
so technically you could probably encode base-73 which, from rough calculations, would still take up 11 characters, but with more complex code which isn't worth it in my opinion.
Of course, that's the maximum compression due to the maximum values. At the other end of the scale (1-digit) this encoding actually results in more data (expansion rather than compression). You can see the improvements only start for numbers over 999, where 4 digits can be turned into 3 base64 characters:
Range (bytes) Chars Base64 chars Compression ratio
------------- ----- ------------ -----------------
< 10 (1) 1 2 -100%
< 100 (1) 2 2 0%
< 1000 (2) 3 3 0%
< 10^4 (2) 4 3 25%
< 10^5 (3) 5 4 20%
< 10^6 (3) 6 4 33%
< 10^7 (3) 7 4 42%
< 10^8 (4) 8 6 25%
< 10^9 (4) 9 6 33%
< 10^10 (5) 10 7 30%
< 10^11 (5) 11 7 36%
< 10^12 (5) 12 7 41%
< 10^13 (6) 13 8 38%
< 10^14 (6) 14 8 42%
< 10^15 (7) 15 10 33%
< 10^16 (7) 16 10 37%
< 10^17 (8) 17 11 35%
< 10^18 (8) 18 11 38%
< 10^19 (8) 19 11 42%
< 2^64 (8) 20 11 45%
Update: I didn't get everything, thus I rewrote the whole thing in a more Java-Style fashion. I didn't think of the prime number case that is bigger than the divisor. This is fixed now. I leave the original code in order to get the idea.
Update 2: I now handle the case of the big prime number in another fashion . This way a result is obtained either way.
public final class PrimeNumberException extends Exception {
private final long primeNumber;
public PrimeNumberException(long x) {
primeNumber = x;
}
public long getPrimeNumber() {
return primeNumber;
}
}
public static Long[] decompose(long x, long y) {
try {
final ArrayList<Long> operands = new ArrayList<Long>(1000);
final long rest = x % y;
// Extract the rest so the reminder is divisible by y
final long newX = x - rest;
// Go into recursion, actually it's a tail recursion
recDivide(newX, y, operands);
} catch (PrimeNumberException e) {
// return new Long[0];
// or do whatever you like, for example
operands.add(e.getPrimeNumber());
} finally {
// Add the reminder to the array
operands.add(rest);
return operands.toArray(new Long[operands.size()]);
}
}
// The recursive method
private static void recDivide(long x, long y, ArrayList<Long> operands)
throws PrimeNumberException {
while ((x > y) && (y != 1)) {
if (x % y == 0) {
final long rest = x / y;
// Since y is a divisor add it to the list of operands
operands.add(y);
if (rest <= y) {
// the rest is smaller than y, we're finished
operands.add(rest);
}
// go in recursion
x = rest;
} else {
// if the value x isn't divisible by y decrement y so you'll find a
// divisor eventually
if (--y == 1) {
throw new PrimeNumberException(x);
}
}
}
}
Original: Here some recursive code I came up with. I would have preferred to code it in some functional language but it was required in Java. I didn't bother converting the numbers to integer but that shouldn't be that hard (yes, I'm lazy ;)
public static Long[] decompose(long x, long y) {
final ArrayList<Long> operands = new ArrayList<Long>();
final long rest = x % y;
// Extract the rest so the reminder is divisible by y
final long newX = x - rest;
// Go into recursion, actually it's a tail recursion
recDivide(newX, y, operands);
// Add the reminder to the array
operands.add(rest);
return operands.toArray(new Long[operands.size()]);
}
// The recursive method
private static void recDivide(long newX, long y, ArrayList<Long> operands) {
long x = newX;
if (x % y == 0) {
final long rest = x / y;
// Since y is a divisor add it to the list of operands
operands.add(y);
if (rest <= y) {
// the rest is smaller than y, we're finished
operands.add(rest);
} else {
// the rest can still be divided, go one level deeper in recursion
recDivide(rest, y, operands);
}
} else {
// if the value x isn't divisible by y decrement y so you'll find a divisor
// eventually
recDivide(x, y-1, operands);
}
}
Are you married to using Java? Python has an entire package dedicated just for this exact purpose. It'll even sanitize the encoding for you to be URL-safe.
Native Python solution
The standard module I'm recommending is base64, which converts arbitrary stings of chars into sanitized base64 format. You can use it in conjunction with the pickle module, which handles conversion from lists of longs (actually arbitrary size) to a compressed string representation.
The following code should work on any vanilla installation of Python:
import base64
import pickle
# get some long list of numbers
a = (854183415,1270335149,228790978,1610119503,1785730631,2084495271,
1180819741,1200564070,1594464081,1312769708,491733762,243961400,
655643948,1950847733,492757139,1373886707,336679529,591953597,
2007045617,1653638786)
# this gets you the url-safe string
str64 = base64.urlsafe_b64encode(pickle.dumps(a,-1))
print str64
>>> gAIoSvfN6TJKrca3S0rCEqMNSk95-F9KRxZwakqn3z58Sh3hYUZKZiePR0pRlwlfSqxGP05KAkNPHUo4jooOSixVFCdK9ZJHdEqT4F4dSvPY41FKaVIRFEq9fkgjSvEVoXdKgoaQYnRxAC4=
# this unwinds it
a64 = pickle.loads(base64.urlsafe_b64decode(str64))
print a64
>>> (854183415, 1270335149, 228790978, 1610119503, 1785730631, 2084495271, 1180819741, 1200564070, 1594464081, 1312769708, 491733762, 243961400, 655643948, 1950847733, 492757139, 1373886707, 336679529, 591953597, 2007045617, 1653638786)
Hope that helps. Using Python is probably the closest you'll get from a 1-line solution.
Wrt the original algorithm request: Is there a limit on the size of the last number (beyond that it must be stored in a 32b int)?
(The original request is all I'm able to tackle lol.)
The one that produces the shortest list is:
bool negative=(n<1)?true:false;
int j=n%y;
if(n==0 || n==1)
{
list.append(n);
return;
}
while((long64)(n-j*y)>MAX_INT && y>1) //R has to be stored in int32
{
y--;
j=n%y;
}
if(y<=1)
fail //Number has no suitable candidate factors. This shouldn't happen
int i=0;
for(;i<j;i++)
{
list.append(y);
}
list.append(n-y*j);
if(negative)
list[0]*=-1;
return;
A little simplistic compared to most answers given so far but it achieves the desired functionality of the original post... It's a little dirty but hopefully useful :)
Isn't this modulus?
Let / be integer division (whole numbers) and % be modulo.
int result[3];
result[0] = y;
result[1] = x / y;
result[2] = x % y;
Just set x:=x/n where n is the largest number that is less both than x and y. When you end up with x<=y, this is your last number in the sequence.
Like in my comment above, I'm not sure I understand exactly the question. But assuming integers (n and a given y), this should work for the cases you stated:
multipliers[0] = n / y;
multipliers[1] = y;
addedNumber = n % y;

Resources