Subtract a number's digits from the number until it reaches 0 - algorithm

Can anyone help me with some algorithm for this problem?
We have a big number (19 digits) and, in a loop, we subtract one of the digits of that number from the number itself.
We continue to do this until the number reaches zero. We want to calculate the minimum number of subtraction that makes a given number reach zero.
The algorithm must respond fast, for a 19 digits number (10^19), within two seconds. As an example, providing input of 36 will give 7:
1. 36 - 6 = 30
2. 30 - 3 = 27
3. 27 - 7 = 20
4. 20 - 2 = 18
5. 18 - 8 = 10
6. 10 - 1 = 9
7. 9 - 9 = 0
Thank you.

The minimum number of subtractions to reach zero makes this, I suspect, a very thorny problem, one that will require a great deal of backtracking potential solutions, making it possibly too expensive for your time limitations.
But the first thing you should do is a sanity check. Since the largest digit is a 9, a 19-digit number will require about 1018 subtractions to reach zero. Code up a simple program to continuously subtract 9 from 1019 until it becomes less than ten. If you can't do that within the two seconds, you're in trouble.
By way of example, the following program (a):
#include <stdio.h>
int main (int argc, char *argv[]) {
unsigned long long x = strtoull(argv[1], NULL, 10);
x /= 1000000000;
while (x > 9)
x -= 9;
return x;
}
when run with the argument 10000000000000000000 (1019), takes a second and a half clock time (and CPU time since it's all calculation) even at gcc insane optimisation level of -O3:
real 0m1.531s
user 0m1.528s
sys 0m0.000s
And that's with the one-billion divisor just before the while loop, meaning the full number of iterations would take about 48 years.
So a brute force method isn't going to help here, what you need is some serious mathematical analysis which probably means you should post a similar question over at https://math.stackexchange.com/ and let the math geniuses have a shot.
(a) If you're wondering why I'm getting the value from the user rather than using a constant of 10000000000000000000ULL, it's to prevent gcc from calculating it at compile time and turning it into something like:
mov $1, %eax
Ditto for the return x which will prevent it noticing I don't use the final value of x and hence optimise the loop out of existence altogether.

I don't have a solution that can solve 19 digit numbers in 2 seconds. Not even close. But I did implement a couple of algorithms (including a dynamic programming algorithm that solves for the optimum), and gained some insight that I believe is interesting.
Greedy Algorithm
As a baseline, I implemented a greedy algorithm that simply picks the largest digit in each step:
uint64_t countGreedy(uint64_t inputVal) {
uint64_t remVal = inputVal;
uint64_t nStep = 0;
while (remVal > 0) {
uint64_t digitVal = remVal;
uint_fast8_t maxDigit = 0;
while (digitVal > 0) {
uint64_t nextDigitVal = digitVal / 10;
uint_fast8_t digit = digitVal - nextDigitVal * 10;
if (digit > maxDigit) {
maxDigit = digit;
}
digitVal = nextDigitVal;
}
remVal -= maxDigit;
++nStep;
}
return nStep;
}
Dynamic Programming Algorithm
The idea for this is that we can calculate the optimum incrementally. For a given value, we pick a digit, which adds one step to the optimum number of steps for the value with the digit subtracted.
With the target function (optimum number of steps) for a given value named optSteps(val), and the digits of the value named d_i, the following relationship holds:
optSteps(val) = 1 + min(optSteps(val - d_i))
This can be implemented with a dynamic programming algorithm. Since d_i is at most 9, we only need the previous 9 values to build on. In my implementation, I keep a circular buffer of 10 values:
static uint64_t countDynamic(uint64_t inputVal) {
uint64_t minSteps[10] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
uint_fast8_t digit0 = 0;
for (uint64_t val = 10; val <= inputVal; ++val) {
digit0 = val % 10;
uint64_t digitVal = val;
uint64_t minPrevStep = 0;
bool prevStepSet = false;
while (digitVal > 0) {
uint64_t nextDigitVal = digitVal / 10;
uint_fast8_t digit = digitVal - nextDigitVal * 10;
if (digit > 0) {
uint64_t prevStep = 0;
if (digit > digit0) {
prevStep = minSteps[10 + digit0 - digit];
} else {
prevStep = minSteps[digit0 - digit];
}
if (!prevStepSet || prevStep < minPrevStep) {
minPrevStep = prevStep;
prevStepSet = true;
}
}
digitVal = nextDigitVal;
}
minSteps[digit0] = minPrevStep + 1;
}
return minSteps[digit0];
}
Comparison of Results
This may be considered a surprise: I ran both algorithms on all values up to 1,000,000. The results are absolutely identical. This suggests that the greedy algorithm actually calculates the optimum.
I don't have a formal proof that this is indeed true for all possible values. It intuitively kind of makes sense to me. If in any given step, you choose a smaller digit than the maximum, you compromise the immediate progress with the goal of getting into a more favorable situation that allows you to catch up and pass the greedy approach. But in all the scenarios I thought about, the situation after taking a sub-optimal step just does not get significantly more favorable. It might make the next step bigger, but that is at most enough to get even again.
Complexity
While both algorithms look linear in the size of the value, they also loop over all digits in the value. Since the number of digits corresponds to log(n), I believe the complexity is O(n * log(n)).
I think it's possible to make it linear by keeping counts of the frequency of each digit, and modifying them incrementally. But I doubt it would actually be faster. It requires more logic, and turns a loop over all digits in the value (which is in the range of 2-19 for the values we are looking at) into a fixed loop over 10 possible digits.
Runtimes
Not surprisingly, the greedy algorithm is faster to calculate a single value. For example, for value 1,000,000,000, the runtimes on my MacBook Pro are:
greedy: 3 seconds
dynamic: 36 seconds
On the other hand, the dynamic programming approach is obviously much faster at calculating all the values, since its incremental approach needs them as intermediate results anyway. For calculating all values from 10 to 1,000,000:
greedy: 19 minutes
dynamic: 0.03 seconds
As already shown in the runtimes above, the greedy algorithm gets about as high as 9 digit input values within the targeted runtime of 2 seconds. The implementations aren't really tuned, and it's certainly possible to squeeze out some more time, but it would be fractional improvements.
Ideas
As already explored in another answer, there's no chance of getting the result for 19 digit numbers in 2 seconds by subtracting digits one by one. Since we subtract at most 9 in each step, completing this for a value of 10^19 needs more than 10^18 steps. We mostly use computers that perform in the rough range of 10^9 operations/second, which suggests that it would take about 10^9 seconds.
Therefore, we need something that can take shortcuts. I can think of scenarios where that's possible, but haven't been able to generalize it to a full strategy so far.
For example, if your current value is 9999, you know that you can subtract 9 until you reach 9000. So you can calculate that you will make 112 steps ((9999 - 9000) / 9 + 1) where you subtract 9, which can be done in a few operations.

As said in comments already, and agreeing with #paxdiablo’s other answer, I’m not sure if there is an algorithm to find the ideal solution without some backtracking; and the size of the number and the time constraint might be tough as well.
A general consideration though: You might want to find a way to decide between always subtracting the highest digit (which will decrease your current number by the largest possible amount, obviously), and by looking at your current digits and subtracting which of those will give you the largest “new” digit.
Say, your current number only consists of digits between 0 and 5 – then you might be tempted to subtract the 5 to decrease your number by the highest possible value, and continue with the next step. If the last digit of your current number is 3 however, then you might want to subtract 4 instead – since that will give you 9 as new digit at the end of the number, instead of “only” 8 you would be getting if you subtracted 5.
Whereas if you have a 2 and two 9 in your digits already, and the last digit is a 1 – then you might want to subtract the 9 anyway, since you will be left with the second 9 in the result (at least in most cases; in some edge cases it might get obliterated from the result as well), so subtracting the 2 instead would not have the advantage of giving you a “high” 9 that you would otherwise not have in the next step, and would have the disadvantage of not lowering your number by as high an amount as subtracting the 9 would …
But every digit you subtract will not only affect the next step directly, but the following steps indirectly – so again, I doubt there is a way to always chose the ideal digit for the current step without any backtracking or similar measures.

Related

Best way to generate U(1,5) from U(1,3)?

I am given a uniform integer random number generator ~ U3(1,3) (inclusive). I would like to generate integers ~ U5(1,5) (inclusive) using U3. What is the best way to do this?
This simplest approach I can think of is to sample twice from U3 and then use rejection sampling. I.e., sampling twice from U3 gives us 9 possible combinations. We can assign the first 5 combinations to 1,2,3,4,5, and reject the last 4 combinations.
This approach expects to sample from U3 9/5 * 2 = 18/5 = 3.6 times.
Another approach could be to sample three times from U3. This gives us a sample space of 27 possible combinations. We can make use of 25 of these combinations and reject the last 2. This approach expects to use U3 27/25 * 3.24 times. But this approach would be a little more tedious to write out since we have a lot more combinations than the first, but the expected number of sampling from U3 is better than the first.
Are there other, perhaps better, approaches to doing this?
I have this marked as language agnostic, but I'm primarily looking into doing this in either Python or C++.
You do not need combinations. A slight tweak using base 3 arithmetic removes the need for a table. Rather than using the 1..3 result directly, subtract 1 to get it into the range 0..2 and treat it as a base 3 digit. For three samples you could do something like:
function sample3()
result <- 0
result <- result + 9 * (randU3() - 1) // High digit: 9
result <- result + 3 * (randU3() - 1) // Middle digit: 3
result <- result + 1 * (randU3() - 1) // Units digit: 1
return result
end function
That will give you a number in the range 0..26, or 1..27 if you add one. You can use that number directly in the rest of your program.
For the range [1, 3] to [1, 5], this is equivalent to rolling a 5-sided die with a 3-sided one.
However, this can't be done without "wasting" randomness (or running forever in the worst case), since all the prime factors of 5 (namely 5) don't divide 3. Thus, the best that can be done is to use rejection sampling to get arbitrarily close to no "waste" of randomness (such as by batching multiple rolls of the 3-sided die until 3^n is "close enough" to a power of 5). In other words, the approaches you give in your question are as good as they can get.
More generally, an algorithm to roll a k-sided die with a p-sided die will inevitably "waste" randomness (and run forever in the worst case) unless "every prime number dividing k also divides p", according to Lemma 3 in "Simulating a dice with a dice" by B. Kloeckner. For example:
Take the much more practical case that p is a power of 2 (and any block of random bits is the same as rolling a die with a power of 2 number of faces) and k is arbitrary. In this case, this "waste" and indefinite running time are inevitable unless k is also a power of 2.
This result applies to any case of rolling a n-sided die with a m-sided die, where n and m are prime numbers. For example, look at the answers to a question for the case n = 7 and m = 5.
See also this question: Frugal conversion of uniformly distributed random numbers from one range to another.
Peter O. is right, you cannot escape to loose some randomness. So the only choice is between how expensive calls to U(1,3) are, code clarity, simplicity etc.
Here is my variant, making bits from U(1,3) and combining them together with rejection
C/C++ (untested!)
int U13(); // your U(1,3)
int getBit() { // single random bit
return (U13()-1)&1;
}
int U15() {
int r;
for(;;) {
int q = getBit() + 2*getBit() + 4*getBit(); // uniform in [0...8)
if (q < 5) { // need range [0...5)
r = q + 1; // q accepted, make it in [1...5]
break;
}
}
return r;
}

Minimum number of train station stops

I received this interview question and got stuck on it:
There are an infinite number of train stops starting from station number 0.
There are an infinite number of trains. The nth train stops at all of the k * 2^(n - 1) stops where k is between 0 and infinity.
When n = 1, the first train stops at stops 0, 1, 2, 3, 4, 5, 6, etc.
When n = 2, the second train stops at stops 0, 2, 4, 6, 8, etc.
When n = 3, the third train stops at stops 0, 4, 8, 12, etc.
Given a start station number and end station number, return the minimum number of stops between them. You can use any of the trains to get from one stop to another stop.
For example, the minimum number of stops between start = 1 and end = 4 is 3 because we can get from 1 to 2 to 4.
I'm thinking about a dynamic programming solution that would store in dp[start][end] the minimum number of steps between start and end. We'd build up the array using start...mid1, mid1...mid2, mid2...mid3, ..., midn...end. But I wasn't able to get it to work. How do you solve this?
Clarifications:
Trains can only move forward from a lower number stop to a higher number stop.
A train can start at any station where it makes a stop at.
Trains can be boarded in any order. The n = 1 train can be boarded before or after boarding the n = 3 train.
Trains can be boarded multiple times. For example, it is permitted to board the n = 1 train, next board the n = 2 train, and finally board the n = 1 train again.
I don't think you need dynamic programming at all for this problem. It can basically be expressed by binary calculations.
If you convert the number of a station to binary it tells you right away how to get there from station 0, e.g.,
station 6 = 110
tells you that you need to take the n=3 train and the n=2 train each for one station. So the popcount of the binary representation tells you how many steps you need.
The next step is to figure out how to get from one station to another.
I´ll show this again by example. Say you want to get from station 7 to station 23.
station 7 = 00111
station 23 = 10111
The first thing you want to do is to get to an intermediate stop. This stop is specified by
(highest bits that are equal in start and end station) + (first different bit) + (filled up with zeros)
In our example the intermediate stop is 16 (10000). The steps you need to make can be calculated by the difference of that number and the start station (7 = 00111). In our example this yields
10000 - 00111 = 1001
Now you know, that you need 2 stops (n=1 train and n=4) to get from 7 to 16.
The remaining task is to get from 16 to 23, again this can be solved by the corresponding difference
10111 - 10000 = 00111
So, you need another 3 stops to go from 16 to 23 (n= 3, n= 2, n= 1). This gives you 5 stops in total, just using two binary differences and the popcount. The resulting path can be extracted from the bit representations 7 -> 8 -> 16 -> 20 -> 22 -> 23
Edit:
For further clarification of the intermediate stop let's assume we want to go from
station 5 = 101 to
station 7 = 111
the intermediate stop in this case will be 110, because
highest bits that are equal in start and end station = 1
first different bit = 1
filled up with zeros = 0
we need one step to go there (110 - 101 = 001) and one more to go from there to the end station (111 - 110 = 001).
About the intermediate stop
The concept of the intermediate stop is a bit clunky but I could not find a more elegant way in order to get the bit operations to work. The intermediate stop is the stop in between start and end where the highest level bit switches (that's why it is constructed the way it is). In this respect it is the stop at which the fastest train (between start and end) operates (actually all trains that you are able to catch stop there).
By subtracting the intermediate stop (bit representation) from the end station (bit representation) you reduce the problem to the simple case starting from station 0 (cf. first example of my answer).
By subtracting the start station from the intermediate stop you also reduce the problem to the simple case, but assume that you go from the intermediate stop to the start station which is equivalent to the other way round.
First, ask if you can go backward. It sounds like you can't, but as presented here (which may not reflect the question as you received it), the problem never gives an explicit direction for any of these trains. (I see you've now edited your question to say you can't go backward.)
Assuming you can't go backward, the strategy is simple: always take the highest-numbered available train that doesn't overshoot your destination.
Suppose you're at stop s, and the highest-numbered train that stops at your current location and doesn't overshoot is train k. Traveling once on train k will take you to stop s + 2^(k-1). There is no faster way to get to that stop, and no way to skip that stop - no lower-numbered trains skip any of train k's stops, and no higher-numbered trains stop between train k's stops, so you can't get on a higher-numbered train before you get there. Thus, train k is your best immediate move.
With this strategy in mind, most of the remaining optimization is a matter of efficient bit twiddling tricks to compute the number of stops without explicitly figuring out every stop on the route.
I will attempt to prove my algorithm is optimal.
The algorithm is "take the fastest train that doesn't overshoot your destination".
How many stops this is is a bit tricky.
Encode both stops as binary numbers. I claim that an identical prefix can be neglected; the problem of going from a to b is the same as the problem of going from a+2^n to b+2^n if 2^n > b, as the stops between 2^n and 2^(n+1) are just the stops between 0 and 2^n shifted over.
From this, we can reduce a trip from a to b to guarantee that the high bit of b is set, and the same "high" bit of a is not set.
To solve going from 5 (101) to 7 (111), we merely have to solve going from 1 (01) to 3 (11), then shift our stop numbers up 4 (100).
To go from x to 2^n + y, where y < 2^n (and hence x is), we first want to go to 2^n, because there are no trains that skip over 2^n that do not also skip over 2^n+y < 2^{n+1}.
So any set of stops between x and y must stop at 2^n.
Thus the optimal number of stops from x to 2^n + y is the number of stops from x to 2^n, followed by the number of stops from 2^n to 2^n+y, inclusive (or from 0 to y, which is the same).
The algorithm I propose to get from 0 to y is to start with the high order bit set, and take the train that gets you there, then go on down the list.
Claim: In order to generate a number with k 1s, you must take at least k trains. As proof, if you take a train and it doesn't cause a carry in your stop number, it sets 1 bit. If you take a train and it does cause a carry, the resulting number has at most 1 more set bit than it started with.
To get from x to 2^n is a bit trickier, but can be made simple by tracking the trains you take backwards.
Mapping s_i to s_{2^n-i} and reversing the train steps, any solution for getting from x to 2^n describes a solution for getting from 0 to 2^n-x. And any solution that is optimal for the forward one is optimal for the backward one, and vice versa.
Using the result for getting from 0 to y, we then get that the optimal route from a to b where b highest bit set is 2^n and a does not have that bit set is #b-2^n + #2^n-a, where # means "the number of bits set in the binary representation". And in general, if a and b have a common prefix, simply drop that common prefix.
A local rule that generates the above number of steps is "take the fastest train in your current location that doesn't overshoot your destination".
For the part going from 2^n to 2^n+y we did that explicitly in our proof above. For the part going from x to 2^n this is trickier to see.
First, if the low order bit of x is set, obviously we have to take the first and only train we can take.
Second, imagine x has some collection of unset low-order bits, say m of them. If we played the train game going from x/2^m to 2^(n-m), then scaled the stop numbers by multiplying by 2^m we'd get a solution to going from x to 2^n.
And #(2^n-x)/2^m = #2^n - x. So this "scaled" solution is optimal.
From this, we are always taking the train corresponding to our low-order set bit in this optimal solution. This is the longest range train available, and it doesn't overshoot 2^n.
QED
This problem doesn't require dynamic programming.
Here is a simple implementation of a solution using GCC:
uint32_t min_stops(uint32_t start, uint32_t end)
{
uint32_t stops = 0;
if(start != 0) {
while(start <= end - (1U << __builtin_ctz(start))) {
start += 1U << __builtin_ctz(start);
++stops;
}
}
stops += __builtin_popcount(end ^ start);
return stops;
}
The train schema is a map of powers-of-two. If you visualize the train lines as a bit representation, you can see that the lowest bit set represents the train line with the longest distance between stops that you can take. You can also take the lines with shorter distances.
To minimize the distance, you want to take the line with the longest distance possible, until that would make the end station unreachable. That's what adding by the lowest-set bit in the code does. Once you do this, some number of the upper bits will agree with the upper bits of the end station, while the lower bits will be zero.
At that point, it's simply a a matter of taking a train for the highest bit in the end station that is not set in the current station. This is optimized as __builtin_popcount in the code.
An example going from 5 to 39:
000101 5 // Start
000110 5+1=6
001000 6+2=8
010000 8+8=16
100000 16+16=32 // 32+32 > 39, so start reversing the process
100100 32+4=36 // Optimized with __builtin_popcount in code
100110 36+2=38 // Optimized with __builtin_popcount in code
100111 38+1=39 // Optimized with __builtin_popcount in code
As some have pointed out, since stops are all multiples of powers of 2, trains that stop more frequently also stop at the same stops of the more-express trains. Any stop is on the first train's route, which stops at every station. Any stop is at most 1 unit away from the second train's route, stopping every second station. Any stop is at most 3 units from the third train that stops every fourth station, and so on.
So start at the end and trace your route back in time - hop on the nearest multiple-of-power-of-2 train and keep switching to the highest multiple-of-power-of-2 train you can as soon as possible (check the position of the least significant set bit - why? multiples of powers of 2 can be divided by two, that is bit-shifted right, without leaving a remainder, log 2 times, or as many leading zeros in the bit-representation), as long as its interval wouldn't miss the starting point after one stop. When the latter is the case, perform the reverse switch, hopping on the next lower multiple-of-power-of-2 train and stay on it until its interval wouldn't miss the starting point after one stop, and so on.
We can figure this out doing nothing but a little counting and array manipulation. Like all the previous answers, we need to start by converting both numbers to binary and padding them to the same length. So 12 and 38 become 01100 and 10110.
Looking at station 12, looking at the least significant set bit (in this case the only bit, 2^2) all trains with intervals larger than 2^2 won't stop at station 4, and all with intervals less than or equal to 2^2 will stop at station 4, but will require multiple stops to get to the same destination as the interval 4 train. We in every situation, up until we reach the largest set bit in the end value, we need to take the train with the interval of the least significant bit of the current station.
If we are at station 0010110100, our sequence will be:
0010110100 2^2
0010111000 2^3
0011000000 2^6
0100000000 2^7
1000000000
Here we can eliminate all bits smaller than the lest significant set bit and get the same count.
00101101 2^0
00101110 2^1
00110000 2^4
01000000 2^6
10000000
Trimming the ends at each stage, we get this:
00101101 2^0
0010111 2^0
0011 2^0
01 2^0
1
This could equally be described as the process of flipping all the 0 bits. Which brings us to the first half of the algorithm: Count the unset bits in the zero padded start number greater than the least significant set bit, or 1 if the start station is 0.
This will get us to the only intermediate station reachable by the train with the largest interval smaller than the end station, so all trains after this must be smaller than the previous train.
Now we need to get from station to 100101, it is easier and obvious, take the train with an interval equal to the largest significant bit set in the destination and not set in the current station number.
1000000000 2^7
1010000000 2^5
1010100000 2^4
1010110000 2^2
1010110100
Similar to the first method, we can trim the most significant bit which will always be set, then count the remaining 1's in the answer. So the second part of the algorithm is Count all the set significant bits smaller than the most significant bit
Then Add the result from parts 1 and 2
Adjusting the algorithm slightly to get all the train intervals, here is an example written in javascript so it can be run here.
function calculateStops(start, end) {
var result = {
start: start,
end: end,
count: 0,
trains: [],
reverse: false
};
// If equal there are 0 stops
if (start === end) return result;
// If start is greater than end, reverse the values and
// add note to reverse the results
if (start > end) {
start = result.end;
end = result.start;
result.reverse = true;
}
// Convert start and end values to array of binary bits
// with the exponent matched to the index of the array
start = (start >>> 0).toString(2).split('').reverse();
end = (end >>> 0).toString(2).split('').reverse();
// We can trim off any matching significant digits
// The stop pattern for 10 to 13 is the same as
// the stop pattern for 2 to 5 offset by 8
while (start[end.length-1] === end[end.length-1]) {
start.pop();
end.pop();
}
// Trim off the most sigificant bit of the end,
// we don't need it
end.pop();
// Front fill zeros on the starting value
// to make the counting easier
while (start.length < end.length) {
start.push('0');
}
// We can break the algorithm in half
// getting from the start value to the form
// 10...0 with only 1 bit set and then getting
// from that point to the end.
var index;
var trains = [];
var expected = '1';
// Now we loop through the digits on the end
// any 1 we find can be added to a temporary array
for (index in end) {
if (end[index] === expected){
result.count++;
trains.push(Math.pow(2, index));
};
}
// if the start value is 0, we can get to the
// intermediate step in one trip, so we can
// just set this to 1, checking both start and
// end because they can be reversed
if (result.start == 0 || result.end == 0) {
index++
result.count++;
result.trains.push(Math.pow(2, index));
// We need to find the first '1' digit, then all
// subsequent 0 digits, as these are the ones we
// need to flip
} else {
for (index in start) {
if (start[index] === expected){
result.count++;
result.trains.push(Math.pow(2, index));
expected = '0';
}
}
}
// add the second set to the first set, reversing
// it to get them in the right order.
result.trains = result.trains.concat(trains.reverse());
// Reverse the stop list if the trip is reversed
if (result.reverse) result.trains = result.trains.reverse();
return result;
}
$(document).ready(function () {
$("#submit").click(function () {
var trains = calculateStops(
parseInt($("#start").val()),
parseInt($("#end").val())
);
$("#out").html(trains.count);
var current = trains.start;
var stopDetails = 'Starting at station ' + current + '<br/>';
for (index in trains.trains) {
current = trains.reverse ? current - trains.trains[index] : current + trains.trains[index];
stopDetails = stopDetails + 'Take train with interval ' + trains.trains[index] + ' to station ' + current + '<br/>';
}
$("#stops").html(stopDetails);
});
});
label {
display: inline-block;
width: 50px;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<label>Start</label> <input id="start" type="number" /> <br>
<label>End</label> <input id="end" type="number" /> <br>
<button id="submit">Submit</button>
<p>Shortest route contains <span id="out">0</span> stops</p>
<p id="stops"></p>
Simple Java solution
public static int minimumNumberOfStops(int start, final int end) {
// I would initialize it with 0 but the example given in the question states :
// the minimum number of stops between start = 1 and end = 4 is 3 because we can get from 1 to 2 to 4
int stops = 1;
while (start < end) {
start += findClosestPowerOfTwoLessOrEqualThan(end - start);
stops++;
}
return stops;
}
private static int findClosestPowerOfTwoLessOrEqualThan(final int i) {
if (i > 1) {
return 2 << (30 - Integer.numberOfLeadingZeros(i));
}
return 1;
}
NOTICE: Reason for current comments under my answer is that first I wrote this algorithm completely wrong and user2357112 awared me from my mistakes. So I completely removed that algorithm and wrote a new one according to what user2357112 answered to this question. I also added some comments into this algorithm to clarify what happens in each line.
This algorithm starts at procedure main(Origin, Dest) and it simulate our movements toward destination with updateOrigin(Origin, Dest)
procedure main(Origin, Dest){
//at the end we have number of minimum steps in this variable
counter = 0;
while(Origin != Dest){
//we simulate our movement toward destination with this
Origin = updateOrigin(Origin, Dest);
counter = counter + 1;
}
}
procedure updateOrigin(Origin, Dest){
if (Origin == 1) return 2;
//we must find which train pass from our origin, what comes out from this IF clause is NOT exact choice and we still have to do some calculation in future
if (Origin == 0){
//all trains pass from stop 0, thus we can choose our train according to destination
n = Log2(Dest);
}else{
//its a good starting point to check if it pass from our origin
n = Log2(Origin);
}
//now lets choose exact train which pass from origin and doesn't overshoot destination
counter = 0;
do {
temp = counter * 2 ^ (n - 1);
//we have found suitable train
if (temp == Origin){
//where we have moved to
return Origin + 2 ^ ( n - 1 );
//we still don't know if this train pass from our origin
} elseif (temp < Origin){
counter = counter + 1;
//lets check another train
} else {
n = n - 1;
counter = 0;
}
}while(temp < origin)
}

Memory-constrained coin changing for numbers up to one billion

I faced this problem on one training. Namely we have given N different values (N<= 100). Let's name this array A[N], for this array A we are sure that we have 1 in the array and A[i] ≤ 109. Secondly we have given number S where S ≤ 109.
Now we have to solve classic coin problem with this values. Actually we need to find minimum number of element which will sum to exactly S. Every element from A can be used infinite number of times.
Time limit: 1 sec
Memory limit: 256 MB
Example:
S = 1000, N = 10
A[] = {1,12,123,4,5,678,7,8,9,10}. The result is 10.
1000 = 678 + 123 + 123 + 12 + 12 + 12 + 12 + 12 + 12 + 4
What I have tried
I tried to solve this with classic dynamic programming coin problem technique but it uses too much memory and it gives memory limit exceeded.
I can't figure out what should we keep about those values. Thanks in advance.
Here are the couple test cases that cannot be solved with the classic dp coin problem.
S = 1000000000 N = 100
1 373241370 973754081 826685384 491500595 765099032 823328348 462385937
251930295 819055757 641895809 106173894 898709067 513260292 548326059
741996520 959257789 328409680 411542100 329874568 352458265 609729300
389721366 313699758 383922849 104342783 224127933 99215674 37629322
230018005 33875545 767937253 763298440 781853694 420819727 794366283
178777428 881069368 595934934 321543015 27436140 280556657 851680043
318369090 364177373 431592761 487380596 428235724 134037293 372264778
267891476 218390453 550035096 220099490 71718497 860530411 175542466
548997466 884701071 774620807 118472853 432325205 795739616 266609698
242622150 433332316 150791955 691702017 803277687 323953978 521256141
174108096 412366100 813501388 642963957 415051728 740653706 68239387
982329783 619220557 861659596 303476058 85512863 72420422 645130771
228736228 367259743 400311288 105258339 628254036 495010223 40223395
110232856 856929227 25543992 957121494 359385967 533951841 449476607
134830774
OUTPUT FOR THIS TEST CASE: 5
S = 999865497 N = 7
1 267062069 637323855 219276511 404376890 528753603 199747292
OUTPUT FOR THIS TEST CASE: 1129042
S = 1000000000 N = 40
1 12 123 4 5 678 7 8 9 10 400 25 23 1000 67 98 33 46 79 896 11 112 1223 412
532 6781 17 18 19 170 1400 925 723 11000 607 983 313 486 739 896
OUTPUT FOR THIS TEST CASE: 90910
(NOTE: Updated and edited for clarity. Complexity Analysis added at the end.)
OK, here is my solution, including my fixes to the performance issues found by #PeterdeRivaz. I have tested this against all of the test cases provided in the question and the comments and it finishes all in under a second (well, 1.5s in one case), using primarily only the memory for the partial results cache (I'd guess about 16MB).
Rather than using the traditional DP solution (which is both too slow and requires too much memory), I use a Depth-First, Greedy-First combinatorial search with pruning using current best results. I was surprised (very) that this works as well as it does, but I still suspect that you could construct test sets that would take a worst-case exponential amount of time.
First there is a master function that is the only thing that calling code needs to call. It handles all of the setup and initialization and calls everything else. (all code is C#)
// Find the min# of coins for a specified sum
int CountChange(int targetSum, int[] coins)
{
// init the cache for (partial) memoization
PrevResultCache = new PartialResult[1048576];
// make sure the coins are sorted lowest to highest
Array.Sort(coins);
int curBest = targetSum;
int result = CountChange_r(targetSum, coins, coins.GetLength(0)-1, 0, ref curBest);
return result;
}
Because of the problem test-cases raised by #PeterdeRivaz I have also added a partial results cache to handle when there are large numbers in N[] that are close together.
Here is the code for the cache:
// implement a very simple cache for previous results of remainder counts
struct PartialResult
{
public int PartialSum;
public int CoinVal;
public int RemainingCount;
}
PartialResult[] PrevResultCache;
// checks the partial count cache for already calculated results
int PrevAddlCount(int currSum, int currCoinVal)
{
int cacheAddr = currSum & 1048575; // AND with (2^20-1) to get only the first 20 bits
PartialResult prev = PrevResultCache[cacheAddr];
// use it, as long as it's actually the same partial sum
// and the coin value is at least as large as the current coin
if ((prev.PartialSum == currSum) && (prev.CoinVal >= currCoinVal))
{
return prev.RemainingCount;
}
// otherwise flag as empty
return 0;
}
// add or overwrite a new value to the cache
void AddPartialCount(int currSum, int currCoinVal, int remainingCount)
{
int cacheAddr = currSum & 1048575; // AND with (2^20-1) to get only the first 20 bits
PartialResult prev = PrevResultCache[cacheAddr];
// only add if the Sum is different or the result is better
if ((prev.PartialSum != currSum)
|| (prev.CoinVal <= currCoinVal)
|| (prev.RemainingCount == 0)
|| (prev.RemainingCount >= remainingCount)
)
{
prev.PartialSum = currSum;
prev.CoinVal = currCoinVal;
prev.RemainingCount = remainingCount;
PrevResultCache[cacheAddr] = prev;
}
}
And here is the code for the recursive function that does the actual counting:
/*
* Find the minimum number of coins required totaling to a specifuc sum
* using a list of coin denominations passed.
*
* Memory Requirements: O(N) where N is the number of coin denominations
* (primarily for the stack)
*
* CPU requirements: O(Sqrt(S)*N) where S is the target Sum
* (Average, estimated. This is very hard to figure out.)
*/
int CountChange_r(int targetSum, int[] coins, int coinIdx, int curCount, ref int curBest)
{
int coinVal = coins[coinIdx];
int newCount = 0;
// check to see if we are at the end of the search tree (curIdx=0, coinVal=1)
// or we have reached the targetSum
if ((coinVal == 1) || (targetSum == 0))
{
// just use math get the final total for this path/combination
newCount = curCount + targetSum;
// update, if we have a new curBest
if (newCount < curBest) curBest = newCount;
return newCount;
}
// prune this whole branch, if it cannot possibly improve the curBest
int bestPossible = curCount + (targetSum / coinVal);
if (bestPossible >= curBest)
return bestPossible; //NOTE: this is a false answer, but it shouldnt matter
// because we should never use it.
// check the cache to see if a remainder-count for this partial sum
// already exists (and used coins at least as large as ours)
int prevRemCount = PrevAddlCount(targetSum, coinVal);
if (prevRemCount > 0)
{
// it exists, so use it
newCount = prevRemCount + targetSum;
// update, if we have a new curBest
if (newCount < curBest) curBest = newCount;
return newCount;
}
// always try the largest remaining coin first, starting with the
// maximum possible number of that coin (greedy-first searching)
newCount = curCount + targetSum;
for (int cnt = targetSum / coinVal; cnt >= 0; cnt--)
{
int tmpCount = CountChange_r(targetSum - (cnt * coinVal), coins, coinIdx - 1, curCount + cnt, ref curBest);
if (tmpCount < newCount) newCount = tmpCount;
}
// Add our new partial result to the cache
AddPartialCount(targetSum, coinVal, newCount - curCount);
return newCount;
}
Analysis:
Memory: Memory usage is pretty easy to determine for this algorithm. Basiclly there's only the partial results cache and the stack. The cache is fixed at appx. 1 million entries times the size of each entry (3*4 bytes), so about 12MB. The stack is limited to O(N), so together, memory is clearly not a problem.
CPU: The run-time complexity of this algorithm starts out hard to determine and then gets harder, so please excuse me because there's a lot of hand-waving here. I tried to search for an analysis of just the brute-force problem (combinatorial search of sums of N*kn base values summing to S) but not much turned up. What little there was tended to say it was O(N^S), which is clearly too high. I think that a fairer estimate is O(N^(S/N)) or possibly O(N^(S/AVG(N)) or even O(N^(S/(Gmean(N))) where Gmean(N) is the geometric mean of the elements of N[]. This solution starts out with the brute-force combinatorial search and then improves it with two significant optimizations.
The first is the pruning of branches based on estimates of the best possible results for that branch versus what the best result it has already found. If the best-case estimators were perfectly accurate and the work for branches was perfectly distributed, this would mean that if we find a result that is better than 90% of the other possible cases, then pruning would effectively eliminate 90% of the work from that point on. To make a long story short here, this should work out that the amount of work still remaining after pruning should shrink harmonically as it progress. Assuming that some kind of summing/integration should be applied to get a work total, this appears to me to work out to a logarithm of the original work. So let's call it O(Log(N^(S/N)), or O(N*Log(S/N)) which is pretty darn good. (Though O(N*Log(S/Gmean(N))) is probably more accurate).
However, there are two obvious holes with this. First, it is true that the best-case estimators are not perfectly accurate and thus they will not prune as effectively as assumed above, but, this is somewhat counter-balanced by the Greedy-First ordering of the branches which gives the best chances for finding better solutions early in the search which increase the effectiveness of pruning.
The second problem is that the best-case estimator works better when the different values of N are far apart. Specifically, if |(S/n2 - S/n1)| > 1 for any 2 values in N, then it becomes almost perfectly effective. For values of N less than SQRT(S), then even two adjacent values (k, k+1) are far enough apart that that this rule applies. However for increasing values above SQRT(S) a window opens up so that any number of N-values within that window will not be able to effectively prune each other. The size of this window is approximately K/SQRT(S). So if S=10^9, when K is around 10^6 this window will be almost 30 numbers wide. This means that N[] could contain 1 plus every number from 1000001 to 1000029 and the pruning optimization would provide almost no benefit.
To address this, I added the partial results cache which allows memoization of the most recent partial sums up to the target S. This takes advantage of the fact that when the N-values are close together, they will tend to have an extremely high number of duplicates in their sums. As best as I can figure, this effectiveness is approximately the N times the J-th root of the problem size where J = S/K and K is some measure of the average size of the N-values (Gmean(N) is probably the best estimate). If we apply this to the brute-force combinatorial search, assuming that pruning is ineffective, we get O((N^(S/Gmean(N)))^(1/Gmean(N))), which I think is also O(N^(S/(Gmean(N)^2))).
So, at this point take your pick. I know this is really sketchy, and even if it is correct, it is still very sensitive to the distribution of the N-values, so lots of variance.
[I've replaced the previous idea about bit operations because it seems to be too time consuming]
A bit crazy idea and incomplete but may work.
Let's start with introducing f(n,s) which returns number of combinations in which s can be composed from n coins.
Now, how f(n+1,s) is related to f(n)?
One of possible ways to calculate it is:
f(n+1,s)=sum[coin:coins]f(n,s-coin)
For example, if we have coins 1 and 3,
f(0,)=[1,0,0,0,0,0,0,0] - with zero coins we can have only zero sum
f(1,)=[0,1,0,1,0,0,0,0] - what we can have with one coin
f(2,)=[0,0,1,0,2,0,1,0] - what we can have with two coins
We can rewrite it a bit differently:
f(n+1,s)=sum[i=0..max]f(n,s-i)*a(i)
a(i)=1 if we have coin i and 0 otherwise
What we have here is convolution: f(n+1,)=conv(f(n,),a)
https://en.wikipedia.org/wiki/Convolution
Computing it as definition suggests gives O(n^2)
But we can use Fourier transform to reduce it to O(n*log n).
https://en.wikipedia.org/wiki/Convolution#Convolution_theorem
So now we have more-or-less cheap way to find out what numbers are possible with n coins without going incrementally - just calculate n-th power of F(a) and apply inverse Fourier transform.
This allows us to make a kind of binary search which can help handling cases when the answer is big.
As I said the idea is incomplete - for now I have no idea how to combine bit representation with Fourier transforms (to satisfy memory constraint) and whether we will fit into 1 second on any "regular" CPU...

Generate an integer that is not among four billion given ones

I have been given this interview question:
Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.
My analysis:
The size of the file is 4×109×4 bytes = 16 GB.
We can do external sorting, thus letting us know the range of the integers.
My question is what is the best way to detect the missing integer in the sorted big integer sets?
My understanding (after reading all the answers):
Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.
Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.
Solution:
If we use one bit representing one distinct integer, it is enough. we don't need sort.
Implementation:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits
Solution:
For all possible 16-bit prefixes, there are 216 number of
integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
Build the counter of each bucket through the first pass through the file.
Scan the buckets, find the first one who has less than 65536 hit.
Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.
Conclusion:
We decrease memory through increasing file pass.
A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.
Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.
If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.
If "integer" means mathematical integer: Read through the input once and keep track of the largest number length of the longest number you've ever seen. When you're done, output the maximum plus one a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).
Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.
If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.
If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.
A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10
Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :)
Coming to the problem, 4 billion numbers can be represented in :
4 billion bits = (4000000000 / 8) bytes = about 0.466 GB
The code to implement the bitset is simple: (taken from solutions page )
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
Bentley's algorithm makes a single pass over the file, setting the appropriate bit in the array and then examines this array using test macro above to find the missing number.
If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of 32/8 = 4 passes.
HTH.
Since the problem does not specify that we have to find the smallest possible number that is not in the file we could just generate a number that is longer than the input file itself. :)
For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.
If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.
In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.
But let's say instead of having about 4 billion numbers; you had something like 232 - 1 numbers and less than 10 MB of RAM; so any small range of ints only has a small possibility of not containing the number.
If you were guaranteed that each int in the list was unique, you could sum the numbers and subtract the sum with one # missing to the full sum (½)(232)(232 - 1) = 9223372034707292160 to find the missing int. However, if an int occurred twice this method will fail.
However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.
That method was inefficient. We are only using two integers in each step (or about 8 bytes of RAM with a 4 byte (32-bit) integer). A better method would be to divide into sqrt(232) = 216 = 65536 bins, each with 65536 numbers in a bin. Each bin requires 4 bytes to store its count, so you need 218 bytes = 256 kB. So bin 0 is (0 to 65535=216-1), bin 1 is (216=65536 to 2*216-1=131071), bin 2 is (2*216=131072 to 3*216-1=196607). In python you'd have something like:
import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
if bin_count < 65536:
break # we have found an incomplete bin with missing ints (bin_num)
Read through the ~4 billion integer list; and count how many ints fall in each of the 216 bins and find an incomplete_bin that doesn't have all 65536 numbers. Then you read through the 4 billion integer list again; but this time only notice when integers are in that range; flipping a bit when you find them.
del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
if N // 65536 == bin_num:
my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
if not bit:
print bin_num*65536 + i
break
Why make it so complicated? You ask for an integer not present in the file?
According to the rules specified, the only thing you need to store is the largest integer that you encountered so far in the file. Once the entire file has been read, return a number 1 greater than that.
There is no risk of hitting maxint or anything, because according to the rules, there is no restriction to the size of the integer or the number returned by the algorithm.
This can be solved in very little space using a variant of binary search.
Start off with the allowed range of numbers, 0 to 4294967295.
Calculate the midpoint.
Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.
If no numbers were equal, you're done. The midpoint number is the answer.
Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.
This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.
This is essentially the same as Henning's solution, except it uses two bins instead of 16k.
EDIT Ok, this wasn't quite thought through as it assumes the integers in the file follow some static distribution. Apparently they don't need to, but even then one should try this:
There are ≈4.3 billion 32-bit integers. We don't know how they are distributed in the file, but the worst case is the one with the highest Shannon entropy: an equal distribution. In this case, the probablity for any one integer to not occur in the file is
( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4
The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).
Memory consumption: a few dozen bytes, complexity: O(n), overhead: neclectable as most of the time will be spent in the unavoidable hard disk accesses rather than comparing ints anyway.
The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only
1 - 4000000000/2³² ≈ 6%
of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.
If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:
>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5
(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)
They may be looking to see if you have heard of a probabilistic Bloom Filter which can very efficiently determine absolutely if a value is not part of a large set, (but can only determine with high probability it is a member of the set.)
Based on the current wording in the original question, the simplest solution is:
Find the maximum value in the file, then add 1 to it.
Use a BitSet. 4 billion integers (assuming up to 2^32 integers) packed into a BitSet at 8 per byte is 2^32 / 2^3 = 2^29 = approx 0.5 Gb.
To add a bit more detail - every time you read a number, set the corresponding bit in the BitSet. Then, do a pass over the BitSet to find the first number that's not present. In fact, you could do this just as effectively by repeatedly picking a random number and testing if it's present.
Actually BitSet.nextClearBit(0) will tell you the first non-set bit.
Looking at the BitSet API, it appears to only support 0..MAX_INT, so you may need 2 BitSets - one for +'ve numbers and one for -'ve numbers - but the memory requirements don't change.
If there is no size limit, the quickest way is to take the length of the file, and generate the length of the file+1 number of random digits (or just "11111..." s). Advantage: you don't even need to read the file, and you can minimize memory use nearly to zero. Disadvantage: You will print billions of digits.
However, if the only factor was minimizing memory usage, and nothing else is important, this would be the optimal solution. It might even get you a "worst abuse of the rules" award.
If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:
The Theory
Given any 0 based range of integers that has 2^n elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.
The Proof
Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:
0 - 00
1 - 01
2 - 10
3 - 11
Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be 10 (or 2).
Now, let's look at n+1. Let's call the number of times each bit is set in n, x and the number of times each bit is set in n+1 y. The value of y will be equal to y = x * 2 because there are x elements with the n+1 bit set to 0, and x elements with the n+1 bit set to 1. And since 2x will always be even, n+1 will always have each bit set an even number of times.
Therefore, since n=2 works, and n+1 works, the xor method will work for all values of n>=2.
The Algorithm For 0 Based Ranges
This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.
long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
result = result ^ supplied;
}
return result;
The Algorithm For Arbitrary Based Ranges
This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).
long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
result = result ^ (supplied - offset);
}
return result + offset;
Arbitrary Ranges
We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:
long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
n++;
result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
result = result ^ (n++);
}
return result + offset;
Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.
Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):
function find($array) {
$offset = min($array);
$n = 0;
$result = 0;
foreach ($array as $value) {
$result = $result ^ ($value - $offset);
$n++;
}
$n++; // This takes care of the missing value
while ($n == 1 || 0 != ($n & ($n - 1))) {
$result = $result ^ ($n++);
}
return $result + $offset;
}
Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.
Another Approach
Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:
long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
if (supplied != last + 1) {
return last + 1;
}
last = supplied;
}
// The range is contiguous, so what do we do here? Let's return last + 1:
return last + 1;
Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n, and return n+1.
Of course you'd need a backup plan in case n+1 causes an integer overflow.
Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).
void maxNum(ulong filesize)
{
ulong bitcount = filesize * 8; //number of bits in file
for (ulong i = 0; i < bitcount; i++)
{
Console.Write(9);
}
}
Should print 10 bitcount - 1, which will always be greater than 2 bitcount. Technically, the number you have to beat is 2 bitcount - (4 * 109 - 1), since you know there are (4 billion - 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.
The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.
The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.
A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).
The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.
Strip the white space and non numeric characters from the file and append 1. Your file now contains a single number not listed in the original file.
From Reddit by Carbonetc.
For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.
Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.
You can use bit flags to mark whether an integer is present or not.
After traversing the entire file, scan each bit to determine if the number exists or not.
Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.
Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.
Let all possible integers be the range from int_min to int_max, and
bool isNotInFile(integer) a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)
for (integer i = int_min; i <= int_max; ++i)
{
if (isNotInFile(i)) {
return i;
}
}
For the 10 MB memory constraint:
Convert the number to its binary representation.
Create a binary tree where left = 0 and right = 1.
Insert each number in the tree using its binary representation.
If a number has already been inserted, the leafs will already have been created.
When finished, just take a path that has not been created before to create the requested number.
4 billion number = 2^32, meaning 10 MB might not be sufficient.
EDIT
An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.
EDIT II
There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.
I will answer the 1 GB version:
There is not enough information in the question, so I will state some assumptions first:
The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.
Pseudo-code:
var bitArray = new bit[4294967296]; // 0.5 GB, initialized to all 0s.
foreach (var number in file) {
bitArray[number + 2147483648] = 1; // Shift all numbers so they start at 0.
}
for (var i = 0; i < 4294967296; i++) {
if (bitArray[i] == 0) {
return i - 2147483648;
}
}
As long as we're doing creative answers, here is another one.
Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed).
Read through the sorted file and output the first number that is missing.
Bit Elimination
One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:
long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
val = val & ~fileVal;
if (val == 0) error;
}
Bit Counts
Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.
Range Logic
Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:
struct Range
{
long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };
Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.
2128*1018 + 1 ( which is (28)16*1018 + 1 ) - cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.
I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:
If there are exactly 4,294,967,295 (2^32 - 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.
Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.
The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).
Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.
# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
for (b=0; b<4; b++) {
print "2";
}
}
As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it :)
EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.
If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is 1 in 2^64/(4*10^9) = 4611686018.4 (roughly 1 in 4 billion). You'd be right most of the time!
(Joking... kind of.)

Expressing an integer as a series of multipliers

Scroll down to see latest edit, I left all this text here just so that I don't invalidate the replies this question has received so far!
I have the following brain teaser I'd like to get a solution for, I have tried to solve this but since I'm not mathematically that much above average (that is, I think I'm very close to average) I can't seem wrap my head around this.
The problem: Given number x should be split to a serie of multipliers, where each multiplier <= y, y being a constant like 10 or 16 or whatever. In the serie (technically an array of integers) the last number should be added instead of multiplied to be able to convert the multipliers back to original number.
As an example, lets assume x=29 and y=10. In this case the expected array would be {10,2,9} meaning 10*2+9. However if y=5, it'd be {5,5,4} meaning 5*5+4 or if y=3, it'd be {3,3,3,2} which would then be 3*3*3+2.
I tried to solve this by doing something like this:
while x >= y, store y to multipliers, then x = x - y
when x < y, store x to multipliers
Obviously this didn't work, I also tried to store the "leftover" part separately and add that after everything else but that didn't work either. I believe my main problem is that I try to think this in a way too complex manner while the solution is blatantly obvious and simple.
To reiterate, these are the limits this algorithm should have:
has to work with 64bit longs
has to return an array of 32bit integers (...well, shorts are OK too)
while support for signed numbers (both + and -) would be nice, if it helps the task only unsigned numbers is a must
And while I'm doing this using Java, I'd rather take any possible code examples as pseudocode, I specifically do NOT want readily made answers, I just need a nudge (well, more of a strong kick) so that I can solve this at least partly myself. Thanks in advance.
Edit: Further clarification
To avoid some confusion, I think I should reword this a bit:
Every integer in the result array should be less or equal to y, including the last number.
Yes, the last number is just a magic number.
No, this is isn't modulus since then the second number would be larger than y in most cases.
Yes, there is multiple answers to most of the numbers available, however I'm looking for the one with least amount of math ops. As far as my logic goes, that means finding the maximum amount of as big multipliers as possible, for example x=1 000 000,y=100 is 100*100*100 even though 10*10*10*10*10*10 is equally correct answer math-wise.
I need to go through the given answers so far with some thought but if you have anything to add, please do! I do appreciate the interest you've already shown on this, thank you all for that.
Edit 2: More explanations + bounty
Okay, seems like what I was aiming for in here just can't be done the way I thought it could be. I was too ambiguous with my goal and after giving it a bit of a thought I decided to just tell you in its entirety what I'd want to do and see what you can come up with.
My goal originally was to come up with a specific method to pack 1..n large integers (aka longs) together so that their String representation is notably shorter than writing the actual number. Think multiples of ten, 10^6 and 1 000 000 are the same, however the representation's length in characters isn't.
For this I wanted to somehow combine the numbers since it is expected that the numbers are somewhat close to each other. I firsth thought that representing 100, 121, 282 as 100+21+161 could be the way to go but the saving in string length is neglible at best and really doesn't work that well if the numbers aren't very close to each other. Basically I wanted more than ~10%.
So I came up with the idea that what if I'd group the numbers by common property such as a multiplier and divide the rest of the number to individual components which I can then represent as a string. This is where this problem steps in, I thought that for example 1 000 000 and 100 000 can be expressed as 10^(5|6) but due to the context of my aimed usage this was a bit too flaky:
The context is Web. RESTful URL:s to be specific. That's why I mentioned of thinking of using 64 characters (web-safe alphanumberic non-reserved characters and then some) since then I could create seemingly random URLs which could be unpacked to a list of integers expressing a set of id numbers. At this point I thought of creating a base 64-like number system for expressing base 10/2 numbers but since I'm not a math genius I have no idea beyond this point how to do it.
The bounty
Now that I have written the whole story (sorry that it's a long one), I'm opening a bounty to this question. Everything regarding requirements for the preferred algorithm specified earlier is still valid. I also want to say that I'm already grateful for all the answers I've received so far, I enjoy being proven wrong if it's done in such a manner as you people have done.
The conclusion
Well, bounty is now given. I spread a few comments to responses mostly for future reference and myself, you can also check out my SO Uservoice suggestion about spreading bounty which is related to this question if you think we should be able to spread it among multiple answers.
Thank you all for taking time and answering!
Update
I couldn't resist trying to come up with my own solution for the first question even though it doesn't do compression. Here is a Python solution using a third party factorization algorithm called pyecm.
This solution is probably several magnitudes more efficient than Yevgeny's one. Computations take seconds instead of hours or maybe even weeks/years for reasonable values of y. For x = 2^32-1 and y = 256, it took 1.68 seconds on my core duo 1.2 ghz.
>>> import time
>>> def test():
... before = time.time()
... print factor(2**32-1, 256)
... print time.time()-before
...
>>> test()
[254, 232, 215, 113, 3, 15]
1.68499994278
>>> 254*232*215*113*3+15
4294967295L
And here is the code:
def factor(x, y):
# y should be smaller than x. If x=y then {y, 1, 0} is the best solution
assert(x > y)
best_output = []
# try all possible remainders from 0 to y
for remainder in xrange(y+1):
output = []
composite = x - remainder
factors = getFactors(composite)
# check if any factor is larger than y
bad_remainder = False
for n in factors.iterkeys():
if n > y:
bad_remainder = True
break
if bad_remainder: continue
# make the best factors
while True:
results = largestFactors(factors, y)
if results == None: break
output += [results[0]]
factors = results[1]
# store the best output
output = output + [remainder]
if len(best_output) == 0 or len(output) < len(best_output):
best_output = output
return best_output
# Heuristic
# The bigger the number the better. 8 is more compact than 2,2,2 etc...
# Find the most factors you can have below or equal to y
# output the number and unused factors that can be reinserted in this function
def largestFactors(factors, y):
assert(y > 1)
# iterate from y to 2 and see if the factors are present.
for i in xrange(y, 1, -1):
try_another_number = False
factors_below_y = getFactors(i)
for number, copies in factors_below_y.iteritems():
if number in factors:
if factors[number] < copies:
try_another_number = True
continue # not enough factors
else:
try_another_number = True
continue # a factor is not present
# Do we want to try another number, or was a solution found?
if try_another_number == True:
continue
else:
output = 1
for number, copies in factors_below_y.items():
remaining = factors[number] - copies
if remaining > 0:
factors[number] = remaining
else:
del factors[number]
output *= number ** copies
return (output, factors)
return None # failed
# Find prime factors. You can use any formula you want for this.
# I am using elliptic curve factorization from http://sourceforge.net/projects/pyecm
import pyecm, collections, copy
getFactors_cache = {}
def getFactors(n):
assert(n != 0)
# attempt to retrieve from cache. Returns a copy
try:
return copy.copy(getFactors_cache[n])
except KeyError:
pass
output = collections.defaultdict(int)
for factor in pyecm.factors(n, False, True, 10, 1):
output[factor] += 1
# cache result
getFactors_cache[n] = output
return copy.copy(output)
Answer to first question
You say you want compression of numbers, but from your examples, those sequences are longer than the undecomposed numbers. It is not possible to compress these numbers without more details to the system you left out (probability of sequences/is there a programmable client?). Could you elaborate more?
Here is a mathematical explanation as to why current answers to the first part of your problem will never solve your second problem. It has nothing to do with the knapsack problem.
This is Shannon's entropy algorithm. It tells you the theoretical minimum amount of bits you need to represent a sequence {X0, X1, X2, ..., Xn-1, Xn} where p(Xi) is the probability of seeing token Xi.
Let's say that X0 to Xn is the span of 0 to 4294967295 (the range of an integer). From what you have described, each number is as likely as another to appear. Therefore the probability of each element is 1/4294967296.
When we plug it into Shannon's algorithm, it will tell us what the minimum number of bits are required to represent the stream.
import math
def entropy():
num = 2**32
probability = 1./num
return -(num) * probability * math.log(probability, 2)
# the (num) * probability cancels out
The entropy unsurprisingly is 32. We require 32 bits to represent an integer where each number is equally likely. The only way to reduce this number, is to increase the probability of some numbers, and decrease the probability of others. You should explain the stream in more detail.
Answer to second question
The right way to do this is to use base64, when communicating with HTTP. Apparently Java does not have this in the standard library, but I found a link to a free implementation:
http://iharder.sourceforge.net/current/java/base64/
Here is the "pseudo-code" which works perfectly in Python and should not be difficult to convert to Java (my Java is rusty):
def longTo64(num):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
output = ""
# special case for 0
if num == 0:
return mapping[0]
while num != 0:
output = mapping[num % 64] + output
num /= 64
return output
If you have control over your web server and web client, and can parse the entire HTTP requests without problem, you can upgrade to base85. According to wikipedia, url encoding allows for up to 85 characters. Otherwise, you may need to remove a few characters from the mapping.
Here is another code example in Python
def longTo85(num):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~!*'();:#&=+$,/?%#[]"
output = ""
base = len(mapping)
# special case for 0
if num == 0:
return mapping[0]
while num != 0:
output = mapping[num % base] + output
num /= base
return output
And here is the inverse operation:
def stringToLong(string):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~!*'();:#&=+$,/?%#[]"
output = 0
base = len(mapping)
place = 0
# check each digit from the lowest place
for digit in reversed(string):
# find the number the mapping of symbol to number, then multiply by base^place
output += mapping.find(digit) * (base ** place)
place += 1
return output
Here is a graph of Shannon's algorithm in different bases.
As you can see, the higher the radix, the less symbols are needed to represent a number. At base64, ~11 symbols are required to represent a long. At base85, it becomes ~10 symbols.
Edit after final explanation:
I would think base64 is the best solution, since there are standard functions that deal with it, and variants of this idea don't give much improvement. This was answered with much more detail by others here.
Regarding the original question, although the code works, it is not guaranteed to run in any reasonable time, as was answered as well as commented on this question by LFSR Consulting.
Original Answer:
You mean something like this?
Edit - corrected after a comment.
shortest_output = {}
foreach (int R = 0; R <= X; R++) {
// iteration over possible remainders
// check if the rest of X can be decomposed into multipliers
newX = X - R;
output = {};
while (newX > Y) {
int i;
for (i = Y; i > 1; i--) {
if ( newX % i == 0) { // found a divider
output.append(i);
newX = newX /i;
break;
}
}
if (i == 1) { // no dividers <= Y
break;
}
}
if (newX != 1) {
// couldn't find dividers with no remainder
output.clear();
}
else {
output.append(R);
if (output.length() < shortest_output.length()) {
shortest_output = output;
}
}
}
It sounds as though you want to compress random data -- this is impossible for information theoretic reasons. (See http://www.faqs.org/faqs/compression-faq/part1/preamble.html question 9.) Use Base64 on the concatenated binary representations of your numbers and be done with it.
The problem you're attempting to solve (you're dealing with a subset of the problem, given you're restriction of y) is called Integer Factorization and it cannot be done efficiently given any known algorithm:
In number theory, integer factorization is the breaking down of a composite number into smaller non-trivial divisors, which when multiplied together equal the original integer.
This problem is what makes a number of cryptographic functions possible (namely RSA which uses 128 bit keys - long is half of that.) The wiki page contains some good resources that should move you in the right direction with your problem.
So, your brain teaser is indeed a brain teaser... and if you solve it efficiently we can elevate your math skills to above average!
Updated after the full story
Base64 is most likely your best option. If you want a custom solution you can try implementing a Base 65+ system. Just remember that just because 10000 can be written as "10^4" doesn't mean that everything can be written as 10^n where n is an integer. Different base systems are the simplest way to write numbers and the higher the base the less digits the number requires. Plus most framework libraries contain algorithms for Base64 encoding. (What language you are using?).
One way to further pack the urls is the one you mentioned but in Base64.
int[] IDs;
IDs.sort() // So IDs[i] is always smaller or equal to IDs[i-1].
string url = Base64Encode(IDs[0]);
for (int i = 1; i < IDs.length; i++) {
url += "," + Base64Encode(IDs[i-1] - IDs[i]);
}
Note that you require some separator as the initial ID can be arbitrarily large and the difference between two IDs CAN be more than 63 in which case one Base64 digit is not enough.
Updated
Just restating that the problem is unsolvable. For Y = 64 you can't write 87681 in multipliers + remainder where each of these is below 64. In other words, you cannot write any of the numbers 87617..87681 with multipliers that are below 64. Each of these numbers has an elementary term over 64. 87616 can be written in elementary terms below 64 but then you'd need those + 65 and so the remainder will be over 64.
So if this was just a brainteaser, it's unsolvable. Was there some practical purpose for this which could be achieved in some way other than using multiplication and a remainder?
And yes, this really should be a comment but I lost my ability to comment at some point. :p
I believe the solution which comes closest is Yevgeny's. It is also easy to extend Yevgeny's solution to remove the limit for the remainder in which case it would be able to find solution where multipliers are smaller than Y and remainder as small as possible, even if greater than Y.
Old answer:
If you limit that every number in the array must be below the y then there is no solution for this. Given large enough x and small enough y, you'll end up in an impossible situation. As an example with y of 2, x of 12 you'll get 2 * 2 * 2 + 4 as 2 * 2 * 2 * 2 would be 16. Even if you allow negative numbers with abs(n) below y that wouldn't work as you'd need 2 * 2 * 2 * 2 - 4 in the above example.
And I think the problem is NP-Complete even if you limit the problem to inputs which are known to have an answer where the last term is less than y. It sounds quite much like the [Knapsack problem][1]. Of course I could be wrong there.
Edit:
Without more accurate problem description it is hard to solve the problem, but one variant could work in the following way:
set current = x
Break current to its terms
If one of the terms is greater than y the current number cannot be described in terms greater than y. Reduce one from current and repeat from 2.
Current number can be expressed in terms less than y.
Calculate remainder
Combine as many of the terms as possible.
(Yevgeny Doctor has more conscise (and working) implementation of this so to prevent confusion I've skipped the implementation.)
OP Wrote:
My goal originally was to come up with
a specific method to pack 1..n large
integers (aka longs) together so that
their String representation is notably
shorter than writing the actual
number. Think multiples of ten, 10^6
and 1 000 000 are the same, however
the representation's length in
characters isn't.
I have been down that path before, and as fun as it was to learn all the math, to save you time I will just point you to: http://en.wikipedia.org/wiki/Kolmogorov_complexity
In a nutshell some strings can be easily compressed by changing your notation:
10^9 (4 characters) = 1000000000 (10 characters)
Others cannot:
7829203478 = some random number...
This is a great great simplification of the article I linked to above, so I recommend that you read it instead of taking my explanation at face value.
Edit:
If you are trying to make RESTful urls for some set of unique data, why wouldn't you use a hash, such as MD5? Then include the hash as part of the URL, then look up the data based on the hash. Or am I missing something obvious?
The original method you chose (a * b + c * d + e) would be very difficult to find optimal solutions for simply due to the large search space of possibilities. You could factorize the number but it's that "+ e" that complicates things since you need to factorize not just that number but quite a few immediately below it.
Two methods for compression spring immediately to mind, both of which give you a much-better-than-10% saving on space from the numeric representation.
A 64-bit number ranges from (unsigned):
0 to
18,446,744,073,709,551,616
or (signed):
-9,223,372,036,854,775,808 to
9,223,372,036,854,775,807
In both cases, you need to reduce the 20-characters taken (without commas) to something a little smaller.
The first is to simply BCD-ify the number the base64 encode it (actually a slightly modified base64 since "/" would not be kosher in a URL - you should use one of the acceptable characters such as "_").
Converting it to BCD will store two digits (or a sign and a digit) into one byte, giving you an immediate 50% reduction in space (10 bytes). Encoding it base 64 (which turns every 3 bytes into 4 base64 characters) will turn the first 9 bytes into 12 characters and that tenth byte into 2 characters, for a total of 14 characters - that's a 30% saving.
The only better method is to just base64 encode the binary representation. This is better because BCD has a small amount of wastage (each digit only needs about 3.32 bits to store [log210], but BCD uses 4).
Working on the binary representation, we only need to base64 encode the 64-bit number (8 bytes). That needs 8 characters for the first 6 bytes and 3 characters for the final 2 bytes. That's 11 characters of base64 for a saving of 45%.
If you wanted maximum compression, there are 73 characters available for URL encoding:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789$-_.+!*'(),
so technically you could probably encode base-73 which, from rough calculations, would still take up 11 characters, but with more complex code which isn't worth it in my opinion.
Of course, that's the maximum compression due to the maximum values. At the other end of the scale (1-digit) this encoding actually results in more data (expansion rather than compression). You can see the improvements only start for numbers over 999, where 4 digits can be turned into 3 base64 characters:
Range (bytes) Chars Base64 chars Compression ratio
------------- ----- ------------ -----------------
< 10 (1) 1 2 -100%
< 100 (1) 2 2 0%
< 1000 (2) 3 3 0%
< 10^4 (2) 4 3 25%
< 10^5 (3) 5 4 20%
< 10^6 (3) 6 4 33%
< 10^7 (3) 7 4 42%
< 10^8 (4) 8 6 25%
< 10^9 (4) 9 6 33%
< 10^10 (5) 10 7 30%
< 10^11 (5) 11 7 36%
< 10^12 (5) 12 7 41%
< 10^13 (6) 13 8 38%
< 10^14 (6) 14 8 42%
< 10^15 (7) 15 10 33%
< 10^16 (7) 16 10 37%
< 10^17 (8) 17 11 35%
< 10^18 (8) 18 11 38%
< 10^19 (8) 19 11 42%
< 2^64 (8) 20 11 45%
Update: I didn't get everything, thus I rewrote the whole thing in a more Java-Style fashion. I didn't think of the prime number case that is bigger than the divisor. This is fixed now. I leave the original code in order to get the idea.
Update 2: I now handle the case of the big prime number in another fashion . This way a result is obtained either way.
public final class PrimeNumberException extends Exception {
private final long primeNumber;
public PrimeNumberException(long x) {
primeNumber = x;
}
public long getPrimeNumber() {
return primeNumber;
}
}
public static Long[] decompose(long x, long y) {
try {
final ArrayList<Long> operands = new ArrayList<Long>(1000);
final long rest = x % y;
// Extract the rest so the reminder is divisible by y
final long newX = x - rest;
// Go into recursion, actually it's a tail recursion
recDivide(newX, y, operands);
} catch (PrimeNumberException e) {
// return new Long[0];
// or do whatever you like, for example
operands.add(e.getPrimeNumber());
} finally {
// Add the reminder to the array
operands.add(rest);
return operands.toArray(new Long[operands.size()]);
}
}
// The recursive method
private static void recDivide(long x, long y, ArrayList<Long> operands)
throws PrimeNumberException {
while ((x > y) && (y != 1)) {
if (x % y == 0) {
final long rest = x / y;
// Since y is a divisor add it to the list of operands
operands.add(y);
if (rest <= y) {
// the rest is smaller than y, we're finished
operands.add(rest);
}
// go in recursion
x = rest;
} else {
// if the value x isn't divisible by y decrement y so you'll find a
// divisor eventually
if (--y == 1) {
throw new PrimeNumberException(x);
}
}
}
}
Original: Here some recursive code I came up with. I would have preferred to code it in some functional language but it was required in Java. I didn't bother converting the numbers to integer but that shouldn't be that hard (yes, I'm lazy ;)
public static Long[] decompose(long x, long y) {
final ArrayList<Long> operands = new ArrayList<Long>();
final long rest = x % y;
// Extract the rest so the reminder is divisible by y
final long newX = x - rest;
// Go into recursion, actually it's a tail recursion
recDivide(newX, y, operands);
// Add the reminder to the array
operands.add(rest);
return operands.toArray(new Long[operands.size()]);
}
// The recursive method
private static void recDivide(long newX, long y, ArrayList<Long> operands) {
long x = newX;
if (x % y == 0) {
final long rest = x / y;
// Since y is a divisor add it to the list of operands
operands.add(y);
if (rest <= y) {
// the rest is smaller than y, we're finished
operands.add(rest);
} else {
// the rest can still be divided, go one level deeper in recursion
recDivide(rest, y, operands);
}
} else {
// if the value x isn't divisible by y decrement y so you'll find a divisor
// eventually
recDivide(x, y-1, operands);
}
}
Are you married to using Java? Python has an entire package dedicated just for this exact purpose. It'll even sanitize the encoding for you to be URL-safe.
Native Python solution
The standard module I'm recommending is base64, which converts arbitrary stings of chars into sanitized base64 format. You can use it in conjunction with the pickle module, which handles conversion from lists of longs (actually arbitrary size) to a compressed string representation.
The following code should work on any vanilla installation of Python:
import base64
import pickle
# get some long list of numbers
a = (854183415,1270335149,228790978,1610119503,1785730631,2084495271,
1180819741,1200564070,1594464081,1312769708,491733762,243961400,
655643948,1950847733,492757139,1373886707,336679529,591953597,
2007045617,1653638786)
# this gets you the url-safe string
str64 = base64.urlsafe_b64encode(pickle.dumps(a,-1))
print str64
>>> gAIoSvfN6TJKrca3S0rCEqMNSk95-F9KRxZwakqn3z58Sh3hYUZKZiePR0pRlwlfSqxGP05KAkNPHUo4jooOSixVFCdK9ZJHdEqT4F4dSvPY41FKaVIRFEq9fkgjSvEVoXdKgoaQYnRxAC4=
# this unwinds it
a64 = pickle.loads(base64.urlsafe_b64decode(str64))
print a64
>>> (854183415, 1270335149, 228790978, 1610119503, 1785730631, 2084495271, 1180819741, 1200564070, 1594464081, 1312769708, 491733762, 243961400, 655643948, 1950847733, 492757139, 1373886707, 336679529, 591953597, 2007045617, 1653638786)
Hope that helps. Using Python is probably the closest you'll get from a 1-line solution.
Wrt the original algorithm request: Is there a limit on the size of the last number (beyond that it must be stored in a 32b int)?
(The original request is all I'm able to tackle lol.)
The one that produces the shortest list is:
bool negative=(n<1)?true:false;
int j=n%y;
if(n==0 || n==1)
{
list.append(n);
return;
}
while((long64)(n-j*y)>MAX_INT && y>1) //R has to be stored in int32
{
y--;
j=n%y;
}
if(y<=1)
fail //Number has no suitable candidate factors. This shouldn't happen
int i=0;
for(;i<j;i++)
{
list.append(y);
}
list.append(n-y*j);
if(negative)
list[0]*=-1;
return;
A little simplistic compared to most answers given so far but it achieves the desired functionality of the original post... It's a little dirty but hopefully useful :)
Isn't this modulus?
Let / be integer division (whole numbers) and % be modulo.
int result[3];
result[0] = y;
result[1] = x / y;
result[2] = x % y;
Just set x:=x/n where n is the largest number that is less both than x and y. When you end up with x<=y, this is your last number in the sequence.
Like in my comment above, I'm not sure I understand exactly the question. But assuming integers (n and a given y), this should work for the cases you stated:
multipliers[0] = n / y;
multipliers[1] = y;
addedNumber = n % y;

Resources