How to find the xth decibinary number? - algorithm

Hackerrank has a problem called Decibinary numbers which are essentially numbers with 0-9 digit values but are exponentiated using powers of 2. The question asks us to display the xth decibinary number. There is another twist to the problem. Multiple decibinary numbers can equal the same decimal number. For example, 4 in decimal can be 100, 20, 12, and 4 in decibinary.
At first, I thought that finding how many decibinary numbers for a given decimal number would be helpful.
I consulted this post for a bit help ( https://math.stackexchange.com/questions/3540243/whats-the-number-of-decibinary-numbers-that-evaluate-to-given-decimal-number ). The post was a bit too hard to understand but then I also realized that even though we have how many decibinary numbers a decimal number can have, this doesn't help FINDING them (at least to my knowledge) which is the original goal of the question.
I do realize that for any decimal number, the largest decibinary number for it will simply be its binary representation. For ex, for 4 it is 100. So the brute force approach would be to check all numbers in this range for each decimal number and see if their decibinary representation evaluates to the given decimal number, but it is clearly evident that this approach will never pass since the input constraints define x to be from 1 to 10^16. Not only that, we have to find the xth decibinary number for a q amount of queries where q is from 1 to 10^5.
This question falls under the section of dp but I am confused how dp will be used or how it is even possible. In order for calculating the xth decibinary number q times (which is described in the brute force method above) it would be better to use a table (like the problem suggests). But for that, we would need to store and calculate 10^16 integers since that is the how big x can be. Assuming an integer is 4 Bytes, 4B * 10^16 ~= 4B * (2^3)^16 = 2^50 Bytes.
Can someone please explain how this problem is solved optimally. I am still new to CP so if I have made an error in something, please let me know.
(see link below for full problem statement):
https://www.hackerrank.com/challenges/decibinary-numbers/problem

This is solvable with about 80 MB of data. I won't give code, but I will explain the strategy.
Build a lookup count[n][i] that gives you the number of ways to get the decimal number n using the first i digits. You start by inserting 0 everywhere, and then put a 1 in count[0][0]. Now start filling in using the rule:
count[n][i] = count[n][i-1] + count[n - 2**i][i-1] + count[n - 2*2**i][i-1] + ... + count[n - 9*2**i][i-1]
It turns out that you only need the first 19 digits, and you only need counts of n up to 2**19-1. And the counts all fit in 8 byte longs.
Once you have that, create a second data structure count_below[n] which is the count of how many decibinary numbers will give a value less than n. Use the same range of n as before.
And now a lookup proceeds as follows. First you do a binary search on count_below to find the last value that has less than your target number below it. Subtracting count_below from your query, you know which decibinary number of that value you want.
Next, search through count[n][i] to find the i such that you get your target query with i digits, and not with less. This will be the position of the leading digit of your answer. You then subtract off count[n][i-1] from your query (all the decibinaries with fewer digits). Then subtract off count[n-2**i][i-1], count[n-2* 2**i][i-1], ... count[n-8*2**i][i-1] until you find what that leading digit is. Now you subtract the contribution of that digit from the value, and repeat the logic for finding the correct decibinary for that smaller value with fewer digits.
Here is a worked example to clarify. First the data structures for the first 3 digits and up to 2**3 - 1:
count = [
[1, 1, 1, 1], # sum 0
[0, 1, 1, 1], # sum 1
[0, 1, 2, 2], # sum 2
[0, 1, 2, 2], # sum 3
[0, 1, 3, 4], # sum 4
[0, 1, 3, 4], # sum 5
[0, 1, 4, 6], # sum 6
[0, 1, 4, 6], # sum 7
]
count_below = [
0, 1, 2, 4, 6, 10, 14, 20, 26, ...
]
Let's find the 20th.
count_below[6] is 14 and count_below[7] is 20 so our decimal sum is 6.
We want the 20 - count_below[6] = 6th decibinary with decimal sum 6.
count[6][2] is 4 while count[6][3] is 6 so we have a non-zero third digit.
We want the count[6][3] - count[6][2] = 2 with a non-zero third digit.
count[1][6 - 2**2] is 2, so 2 have 3rd digit 1.
The third digit is 1
We are now looking for the second decibinary whose decimal sum is 2.
count[2][1] is 1 and count[2][2] is 2 so it has a non-zero second digit.
We want the count[2][2] - count[2][1] = 1st with a non-zero second digit.
The second digit is 1
The rest is 0 because 2 - 2**1 = 0.
And thus you find that the answer is 110.
Now for such a small number, this was a lot of work. But even for your hardest lookup you'll only need about 20 steps of a binary search to find your decimal sum, another 20 steps to find the position of the first non-zero digit, and for each of of those digits, you'll have to do 1-9 different calculations to find what that digit is. Which means only hundreds of calculations to find the number.

Related

Ugly Number - Mathematical intuition for dp

I am trying find the "ugly" numbers, which is a series of numbers whose only prime factors are [2,3,5].
I found dynamic programming solution and wanted to understand how it works and what is the mathematical intuition behind the logic.
The algorithm is to keep three different counter variable for a multiple of 2, 3 and 5. Let's assume i2,i3, and i5.
Declare ugly array and initialize 0 index to 1 as the first ugly number is 1.
Initialize i2=i3=i4=0;
ugly[i] = min(ugly[i2]*2, ugly[i3]*3, ugly[i5]*5) and increment i2 or i3 or i5 which ever index was chosen.
Dry run:
ugly = |1|
i2=0;
i3=0;
i5=0;
ugly[1] = min(ugly[0]*2, ugly[0]*3, ugly[0]*5) = 2
---------------------------------------------------
ugly = |1|2|
i2=1;
i3=0;
i5=0;
ugly[2] = min(ugly[1]*2, ugly[0]*3, ugly[0]*5) = 3
---------------------------------------------------
ugly = |1|2|3|
i2=1;
i3=1;
i5=0;
ugly[3] = min(ugly[1]*2, ugly[1]*3, ugly[0]*5) = 4
---------------------------------------------------
ugly = |1|2|3|4|
i2=2;
i3=1;
i5=0;
ugly[4] = min(ugly[2]*2, ugly[1]*3, ugly[0]*5) = 5
---------------------------------------------------
ugly = |1|2|3|4|5|
i2=2;
i3=1;
i5=1;
ugly[4] = min(ugly[2]*2, ugly[1]*3, ugly[0]*5) = 6
---------------------------------------------------
ugly = |1|2|3|4|5|6|
I am getting lost how six is getting formed from 2's index. Can someone explain in an easy way?
Every "ugly" number (except 1) can be formed by multiplying a smaller ugly number by 2, 3, or 5.
So let's say that the ugly numbers found so far are [1,2,3,4,5]. Based on that list we can generate three sequences of ugly numbers:
Multiplying by 2, the possible ugly numbers are [2,4,6,8,10]
Multiplying by 3, the possible ugly numbers are [3,6,9,12,15]
Multiplying by 5, the possible ugly numbers are [5,10,15,20,25]
But we already have 2,3,4, and 5 in the list, so we don't care about values less than or equal to 5. Let's mark those entries with a - to indicate that we don't care about them
Multiplying by 2, the possible ugly numbers are [-,-,6,8,10]
Multiplying by 3, the possible ugly numbers are [-,6,9,12,15]
Multiplying by 5, the possible ugly numbers are [-,10,15,20,25]
And in fact, all we really care about is the smallest number in each sequence
Multiplying by 2, the smallest number greater than 5 is 6
Multiplying by 3, the smallest number greater than 5 is 6
Multiplying by 5, the smallest number greater than 5 is 10
After adding 6 to the list of ugly numbers, each sequence has one additional element:
Multiplying by 2, the possible ugly numbers are [-,-,-,8,10,12]
Multiplying by 3, the possible ugly numbers are [-,-,9,12,15,18]
Multiplying by 5, the possible ugly numbers are [-,10,15,20,25,30]
But the elements from each sequence that are useful are:
Multiplying by 2, the smallest number greater than 6 is 8
Multiplying by 3, the smallest number greater than 6 is 9
Multiplying by 5, the smallest number greater than 6 is 10
So you can see that what the algorithm is doing is creating three sequences of ugly numbers. Each sequence is formed by multiplying all of the existing ugly numbers by one of the three factors.
But all we care about is the smallest number in each sequence (larger than the largest ugly number found so far).
So the indexes i2, i3, and i5 are the indexes into the corresponding sequences. When you use a number from a sequence, you update the index to point to the next number in that sequence.
The intuition is the following:
any ugly number can be written as the product between 2, 3 or 5 and another (smaller) ugly number.
With that in mind, the solution that is mentioned in the question keeps track of i2, i3 and i5, the indices of the smallest ugly numbers generated so far, which multiplied by 2, 3, respectively 5 lead to a number that was not already generated. The smallest of these products is the smallest ugly number that was not already generated.
To state this differently, I believe that the following statement from the question might be the source of some confusion:
The algorithm is to keep three different counter variable for a
multiple of 2, 3 and 5. Let's assume i2,i3, and i5.
Note, for example, that ugly[i2] is not necessarily a multiple of 2. It is simply the smallest ugly number for which 2 * ugly[i2] is greater than ugly[i] (the largest ugly number known so far).
Regarding how the number 6 is generated in the next step, the procedure is shown below:
ugly = |1|2|3|4|5
i2 = 2;
i3 = 1;
i5 = 1;
ugly[5] = min(ugly[2]*2, ugly[1]*3, ugly[1]*5) = min(3*2, 2*3, 2*5) = 6
---------------------------------------------------
ugly = |1|2|3|4|5|6
i2 = 3
i3 = 2
i5 = 1
Note that here both i2 and i3 need to be incremented after generating the number 6, because both i2*2, as well as i3*3 produced the same next smallest ugly number.

Algorithm to finding shortest sequence of numbers from array A that add up to number B

I had an interesting interview question the other day that sort of stumped me. I couldn't find a really good answer for it. The problem stated:
Suppose you are given a number B and an array A of length n. The number B is a natural number, and all numbers in array A are distinct, natural numbers. Design an algorithm that would find the shortest sequence of numbers in array A that would sum up to the number B. Duplicates can be used.
So, as an example, let us say I have a number B = 19, and A = [9, 6, 3, 1]. I could say a solution is 6+6+6+1, or 3+3+3+3+3+3+1, but the solution they are looking for is 9+9+1, because that is the shortest sequence of numbers.
The algorithm that I designed would sort the array and reach into the largest number and subtract it from the original number. It would keep doing this until it could no longer subtract the largest number. It would then go through the array and see if it could keep finding any numbers that it could subtract from B. It actually looked a lot like this:
def domath(b, a):
a.sort()
x = []
n = 0
idx = -1
while b != 0:
n = a[idx]
if(b >= n):
b -= n
x.append(n)
else:
idx -= 1
return x
But this solution would not always work. It would only work if you were lucky enough to have, say, a 2 or a 1 in the array, or the numbers that you kept subtracting from b magically worked. Consider if B=21 and A=[7,8,9]. If it kept subtracting 9, it would not be able to find a solution.
So I was thinking "Okay, then maybe I need to backtrack a bit.".
If I reached into the x array, which keeps track of all the number we kept subtracting, I could add the latest number we subtracted from b, then try to move the idx to the next largest number. So, instead of doing 21 - 9, then 12 -9, it would do 21 - 9, then 12 - 8. It still wouldnt find anything, so then it would try 21 - 9, then 12 - 7. It still wouldnt find anything, so it would try 21 - 8, then 13 - 8, and it wouldnt find anything, so it would do 21 -8, then 13 -7, and it still wouldn't find anything, so it would try 21 -7, and continue on that, and determine if it could do it. If it cant (in this case, it should), it would just return "False" or something.
Is that... a good solution? I feel like there must be a better one, because the interviewers were kind of iffy about this solution.
Tricky. The linked wikipedia page suggests an approach that will take I think O (B * length (A)) which would take quite long if we had B = 1,000,000,000,000 instead of B = 21 with A = [9, 8, 7]. Your backtracking algorithm would handle this reasonably quickly if you start with a division:
111,111,111,111 nines leaves one, no way.
111,111,111,110 nines leaves ten, no way (trying 1 or 0 eights)
111,111,111,109 nines leaves 19, no way (trying 2, 1 or 0 eights)
111,111,111,108 nines leaves 28 = 4x7 (trying 3 .. 0 eights). Best so far.
111,111,111,107 nines leaves 37. 4x8 < 37, no solution can beat what we have.
In your example, B = 21, backtracking would also work quite well. If we just denote the numbers of nines, eights, and sevens, then you would just try the following: 2,0,0; 1,1,0; 1,0,1; 0,2,0; 0,1,1; 0,0,3.
You'd want to stop search branches when you have a solution and can prove that no further solution can be better. That's what I did: When you have 37 left and the highest number available is 8 then you need at least 5 numbers. And for every nine that you remove that number is going up at least by one, so the best solution so far cannot be beaten.

Find a period of eventually periodic sequence

Short explanation.
I have a sequence of numbers [0, 1, 4, 0, 0, 1, 1, 2, 3, 7, 0, 0, 1, 1, 2, 3, 7, 0, 0, 1, 1, 2, 3, 7, 0, 0, 1, 1, 2, 3, 7]. As you see, from the 3-rd value the sequence is periodic with a period [0, 0, 1, 1, 2, 3, 7].
I am trying to automatically extract this period from this sequence. The problem is that neither I know the length of the period, nor do I know from which position the sequence becomes periodic.
Full explanation (might require some math)
I am learning combinatorial game theory and a cornerstone of this theory requires one to calculate Grundy values of a game graph. This produces infinite sequence, which in many cases becomes eventually periodic.
I found a way to efficiently calculate grundy values (it returns me a sequence). I would like to automatically extract offset and period of this sequence. I am aware that seeing a part of the sequence [1, 2, 3, 1, 2, 3] you can't be sure that [1, 2, 3] is a period (who knows may be the next number is 4, which breaks the assumption), but I am not interested in such intricacies (I assume that the sequence is enough to find the real period). Also the problem is the sequence can stop in the middle of the period: [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, ...] (the period is still 1, 2, 3).
I also need to find the smallest offset and period. For example for original sequence, the offset can be [0, 1, 4, 0, 0] and the period [1, 1, 2, 3, 7, 0, 0], but the smallest is [0, 1, 4] and [0, 0, 1, 1, 2, 3, 7].
My inefficient approach is to try every possible offset and every possible period. Construct the sequence using this data and check whether it is the same as original. I have not done any normal analysis, but it looks like it is at least quadratic in terms of time complexity.
Here is my quick python code (have not tested it properly):
def getPeriod(arr):
min_offset, min_period, n = len(arr), len(arr), len(arr)
best_offset, best_period = [], []
for offset in xrange(n):
start = arr[:offset]
for period_len in xrange(1, (n - offset) / 2):
period = arr[offset: offset+period_len]
attempt = (start + period * (n / period_len + 1))[:n]
if attempt == arr:
if period_len < min_period:
best_offset, best_period = start[::], period[::]
min_offset, min_period = len(start), period_len
elif period_len == min_period and len(start) < min_offset:
best_offset, best_period = start[::], period[::]
min_offset, min_period = len(start), period_len
return best_offset, best_period
Which returns me what I want for my original sequence:
offset [0, 1, 4]
period [0, 0, 1, 1, 2, 3, 7]
Is there anything more efficient?
Remark: If there is a period P1 with length L, then there is also a period P2, with the same length, L, such that the input sequence ends exactly with P2 (i.e. we do not have a partial period involved at the end).
Indeed, a different period of the same length can always be obtained by changing the offset. The new period will be a rotation of the initial period.
For example the following sequence has a period of length 4 and offset 3:
0 0 0 (1 2 3 4) (1 2 3 4) (1 2 3 4) (1 2 3 4) (1 2 3 4) (1 2
but it also has a period with the same length 4 and offset 5, without a partial period at the end:
0 0 0 1 2 (3 4 1 2) (3 4 1 2) (3 4 1 2) (3 4 1 2) (3 4 1 2)
The implication is that we can find the minimum length of a period by processing the sequence in reverse order, and searching the minimum period using zero offset from the end. One possible approach is to simply use your current algorithm on the reversed list, without the need of the loop over offsets.
Now that we know the length of the desired period, we can also find its minimum offset. One possible approach is to try all various offsets (with the advantage of not needing the loop over lengths, since the length is known), however, further optimizations are possible if necessary, e.g. by advancing as much as possible when processing the list from the end, allowing the final repetition of the period (i.e. the one closest to the start of the un-reversed sequence) to be partial.
I would start with constructing histogram of the values in the sequence
So you just make a list of all numbers used in sequence (or significant part of it) and count their occurrence. This is O(n) where n is sequence size.
sort the histogram ascending
This is O(m.log(m)) where m is number of distinct values. You can also ignore low probable numbers (count<treshold) which are most likely in the offset or just irregularities further lowering m. For periodic sequences m <<< n so you can use it as a first marker if the sequence is periodic or not.
find out the period
In the histogram the counts should be around multiples of the n/period. So approximate/find GCD of the histogram counts. The problem is that you need to take into account there are irregularities present in the counts and also in the n (offset part) so you need to compute GCD approximately. for example:
sequence = { 1,1,2,3,3,1,2,3,3,1,2,3,3 }
has ordered histogram:
item,count
2 3
1 4
3 6
the GCD(6,4)=2 and GCD(6,3)=3 you should check at least +/-1 around the GCD results so the possible periods are around:
T = ~n/2 = 13/2 = 6
T = ~n/3 = 13/3 = 4
So check T={3,4,5,6,7} just to be sure. Use always GCD between the highest counts vs. lowest counts. If the sequence has many distinct numbers you can also do a histogram of counts checking only the most common values.
To check period validity just take any item near end or middle of the sequence (just use probable periodic area). Then look for it in close area near probable period before (or after) its occurrence. If found few times you got the right period (or its multiple)
Get the exact period
Just check the found period fractions (T/2, T/3, ...) or do a histogram on the found period and the smallest count tells you how many real periods you got encapsulated so divide by it.
find offset
When you know the period this is easy. Just scan from start take first item and see if after period is there again. If not remember position. Stop at the end or in the middle of sequence ... or on some treshold consequent successes. This is up to O(n) And the last remembered position is the last item in the offset.
[edit1] Was curious so I try to code it in C++
I simplified/skip few things (assuming at least half of the array is periodic) to test if I did not make some silly mistake in my algorithm and here the result (Works as expected):
const int p=10; // min periods for testing
const int n=500; // generated sequence size
int seq[n]; // generated sequence
int offset,period; // generated properties
int i,j,k,e,t0,T;
int hval[n],hcnt[n],hs; // histogram
// generate periodic sequence
Randomize();
offset=Random(n/5);
period=5+Random(n/5);
for (i=0;i<offset+period;i++) seq[i]=Random(n);
for (i=offset,j=i+period;j<n;i++,j++) seq[j]=seq[i];
if ((offset)&&(seq[offset-1]==seq[offset-1+period])) seq[offset-1]++;
// compute histogram O(n) on last half of it
for (hs=0,i=n>>1;i<n;i++)
{
for (e=seq[i],j=0;j<hs;j++)
if (hval[j]==e) { hcnt[j]++; j=-1; break; }
if (j>=0) { hval[hs]=e; hcnt[hs]=1; hs++; }
}
// bubble sort histogram asc O(m^2)
for (e=1,j=hs;e;j--)
for (e=0,i=1;i<j;i++)
if (hcnt[i-1]>hcnt[i])
{ e=hval[i-1]; hval[i-1]=hval[i]; hval[i]=e;
e=hcnt[i-1]; hcnt[i-1]=hcnt[i]; hcnt[i]=e; e=1; }
// test possible periods
for (j=0;j<hs;j++)
if ((!j)||(hcnt[j]!=hcnt[j-1])) // distinct counts only
if (hcnt[j]>1) // more then 1 occurence
for (T=(n>>1)/(hcnt[j]+1);T<=(n>>1)/(hcnt[j]-1);T++)
{
for (i=n-1,e=seq[i],i-=T,k=0;(i>=(n>>1))&&(k<p)&&(e==seq[i]);i-=T,k++);
if ((k>=p)||(i<n>>1)) { j=hs; break; }
}
// compute histogram O(T) on last multiple of period
for (hs=0,i=n-T;i<n;i++)
{
for (e=seq[i],j=0;j<hs;j++)
if (hval[j]==e) { hcnt[j]++; j=-1; break; }
if (j>=0) { hval[hs]=e; hcnt[hs]=1; hs++; }
}
// least count is the period multiple O(m)
for (e=hcnt[0],i=0;i<hs;i++) if (e>hcnt[i]) e=hcnt[i];
if (e) T/=e;
// check/handle error
if (T!=period)
{
return;
}
// search offset size O(n)
for (t0=-1,i=0;i<n-T;i++)
if (seq[i]!=seq[i+T]) t0=i;
t0++;
// check/handle error
if (t0!=offset)
{
return;
}
Code is still not optimized. For n=10000 it takes around 5ms on mine setup. The result is in t0 (offset) and T (period). You may need to play with the treshold constants a bit
I had to do something similar once. I used brute force and some common sense, the solution is not very elegant but it works. The solution always works, but you have to set the right parameters (k,j, con) in the function.
The sequence is saved as a list in the variable seq.
k is the size of the sequence array, if you think your sequence will take long to become periodic then set this k to a big number.
The variable found will tell us if the array passed the periodic test with period j
j is the period.
If you expect a huge period then you must set j to a big number.
We test the periodicity by checking the last j+30 numbers of the sequence.
The bigger the period (j) the more we must check.
As soon as one of the test is passed we exit the function and we return the smaller period.
As you may notice the accuracy depends on the variables j and k but if you set them to very big numbers it will always be correct.
def some_sequence(s0, a, b, m):
try:
seq=[s0]
snext=s0
findseq=True
k=0
while findseq:
snext= (a*snext+b)%m
seq.append(snext)
#UNTIL THIS PART IS JUST TO CREATE THE SEQUENCE (seq) SO IS NOT IMPORTANT
k=k+1
if k>20000:
# I IS OUR LIST INDEX
for i in range(1,len(seq)):
for j in range(1,1000):
found =True
for con in range(j+30):
#THE TRICK IS TO START FROM BEHIND
if not (seq[-i-con]==seq[-i-j-con]):
found = False
if found:
minT=j
findseq=False
return minT
except:
return None
simplified version
def get_min_period(sequence,max_period,test_numb):
seq=sequence
if max_period+test_numb > len(sequence):
print("max_period+test_numb cannot be bigger than the seq length")
return 1
for i in range(1,len(seq)):
for j in range(1,max_period):
found =True
for con in range(j+test_numb):
if not (seq[-i-con]==seq[-i-j-con]):
found = False
if found:
minT=j
return minT
Where max_period is the maximun period you want to look for, and test_numb is how many numbers of the sequence you want to test, the bigger the better but you have to make max_period+test_numb < len(sequence)

Generate random integers from random bit sequence

Very basic question but I can't seem to find the answer on Google. A standard PRNG will generate a sequence of random bits. How would I use this to produce a sequence of random integers with a uniform probability distribution in the range [0, N)? Moreover each integer should use (expected value) log_2(N) bits.
If you want a random number between 1 and N :
you calculate how many bits you would need to turn N into a binary number. That's :
n_bits = ceiling(log_2(N))
where ceiling is the "round up" operation. (ex : ceiling(3) = 3, ceiling(3.7) = 4)
you pick the first n_bits of your random binary list and change them into a decimal number.
if your decimal number is above N, well... you discard it and try again with the n_bits next bits until it works.
Exemple for N = 12 :
n_bits = ceiling(log_2(12)) = 4
you take the 4 first bits of your random bit sequence which might be "1011"
you turn "1011" into a decimal number which gives 13. That's above 12, no good. So :
take the 4 next bits in your random sequence which might be "1110".
turn '1110' into a decimal which gives 7. That works !
Hope it helps.
Actually most standard PRNGs such as linear congruential generators or Mersenne twister generate sequences of integer values. Even generalized feedback shift register techniques are usually implemented at the register/word level. I don't know of any common techniques that actually operate at the bit level. That's not to say they don't exist, but they're not common...
Generating values from 1 to N is usually accomplished by taking the integer value produced modulo the desired bound, and then doing an acceptance/rejection stage to make sure you aren't subject to modulo bias. See Java's nextInt(int bound) method, for example, to see how this can be implemented. (Add 1 to the result to get [1,N] rather than [0,N-1].)
Theoretically this is possible. Find a, b such that 2^a > N^b but is very close. (This can be done by iterating through multiples of log2(N).) Take the first a bits, and, interpreting it as a binary number, convert it to base N (also checking that the number is less than N^b). The digits give b terms of the desired sequence.
The problem is that converting to base N is very expensive and will cost more than essentially any PRNG, so this is mostly a theoretical answer.
Calculate the number of bits required for N (= location of the most significant bit with value 1) - let's call it k.
Take the first k bits from your input stream of bits - let's call it number X.
Result = X mod N.
Propagate to the next set of k bits and repeat from step 2 for next random number generation.
Alternatively, for better distribution, this can be applied instead of step 3:
Ratio = N/2k
Result = X * Ratio
Start with the range [0, N-1] then use 0s and 1s to perform a binary search:
0: lower half
1: upper half
e.g. With N = 16, you start with [0, 15], and the sequence 0, 1, 1, 0 would give:
[0, 7]
[4, 7]
[6, 7]
[6]
If N is not a power of 2, then in any iteration, the length of the list of remaining numbers could be odd, in which case a decision needs to be made to include the middle number as part of the lower half or the upper half. This can be decided right at the start of the algorithm. Roll once: 0 means include all instances of middle numbers to the lower half, and 1 means include all instances of middle numbers to the right half.
I think this is at least closer to the uniform distribution that you are asking for compared to the common method of generating log(N) bits and taking that or taking the mod N of it.
To illustrate what I mean, using my method to generate a number in the range [0, 9]:
To generate 0
0: 0, 0, 0, 0
1: 0, 0, 0
To generate 1
0: 0, 0, 0, 1
1: 0, 0, 1
To generate 2
0: 0, 0, 1
1: 0, 1, 0
To generate 3
0: 0, 1, 0
1: 0, 1, 1, 0
To generate 4
0: 0, 1, 1
1: 0, 1, 1, 1
To generate 5
0: 1, 0, 0, 0
1: 1, 0, 0
To generate 6
0: 1, 0, 0, 1
1: 1, 0, 1
To generate 7
0: 1, 0, 1
1: 1, 1, 0
To generate 8
0: 1, 1, 0
1: 1, 1, 1, 0
To generate 9
0: 1, 1, 1
1: 1, 1, 1, 1
The other easy answer is to generate a large enough binary number such that taking mod N does not (statistically) favor some numbers over others. But I figured that you would not like this answer either because judging from your comments to another answer, you seem to be taking into account efficiency in terms of number of bits generated.
In short, I am not sure why I was downvoted for this answer as this algorithm seems to provide a nice distribution compared to the number of bits it uses (~log(N)).

algorithm to find number of integers with given digits within a given range

If I am given the full set of digits in the form of a list list and I want to know how many (valid) integers they can form within a given range [A, B], what algorithm can I use to do it efficiently?
For example, given a list of digits (containing duplicates and zeros) list={5, 3, 3, 2, 0, 0}, I want to know how many integers can be formed in the range [A, B]=[20, 400] inclusive. For example, in this case, 20, 23, 25, 30, 32, 33, 35, 50, 52, 53, 200, 203, 205, 230, 233, 235, 250, 253, 300, 302, 303, 305, 320, 323, 325, 330, 332, 335, 350, 352, 353 are all valid.
Step 1: Find the number of digits your answers are likely to fall in. In your
example it is 2 or 3.
Step 2: For a given number size (number of digits)
Step 2a: Pick the possibilities for the first (most significant digit).
Find the min and max number starting with that digit (ascend or descending
order of rest of the digits). If both of them fall into the range:
step 2ai: Count the number of digits starting with that first digit and
update that count
Step 2b: Else if both max and min are out of range, ignore.
Step 2c: Otherwise, add each possible digit as second most significant digit
and repeat the same step
Solving by example of your case:
For number size of 2 i.e. __:
0_ : Ignore since it starts with 0
2_ : Minimum=20, Max=25. Both are in range. So update count by 3 (second digit might be 0,3,5)
3_ : Minimum=30, Max=35. Both are in range. So update count by 4 (second digit might be 0,2,3,5)
5_ : Minimum=50, Max=53. Both are in range. So update count by 3 (second digit might be 0,2,3)
For size 3:
0__ : Ignore since it starts with 0
2__ : Minimum=200, max=253. Both are in range. Find the number of ways you can choose 2 numbers from a set of {0,0,3,3,5}, and update the count.
3__ : Minimum=300, max=353. Both are in range. Find the number of ways you can choose 2 numbers from a set of {0,0,2,3,5}, and update the count.
5__ : Minimum=500, max=532. Both are out of range. Ignore.
A more interesting case is when max limit is 522 (instead of 400):
5__ : Minimum=500, max=532. Max out of range.
50_: Minimum=500, Max=503. Both in range. Add number of ways you can choose one digit from {0,2,3,5}
52_: Minimum=520, Max=523. Max out of range.
520: In range. Add 1 to count.
522: In range. Add 1 to count.
523: Out of range. Ignore.
53_: Minimum=530, Max=532. Both are out of range. Ignore.
def countComb(currentVal, digSize, maxVal, minVal, remSet):
minPosVal, maxPosVal = calculateMinMax( currentVal, digSize, remSet)
if maxVal>= minPosVal >= minVal and maxVal>= maxPosVal >= minVal
return numberPermutations(remSet,digSize, currentVal)
elif minPosVal< minVal and maxPosVal < minVal or minPosVal> maxVal and maxPosVal > maxVal:
return 0
else:
count=0
for k in unique(remSet):
tmpRemSet = [i for i in remSet]
tmpRemSet.remove(k)
count+= countComb(currentVal+k, digSize, maxVal, minVal, tmpRemSet)
return count
In your case: countComb('',2,400,20,['0','0','2','3','3','5']) +
countComb('',3,400,20,['0','0','2','3','3','5']) will give the answer.
def calculateMinMax( currentVal, digSize, remSet):
numRemain = digSize - len(currentVal)
minPosVal = int( sorted(remSet)[:numRemain] )
maxPosVal = int( sorted(remSet,reverse=True)[:numRemain] )
return minPosVal,maxPosVal
numberPermutations(remSet,digSize, currentVal): Basically number of ways
you can choose (digSize-len(currentVal)) values from remSet. See permutations
with repeats.
If the range is small but the list is big, the easy solution is just loop over the range and check if every number can be generated from the list. The checking can be made fast by using a hash table or an array with a count for how many times each number in the list can still be used.
For a list of n digits, z of which are zero, a lower bound l, and an upper bound u...
Step 1: The Easy Stuff
Consider a situation in which you have a 2-digit lower bound and a 4-digit upper bound. While it might be tricky to determine how many 2- and 4-digit numbers are within the bounds, we at least know that all 3-digit numbers are. And if the bounds were a 2-digit number and a 5-digit number, you know that all 3- and 4-digit numbers are fair game.
So let's generalize this to to a lower bound with a digits and an upper bound with b digits. For every k between a and b (not including a and b, themselves), all k-digit numbers are within the range.
How many such numbers are there? Consider how you'd pick them: the first digit must be one of the n numbers which is non-zero (so one of (n - z) numbers), and the rest are picked from the yet-unpicked list, i.e. (n-1) choices for the second digit, (n-2) for the third, etc. So this is looking like a factorial, but with a weird first term. How many numbers of the n are picked? Why, k of them, which means we have to divide by (n - k)! to ensure we only pick k digits in total. So the equation for each k looks something like: (n - z)(n - 1)!/(n - k)! Plug in every k in the range (a, b), and you have the number of (a+1)- to (b-1)-digit numbers possible, all of which must be valid.
Step 2: The Edge Cases
Things are a little bit trickier when you consider a- and b-digit numbers. I don't think you can avoid starting a depth-first search through all possible combinations of digits, but you can at least abort on an entire branch if it exceeds the boundary.
For example, if your list contained { 7, 5, 2, 3, 0 } and you had an upper bound of 520, your search might go something like the following:
Pick the 7: does 7 work in the hundreds place? No, because 700 > 520;
abort this branch entirely (i.e. don't consider 752, 753, 750, 725, etc.)
Pick the 5: does 5 work in the hundreds place? Yes, because 500 <= 520.
Pick the 7: does 7 work in the tens place? No, because 570 > 520.
Abort this branch (i.e. don't consider 573, 570, etc.)
Pick the 2: does 2 work in the tens place? Yes, because 520 <= 520.
Pick the 7: does 7 work in the ones place? No, because 527 > 520.
Pick the 3: does 3 work in the ones place? No, because 523 > 520.
Pick the 0: does 0 work in the ones place? Yes, because 520 <= 520.
Oh hey, we found a number. Make sure to count it.
Pick the 3: does 3 work in the tens place? No; abort this branch.
Pick the 0: does 0 work in the tens place? Yes.
...and so on.
...and then you'd do the same for the lower bound, but flipping the comparators. It's not nearly as efficient as the k-digit combinations in the (a, b) interval (i.e. O(1)), but at least you can avoid a good deal by pruning branches that must be impossible early on. In any case, this strategy ensures you only have to actually enumerate the two edge cases that are the boundaries, regardless of how wide your (a, b) interval is (or if you have 0 as your lower bound, only one edge case).
EDIT:
Something I forgot to mention (sorry, I typed all of the above on the bus home):
When doing the depth-first search, you actually only have to recurse when your first number equals the first number of the bound. That is, if your bound is 520 and you've just picked 3 as your first number, you can just add (n-1)!/(n-3)! immediately and skip the entire branch, because all 3-digit numbers beginning with 300 are certainly all below 500.

Resources