3n+1 Optimization Idea for Larger Integers - algorithm

I recently got into the book "Programming Challenges" by Skiena and Revilla and was somewhat surprised when I saw the solution to the 3n+1 problem, which was simply brute forced. Basically it's an algorithm that generates a list of numbers, dividing by 2 if even and multiplying by 3 and adding 1 if odd. This occurs until n=1 is reached, its base case. Now the trick is to find the maximum length of a list between integers i and j which in the problem ranges between 1 and 1,000,000 for both variables. So I was wondering how much more efficient (if so) a program would be with Dynamic Programming. Basically, the program would do one pass on the first number, i, find the total length, and then check each individual number within the array and store the associated lengths within a HashMap or other dictionary data type.
For Example:
Let's say i = 22 and j = 23
For 22:
22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1
This means that in the dictionary, with the structure would store
(22,16) , (11,15) , (34,14) and so on... until (1,1)
Now for 23:
23 70 35 106 53 160 80 40 ...
Since 40 was hit, and it is in the dictionary
program would get the length of 23 to 80, which is 7, and add it to the length stored previously by 40 which is 9 resulting in total list length of 16. And of course the program would store lengths of 23, 70 , 35 etc... such that if the numbers were bigger it should compute faster.
So what are the opinions of approaching such a question in this manner?

I tried both approaches and submitted them to UVaOJ, the brute force solution got runtime ~0.3s and the dp solution ~0.0s. It gets pretty slow when the range gets long (like over 1e7 elements).
I just used an array (memo) to be able to memorize the first 5 million (SIZE) values:
int cycleLength(long long n)
{
if(n < 1) //overflow
return 0;
if (n == 1)
return 1;
if (n < SIZE && memo[n] != 0)
return memo[n];
int res = 1 + cycleLength(n % 2 == 0 ? n / 2 : 3 * n + 1);
if (n < SIZE)
memo[n] = res;
return res;
}

Related

How to search the minimum n that 10^n ≡ 1 mod(9x) for given x

For given x, I need to calculate the minimum n that equates true for the formula 10^n ≡ 1 (mod 9x)
My algorithm is simple. For i = 1 to inf, I loop it until I get a result. There is always a result if gcd(10, x) = 1. Meanwhile if I don't get a result, I increase i by 1 .
This is really slow for big primes or numbers with a factorization of big values, so I ask if there is another way to calculate it faster. I have tried with threads, getting each thread the next 10^i to calculate. Performance is a bit better, but big primes still don't finish.
You can use Fermat's Little Theorem.
Assuming your x is relatively prime with 10, the following holds:
10 ^ φ(9x) ≡ 1 (mod 9x)
Here φ is Euler's totient function. So you can easily calculate at least one n (not necessarily the smallest) for which your equation holds. To find the smallest such n, just iterate through the list of n's divisors.
Example: x = 89 (a prime number just for simplicity).
9x = 801
φ(9x) = 6 * (89 - 1) = 528 (easy to calculate for a prime number)
The list of divisors of 528:
1
2
3
4
6
8
11
12
16
22
24
33
44
48
66
88
132
176
264
528
Trying each one, you can find that your equation holds for 44:
10 ^ 44 ≡ 1 (mod 801)
I just tried the example, it runs in less than one second:
public class Main {
public static void main(String[] args) {
int n = 1;
int x = 954661;
int v = 10;
while (v != 1) {
n++;
v = (v * 10) % (9*x);
}
System.out.println(n);
}
}
For larger values of x the variables should be of type long.
As you specified you are actually trying to get modulus with 1 that is 1mod(9x).
That will always give you 1.
And you don't have to calculate that part exactly which might reduce your calculation.
On the other hand for 10^n = 1, it will always be 0.
So can you exactly specify what you are trying to do

The 1000th element which is product of 2, 3, 5

There is a sequence S.
All the elements in S is product of 2, 3, 5.
S = {2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24 ...}
How to get the 1000th element in this sequence efficiently?
I check each number from 1, but this method is too slow.
A geometric approach:
Let s = 2^i . 3^j . 5^k, where the triple (i, j, k) belongs to the first octant of a 3D state space.
Taking the logarithm,
ln(s) = i.ln(2) + j.ln(3) + k.ln(5)
so that in the state space the iso-s surfaces are planes, which intersect the first octant along a triangle. On the other hand, the feasible solutions are the nodes of a square grid.
If one wants to produce the s-values in increasing order, one can keep a list of the grid nodes closest to the current s-plane*, on its "greater than" side.
If I am right, to move from one s-value to the next, it suffices to discard the current (i, j, k) and replace it by the three triples (i+1, j, k), (i, j+1, k) and (i, j, k+1), unless they are already there, and pick the next smallest s.
An efficient implementation will be by storing the list as a binary tree with the log(s)-value as the key.
If you are asking for the first N values, you will explore a pyramidal volume of state-space of height O(³√N), and base area O(³√N²), which is the number of tree nodes, hence the spatial complexity. Every query in the tree will take O(log(N)) comparisons (and O(1) operations to fetch the minimum), for a total of O(N.log(N)).
*More precisely, the list will contain all triples on the "greater than" side and such that no index can be decreased without getting on the other side of the plane.
Here is Python code that implements these ideas.
You will notice that the logarithms are converted to fixed point (7 decimals) to avoid floating-point inaccuracies that could result in the log(s)-values not being found equal. This causes the s values being inexact in the last digits, but this does not matter as long as the ordering of the values is preserved. Recomputing the s-values from the indexes yields exact values.
import math
import bintrees
# Constants
ln2= round(10000000 * math.log(2))
ln3= round(10000000 * math.log(3))
ln5= round(10000000 * math.log(5))
# Initial list
t= bintrees.FastAVLTree()
t.insert(0, (0, 0, 0))
# Find the N first products
N= 100
for i in range(N):
# Current s
s= t.pop_min()
print math.pow(2, s[1][0]) * math.pow(3, s[1][1]) * math.pow(5, s[1][2])
# Update the list
if not s[0] + ln2 in t:
t.insert(s[0] + ln2, (s[1][0]+1, s[1][1], s[1][2]))
if not s[0] + ln3 in t:
t.insert(s[0] + ln3, (s[1][0], s[1][1]+1, s[1][2]))
if not s[0] + ln5 in t:
t.insert(s[0] + ln5, (s[1][0], s[1][1], s[1][2]+1))
The 100 first values are
1 2 3 4 5 6 8 9 10 12
15 16 18 20 24 25 27 30 32 36
40 45 48 50 54 60 64 72 75 80
81 90 96 100 108 120 125 128 135 144
150 160 162 180 192 200 216 225 240 243
250 256 270 288 300 320 324 360 375 384
400 405 432 450 480 486 500 512 540 576
600 625 640 648 675 720 729 750 768 800
810 864 900 960 972 1000 1024 1080 1125 1152
1200 1215 1250 1280 1296 1350 1440 1458 1500 1536
The plot of the number of tree nodes confirms the O(³√N²) spatial behavior.
Update:
When there is no risk of overflow, a much simpler version (not using logarithms) is possible:
import math
import bintrees
# Initial list
t= bintrees.FastAVLTree()
t[1]= None
# Find the N first products
N= 100
for i in range(N):
# Current s
(s, r)= t.pop_min()
print s
# Update the list
t[2 * s]= None
t[3 * s]= None
t[5 * s]= None
Simply put, you just have to generate each ith number consecutively. Let's call the set {2, 3, 5} to be Z. At ith iteration, assume you have all (i-1) of the values generated in the previous iteration. While generating the next one, what you basically have to do is trying all the elements in Z and for each of them generating **the least element they can form that is larger than the element generated at (i-1)th iteration. Then, you simply consider the smallest one among them as the ith value. A simple and not so efficient implementation is given below.
def generate_simple(N, Z):
generated = [1]
for i in range(1, N+1):
minFound = -1
minElem = -1
for j in range(0, len(Z)):
for k in range(0, len(generated)):
candidateVal = Z[j] * generated[k]
if candidateVal > generated[-1]:
if minFound == -1 or minFound > candidateVal:
minFound = candidateVal
minElem = j
break
generated.append(minFound)
return generated[-1]
As you may observe, this approach has a time complexity of O(N2 * |Z|). An improvement in terms of efficiency would be to store where we left off scanning in the array of generated values for each element in a second array, indicesToStart. Then, for each element we would only scan all N values of the array generated for once(i.e. all through the algorithm), which means the time complexity after such an improvement would be O(N * |Z|).
A simple implementation of the improvement based on the simple version provided above, is given below.
def generate_improved(N, Z):
generated = [1]
indicesToStart = [0] * len(Z)
for i in range(1, N+1):
minFound = -1
minElem = -1
for j in range(0, len(Z)):
for k in range(indicesToStart[j], len(generated)):
candidateVal = Z[j] * generated[k]
if candidateVal > generated[-1]:
if minFound == -1 or minFound > candidateVal:
minFound = candidateVal
minElem = j
break
indicesToStart[j] += 1
generated.append(minFound)
indicesToStart[minElem] += 1
return generated[-1]
If you have a hard time understanding how complexity decreases with this algorithm, try looking into the difference in time complexity of any graph traversal algorithm when an adjacency list is used, and when an adjacency matrix is used. The improvement adjacency lists help achieve is almost exactly the same kind of improvement we get here. In a nutshell, you have an index for each element and instead of starting to scan from the beginning you continue from wherever you left the last time you scanned the generated array for that element. Consequently, even though there are N iterations in the algorithm(i.e. the outermost loop) the overall number of operations you make is O(N * |Z|).
Important Note: All the code above is a simple implementation for demonstration purposes, and you should consider it just as a pseudocode you can test. While implementing this in real life, based on the programming language you choose to use, you will have to consider issues like integer overflow when computing candidateVal.

Prime factorization of a factorial

I need to write a program to input a number and output its factorial's prime factorization in the form:
4!=(2^3)*(3^1)
5!=(2^3)*(3^1)*(5^1)
The problem is I still can't figure out how to get that result.
Apparently each first number in brackets is for the ascending prime numbers up until the actual factorial. The second number in brackets is the amount of times the number occurs in the factorial.
What I can't figure out is for example in 5!=(2^3)*(3^1)*(5^1), how does 2 only occur 3 times, 3 only 1 time and 5 only one time in 120 (5!=120).
I have now solved this thanks to the helpful people who commented but I'm now having trouble trying to figure out how could I take a number and get the factorial in this format without actually calculating the factorial.
Every number can be represented by a unique (up to re-ordering) multiplication of prime numbers, called the prime factorization of the number, as you are finding the prime factors that can uniquely create that number.
2^3=8
3^1=3
5^1=5
and 8*3*5=120
But this also means that: (2^3)*(3^1)*(5^1) = 120
It's not saying that 2 occurs 3 times as a digit in the number 120, which it obviously does not, but rather to multiply 2 by 2 by 2, for a total of 3 twos. Likewise for the 3 and 5, which occur once in the prime factorization of 120. The expression which you mention is showing you this unique prime factorization of the number 120. This is one way of getting the prime factorization of a number in Python:
def pf(number):
factors=[]
d=2
while(number>1):
while(number%d==0):
factors.append(d)
number=number/d
d+=1
return factors
Running it you get:
>>> pf(120)
[2, 2, 2, 3, 5]
Which multiplied together give you 120, as explained above. Here's a little diagram to illustrate this more clearly:
Consider e.g. 33!. It's a product of:
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
the factors are:
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2
2 2 2 2
2 2
2
3 3 3 3 3 3 3 3 3 3 3
3 3 3
3
5 5 5 5 5 5
5
7 7 7 7
11 11 11
13 13
17
19
23
29 31
Do you see the pattern?
33! = 2^( 33 div 2 + 33 div 4 + 33 div 8 + 33 div 16 + 33 div 32) *
3^( 33 div 3 + 33 div 9 + 33 div 27) *
5^( 33 div 5 + 33 div 25) *
----
7^( 33 div 7) * 11^( 33 div 11) * 13^( 33 div 13) *
----
17 * 19 * 23 * 29 * 31
Thus, to find prime factorization of n! without doing any multiplications or factorizations, we just need to have the ordered list of primes not greater than n, which we process (with a repeated integer division and a possible summation) in three stages - primes that are smaller or equal to the square root of n; such that are smaller or equal to n/2; and the rest.
Actually with lazy evaluation it's even simpler than that. Assuming primes is already implemented returning a stream of prime numbers in order, in Haskell, factorial factorization is found as
ff n = [(p, sum . takeWhile (> 0) . tail . iterate (`div` p) $ n)
| p <- takeWhile (<= n) primes]
-- Prelude> ff 33
-- [(2,31),(3,15),(5,7),(7,4),(11,3),(13,2),(17,1),(19,1),(23,1),(29,1),(31,1)]
because 33 div 4 is (33 div 2) div 2, etc..
2^3 is another way of writing 23, or two to the third power. (2^3)(3^1)(5^1) = 23 × 3 × 5 = 120.
(2^3)(3^1)(5^1) is just the prime factorization of 120 expressed in plain ASCII text rather than with pretty mathematical formatting. Your assignment requires output in this form simply because it's easier for you to output than it would be for you to figure out how to output formatted equations (and probably because it's easier to process for grading).
The conventions used here for expressing equations in plain text are standard enough that you can directly type this text into google.com or wolframalpha.com and it will calculate the result as 120 for you: (2^3)(3^1)(5^1) on wolframalpha.com / (2^3)(3^1)(5^1) on google.com
WolframAlpha can also compute prime factorizations, which you can use to get test results to compare your program with. For example: prime factorization of 1000!
A naïve solution that actually calculates the factorial will only handle numbers up to 12 (if using 32 bit ints). This is because 13! is ~6.2 billion, larger than the largest number that can be represented in a 32 bit int.
However it's possible to handle much larger inputs if you avoid calculating the factorial first. I'm not going to tell you exactly how to do that because either figuring it out is part of your assignment or you can ask your prof/TAs. But below are some hints.
ab × ac = ab+c
equation (a) 10 = 21 × 51
equation (b) 15 = 31 × 51
10 × 15 = ? Answer using the right hand sides of equations (a) and (b), not with the number 150.
10 × 15 = (21 × 51) × (31 × 51) = 21 × 31 × (51 × 51) = 21 × 31 × 52
As you can see, computing the prime factorization of 10 × 15 can be done without multiplying 10 by 15; You can instead compute the prime factorization of the individual terms and then combine those factorizations.
If you write out the factorial 5!:
1 * 2 * 3 * 4 * 5,
you will notice that there is one non-prime number: 4. 4 can be written as 2 * 2 or 2^2, which is where the extra 2s come from.
Add up all of the occurrences (exponential forms are in parentheses; add exponents for like bases):
2 (2^1) * 3 (3^1) * 4 (2^2) * 5 (5^1), you get the proper answer.
You can use O(n/2 log log n) algorithm using only sums (no need precalc primes).
This is a sieve using relation
f = a * b ~> f^k = a^k * b^k
then, we reduce all initial factors 1 * 2 * 3 * ... * n moving k from big numbers to small numbers.
Using Sieve of Atkin the Will Ness algorithm could be better for very big n if not, I think it could be better
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
int n = atoi(argv[1]);
int *p = (int *) malloc(sizeof(int) * (n + 1));
int i, j, d;
for(i = 0; i <= n; i++)
p[i] = 1;
for(i = n >> 1; i > 1; i--)
if(p[i]) {
for(j = i + i, d = 2; j <= n; j += i, d++) {
if(p[j]) {
p[i] += p[j];
p[d] += p[j];
p[j] = 0;
}
}
}
printf("1");
for(i = 2; i <= n; i++)
if(p[i])
printf(" * %i^%i", i, p[i]);
printf("\n");
return 0;
}

Finding the minimum and maximm element from one of many arrays

I received a question during an Amazon interview and would like assistance with solving it.
Given N arrays of size K each, each of these K elements in the N arrays are sorted, and each of these N*K elements are unique. Choose a single element from each of the N arrays, from the chosen subset of N elements. Subtract the minimum and maximum element. This difference should be the least possible minimum.
Sample:
N=3, K=3
N=1 : 6, 16, 67
N=2 : 11,17,68
N=3 : 10, 15, 100
here if 16, 17, 15 are chosen, we get the minimum difference as
17-15=2.
I can think of O(N*K*N)(edited after correctly pointed out by zivo, not a good solution now :( ) solution.
1. Take N pointer initially pointing to initial element each of N arrays.
6, 16, 67
^
11,17,68
^
10, 15, 100
^
2. Find out the highest and lowest element among the current pointer O(k) (6 and 11) and find the difference between them.(5)
3. Increment the pointer which is pointing to lowest element by 1 in that array.
6, 16, 67
^
11,17,68
^
10, 15, 100 (difference:5)
^
4. Keep repeating step 2 and 3 and store the minimum difference.
6, 16, 67
^
11,17,68
^
10,15,100 (difference:5)
^
6, 16, 67
^
11,17,68
^
10,15,100 (difference:2)
^
Above will be the required solution.
6, 16, 67
^
11,17,68
^
10,15,100 (difference:84)
^
6, 16, 67
^
11,17,68
^
10,15,100 (difference:83)
^
And so on......
EDIT:
Its complexity can be reduced by using a heap (as suggested by Uri). I thought of it but faced a problem: Each time an element is extracted from heap, its array number has to be found out in order to increment the corresponding pointer for that array. An efficient way to find array number can definitely reduce the complexity to O(K*N log(K*N)). One naive way is to use a data structure like this
Struct
{
int element;
int arraynumer;
}
and reconstruct the initial data like
6|0,16|0,67|0
11|1,17|1,68|1
10|2,15|2,100|2
Initially keep the current max for first column and insert the pointed elements in heap. Now each time an element is extracted, its array number can be found out, pointer in that array is incremented , the newly pointed element can be compared to current max and max pointer can be adjusted accordingly.
So here is an algorithm to do solve this problem in two steps:
First step is to merge all your arrays into one sorted array which would look like this:
combined_val[] - which holds all numbers
combined_ind[] - which holds index of which array did this number originally belonged to
this step can be done easily in O(K*N*log(N)) but i think you can do better than that too (maybe not, you can lookup variants of merge sort because they do step similar to that)
Now second step:
it is easier to just put code instead of explaining so here is the pseduocode:
int count[N] = { 0 }
int head = 0;
int diffcnt = 0;
// mindiff is initialized to overall maximum value - overall minimum value
int mindiff = combined_val[N * K - 1] - combined_val[0];
for (int i = 0; i &lt N * K; i++)
{
count[combined_ind[i]]++;
if (count[combined_ind[i]] == 1) {
// diffcnt counts how many arrays have at least one element between
// indexes of "head" and "i". Once diffcnt reaches N it will stay N and
// not increase anymore
diffcnt++;
} else {
while (count[combined_ind[head]] > 1) {
// We try to move head index as forward as possible while keeping diffcnt constant.
// i.e. if count[combined_ind[head]] is 1, then if we would move head forward
// diffcnt would decrease, that is something we dont want to do.
count[combined_ind[head]]--;
head++;
}
}
if (diffcnt == N) {
// i.e. we got at least one element from all arrays
if (combined_val[i] - combined_val[head] &lt mindiff) {
mindiff = combined_val[i] - combined_val[head];
// if you want to save actual numbers too, you can save this (i.e. i and head
// and then extract data from that)
}
}
}
the result is in mindiff.
The runing time of second step is O(N * K). This is because "head" index will move only N*K times maximum. so the inner loop does not make this quadratic, it is still linear.
So total algorithm running time is O(N * K * log(N)), however this is because of merging step, if you can come up with better merging step you can probably bring it down to O(N * K).
This problem is for managers
You have 3 developers (N1), 3 testers (N2) and 3 DBAs (N3)
Choose the less divergent team that can run a project successfully.
int[n] result;// where result[i] keeps the element from bucket N_i
int[n] latest;//where latest[i] keeps the latest element visited from bucket N_i
Iterate elements in (N_1 + N_2 + N_3) in sorted order
{
Keep track of latest element visited from each bucket N_i by updating 'latest' array;
if boundary(latest) < boundary(result)
{
result = latest;
}
}
int boundary(int[] array)
{
return Max(array) - Min(array);
}
I've O(K*N*log(K)), with typical execution much less. Currently cannot think anything better. I'll explain first the easier to describe (somewhat longer execution):
For each element f in the first array (loop through K elements)
For each array, starting from the second array (loop through N-1 arrays)
Do a binary search on the array, and find element closest to f. This is your element (Log(K))
This algorithm can be optimized, if for each array, you add a new Floor Index. When performent the binary search, search between 'Floor' to 'K-1'.
Initially Floor index is 0, and for first element you search through the entire arrays. Once you find an element closest to 'f', update the Floor Index with the index of that element. Worse case is the same (Floor may not update, if maximum element of first array is smaller than any other minimum), but average case will improve.
Correctness proof for the accepted answer (Terminal's solution)
Assume that the algorithm finds a series A=<A[1],A[2],...,A[N]> which isn't the optimal solution (R).
Consider the index j in R, such that item R[j] is the first item among R that the algorithm examines and replaces it with the next item in its row.
Let A' denote the candidate solution at that phase (prior to the replacement). Since R[j]=A'[j] is the minimum value of A', it's also the minimum of R.
Now, consider the maximum value of R, R[m]. If A'[m]<R[m], then R can be improved by replacing R[m] with A'[m], which contradicts the fact that R is optimal. Therefore, A'[m]=R[m].
In other words, R and A' share the same maximum and minimum, therefore they are equivalent. This completes the proof: if R is an optimal solution, then the algorithm is guaranteed to find a solution as good as R.
for every element in 1st array
choose the element in 2nd array that is closest to the element in 1st array
current_array = 2;
do
{
choose the element in current_array+1 that is closest to the element in current_array
current_array++;
} while(current_array < n);
complexity: O(k^2*n)
Here is my logic on how to resolve this issue, keeping in mind that we need to pick one element from each of the N arrays (to compute the least minimum)
// if we take the above values as an example!
// then the idea would be to sort all three arrays while keeping another
// array to keep the reference to their sets (1 or 2 or 3, could be
// extended to n sets)
1 3 2 3 1 2 1 2 3 // this is the array that holds the set index
6 10 11 15 16 17 67 68 100 // this is the sorted combined array.
| |
5 2 33 // this is the computed least minimum,
// the rule is to make sure the indexes of the values
// we are comparing are different (to make sure we are
// comparing elements from different sets), then for example
// the first element of that example is index:1|value:6 we hold
// that value 6 (that is the value we will be using to compute the least minimum,
// then we go to the edge of the comparison which would be the second different index,
// we skip index:3|value:10 (we remove it from the array) we compare index:2|value:11
// to index:1|value:6 we obtain 5 which would go to a variable named leastMinimum = 5,
// now we remove the indexes and values we already used,
// and redo the same steps.
Step 1:
1 3 2 3 1 2 1 2 3
6 10 11 15 16 17 67 68 100
|
5
leastMinumum = 5
Step 2:
3 1 2 1 2 3
15 16 17 67 68 100
|
2
leastMinimum = min(2, leastMinumum) // which is equal 2
Step 3:
1 2 3
67 68 100
33
leastMinimum = min(33, leastMinumum) // which is equal to old leastMinumum which is 2
Now: We suppose we have elements from the same array that are very close to each other (k=2 this time which means we only have 3 sets with two values) :
// After sorting the n arrays we will have the below indexes array and values array
1 1 2 3 2 3
6 7 8 12 15 16
* * *
* we skip second index of 1|7 and we take the least minimum of 1|6 and 3|12 (index:2|value:8 will be removed as it is not at the edges, we pick the minimum and maximum of the unique index subset of n elements)
1 3
6 12
=6
* second step we remove the values we already used, so the array become like below:
1 2 3
7 15 16
* * *
7 - 16
= 9
Note:
Another approach that consumes more memory would consist of creating N sub-arrays from which we would be comparing the maximum - minumum
So from the below sorted values array and its corresponding indexes array we extract three other sub arrays:
1 3 2 3 1 2 1 2 3
6 10 11 15 16 17 67 68 100
First Array:
1 3 2
6 10 11
11-6 = 5
Second Array:
3 1 2
15 15 17
17-15 = 2
Third Array:
1 2 3
67 68 100
100 - 67 = 33

How to count each digit in a range of integers?

Imagine you sell those metallic digits used to number houses, locker doors, hotel rooms, etc. You need to find how many of each digit to ship when your customer needs to number doors/houses:
1 to 100
51 to 300
1 to 2,000 with zeros to the left
The obvious solution is to do a loop from the first to the last number, convert the counter to a string with or without zeros to the left, extract each digit and use it as an index to increment an array of 10 integers.
I wonder if there is a better way to solve this, without having to loop through the entire integers range.
Solutions in any language or pseudocode are welcome.
Edit:
Answers review
John at CashCommons and Wayne Conrad comment that my current approach is good and fast enough. Let me use a silly analogy: If you were given the task of counting the squares in a chess board in less than 1 minute, you could finish the task by counting the squares one by one, but a better solution is to count the sides and do a multiplication, because you later may be asked to count the tiles in a building.
Alex Reisner points to a very interesting mathematical law that, unfortunately, doesn’t seem to be relevant to this problem.
Andres suggests the same algorithm I’m using, but extracting digits with %10 operations instead of substrings.
John at CashCommons and phord propose pre-calculating the digits required and storing them in a lookup table or, for raw speed, an array. This could be a good solution if we had an absolute, unmovable, set in stone, maximum integer value. I’ve never seen one of those.
High-Performance Mark and strainer computed the needed digits for various ranges. The result for one millon seems to indicate there is a proportion, but the results for other number show different proportions.
strainer found some formulas that may be used to count digit for number which are a power of ten.
Robert Harvey had a very interesting experience posting the question at MathOverflow. One of the math guys wrote a solution using mathematical notation.
Aaronaught developed and tested a solution using mathematics. After posting it he reviewed the formulas originated from Math Overflow and found a flaw in it (point to Stackoverflow :).
noahlavine developed an algorithm and presented it in pseudocode.
A new solution
After reading all the answers, and doing some experiments, I found that for a range of integer from 1 to 10n-1:
For digits 1 to 9, n*10(n-1) pieces are needed
For digit 0, if not using leading zeros, n*10n-1 - ((10n-1) / 9) are needed
For digit 0, if using leading zeros, n*10n-1 - n are needed
The first formula was found by strainer (and probably by others), and I found the other two by trial and error (but they may be included in other answers).
For example, if n = 6, range is 1 to 999,999:
For digits 1 to 9 we need 6*105 = 600,000 of each one
For digit 0, without leading zeros, we need 6*105 – (106-1)/9 = 600,000 - 111,111 = 488,889
For digit 0, with leading zeros, we need 6*105 – 6 = 599,994
These numbers can be checked using High-Performance Mark results.
Using these formulas, I improved the original algorithm. It still loops from the first to the last number in the range of integers, but, if it finds a number which is a power of ten, it uses the formulas to add to the digits count the quantity for a full range of 1 to 9 or 1 to 99 or 1 to 999 etc. Here's the algorithm in pseudocode:
integer First,Last //First and last number in the range
integer Number //Current number in the loop
integer Power //Power is the n in 10^n in the formulas
integer Nines //Nines is the resut of 10^n - 1, 10^5 - 1 = 99999
integer Prefix //First digits in a number. For 14,200, prefix is 142
array 0..9 Digits //Will hold the count for all the digits
FOR Number = First TO Last
CALL TallyDigitsForOneNumber WITH Number,1 //Tally the count of each digit
//in the number, increment by 1
//Start of optimization. Comments are for Number = 1,000 and Last = 8,000.
Power = Zeros at the end of number //For 1,000, Power = 3
IF Power > 0 //The number ends in 0 00 000 etc
Nines = 10^Power-1 //Nines = 10^3 - 1 = 1000 - 1 = 999
IF Number+Nines <= Last //If 1,000+999 < 8,000, add a full set
Digits[0-9] += Power*10^(Power-1) //Add 3*10^(3-1) = 300 to digits 0 to 9
Digits[0] -= -Power //Adjust digit 0 (leading zeros formula)
Prefix = First digits of Number //For 1000, prefix is 1
CALL TallyDigitsForOneNumber WITH Prefix,Nines //Tally the count of each
//digit in prefix,
//increment by 999
Number += Nines //Increment the loop counter 999 cycles
ENDIF
ENDIF
//End of optimization
ENDFOR
SUBROUTINE TallyDigitsForOneNumber PARAMS Number,Count
REPEAT
Digits [ Number % 10 ] += Count
Number = Number / 10
UNTIL Number = 0
For example, for range 786 to 3,021, the counter will be incremented:
By 1 from 786 to 790 (5 cycles)
By 9 from 790 to 799 (1 cycle)
By 1 from 799 to 800
By 99 from 800 to 899
By 1 from 899 to 900
By 99 from 900 to 999
By 1 from 999 to 1000
By 999 from 1000 to 1999
By 1 from 1999 to 2000
By 999 from 2000 to 2999
By 1 from 2999 to 3000
By 1 from 3000 to 3010 (10 cycles)
By 9 from 3010 to 3019 (1 cycle)
By 1 from 3019 to 3021 (2 cycles)
Total: 28 cycles
Without optimization: 2,235 cycles
Note that this algorithm solves the problem without leading zeros. To use it with leading zeros, I used a hack:
If range 700 to 1,000 with leading zeros is needed, use the algorithm for 10,700 to 11,000 and then substract 1,000 - 700 = 300 from the count of digit 1.
Benchmark and Source code
I tested the original approach, the same approach using %10 and the new solution for some large ranges, with these results:
Original 104.78 seconds
With %10 83.66
With Powers of Ten 0.07
A screenshot of the benchmark application:
(source: clarion.sca.mx)
If you would like to see the full source code or run the benchmark, use these links:
Complete Source code (in Clarion): http://sca.mx/ftp/countdigits.txt
Compilable project and win32 exe: http://sca.mx/ftp/countdigits.zip
Accepted answer
noahlavine solution may be correct, but l just couldn’t follow the pseudo code, I think there are some details missing or not completely explained.
Aaronaught solution seems to be correct, but the code is just too complex for my taste.
I accepted strainer’s answer, because his line of thought guided me to develop this new solution.
There's a clear mathematical solution to a problem like this. Let's assume the value is zero-padded to the maximum number of digits (it's not, but we'll compensate for that later), and reason through it:
From 0-9, each digit occurs once
From 0-99, each digit occurs 20 times (10x in position 1 and 10x in position 2)
From 0-999, each digit occurs 300 times (100x in P1, 100x in P2, 100x in P3)
The obvious pattern for any given digit, if the range is from 0 to a power of 10, is N * 10N-1, where N is the power of 10.
What if the range is not a power of 10? Start with the lowest power of 10, then work up. The easiest case to deal with is a maximum like 399. We know that for each multiple of 100, each digit occurs at least 20 times, but we have to compensate for the number of times it appears in the most-significant-digit position, which is going to be exactly 100 for digits 0-3, and exactly zero for all other digits. Specifically, the extra amount to add is 10N for the relevant digits.
Putting this into a formula, for upper bounds that are 1 less than some multiple of a power of 10 (i.e. 399, 6999, etc.) it becomes: M * N * 10N-1 + iif(d <= M, 10N, 0)
Now you just have to deal with the remainder (which we'll call R). Take 445 as an example. This is whatever the result is for 399, plus the range 400-445. In this range, the MSD occurs R more times, and all digits (including the MSD) also occur at the same frequencies they would from range [0 - R].
Now we just have to compensate for the leading zeros. This pattern is easy - it's just:
10N + 10N-1 + 10N-2 + ... + **100
Update: This version correctly takes into account "padding zeros", i.e. the zeros in middle positions when dealing with the remainder ([400, 401, 402, ...]). Figuring out the padding zeros is a bit ugly, but the revised code (C-style pseudocode) handles it:
function countdigits(int d, int low, int high) {
return countdigits(d, low, high, false);
}
function countdigits(int d, int low, int high, bool inner) {
if (high == 0)
return (d == 0) ? 1 : 0;
if (low > 0)
return countdigits(d, 0, high) - countdigits(d, 0, low);
int n = floor(log10(high));
int m = floor((high + 1) / pow(10, n));
int r = high - m * pow(10, n);
return
(max(m, 1) * n * pow(10, n-1)) + // (1)
((d < m) ? pow(10, n) : 0) + // (2)
(((r >= 0) && (n > 0)) ? countdigits(d, 0, r, true) : 0) + // (3)
(((r >= 0) && (d == m)) ? (r + 1) : 0) + // (4)
(((r >= 0) && (d == 0)) ? countpaddingzeros(n, r) : 0) - // (5)
(((d == 0) && !inner) ? countleadingzeros(n) : 0); // (6)
}
function countleadingzeros(int n) {
int tmp= 0;
do{
tmp= pow(10, n)+tmp;
--n;
}while(n>0);
return tmp;
}
function countpaddingzeros(int n, int r) {
return (r + 1) * max(0, n - max(0, floor(log10(r))) - 1);
}
As you can see, it's gotten a bit uglier but it still runs in O(log n) time, so if you need to handle numbers in the billions, this will still give you instant results. :-) And if you run it on the range [0 - 1000000], you get the exact same distribution as the one posted by High-Performance Mark, so I'm almost positive that it's correct.
FYI, the reason for the inner variable is that the leading-zero function is already recursive, so it can only be counted in the first execution of countdigits.
Update 2: In case the code is hard to read, here's a reference for what each line of the countdigits return statement means (I tried inline comments but they made the code even harder to read):
Frequency of any digit up to highest power of 10 (0-99, etc.)
Frequency of MSD above any multiple of highest power of 10 (100-399)
Frequency of any digits in remainder (400-445, R = 45)
Additional frequency of MSD in remainder
Count zeros in middle position for remainder range (404, 405...)
Subtract leading zeros only once (on outermost loop)
I'm assuming you want a solution where the numbers are in a range, and you have the starting and ending number. Imagine starting with the start number and counting up until you reach the end number - it would work, but it would be slow. I think the trick to a fast algorithm is to realize that in order to go up one digit in the 10^x place and keep everything else the same, you need to use all of the digits before it 10^x times plus all digits 0-9 10^(x-1) times. (Except that your counting may have involved a carry past the x-th digit - I correct for this below.)
Here's an example. Say you're counting from 523 to 1004.
First, you count from 523 to 524. This uses the digits 5, 2, and 4 once each.
Second, count from 524 to 604. The rightmost digit does 6 cycles through all of the digits, so you need 6 copies of each digit. The second digit goes through digits 2 through 0, 10 times each. The third digit is 6 5 times and 5 100-24 times.
Third, count from 604 to 1004. The rightmost digit does 40 cycles, so add 40 copies of each digit. The second from right digit doers 4 cycles, so add 4 copies of each digit. The leftmost digit does 100 each of 7, 8, and 9, plus 5 of 0 and 100 - 5 of 6. The last digit is 1 5 times.
To speed up the last bit, look at the part about the rightmost two places. It uses each digit 10 + 1 times. In general, 1 + 10 + ... + 10^n = (10^(n+1) - 1)/9, which we can use to speed up counting even more.
My algorithm is to count up from the start number to the end number (using base-10 counting), but use the fact above to do it quickly. You iterate through the digits of the starting number from least to most significant, and at each place you count up so that that digit is the same as the one in the ending number. At each point, n is the number of up-counts you need to do before you get to a carry, and m the number you need to do afterwards.
Now let's assume pseudocode counts as a language. Here, then, is what I would do:
convert start and end numbers to digit arrays start[] and end[]
create an array counts[] with 10 elements which stores the number of copies of
each digit that you need
iterate through start number from right to left. at the i-th digit,
let d be the number of digits you must count up to get from this digit
to the i-th digit in the ending number. (i.e. subtract the equivalent
digits mod 10)
add d * (10^i - 1)/9 to each entry in count.
let m be the numerical value of all the digits to the right of this digit,
n be 10^i - m.
for each digit e from the left of the starting number up to and including the
i-th digit, add n to the count for that digit.
for j in 1 to d
increment the i-th digit by one, including doing any carries
for each digit e from the left of the starting number up to and including
the i-th digit, add 10^i to the count for that digit
for each digit e from the left of the starting number up to and including the
i-th digit, add m to the count for that digit.
set the i-th digit of the starting number to be the i-th digit of the ending
number.
Oh, and since the value of i increases by one each time, keep track of your old 10^i and just multiply it by 10 to get the new one, instead of exponentiating each time.
To reel of the digits from a number, we'd only ever need to do a costly string conversion if we couldnt do a mod, digits can most quickly be pushed of a number like this:
feed=number;
do
{ digit=feed%10;
feed/=10;
//use digit... eg. digitTally[digit]++;
}
while(feed>0)
that loop should be very fast and can just be placed inside a loop of the start to end numbers for the simplest way to tally the digits.
To go faster, for larger range of numbers, im looking for an optimised method of tallying all digits from 0 to number*10^significance
(from a start to end bazzogles me)
here is a table showing digit tallies of some single significant digits..
these are inclusive of 0, but not the top value itself, -that was an oversight
but its maybe a bit easier to see patterns (having the top values digits absent here)
These tallies dont include trailing zeros,
1 10 100 1000 10000 2 20 30 40 60 90 200 600 2000 6000
0 1 1 10 190 2890 1 2 3 4 6 9 30 110 490 1690
1 0 1 20 300 4000 1 12 13 14 16 19 140 220 1600 2800
2 0 1 20 300 4000 0 2 13 14 16 19 40 220 600 2800
3 0 1 20 300 4000 0 2 3 14 16 19 40 220 600 2800
4 0 1 20 300 4000 0 2 3 4 16 19 40 220 600 2800
5 0 1 20 300 4000 0 2 3 4 16 19 40 220 600 2800
6 0 1 20 300 4000 0 2 3 4 6 19 40 120 600 1800
7 0 1 20 300 4000 0 2 3 4 6 19 40 120 600 1800
8 0 1 20 300 4000 0 2 3 4 6 19 40 120 600 1800
9 0 1 20 300 4000 0 2 3 4 6 9 40 120 600 1800
edit: clearing up my origonal
thoughts:
from the brute force table showing
tallies from 0 (included) to
poweroTen(notinc) it is visible that
a majordigit of tenpower:
increments tally[0 to 9] by md*tp*10^(tp-1)
increments tally[1 to md-1] by 10^tp
decrements tally[0] by (10^tp - 10)
(to remove leading 0s if tp>leadingzeros)
can increment tally[moresignificantdigits] by self(md*10^tp)
(to complete an effect)
if these tally adjustments were applied for each significant digit,
the tally should be modified as though counted from 0 to end-1
the adjustments can be inverted to remove preceeding range (start number)
Thanks Aaronaught for your complete and tested answer.
Here's a very bad answer, I'm ashamed to post it. I asked Mathematica to tally the digits used in all numbers from 1 to 1,000,000, no leading 0s. Here's what I got:
0 488895
1 600001
2 600000
3 600000
4 600000
5 600000
6 600000
7 600000
8 600000
9 600000
Next time you're ordering sticky digits for selling in your hardware store, order in these proportions, you won't be far wrong.
I asked this question on Math Overflow, and got spanked for asking such a simple question. One of the users took pity on me and said if I posted it to The Art of Problem Solving, he would answer it; so I did.
Here is the answer he posted:
http://www.artofproblemsolving.com/Forum/viewtopic.php?p=1741600#1741600
Embarrassingly, my math-fu is inadequate to understand what he posted (the guy is 19 years old...that is so depressing). I really need to take some math classes.
On the bright side, the equation is recursive, so it should be a simple matter to turn it into a recursive function with a few lines of code, by someone who understands the math.
I know this question has an accepted answer but I was tasked with writing this code for a job interview and I think I came up with an alternative solution that is fast, requires no loops and can use or discard leading zeroes as required.
It is in fact quite simple but not easy to explain.
If you list out the first n numbers
1
2
3
.
.
.
9
10
11
It is usual to start counting the digits required from the start room number to the end room number in a left to right fashion, so for the above we have one 1, one 2, one 3 ... one 9, two 1's one zero, four 1's etc. Most solutions I have seen used this approach with some optimisation to speed it up.
What I did was to count vertically in columns, as in hundreds, tens, and units. You know the highest room number so we can calculate how many of each digit there are in the hundreds column via a single division, then recurse and calculate how many in the tens column etc. Then we can subtract the leading zeros if we like.
Easier to visualize if you use Excel to write out the numbers but use a separate column for each digit of the number
A B C
- - -
0 0 1 (assuming room numbers do not start at zero)
0 0 2
0 0 3
.
.
.
3 6 4
3 6 5
.
.
.
6 6 9
6 7 0
6 7 1
^
sum in columns not rows
So if the highest room number is 671 the hundreds column will have 100 zeroes vertically, followed by 100 ones and so on up to 71 sixes, ignore 100 of the zeroes if required as we know these are all leading.
Then recurse down to the tens and perform the same operation, we know there will be 10 zeroes followed by 10 ones etc, repeated six times, then the final time down to 2 sevens. Again can ignore the first 10 zeroes as we know they are leading. Finally of course do the units, ignoring the first zero as required.
So there are no loops everything is calculated with division. I use recursion for travelling "up" the columns until the max one is reached (in this case hundreds) and then back down totalling as it goes.
I wrote this in C# and can post code if anyone interested, haven't done any benchmark timings but it is essentially instant for values up to 10^18 rooms.
Could not find this approach mentioned here or elsewhere so thought it might be useful for someone.
Your approach is fine. I'm not sure why you would ever need anything faster than what you've described.
Or, this would give you an instantaneous solution: Before you actually need it, calculate what you would need from 1 to some maximum number. You can store the numbers needed at each step. If you have a range like your second example, it would be what's needed for 1 to 300, minus what's needed for 1 to 50.
Now you have a lookup table that can be called at will. Doing up to 10,000 would only take a few MB and, what, a few minutes to compute, once?
This doesn't answer your exact question, but it's interesting to note the distribution of first digits according to Benford's Law. For example, if you choose a set of numbers at random, 30% of them will start with "1", which is somewhat counter-intuitive.
I don't know of any distributions describing subsequent digits, but you might be able to determine this empirically and come up with a simple formula for computing an approximate number of digits required for any range of numbers.
If "better" means "clearer," then I doubt it. If it means "faster," then yes, but I wouldn't use a faster algorithm in place of a clearer one without a compelling need.
#!/usr/bin/ruby1.8
def digits_for_range(min, max, leading_zeros)
bins = [0] * 10
format = [
'%',
('0' if leading_zeros),
max.to_s.size,
'd',
].compact.join
(min..max).each do |i|
s = format % i
for digit in s.scan(/./)
bins[digit.to_i] +=1 unless digit == ' '
end
end
bins
end
p digits_for_range(1, 49, false)
# => [4, 15, 15, 15, 15, 5, 5, 5, 5, 5]
p digits_for_range(1, 49, true)
# => [13, 15, 15, 15, 15, 5, 5, 5, 5, 5]
p digits_for_range(1, 10000, false)
# => [2893, 4001, 4000, 4000, 4000, 4000, 4000, 4000, 4000, 4000]
Ruby 1.8, a language known to be "dog slow," runs the above code in 0.135 seconds. That includes loading the interpreter. Don't give up an obvious algorithm unless you need more speed.
If you need raw speed over many iterations, try a lookup table:
Build an array with 2 dimensions: 10 x max-house-number
int nDigits[10000][10] ; // Don't try this on the stack, kids!
Fill each row with the count of digits required to get to that number from zero.
Hint: Use the previous row as a start:
n=0..9999:
if (n>0) nDigits[n] = nDigits[n-1]
d=0..9:
nDigits[n][d] += countOccurrencesOf(n,d) //
Number of digits "between" two numbers becomes simple subtraction.
For range=51 to 300, take the counts for 300 and subtract the counts for 50.
0's = nDigits[300][0] - nDigits[50][0]
1's = nDigits[300][1] - nDigits[50][1]
2's = nDigits[300][2] - nDigits[50][2]
3's = nDigits[300][3] - nDigits[50][3]
etc.
You can separate each digit (look here for a example), create a histogram with entries from 0..9 (which will count how many digits appeared in a number) and multiply by the number of 'numbers' asked.
But if isn't what you are looking for, can you give a better example?
Edited:
Now I think I got the problem. I think you can reckon this (pseudo C):
int histogram[10];
memset(histogram, 0, sizeof(histogram));
for(i = startNumber; i <= endNumber; ++i)
{
array = separateDigits(i);
for(j = 0; k < array.length; ++j)
{
histogram[k]++;
}
}
Separate digits implements the function in the link.
Each position of the histogram will have the amount of each digit. For example
histogram[0] == total of zeros
histogram[1] == total of ones
...
Regards

Resources