Efficiency in Haskell when counting primes - performance

I have the following set of functions to count the number of primes less than or equal to a number n in Haskell.
The algorithm takes a number, checks if it is divisible by two and then checks if it divisible by odd numbers up to the square root of the number being checked.
-- is a numner, n, prime?
isPrime :: Int -> Bool
isPrime n = n > 1 &&
foldr (\d r -> d * d > n || (n `rem` d /= 0 && r))
True divisors
-- list of divisors for which to test primality
divisors :: [Int]
divisors = 2:[3,5..]
-- pi(n) - the prime counting function, the number of prime numbers <= n
primesNo :: Int -> Int
primesNo 2 = 1
primesNo n
| isPrime n = 1 + primesNo (n-1)
| otherwise = 0 + primesNo (n-1)
main = print $ primesNo (2^22)
Using GHC with the -O2 optimisation flag, counting the number of primes for n = 2^22 takes ~3.8sec on my system. The following C code take ~ 0.8 sec:
#include <stdio.h>
#include <math.h>
/*
compile with: gcc -std=c11 -lm -O2 c_primes.c -o c_orig
*/
int isPrime(int n) {
if (n < 2)
return 0;
else if (n == 2)
return 1;
else if (n % 2 == 0)
return 0;
int uL = sqrt(n);
int i = 3;
while (i <= uL) {
if (n % i == 0)
return 0;
i+=2;
}
return 1;
}
int main() {
int noPrimes = 0, limit = 4194304;
for (int n = 0; n <= limit; n++) {
if (isPrime(n))
noPrimes++;
}
printf("Number of primes in the interval [0,%d]: %d\n", limit, noPrimes);
return 0;
}
This algorithm take about 0.9 sec in Java and 1.8 sec in JavaScript (on Node) so it just feels that the Haskell version is slower than I would expect be. Is there anyway I can more efficiently code this in Haskell without changing the algorithm?
EDIT
The following version of isPrime offered by #dfeuer shaves one second off the running time taking it down to 2.8sec (down from 3.8). Though this is still slower than JavaScript (Node) which takes approx 1.8 sec as shown here, Yet Another Language Speed Test.
isPrime :: Int -> Bool
isPrime n
| n <= 2 = n == 2
| otherwise = odd n && go 3
where
go factor
| factor * factor > n = True
| otherwise = n `rem` factor /= 0 && go (factor+2)
EDIT
In the above isPrime function, the function go calls factor * factor for each divisor for a single n. I would imagine that it would be more efficient to compare factor to the square root of n as this would only have to be calculated once per n. However, using the following code, computation time is increased by approximately 10%, is the square root of n being re-calculated every time the inequality is evaluated (for each factor)?
isPrime :: Int -> Bool
isPrime n
| n <= 2 = n == 2
| otherwise = odd n && go 3
where
go factor
| factor > upperLim = True
| otherwise = n `rem` factor /= 0 && go (factor+2)
where
upperLim = (floor.sqrt.fromIntegral) n

I urge you to use a different algorithm, such as the Sieve of Eratosthenes discussed in the paper by Melissa O'Neill, or the version used in Math.NumberTheory.Primes from the arithmoi package, which also offers an optimized prime counting function. However, this might get you better constant factors:
-- is a number, n, prime?
isPrime :: Int -> Bool
isPrime n
| n <= 2 = n == 2
| otherwise = odd n && -- Put the 2 here instead
foldr (\d r -> d * d > n || (n `rem` d /= 0 && r))
True divisors
-- list of divisors for which to test primality
divisors :: [Int]
{-# INLINE divisors #-} -- No guarantee, but it might possibly inline and stay inlined,
-- so the numbers will be generated on each call instead of
-- being pulled in (expensively) from RAM.
divisors = [3,5..] -- No more 2:
The reason to get rid of the 2: is that an optimization called "foldr/build fusion", "short cut deforestation", or just "list fusion" can, potentially, make your divisors list go away, but, at least with GHC < 7.10.1, that 2: will block the optimization.
Edit: it seems that's not working for you, so here's something else to try:
isPrime n
| n <= 2 = n == 2
| otherwise = odd n && go 3
where
go factor
| factor * factor > n = True
| otherwise = n `rem` factor /= 0 && go (factor+2)

In general I've found that looping in Haskell is about 3-4 times slower than what can be accomplished with C.
To help understand the performance difference I slightly modified the
programs so that a fixed number of divisor tests are made per iteration
and added a parameter e to control how many iterations are made -
the number of (outer) iterations performed is 2^e. For each outer iteration
approx. 2^21 divisor tests are made.
The source code for each program and scripts to run and analyze the
results made be found here: https://github.com/erantapaa/loopbench
Pull-requests to improve the benchmarking are welcome.
Here are the results I get on a 2.4 GHz Intel Core 2 Duo using ghc 7.8.3 (under OSX). The gcc used was "Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn)".
e ctime htime allocated gc-bytes alloc/iter h/c dns
10 0.0101 0.0200 87424 3408 1.980 4.61
11 0.0151 0.0345 112000 3408 2.285 4.51
12 0.0263 0.0700 161152 3408 2.661 5.09
13 0.0472 0.1345 259456 3408 2.850 5.08
14 0.0819 0.2709 456200 3408 3.308 5.50
15 0.1575 0.5382 849416 9616 3.417 5.54
16 0.3112 1.0900 1635848 15960 3.503 5.66
17 0.6105 2.1682 3208848 15984 3.552 5.66
18 1.2167 4.3536 6354576 16032 24.24 3.578 5.70
19 2.4092 8.7336 12646032 16128 24.12 3.625 5.75
20 4.8332 17.4109 25229080 16320 24.06 3.602 5.72
e = exponent parameter
ctime = running time of the C program
htime = running time of the Haskell program
allocated = bytes allocated in the heap (Haskell program)
gc-bytes = bytes copied during GC (Haskell program)
alloc/iter = bytes allocated in the heap / 2^e
h / c = htime divided by ctime
dns = (htime - ctime) divided by the number of divisor tests made
in nanoseconds
# divisor tests made = 2^e * 2^11
Some observations:
The Haskell program performs heap allocation at a rate of about 24 bytes per (outer) loop iteration. The C program clearly does not perform any alloction and runs completely in L1 cache.
The gc-bytes count remains constant for e between 10 and 14 because no garbage collections were performed for those runs.
The time ratio h/c gets progressively worse as more allocations are made.
dps is a measure of the extra time the Haskell program takes per divisor test; it increases with the total amount of allocation made. Also there are some plateaus which suggest this is due to cache effects.
It is well known that GHC does not produce the same tight loop code that
a C compiler produces. The penalty you pay is approx. 4.6 ns per iteration.
Moreover, it looks like Haskell is also affected by cache effects due to
heap allocation.
24 bytes per allocation and 5 ns per loop iteration is not a lot for
most programs, but when you have 2^20 allocations and 2^40 loop iterations
it becomes a factor.

The C code uses 32-bit integers, while the Haskell code uses 64-bit integers.
The original C code runs in 0.63 secs on my computer. However, if I replace the int-s with long-s, it runs in 2.07 seconds with gcc and 2.17 secs with clang.
In comparison, the updated isPrime function (see it in the thread question) runs in 2.09 seconds (with -O2 and -fllvm). Note that is slightly better than the clang-compiled C code, even though they use the same LLVM code generator.
The original Haskell code runs in 3.2 secs, which I think is an acceptable overhead for the convenience of using lists for iteration.

Inline everything, loose the superfluous tests, add strictness annotations just to be sure:
{-# LANGUAGE BangPatterns #-}
-- pi(n) - the prime counting function, the number of prime numbers <= n
primesNo :: Int -> Int
primesNo n
| n < 2 = 0
| otherwise = g 3 1
where
g k !cnt | k > n = cnt
| go 3 = g (k+2) (cnt+1)
| otherwise = g (k+2) cnt
where go f
| f*f > k = True
| otherwise = k `rem` f /= 0 && go (f+2)
main = print $ primesNo (2^22)
The go testing function is as in dfeuer's answer. Compile with -O2 as usual, and always test by running a standalone executable (with something like > test +RTS -s).
Calls to g can be made direct (that's really micro-optimizing it):
primesNo n
| n < 2 = 0
| otherwise = g 3 1
where
g k !cnt | k > n = cnt
| otherwise = go 3
where go f
| f*f > k = g (k+2) (cnt+1)
| k `rem` f == 0 = g (k+2) cnt
| otherwise = go (f+2)
More substantial change (still keeping the algorithm arguably the same) which might or mightn't speed it up is to turn it inside out, to spare the squares computations: test by [3] all odds from 9 to 23, by [3,5] all odds from 25 to 47, etc., along the lines of this segmented code:
import Data.List (inits)
primesNo n = length (takeWhile (<= n) $ 2 : oddprimes)
where
oddprimes = sieve 3 9 [3,5..] (inits [3,5..])
sieve x q ~(_:t) (fs:ft) =
filter ((`all` fs) . ((/=0).) . rem) [x,x+2..q-2]
++ sieve (q+2) (head t^2) t ft
Sometimes tweaking your code into using and instead of all changes the speed too. Further speedup might be attempted by inlining and simplifying everything (replace length with counting etc.).

Related

Generating random number in the range 0-N [duplicate]

I have seen this question asked a lot but never seen a true concrete answer to it. So I am going to post one here which will hopefully help people understand why exactly there is "modulo bias" when using a random number generator, like rand() in C++.
So rand() is a pseudo-random number generator which chooses a natural number between 0 and RAND_MAX, which is a constant defined in cstdlib (see this article for a general overview on rand()).
Now what happens if you want to generate a random number between say 0 and 2? For the sake of explanation, let's say RAND_MAX is 10 and I decide to generate a random number between 0 and 2 by calling rand()%3. However, rand()%3 does not produce the numbers between 0 and 2 with equal probability!
When rand() returns 0, 3, 6, or 9, rand()%3 == 0. Therefore, P(0) = 4/11
When rand() returns 1, 4, 7, or 10, rand()%3 == 1. Therefore, P(1) = 4/11
When rand() returns 2, 5, or 8, rand()%3 == 2. Therefore, P(2) = 3/11
This does not generate the numbers between 0 and 2 with equal probability. Of course for small ranges this might not be the biggest issue but for a larger range this could skew the distribution, biasing the smaller numbers.
So when does rand()%n return a range of numbers from 0 to n-1 with equal probability? When RAND_MAX%n == n - 1. In this case, along with our earlier assumption rand() does return a number between 0 and RAND_MAX with equal probability, the modulo classes of n would also be equally distributed.
So how do we solve this problem? A crude way is to keep generating random numbers until you get a number in your desired range:
int x;
do {
x = rand();
} while (x >= n);
but that's inefficient for low values of n, since you only have a n/RAND_MAX chance of getting a value in your range, and so you'll need to perform RAND_MAX/n calls to rand() on average.
A more efficient formula approach would be to take some large range with a length divisible by n, like RAND_MAX - RAND_MAX % n, keep generating random numbers until you get one that lies in the range, and then take the modulus:
int x;
do {
x = rand();
} while (x >= (RAND_MAX - RAND_MAX % n));
x %= n;
For small values of n, this will rarely require more than one call to rand().
Works cited and further reading:
CPlusPlus Reference
Eternally Confuzzled
Keep selecting a random is a good way to remove the bias.
Update
We could make the code fast if we search for an x in range divisible by n.
// Assumptions
// rand() in [0, RAND_MAX]
// n in (0, RAND_MAX]
int x;
// Keep searching for an x in a range divisible by n
do {
x = rand();
} while (x >= RAND_MAX - (RAND_MAX % n))
x %= n;
The above loop should be very fast, say 1 iteration on average.
#user1413793 is correct about the problem. I'm not going to discuss that further, except to make one point: yes, for small values of n and large values of RAND_MAX, the modulo bias can be very small. But using a bias-inducing pattern means that you must consider the bias every time you calculate a random number and choose different patterns for different cases. And if you make the wrong choice, the bugs it introduces are subtle and almost impossible to unit test. Compared to just using the proper tool (such as arc4random_uniform), that's extra work, not less work. Doing more work and getting a worse solution is terrible engineering, especially when doing it right every time is easy on most platforms.
Unfortunately, the implementations of the solution are all incorrect or less efficient than they should be. (Each solution has various comments explaining the problems, but none of the solutions have been fixed to address them.) This is likely to confuse the casual answer-seeker, so I'm providing a known-good implementation here.
Again, the best solution is just to use arc4random_uniform on platforms that provide it, or a similar ranged solution for your platform (such as Random.nextInt on Java). It will do the right thing at no code cost to you. This is almost always the correct call to make.
If you don't have arc4random_uniform, then you can use the power of opensource to see exactly how it is implemented on top of a wider-range RNG (ar4random in this case, but a similar approach could also work on top of other RNGs).
Here is the OpenBSD implementation:
/*
* Calculate a uniformly distributed random number less than upper_bound
* avoiding "modulo bias".
*
* Uniformity is achieved by generating new random numbers until the one
* returned is outside the range [0, 2**32 % upper_bound). This
* guarantees the selected random number will be inside
* [2**32 % upper_bound, 2**32) which maps back to [0, upper_bound)
* after reduction modulo upper_bound.
*/
u_int32_t
arc4random_uniform(u_int32_t upper_bound)
{
u_int32_t r, min;
if (upper_bound < 2)
return 0;
/* 2**32 % x == (2**32 - x) % x */
min = -upper_bound % upper_bound;
/*
* This could theoretically loop forever but each retry has
* p > 0.5 (worst case, usually far better) of selecting a
* number inside the range we need, so it should rarely need
* to re-roll.
*/
for (;;) {
r = arc4random();
if (r >= min)
break;
}
return r % upper_bound;
}
It is worth noting the latest commit comment on this code for those who need to implement similar things:
Change arc4random_uniform() to calculate 2**32 % upper_bound as
-upper_bound % upper_bound. Simplifies the code and makes it the
same on both ILP32 and LP64 architectures, and also slightly faster on
LP64 architectures by using a 32-bit remainder instead of a 64-bit
remainder.
Pointed out by Jorden Verwer on tech#
ok deraadt; no objections from djm or otto
The Java implementation is also easily findable (see previous link):
public int nextInt(int n) {
if (n <= 0)
throw new IllegalArgumentException("n must be positive");
if ((n & -n) == n) // i.e., n is a power of 2
return (int)((n * (long)next(31)) >> 31);
int bits, val;
do {
bits = next(31);
val = bits % n;
} while (bits - val + (n-1) < 0);
return val;
}
Definition
Modulo Bias is the inherent bias in using modulo arithmetic to reduce an output set to a subset of the input set. In general, a bias exists whenever the mapping between the input and output set is not equally distributed, as in the case of using modulo arithmetic when the size of the output set is not a divisor of the size of the input set.
This bias is particularly hard to avoid in computing, where numbers are represented as strings of bits: 0s and 1s. Finding truly random sources of randomness is also extremely difficult, but is beyond the scope of this discussion. For the remainder of this answer, assume that there exists an unlimited source of truly random bits.
Problem Example
Let's consider simulating a die roll (0 to 5) using these random bits. There are 6 possibilities, so we need enough bits to represent the number 6, which is 3 bits. Unfortunately, 3 random bits yields 8 possible outcomes:
000 = 0, 001 = 1, 010 = 2, 011 = 3
100 = 4, 101 = 5, 110 = 6, 111 = 7
We can reduce the size of the outcome set to exactly 6 by taking the value modulo 6, however this presents the modulo bias problem: 110 yields a 0, and 111 yields a 1. This die is loaded.
Potential Solutions
Approach 0:
Rather than rely on random bits, in theory one could hire a small army to roll dice all day and record the results in a database, and then use each result only once. This is about as practical as it sounds, and more than likely would not yield truly random results anyway (pun intended).
Approach 1:
Instead of using the modulus, a naive but mathematically correct solution is to discard results that yield 110 and 111 and simply try again with 3 new bits. Unfortunately, this means that there is a 25% chance on each roll that a re-roll will be required, including each of the re-rolls themselves. This is clearly impractical for all but the most trivial of uses.
Approach 2:
Use more bits: instead of 3 bits, use 4. This yield 16 possible outcomes. Of course, re-rolling anytime the result is greater than 5 makes things worse (10/16 = 62.5%) so that alone won't help.
Notice that 2 * 6 = 12 < 16, so we can safely take any outcome less than 12 and reduce that modulo 6 to evenly distribute the outcomes. The other 4 outcomes must be discarded, and then re-rolled as in the previous approach.
Sounds good at first, but let's check the math:
4 discarded results / 16 possibilities = 25%
In this case, 1 extra bit didn't help at all!
That result is unfortunate, but let's try again with 5 bits:
32 % 6 = 2 discarded results; and
2 discarded results / 32 possibilities = 6.25%
A definite improvement, but not good enough in many practical cases. The good news is, adding more bits will never increase the chances of needing to discard and re-roll. This holds not just for dice, but in all cases.
As demonstrated however, adding an 1 extra bit may not change anything. In fact if we increase our roll to 6 bits, the probability remains 6.25%.
This begs 2 additional questions:
If we add enough bits, is there a guarantee that the probability of a discard will diminish?
How many bits are enough in the general case?
General Solution
Thankfully the answer to the first question is yes. The problem with 6 is that 2^x mod 6 flips between 2 and 4 which coincidentally are a multiple of 2 from each other, so that for an even x > 1,
[2^x mod 6] / 2^x == [2^(x+1) mod 6] / 2^(x+1)
Thus 6 is an exception rather than the rule. It is possible to find larger moduli that yield consecutive powers of 2 in the same way, but eventually this must wrap around, and the probability of a discard will be reduced.
Without offering further proof, in general using double the number
of bits required will provide a smaller, usually insignificant,
chance of a discard.
Proof of Concept
Here is an example program that uses OpenSSL's libcrypo to supply random bytes. When compiling, be sure to link to the library with -lcrypto which most everyone should have available.
#include <iostream>
#include <assert.h>
#include <limits>
#include <openssl/rand.h>
volatile uint32_t dummy;
uint64_t discardCount;
uint32_t uniformRandomUint32(uint32_t upperBound)
{
assert(RAND_status() == 1);
uint64_t discard = (std::numeric_limits<uint64_t>::max() - upperBound) % upperBound;
RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));
while(randomPool > (std::numeric_limits<uint64_t>::max() - discard)) {
RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));
++discardCount;
}
return randomPool % upperBound;
}
int main() {
discardCount = 0;
const uint32_t MODULUS = (1ul << 31)-1;
const uint32_t ROLLS = 10000000;
for(uint32_t i = 0; i < ROLLS; ++i) {
dummy = uniformRandomUint32(MODULUS);
}
std::cout << "Discard count = " << discardCount << std::endl;
}
I encourage playing with the MODULUS and ROLLS values to see how many re-rolls actually happen under most conditions. A sceptical person may also wish to save the computed values to file and verify the distribution appears normal.
Mark's Solution (The accepted solution) is Nearly Perfect.
int x;
do {
x = rand();
} while (x >= (RAND_MAX - RAND_MAX % n));
x %= n;
edited Mar 25 '16 at 23:16
Mark Amery 39k21170211
However, it has a caveat which discards 1 valid set of outcomes in any scenario where RAND_MAX (RM) is 1 less than a multiple of N (Where N = the Number of possible valid outcomes).
ie, When the 'count of values discarded' (D) is equal to N, then they are actually a valid set (V), not an invalid set (I).
What causes this is at some point Mark loses sight of the difference between N and Rand_Max.
N is a set who's valid members are comprised only of Positive Integers, as it contains a count of responses that would be valid. (eg: Set N = {1, 2, 3, ... n } )
Rand_max However is a set which ( as defined for our purposes ) includes any number of non-negative integers.
In it's most generic form, what is defined here as Rand Max is the Set of all valid outcomes, which could theoretically include negative numbers or non-numeric values.
Therefore Rand_Max is better defined as the set of "Possible Responses".
However N operates against the count of the values within the set of valid responses, so even as defined in our specific case, Rand_Max will be a value one less than the total number it contains.
Using Mark's Solution, Values are Discarded when: X => RM - RM % N
EG:
Ran Max Value (RM) = 255
Valid Outcome (N) = 4
When X => 252, Discarded values for X are: 252, 253, 254, 255
So, if Random Value Selected (X) = {252, 253, 254, 255}
Number of discarded Values (I) = RM % N + 1 == N
IE:
I = RM % N + 1
I = 255 % 4 + 1
I = 3 + 1
I = 4
X => ( RM - RM % N )
255 => (255 - 255 % 4)
255 => (255 - 3)
255 => (252)
Discard Returns $True
As you can see in the example above, when the value of X (the random number we get from the initial function) is 252, 253, 254, or 255 we would discard it even though these four values comprise a valid set of returned values.
IE: When the count of the values Discarded (I) = N (The number of valid outcomes) then a Valid set of return values will be discarded by the original function.
If we describe the difference between the values N and RM as D, ie:
D = (RM - N)
Then as the value of D becomes smaller, the Percentage of unneeded re-rolls due to this method increases at each natural multiplicative. (When RAND_MAX is NOT equal to a Prime Number this is of valid concern)
EG:
RM=255 , N=2 Then: D = 253, Lost percentage = 0.78125%
RM=255 , N=4 Then: D = 251, Lost percentage = 1.5625%
RM=255 , N=8 Then: D = 247, Lost percentage = 3.125%
RM=255 , N=16 Then: D = 239, Lost percentage = 6.25%
RM=255 , N=32 Then: D = 223, Lost percentage = 12.5%
RM=255 , N=64 Then: D = 191, Lost percentage = 25%
RM=255 , N= 128 Then D = 127, Lost percentage = 50%
Since the percentage of Rerolls needed increases the closer N comes to RM, this can be of valid concern at many different values depending on the constraints of the system running he code and the values being looked for.
To negate this we can make a simple amendment As shown here:
int x;
do {
x = rand();
} while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );
x %= n;
This provides a more general version of the formula which accounts for the additional peculiarities of using modulus to define your max values.
Examples of using a small value for RAND_MAX which is a multiplicative of N.
Mark'original Version:
RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X >= (RAND_MAX - ( RAND_MAX % n ) )
When X >= 2 the value will be discarded, even though the set is valid.
Generalized Version 1:
RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X > (RAND_MAX - ( ( RAND_MAX % n ) + 1 ) % n )
When X > 3 the value would be discarded, but this is not a vlue in the set RAND_MAX so there will be no discard.
Additionally, in the case where N should be the number of values in RAND_MAX; in this case, you could set N = RAND_MAX +1, unless RAND_MAX = INT_MAX.
Loop-wise you could just use N = 1, and any value of X will be accepted, however, and put an IF statement in for your final multiplier. But perhaps you have code that may have a valid reason to return a 1 when the function is called with n = 1...
So it may be better to use 0, which would normally provide a Div 0 Error, when you wish to have n = RAND_MAX+1
Generalized Version 2:
int x;
if n != 0 {
do {
x = rand();
} while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );
x %= n;
} else {
x = rand();
}
Both of these solutions resolve the issue with needlessly discarded valid results which will occur when RM+1 is a product of n.
The second version also covers the edge case scenario when you need n to equal the total possible set of values contained in RAND_MAX.
The modified approach in both is the same and allows for a more general solution to the need of providing valid random numbers and minimizing discarded values.
To reiterate:
The Basic General Solution which extends mark's example:
// Assumes:
// RAND_MAX is a globally defined constant, returned from the environment.
// int n; // User input, or externally defined, number of valid choices.
int x;
do {
x = rand();
} while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );
x %= n;
The Extended General Solution which Allows one additional scenario of RAND_MAX+1 = n:
// Assumes:
// RAND_MAX is a globally defined constant, returned from the environment.
// int n; // User input, or externally defined, number of valid choices.
int x;
if n != 0 {
do {
x = rand();
} while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );
x %= n;
} else {
x = rand();
}
In some languages ( particularly interpreted languages ) doing the calculations of the compare-operation outside of the while condition may lead to faster results as this is a one-time calculation no matter how many re-tries are required. YMMV!
// Assumes:
// RAND_MAX is a globally defined constant, returned from the environment.
// int n; // User input, or externally defined, number of valid choices.
int x; // Resulting random number
int y; // One-time calculation of the compare value for x
y = RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n)
if n != 0 {
do {
x = rand();
} while (x > y);
x %= n;
} else {
x = rand();
}
There are two usual complaints with the use of modulo.
one is valid for all generators. It is easier to see in a limit case. If your generator has a RAND_MAX which is 2 (that isn't compliant with the C standard) and you want only 0 or 1 as value, using modulo will generate 0 twice as often (when the generator generates 0 and 2) as it will generate 1 (when the generator generates 1). Note that this is true as soon as you don't drop values, whatever the mapping you are using from the generator values to the wanted one, one will occurs twice as often as the other.
some kind of generator have their less significant bits less random than the other, at least for some of their parameters, but sadly those parameter have other interesting characteristic (such has being able to have RAND_MAX one less than a power of 2). The problem is well known and for a long time library implementation probably avoid the problem (for instance the sample rand() implementation in the C standard use this kind of generator, but drop the 16 less significant bits), but some like to complain about that and you may have bad luck
Using something like
int alea(int n){
assert (0 < n && n <= RAND_MAX);
int partSize =
n == RAND_MAX ? 1 : 1 + (RAND_MAX-n)/(n+1);
int maxUsefull = partSize * n + (partSize-1);
int draw;
do {
draw = rand();
} while (draw > maxUsefull);
return draw/partSize;
}
to generate a random number between 0 and n will avoid both problems (and it avoids overflow with RAND_MAX == INT_MAX)
BTW, C++11 introduced standard ways to the the reduction and other generator than rand().
With a RAND_MAX value of 3 (in reality it should be much higher than that but the bias would still exist) it makes sense from these calculations that there is a bias:
1 % 2 = 1
2 % 2 = 0
3 % 2 = 1
random_between(1, 3) % 2 = more likely a 1
In this case, the % 2 is what you shouldn't do when you want a random number between 0 and 1. You could get a random number between 0 and 2 by doing % 3 though, because in this case: RAND_MAX is a multiple of 3.
Another method
There is much simpler but to add to other answers, here is my solution to get a random number between 0 and n - 1, so n different possibilities, without bias.
the number of bits (not bytes) needed to encode the number of possibilities is the number of bits of random data you'll need
encode the number from random bits
if this number is >= n, restart (no modulo).
Really random data is not easy to obtain, so why use more bits than needed.
Below is an example in Smalltalk, using a cache of bits from a pseudo-random number generator. I'm no security expert so use at your own risk.
next: n
| bitSize r from to |
n < 0 ifTrue: [^0 - (self next: 0 - n)].
n = 0 ifTrue: [^nil].
n = 1 ifTrue: [^0].
cache isNil ifTrue: [cache := OrderedCollection new].
cache size < (self randmax highBit) ifTrue: [
Security.DSSRandom default next asByteArray do: [ :byte |
(1 to: 8) do: [ :i | cache add: (byte bitAt: i)]
]
].
r := 0.
bitSize := n highBit.
to := cache size.
from := to - bitSize + 1.
(from to: to) do: [ :i |
r := r bitAt: i - from + 1 put: (cache at: i)
].
cache removeFrom: from to: to.
r >= n ifTrue: [^self next: n].
^r
Modulo reduction is a commonly seen way to make a random integer generator avoid the worst case of running forever.
When the range of possible integers is unknown, however, there is no way in general to "fix" this worst case of running forever without introducing bias. It's not just modulo reduction (rand() % n, discussed in the accepted answer) that will introduce bias this way, but also the "multiply-and-shift" reduction of Daniel Lemire, or if you stop rejecting an outcome after a set number of iterations. (To be clear, this doesn't mean there is no way to fix the bias issues present in pseudorandom generators. For example, even though modulo and other reductions are biased in general, they will have no issues with bias if the range of possible integers is a power of 2 and if the random generator produces unbiased random bits or blocks of them.)
The following answer of mine discusses the relationship between running time and bias in random generators, assuming we have a "true" random generator that can produce unbiased and independent random bits. The answer doesn't even involve the rand() function in C because it has many issues. Perhaps the most serious here is the fact that the C standard does not explicitly specify a particular distribution for the numbers returned by rand(), not even a uniform distribution.
How to generate a random integer in the range [0,n] from a stream of random bits without wasting bits?
As the accepted answer indicates, "modulo bias" has its roots in the low value of RAND_MAX. He uses an extremely small value of RAND_MAX (10) to show that if RAND_MAX were 10, then you tried to generate a number between 0 and 2 using %, the following outcomes would result:
rand() % 3 // if RAND_MAX were only 10, gives
output of rand() | rand()%3
0 | 0
1 | 1
2 | 2
3 | 0
4 | 1
5 | 2
6 | 0
7 | 1
8 | 2
9 | 0
So there are 4 outputs of 0's (4/10 chance) and only 3 outputs of 1 and 2 (3/10 chances each).
So it's biased. The lower numbers have a better chance of coming out.
But that only shows up so obviously when RAND_MAX is small. Or more specifically, when the number your are modding by is large compared to RAND_MAX.
A much better solution than looping (which is insanely inefficient and shouldn't even be suggested) is to use a PRNG with a much larger output range. The Mersenne Twister algorithm has a maximum output of 4,294,967,295. As such doing MersenneTwister::genrand_int32() % 10 for all intents and purposes, will be equally distributed and the modulo bias effect will all but disappear.
I just wrote a code for Von Neumann's Unbiased Coin Flip Method, that should theoretically eliminate any bias in the random number generation process. More info can be found at (http://en.wikipedia.org/wiki/Fair_coin)
int unbiased_random_bit() {
int x1, x2, prev;
prev = 2;
x1 = rand() % 2;
x2 = rand() % 2;
for (;; x1 = rand() % 2, x2 = rand() % 2)
{
if (x1 ^ x2) // 01 -> 1, or 10 -> 0.
{
return x2;
}
else if (x1 & x2)
{
if (!prev) // 0011
return 1;
else
prev = 1; // 1111 -> continue, bias unresolved
}
else
{
if (prev == 1)// 1100
return 0;
else // 0000 -> continue, bias unresolved
prev = 0;
}
}
}

Ruby's digits method performance

I'm solving some Project Euler problems using Ruby, and specifically here I'm talking about problem 25 (What is the index of the first term in the Fibonacci sequence to contain 1000 digits?).
At first, I was using Ruby 2.2.3 and I coded the problem as such:
number = 3
a = 1
b = 2
while b.to_s.length < 1000
a, b = b, a + b
number += 1
end
puts number
But then I found out that version 2.4.2 has a method called digits which is exactly what I needed. I transformed to code to:
while b.digits.length < 1000
And when I compared the two methods, digits was much slower.
Time
./025/problem025.rb 0.13s user 0.02s system 80% cpu 0.190 total
./025/problem025.rb 2.19s user 0.03s system 97% cpu 2.275 total
Does anyone have an idea why?
Ruby's digits
... is implemented in rb_int_digits.
Which for non-tiny numbers (i.e., most of your numbers) uses rb_int_digits_bigbase.
Which extracts digit after digit naively with division/modulo by base.
So it should take quadratic time (at least with a small base such as 10).
Ruby's to_s
... is implemented in int_to_s.
Which uses rb_int2str.
Which for non-tiny numbers uses rb_big2str.
Which uses rb_big2str1.
Which might use big2str_gmp if available (which sounds/looks like it uses the fast GMP library) or ...
... uses big2str_generic.
Which uses big2str_karatsuba (sweet, I recognize that name!).
Which looks like it has something to do with ...
... Karatsuba's algorithm, which is a fast multiplication algorithm. If you multiply two n-digit numbers the naive way you learned in school, you take n2 single-digit products. Karatsuba on the other hand only needs about n1.585, which is quite a lot better. And I didn't read into this further, but I suspect what Ruby does here is also this efficient. Eric Lippert's answer with a base conversion algorithm uses Karatsuba multiplication and says "this [base conversion] algorithm is utterly dominated by the cost of the multiplication".
Comparing quadratic to n1.585 over the number lengths from 1 digit to 1000 digits gives factor 15:
(1..1000).sum { |i| i**2 } / (1..1000).sum { |i| i**1.585 }
=> 15.150583254950678
Which is roughly the factor you observed as well. Of course that's a rather naive comparison, but, well, why not.
GMP by the way apparently uses/used a "near O(n * log(n)) FFT-based multiplication algorithm".
Thanks to #Drenmi's answer for motivating me to dig into the source after all. I hope I did this right, no guarantees, I'm a Ruby beginner. But that's why I left all the links there for you to check for yourself :-P
Integer#digits doesn't just "split" the number. From the documentation:
Returns the array including the digits extracted by place-value
notation with radix base of int.
This extraction is done even if a base argument is omitted. The relevant source:
# ruby/numeric.c:4809
while (!FIXNUM_P(num) || FIX2LONG(num) > 0) {
VALUE qr = rb_int_divmod(num, base);
rb_ary_push(digits, RARRAY_AREF(qr, 1));
num = RARRAY_AREF(qr, 0);
}
As you can see, this process includes repeated modulo arithmetics, which likely accounts for the additional runtime.
Many ruby methods create objects (strins, arrays, etc.)
In ruby, object creation in ruby is "expensive".
For instance to_s creates a string and digits creates an array every time the while condition is evaluated.
If you want to optimize your example, you can do the following:
# create the smallest possible 1000 digits number
max = 10**999
number = 3
a = 1
b = 2
# do not create objects in while condition
while b < max
a, b = b, a + b
number += 1
end
puts number
I have not answered your question, but wish to suggest an improved algorithm for the problem you have addressed. For a given number of decimal digits, n, I have implemented the following algorithm.
estimate the number f of Fibonacci numbers ("FNs") that have n or fewer decimal digits.
compute the fth and (f-1)st FNs, and the number of digits m in the fth FN.
if m >= n back down from down from the (f-1)st FN until the (f-1)st FN has fewer than n decimal digits, at which time the fth FN is the smallest FN to have n decimal digits.
if m < n increase the fth FN until the it has n decimal digits, at which time it is the smallest FN to have n decimal digits.
The key is to compute a close estimate f in the first step.
Code
AVG_FNs_PER_DIGIT = 4.784971966781667
def first_fibonacci_with_n_digits(n)
return [1, 1] if n == 1
idx = (n * AVG_FNs_PER_DIGIT).round
fn, prev_fn = fib(idx)
fn.to_s.size >= n ? fib_down(n, fn, prev_fn, idx) : fib_up(n, fn, prev_fn, idx)
end
def fib(idx)
a = 1
b = 2
(idx - 2).times {a, b = b, a + b }
[b, a]
end
def fib_up(n, b, a, idx)
loop do
a, b = b, a + b
idx += 1
break [idx, b] if b.to_s.size == n
end
end
def fib_down(n, b, a, idx)
loop do
a, b = b - a, a
break [idx, b] if a.to_s.size == n - 1
idx -= 1
end
end
Benchmarks
In computing each Fibonacci number two operations are typically performed:
compute the number of digits in the last-computed Fibonacci number and if that number is equal to the target number of digits, terminate (for reasons made clear in the Explanation section below, it cannot be larger than the target number); else
compute the next number in the Fibonacci sequence.
By contrast, the method I have proposed performs the first step a relatively small number of times.
How important is the first step relative to the second and how does the use of n.digits.size compare with that of n.to_s.size in the first step? Let's run some benchmarks to find out.
def use_to_s(ndigits)
case ndigits
when 1
[1, 1]
else
a = 1
b = 2
idx = 3
loop do
break [idx, b] if b.to_s.length == ndigits
a, b = b, a + b
idx += 1
end
end
end
def use_digits(ndigits)
case ndigits
when 1
[1, 1]
else
a = 1
b = 2
idx = 3
loop do
break [idx, b] if b.digits.size == ndigits
a, b = b, a + b
idx += 1
end
end
end
require 'fruity'
def test(ndigits)
nfibs, last_fib = use_to_s(ndigits)
puts "\nndigits = #{ndigits}, nfibs=#{nfibs}, last_fib=#{last_fib}"
compare do
try_use_to_s { use_to_s(ndigits) }
try_use_digits { use_digits(ndigits) }
try_estimate { first_fibonacci_with_n_digits(ndigits) }
end
end
test 20
ndigits = 20, nfibs=93, last_fib=12200160415121876738
Running each test 128 times. Test will take about 1 second.
try_estimate is faster than try_use_to_s by 2x ± 0.1
try_use_to_s is faster than try_use_digits by 80.0% ± 10.0%
test 100
ndigits = 100, nfibs=476, last_fib=13447...37757 (90 digits omitted)
Running each test 16 times. Test will take about 4 seconds.
try_estimate is faster than try_use_to_s by 5x ± 0.1
try_use_to_s is faster than try_use_digits by 10x ± 1.0
test 500
ndigits = 500, nfibs=2390, last_fib=13519...63145 (490 digits omitted)
Running each test 2 times. Test will take about 27 seconds.
try_estimate is faster than try_use_to_s by 9x ± 0.1
try_use_to_s is faster than try_use_digits by 60x ± 1.0
test 1000
ndigits = 1000, nfibs=4782, last_fib=10700...27816 (990 digits omitted)
Running each test once. Test will take about 1 minute.
try_estimate is faster than try_use_to_s by 12x ± 10.0
try_use_to_s is faster than try_use_digits by 120x ± 100.0
There are two main take-aways from these results:
"try_estimate" is the fastest because it performs the first step relatively few times; and
the use of to_s is much faster than that of digits.
Further to the first of these observations note that the initial estimates of the index of the first FN having a given number of digits, compared to the actual index, are as follows:
for 20 digits: 96 est. vs 93 actual
for 100 digits: 479 est. vs 476 actual
for 500 digits: 2392 est. vs 2390 actual
for 1000 digits: 4785 est. vs 4782 actual
The deviation was at most 3, meaning numbers of digits had to be calculated for at most 3 FNs to obtain the desired result.
Explanation
The only explanation of the methods given in the section Code above is the derivation of the constant AVG_FNs_PER_DIGIT, which is used to calculate an estimate of the index of the first FN having the specified number of digits.
The derivation of this constant derives from the question and selected answer given here. (The Wiki for Fibonacci numbers provides a good overview of the mathematical properties of FNs.)
It is known that the first 7 FNs (including zero) have one digit; thereafter the FNs gain an additional digit every 4 or 5 FNs (i.e., sometimes 4, else 5). Therefore, as a very crude calculation, we see that to calculate the first FN with n digits, n >= 2, it will not be less than the 4*nth FN. For n = 1000, that would be 4,000. (In fact, the 4,782nd is the smallest to have 1,000 digits.) In other words, we don't need to calculate the number of digits in the first 4,000 FNs. We can improve on this estimate, however.
As n approaches infinity, the ratio of ranges 10**n...10**(n+1) (n-digit intervals) that contain 5 FNs to those that contain 4 FNs can be computed as follows.
LOG_10 = Math.log(10)
#=> 2.302585092994046
GR = (1 + Math.sqrt(5))/2
#=> 1.618033988749895
LOG_GR = Math.log(GR)
#=> 0.48121182505960347
RATIO_5to4 = (LOG_10 - 4*LOG_GR)/(5*LOG_GR - LOG_10)
#=> 3.6505564183095474
where GR is the Golden Ratio.
Over a large number of n-digit intervals let n4 be the number of those intervals containing 4 FNs and n5 be the number containing 5 FNs. The average number of FNs per interval is therefore (n4*4 + n5*5)/(n4 + n5). Since n5/n4 converges to RATIO_5to4, n5 approaches RATIO_5to4 * n4 in the limit (discarding roundoff error). If we substitute out n5, and let
b = 1/(1 + RATIO_5to4)
#=> 0.21502803321833364
we find the average number of FNs per n-digit interval converges to
avg = b * 4 + (1-b) *5
#=> 4.784971966781667
If fn is the first FN to have n decimal digits, the number of FNs in the sequence up to an including fn can therefore be approximated to be
n * avg
If, for example, the estimate of the index of the first FN to have 1000 decimal digits would be 1000 * 4.784971966781667).round #=> 4785.

Optimize haskell function with huge numbers of power function call

The following function is to find a number n which 1^3 + 2^3 + ... + (n-1) ^3 + n^3 = m. Is there any chance this function can be optimized for speed?
findNb :: Integer -> Integer
findNb m = findNb' 1 0
where findNb' n m' =
if m' == m then n - 1
else if m' < m then findNb' (n + 1) (m' + n^3)
else -1
I know there is a faster solution by using a math formula.
The reason I'm asking is that the similar implementation in JavaScript / C# seems far more faster than in Haskell. I'm just curious if it can be optimized. Thanks.
EDIT1: Add more evidences on the rum time
Haskell Version:
With main = print (findNb2 152000000000000000000000):
Compile with -O2 and profiling: ghc -o testo2.exe -O2 -prof -fprof-auto -rtsopts pileofcube.hs. Here is total time from profiling report:
total time = 0.19 secs (190 milliseconds) (190 ticks # 1000 us, 1 processor)
Compile with -O2 but no profiling: ghc -o testo22.exe -O2 pileofcube.hs. Run it with Measure-Command {./testo22.exe} in powershell. The result is:
Milliseconds : 157
JavaScript Version:
Code:
function findNb(m) {
let n = 0;
let sum = 0;
while (sum < m) {
n++;
sum += Math.pow(n, 3);
}
return sum === m ? n : -1;
}
var d1 = new Date();
findNb(152000000000000000000000);
console.log(new Date() - d1);
Result: 45 milliseconds running in Chrome on the same machine
EDIT2: Add C# Version
As #Berji and #Bakuriu commented, comparing to the JavaScript version above is not fair as it uses double-precision floating point numbers underlying and could not give the correct answer even. So I implemented it in C#, here is the code and result:
static void Main(string[] args)
{
BigInteger m = BigInteger.Parse("152000000000000000000000");
var s = new Stopwatch();
s.Start();
long n = 0;
BigInteger sum = 0;
while (sum < m)
{
n++;
sum += BigInteger.Pow(n, 3);
}
Console.WriteLine(sum == m ? n : -1);
s.Stop();
Console.WriteLine($"Escaped Time: {s.ElapsedMilliseconds} milliseconds.");
}
Result: Escaped Time: 457 milliseconds.
Conclusion
Haskell version is faster than C# one...
I was wrong at start because I didn't realized JavaScript use double-precision floating point numbers under the hood due to my poor JavaScript knowledge.
At this point seems the question does not make sense anymore...
Haskell too can use Double to get the wrong answer in less time:
% time ./so
./so 0.03s user 0.00s system 95% cpu 0.038 total
And Javascript too can get the correct result via npm-installing big-integer and using bigInt everywhere instead of Double:
% node so.js
^C
node so.js 35.62s user 0.30s system 93% cpu 38.259 total
... or maybe it isn't as trivial as that.
EDIT : I realized afterward that's not what the author of the question wanted. I'll keep it there as a in case someone wants to know the formula in question, but otherwise please disregard.
There is indeed a formula that lets you compute this in constant time (rather than n iterations). Since I couldn't remember the exact formula from school, I did a bit of searching, and here is is: https://proofwiki.org/wiki/Sum_of_Sequence_of_Cubes.
In haskell code, that would translate to
findNb n = n ^ 2 * (n + 1) ^ 2 / 4
which I believe should be much faster.
Not sure if this wording of that algorithm is faster, but try this?
findNb :: Integer -> Integer
findNb m = length $ takeWhile (<=m) $ scanl1 (+) [n^3 | n <- [1..]]
(This has different semantics in the undefined case, though.)

What's the ideal implementation for the Sieve of Eratosthenes between Lists, Arrays, and Mutable Arrays?

In Haskell, I've found three simple implementations of the Sieve of Eratosthenes on the Rosetta Code page.
Now my question is, which one should be used in which situations?
Correcting my initial reasoning would be helpful too:
I'm assuming the List one is the most idiomatic and easy to read for a Haskeller. Is it correct, though? I'm wondering if it suffers from the same problems as another list-based sieve that I then learned was not actually implementing the algorithm:
(edit: shown here is the list-based sieve I know has problems, not the one from RosettaCode, which I pasted at the bottom)
primes = sieve [2..] where
sieve (p:x) = p : sieve [ n | n <- x, n `mod` p > 0 ]
In terms of performance, the immutable Array seems to be the winner. With an upper bound m of 2000000, the times were about:
1.3s for List
0.6s for Array
1.8s for Mutable Array
So I'd pick Array for performance.
And of course, the Mutable Array one is also easy to reason about since I have a more imperative language background. I'm not sure why I would pick this one if I'm coding in Haskell, though, since it's both slower than the others and non-idiomatic.
Code copied here for reference:
List:
primesTo m = 2 : eratos [3,5..m] where
eratos (p : xs) | p*p>m = p : xs
| True = p : eratos (xs `minus` [p*p, p*p+2*p..])
minus a#(x:xs) b#(y:ys) = case compare x y of
LT -> x : minus xs b
EQ -> minus xs ys
GT -> minus a ys
minus a b = a
Immutable Array:
import Data.Array.Unboxed
primesToA m = sieve 3 (array (3,m) [(i,odd i) | i<-[3..m]] :: UArray Int Bool)
where
sieve p a
| p*p > m = 2 : [i | (i,True) <- assocs a]
| a!p = sieve (p+2) $ a//[(i,False) | i <- [p*p, p*p+2*p..m]]
| otherwise = sieve (p+2) a
Mutable Array:
import Control.Monad (forM_, when)
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
primeSieve :: Integer -> UArray Integer Bool
primeSieve top = runSTUArray $ do
a <- newArray (2,top) True -- :: ST s (STUArray s Integer Bool)
let r = ceiling . sqrt $ fromInteger top
forM_ [2..r] $ \i -> do
ai <- readArray a i
when ai $ do
forM_ [i*i,i*i+i..top] $ \j -> do
writeArray a j False
return a
-- Return primes from sieve as list:
primesTo :: Integer -> [Integer]
primesTo top = [p | (p,True) <- assocs $ primeSieve top]
EDIT
I showed Turner's Sieve at the top but that's not the list-based example I'm comparing with the other two. I just wanted to know if the list-based example suffers from the same "not the correct Sieve of Eratosthenes" problems as Turner's.
It appears the performance comparison is unfair because the Mutable Array example goes through evens as well and uses Integer rather than Int...
This
primes = sieve [2..] where
sieve (p:x) = p : sieve [ n | n <- x, n `mod` p > 0 ]
is not a sieve. It's very inefficient trial division. Don't use that!
I'm curious about how you got your times, there is no way that the Turner "sieve" could produce the primes not exceeding 2,000,000 in mere seconds. Letting it find the primes to 200,000 took
MUT time 6.38s ( 6.39s elapsed)
GC time 9.19s ( 9.20s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 15.57s ( 15.59s elapsed)
on my box (64-bit Linux, ghc-7.6.1, compiled with -O2). The complexity of that algorithm is O(N² / log² N), almost quadratic. Letting it proceed to 2,000,000 would take about twenty minutes.
Your times for the array versions are suspicious too, though in the other direction. Did you measure interpreted code?
Sieving to 2,000,000, compiled with optimisations, the mutable array code took 0.35 seconds to run, and the immutable array code 0.12 seconds.
Now, that still has the mutable array about three times slower than the immutable array.
But, it's an unfair comparison. For the immutable array, you used Ints, and for the mutable array Integers. Changing the mutable array code to use Ints - as it should, since under the hood, arrays are Int-indexed, so using Integer is an unnecessary performance sacrifice that buys nothing - made the mutable array code run in 0.15 seconds. Close to the mutable array code, but not quite there. However, you let the mutable array do more work, since in the immutable array code you only eliminate odd multiples of the odd primes, but in the mutable array code, you mark all multiples of all primes. Changing the mutable array code to treat 2 specially, and only eliminate odd multiples of odd primes brings that down to 0.12 seconds too.
But, you're using range-checked array indexing, which is slow, and, since the validity of the indices is checked in the code itself, unnecessary. Changing that to using unsafeRead and unsafeWrite brings down the time for the immutable array to 0.09 seconds.
Then you have the problem that using
forM_ [x, y .. z]
uses boxed Ints (fortunately, GHC eliminates the list). Writing a loop yourself, so that only unboxed Int#s are used, the time goes down to 0.02 seconds.
{-# LANGUAGE MonoLocalBinds #-}
import Control.Monad (forM_, when)
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Data.Array.Base
primeSieve :: Int -> UArray Int Bool
primeSieve top = runSTUArray $ do
a <- newArray (0,top) True
unsafeWrite a 0 False
unsafeWrite a 1 False
let r = ceiling . sqrt $ fromIntegral top
mark step idx
| top < idx = return ()
| otherwise = do
unsafeWrite a idx False
mark step (idx+step)
sift p
| r < p = return a
| otherwise = do
prim <- unsafeRead a p
when prim $ mark (2*p) (p*p)
sift (p+2)
mark 2 4
sift 3
-- Return primes from sieve as list:
primesTo :: Int -> [Int]
primesTo top = [p | (p,True) <- assocs $ primeSieve top]
main :: IO ()
main = print .last $ primesTo 2000000
So, wrapping up, for a Sieve of Eratosthenes, you should use an array - not surprising, its efficiency depends on being able to step from one multiple to the next in short constant time.
You get very simple and straightforward code with immutable arrays, and that code performs decently for not too high limits (it gets relatively worse for higher limits, since the arrays are still copied and garbage-collected, but that's not too bad).
When you need better performance, you need mutable arrays. Writing efficient mutable array code is not entirely trivial, one has to know how the compiler translates the various constructs to choose the right one, and some would consider such code unidiomatic. But you can also use a library (disclaimer: I wrote it) that provides a fairly efficient implementation rather than writing it yourself.
Mutable array will always be the winner in terms of performance (and you really should've copied the version that works on odds only as a minimum; it should be the fastest of the three - also because it uses Int and not Integer).
For lists, tree-shaped merging incremental sieve should perform better than the one you show. You can always use it with takeWhile (< limit) if needed. I contend that it conveys the true nature of the sieve most clearly:
import Data.List (unfoldr)
primes :: [Int]
primes = 2 : _Y ((3 :) . gaps 5 . _U . map (\p -> [p*p, p*p+2*p..]))
_Y g = g (_Y g) -- recursion
_U ((x:xs):t) = (x :) . union xs . _U -- ~= nub . sort . concat
. unfoldr (\(a:b:c) -> Just (union a b, c)) $ t
gaps k s#(x:xs) | k < x = k : gaps (k+2) s -- ~= [k,k+2..]\\s, when
| otherwise = gaps (k+2) xs -- k<=x && null(s\\[k,k+2..])
union a#(x:xs) b#(y:ys) = case compare x y of -- ~= nub . sort .: (++)
LT -> x : union xs b
EQ -> x : union xs ys
GT -> y : union a ys
_U reimplements Data.List.Ordered.unionAll, and gaps 5 is (minus [5,7..]), fused for efficiency, with minus and union from the same package.
Of course nothing beats the brevity of Data.List.nubBy (((>1).).gcd) [2..] (but it's very slow).
To your 1st new question: not. It does find the multiples by counting up, as any true sieve should (although "minus" on lists is of course under-performant; the above improves on that by re-arranging a linear subtraction chain ((((xs-a)-b)-c)- ... ) into a subtraction of tree-folded additions, xs-(a+((b+c)+...))).
As has been said, using mutable arrays has the best performance. The following code is derived from this "TemplateHaskell" version converted back to something more in line with the straight mutable Array solution as the "TemplateHaskell" doesn't seem to make any difference, with some further optimizations. I believe it is faster than the usual mutable unboxed array versions due to the further optimizations and especially due to the use of the "unsafeRead" and "unsafeWrite" functions that avoid range checking of the arrays, probably also internally using pointers for array access:
{-# OPTIONS -O2 -optc-O3 #-}
import Control.Monad
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Data.Array.Base
primesToUA :: Word32-> [Word32]
primesToUA top = do
let sieveUA top = runSTUArray $ do
let m = ((fromIntegral top) - 3) `div` 2 :: Int
buf <- newArray (0,m) True -- :: ST s (STUArray s Int Bool)
let cullUA i = do
let p = i + i + 3
strt = p * (i + 1) + i
let cull j = do
if j > m then cullUA (i + 1)
else do
unsafeWrite buf j False
cull (j + p)
if strt > m then return ()
else do
e <- unsafeRead buf i
if e then cull strt else cullUA (i + 1)
cullUA 0; return buf
if top > 1 then 2 : [2 * (fromIntegral i) + 3 | (i,True) <- assocs $ sieveUA top]
else []
main = do
x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln
print (last (primesToUA x)) -- 0.01 0.02 0.09 1.26 seconds
EDIT: The above code has been corrected and the times below edited to reflect the correction and as well the comments comparing the paged to the non-paged version have been edited.
The times to run this to the indicated top ranges are as shown in the comment table at the bottom of the above source code as measured by ideone.com and are about exactly five times faster than the answer posted by #WillNess as also measured at ideone.com. This code takes a trivial amount of time to cull the primes to two million and only 1.26 seconds to cull to a hundred million. These times are about 2.86 times faster when run on my i7 (3.5 GHz) at 0.44 seconds to a hundred million, and takes 6.81 seconds to run to one billion. The memory use is just over six megabytes for the former and sixty megabytes for the latter, which is the memory used by the huge (bit packed) array. This array also explains the non-linear performance in that as the array size exceeds the CPU cache sizes the average memory access times get worse per composite number representation cull.
EDIT_ADD: A page segmented sieve is more efficient in that it has better memory access efficiency when the buffer size is kept smaller than the L1 or L2 CPU caches, and as well has the advantage that it is unbounded so that the upper range does not have to be specified in advance and a much smaller memory footprint being just the base primes less than the square root of the range used plus the page buffer memory. The following code has been written as a page segmented implementation and is somewhat faster than the non-paged version; it also offers the advantage that one can change the output range specification at the top of the code to 'Word64' from 'Word32' so it is then not limited to the 32-bit number range, at only a slight cost in processing time (for 32-bit compiled code) for any range that is in common. The code is as follows:
-- from http://www.haskell.org/haskellwiki/Prime_numbers#Using_ST_Array
{-# OPTIONS -O2 -optc-O3 #-}
import Data.Word
import Control.Monad
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Data.Array.Base
primesUA :: () -> [Word32]
primesUA () = do
let pgSZBTS = 262144 * 8
let sieveUA low bps = runSTUArray $ do
let nxt = (fromIntegral low) + (fromIntegral pgSZBTS)
buf <- newArray (0,pgSZBTS - 1) True -- :: ST s (STUArray s Int Bool)
let cullUAbase i = do
let p = i + i + 3
strt = p * (i + 1) + i
when (strt < pgSZBTS) $ do
e <- unsafeRead buf i
if e then do
let loop j = do
if j < pgSZBTS then do
unsafeWrite buf j False
loop (j + p)
else cullUAbase (i + 1)
loop strt
else cullUAbase (i + 1)
let cullUA ~(p:t) = do
let bp = (fromIntegral p)
i = (bp - 3) `div` 2
s = bp * (i + 1) + i
when (s < nxt) $ do
let strt = do
if s >= low then fromIntegral (s - low)
else do
let b = (low - s) `rem` bp
if b == 0 then 0 else fromIntegral (bp - b)
let loop j = do
if j < pgSZBTS then do
unsafeWrite buf j False
loop (j + (fromIntegral p))
else cullUA t
loop strt
if low <= 0 then cullUAbase 0 else cullUA bps
return buf
let sieveList low bps = do
[2 * ((fromIntegral i) + low) + 3 | (i,True) <- assocs $ sieveUA low bps]
let sieve low bps = do
(sieveList low bps) ++ sieve (low + (fromIntegral pgSZBTS)) bps
let primes' = ((sieveList 0 []) ++ sieve (fromIntegral pgSZBTS) primes') :: [Word32]
2 : sieve 0 primes'
main = do
x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln
-- 0.02 0.03 0.13 1.13 seconds
print (length (takeWhile ((>=) (fromIntegral x)) (primesUA ())))
The above code has quite a few more lines of code than the non-paged case due to the need to cull the composite number representations from the first page array differently than succeeding pages. The code also has the fixes so that there aren't memory leaks due to the base primes list and the output list now not being the same list (thus avoiding holding onto the whole list in memory).
Note that this code takes close to linear time (over the range sieved) as the range gets larger due to the culling buffer being of a constant size less than the L2 CPU cache. Memory use is a fraction of that used by the non-paged version at just under 600 kilobytes for a hundred million and just over 600 kilobytes for one billion, which slight increase is just the extra space required for the base primes less than the square root of the range list.
On ideone.com this code produces the number of primes to a hundred million in about 1.13 seconds and about 12 seconds to one billion (32-bit setting). Probably wheel factorization and definitely multi-core processing would make it even faster on a multi-core CPU. On my i7 (3.5 GHz), it takes 0.44 seconds to sieve to a hundred million and 4.7 seconds to one billion, with the roughly linear performance with increasing range as expected. It seems that there is some sort of non-linear overhead in the version of GHC run on ideone.com that has some performance penalty for larger ranges that is not present for the i7 and that is perhaps related to different garbage collection, as the page buffers are being created new for each new page. END_EDIT_ADD
EDIT_ADD2: It seems that much of the processing time for the above page segmented code is used in (lazy) list processing, so the code is accordingly reformulated with several improvements as follows:
Implemented a prime counting function that does not use list processing and uses "popCount" table look ups to count the number of 'one' bits in a 32 bit word at a time. In this way, the time to find the results is insignificant compared to the actual sieve culling time.
Stored the base primes as a list of bit packed page segments, which is much more space efficient than storing a list of primes, and the the time to convert the page segments to primes as required is not much of a computational overhead.
Tuned the prime segment make function so that for the initial zero page segment, it uses its own bit pattern as a source page, thus making the composite number cull code shorter and simpler.
The code then becomes as follows:
{-# OPTIONS -O3 -rtsopts #-} -- -fllvm ide.com doesn't support LLVM
import Data.Word
import Data.Bits
import Control.Monad
import Control.Monad.ST
import Data.Array.ST (runSTUArray)
import Data.Array.Unboxed
import Data.Array.Base
pgSZBTS = (2^18) * 8 :: Int -- size of L2 data cache
type PrimeType = Word32
type Chunk = UArray PrimeType Bool
-- makes a new page chunk and culls it
-- if the base primes list provided is empty then
-- it uses the current array as source (for zero page base primes)
mkChnk :: Word32 -> [Chunk] -> Chunk
mkChnk low bschnks = runSTUArray $ do
let nxt = (fromIntegral low) + (fromIntegral pgSZBTS)
buf <- nxt `seq` newArray (fromIntegral low, fromIntegral nxt - 1) True
let cull ~(p:ps) =
let bp = (fromIntegral p)
i = (bp - 3) `shiftR` 1
s = bp * (i + 1) + i in
let cullp j = do
if j >= pgSZBTS then cull ps
else do
unsafeWrite buf j False
cullp (j + (fromIntegral p)) in
when (s < nxt) $ do
let strt = do
if s >= low then fromIntegral (s - low)
else do
let b = (low - s) `rem` bp
if b == 0 then 0 else fromIntegral (bp - b)
cullp strt
case bschnks of
[] -> do bsbf <- unsafeFreezeSTUArray buf
cull (listChnkPrms [bsbf])
_ -> cull $ listChnkPrms bschnks
return buf
-- creates a page chunk list starting at the lw value
chnksList :: Word32 -> [Chunk]
chnksList lw =
mkChnk lw basePrmChnks : chnksList (lw + fromIntegral pgSZBTS)
-- converts a page chunk list to a list of primes
listChnkPrms :: [Chunk] -> [PrimeType]
listChnkPrms [] = []
listChnkPrms ~(hdchnk#(UArray lw _ rng _):tlchnks) =
let nxtp i =
if i >= rng then [] else
if unsafeAt hdchnk i then
(case ((lw + fromIntegral i) `shiftL` 1) + 3 of
np -> np) : nxtp (i + 1)
else nxtp (i + 1) in
(hdchnk `seq` lw `seq` nxtp 0) ++ listChnkPrms tlchnks
-- the base page chunk list used to cull the higher order pages,
-- note that it has special treatment for the zero page.
-- It is more space efficient to store this as chunks rather than
-- as a list of primes or even a list of deltas (gaps), with the
-- extra processing to convert as needed not too much.
basePrmChnks :: [Chunk]
basePrmChnks = mkChnk 0 [] : chnksList (fromIntegral pgSZBTS)
-- the full list of primes could be accessed with the following function.
primes :: () -> [PrimeType]
primes () = 2 : (listChnkPrms $ chnksList 0)
-- a quite fast prime counting up to the given limit using
-- chunk processing to avoid innermost list processing loops.
countPrimesTo :: PrimeType -> Int
countPrimesTo limit =
let lmtb = (limit - 3) `div` 2 in
let sumChnks acc chnks#(chnk#(UArray lo hi rng _):chnks') =
let cnt :: UArray PrimeType Word32 -> Int
cnt bfw =
case if lmtb < hi then fromIntegral (lmtb - lo) else rng of
crng -> case crng `shiftR` 5 of
rngw ->
let cnt' i ac =
ac `seq` if i >= rngw then
if (i `shiftL` 5) >= rng then ac else
case (-2) `shiftL` fromIntegral (lmtb .&. 31) of
msk -> msk `seq`
case (unsafeAt bfw rngw) .&.
(complement msk) of
bts -> bts `seq` case popCount bts of
c -> c `seq` case ac + c of nac -> nac
else case ac + (popCount $ unsafeAt bfw i) of
nacc -> nacc `seq` cnt' (i + 1) (nacc)
in cnt' 0 0 in
acc `seq` case runST $ do -- make UArray _ Bool into a UArray _ Word32
stbuf <- unsafeThawSTUArray chnk
stbufw <- castSTUArray stbuf
bufw <- unsafeFreezeSTUArray stbufw
return $ cnt bufw of
c -> c `seq` case acc + c of
nacc -> nacc `seq` if hi >= lmtb then nacc
else sumChnks nacc chnks' in
if limit < 2 then 0 else if limit < 3 then 1 else
lmtb `seq` sumChnks 1 (chnksList 0)
main = do
x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln 1000mln
-- 0.02 0.03 0.06 0.45 4.60 seconds
-- 7328 7328 8352 8352 9424 Kilobytes
-- this takes 14.34 seconds and 9424 Kilobytes to 3 billion on ideone.com,
-- and 9.12 seconds for 3 billion on an i7-2700K (3.5 GHz).
-- The above ratio of about 1.6 is much better than usual due to
-- the extremely low memory use of the page segmented algorithm.
-- It seems thaat the Windows Native Code Generator (NCG) from GHC
-- is particularly poor, as the Linux 32-bit version takes
-- less than two thirds of the time for exactly the same program...
print $ countPrimesTo x
-- print $ length $ takeWhile ((>=) x) $ primes () -- the slow way to do this
The times and memory requirements given in the code are as observed when run on ideone.com, with 0.02, 0.03, 0.05, 0.30, 3.0, and 9.1 seconds required to run on my i7-2700K (3.5 GHz) for one, two, ten, a hundred, a thousand (one billion), and three thousand (three billion) million range, respectively, with a pretty much constant memory footprint slowly increasing with the number of base primes less than the square root of the range as required. When compiled with the LLVM compiler back end, these times become 0.01, 0.02, 0.02, 0.12, 1.35, and 4.15 seconds, respectively, due to its more efficient use of registers and machine instructions; this last is quite close to the same speed as if compiled with a 64-bit compiler rather than the 32-bit compiler used as the efficient use of registers means that availability of extra registers doesn't make much difference.
As commented in the code, the ratio between performance on my real machine and the ideone.com servers becomes much less than for much more memory wasteful algorithms due to not being throttled by memory access bottlenecks so the limit to speed is mostly just the ratio of CPU clock speeds as well as CPU processing efficiency per clock cycle. However, as commented there, there is some strange inefficiency with the GHC Native Code Generator (NCG) when run under Windows (32-bit compiler) in that the run times are over 50% slower than if run under Linux (as the ideone.com server uses). AFAIK they both have a common code base for the same Haskell GHC version and the only divergence is in the linker used (which is also used with the LLVM backend, which is not affected) as GHC NCG does not use GCC but only the mingw32 assembler, which should also be the same.
Note that this code when compiled with the LLVM compiler back end is about the same speed as the same algorithm written for highly optimized 'C/C++' implementations indicating that Haskell really has the ability to develop very tight loop coding. It might be said that the Haskell code is quite a bit more readable and secure than equivalent 'C/C++' code once one gets used to the Haskell paradigms of monadic and non-strict code. The further refinements in execution speed for the Sieve of Eratosthenes are purely a function of the tuning of the implementations used and not the choice of language between Haskell and 'C/C++'.
Summary: Of course, this isn't yet the ultimate in speed for a Haskell version of the Sieve of Eratosthenes in that we still haven't tuned the memory access to more efficiently use the fast CPU L1 cache, nor have we significantly reduced the total number of composite culling operations necessary using extreme wheel factorization other than to eliminate the odds processing. However, this is enough to answer the question in showing that mutable arrays are the most efficient way of addressing such tight loop type of problems, with potential speed gains of about 100 times over the use of lists or immutable arrays. END_EDIT_ADD2

Find the smallest regular number that is not less than N

Regular numbers are numbers that evenly divide powers of 60. As an example, 602 = 3600 = 48 × 75, so both 48 and 75 are divisors of a power of 60. Thus, they are also regular numbers.
This is an extension of rounding up to the next power of two.
I have an integer value N which may contain large prime factors and I want to round it up to a number composed of only small prime factors (2, 3 and 5)
Examples:
f(18) == 18 == 21 * 32
f(19) == 20 == 22 * 51
f(257) == 270 == 21 * 33 * 51
What would be an efficient way to find the smallest number satisfying this requirement?
The values involved may be large, so I would like to avoid enumerating all regular numbers starting from 1 or maintaining an array of all possible values.
One can produce arbitrarily thin a slice of the Hamming sequence around the n-th member in time ~ n^(2/3) by direct enumeration of triples (i,j,k) such that N = 2^i * 3^j * 5^k.
The algorithm works from log2(N) = i+j*log2(3)+k*log2(5); enumerates all possible ks and for each, all possible js, finds the top i and thus the triple (k,j,i) and keeps it in a "band" if inside the given "width" below the given high logarithmic top value (when width < 1 there can be at most one such i) then sorts them by their logarithms.
WP says that n ~ (log N)^3, i.e. run time ~ (log N)^2. Here we don't care for the exact position of the found triple in the sequence, so all the count calculations from the original code can be thrown away:
slice hi w = sortBy (compare `on` fst) b where -- hi>log2(N) is a top value
lb5=logBase 2 5 ; lb3=logBase 2 3 -- w<1 (NB!) is log2(width)
b = concat -- the slice
[ [ (r,(i,j,k)) | frac < w ] -- store it, if inside width
| k <- [ 0 .. floor ( hi /lb5) ], let p = fromIntegral k*lb5,
j <- [ 0 .. floor ((hi-p)/lb3) ], let q = fromIntegral j*lb3 + p,
let (i,frac)=properFraction(hi-q) ; r = hi - frac ] -- r = i + q
-- properFraction 12.7 == (12, 0.7)
-- update: in pseudocode:
def slice(hi, w):
lb5, lb3 = logBase(2, 5), logBase(2, 3) -- logs base 2 of 5 and 3
for k from 0 step 1 to floor(hi/lb5) inclusive:
p = k*lb5
for j from 0 step 1 to floor((hi-p)/lb3) inclusive:
q = j*lb3 + p
i = floor(hi-q)
frac = hi-q-i -- frac < 1 , always
r = hi - frac -- r == i + q
if frac < w:
place (r,(i,j,k)) into the output array
sort the output array's entries by their "r" component
in ascending order, and return thus sorted array
Having enumerated the triples in the slice, it is a simple matter of sorting and searching, taking practically O(1) time (for arbitrarily thin a slice) to find the first triple above N. Well, actually, for constant width (logarithmic), the amount of numbers in the slice (members of the "upper crust" in the (i,j,k)-space below the log(N) plane) is again m ~ n^2/3 ~ (log N)^2 and sorting takes m log m time (so that searching, even linear, takes ~ m run time then). But the width can be made smaller for bigger Ns, following some empirical observations; and constant factors for the enumeration of triples are much higher than for the subsequent sorting anyway.
Even with constant width (logarthmic) it runs very fast, calculating the 1,000,000-th value in the Hamming sequence instantly and the billionth in 0.05s.
The original idea of "top band of triples" is due to Louis Klauder, as cited in my post on a DDJ blogs discussion back in 2008.
update: as noted by GordonBGood in the comments, there's no need for the whole band but rather just about one or two values above and below the target. The algorithm is easily amended to that effect. The input should also be tested for being a Hamming number itself before proceeding with the algorithm, to avoid round-off issues with double precision. There are no round-off issues comparing the logarithms of the Hamming numbers known in advance to be different (though going up to a trillionth entry in the sequence uses about 14 significant digits in logarithm values, leaving only 1-2 digits to spare, so the situation may in fact be turning iffy there; but for 1-billionth we only need 11 significant digits).
update2: turns out the Double precision for logarithms limits this to numbers below about 20,000 to 40,000 decimal digits (i.e. 10 trillionth to 100 trillionth Hamming number). If there's a real need for this for such big numbers, the algorithm can be switched back to working with the Integer values themselves instead of their logarithms, which will be slower.
Okay, hopefully third time's a charm here. A recursive, branching algorithm for an initial input of p, where N is the number being 'built' within each thread. NB 3a-c here are launched as separate threads or otherwise done (quasi-)asynchronously.
Calculate the next-largest power of 2 after p, call this R. N = p.
Is N > R? Quit this thread. Is p composed of only small prime factors? You're done. Otherwise, go to step 3.
After any of 3a-c, go to step 4.
a) Round p up to the nearest multiple of 2. This number can be expressed as m * 2.
b) Round p up to the nearest multiple of 3. This number can be expressed as m * 3.
c) Round p up to the nearest multiple of 5. This number can be expressed as m * 5.
Go to step 2, with p = m.
I've omitted the bookkeeping to do regarding keeping track of N but that's fairly straightforward I take it.
Edit: Forgot 6, thanks ypercube.
Edit 2: Had this up to 30, (5, 6, 10, 15, 30) realized that was unnecessary, took that out.
Edit 3: (The last one I promise!) Added the power-of-30 check, which helps prevent this algorithm from eating up all your RAM.
Edit 4: Changed power-of-30 to power-of-2, per finnw's observation.
Here's a solution in Python, based on Will Ness answer but taking some shortcuts and using pure integer math to avoid running into log space numerical accuracy errors:
import math
def next_regular(target):
"""
Find the next regular number greater than or equal to target.
"""
# Check if it's already a power of 2 (or a non-integer)
try:
if not (target & (target-1)):
return target
except TypeError:
# Convert floats/decimals for further processing
target = int(math.ceil(target))
if target <= 6:
return target
match = float('inf') # Anything found will be smaller
p5 = 1
while p5 < target:
p35 = p5
while p35 < target:
# Ceiling integer division, avoiding conversion to float
# (quotient = ceil(target / p35))
# From https://stackoverflow.com/a/17511341/125507
quotient = -(-target // p35)
# Quickly find next power of 2 >= quotient
# See https://stackoverflow.com/a/19164783/125507
try:
p2 = 2**((quotient - 1).bit_length())
except AttributeError:
# Fallback for Python <2.7
p2 = 2**(len(bin(quotient - 1)) - 2)
N = p2 * p35
if N == target:
return N
elif N < match:
match = N
p35 *= 3
if p35 == target:
return p35
if p35 < match:
match = p35
p5 *= 5
if p5 == target:
return p5
if p5 < match:
match = p5
return match
In English: iterate through every combination of 5s and 3s, quickly finding the next power of 2 >= target for each pair and keeping the smallest result. (It's a waste of time to iterate through every possible multiple of 2 if only one of them can be correct). It also returns early if it ever finds that the target is already a regular number, though this is not strictly necessary.
I've tested it pretty thoroughly, testing every integer from 0 to 51200000 and comparing to the list on OEIS http://oeis.org/A051037, as well as many large numbers that are ±1 from regular numbers, etc. It's now available in SciPy as fftpack.helper.next_fast_len, to find optimal sizes for FFTs (source code).
I'm not sure if the log method is faster because I couldn't get it to work reliably enough to test it. I think it has a similar number of operations, though? I'm not sure, but this is reasonably fast. Takes <3 seconds (or 0.7 second with gmpy) to calculate that 2142 × 380 × 5444 is the next regular number above 22 × 3454 × 5249+1 (the 100,000,000th regular number, which has 392 digits)
You want to find the smallest number m that is m >= N and m = 2^i * 3^j * 5^k where all i,j,k >= 0.
Taking logarithms the equations can be rewritten as:
log m >= log N
log m = i*log2 + j*log3 + k*log5
You can calculate log2, log3, log5 and logN to (enough high, depending on the size of N) accuracy. Then this problem looks like a Integer Linear programming problem and you could try to solve it using one of the known algorithms for this NP-hard problem.
EDITED/CORRECTED: Corrected the codes to pass the scipy tests:
Here's an answer based on endolith's answer, but almost eliminating long multi-precision integer calculations by using float64 logarithm representations to do a base comparison to find triple values that pass the criteria, only resorting to full precision comparisons when there is a chance that the logarithm value may not be accurate enough, which only occurs when the target is very close to either the previous or the next regular number:
import math
def next_regulary(target):
"""
Find the next regular number greater than or equal to target.
"""
if target < 2: return ( 0, 0, 0 )
log2hi = 0
mant = 0
# Check if it's already a power of 2 (or a non-integer)
try:
mant = target & (target - 1)
target = int(target) # take care of case where not int/float/decimal
except TypeError:
# Convert floats/decimals for further processing
target = int(math.ceil(target))
mant = target & (target - 1)
# Quickly find next power of 2 >= target
# See https://stackoverflow.com/a/19164783/125507
try:
log2hi = target.bit_length()
except AttributeError:
# Fallback for Python <2.7
log2hi = len(bin(target)) - 2
# exit if this is a power of two already...
if not mant: return ( log2hi - 1, 0, 0 )
# take care of trivial cases...
if target < 9:
if target < 4: return ( 0, 1, 0 )
elif target < 6: return ( 0, 0, 1 )
elif target < 7: return ( 1, 1, 0 )
else: return ( 3, 0, 0 )
# find log of target, which may exceed the float64 limit...
if log2hi < 53: mant = target << (53 - log2hi)
else: mant = target >> (log2hi - 53)
log2target = log2hi + math.log2(float(mant) / (1 << 53))
# log2 constants
log2of2 = 1.0; log2of3 = math.log2(3); log2of5 = math.log2(5)
# calculate range of log2 values close to target;
# desired number has a logarithm of log2target <= x <= top...
fctr = 6 * log2of3 * log2of5
top = (log2target**3 + 2 * fctr)**(1/3) # for up to 2 numbers higher
btm = 2 * log2target - top # or up to 2 numbers lower
match = log2hi # Anything found will be smaller
result = ( log2hi, 0, 0 ) # placeholder for eventual matches
count = 0 # only used for debugging counting band
fives = 0; fiveslmt = int(math.ceil(top / log2of5))
while fives < fiveslmt:
log2p = top - fives * log2of5
threes = 0; threeslmt = int(math.ceil(log2p / log2of3))
while threes < threeslmt:
log2q = log2p - threes * log2of3
twos = int(math.floor(log2q)); log2this = top - log2q + twos
if log2this >= btm: count += 1 # only used for counting band
if log2this >= btm and log2this < match:
# logarithm precision may not be enough to differential between
# the next lower regular number and the target, so do
# a full resolution comparison to eliminate this case...
if (2**twos * 3**threes * 5**fives) >= target:
match = log2this; result = ( twos, threes, fives )
threes += 1
fives += 1
return result
print(next_regular(2**2 * 3**454 * 5**249 + 1)) # prints (142, 80, 444)
Since most long multi-precision calculations have been eliminated, gmpy isn't needed, and on IDEOne the above code takes 0.11 seconds instead of 0.48 seconds for endolith's solution to find the next regular number greater than the 100 millionth one as shown; it takes 0.49 seconds instead of 5.48 seconds to find the next regular number past the billionth (next one is (761,572,489) past (1334,335,404) + 1), and the difference will get even larger as the range goes up as the multi-precision calculations get increasingly longer for the endolith version compared to almost none here. Thus, this version could calculate the next regular number from the trillionth in the sequence in about 50 seconds on IDEOne, where it would likely take over an hour with the endolith version.
The English description of the algorithm is almost the same as for the endolith version, differing as follows:
1) calculates the float log estimation of the argument target value (we can't use the built-in log function directly as the range may be much too large for representation as a 64-bit float),
2) compares the log representation values in determining qualifying values inside an estimated range above and below the target value of only about two or three numbers (depending on round-off),
3) compare multi-precision values only if within the above defined narrow band,
4) outputs the triple indices rather than the full long multi-precision integer (would be about 840 decimal digits for the one past the billionth, ten times that for the trillionth), which can then easily be converted to the long multi-precision value if required.
This algorithm uses almost no memory other than for the potentially very large multi-precision integer target value, the intermediate evaluation comparison values of about the same size, and the output expansion of the triples if required. This algorithm is an improvement over the endolith version in that it successfully uses the logarithm values for most comparisons in spite of their lack of precision, and that it narrows the band of compared numbers to just a few.
This algorithm will work for argument ranges somewhat above ten trillion (a few minute's calculation time at IDEOne rates) when it will no longer be correct due to lack of precision in the log representation values as per #WillNess's discussion; in order to fix this, we can change the log representation to a "roll-your-own" logarithm representation consisting of a fixed-length integer (124 bits for about double the exponent range, good for targets of over a hundred thousand digits if one is willing to wait); this will be a little slower due to the smallish multi-precision integer operations being slower than float64 operations, but not that much slower since the size is limited (maybe a factor of three or so slower).
Now none of these Python implementations (without using C or Cython or PyPy or something) are particularly fast, as they are about a hundred times slower than as implemented in a compiled language. For reference sake, here is a Haskell version:
{-# OPTIONS_GHC -O3 #-}
import Data.Word
import Data.Bits
nextRegular :: Integer -> ( Word32, Word32, Word32 )
nextRegular target
| target < 2 = ( 0, 0, 0 )
| target .&. (target - 1) == 0 = ( fromIntegral lg2hi - 1, 0, 0 )
| target < 9 = case target of
3 -> ( 0, 1, 0 )
5 -> ( 0, 0, 1 )
6 -> ( 1, 1, 0 )
_ -> ( 3, 0, 0 )
| otherwise = match
where
lg3 = logBase 2 3 :: Double; lg5 = logBase 2 5 :: Double
lg2hi = let cntplcs v cnt =
let nv = v `shiftR` 31 in
if nv <= 0 then
let cntbts x c =
if x <= 0 then c else
case c + 1 of
nc -> nc `seq` cntbts (x `shiftR` 1) nc in
cntbts (fromIntegral v :: Word32) cnt
else case cnt + 31 of ncnt -> ncnt `seq` cntplcs nv ncnt
in cntplcs target 0
lg2tgt = let mant = if lg2hi <= 53 then target `shiftL` (53 - lg2hi)
else target `shiftR` (lg2hi - 53)
in fromIntegral lg2hi +
logBase 2 (fromIntegral mant / 2^53 :: Double)
lg2top = (lg2tgt^3 + 2 * 6 * lg3 * lg5)**(1/3) -- for 2 numbers or so higher
lg2btm = 2* lg2tgt - lg2top -- or two numbers or so lower
match =
let klmt = floor (lg2top / lg5)
loopk k mtchlgk mtchtplk =
if k > klmt then mtchtplk else
let p = lg2top - fromIntegral k * lg5
jlmt = fromIntegral $ floor (p / lg3)
loopj j mtchlgj mtchtplj =
if j > jlmt then loopk (k + 1) mtchlgj mtchtplj else
let q = p - fromIntegral j * lg3
( i, frac ) = properFraction q; r = lg2top - frac
( nmtchlg, nmtchtpl ) =
if r < lg2btm || r >= mtchlgj then
( mtchlgj, mtchtplj ) else
if 2^i * 3^j * 5^k >= target then
( r, ( i, j, k ) ) else ( mtchlgj, mtchtplj )
in nmtchlg `seq` nmtchtpl `seq` loopj (j + 1) nmtchlg nmtchtpl
in loopj 0 mtchlgk mtchtplk
in loopk 0 (fromIntegral lg2hi) ( fromIntegral lg2hi, 0, 0 )
trival :: ( Word32, Word32, Word32 ) -> Integer
trival (i,j,k) = 2^i * 3^j * 5^k
main = putStrLn $ show $ nextRegular $ (trival (1334,335,404)) + 1 -- (1126,16930,40)
This code calculates the next regular number following the billionth in too small a time to be measured and following the trillionth in 0.69 seconds on IDEOne (and potentially could run even faster except that IDEOne doesn't support LLVM). Even Julia will run at something like this Haskell speed after the "warm-up" for JIT compilation.
EDIT_ADD: The Julia code is as per the following:
function nextregular(target :: BigInt) :: Tuple{ UInt32, UInt32, UInt32 }
# trivial case of first value or anything less...
target < 2 && return ( 0, 0, 0 )
# Check if it's already a power of 2 (or a non-integer)
mant = target & (target - 1)
# Quickly find next power of 2 >= target
log2hi :: UInt32 = 0
test = target
while true
next = test & 0x7FFFFFFF
test >>>= 31; log2hi += 31
test <= 0 && (log2hi -= leading_zeros(UInt32(next)) - 1; break)
end
# exit if this is a power of two already...
mant == 0 && return ( log2hi - 1, 0, 0 )
# take care of trivial cases...
if target < 9
target < 4 && return ( 0, 1, 0 )
target < 6 && return ( 0, 0, 1 )
target < 7 && return ( 1, 1, 0 )
return ( 3, 0, 0 )
end
# find log of target, which may exceed the Float64 limit...
if log2hi < 53 mant = target << (53 - log2hi)
else mant = target >>> (log2hi - 53) end
log2target = log2hi + log(2, Float64(mant) / (1 << 53))
# log2 constants
log2of2 = 1.0; log2of3 = log(2, 3); log2of5 = log(2, 5)
# calculate range of log2 values close to target;
# desired number has a logarithm of log2target <= x <= top...
fctr = 6 * log2of3 * log2of5
top = (log2target^3 + 2 * fctr)^(1/3) # for 2 numbers or so higher
btm = 2 * log2target - top # or 2 numbers or so lower
# scan for values in the given narrow range that satisfy the criteria...
match = log2hi # Anything found will be smaller
result :: Tuple{UInt32,UInt32,UInt32} = ( log2hi, 0, 0 ) # placeholder for eventual matches
fives :: UInt32 = 0; fiveslmt = UInt32(ceil(top / log2of5))
while fives < fiveslmt
log2p = top - fives * log2of5
threes :: UInt32 = 0; threeslmt = UInt32(ceil(log2p / log2of3))
while threes < threeslmt
log2q = log2p - threes * log2of3
twos = UInt32(floor(log2q)); log2this = top - log2q + twos
if log2this >= btm && log2this < match
# logarithm precision may not be enough to differential between
# the next lower regular number and the target, so do
# a full resolution comparison to eliminate this case...
if (big(2)^twos * big(3)^threes * big(5)^fives) >= target
match = log2this; result = ( twos, threes, fives )
end
end
threes += 1
end
fives += 1
end
result
end
Here's another possibility I just thought of:
If N is X bits long, then the smallest regular number R ≥ N will be in the range
[2X-1, 2X]
e.g. if N = 257 (binary 100000001) then we know R is 1xxxxxxxx unless R is exactly equal to the next power of 2 (512)
To generate all the regular numbers in this range, we can generate the odd regular numbers (i.e. multiples of powers of 3 and 5) first, then take each value and multiply by 2 (by bit-shifting) as many times as necessary to bring it into this range.
In Python:
from itertools import ifilter, takewhile
from Queue import PriorityQueue
def nextPowerOf2(n):
p = max(1, n)
while p != (p & -p):
p += p & -p
return p
# Generate multiples of powers of 3, 5
def oddRegulars():
q = PriorityQueue()
q.put(1)
prev = None
while not q.empty():
n = q.get()
if n != prev:
prev = n
yield n
if n % 3 == 0:
q.put(n // 3 * 5)
q.put(n * 3)
# Generate regular numbers with the same number of bits as n
def regularsCloseTo(n):
p = nextPowerOf2(n)
numBits = len(bin(n))
for i in takewhile(lambda x: x <= p, oddRegulars()):
yield i << max(0, numBits - len(bin(i)))
def nextRegular(n):
bigEnough = ifilter(lambda x: x >= n, regularsCloseTo(n))
return min(bigEnough)
You know what? I'll put money on the proposition that actually, the 'dumb' algorithm is fastest. This is based on the observation that the next regular number does not, in general, seem to be much larger than the given input. So simply start counting up, and after each increment, refactor and see if you've found a regular number. But create one processing thread for each available core you have, and for N cores have each thread examine every Nth number. When each thread has found a number or crossed the power-of-2 threshold, compare the results (keep a running best number) and there you are.
I wrote a small c# program to solve this problem. It's not very optimised but it's a start.
This solution is pretty fast for numbers as big as 11 digits.
private long GetRegularNumber(long n)
{
long result = n - 1;
long quotient = result;
while (quotient > 1)
{
result++;
quotient = result;
quotient = RemoveFactor(quotient, 2);
quotient = RemoveFactor(quotient, 3);
quotient = RemoveFactor(quotient, 5);
}
return result;
}
private static long RemoveFactor(long dividend, long divisor)
{
long remainder = 0;
long quotient = dividend;
while (remainder == 0)
{
dividend = quotient;
quotient = Math.DivRem(dividend, divisor, out remainder);
}
return dividend;
}

Resources