Create a random permutation of 1..N in constant space - algorithm

I am looking to enumerate a random permutation of the numbers 1..N in fixed space. This means that I cannot store all numbers in a list. The reason for that is that N can be very large, more than available memory. I still want to be able to walk through such a permutation of numbers one at a time, visiting each number exactly once.
I know this can be done for certain N: Many random number generators cycle through their whole state space randomly, but entirely. A good random number generator with state size of 32 bit will emit a permutation of the numbers 0..(2^32)-1. Every number exactly once.
I want to get to pick N to be any number at all and not be constrained to powers of 2 for example. Is there an algorithm for this?

The easiest way is probably to just create a full-range PRNG for a larger range than you care about, and when it generates a number larger than you want, just throw it away and get the next one.
Another possibility that's pretty much a variation of the same would be to use a linear feedback shift register (LFSR) to generate the numbers in the first place. This has a couple of advantages: first of all, an LFSR is probably a bit faster than most PRNGs. Second, it is (I believe) a bit easier to engineer an LFSR that produces numbers close to the range you want, and still be sure it cycles through the numbers in its range in (pseudo)random order, without any repetitions.
Without spending a lot of time on the details, the math behind LFSRs has been studied quite thoroughly. Producing one that runs through all the numbers in its range without repetition simply requires choosing a set of "taps" that correspond to an irreducible polynomial. If you don't want to search for that yourself, it's pretty easy to find tables of known ones for almost any reasonable size (e.g., doing a quick look, the wikipedia article lists them for size up to 19 bits).
If memory serves, there's at least one irreducible polynomial of ever possible bit size. That translates to the fact that in the worst case you can create a generator that has roughly twice the range you need, so on average you're throwing away (roughly) every other number you generate. Given the speed an LFSR, I'd guess you can do that and still maintain quite acceptable speed.

One way to do it would be
Find a prime p larger than N, preferably not much larger.
Find a primitive root of unity g modulo p, that is, a number 1 < g < p such that g^k ≡ 1 (mod p) if and only if k is a multiple of p-1.
Go through g^k (mod p) for k = 1, 2, ..., ignoring the values that are larger than N.
For every prime p, there are φ(p-1) primitive roots of unity, so it works. However, it may take a while to find one. Finding a suitable prime is much easier in general.
For finding a primitive root, I know nothing substantially better than trial and error, but one can increase the probability of a fast find by choosing the prime p appropriately.
Since the number of primitive roots is φ(p-1), if one randomly chooses r in the range from 1 to p-1, the expected number of tries until one finds a primitive root is (p-1)/φ(p-1), hence one should choose p so that φ(p-1) is relatively large, that means that p-1 must have few distinct prime divisors (and preferably only large ones, except for the factor 2).
Instead of randomly choosing, one can also try in sequence whether 2, 3, 5, 6, 7, 10, ... is a primitive root, of course skipping perfect powers (or not, they are in general quickly eliminated), that should not affect the number of tries needed greatly.
So it boils down to checking whether a number x is a primitive root modulo p. If p-1 = q^a * r^b * s^c * ... with distinct primes q, r, s, ..., x is a primitive root if and only if
x^((p-1)/q) % p != 1
x^((p-1)/r) % p != 1
x^((p-1)/s) % p != 1
...
thus one needs a decent modular exponentiation (exponentiation by repeated squaring lends itself well for that, reducing by the modulus on each step). And a good method to find the prime factor decomposition of p-1. Note, however, that even naive trial division would be only O(√p), while the generation of the permutation is Θ(p), so it's not paramount that the factorisation is optimal.

Another way to do this is with a block cipher; see this blog post for details.
The blog posts links to the paper Ciphers with Arbitrary Finite Domains which contains a bunch of solutions.

Consider the prime 3. To fully express all possible outputs, think of it this way...
bias + step mod prime
The bias is just an offset bias. step is an accumulator (if it's 1 for example, it would just be 0, 1, 2 in sequence, while 2 would result in 0, 2, 4) and prime is the prime number we want to generate the permutations against.
For example. A simple sequence of 0, 1, 2 would be...
0 + 0 mod 3 = 0
0 + 1 mod 3 = 1
0 + 2 mod 3 = 2
Modifying a couple of those variables for a second, we'll take bias of 1 and step of 2 (just for illustration)...
1 + 2 mod 3 = 0
1 + 4 mod 3 = 2
1 + 6 mod 3 = 1
You'll note that we produced an entirely different sequence. No number within the set repeats itself and all numbers are represented (it's bijective). Each unique combination of offset and bias will result in one of prime! possible permutations of the set. In the case of a prime of 3 you'll see that there are 6 different possible permuations:
0,1,2
0,2,1
1,0,2
1,2,0
2,0,1
2,1,0
If you do the math on the variables above you'll not that it results in the same information requirements...
1/3! = 1/6 = 1.66..
... vs...
1/3 (bias) * 1/2 (step) => 1/6 = 1.66..
Restrictions are simple, bias must be within 0..P-1 and step must be within 1..P-1 (I have been functionally just been using 0..P-2 and adding 1 on arithmetic in my own work). Other than that, it works with all prime numbers no matter how large and will permutate all possible unique sets of them without the need for memory beyond a couple of integers (each technically requiring slightly less bits than the prime itself).
Note carefully that this generator is not meant to be used to generate sets that are not prime in number. It's entirely possible to do so, but not recommended for security sensitive purposes as it would introduce a timing attack.
That said, if you would like to use this method to generate a set sequence that is not a prime, you have two choices.
First (and the simplest/cheapest), pick the prime number just larger than the set size you're looking for and have your generator simply discard anything that doesn't belong. Once more, danger, this is a very bad idea if this is a security sensitive application.
Second (by far the most complicated and costly), you can recognize that all numbers are composed of prime numbers and create multiple generators that then produce a product for each element in the set. In other words, an n of 6 would involve all possible prime generators that could match 6 (in this case, 2 and 3), multiplied in sequence. This is both expensive (although mathematically more elegant) as well as also introducing a timing attack so it's even less recommended.
Lastly, if you need a generator for bias and or step... why don't you use another of the same family :). Suddenly you're extremely close to creating true simple-random-samples (which is not easy usually).

The fundamental weakness of LCGs (x=(x*m+c)%b style generators) is useful here.
If the generator is properly formed then x%f is also a repeating sequence of all values lower than f (provided f if a factor of b).
Since bis usually a power of 2 this means that you can take a 32-bit generator and reduce it to an n-bit generator by masking off the top bits and it will have the same full-range property.
This means that you can reduce the number of discard values to be fewer than N by choosing an appropriate mask.
Unfortunately LCG Is a poor generator for exactly the same reason as given above.
Also, this has exactly the same weakness as I noted in a comment on #JerryCoffin's answer. It will always produce the same sequence and the only thing the seed controls is where to start in that sequence.

Here's some SageMath code that should generate a random permutation the way Daniel Fischer suggested:
def random_safe_prime(lbound):
while True:
q = random_prime(lbound, lbound=lbound // 2)
p = 2 * q + 1
if is_prime(p):
return p, q
def random_permutation(n):
p, q = random_safe_prime(n + 2)
while True:
r = randint(2, p - 1)
if pow(r, 2, p) != 1 and pow(r, q, p) != 1:
i = 1
while True:
x = pow(r, i, p)
if x == 1:
return
if 0 <= x - 2 < n:
yield x - 2
i += 1

Related

Hashing with the Division Method - Choosing number of slots?

So, in CLRS, there's this quote
A prime not too close to an exact power of 2 is often a good choice for m.
Several Questions...
I understand how a power of 2 will just be the lower order bits of your key...however, say you have keys from a universe of 1 to 1 million, with each key having an equal probability of being any number from universe (which I'm guessing is a common assumption about your universe if given no other data?) then wouldn't taking say the 4 lower order bits result in (2^4) lower order bit patterns that were pretty much equally likely for the keys from 1 to 1 million? How am I thinking about this incorrectly?
Why a prime number? So, if power of 2's aren't a good idea, why is a prime number a better choice as opposed to a composite number close to a power of 2 (Also why should it be close to a power of 2...lol)?
You are trying to find a hash table that works well for typical input data, and typical input data does things that you wouldn't expect from good random number generators. Very often you get formatted or semi-formatted strings which, when converted to numbers, end up as K, K+A, K+2A, K+3A,.... for some integers K and A. If K+xA and K+yA hash to the same number mod m, then (x-y)A must be 0 mod m. If m is prime, this can only happen if A = 0 mod m or if x = y mod m, so one time in m. But if m=pq and A happens to be divisible by p, then you get a collision every time x-y is divisible by q, which is more often since q < m.
I guess close to a power of 2 because it might be convenient for the memory management system to have blocks of memory of the resulting size - I really don't know. If you really care, and if you have the time, you could try different primes with some representative data and see which of them are best in practice.

Find the smallest set group to cover all combinatory possibilities

I'm making some exercises on combinatorics algorithm and trying to figure out how to solve the question below:
Given a group of 25 bits, set (choose) 15 (non-permutable and order NON matters):
n!/(k!(n-k)!) = 3.268.760
Now for every of these possibilities construct a matrix where I cross every unique 25bit member against all other 25bit member where
in the relation in between it there must be at least 11 common setted bits (only ones, not zeroes).
Let me try to illustrate representing it as binary data, so the first member would be:
0000000000111111111111111 (10 zeros and 15 ones) or (15 bits set on 25 bits)
0000000001011111111111111 second member
0000000001101111111111111 third member
0000000001110111111111111 and so on....
...
1111111111111110000000000 up to here. The 3.268.760 member.
Now crossing these values over a matrix for the 1 x 1 I must have 15 bits common. Since the result is >= 11 it is a "useful" result.
For the 1 x 2 we have 14 bits common so also a valid result.
Doing that for all members, finally, crossing 1 x 3.268.760 should result in 5 bits common so since it's < 11 its not "useful".
What I need is to find out (by math or algorithm) wich is the minimum number of members needed to cover all possibilities having 11 bits common.
In other words a group of N members that if tested against all others may have at least 11 bits common over the whole 3.268.760 x 3.268.760 universe.
Using a brute force algorithm I found out that with 81 25bit member is possible achive this. But i'm guessing that this number should be smaller (something near 12).
I was trying to use a brute force algorithm to make all possible variations of 12 members over the 3.268.760 but the number of possibilities
it's so huge that it would take more than a hundred years to compute (3,156x10e69 combinations).
I've googled about combinatorics but there are so many fields that i don't know in wich these problem should fit.
So any directions on wich field of combinatorics, or any algorithm for these issue is greatly appreciate.
PS: Just for reference. The "likeness" of two members is calculated using:
(Not(a xor b)) and a
After that there's a small recursive loop to count the bits given the number of common bits.
EDIT: As promissed (#btilly)on the comment below here's the 'fractal' image of the relations or link to image
The color scale ranges from red (15bits match) to green (11bits match) to black for values smaller than 10bits.
This image is just sample of the 4096 first groups.
tl;dr: you want to solve dominating set on a large, extremely symmetric graph. btilly is right that you should not expect an exact answer. If this were my problem, I would try local search starting with the greedy solution. Pick one set and try to get rid of it by changing the others. This requires data structures to keep track of which sets are covered exactly once.
EDIT: Okay, here's a better idea for a lower bound. For every k from 1 to the value of the optimal solution, there's a lower bound of [25 choose 15] * k / [maximum joint coverage of k sets]. Your bound of 12 (actually 10 by my reckoning, since you forgot some neighbors) corresponds to k = 1. Proof sketch: fix an arbitrary solution with m sets and consider the most coverage that can be obtained by k of the m. Build a fractional solution where all symmetries of the chosen k are averaged together and scaled so that each element is covered once. The cost of this solution is [25 choose 15] * k / [maximum joint coverage of those k sets], which is at least as large as the lower bound we're shooting for. It's still at least as small, however, as the original m-set solution, as the marginal returns of each set are decreasing.
Computing maximum coverage is in general hard, but there's a factor (e/(e-1))-approximation (≈ 1.58) algorithm: greedy, which it sounds as though you could implement quickly (note: you need to choose the set that covers the most uncovered other sets each time). By multiplying the greedy solution by e/(e-1), we obtain an upper bound on the maximum coverage of k elements, which suffices to power the lower bound described in the previous paragraph.
Warning: if this upper bound is larger than [25 choose 15], then k is too large!
This type of problem is extremely hard, you should not expect to be able to find the exact answer.
A greedy solution should produce a "fairly good" answer. But..how to be greedy?
The idea is to always choose the next element to be the one that is going to match as many possibilities as you can that are currently unmatched. Unfortunately with over 3 million possible members, that you have to try match against millions of unmatched members (note, your best next guess might already match another member in your candidate set..), even choosing that next element is probably not feasible.
So we'll have to be greedy about choosing the next element. We will choose each bit to maximize the sum of the probabilities of eventually matching all of the currently unmatched elements.
For that we will need a 2-dimensional lookup table P such that P(n, m) is the probability that two random members will turn out to have at least 11 bits in common, if m of the first n bits that are 1 in the first member are also 1 in the second. This table of 225 probabilities should be precomputed.
This table can easily be computed using the following rules:
P(15, m) is 0 if m < 11, 1 otherwise.
For n < 15:
P(n, m) = P(n+1, m+1) * (15-m) / (25-n) + P(n+1, m) * (10-n+m) / (25-n)
Now let's start with a few members that are "very far" from each other. My suggestion would be:
First 15 bits 1, rest 0.
First 10 bits 0, rest 1.
First 8 bits 1, last 7 1, rest 0.
Bits 1-4, 9-12, 16-23 are 1, rest 0.
Now starting with your universe of (25 choose 15) members, eliminate all of those that match one of the elements in your initial collection.
Next we go into the heart of the algorithm.
While there are unmatched members:
Find the bit that appears in the most unmatched members (break ties randomly)
Make that the first set bit of our candidate member for the group.
While the candidate member has less than 15 set bits:
Let p_best = 0, bit_best = 0;
For each unset bit:
Let p = 0
For each unmatched member:
p += P(n, m) where m = number of bits in common between
candidate member+this bit and the unmatched member
and n = bits in candidate member + 1
If p_best < p:
p_best = p
bit_best = this unset bit
Set bit_best as the next bit in our candidate member.
Add the candidate member to our collection
Remove all unmatched members that match this from unmatched members
The list of candidate members is our answer
I have not written code, I therefore have no idea how good an answer this algorithm will produce. But assuming that it does no better than your current, for 77 candidate members (we cheated and started with 4) you have to make 271 passes through your unmatched candidates (25 to find the first bit, 24 to find the second, etc down to 11 to find the 15th, and one more to remove the matched members). That's 20867 passes. If you have an average of 1 million unmatched members, that's on the order of a 20 billion operations.
This won't be quick. But it should be computationally feasible.

in a series of n elements of arithmetic progression, [n/2] elements are changed. Find the difference in the initial arithmetic progression

I have a list of size n which contains n consecutive members of an arithmetic progression which are not in order. I changed less than half of the elements in this list with some random integer. From this new list, how can I find the difference of the initial arithmetic progression?
I thought a lot about it but except brute force, I was not able to come up with any other thing :(
Thanks for thinking on this one :)
It's not possible to solve this in general and be 100% sure that your answer is correct. Let's say that the initial list is the following arithmetic progression (not in order):
1 3 2 4
Change less than half the elements at random... let's say for example that we changed 2 to 5:
1 3 5 4
If we can first find out which numbers we need to change to obtain a valid shuffled arithmetic sequence then we can easily solve the problem stated in the question. However we can see that there are multiple possible answers depending in which we number we choose to change:
6, 3, 5, 4 (difference is 1)
1, 3, 2, 4 (difference is 1)
1, 3, 5, 7 (difference is 2)
There is no way to know which of these possible sequence is the original sequence, so you cannot be sure what the original difference was.
Since there is no deterministic solution for the problem (as stated by #Mark Byers), you can try a probabilistic approach.
It's difficult to obtain the original progression, but its rate can be obtained easily by comparing the differences between elements. The difference of original ones will be multiples of rate.
Consider you take 2 elements from the list (probability that both of them belongs to the original sequence is 1/4), and compute the difference. This difference, with probability of 1/4, will be a multiple of the rate. Decompose it to prime factors and count them (for example, 12 = 2^^2 * 3 will add 2 to 2's counter and will increment 3's counter).
After many such iterations (it looks like a good problem for probabilistic methods, like Monte Carlo), you could analize the counters.
If a prime factor belongs to the rate, its counter will be at least num_iteartions/4 ( or num_iterations/2 if it appears twice).
The main problem is that small factors will have large probability on random input (for example, the difference between two random numbers will have 50% probability to be divisible by 2). So you'll have to compensate it: since 3/4 of your differences were random, you'll have to consider that (3/8)*num_iterations of 2's counter must be ignored. Since this also applies to all powers of two, the simpliest way is to pregenerate "white noise mask" by taking the differences only between random numbers.
EDIT: let's take this approach further. Consider that you create this "white noise mask" (let's call it spectrum) for random numbers, and consider that it's base-1 spectrum, since their smallest "largest common factor" is 1. By computing it for a differences of the arithmetic sequence, you'll obtain a base-R spectrum, where R is the rate, and it will equivalent to a shifted version of base-1 spectrum. So you have to find the value of R such that
your_spectrum ~= spectrum(1)*3/4 + spectrum(R)*1/4
You could also check for largest number R such that at least half of the elements will be equal modulo R.

Programming problem - Game of Blocks

maybe you would have an idea on how to solve the following problem.
John decided to buy his son Johnny some mathematical toys. One of his most favorite toy is blocks of different colors. John has decided to buy blocks of C different colors. For each color he will buy googol (10^100) blocks. All blocks of same color are of same length. But blocks of different color may vary in length.
Jhonny has decided to use these blocks to make a large 1 x n block. He wonders how many ways he can do this. Two ways are considered different if there is a position where the color differs. The example shows a red block of size 5, blue block of size 3 and green block of size 3. It shows there are 12 ways of making a large block of length 11.
Each test case starts with an integer 1 ≤ C ≤ 100. Next line consists c integers. ith integer 1 ≤ leni ≤ 750 denotes length of ith color. Next line is positive integer N ≤ 10^15.
This problem should be solved in 20 seconds for T <= 25 test cases. The answer should be calculated MOD 100000007 (prime number).
It can be deduced to matrix exponentiation problem, which can be solved relatively efficiently in O(N^2.376*log(max(leni))) using Coppersmith-Winograd algorithm and fast exponentiation. But it seems that a more efficient algorithm is required, as Coppersmith-Winograd implies a large constant factor. Do you have any other ideas? It can possibly be a Number Theory or Divide and Conquer problem
Firstly note the number of blocks of each colour you have is a complete red herring, since 10^100 > N always. So the number of blocks of each colour is practically infinite.
Now notice that at each position, p (if there is a valid configuration, that leaves no spaces, etc.) There must block of a color, c. There are len[c] ways for this block to lie, so that it still lies over this position, p.
My idea is to try all possible colors and positions at a fixed position (N/2 since it halves the range), and then for each case, there are b cells before this fixed coloured block and a after this fixed colour block. So if we define a function ways(i) that returns the number of ways to tile i cells (with ways(0)=1). Then the number of ways to tile a number of cells with a fixed colour block at a position is ways(b)*ways(a). Adding up all possible configurations yields the answer for ways(i).
Now I chose the fixed position to be N/2 since that halves the range and you can halve a range at most ceil(log(N)) times. Now since you are moving a block about N/2 you will have to calculate from N/2-750 to N/2-750, where 750 is the max length a block can have. So you will have to calculate about 750*ceil(log(N)) (a bit more because of the variance) lengths to get the final answer.
So in order to get good performance you have to through in memoisation, since this inherently a recursive algorithm.
So using Python(since I was lazy and didn't want to write a big number class):
T = int(raw_input())
for case in xrange(T):
#read in the data
C = int(raw_input())
lengths = map(int, raw_input().split())
minlength = min(lengths)
n = int(raw_input())
#setup memoisation, note all lengths less than the minimum length are
#set to 0 as the algorithm needs this
memoise = {}
memoise[0] = 1
for length in xrange(1, minlength):
memoise[length] = 0
def solve(n):
global memoise
if n in memoise:
return memoise[n]
ans = 0
for i in xrange(C):
if lengths[i] > n:
continue
if lengths[i] == n:
ans += 1
ans %= 100000007
continue
for j in xrange(0, lengths[i]):
b = n/2-lengths[i]+j
a = n-(n/2+j)
if b < 0 or a < 0:
continue
ans += solve(b)*solve(a)
ans %= 100000007
memoise[n] = ans
return memoise[n]
solve(n)
print "Case %d: %d" % (case+1, memoise[n])
Note I haven't exhaustively tested this, but I'm quite sure it will meet the 20 second time limit, if you translated this algorithm to C++ or somesuch.
EDIT: Running a test with N = 10^15 and a block with length 750 I get that memoise contains about 60000 elements which means non-lookup bit of solve(n) is called about the same number of time.
A word of caution: In the case c=2, len1=1, len2=2, the answer will be the N'th Fibonacci number, and the Fibonacci numbers grow (approximately) exponentially with a growth factor of the golden ratio, phi ~ 1.61803399. For the
huge value N=10^15, the answer will be about phi^(10^15), an enormous number. The answer will have storage
requirements on the order of (ln(phi^(10^15))/ln(2)) / (8 * 2^40) ~ 79 terabytes. Since you can't even access 79
terabytes in 20 seconds, it's unlikely you can meet the speed requirements in this special case.
Your best hope occurs when C is not too large, and leni is large for all i. In such cases, the answer will
still grow exponentially with N, but the growth factor may be much smaller.
I recommend that you first construct the integer matrix M which will compute the (i+1,..., i+k)
terms in your sequence based on the (i, ..., i+k-1) terms. (only row k+1 of this matrix is interesting).
Compute the first k entries "by hand", then calculate M^(10^15) based on the repeated squaring
trick, and apply it to terms (0...k-1).
The (integer) entries of the matrix will grow exponentially, perhaps too fast to handle. If this is the case, do the
very same calculation, but modulo p, for several moderate-sized prime numbers p. This will allow you to obtain
your answer modulo p, for various p, without using a matrix of bigints. After using enough primes so that you know their product
is larger than your answer, you can use the so-called "Chinese remainder theorem" to recover
your answer from your mod-p answers.
I'd like to build on the earlier #JPvdMerwe solution with some improvements. In his answer, #JPvdMerwe uses a Dynamic Programming / memoisation approach, which I agree is the way to go on this problem. Dividing the problem recursively into two smaller problems and remembering previously computed results is quite efficient.
I'd like to suggest several improvements that would speed things up even further:
Instead of going over all the ways the block in the middle can be positioned, you only need to go over the first half, and multiply the solution by 2. This is because the second half of the cases are symmetrical. For odd-length blocks you would still need to take the centered position as a seperate case.
In general, iterative implementations can be several magnitudes faster than recursive ones. This is because a recursive implementation incurs bookkeeping overhead for each function call. It can be a challenge to convert a solution to its iterative cousin, but it is usually possible. The #JPvdMerwe solution can be made iterative by using a stack to store intermediate values.
Modulo operations are expensive, as are multiplications to a lesser extent. The number of multiplications and modulos can be decreased by approximately a factor C=100 by switching the color-loop with the position-loop. This allows you to add the return values of several calls to solve() before doing a multiplication and modulo.
A good way to test the performance of a solution is with a pathological case. The following could be especially daunting: length 10^15, C=100, prime block sizes.
Hope this helps.
In the above answer
ans += 1
ans %= 100000007
could be much faster without general modulo :
ans += 1
if ans == 100000007 then ans = 0
Please see TopCoder thread for a solution. No one was close enough to find the answer in this thread.

Prime factor of 300 000 000 000?

I need to find out the prime factors of over 300 billion. I have a function that is adding to the list of them...very slowly! It has been running for about an hour now and i think its got a fair distance to go still. Am i doing it completly wrong or is this expected?
Edit: Im trying to find the largest prime factor of the number 600851475143.
Edit:
Result:
{
List<Int64> ListOfPrimeFactors = new List<Int64>();
Int64 Number = 600851475143;
Int64 DividingNumber = 2;
while (DividingNumber < Number / DividingNumber)
{
if (Number % DividingNumber == 0)
{
ListOfPrimeFactors.Add(DividingNumber);
Number = Number/DividingNumber;
}
else
DividingNumber++;
}
ListOfPrimeFactors.Add(Number);
listBox1.DataSource = ListOfPrimeFactors;
}
}
Are you remembering to divide the number that you're factorizing by each factor as you find them?
Say, for example, you find that 2 is a factor. You can add that to your list of factors, but then you divide the number that you're trying to factorise by that value.
Now you're only searching for the factors of 150 billion. Each time around you should start from the factor you just found. So if 2 was a factor, test 2 again. If the next factor you find is 3, there's no point testing from 2 again.
And so on...
Finding prime factors is difficult using brute force, which sounds like the technique you are using.
Here are a few tips to speed it up somewhat:
Start low, not high
Don't bother testing each potential factor to see whether it is prime--just test LIKELY prime numbers (odd numbers that end in 1,3,7 or 9)
Don't bother testing even numbers (all divisible by 2), or odds that end in 5 (all divisible by 5). Of course, don't actually skip 2 and 5!!
When you find a prime factor, make sure to divide it out--don't continue to use your massive original number. See my example below.
If you find a factor, make sure to test it AGAIN to see if it is in there multiple times. Your number could be 2x2x3x7x7x7x31 or something like that.
Stop when you reach >= sqrt(remaining large number)
Edit: A simple example:
You are finding the factors of 275.
Test 275 for divisibility by 2. Does 275/2 = int(275/2)? No. Failed.
Test 275 for divisibility by 3. Failed.
Skip 4!
Test 275 for divisibility by 5. YES! 275/5 = 55. So your NEW test number is now 55.
Test 55 for divisibility by 5. YES! 55/5 = 11. So your NEW test number is now 11.
BUT 5 > sqrt (11), so 11 is prime, and you can stop!
So 275 = 5 * 5 * 11
Make more sense?
Factoring big numbers is a hard problem. So hard, in fact, that we rely on it to keep RSA secure. But take a look at the wikipedia page for some pointers to algorithms that can help. But for a number that small, it really shouldn't be taking that long, unless you are re-doing work over and over again that you don't have to somewhere.
For the brute-force solution, remember that you can do some mini-optimizations:
Check 2 specially, then only check odd numbers.
You only ever need to check up to the square-root of the number (if you find no factors by then, then the number is prime).
Once you find a factor, don't use the original number to find the next factor, divide it by the found factor, and search the new smaller number.
When you find a factor, divide it through as many times as you can. After that, you never need to check that number, or any smaller numbers again.
If you do all the above, each new factor you find will be prime, since any smaller factors have already been removed.
Here is an XSLT solution!
This XSLT transformation takes 0.109 sec.
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:saxon="http://saxon.sf.net/"
xmlns:f="http://fxsl.sf.net/"
exclude-result-prefixes="xs saxon f"
>
<xsl:import href="../f/func-Primes.xsl"/>
<xsl:output method="text"/>
<xsl:template name="initial" match="/*">
<xsl:sequence select="f:maxPrimeFactor(600851475143)"/>
</xsl:template>
<xsl:function name="f:maxPrimeFactor" as="xs:integer">
<xsl:param name="pNum" as="xs:integer"/>
<xsl:sequence select=
"if(f:isPrime($pNum))
then $pNum
else
for $vEnd in xs:integer(floor(f:sqrt($pNum, 0.1E0))),
$vDiv1 in (2 to $vEnd)[$pNum mod . = 0][1],
$vDiv2 in $pNum idiv $vDiv1
return
max((f:maxPrimeFactor($vDiv1),f:maxPrimeFactor($vDiv2)))
"/>
</xsl:function>
</xsl:stylesheet>
This transformation produces the correct result (the maximum prime factor of 600851475143) in just 0.109 sec.:
6857
The transformation uses the f:sqrt() and f:isPrime() defined in FXSL 2.0 -- a library for functional programming in XSLT. FXSL is itself written entirely in XSLT.
f:isPrime() uses Fermat's little theorem so that it is efficient to determine primeality.
One last thing nobody has mentioned, perhaps because it seems obvious. Every time you find a factor and divide it out, keep trying the factor until it fails.
64 only has one prime factor, 2. You will find that out pretty trivially if you keep dividing out the 2 until you can't anymore.
$ time factor 300000000000 > /dev/null
real 0m0.027s
user 0m0.000s
sys 0m0.001s
You're doing something wrong if it's taking an hour. You might even have an infinite loop somewhere - make sure you're not using 32-bit ints.
The key to understanding why the square root is important, consider that each factor of n below the square root of n has a corresponding factor above it. To see this, consider that if x is factor of n, then x/n = m which means that x/m = n, hence m is also a factor.
I wouldn't expect it to take very long at all - that's not a particularly large number.
Could you give us an example number which is causing your code difficulties?
Here's one site where you can get answers: Factoris - Online factorization service. It can do really big numbers, but it also can factorize algebraic expressions.
The fastest algorithms are sieve algorithms, and are based on arcane areas of discrete mathematics (over my head at least), complicated to implement and test.
The simplest algorithm for factoring is probably (as others have said) the Sieve of Eratosthenes. Things to remember about using this to factor a number N:
general idea: you're checking an increasing sequence of possible integer factors x to see if they evenly divide your candidate number N (in C/Java/Javascript check whether N % x == 0) in which case N is not prime.
you just need to go up to sqrt(N), but don't actually calculate sqrt(N): loop as long as your test factor x passes the test x*x<N
if you have the memory to save a bunch of previous primes, use only those as the test factors (and don't save them if prime P fails the test P*P > N_max since you'll never use them again
Even if you don't save the previous primes, for possible factors x just check 2 and all the odd numbers. Yes, it will take longer, but not that much longer for reasonable sized numbers. The prime-counting function and its approximations can tell you what fraction of numbers are prime; this fraction decreases very slowly. Even for 264 = approx 1.8x1019, roughly one out of every 43 numbers is prime (= one out of every 21.5 odd numbers is prime). For factors of numbers less than 264, those factors x are less than 232 where about one out of every 20 numbers is prime = one out of every 10 odd numbers is prime. So you'll have to test 10 times as many numbers, but the loop should be a bit faster and you don't have to mess around with storing all those primes.
There are also some older and simpler sieve algorithms that a little bit more complex but still fairly understandable. See Dixon's, Shanks' and Fermat's factoring algorithms. I read an article about one of these once, can't remember which one, but they're all fairly straightforward and use algebraic properties of the differences of squares.
If you're just testing whether a number N is prime, and you don't actually care about the factors themselves, use a probabilistic primality test. Miller-Rabin is the most standard one, I think.
I spent some time on this since it just sucked me in. I won't paste the code here just yet. Instead see this factors.py gist if you're curious.
Mind you, I didn't know anything about factoring (still don't) before reading this question. It's just a Python implementation of BradC's answer above.
On my MacBook it takes 0.002 secs to factor the number mentioned in the question (600851475143).
There must obviously be much, much faster ways of doing this. My program takes 19 secs to compute the factors of 6008514751431331. But the Factoris service just spits out the answer in no-time.
The specific number is 300425737571? It trivially factors into 131 * 151 * 673 * 22567.
I don't see what all the fuss is...
Here's some Haskell goodness for you guys :)
primeFactors n = factor n primes
where factor n (p:ps) | p*p > n = [n]
| n `mod` p /= 0 = factor n ps
| otherwise = p : factor (n `div` p) (p:ps)
primes = 2 : filter ((==1) . length . primeFactors) [3,5..]
Took it about .5 seconds to find them, so I'd call that a success.
You could use the sieve of Eratosthenes to find the primes and see if your number is divisible by those you find.
You only need to check it's remainder mod(n) where n is a prime <= sqrt(N) where N is the number you are trying to factor. It really shouldn't take over an hour, even on a really slow computer or a TI-85.
Your algorithm must be FUBAR. This only takes about 0.1s on my 1.6 GHz netbook in Python. Python isn't known for its blazing speed. It does, however, have arbitrary precision integers...
import math
import operator
def factor(n):
"""Given the number n, to factor yield a it's prime factors.
factor(1) yields one result: 1. Negative n is not supported."""
M = math.sqrt(n) # no factors larger than M
p = 2 # candidate factor to test
while p <= M: # keep looking until pointless
d, m = divmod(n, p)
if m == 0:
yield p # p is a prime factor
n = d # divide n accordingly
M = math.sqrt(n) # and adjust M
else:
p += 1 # p didn't pan out, try the next candidate
yield n # whatever's left in n is a prime factor
def test_factor(n):
f = factor(n)
n2 = reduce(operator.mul, f)
assert n2 == n
def example():
n = 600851475143
f = list(factor(n))
assert reduce(operator.mul, f) == n
print n, "=", "*".join(str(p) for p in f)
example()
# output:
# 600851475143 = 71*839*1471*6857
(This code seems to work in defiance of the fact that I don't know enough about number theory to fill a thimble.)
Just to expand/improve slightly on the "only test odd numbers that don't end in 5" suggestions...
All primes greater than 3 are either one more or one less than a multiple of 6 (6x + 1 or 6x - 1 for integer values of x).
It shouldn't take that long, even with a relatively naive brute force. For that specific number, I can factor it in my head in about one second.
You say you don't want solutions(?), but here's your "subtle" hint. The only prime factors of the number are the lowest three primes.
Semi-prime numbers of that size are used for encryption, so I am curious as to what you exactly want to use them for.
That aside, there currently are not good ways to find the prime factorization of large numbers in a relatively small amount of time.

Resources