Fermat vs Mersenne as modulus - algorithm

So there are some number theory applications where we need to do modulo with big numbers, and we can choose the modulus. There's two groups that can get huge optimizations - Fermat and Mersenne.
So let's call an N bit sequence a chunk. N is often not a multiple of the word size.
For Fermat, we have M=2^N+1, so 2^N=-1 mod M, so we take the chunks of the dividend and alternate adding and subtracting.
For Mersenne, we have M=2^N-1, so 2^N=1 mod M, so we sum the chunks of the dividend.
In either case, we will likely end up with a number that takes up 2 chunks. We can apply this algorithm again if needed and finally do a general modulo algorithm.
Fermat will make the result smaller on average due to the alternating addition and subtraction. A negative result isn't that computationally expensive, you just keep track of the sign and fix it in the final modulo step. But I'd think bignum subtraction is a little slower than bignum addition.
Mersenne sums all chunks, so the result is a little larger, but that can be fixed with a second iteration of the algorithm at next to no extra cost.
So in the end, which is faster?
Schönhage–Strassen uses Fermat. There might be some other factors other than performance that make Fermat better than Mersenne - or maybe it's just straight up faster.

If you need a prime modulus, you're going to make the decision based on the convenience of the size.
For example, 2^31-1 is often convenient on 64-bit architectures, since it fits pretty snugly into 32 bits and and the product of two of them fits into a 64-bit word, either signed or unsigned.
On 32-bit architectures, 2^16+1 has similar advantages. It doesn't quite fit unto 16 bits, of course, but if you treat 0s a special case, then it's still pretty easy to multiply them in a 32-bit word.

Related

Why is exponentiation not atomic?

In calculating the efficiency of algorithms, I have read that the exponentiation operation is not considered to be an atomic operation (like multiplication).
Is it because exponentiation is the same as the multiplication operation repeated several times over?
In principle, you can pick any set of "core" operations on numbers that you consider to take a single time unit to evaluate. However, there are a couple of reasons, though, why we typically don't count exponentiation as one of them.
Perhaps the biggest has to do with how large of an output you produce. Suppose you have two numbers x and y that are each d digits long. Then their sum x + y has (at most) d + 1 digits - barely bigger than what we started with. Their product xy has at most 2d digits - larger than what we started with, but not by a huge amount. On the other hand, the value xy has roughly yd digits, which can be significantly bigger than what we started with. (A good example of this: think about computing 100100, which has about 200 digits!) This means that simply writing down the result of the exponentiation would require a decent amount of time to complete.
This isn't to say that you couldn't consider exponentiation to be a constant-time operation. Rather, I've just never seen it done.
(Fun fact: some theory papers don't consider multiplication to be a constant-time operation, since the complexity of a hardware circuit to multiply two b-bit numbers grows quadratically with the size of b. And some theory papers don't consider addition to be constant-time either, especially when working with variable-length numbers! It's all about context. If you're dealing with "smallish" numbers that fit into machine words, then we can easily count addition and multiplication as taking constant time. If you have huge numbers - say, large primes for RSA encryption - then the size of the numbers starts to impact the algorithm's runtime and implementation.)
This is a matter of definition. For example in hardware-design and biginteger-processing multiplication is not considered an atomic operation (see e.g. this analysis of the karatsuba-algorithm).
On the level that is relevant for general purpose software-design on the other hand, multiplication can be considered as a fairly fast operation on fixed-digit numbers implemented in hardware. Exponentiation on the other hand is rarely implemented in hardware and an upper bound for the complexity can only be given in terms of the exponent, rather than the number of digits.

What is the most efficient algorithm to give out prime numbers, up to very high values (all a 32bit machine can handle)

My program is supposed to loop forever and give out via print every prime number it comes along. Doing this in x86-NASM btw.
My first attempt divided it by EVERY previous number until either the Carry is 0 (not a prime) or the result is 1.
MY second attempt improved this by only testing every second, so only odd numbers.
The third thing I am currently implementing is trying to not divide by EVERY previous number but rather all of the previous divided by 2, since you can't get an even number by dividing a number by something bigger than its half
Another thing that might help is to test it with only odd numbers, like the sieve of eratosthenes, but only excluding even numbers.
Anyway, if there is another thing I can do, all help welcome.
edit:
If you need to test an handful, possibly only one, of primes, the AKS primality test is polynomial in the length of n.
If you want to find a very big prime, of cryptographic size, then select a random range of odd numbers and sieve out all the numbers whose factors are small primes (e.g. less equal than 64K-240K) then test the remaining numbers for primality.
If you want to find the primes in a range then use a sieve, the sieve of Erathostenes is very easy to implement but run slower and require more memory.
The sieve of Atkin is faster, the wheels sieve requires far less memory.
The size of the problem is exponential if approached naively so before micro-optimising is mandatory to first macro-optimise.
More or less all prime numbers algorithms require confidence with Number theory, so take particular attention to the group/ring/field the algorithm is working on because mathematicians write operations like the inverse or the multiplication with the same symbol for all the algebraic structures.
Once you have a fast algorithm, you can start micro-optimising.
At this level it's really impossible to answer how to proceed with such optimisations.

n/2 bit multiplication on an n bit cpu

I was looking at an algorithm which multiplied 2 n bit numbers with 3 multiplications of n/2 bits. This algorithm is considered efficient. While I understand that space is obviously conserved, if I were working on an n bit machine , how would n/2 bit multiplications be better. Those n/2 bit multiplications would be converted to n bit multiplications because the CPU can only understand n bit numbers.
Thank you in advance.
Algorithms like Karatsuba multiplication or Toom-Cook are typically used in the implementation of "bignums" -- computation with numbers of unlimited size. Generally speaking, the more sophisticated the algorithm, the larger numbers need to be to make it worthwhile doing.
There are a variety of bignum packages; one of the more commonly used ones is the Gnu Multiprecision library, gmplib, which includes a large number of different multiplication algorithms, selecting the appropriate one based on the length of the multiplicands. (According to wikipedia, the Schönhage–Strassen algorithm, which is based on the fast Fourier transform algorithm, isn't used until the multiplicands reach 33,000 decimal digits. Such computations are relatively rare, but when you have to do such a computation, you probably care about it being done efficiently.)

Why is division more expensive than multiplication?

I am not really trying to optimize anything, but I remember hearing this from programmers all the time, that I took it as a truth. After all they are supposed to know this stuff.
But I wonder why is division actually slower than multiplication? Isn't division just a glorified subtraction, and multiplication is a glorified addition? So mathematically I don't see why going one way or the other has computationally very different costs.
Can anyone please clarify the reason/cause of this so I know, instead of what I heard from other programmer's that I asked before which is: "because".
CPU's ALU (Arithmetic-Logic Unit) executes algorithms, though they are implemented in hardware. Classic multiplications algorithms includes Wallace tree and Dadda tree. More information is available here. More sophisticated techniques are available in newer processors. Generally, processors strive to parallelize bit-pairs operations in order the minimize the clock cycles required. Multiplication algorithms can be parallelized quite effectively (though more transistors are required).
Division algorithms can't be parallelized as efficiently. The most efficient division algorithms are quite complex (The Pentium FDIV bug demonstrates the level of complexity). Generally, they requires more clock cycles per bit. If you're after more technical details, here is a nice explanation from Intel. Intel actually patented their division algorithm.
But I wonder why is division actually slower than multiplication? Isn't division just a glorified subtraction, and multiplication is a glorified addition?
The big difference is that in a long multiplication you just need to add up a bunch of numbers after shifting and masking. In a long division you have to test for overflow after each subtraction.
Lets consider a long multiplication of two n bit binary numbers.
shift (no time)
mask (constant time)
add (neively looks like time proportional to n²)
But if we look closer it turns out we can optimise the addition by using two tricks (there are further optimisations but these are the most important).
We can add the numbers in groups rather than sequentially.
Until the final step we can add three numbers to produce two rather than adding two to produce one. While adding two numbers to produce one takes time proportional to n, adding three numbers to produce two can be done in constant time because we can eliminate the carry chain.
So now our algorithm looks like
shift (no time)
mask (constant time)
add numbers in groups of three to produce two until there are only two left (time proportional to log(n))
perform the final addition (time proportional to n)
In other words we can build a multiplier for two n bit numbers in time roughly proportional to n (and space roughly proportional to n²). As long as the CPU designer is willing to dedicate the logic multiplication can be almost as fast as addition.
In long division we need to know whether each subtraction overflowed before we can decide what inputs to use for the next one. So we can't apply the same parallising tricks as we can with long multiplication.
There are methods of division that are faster than basic long division but still they are slower than multiplication.

Simple deterministic primality testing for small numbers

I am aware that there are a number of primality testing algorithms used in practice (Sieve of Eratosthenes, Fermat's test, Miller-Rabin, AKS, etc). However, they are either slow (e.g. sieve), probabalistic (Fermat and Miller-Rabin), or relatively difficult to implement (AKS).
What is the best deterministic solution to determine whether or not a number is prime?
Note that I am primarily (pun intended) interested in testing against numbers on the order of 32 (and maybe 64) bits. So a robust solution (applicable to larger numbers) is not necessary.
Up to ~2^30 you could brute force with trial-division.
Up to 3.4*10^14, Rabin-Miller with the first 7 primes has been proven to be deterministic.
Above that, you're on your own. There's no known sub-cubic deterministic algorithm.
EDIT : I remembered this, but I didn't find the reference until now:
http://reference.wolfram.com/legacy/v5_2/book/section-A.9.4
PrimeQ first tests for divisibility using small primes, then uses the
Miller-Rabin strong pseudoprime test base 2 and base 3, and then uses
a Lucas test.
As of 1997, this procedure is known to be correct only for n < 10^16,
and it is conceivable that for larger n it could claim a composite
number to be prime.
So if you implement Rabin-Miller and Lucas, you're good up to 10^16.
If I didn't care about space, I would try precomputing all the primes below 2^32 (~4e9/ln(4e9)*4 bytes, which is less than 1GB), store them in the memory and use a binary search. You can also play with memory mapping of the file containing these precomputed primes (pros: faster program start, cons: will be slow until all the needed data is actually in the memory).
If you can factor n-1 it is easy to prove that n is prime, using a method developed by Edouard Lucas in the 19th century. You can read about the algorithm at Wikipedia, or look at my implementation of the algorithm at my blog. There are variants of the algorithm that require only a partial factorization.
If the factorization of n-1 is difficult, the best method is the elliptic curve primality proving algorithm, but that requires more math, and more code, than you may be willing to write. That would be much faster than AKS, in any case.
Are you sure that you need an absolute proof of primality? The Baillie-Wagstaff algorithm is faster than any deterministic primality prover, and there are no known counter-examples.
If you know that n will never exceed 2^64 then strong pseudo-prime tests using the first twelve primes as bases are sufficient to prove n prime. For 32-bit integers, strong pseudo-prime tests to the three bases 2, 7 and 61 are sufficient to prove primality.
Use the Sieve of Eratosthenes to pre-calculate as many primes as you have space for. You can fit in a lot at one bit per number and halve the space by only sieving odd numbers (treating 2 as a special case).
For numbers from Sieve.MAX_NUM up to the square of Sieve.MAX_NUM you can use trial division because you already have the required primes listed. Judicious use of Miller-Rabin on larger unfactored residues can speed up the process a lot.
For numbers larger than that I would use one of the probabilistic tests, Miller-Rabin is good and if repeated a few times can give results that are less likely to be wrong than a hardware failure in the computer you are running.

Resources