Fastest way to compute (n + 1)^j from (n^j) - algorithm

I need to compute 0^j, 1^j, ..., k^j for some very large k and j (both in the order of a few millions). I am using GMP to handle the big integer numbers (yes, I need integer numbers as I need full precision). Now, I wonder, once I have gone through the effort of computing n^j, isn't there a way to speed up the computation of (n + 1)^j, instead of starting from scratch?
Here is the algorithm I am currently using to compute the power:
mpz_class pow(unsigned long int b, unsigned long int e)
{
mpz_class res = 1;
mpz_class m = b;
while(e)
{
if(e & 1)
{
res *= m;
}
e >>= 1;
m *= m;
}
return res;
}
As you can see, every time I start from scratch, and it takes a lot of time.

To compute n^j, why not find at least one factor of n, say k perform n^j = k^j * (n/k)^j ? By the time n^j is being computed, both k^j and (n/k)^j should be known.
However the above takes potentially O(sqrt(n)) time for n. We have a computation of n^j independently in O(log(j)) time by Exponentiation by Squaring as you have mentioned in the code above.
So you could have a mix of the above depending on which is larger:
If n is much smaller than log(j), compute n^j by factorization.
Whenever n^j is known compute {(2*n)^j, (3*n)^j, ..., ((n-1)*n)^j, n * n^j} and keep it in a lookup table.
If n is larger than log(j) and a ready computation as above is not possible, use the logarithmic method and then compute the other related powers like above.
If n is a pure power of 2 (possible const time computation), compute the jth power by shifting and calculate the related sums.
If n is even (const time computation again), use the factorization method and compute associated products.
The above should make it quite fast. For example, identification of even numbers by itself should convert half of the power computations to multiplications. There could be many more thumb rules that could be found regarding factorization that could reduce the computation further (especially for divisibility by 3, 7 etc)

You may want to use the binomial expansion of (n+1)^j as n^j + jn^(j-1)+j(j-1)/2 * n^(j-2) +... + 1 and memoize lower powers already computed and reuse them to compute (n+1)^j in O(n) time by addition. If you compute the coefficients j, j*(j-1)/2,... incrementally while adding each term, that can be done in O(n) too.

Related

What is the best approach for computing logarithm of an integer x base 2 approximated to the greatest integer less than or equal to it? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
What is the best amortized time complexity to compute floor(log(x)) amongst the algorithms that find floor(log(x)) for base 2?
There are many different algorithms for computing logarithms, each of which represents a different tradeoff of some sort. This answer surveys a variety of approaches and some of the tradeoffs involved.
Approach 1: Iterated Multiplication
One simple approach for computing ⌊logb n⌋ is to compute the sequence b0, b1, b2, etc. until we find a value greater than n. At that point, we can stop, and return the exponent just before this. The code for this is fairly straightforward:
x = 0; # Exponent
bX = 1; # b^Exponent
while (bx <= n) {
x++;
bX *= b;
}
return x - 1;
How fast is this? Notice that the inner loop counts up x = 0, x = 1, x = 2, etc. until eventually we reach x = ⌊logb x⌋, doing one multiplication per iteration. If we assume all the integers we're dealing with fit into individual machine words - say, we're working with int or long or something like that - then each multiply takes time O(1) and the overall runtime is O(logb n), with a memory usage of O(1).
Approach 2: Repeated Squaring
There's an old interview question that goes something like this. I have a number n, and I'm not going to tell you what it is. You can make queries of the form "is x equal to n, less than n, or greater than n?," and your goal is to use the fewest queries to figure out what n is. Assuming you have literally no idea what n can be, one reasonable approach works like this: guess the values 1, 2, 4, 8, 16, 32, ..., 2k, ..., until you overshoot n. At that point, use a binary search on the range you just found to discover what n is. This runs in time O(log n), since after computing 21 + log2 n = 2n you'll overshoot n, and after that you're binary searching over a range of size n for a total runtime of O(log n).
Finding logarithms, in a sense, kinda sorta matches this problem. We have a number n written as bx for some unknown x, and we want to find x. Using the above strategy as a starting point, we can therefore compute b20, b21, b22, etc. until we overshoot bx. From there, we can run a secondary binary search to figure out the exact exponent required.
We can compute the series of values b2k by using the fact that
b2k+1 = b2 · 2k = (b2k)2
and find a value that overshoots as follows:
x = 1; # exponent
bX = b; # b^x
while (bX <= n) {
bX *= bX; # bX = bX^2
x *= 2;
}
# Overshot, now do the binary search
The problem is how to do that binary search to figure things out. Specifically, we know that b2x is too big, but we don't know by how much. And unlike the "guess the number" game, binary searching over the exponent is a bit tricky.
One cute solution to this is based on the idea that if x is the value that we're looking for, then we can write x as a series of bits in binary. For example, let's write x = am-12m-1 + am-22m-2 + ... + a121 + a020. Then
bx = bam-12m-1 + am-22m-2 + ... + a121 + a020
= 2am-12m-1 · 2am-22m-2 · ... · 2a0 20
In other words, we can try to discover what bx is by building up x one bit at a time. To do so, as we compute the values b1, b2, b4, b8, etc., we can write down the values we discover. Then, once we overshoot, we can try multiplying them in and see which ones should be included and which should be excluded. Here's what that looks like:
x = 1; // Exponent
bX = b; // b^x
powers = [b]; // b^{2^0}
exps = [1]; // 2^0
while (bX <= n) {
bX *= bX; // bX = bX^2
powers += bX; // Append bX
x++;
exps += x;
}
# Overshot, now recover the bits
resultExp = 1
result = 0;
while (x > 0) {
# If including this bit doesn't overshoot, it's part of the
# representation of x.
if (resultExp * powers[x] <= n) {
resultExp *= powers[x];
result += exps[x];
}
x--;
}
return result;
This is certainly a more involved approach, but it is faster. Since the value we're looking for is ⌊bx⌋ and we're essentially using the solution from the "guess the number game" to figure out what x is, the number of multiplications is O(log logb n), with a memory usage of O(log logb n) to hold the intermediate powers. This is exponentially faster than the previous solution!
Approach 3: Zeckendorf Representations
There's a slight modification on the previous approach that keeps the runtime of O(log logb n) but drops the auxiliary space usage to O(1). The idea is that, instead of writing the exponent in binary using the regular system, we write the number out using Zeckendorf's theorem, which is a binary number system based on the Fibonacci sequence. The advantage is that instead of having to store the intermediate powers of two, we can use the fact that any two consecutive Fibonacci numbers are sufficient to let you compute the next or previous Fibonacci number, allowing us to regenerate the powers of b as needed. Here's an implementation of that idea in C++.
Approach 4: Bitwise Iterated Multiplication
In some cases, you need to find logarithms where the log base is two. In that case, you can take advantage of the fact that numbers on a computer are represented in binary and that multiplications and divisions by two correspond to bit shifts.
For example, let's take the iterated multiplication approach from before, where we computed larger and larger powers of b until we overshot. We can do that same technique using bitshifts, and it's much faster:
x = 0; # Exponent
while ((1 << x) <= n) {
x++;
}
return x - 1;
This approach still runs in time O(log n), as before, but is probably faster implemented this way than with multiplications because the CPU can do bit shifts much more quickly.
Approach 5: Bitwise Binary Search
The base-two logarithm of a number, written in binary, is equivalent to the position of the most-significant bit of that number. To find that bit, we can use a binary search technique somewhat reminiscent of Approach 2, though done faster because the machine can process multiple bits in parallel in a single instruction. Basically, as before, we generate the sequence 220, 221, etc. until we overshoot the number, giving an upper bound on how high the highest bit can be. From there, we do a binary search to find the highest 1 bit. Here's what that looks like:
x = 1;
while ((1 << x) <= n) {
x *= 2;
}
# We've overshot the high-order bit. Do a binary search to find it.
low = 0;
high = x;
while (low < high) {
mid = (low + high) / 2;
# Form a bitmask with 1s up to and including bit number mid.
# This can be done by computing 2^{m+1} - 1.
mask = (1 << (mid + 1)) - 1
# If the mask overlaps, branch higher
if (mask & n) {
low = mid + 1
}
# Otherwise, branch lower
else {
high = mid
}
}
return high - 1
This approach runs in time O(log log n), since the binary search takes time logarithmic in the quantity being searched for and the quantity we're searching for is O(log n). It uses O(1) auxiliary space.
Approach 6: Magical Word-Level Parallelism
The speedup in Approach 5 is largely due to the fact that we can test multiple bits in parallel using a single bitwise operation. By doing some truly amazing things with machine words, it's possible to harness this parallelism to find the most-significant bit in a number in time O(1) using only basic arithmetic operations and bit-shifts, and to do so in a way where the runtime is completely independent of the size of a machine word (e.g. the algorithm works equally quickly on a 16-bit, 32-bit, and 64-bit machine). The techniques involved are fairly complex and I will confess that I had no idea this was possible to do until recently when I learned the technique when teaching an advanced data structures course.
Summary
To summarize, here are the approaches listed, their time complexity, and their space complexity.
Approach Which Bases? Time Complexity Space Complexity
--------------------------------------------------------------------------
Iter. Multiplication Any O(log_b n) O(1)
Repeated Squaring Any O(log log_b n) O(log log_b n)
Zeckendorf Logarithm Any O(log log_b n) O(1)
Bitwise Multiplication 2 O(log n) O(1)
Bitwise Binary Search 2 O(log log n) O(1)
Word-Level Parallelism 2 O(1) O(1)
There are many other algorithms I haven't mentioned here that are worth exploring. Some algorithms work by segmenting machine words into blocks of some fixed size, precomputing the position of the first 1 bit in each block, then testing one block at a time. These approaches have runtimes that depend on the size of the machine word, and (to the best of my knowledge) none of them are asymptotically faster than the approaches I've outlined here. Other approaches work by using the fact that some processors have instructions that immediately output the position of the most significant bit in a number, or work by using floating-point hardware. Those are also interesting and fascinating, be sure to check them out!
Another area worth exploring is when you have arbitrary-precision integers. There, the costs of multiplications, divisions, shifts, etc. are not O(1), and the relative costs of these algorithms change. This is definitely worth exploring in more depth if you're curious!
The code included here is written in pseudocode because it's designed mostly for exposition. In a real implementation, you'd need to worry about overflow, handling the case where the input is negative or zero, etc. Just FYI. :-)
Hope this helps!

Determine time complexity for unpredictive non-linear iterative algorithm

I don't know whether I used the unpredictive word correctly or not. But here's the problem:
I have a rectangular piece of paper of length a and breadth b. I will keep cutting the squares from it, of side equal to min(a,b) until the last square is of unit length. Determine the number of squares I can cut.
Here's my algorithm :
#include <iostream>
using namespace std;
int main()
{
long long a,b,temp,small,large,res;
cin >> a >> b;
res = 0;
small = min(a,b);
large = a + b - small;
while(small > 0 && small != large)
{
res = res + large/small;
temp = large % small;
large = small;
small = temp;
}
cout << res;
return 0;
}
I am confused how to calculate time complexity in this case, as max(a,b) decreases to 1, in non-even fashion, depending upon the initial values of a and b. The best case would definitely be when, either or both of them is 1 already. The worst case would be, I guess, when both are prime. Please help me to analyze the time complexity.
This algorithm is very similar to the Euclidean algorithm for computing the greatest common divisor. Recall that that algorithm works by:
Start with two numbers a, b, assume without loss than a >= b. If a == b then stop.
In the next round, the two numbers are b and a % b instead.
Now consider your algorithm. It's the same, except it's a - b instead. But this will actually do the same thing if a < 2 * b. And if a < k * b, then in the next round it changes only by a multiple of b, so after at most k rounds, it will converge to a % b. So this is just a slower version of the Euclidean algorithm.
The time complexity of the Euclidean algorithm is quite fast -- since it's repeated division, the number of rounds is not more than the number of digits.
Edit: To expand on last part:
To analyze the time complexity, a first question is, how many rounds does it take.
An easy way to start is, if a and b have n and m digits in their (binary) description, then there can't be more than n + m rounds. Because, as long as b is at least two in a given round, we will be dividing one of the numbers by two in that round, so the result will have one less digit. If b is one, then this is the last round.
A second question is, how much time does it take to do a single round.
If you are satisfied with "the running time is at most polynomial in the number of digits", then this is now clear, since you can easily do division in polynomial in the number of digits.
I'm not actually sure what the tightest analysis possible is. You might be able to do some kind of win-win analysis to improve on this, I'm almost sure this has been studied but I don't know a reference, sorry.

Calculation of euler phi function

int phi (int n) {
int result = n;
for (int i=2; i*i<=n; ++i)
if (n % i == 0) {
while (n % i == 0)
n /= i;
result -= result / i;
}
if (n > 1)
result -= result / n;
return result;
}
I saw the above implementation of Euler phi function which is of the O(sqrt n).I don't get the fact of using i*i<=n in the for loop and need of changing n.It is said that it can be done in a time much smaller O (sqrt n) How? link (in Russian)
i*i<=n is same as i<= sqrt(n) from which your iteration lasts only to order of sqrt(n).
Using the straight definition of Euler totient function you are supposed to find the prime numbers that divides n.
The function is a straight forward implementation of integer factorization by trial division, except that instead of reporting the factors as it finds them the function uses the factors to calculate phi. Calculation of phi can be done in time less than O(sqrt n) by using a better algorithm to find the factors; the best way to do that depends on the magnitude of n.
If the biggest number (N say) that you will want the totient of is small enough that you can have a table of size N in memory, then you can do a lot better, per evaluation, at the cost of having to build a table before any evaluations.
One approach would be to build a table of primes first, and then instead of using trial division by every integer at most sqrt(n), use trial division by every prime at most sqrt(n).
You could improve on this by building, instead of a table of primes, a table that gives (for each integer 2..N) the smallest prime that divides the number. A simple modification of the usual Sieve of Eratosthenes can be used to build such a table. Then to compute the totient of a number you use the table to find the smallest prime dividing the number (and accumulate that into you answer), then divide the number by the table entry, use the table to find the smallest prime that divides that, and so on.

Time complexity of Euclid's Algorithm

I am having difficulty deciding what the time complexity of Euclid's greatest common denominator algorithm is. This algorithm in pseudo-code is:
function gcd(a, b)
while b ≠ 0
t := b
b := a mod b
a := t
return a
It seems to depend on a and b. My thinking is that the time complexity is O(a % b). Is that correct? Is there a better way to write that?
One trick for analyzing the time complexity of Euclid's algorithm is to follow what happens over two iterations:
a', b' := a % b, b % (a % b)
Now a and b will both decrease, instead of only one, which makes the analysis easier. You can divide it into cases:
Tiny A: 2a <= b
Tiny B: 2b <= a
Small A: 2a > b but a < b
Small B: 2b > a but b < a
Equal: a == b
Now we'll show that every single case decreases the total a+b by at least a quarter:
Tiny A: b % (a % b) < a and 2a <= b, so b is decreased by at least half, so a+b decreased by at least 25%
Tiny B: a % b < b and 2b <= a, so a is decreased by at least half, so a+b decreased by at least 25%
Small A: b will become b-a, which is less than b/2, decreasing a+b by at least 25%.
Small B: a will become a-b, which is less than a/2, decreasing a+b by at least 25%.
Equal: a+b drops to 0, which is obviously decreasing a+b by at least 25%.
Therefore, by case analysis, every double-step decreases a+b by at least 25%. There's a maximum number of times this can happen before a+b is forced to drop below 1. The total number of steps (S) until we hit 0 must satisfy (4/3)^S <= A+B. Now just work it:
(4/3)^S <= A+B
S <= lg[4/3](A+B)
S is O(lg[4/3](A+B))
S is O(lg(A+B))
S is O(lg(A*B)) //because A*B asymptotically greater than A+B
S is O(lg(A)+lg(B))
//Input size N is lg(A) + lg(B)
S is O(N)
So the number of iterations is linear in the number of input digits. For numbers that fit into cpu registers, it's reasonable to model the iterations as taking constant time and pretend that the total running time of the gcd is linear.
Of course, if you're dealing with big integers, you must account for the fact that the modulus operations within each iteration don't have a constant cost. Roughly speaking, the total asymptotic runtime is going to be n^2 times a polylogarithmic factor. Something like n^2 lg(n) 2^O(log* n). The polylogarithmic factor can be avoided by instead using a binary gcd.
The suitable way to analyze an algorithm is by determining its worst case scenarios.
Euclidean GCD's worst case occurs when Fibonacci Pairs are involved.
void EGCD(fib[i], fib[i - 1]), where i > 0.
For instance, let's opt for the case where the dividend is 55, and the divisor is 34 (recall that we are still dealing with fibonacci numbers).
As you may notice, this operation costed 8 iterations (or recursive calls).
Let's try larger Fibonacci numbers, namely 121393 and 75025. We can notice here as well that it took 24 iterations (or recursive calls).
You can also notice that each iterations yields a Fibonacci number. That's why we have so many operations. We can't obtain similar results only with Fibonacci numbers indeed.
Hence, the time complexity is going to be represented by small Oh (upper bound), this time. The lower bound is intuitively Omega(1): case of 500 divided by 2, for instance.
Let's solve the recurrence relation:
We may say then that Euclidean GCD can make log(xy) operation at most.
There's a great look at this on the wikipedia article.
It even has a nice plot of complexity for value pairs.
It is not O(a%b).
It is known (see article) that it will never take more steps than five times the number of digits in the smaller number. So the max number of steps grows as the number of digits (ln b). The cost of each step also grows as the number of digits, so the complexity is bound by O(ln^2 b) where b is the smaller number. That's an upper limit, and the actual time is usually less.
See here.
In particular this part:
Lamé showed that the number of steps needed to arrive at the greatest common divisor for two numbers less than n is
So O(log min(a, b)) is a good upper bound.
Here's intuitive understanding of runtime complexity of Euclid's algorithm. The formal proofs are covered in various texts such as Introduction to Algorithms and TAOCP Vol 2.
First think about what if we tried to take gcd of two Fibonacci numbers F(k+1) and F(k). You might quickly observe that Euclid's algorithm iterates on to F(k) and F(k-1). That is, with each iteration we move down one number in Fibonacci series. As Fibonacci numbers are O(Phi ^ k) where Phi is golden ratio, we can see that runtime of GCD was O(log n) where n=max(a, b) and log has base of Phi. Next, we can prove that this would be the worst case by observing that Fibonacci numbers consistently produces pairs where the remainders remains large enough in each iteration and never become zero until you have arrived at the start of the series.
We can make O(log n) where n=max(a, b) bound even more tighter. Assume that b >= a so we can write bound at O(log b). First, observe that GCD(ka, kb) = GCD(a, b). As biggest values of k is gcd(a,c), we can replace b with b/gcd(a,b) in our runtime leading to more tighter bound of O(log b/gcd(a,b)).
Here is the analysis in the book Data Structures and Algorithm Analysis in C by Mark Allen Weiss (second edition, 2.4.4):
Euclid's algorithm works by continually computing remainders until 0 is reached. The last nonzero remainder is the answer.
Here is the code:
unsigned int Gcd(unsigned int M, unsigned int N)
{
unsigned int Rem;
while (N > 0) {
Rem = M % N;
M = N;
N = Rem;
}
Return M;
}
Here is a THEOREM that we are going to use:
If M > N, then M mod N < M/2.
PROOF:
There are two cases. If N <= M/2, then since the remainder is smaller
than N, the theorem is true for this case. The other case is N > M/2.
But then N goes into M once with a remainder M - N < M/2, proving the
theorem.
So, we can make the following inference:
Variables M N Rem
initial M N M%N
1 iteration N M%N N%(M%N)
2 iterations M%N N%(M%N) (M%N)%(N%(M%N)) < (M%N)/2
So, after two iterations, the remainder is at most half of its original value. This would show that the number of iterations is at most 2logN = O(logN).
Note that, the algorithm computes Gcd(M,N), assuming M >= N.(If N > M, the first iteration of the loop swaps them.)
Worst case will arise when both n and m are consecutive Fibonacci numbers.
gcd(Fn,Fn−1)=gcd(Fn−1,Fn−2)=⋯=gcd(F1,F0)=1 and nth Fibonacci number is 1.618^n, where 1.618 is the Golden ratio.
So, to find gcd(n,m), number of recursive calls will be Θ(logn).
The worst case of Euclid Algorithm is when the remainders are the biggest possible at each step, ie. for two consecutive terms of the Fibonacci sequence.
When n and m are the number of digits of a and b, assuming n >= m, the algorithm uses O(m) divisions.
Note that complexities are always given in terms of the sizes of inputs, in this case the number of digits.
Gabriel Lame's Theorem bounds the number of steps by log(1/sqrt(5)*(a+1/2))-2, where the base of the log is (1+sqrt(5))/2. This is for the the worst case scenerio for the algorithm and it occurs when the inputs are consecutive Fibanocci numbers.
A slightly more liberal bound is: log a, where the base of the log is (sqrt(2)) is implied by Koblitz.
For cryptographic purposes we usually consider the bitwise complexity of the algorithms, taking into account that the bit size is given approximately by k=loga.
Here is a detailed analysis of the bitwise complexity of Euclid Algorith:
Although in most references the bitwise complexity of Euclid Algorithm is given by O(loga)^3 there exists a tighter bound which is O(loga)^2.
Consider; r0=a, r1=b, r0=q1.r1+r2 . . . ,ri-1=qi.ri+ri+1, . . . ,rm-2=qm-1.rm-1+rm rm-1=qm.rm
observe that: a=r0>=b=r1>r2>r3...>rm-1>rm>0 ..........(1)
and rm is the greatest common divisor of a and b.
By a Claim in Koblitz's book( A course in number Theory and Cryptography) is can be proven that: ri+1<(ri-1)/2 .................(2)
Again in Koblitz the number of bit operations required to divide a k-bit positive integer by an l-bit positive integer (assuming k>=l) is given as: (k-l+1).l ...................(3)
By (1) and (2) the number of divisons is O(loga) and so by (3) the total complexity is O(loga)^3.
Now this may be reduced to O(loga)^2 by a remark in Koblitz.
consider ki= logri +1
by (1) and (2) we have: ki+1<=ki for i=0,1,...,m-2,m-1 and ki+2<=(ki)-1 for i=0,1,...,m-2
and by (3) the total cost of the m divisons is bounded by: SUM [(ki-1)-((ki)-1))]*ki for i=0,1,2,..,m
rearranging this: SUM [(ki-1)-((ki)-1))]*ki<=4*k0^2
So the bitwise complexity of Euclid's Algorithm is O(loga)^2.
For the iterative algorithm, however, we have:
int iterativeEGCD(long long n, long long m) {
long long a;
int numberOfIterations = 0;
while ( n != 0 ) {
a = m;
m = n;
n = a % n;
numberOfIterations ++;
}
printf("\nIterative GCD iterated %d times.", numberOfIterations);
return m;
}
With Fibonacci pairs, there is no difference between iterativeEGCD() and iterativeEGCDForWorstCase() where the latter looks like the following:
int iterativeEGCDForWorstCase(long long n, long long m) {
long long a;
int numberOfIterations = 0;
while ( n != 0 ) {
a = m;
m = n;
n = a - n;
numberOfIterations ++;
}
printf("\nIterative GCD iterated %d times.", numberOfIterations);
return m;
}
Yes, with Fibonacci Pairs, n = a % n and n = a - n, it is exactly the same thing.
We also know that, in an earlier response for the same question, there is a prevailing decreasing factor: factor = m / (n % m).
Therefore, to shape the iterative version of the Euclidean GCD in a defined form, we may depict as a "simulator" like this:
void iterativeGCDSimulator(long long x, long long y) {
long long i;
double factor = x / (double)(x % y);
int numberOfIterations = 0;
for ( i = x * y ; i >= 1 ; i = i / factor) {
numberOfIterations ++;
}
printf("\nIterative GCD Simulator iterated %d times.", numberOfIterations);
}
Based on the work (last slide) of Dr. Jauhar Ali, the loop above is logarithmic.
Yes, small Oh because the simulator tells the number of iterations at most. Non Fibonacci pairs would take a lesser number of iterations than Fibonacci, when probed on Euclidean GCD.
At every step, there are two cases
b >= a / 2, then a, b = b, a % b will make b at most half of its previous value
b < a / 2, then a, b = b, a % b will make a at most half of its previous value, since b is less than a / 2
So at every step, the algorithm will reduce at least one number to at least half less.
In at most O(log a)+O(log b) step, this will be reduced to the simple cases. Which yield an O(log n) algorithm, where n is the upper limit of a and b.
I have found it here

Better ways to implement a modulo operation (algorithm question)

I've been trying to implement a modular exponentiator recently. I'm writing the code in VHDL, but I'm looking for advice of a more algorithmic nature. The main component of the modular exponentiator is a modular multiplier which I also have to implement myself. I haven't had any problems with the multiplication algorithm- it's just adding and shifting and I've done a good job of figuring out what all of my variables mean so that I can multiply in a pretty reasonable amount of time.
The problem that I'm having is with implementing the modulus operation in the multiplier. I know that performing repeated subtractions will work, but it will also be slow. I found out that I could shift the modulus to effectively subtract large multiples of the modulus but I think there might still be better ways to do this. The algorithm that I'm using works something like this (weird pseudocode follows):
result,modulus : integer (n bits) (previously defined)
shiftcount : integer (initialized to zero)
while( (modulus<result) and (modulus(n-1) != 1) ){
modulus = modulus << 1
shiftcount++
}
for(i=shiftcount;i>=0;i--){
if(modulus<result){result = result-modulus}
if(i!=0){modulus = modulus >> 1}
}
So...is this a good algorithm, or at least a good place to start? Wikipedia doesn't really discuss algorithms for implementing the modulo operation, and whenever I try to search elsewhere I find really interesting but incredibly complicated (and often unrelated) research papers and publications. If there's an obvious way to implement this that I'm not seeing, I'd really appreciate some feedback.
I'm not sure what you're calculating there to be honest. You talk about modulo operation, but usually a modulo operation is between two numbers a and b, and its result is the remainder of dividing a by b. Where is the a and b in your pseudocode...?
Anyway, maybe this'll help: a mod b = a - floor(a / b) * b.
I don't know if this is faster or not, it depends on whether or not you can do division and multiplication faster than a lot of subtractions.
Another way to speed up the subtraction approach is to use binary search. If you want a mod b, you need to subtract b from a until a is smaller than b. So basically you need to find k such that:
a - k*b < b, k is min
One way to find this k is a linear search:
k = 0;
while ( a - k*b >= b )
++k;
return a - k*b;
But you can also binary search it (only ran a few tests but it worked on all of them):
k = 0;
left = 0, right = a
while ( left < right )
{
m = (left + right) / 2;
if ( a - m*b >= b )
left = m + 1;
else
right = m;
}
return a - left*b;
I'm guessing the binary search solution will be the fastest when dealing with big numbers.
If you want to calculate a mod b and only a is a big number (you can store b on a primitive data type), you can do it even faster:
for each digit p of a do
mod = (mod * 10 + p) % b
return mod
This works because we can write a as a_n*10^n + a_(n-1)*10^(n-1) + ... + a_1*10^0 = (((a_n * 10 + a_(n-1)) * 10 + a_(n-2)) * 10 + ...
I think the binary search is what you're looking for though.
There are many ways to do it in O(log n) time for n bits; you can do it with multiplication and you don't have to iterate 1 bit at a time. For example,
a mod b = a - floor((a * r)/2^n) * b
where
r = 2^n / b
is precomputed because typically you're using the same b many times. If not, use the standard superconverging polynomial iteration method for reciprocal (iterate 2x - bx^2 in fixed point).
Choose n according to the range you need the result (for many algorithms like modulo exponentiation it doesn't have to be 0..b).
(Many decades ago I thought I saw a trick to avoid 2 multiplications in a row... Update: I think it's Montgomery Multiplication (see REDC algorithm). I take it back, REDC does the same work as the simpler algorithm above. Not sure why REDC was ever invented... Maybe slightly lower latency due to using the low-order result into the chained multiplication, instead of the higher-order result?)
Of course if you have a lot of memory, you can just precompute all the 2^n mod b partial sums for n = log2(b)..log2(a). Many software implementations do this.
If you're using shift-and-add for the multiplication (which is by no means the fastest way) you can do the modulo operation after each addition step. If the sum is greater than the modulus you then subtract the modulus. If you can predict the overflow, you can do the addition and subtraction at the same time. Doing the modulo at each step will also reduce the overall size of your multiplier (same length as input rather than double).
The shifting of the modulus you're doing is getting you most of the way towards a full division algorithm (modulo is just taking the remainder).
EDIT Here is my implementation in Python:
def mod_mul(a,b,m):
result = 0
a = a % m
b = b % m
while (b>0):
if (b&1)!=0:
result += a
if result >= m: result -= m
a = a << 1
if a>=m: a-= m
b = b>>1
return result
This is just modular multiplication (result = a*b mod m). The modulo operations at the top are not needed, but serve as a reminder that the algorithm assumes a and b are less than m.
Of course for modular exponentiation you'll have an outer loop that does this entire operation at each step doing either squaring or multiplication. But I think you knew that.
For modulo itself, I'm not sure. For modulo as part of the larger modular exponential operation, did you look up Montgomery multiplication as mentioned in the wikipedia page on modular exponentiation? It's been a while since I've looked into this type of algorithm, but from what I recall, it's commonly used in fast modular exponentiation.
edit: for what it's worth, your modulo algorithm seems ok at first glance. You're basically doing division which is a repeated subtraction algorithm.
That test (modulus(n-1) != 1) //a bit test?
-seems redundant combined with (modulus<result).
Designing for hardware implementation i would be conscious of the smaller/greater than tests implying more logic (subtraction) than bitwise operations and branching on zero.
If we can do bitwise tests easily, this could be quick:
m=msb_of(modulus)
while( result>0 )
{
r=msb_of(result) //countdown from prev msb onto result
shift=r-m //countdown from r onto modulus or
//unroll the small subtraction
takeoff=(modulus<<(shift)) //or integrate this into count of shift
result=result-takeoff; //necessary subtraction
if(shift!=0 && result<0)
{ result=result+(takeoff>>1); }
} //endwhile
if(result==0) { return result }
else { return result+takeoff }
(code untested may contain gotchas)
result is repetively decremented by modulus shifted to match at most significant bits.
After each subtraction: result has a ~50/50 chance of loosing more than 1 msb. It also has ~50/50 chance of going negative,
addition of half what was subtracted will always put it into positive again. > it should be put back in positive if shift was not=0
The working loop exits when result is underrun and 'shift' was 0.

Resources