Matrix Exponentiation Algorithm for large values of N - algorithm

I want to calculate the Fibonacci of very large value of N ie. 10^6 with a complexity of O(logN).
Here is my code but it gives the result for 10^6 in 30 seconds which is very time consuming.Help me point out the mistake.I have to give the output in modulo 10^9+7.
static BigInteger mod=new BigInteger("1000000007");
BigInteger fibo(long n){
BigInteger F[][] = {{BigInteger.ONE,BigInteger.ONE},{BigInteger.ONE,BigInteger.ZERO}};
if(n == 0)
return BigInteger.ZERO;
power(F, n-1);
return F[0][0].mod(mod);
void power(BigInteger F[][], long n) {
if( n == 0 || n == 1)
BigInteger M[][] = {{BigInteger.ONE,BigInteger.ONE},{BigInteger.ONE,BigInteger.ZERO}};
power(F, n/2);
multiply(F, F);
if( n%2 != 0 )
multiply(F, M);
void multiply(BigInteger F[][], BigInteger M[][]){
BigInteger x = (F[0][0].multiply(M[0][0])).add(F[0][1].multiply(M[1][0])) ;
BigInteger y = F[0][0].multiply(M[0][1]).add(F[0][1].multiply(M[1][1])) ;
BigInteger z = F[1][0].multiply(M[0][0]).add( F[1][1].multiply(M[1][0]));
BigInteger w = F[1][0].multiply(M[0][1]).add(F[1][1].multiply(M[1][1]));
F[0][0] = x;
F[0][1] = y;
F[1][0] = z;
F[1][1] = w;

Use these recurrences:
F2n−1 = Fn2 + Fn−12
F2n = (2Fn−1 + Fn) Fn
together with memoization. For example, in Python you could use the #functools.lru_cache decorator, like this:
from functools import lru_cache
def fibonacci_modulo(n, m):
"""Compute the nth Fibonacci number modulo m."""
if n <= 3:
return (0, 1, 1, 2)[n] % m
elif n % 2 == 0:
a = fibonacci_modulo(n // 2 - 1, m)
b = fibonacci_modulo(n // 2, m)
return ((2 * a + b) * b) % m
a = fibonacci_modulo(n // 2, m)
b = fibonacci_modulo(n // 2 + 1, m)
return (a * a + b * b) % m
this computes the 106th Fibonacci number (modulo 109 + 7) in a few microseconds:
>>> from timeit import timeit
>>> timeit(lambda:fibonacci_modulo(10 ** 6, 10 ** 9 + 7), number=1)

I get a more reasonable - although still very slow - time of real 0m2.335s using your code.
The algorithm to compute the Fibonacci numbers is okay (there are some tweaks that could speed it up somewhat, but nothing very dramatic), so the problem is that operations on large BigIntegers are slow, and F(10^6) has nearly 700,000 bits.
Since you want to compute the remainder modulo mod = 10^9 + 7, and (mod-1)^2 fits in a long, you can get a much faster implementation using longs instead of BigIntegers, computing the remainder in each step. The direct transcription
public class FiboL {
static final long mod = 1000000007L;
static long fibo(long n){
long F[][] = {{1,1},{1,0}};
if(n == 0)
return 0;
power(F, n-1);
return F[0][0]; //.mod(mod);
static void power(long F[][], long n){
if( n == 0 || n == 1)
long M[][] = {{1,1},{1,0}};
power(F, n/2);
multiply(F, F);
if( n%2 != 0 )
multiply(F, M);
static void multiply(long F[][], long M[][]){
long x = (F[0][0] * M[0][0]) % mod + (F[0][1] * M[1][0]) % mod;
long y = (F[0][0] * M[0][1]) % mod + (F[0][1] * M[1][1]) % mod;
long z = (F[1][0] * M[0][0]) % mod + (F[1][1] * M[1][0]) % mod;
long w = (F[1][0] * M[0][1]) % mod + (F[1][1] * M[1][1]) % mod;
F[0][0] = x % mod;
F[0][1] = y % mod;
F[1][0] = z % mod;
F[1][1] = w % mod;
public static void main(String[] args) {
runs in real 0m0.083s.


Multipliers (codeforces)

This is the link to this algorithm topic:
my code time limit exceeded on test40, I thought for a long time but no good way, is there a good optimization method, may be ?
typedef long long ll;
ll mod = 1e9 + 7;
ll fast_mod(ll a, ll n, ll Mod)
ll ans=1;
if(n&1) ans=(ans*a)%Mod;
return ans;
int main()
std::cin.tie(0); // IO
ll m;
cin >> m;
ll num = 1ll;
map<ll, ll> count;
for(int i = 0; i < m; i++)
ll p;
cin >> p;
ll res = 1ll;
vector<ll> a;
vector<ll> b;
for(auto it = count.begin(); it != count.end(); it++)
a.push_back(it -> first);
b.push_back(it -> second);
for(int i = 0; i < a.size(); i++)
ll x = a[i]; // a kind of prime
ll y = b[i]; // the count of the prime
ll tmp = fast_mod(x, y * (y + 1) / 2, mod); // x^1 * x^2 * x^3 *...*x^y
for(int j = 0; j < b.size(); j++) // calculate ( tmp)^((b[0] + 1)*(b[1] + 1)*...*(b[b.size() - 1] + 1)), here b.size() is the number of different primes
tmp = fast_mod(tmp, i != j ? (b[j] + 1) : 1, mod) % mod;
res = (res * tmp % mod);
cout << res << endl;
return 0;
Find the number of each different prime number, suppose x is one of the different prime number, then calculate x^1x^2...x^y, y is the count of x, the result as tmp.Then the product of count of
other prime plus one as the exponent: (b[0] + 1)(b[1] +1)...(b[b.size() - 1] + 1), tmp as base.
The for loop divide the calculation into several steps.
Last, res * (tmp^ ((b[0] + 1)(b[1] +1)...*(b[b.size() - 1] + 1)))
An other formula for the product of the divisors of N is N ** (D/ 2), where D is the number of divisors and may be found from your map count by taking the product of entry->second + 1 for every entry.
This does raise the question of what to do when D is odd, which it would be if N is a perfect square. In that case it is easy to compute sqrt(N) (the exponents would all be even, so you can halve them all and take the product of the primes to half of their original exponents), and then raise sqrt(N) to the power of D. Essentially this changes N ** (D / 2) into (N ** (1 / 2)) ** D.
For example if N = 2 * 3 * 2 = 12 (one of the examples), then D will be (2 + 1) * (1 + 1) = 6 and the product of divisors will be 12 ** (6 / 2) = 1728.
Computing N (or its square root) should done modulo mod. Computing D should be done modulo mod - 1 (the totient of mod, mod is a prime so its totient is just one less). mod - 1 is even, so we could not have computed the modular multiplicative inverse of 2 to "divide" D by 2 that way. When N is a square then AFAIK we're really stuck with computing its square root (that's not so bad, but multiplying by a half would have been easier).

Finding sum of geometric sequence with modulo 10^9+7 with my program

The problem is given as:
Output the answer of (A^1+A^2+A^3+...+A^K) modulo 1,000,000,007, where 1≤ A, K ≤ 10^9, and A and K must be an integer.
I am trying to write a program to compute the above question. I have tried using the formula for geometric sequence, then applying the modulo on the answer. Since the results must be an integer as well, finding modulo inverse is not required.
Below is the code I have now, its in pascal
power,sum: int64;
power := 1;
For i := 1 to k do
power := ((power mod 1000000007) * a) mod 1000000007;
sum := a * (power-1) div (a-1);
Writeln(sum mod 1000000007);
This task came from my school, they do not give away their test data to the students. Hence I do not know why or where my program is wrong. I only know that my program outputs the wrong answer for their test data.
If you want to do this without calculating a modular inverse, you can calculate it recursively using:
1+ A + A2 + A3 + ... + Ak
= 1 + (A + A2)(1 + A2 + (A2)2 + ... + (A2)k/2-1)
That's for even k. For odd k:
1+ A + A2 + A3 + ... + Ak
= (1 + A)(1 + A2 + (A2)2 + ... + (A2)(k-1)/2)
Since k is divided by 2 in each recursive call, the resulting algorithm has O(log k) complexity. In java:
static int modSumAtoAk(int A, int k, int mod)
return (modSum1ToAk(A, k, mod) + mod-1) % mod;
static int modSum1ToAk(int A, int k, int mod)
long sum;
if (k < 5) {
//k is small -- just iterate
sum = 0;
long x = 1;
for (int i=0; i<=k; ++i) {
sum = (sum+x) % mod;
x = (x*A) % mod;
return (int)sum;
//k is big
int A2 = (int)( ((long)A)*A % mod );
if ((k%2)==0) {
// k even
sum = modSum1ToAk(A2, (k/2)-1, mod);
sum = (sum + sum*A) % mod;
sum = ((sum * A) + 1) % mod;
} else {
// k odd
sum = modSum1ToAk(A2, (k-1)/2, mod);
sum = (sum + sum*A) % mod;
return (int)sum;
Note that I've been very careful to make sure that each product is done in 64 bits, and to reduce by the modulus after each one.
With a little math, the above can be converted to an iterative version that doesn't require any storage:
static int modSumAtoAk(int A, int k, int mod)
// first, we calculate the sum of all 1... A^k
// we'll refer to that as SUM1 in comments below
long fac=1;
long add=0;
//INVARIANT: SUM1 = add + fac*(sum 1...A^k)
//this will remain true as we change k
while (k > 0) {
//above INVARIANT is true here, too
long newmul, newadd;
if ((k%2)==0) {
//k is even. sum 1...A^k = 1+A*(sum 1...A^(k-1))
newmul = A;
newadd = 1;
} else {
//k is odd.
newmul = A+1L;
newadd = 0;
A = (int)(((long)A) * A % mod);
k = (k-1)/2;
//SUM1 = add + fac * (newadd + newmul*(sum 1...Ak))
// = add+fac*newadd + fac*newmul*(sum 1...Ak)
add = (add+fac*newadd) % mod;
fac = (fac*newmul) % mod;
//INVARIANT is restored
// k == 0
long sum1 = fac + add;
return (int)((sum1 + mod -1) % mod);

modular multiplicative inverse of an number for calculating nCr % 10000007 (combination)

I am trying to calculate nCr % M. So what I am doing is
nCr = n!/(n-r)!*r! %M
In other words, nCr = n! * (inverseFactorial(n-r)*inverseFactorial(r)).
So i am precomputing the values for factorial and inverseFactorial of numbers from range 1 to 10^5.
Basically, I am trying to implement this first answer.
This is my code.
//fill fact
for(int i=1;i<100001;i++){
//fill ifact - inverse of fact
for(int i=1;i<100001;i++){
ifact[i] = ifact[i-1]*inverse(i)%1000000007;
And the methods are
public static long fastcomb(int n,int r){
long ans = ifact[r]*ifact[n-r];
ans = ans%1000000007;
ans = ans%1000000007;
return ans;
public static int modul(int x){
x = x%1000000007;
return x;
public static int inverse(int x){
int mod = modul(x);
return 1;
return modul((-1000000007/mod)*(ifact[1000000007%mod]%1000000007));
I am not sure where i am going wrong? Please help what i am doing wrong as for ifact[2] it is showing me 500000004.
Here is the Fermat's Little theorem implementation for multiplicative inverse.
I tested it and it works.
static long modInverse(long a, long m)
return power(a, m - 2, m);
// To compute x^y under modulo m
static long power(long x, long y, long m)
if (y == 0)
return 1;
long p = power(x, y / 2, m) % m;
p = (p * p) % m;
if (y % 2 == 0)
return p;
return (x * p) % m;
I'm working on nCr mod M, you don't need that array to find it.
Find the following implementation of nCr mod m, please check it with your values, remember m should be a prime for this method.
static long nCr_mod_m(long n, long r, long m)
if(n-r < r) r = (n-r); // since nCr = nC(n-r)
long top_part = n, bottom_part=1;
for(long i=1; i<r; i++)
top_part = (top_part*(n-i)) % m;
for(long i=2; i<=r; i++)
bottom_part = (bottom_part * modInverse(i, m))%m;
return (top_part*bottom_part)%m;

Fast Iterative GCD

I have GCD(n, i) where i=1 is increasing in loop by 1 up to n. Is there any algorithm which calculate all GCD's faster than naive increasing and compute GCD using Euclidean algorithm?
PS I've noticed if n is prime I can assume that number from 1 to n-1 would give 1, because prime number would be co-prime to them. Any ideas for other numbers than prime?
C++ implementation, works in O(n * log log n) (assuming size of integers are O(1)):
#include <cstdio>
#include <cstring>
using namespace std;
void find_gcd(int n, int *gcd) {
// divisor[x] - any prime divisor of x
// or 0 if x == 1 or x is prime
int *divisor = new int[n + 1];
memset(divisor, 0, (n + 1) * sizeof(int));
// This is almost copypaste of sieve of Eratosthenes, but instead of
// just marking number as 'non-prime' we remeber its divisor.
// O(n * log log n)
for (int x = 2; x * x <= n; ++x) {
if (divisor[x] == 0) {
for (int y = x * x; y <= n; y += x) {
divisor[y] = x;
for (int x = 1; x <= n; ++x) {
if (n % x == 0) gcd[x] = x;
else if (divisor[x] == 0) gcd[x] = 1; // x is prime, and does not divide n (previous line)
else {
int a = x / divisor[x], p = divisor[x]; // x == a * p
// gcd(a * p, n) = gcd(a, n) * gcd(p, n / gcd(a, n))
// gcd(p, n / gcd(a, n)) == 1 or p
gcd[x] = gcd[a];
if ((n / gcd[a]) % p == 0) gcd[x] *= p;
int main() {
int n;
scanf("%d", &n);
int *gcd = new int[n + 1];
find_gcd(n, gcd);
for (int x = 1; x <= n; ++x) {
printf("%d:\t%d\n", x, gcd[x]);
return 0;
The possible answers for the gcd consist of the factors of n.
You can compute these efficiently as follows.
First factorise n into a product of prime factors, i.e. n=p1^n1*p2^n2*..*pk^nk.
Then you can loop over all factors of n and for each factor of n set the contents of the GCD array at that position to the factor.
If you make sure that the factors are done in a sensible order (e.g. sorted) you should find that the array entries that are written multiple times will end up being written with the highest value (which will be the gcd).
Here is some Python code to do this for the number 1400=2^3*5^2*7:
for prime,count in zip(prime_factors,prime_counts):
N *= prime**count
GCD = [0]*(N+1)
GCD[0] = N
def go(i,n):
"""Try all counts for prime[i]"""
if i==len(prime_factors):
for x in xrange(n,N+1,n):
for c in xrange(prime_counts[i]+1):
print N,GCD
Binary GCD algorithm:
is faster than Euclidean algorithm:
I implemented "gcd()" in C for type "__uint128_t" (with gcc on Intel i7 Ubuntu), based on iterative Rust version:
Determining number of trailing 0s was done efficiently with "__builtin_ctzll()". I did benchmark 1 million loops of two biggest 128bit Fibonacci numbers (they result in maximal number of iterations) against gmplib "mpz_gcd()" and saw 10% slowdown. Utilizing the fact that u/v values only decrease, I switched to 64bit special case "_gcd()" when "<=UINT64_max" and now see speedup of 1.31 over gmplib, for details see:
inline int ctz(__uint128_t u)
unsigned long long h = u;
return (h!=0) ? __builtin_ctzll( h )
: 64 + __builtin_ctzll( u>>64 );
unsigned long long _gcd(unsigned long long u, unsigned long long v)
for(;;) {
if (u > v) { unsigned long long a=u; u=v; v=a; }
v -= u;
if (v == 0) return u;
v >>= __builtin_ctzll(v);
__uint128_t gcd(__uint128_t u, __uint128_t v)
if (u == 0) { return v; }
else if (v == 0) { return u; }
int i = ctz(u); u >>= i;
int j = ctz(v); v >>= j;
int k = (i < j) ? i : j;
for(;;) {
if (u > v) { __uint128_t a=u; u=v; v=a; }
if (v <= UINT64_MAX) return _gcd(u, v) << k;
v -= u;
if (v == 0) return u << k;
v >>= ctz(v);

Sum of series: 1^1 + 2^2 + 3^3 + ... + n^n (mod m)

Can someone give me an idea of an efficient algorithm for large n (say 10^10) to find the sum of above series?
Mycode is getting klilled for n= 100000 and m=200000
int main() {
int n,m,i,j,sum,t;
for(i=1;i<=n;i++) {
t=((long long)t*i)%m;
Two notes:
(a + b + c) % m
is equivalent to
(a % m + b % m + c % m) % m
(a * b * c) % m
is equivalent to
((a % m) * (b % m) * (c % m)) % m
As a result, you can calculate each term using a recursive function in O(log p):
int expmod(int n, int p, int m) {
if (p == 0) return 1;
int nm = n % m;
long long r = expmod(nm, p / 2, m);
r = (r * r) % m;
if (p % 2 == 0) return r;
return (r * nm) % m;
And sum elements using a for loop:
long long r = 0;
for (int i = 1; i <= n; ++i)
r = (r + expmod(i, i, m)) % m;
This algorithm is O(n log n).
I think you can use Euler's theorem to avoid some exponentation, as phi(200000)=80000. Chinese remainder theorem might also help as it reduces the modulo.
You may have a look at my answer to this post. The implementation there is slightly buggy, but the idea is there. The key strategy is to find x such that n^(x-1)<m and n^x>m and repeatedly reduce n^n%m to (n^x%m)^(n/x)*n^(n%x)%m. I am sure this strategy works.
I encountered similar question recently: my 'n' is 1435, 'm' is 10^10. Here is my solution (C#):
ulong n = 1435, s = 0, mod = 0;
mod = ulong.Parse(Math.Pow(10, 10).ToString());
for (ulong i = 1; i <= n;
ulong summand = i;
for (ulong j = 2; j <= i; j++)
summand *= i;
summand = summand % mod;
s += summand;
s = s % mod;
At the end 's' is equal to required number.
Are you getting killed here:
t=((long long)t*i)%m;
Exponentials mod m could be implemented using the sum of squares method.
n = 10000;
m = 20000;
sqr = n;
bit = n;
sum = 0;
while(bit > 0)
if(bit % 2 == 1)
sum += sqr;
sqr = (sqr * sqr) % m;
bit >>= 2;
I can't add comment, but for the Chinese remainder theorem, see formulas (4)-(6).
