How to find all possible values of four variables when squared sum to N?

How to find all possible values of four variables when squared sum to N? - algorithm

A^2+B^2+C^2+D^2 = N Given an integer N, print out all possible combinations of integer values of ABCD which solve the equation.
I am guessing we can do better than brute force.

Naive brute force would be something like:
n = 3200724;
lim = sqrt (n) + 1;
for (a = 0; a <= lim; a++)
for (b = 0; b <= lim; b++)
for (c = 0; c <= lim; c++)
for (d = 0; d <= lim; d++)
if (a * a + b * b + c * c + d * d == n)
printf ("%d %d %d %d\n", a, b, c, d);
Unfortunately, this will result in over a trillion loops, not overly efficient.
You can actually do substantially better than that by discounting huge numbers of impossibilities at each level, with something like:
#include <stdio.h>
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
int a, b, c, d, na, nb, nc, nd;
int count = 0;
for (a = 0, na = n; a * a <= na; a++) {
for (b = 0, nb = na - a * a; b * b <= nb; b++) {
for (c = 0, nc = nb - b * b; c * c <= nc; c++) {
for (d = 0, nd = nc - c * c; d * d <= nd; d++) {
if (d * d == nd) {
printf ("%d %d %d %d\n", a, b, c, d);
count++;
}
tot++;
}
}
}
}
printf ("Found %d solutions\n", count);
return 0;
}
It's still brute force, but not quite as brutish inasmuch as it understands when to stop each level of looping as early as possible.
On my (relatively) modest box, that takes under a second (a) to get all solutions for numbers up to 50,000. Beyond that, it starts taking more time:
n time taken
---------- ----------
100,000 3.7s
1,000,000 6m, 18.7s
For n = ten million, it had been going about an hour and a half before I killed it.
So, I would say brute force is perfectly acceptable up to a point. Beyond that, more mathematical solutions would be needed.
For even more efficiency, you could only check those solutions where d >= c >= b >= a. That's because you could then build up all the solutions from those combinations into permutations (with potential duplicate removal where the values of two or more of a, b, c, or d are identical).
In addition, the body of the d loop doesn't need to check every value of d, just the last possible one.
Getting the results for 1,000,000 in that case takes under ten seconds rather than over six minutes:
0 0 0 1000
0 0 280 960
0 0 352 936
0 0 600 800
0 24 640 768
: : : :
424 512 512 544
428 460 500 596
432 440 480 624
436 476 532 548
444 468 468 604
448 464 520 560
452 452 476 604
452 484 484 572
500 500 500 500
Found 1302 solutions
real 0m9.517s
user 0m9.505s
sys 0m0.012s
That code follows:
#include <stdio.h>
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
int a, b, c, d, na, nb, nc, nd;
int count = 0;
for (a = 0, na = n; a * a <= na; a++) {
for (b = a, nb = na - a * a; b * b <= nb; b++) {
for (c = b, nc = nb - b * b; c * c <= nc; c++) {
for (d = c, nd = nc - c * c; d * d < nd; d++);
if (d * d == nd) {
printf ("%4d %4d %4d %4d\n", a, b, c, d);
count++;
}
}
}
}
printf ("Found %d solutions\n", count);
return 0;
}
And, as per a suggestion by DSM, the d loop can disappear altogether (since there's only one possible value of d (discounting negative numbers) and it can be calculated), which brings the one million case down to two seconds for me, and the ten million case to a far more manageable 68 seconds.
That version is as follows:
#include <stdio.h>
#include <math.h>
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
int a, b, c, d, na, nb, nc, nd;
int count = 0;
for (a = 0, na = n; a * a <= na; a++) {
for (b = a, nb = na - a * a; b * b <= nb; b++) {
for (c = b, nc = nb - b * b; c * c <= nc; c++) {
nd = nc - c * c;
d = sqrt (nd);
if (d * d == nd) {
printf ("%d %d %d %d\n", a, b, c, d);
count++;
}
}
}
}
printf ("Found %d solutions\n", count);
return 0;
}
(a): All timings are done with the inner printf commented out so that I/O doesn't skew the figures.

The Wikipedia page has some interesting background information, but Lagrange's four-square theorem (or, more correctly, Bachet's Theorem - Lagrange only proved it) doesn't really go into detail on how to find said squares.
As I said in my comment, the solution is going to be nontrivial. This paper discusses the solvability of four-square sums. The paper alleges that:
There is no convenient algorithm (beyond the simple one mentioned in
the second paragraph of this paper) for finding additional solutions
that are indicated by the calculation of representations, but perhaps
this will streamline the search by giving an idea of what kinds of
solutions do and do not exist.
There are a few other interesting facts related to this topic. There
exist other theorems that state that every integer can be written as a
sum of four particular multiples of squares. For example, every
integer can be written as N = a^2 + 2b^2 + 4c^2 + 14d^2. There are 54
cases like this that are true for all integers, and Ramanujan provided
the complete list in the year 1917.
For more information, see Modular Forms. This is not easy to understand unless you have some background in number theory. If you could generalize Ramanujan's 54 forms, you may have an easier time with this. With that said, in the first paper I cite, there is a small snippet which discusses an algorithm that may find every solution (even though I find it a bit hard to follow):
For example, it was reported in 1911 that the calculator Gottfried
Ruckle was asked to reduce N = 15663 as a sum of four squares. He
produced a solution of 125^2 + 6^2 + 1^2 + 1^2 in 8 seconds, followed
immediately by 125^2 + 5^2 + 3^2 + 2^2. A more difficult problem
(reflected by a first term that is farther from the original number,
with correspondingly larger later terms) took 56 seconds: 11399 = 105^2
+ 15^2 + 8^2 + 5^2. In general, the strategy is to begin by setting the first term to be the largest square below N and try to represent the
smaller remainder as a sum of three squares. Then the first term is
set to the next largest square below N, and so forth. Over time a
lightning calculator would become familiar with expressing small
numbers as sums of squares, which would speed up the process.
(Emphasis mine.)
The algorithm is described as being recursive, but it could easily be implemented iteratively.

It seems as though all integers can be made by such a combination:
0 = 0^2 + 0^2 + 0^2 + 0^2
1 = 1^2 + 0^2 + 0^2 + 0^2
2 = 1^2 + 1^2 + 0^2 + 0^2
3 = 1^2 + 1^2 + 1^2 + 0^2
4 = 2^2 + 0^2 + 0^2 + 0^2, 1^2 + 1^2 + 1^2 + 1^2 + 1^2
5 = 2^2 + 1^2 + 0^2 + 0^2
6 = 2^2 + 1^2 + 1^2 + 0^2
7 = 2^2 + 1^2 + 1^2 + 1^2
8 = 2^2 + 2^2 + 0^2 + 0^2
9 = 3^2 + 0^2 + 0^2 + 0^2, 2^2 + 2^2 + 1^2 + 0^2
10 = 3^2 + 1^2 + 0^2 + 0^2, 2^2 + 2^2 + 1^2 + 1^2
11 = 3^2 + 1^2 + 1^2 + 0^2
12 = 3^2 + 1^2 + 1^2 + 1^2, 2^2 + 2^2 + 2^2 + 0^2
.
.
.
and so forth
As I did some initial working in my head, I thought that it would be only the perfect squares that had more than 1 possible solution. However after listing them out it seems to me there is no obvious order to them. However, I thought of an algorithm I think is most appropriate for this situation:
The important thing is to use a 4-tuple (a, b, c, d). In any given 4-tuple which is a solution to a^2 + b^2 + c^2 + d^2 = n, we will set ourselves a constraint that a is always the largest of the 4, b is next, and so on and so forth like:
a >= b >= c >= d
Also note that a^2 cannot be less than n/4, otherwise the sum of the squares will have to be less than n.
Then the algorithm is:
1a. Obtain floor(square_root(n)) # this is the maximum value of a - call it max_a
1b. Obtain the first value of a such that a^2 >= n/4 - call it min_a
2. For a in a range (min_a, max_a)
At this point we have selected a particular a, and are now looking at bridging the gap from a^2 to n - i.e. (n - a^2)
3. Repeat steps 1a through 2 to select a value of b. This time instead of finding
floor(square_root(n)) we find floor(square_root(n - a^2))
and so on and so forth. So the entire algorithm would look something like:
1a. Obtain floor(square_root(n)) # this is the maximum value of a - call it max_a
1b. Obtain the first value of a such that a^2 >= n/4 - call it min_a
2. For a in a range (min_a, max_a)
3a. Obtain floor(square_root(n - a^2))
3b. Obtain the first value of b such that b^2 >= (n - a^2)/3
4. For b in a range (min_b, max_b)
5a. Obtain floor(square_root(n - a^2 - b^2))
5b. Obtain the first value of b such that b^2 >= (n - a^2 - b^2)/2
6. For c in a range (min_c, max_c)
7. We now look at (n - a^2 - b^2 - c^2). If its square root is an integer, this is d.
Otherwise, this tuple will not form a solution
At steps 3b and 5b I use (n - a^2)/3, (n - a^2 - b^2)/2. We divide by 3 or 2, respectively, because of the number of values in the tuple not yet 'fixed'.
An example:
doing this on n = 12:
1a. max_a = 3
1b. min_a = 2
2. for a in range(2, 3):
use a = 2
3a. we now look at (12 - 2^2) = 8
max_b = 2
3b. min_b = 2
4. b must be 2
5a. we now look at (12 - 2^2 - 2^2) = 4
max_c = 2
5b. min_c = 2
6. c must be 2
7. (n - a^2 - b^2 - c^2) = 0, hence d = 0
so a possible tuple is (2, 2, 2, 0)
2. use a = 3
3a. we now look at (12 - 3^2) = 3
max_b = 1
3b. min_b = 1
4. b must be 1
5a. we now look at (12 - 3^2 - 1^2) = 2
max_c = 1
5b. min_c = 1
6. c must be 1
7. (n - a^2 - b^2 - c^2) = 1, hence d = 1
so a possible tuple is (3, 1, 1, 1)
These are the only two possible tuples - hey presto!

nebffa has a great answer. one suggestion:
step 3a: max_b = min(a, floor(square_root(n - a^2))) // since b <= a
max_c and max_d can be improved in the same way too.
Here is another try:
1. generate array S: {0, 1, 2^2, 3^2,.... nr^2} where nr = floor(square_root(N)).
now the problem is to find 4 numbers from the array that sum(a, b,c,d) = N;
2. according to neffa's post (step 1a & 1b), a (which is the largest among all 4 numbers) is between [nr/2 .. nr].
We can loop a from nr down to nr/2 and calculate r = N - S[a];
now the question is to find 3 numbers from S the sum(b,c,d) = r = N -S[a];
here is code:
nr = square_root(N);
S = {0, 1, 2^2, 3^2, 4^2,.... nr^2};
for (a = nr; a >= nr/2; a--)
{
r = N - S[a];
// it is now a 3SUM problem
for(b = a; b >= 0; b--)
{
r1 = r - S[b];
if (r1 < 0)
continue;
if (r1 > N/2) // because (a^2 + b^2) >= (c^2 + d^2)
break;
for (c = 0, d = b; c <= d;)
{
sum = S[c] + S[d];
if (sum == r1)
{
print a, b, c, d;
c++; d--;
}
else if (sum < r1)
c++;
else
d--;
}
}
}
runtime is O(sqare_root(N)^3).
Here is the test result running java on my VM (time in milliseconds, result# is total num of valid combination, time 1 with printout, time2 without printout):
N result# time1 time2
----------- -------- -------- -----------
1,000,000 1302 859 281
10,000,000 6262 16109 7938
100,000,000 30912 442469 344359

Related

The sum from 1 to n in theta(log n)

Is there anyway to calculate the sum of 1 to n in Theta(log n)?
Of course, the obvious way to do it is sum = n*(n+1)/2.
However, for practicing, I want to calculate in Theta(log n).
For example,
sum=0; for(int i=1; i<=n; i++) { sum += i}
this code will calculate in Theta(n).

Fair way (without using math formulas) assumes direct summing all n values, so there is no way to avoid O(n) behavior.
If you want to make some artificial approach to provide exactly O(log(N)) time, consider, for example, using powers of two (knowing that Sum(1..2^k = 2^(k-1) + 2^(2*k-1) - for example, Sum(8) = 4 + 32). Pseudocode:
function Sum(n)
if n < 2
return n
p = 1 //2^(k-1)
p2 = 2 //2^(2*k-1)
while p * 4 < n:
p = p * 2;
p2 = p2 * 4;
return p + p2 + ///sum of 1..2^k
2 * p * (n - 2 * p) + ///(n - 2 * p) summands over 2^k include 2^k
Sum(n - 2 * p) ///sum of the rest over 2^k
Here 2*p = 2^k is the largest power of two not exceeding N. Example:
Sum(7) = Sum(4) + 5 + 6 + 7 =
Sum(4) + (4 + 1) + (4 + 2) + (4 + 3) =
Sum(4) + 3 * 4 + Sum(3) =
Sum(4) + 3 * 4 + Sum(2) + 1 * 2 + Sum(1) =
Sum(4) + 3 * 4 + Sum(2) + 1 * 2 + Sum(1) =
2 + 8 + 12 + 1 + 2 + 2 + 1 = 28

Using matrices to find the number of different ways to write n as the sum of 1, 3, and 4?

This is a question given in this presentation. Dynamic Programming
now i have implemented the algorithm using recursion and it works fine for small values. But when n is greater than 30 it becomes really slow.The presentation mentions that for large values of n one should consider something similar to
the matrix form of Fibonacci numbers .I am having trouble undestanding how to use the matrix form of Fibonacci numbers to come up with a solution.Can some one give me some hints or pseudocode
Thanks

Yes, you can use the technique from fast Fibonacci implementations to solve this problem in time O(log n)! Here's how to do it.
Let's go with your definition from the problem statement that 1 + 3 is counted the same as 3 + 1. Then you have the following recurrence relation:
A(0) = 1
A(1) = 1
A(2) = 1
A(3) = 2
A(k+4) = A(k) + A(k+1) + A(k+3)
The matrix trick here is to notice that
| 1 0 1 1 | |A( k )| |A(k) + A(k-2) + A(k-3)| |A(k+1)|
| 1 0 0 0 | |A(k-1)| | A( k ) | |A( k )|
| 0 1 0 0 | |A(k-2)| = | A(k-1) | = |A(k-1)|
| 0 0 1 0 | |A(k-3)| | A(k-2) | = |A(k-2)|
In other words, multiplying a vector of the last four values in the series produces a vector with those values shifted forward by one step.
Let's call that matrix there M. Then notice that
|A( k )| |A(k+2)|
|A(k-1)| |A(k+1)|
M^2 |A(k-2)| = |A( k )|
|A(k-3)| |A(k-1)|
In other words, multiplying by the square of this matrix shifts the series down two steps. More generally:
|A( k )| | A(k+n) |
|A(k-1)| |A(k-1 + n)|
M^n |A(k-2)| = |A(k-2 + n)|
|A(k-3)| |A(k-3 + n)|
So multiplying by Mn shifts the series down n steps. Now, if we want to know the value of A(n+3), we can just compute
|A(3)| |A(n+3)|
|A(2)| |A(n+2)|
M^n |A(1)| = |A(n+1)|
|A(0)| |A(n+2)|
and read off the top entry of the vector! This can be done in time O(log n) by using exponentiation by squaring. Here's some code that does just that. This uses a matrix library I cobbled together a while back:
#include "Matrix.hh"
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <algorithm>
using namespace std;
/* Naive implementations of A. */
uint64_t naiveA(int n) {
if (n == 0) return 1;
if (n == 1) return 1;
if (n == 2) return 1;
if (n == 3) return 2;
return naiveA(n-1) + naiveA(n-3) + naiveA(n-4);
}
/* Constructs and returns the giant matrix. */
Matrix<4, 4, uint64_t> M() {
Matrix<4, 4, uint64_t> result;
fill(result.begin(), result.end(), uint64_t(0));
result[0][0] = 1;
result[0][2] = 1;
result[0][3] = 1;
result[1][0] = 1;
result[2][1] = 1;
result[3][2] = 1;
return result;
}
/* Constructs the initial vector that we multiply the matrix by. */
Vector<4, uint64_t> initVec() {
Vector<4, uint64_t> result;
result[0] = 2;
result[1] = 1;
result[2] = 1;
result[3] = 1;
return result;
}
/* O(log n) time for raising a matrix to a power. */
Matrix<4, 4, uint64_t> fastPower(const Matrix<4, 4, uint64_t>& m, int n) {
if (n == 0) return Identity<4, uint64_t>();
auto half = fastPower(m, n / 2);
if (n % 2 == 0) return half * half;
else return half * half * m;
}
/* Fast implementation of A(n) using matrix exponentiation. */
uint64_t fastA(int n) {
if (n == 0) return 1;
if (n == 1) return 1;
if (n == 2) return 1;
if (n == 3) return 2;
auto result = fastPower(M(), n - 3) * initVec();
return result[0];
}
/* Some simple test code showing this in action! */
int main() {
for (int i = 0; i < 25; i++) {
cout << setw(2) << i << ": " << naiveA(i) << ", " << fastA(i) << endl;
}
}
Now, how would this change if 3 + 1 and 1 + 3 were treated as equivalent? This means that we can think about solving this problem in the following way:
Let A(n) be the number of ways to write n as a sum of 1s, 3s, and 4s.
Let B(n) be the number of ways to write n as a sum of 1s and 3s.
Let C(n) be the number of ways to write n as a sum of 1s.
We then have the following:
A(n) = B(n) for all n ≤ 3, since for numbers in that range the only options are to use 1s and 3s.
A(n + 4) = A(n) + B(n + 4), since your options are either (1) use a 4 or (2) not use a 4, leaving the remaining sum to use 1s and 3s.
B(n) = C(n) for all n ≤ 2, since for numbers in that range the only options are to use 1s.
B(n + 3) = B(n) + C(n + 3), sine your options are either (1) use a 3 or (2) not use a 3, leaving the remaining sum to use only 1s.
C(0) = 1, since there's only one way to write 0 as a sum of no numbers.
C(n+1) = C(n), since the only way to write something with 1s is to pull out a 1 and write the remaining number as a sum of 1s.
That's a lot to take in, but do notice the following: we ultimately care about A(n), and to evaluate it, we only need to know the values of A(n), A(n-1), A(n-2), A(n-3), B(n), B(n-1), B(n-2), B(n-3), C(n), C(n-1), C(n-2), and C(n-3).
Let's imagine, for example, that we know these twelve values for some fixed value of n. We can learn those twelve values for the next value of n as follows:
C(n+1) = C(n)
B(n+1) = B(n-2) + C(n+1) = B(n-2) + C(n)
A(n+1) = A(n-3) + B(n+1) = A(n-3) + B(n-2) + C(n)
And the remaining values then shift down.
We can formulate this as a giant matrix equation:
A( n ) A(n-1) A(n-2) A(n-3) B( n ) B(n-1) B(n-2) C( n )
| 0 0 0 1 0 0 1 1 | |A( n )| = |A(n+1)|
| 1 0 0 0 0 0 0 0 | |A(n-1)| = |A( n )|
| 0 1 0 0 0 0 0 0 | |A(n-2)| = |A(n-1)|
| 0 0 1 0 0 0 0 0 | |A(n-3)| = |A(n-2)|
| 0 0 0 0 0 0 1 1 | |B( n )| = |B(n+1)|
| 0 0 0 0 1 0 0 0 | |B(n-1)| = |B( n )|
| 0 0 0 0 0 1 0 0 | |B(n-2)| = |B(n-1)|
| 0 0 0 0 0 0 0 1 | |C( n )| = |C(n+1)|
Let's call this gigantic matrix here M. Then if we compute
|2| // A(3) = 2, since 3 = 3 or 3 = 1 + 1 + 1
|1| // A(2) = 1, since 2 = 1 + 1
|1| // A(1) = 1, since 1 = 1
M^n |1| // A(0) = 1, since 0 = (empty sum)
|2| // B(3) = 2, since 3 = 3 or 3 = 1 + 1 + 1
|1| // B(2) = 1, since 2 = 1 + 1
|1| // B(1) = 1, since 1 = 1
|1| // C(3) = 1, since 3 = 1 + 1 + 1
We'll get back a vector whose first entry is A(n+3), the number of ways to write n+3 as a sum of 1's, 3's, and 4's. (I've actually coded this up to check it - it works!) You can then use the technique for computing Fibonacci numbers using a matrix to a power efficiently that you saw with Fibonacci numbers to solve this in time O(log n).
Here's some code doing that:
#include "Matrix.hh"
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <algorithm>
using namespace std;
/* Naive implementations of A, B, and C. */
uint64_t naiveC(int n) {
return 1;
}
uint64_t naiveB(int n) {
return (n < 3? 0 : naiveB(n-3)) + naiveC(n);
}
uint64_t naiveA(int n) {
return (n < 4? 0 : naiveA(n-4)) + naiveB(n);
}
/* Constructs and returns the giant matrix. */
Matrix<8, 8, uint64_t> M() {
Matrix<8, 8, uint64_t> result;
fill(result.begin(), result.end(), uint64_t(0));
result[0][3] = 1;
result[0][6] = 1;
result[0][7] = 1;
result[1][0] = 1;
result[2][1] = 1;
result[3][2] = 1;
result[4][6] = 1;
result[4][7] = 1;
result[5][4] = 1;
result[6][5] = 1;
result[7][7] = 1;
return result;
}
/* Constructs the initial vector that we multiply the matrix by. */
Vector<8, uint64_t> initVec() {
Vector<8, uint64_t> result;
result[0] = 2;
result[1] = 1;
result[2] = 1;
result[3] = 1;
result[4] = 2;
result[5] = 1;
result[6] = 1;
result[7] = 1;
return result;
}
/* O(log n) time for raising a matrix to a power. */
Matrix<8, 8, uint64_t> fastPower(const Matrix<8, 8, uint64_t>& m, int n) {
if (n == 0) return Identity<8, uint64_t>();
auto half = fastPower(m, n / 2);
if (n % 2 == 0) return half * half;
else return half * half * m;
}
/* Fast implementation of A(n) using matrix exponentiation. */
uint64_t fastA(int n) {
if (n == 0) return 1;
if (n == 1) return 1;
if (n == 2) return 1;
if (n == 3) return 2;
auto result = fastPower(M(), n - 3) * initVec();
return result[0];
}
/* Some simple test code showing this in action! */
int main() {
for (int i = 0; i < 25; i++) {
cout << setw(2) << i << ": " << naiveA(i) << ", " << fastA(i) << endl;
}
}

This is a very interesting sequence. It is almost but not quite the order-4 Fibonacci (a.k.a. Tetranacci) numbers. Having extracted the doubling formulas for Tetranacci from its companion matrix, I could not resist doing it again for this very similar recurrence relation.
Before we get into the actual code, some definitions and a short derivation of the formulas used are in order. Define an integer sequence A such that:
A(n) := A(n-1) + A(n-3) + A(n-4)
with initial values A(0), A(1), A(2), A(3) := 1, 1, 1, 2.
For n >= 0, this is the number of integer compositions of n into parts from the set {1, 3, 4}. This is the sequence that we ultimately wish to compute.
For convenience, define a sequence T such that:
T(n) := T(n-1) + T(n-3) + T(n-4)
with initial values T(0), T(1), T(2), T(3) := 0, 0, 0, 1.
Note that A(n) and T(n) are simply shifts of each other. More precisely, A(n) = T(n+3) for all integers n. Accordingly, as elaborated by another answer, the companion matrix for both sequences is:
[0 1 0 0]
[0 0 1 0]
[0 0 0 1]
[1 1 0 1]
Call this matrix C, and let:
a, b, c, d := T(n), T(n+1), T(n+2), T(n+3)
a', b', c', d' := T(2n), T(2n+1), T(2n+2), T(2n+3)
By induction, it can easily be shown that:
[0 1 0 0]^n = [d-c-a c-b b-a a]
[0 0 1 0] [ a d-c c-b b]
[0 0 0 1] [ b b+a d-c c]
[1 1 0 1] [ c c+b b+a d]
As seen above, for any n, C^n can be fully determined from its rightmost column alone. Furthermore, multiplying C^n with its rightmost column produces the rightmost column of C^(2n):
[d-c-a c-b b-a a][a] = [a'] = [a(2d - 2c - a) + b(2c - b)]
[ a d-c c-b b][b] [b'] [ a^2 + c^2 + 2b(d - c)]
[ b b+a d-c c][c] [c'] [ b(2a + b) + c(2d - c)]
[ c c+b b+a d][d] [d'] [ b^2 + d^2 + 2c(a + b)]
Thus, if we wish to compute C^n for some n by repeated squaring, we need only perform matrix-vector multiplication per step instead of the full matrix-matrix multiplication.
Now, the implementation, in Python:
# O(n) integer additions or subtractions
def A_linearly(n):
a, b, c, d = 0, 0, 0, 1 # T(0), T(1), T(2), T(3)
if n >= 0:
for _ in range(+n):
a, b, c, d = b, c, d, a + b + d
else: # n < 0
for _ in range(-n):
a, b, c, d = d - c - a, a, b, c
return d # because A(n) = T(n+3)
# O(log n) integer multiplications, additions, subtractions.
def A_by_doubling(n):
n += 3 # because A(n) = T(n+3)
if n >= 0:
a, b, c, d = 0, 0, 0, 1 # T(0), T(1), T(2), T(3)
else: # n < 0
a, b, c, d = 1, 0, 0, 0 # T(-1), T(0), T(1), T(2)
# Unroll the final iteration to avoid computing extraneous values
for i in reversed(range(1, abs(n).bit_length())):
w = a*(2*(d - c) - a) + b*(2*c - b)
x = a*a + c*c + 2*b*(d - c)
y = b*(2*a + b) + c*(2*d - c)
z = b*b + d*d + 2*c*(a + b)
if (n >> i) & 1 == 0:
a, b, c, d = w, x, y, z
else: # (n >> i) & 1 == 1
a, b, c, d = x, y, z, w + x + z
if n & 1 == 0:
return a*(2*(d - c) - a) + b*(2*c - b) # w
else: # n & 1 == 1
return a*a + c*c + 2*b*(d - c) # x
print(all(A_linearly(n) == A_by_doubling(n) for n in range(-1000, 1001)))
Because it was rather trivial to code, the sequence is extended to negative n in the usual way. Also provided is a simple linear implementation to serve as a point of reference.
For n large enough, the logarithmic implementation above is 10-20x faster than directly exponentiating the companion matrix with numpy, by a simple (i.e. not rigorous, and likely flawed) timing comparison. And by my estimate, it would still take ~100 years to compute A(10**12)! Even though the algorithm above has room for improvement, that number is simply too large. On the other hand, computing A(10**12) mod M for some M is much more attainable.
A direct relation to Lucas and Fibonacci numbers
It turns out that T(n) is even closer to the Fibonacci and Lucas numbers than it is to Tetranacci. To see this, note that the characteristic polynomial for T(n) is x^4 - x^3 - x - 1 = 0 which factors into (x^2 - x - 1)(x^2 + 1) = 0. The first factor is the characteristic polynomial for Fibonacci & Lucas! The 4 roots of (x^2 - x - 1)(x^2 + 1) = 0 are the two Fibonacci roots, phi and psi = 1 - phi, and i and -i--the two square roots of -1.
The closed-form expression or "Binet" formula for T(n) will have the general form:
T(n) = U(n) + V(n)
U(n) = p*(phi^n) + q*(psi^n)
V(n) = r*(i^n) + s*(-i)^n
for some constant coefficients p, q, r, s.
Using the initial values for T(n), solving for the coefficients, applying some algebra, and noting that the Lucas numbers have the closed-form expression: L(n) = phi^n + psi^n, we can derive the following relations:
L(n+1) - L(n) L(n-1) F(n) + F(n-2)
U(n) = ------------- = -------- = ------------
5 5 5
where L(n) is the n'th Lucas number with L(0), L(1) := 2, 1 and F(n) is the n'th Fibonacci number with F(0), F(1) := 0, 1. And we also have:
V(n) = 1 / 5 if n = 0 (mod 4)
| -2 / 5 if n = 1 (mod 4)
| -1 / 5 if n = 2 (mod 4)
| 2 / 5 if n = 3 (mod 4)
Which is ugly, but trivial to code. Note that the numerator of V(n) can also be succinctly expressed as cos(n*pi/2) - 2sin(n*pi/2) or (3-(-1)^n) / 2 * (-1)^(n(n+1)/2), but we use the piece-wise definition for clarity.
Here's an even nicer, more direct identity:
T(n) + T(n+2) = F(n)
Essentially, we can compute T(n) (and therefore A(n)) by using Fibonacci & Lucas numbers. Theoretically, this should be much more efficient than the Tetranacci-like approach.
It is known that the Lucas numbers can computed more efficiently than Fibonacci, therefore we will compute A(n) from the Lucas numbers. The most efficient, simple Lucas number algorithm I know of is one by L.F. Johnson (see his 2010 paper: Middle and Ripple, fast simple O(lg n) algorithms for Lucas Numbers). Once we have a Lucas algorithm, we use the identity: T(n) = L(n - 1) / 5 + V(n) to compute A(n).
# O(log n) integer multiplications, additions, subtractions
def A_by_lucas(n):
n += 3 # because A(n) = T(n+3)
offset = (+1, -2, -1, +2)[n % 4]
L = lf_johnson_2010_middle(n - 1)
return (L + offset) // 5
def lf_johnson_2010_middle(n):
"-> n'th Lucas number. See [L.F. Johnson 2010a]."
#: The following Lucas identities are used:
#:
#: L(2n) = L(n)^2 - 2*(-1)^n
#: L(2n+1) = L(2n+2) - L(2n)
#: L(2n+2) = L(n+1)^2 - 2*(-1)^(n+1)
#:
#: The first and last identities are equivalent.
#: For the unrolled iteration, the following is also used:
#:
#: L(2n+1) = L(n)*L(n+1) - (-1)^n
#:
#: Since this approach uses only square multiplications per loop,
#: It turns out to be slightly faster than standard Lucas doubling,
#: which uses 1 square and 1 regular multiplication.
if n >= 0:
a, b, sign = 2, 1, +1 # L(0), L(1), (-1)^0
else: # n < 0
a, b, sign = -1, 2, -1 # L(-1), L(0), (-1)^(-1)
# unroll the last iteration to avoid computing unnecessary values
for i in reversed(range(1, abs(n).bit_length())):
a = a*a - 2*sign # L(2k)
c = b*b + 2*sign # L(2k+2)
b = c - a # L(2k+1)
sign = +1
if (n >> i) & 1:
a, b = b, c
sign = -1
if n & 1:
return a*b - sign
else:
return a*a - 2*sign
You may verify that A_by_lucas produces the same results as the previous A_by_doubling function, but is roughly 5x faster. Still not fast enough to compute A(10**12) in any reasonable amount of time!

You can easily improve your current recursion implementation by adding memoization which makes the solution fast again. C# code:
// Dictionary to store computed values
private static Dictionary<int, long> s_Solutions = new Dictionary<int, long>();
private static long Count134(int value) {
if (value == 0)
return 1;
else if (value <= 0)
return 0;
long result;
// Improvement: Do we have the value computed?
if (s_Solutions.TryGetValue(value, out result))
return result;
result = Count134(value - 4) +
Count134(value - 3) +
Count134(value - 1);
// Improvement: Store the value computed for future use
s_Solutions.Add(value, result);
return result;
}
And so you can easily call
Console.Write(Count134(500));
The outcome (which takes about 2 milliseconds) is
3350159379832610737

Number of ways to divide a number

Given a number N, print in how many ways it can be represented as
N = a + b + c + d
with
1 <= a <= b <= c <= d; 1 <= N <= M
My observation:
For N = 4: Only 1 way - 1 + 1 + 1 + 1
For N = 5: Only 1 way - 1 + 1 + 1 + 2
For N = 6: 2 ways - 1 + 1 + 1 + 3
1 + 1 + 2 + 2
For N = 7: 3 ways - 1 + 1 + 1 + 4
1 + 1 + 2 + 3
1 + 2 + 2 + 2
For N = 8: 5 ways - 1 + 1 + 1 + 5
1 + 1 + 2 + 4
1 + 1 + 3 + 3
1 + 2 + 2 + 3
2 + 2 + 2 + 2
So I have reduced it to a DP solution as follows:
DP[4] = 1, DP[5] = 1;
for(int i = 6; i <= M; i++)
DP[i] = DP[i-1] + DP[i-2];
Is my observation correct or am I missing any thing. I don't have any test cases to run on. So please let me know if the approach is correct or wrong.

It's not correct. Here is the correct one:
Lets DP[n,k] be the number of ways to represent n as sum of k numbers.
Then you are looking for DP[n,4].
DP[n,1] = 1
DP[n,2] = DP[n-2, 2] + DP[n-1,1] = n / 2
DP[n,3] = DP[n-3, 3] + DP[n-1,2]
DP[n,4] = DP[n-4, 4] + DP[n-1,3]
I will only explain the last line and you can see right away, why others are true.
Let's take one case of n=a+b+c+d.
If a > 1, then n-4 = (a-1)+(b-1)+(c-1)+(d-1) is a valid sum for DP[n-4,4].
If a = 1, then n-1 = b+c+d is a valid sum for DP[n-1,3].
Also in reverse:
For each valid n-4 = x+y+z+t we have a valid n=(x+1)+(y+1)+(z+1)+(t+1).
For each valid n-1 = x+y+z we have a valid n=1+x+y+z.

Unfortunately, your recurrence is wrong, because for n = 9, the solution is 6, not 8.
If p(n,k) is the number of ways to partition n into k non-zero integer parts, then we have
p(0,0) = 1
p(n,k) = 0 if k > n or (n > 0 and k = 0)
p(n,k) = p(n-k, k) + p(n-1, k-1)
Because there is either a partition of value 1 (in which case taking this part away yields a partition of n-1 into k-1 parts) or you can subtract 1 from each partition, yielding a partition of n - k. It's easy to show that this process is a bijection, hence the recurrence.
UPDATE:
For the specific case k = 4, OEIS tells us that there is another linear recurrence that depends only on n:
a(n) = 1 + a(n-2) + a(n-3) + a(n-4) - a(n-5) - a(n-6) - a(n-7) + a(n-9)
This recurrence can be solved via standard methods to get an explicit formula. I wrote a small SAGE script to solve it and got the following formula:
a(n) = 1/144*n^3 + 1/32*(-1)^n*n + 1/48*n^2 - 1/54*(1/2*I*sqrt(3) - 1/2)^n*(I*sqrt(3) + 3) - 1/54*(-1/2*I*sqrt(3) - 1/2)^n*(-I*sqrt(3) + 3) + 1/16*I^n + 1/16*(-I)^n + 1/32*(-1)^n - 1/32*n - 13/288
OEIS also gives the following simplification:
a(n) = round((n^3 + 3*n^2 -9*n*(n % 2))/144)
Which I have not verified.

#include <iostream>
using namespace std;
int func_count( int n, int m )
{
if(n==m)
return 1;
if(n<m)
return 0;
if ( m == 1 )
return 1;
if ( m==2 )
return (func_count(n-2,2) + func_count(n - 1, 1));
if ( m==3 )
return (func_count(n-3,3) + func_count(n - 1, 2));
return (func_count(n-1, 3) + func_count(n - 4, 4));
}
int main()
{
int t;
cin>>t;
cout<<func_count(t,4);
return 0;
}

I think that the definition of a function f(N,m,n) where N is the sum we want to produce, m is the maximum value for each term in the sum and n is the number of terms in the sum should work.
f(N,m,n) is defined for n=1 to be 0 if N > m, or N otherwise.
for n > 1, f(N,m,n) = the sum, for all t from 1 to N of f(S-t, t, n-1)
This represents setting each term, right to left.
You can then solve the problem using this relationship, probably using memoization.
For maximum n=4, and N=5000, (and implementing cleverly to quickly work out when there are 0 possibilities), I think that this is probably computable quickly enough for most purposes.

Some linear constraints seem to be ignored in function NMinimize with Mathematica 8

I'm trying to minimize a non-linear function of four variables with some linear constraints. Mathematica 8 is unable to find a good solution giving complex values of the function at some point in the iteration. This implies that one or some contraints are not being enabled in the process. Is this a bug or limitation of the optimization function ?
Function to minimize is
ff[lxw_, lwz_, c_, d_] := - J1 (lxw + lwz) - 2 J2 c +
T (-Log[2] - 1/2 (1 - lxw) Log[(1 - lxw)/4] -
1/2 (1 + lxw) Log[(1 + lxw)/4] -
1/2 (1 - lwz) Log[(1 - lwz)/4] -
1/2 (1 + lwz) Log[(1 + lwz)/4] + 1/2 (1 - d) Log[(1 - d)/16] +
1/8 (1 + 2 c + d - 2 lwz - 2 lxw) Log[
1/16 (1 + 2 c + d - 2 lwz - 2 lxw)])
where
T = 10;
J1 = 1;
J2 = -0.2;
are constant parameters. Then I try
NMinimize[{ff[lxw, lwz, c, d],
2 c + d - 2 lwz - 2 lxw >= -0.999 &&
-0.999 <= lxw <= 0.999 &&
-0.999 <= lwz <= 0.999 &&
-0.999 <= c <= 0.999 &&
d <= 0.9999}, {lxw, lwz, c, d}]
with the result
NMinimize::nrnum: "The function value 5.87777[VeryThinSpace]-4.87764\ I\n
is not a real number at {c,d,lwz,lxw} = {-0.718817,-1.28595,0.69171,-0.932461}.
I would appreciate if someone can give a hint at what is happening here.

Try this:
Clear[ff];
ff[lxw_, lwz_, c_, d_] /; 2 c + d - 2 lwz - 2 lxw >= -0.999 :=
< your function def >
This will cause the cause the function to be unevaluated in case NMinimize takes an excursion out of bounds. Sorry i cant test this from here.. If that doesn't do try asking on mathematica.stackexchange.com
Aside, why use <=.999 instead of simply < 1 ?
It just might help if you fix that too ( use integer 1, not 1. )

The warning is appearing because at the values given in the warning the last term in ff is complex, due to taking the log of a negative number, i.e.
{c, d, lwz, lxw} = {
-0.7188174745559741`,
-1.2859482844800894`,
0.6917100913968041`,
-0.9324611085040573`};
Log[1/16 (1 + 2 c + d - 2 lwz - 2 lxw)]
-2.5558 + 3.14159 i
1/16 (1 + 2 c + d - 2 lwz - 2 lxw)
-0.0776301
In Mathematica 9 a result is produced in addition to the warning :-
{-4.90045, {c -> 0.94425, d -> -0.315633, lwz -> 0.900231, lxw -> -0.191476}}
I.e.
{c, d, lwz, lxw} = {
0.9442497691706085`,
-0.31563295950647885`,
0.900230825707721`,
-0.1914760216875171`};
ff[lxw, lwz, c, d]
-4.90045

Is there any easy way to do modulus of 2^32 - 1 operation?

I just heard about that x mod (2^32-1) and x / (2^32-1) would be easy, but how?
to calculate the formula:
xn = (xn-1 + xn-1 / b)mod b.
For b = 2^32, its easy, x%(2^32) == x & (2^32-1); and x / (2^32) == x >> 32. (the ^ here is not XOR). How to do that when b = 2^32 - 1.
In the page https://en.wikipedia.org/wiki/Multiply-with-carry. They say "arithmetic for modulus 2^32 − 1 requires only a simple adjustment from that for 2^32". So what is the "simple adjustment"?

(This answer only handles the mod case.)
I'll assume that the datatype of x is more than 32 bits (this answer will actually work with any positive integer) and that it is positive (the negative case is just -(-x mod 2^32-1)), since if it at most 32 bits, the question can be answered by
x mod (2^32-1) = 0 if x == 2^32-1, x otherwise
x / (2^32 - 1) = 1 if x == 2^32-1, 0 otherwise
We can write x in base 2^32, with digits x0, x1, ..., xn. So
x = x0 + 2^32 * x1 + (2^32)^2 * x2 + ... + (2^32)^n * xn
This makes the answer clearer when we do the modulus, since 2^32 == 1 mod 2^32-1. That is
x == x0 + 1 * x1 + 1^2 * x2 + ... + 1^n * xn (mod 2^32-1)
== x0 + x1 + ... + xn (mod 2^32-1)
x mod 2^32-1 is the same as the sum of the base 2^32 digits! (we can't drop the mod 2^32-1 yet). We have two cases now, either the sum is between 0 and 2^32-1 or it is greater. In the former, we are done; in the later, we can just recur until we get between 0 and 2^32-1. Getting the digits in base 2^32 is fast, since we can use bitwise operations. In Python (this doesn't handle negative numbers):
def mod_2to32sub1(x):
s = 0 # the sum
while x > 0: # get the digits
s += x & (2**32-1)
x >>= 32
if s > 2**32-1:
return mod_2to32sub1(s)
elif s == 2**32-1:
return 0
else:
return s
(This is extremely easy to generalise to x mod 2^n-1, in fact you just replace any occurance of 32 with n in this answer.)
(EDIT: added the elif clause to avoid an infinite loop on mod_2to32sub1(2**32-1). EDIT2: replaced ^ with **... oops.)

So you compute with the "rule" 232 = 1. In general, 232+x = 2x. You can simplify 2a by taking the exponent modulo 32. Example: 266 = 22.
You can express any number in binary, and then lower the exponents. Example: the number 240 + 238 + 220 + 2 + 1 can be simplified to 28 + 26 + 220 + 2 + 1.
In general, you can group the exponents every 32 powers of 2, and "downgrade" all exponents modulo 32.
For 64 bit words, the number can be expressed as
232 A + B
where 0 <= A,B <= 232-1. Getting A and B is easy with bitwise operations.
So you can simplify that to A + B, which is much smaller: at most 233. Then, check if this number is at least 232-1, and subtract 232 - 1 in that case.
This avoids expensive direct division.

The modulus has already been explained, nevertheless, let's recapitulate.
To find the remainder of k modulo 2^n-1, write
k = a + 2^n*b, 0 <= a < 2^n
Then
k = a + ((2^n-1) + 1) * b
= (a + b) + (2^n-1)*b
≡ (a + b) (mod 2^n-1)
If a + b >= 2^n, repeat until the remainder is less than 2^n, and if that leads you to a + b = 2^n-1, replace that with 0. Each "shift right by n and add to the last n bits" moves the first set bit right by n or n-1 places (unless k < 2^(2*n-1), when the first set bit after the shift-and-add may be the 2^n bit). So if the width of the type is large compared to n, this will need many shifts - consider a 128-bit type and n = 3, for large k you will need over 40 shifts. To reduce the number of shifts required, you can exploit the fact that
2^(m*n) - 1 = (2^n - 1) * (2^((m-1)*n) + 2^((m-2)*n) + ... + 2^(2*n) + 2^n + 1),
of which we will only use that 2^n - 1 divides 2^(m*n) - 1 for all m > 0. Then you shift by multiples of n that are roughly half the maximal bit-length the value can have at that step. For the above example of a 128-bit type and the remainder modulo 7 (2^3 - 1), the closest multiples of 3 to 128/2 are 63 and 66, first shift by 63 bits
r_1 = (k & (2^63 - 1)) + (k >> 63) // r_1 < 2^63 + 2^(128-63) < 2^66
to get a number with at most 66 bits, then shift by 66/2 = 33 bits
r_2 = (r_1 & (2^33 - 1)) + (r_1 >> 33) // r_2 < 2^33 + 2^(66-33) = 2^34
to reach at most 34 bits. Next shift by 18 bits, then 9, 6, 3
r_3 = (r_2 & (2^18 - 1)) + (r_2 >> 18) // r_3 < 2^18 + 2^(34-18) < 2^19
r_4 = (r_3 & (2^9 - 1)) + (r_3 >> 9) // r_4 < 2^9 + 2^(19-9) < 2^11
r_5 = (r_4 & (2^6 - 1)) + (r_4 >> 6) // r_5 < 2^6 + 2^(11-6) < 2^7
r_6 = (r_5 & (2^3 - 1)) + (r_5 >> 3) // r_6 < 2^3 + 2^(7-3) < 2^5
r_7 = (r_6 & (2^3 - 1)) + (r_6 >> 3) // r_7 < 2^3 + 2^(5-3) < 2^4
Now a single subtraction if r_7 >= 2^3 - 1 suffices. To calculate k % (2^n -1) in a b-bit type, O(log2 (b/n)) shifts are needed.
The quotient is obtained similarly, again we write
k = a + 2^n*b, 0 <= a < 2^n
= a + ((2^n-1) + 1)*b
= (2^n-1)*b + (a+b),
so k/(2^n-1) = b + (a+b)/(2^n-1), and we continue while a+b > 2^n-1. Here we unfortunately cannot reduce the work by shifting and masking about half the width, so the method is only efficient when n is not much smaller than the width of the type.
Code for the fast cases where n is not too small:
unsigned long long modulus_2n1(unsigned n, unsigned long long k) {
unsigned long long mask = (1ULL << n) - 1ULL;
while(k > mask) {
k = (k & mask) + (k >> n);
}
return k == mask ? 0 : k;
}
unsigned long long quotient_2n1(unsigned n, unsigned long long k) {
unsigned long long mask = (1ULL << n) - 1ULL, quotient = 0;
while(k > mask) {
quotient += k >> n;
k = (k & mask) + (k >> n);
}
return k == mask ? quotient + 1 : quotient;
}
For the special case where n is half the width of the type, the loop runs at most twice, so if branches are expensive, it may be better to unroll the loop and unconditionally execute the loop body twice.

It is not. What must you have heard is x mod 2^n and x/2^n being easier. x/2^n can be performed as x>>n, and x mod 2^n, do x&(1<<n-1)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio