Is there any easy way to do modulus of 2^32 - 1 operation? - algorithm

I just heard about that x mod (2^32-1) and x / (2^32-1) would be easy, but how?
to calculate the formula:
xn = (xn-1 + xn-1 / b)mod b.
For b = 2^32, its easy, x%(2^32) == x & (2^32-1); and x / (2^32) == x >> 32. (the ^ here is not XOR). How to do that when b = 2^32 - 1.
In the page https://en.wikipedia.org/wiki/Multiply-with-carry. They say "arithmetic for modulus 2^32 − 1 requires only a simple adjustment from that for 2^32". So what is the "simple adjustment"?

(This answer only handles the mod case.)
I'll assume that the datatype of x is more than 32 bits (this answer will actually work with any positive integer) and that it is positive (the negative case is just -(-x mod 2^32-1)), since if it at most 32 bits, the question can be answered by
x mod (2^32-1) = 0 if x == 2^32-1, x otherwise
x / (2^32 - 1) = 1 if x == 2^32-1, 0 otherwise
We can write x in base 2^32, with digits x0, x1, ..., xn. So
x = x0 + 2^32 * x1 + (2^32)^2 * x2 + ... + (2^32)^n * xn
This makes the answer clearer when we do the modulus, since 2^32 == 1 mod 2^32-1. That is
x == x0 + 1 * x1 + 1^2 * x2 + ... + 1^n * xn (mod 2^32-1)
== x0 + x1 + ... + xn (mod 2^32-1)
x mod 2^32-1 is the same as the sum of the base 2^32 digits! (we can't drop the mod 2^32-1 yet). We have two cases now, either the sum is between 0 and 2^32-1 or it is greater. In the former, we are done; in the later, we can just recur until we get between 0 and 2^32-1. Getting the digits in base 2^32 is fast, since we can use bitwise operations. In Python (this doesn't handle negative numbers):
def mod_2to32sub1(x):
s = 0 # the sum
while x > 0: # get the digits
s += x & (2**32-1)
x >>= 32
if s > 2**32-1:
return mod_2to32sub1(s)
elif s == 2**32-1:
return 0
else:
return s
(This is extremely easy to generalise to x mod 2^n-1, in fact you just replace any occurance of 32 with n in this answer.)
(EDIT: added the elif clause to avoid an infinite loop on mod_2to32sub1(2**32-1). EDIT2: replaced ^ with **... oops.)

So you compute with the "rule" 232 = 1. In general, 232+x = 2x. You can simplify 2a by taking the exponent modulo 32. Example: 266 = 22.
You can express any number in binary, and then lower the exponents. Example: the number 240 + 238 + 220 + 2 + 1 can be simplified to 28 + 26 + 220 + 2 + 1.
In general, you can group the exponents every 32 powers of 2, and "downgrade" all exponents modulo 32.
For 64 bit words, the number can be expressed as
232 A + B
where 0 <= A,B <= 232-1. Getting A and B is easy with bitwise operations.
So you can simplify that to A + B, which is much smaller: at most 233. Then, check if this number is at least 232-1, and subtract 232 - 1 in that case.
This avoids expensive direct division.

The modulus has already been explained, nevertheless, let's recapitulate.
To find the remainder of k modulo 2^n-1, write
k = a + 2^n*b, 0 <= a < 2^n
Then
k = a + ((2^n-1) + 1) * b
= (a + b) + (2^n-1)*b
≡ (a + b) (mod 2^n-1)
If a + b >= 2^n, repeat until the remainder is less than 2^n, and if that leads you to a + b = 2^n-1, replace that with 0. Each "shift right by n and add to the last n bits" moves the first set bit right by n or n-1 places (unless k < 2^(2*n-1), when the first set bit after the shift-and-add may be the 2^n bit). So if the width of the type is large compared to n, this will need many shifts - consider a 128-bit type and n = 3, for large k you will need over 40 shifts. To reduce the number of shifts required, you can exploit the fact that
2^(m*n) - 1 = (2^n - 1) * (2^((m-1)*n) + 2^((m-2)*n) + ... + 2^(2*n) + 2^n + 1),
of which we will only use that 2^n - 1 divides 2^(m*n) - 1 for all m > 0. Then you shift by multiples of n that are roughly half the maximal bit-length the value can have at that step. For the above example of a 128-bit type and the remainder modulo 7 (2^3 - 1), the closest multiples of 3 to 128/2 are 63 and 66, first shift by 63 bits
r_1 = (k & (2^63 - 1)) + (k >> 63) // r_1 < 2^63 + 2^(128-63) < 2^66
to get a number with at most 66 bits, then shift by 66/2 = 33 bits
r_2 = (r_1 & (2^33 - 1)) + (r_1 >> 33) // r_2 < 2^33 + 2^(66-33) = 2^34
to reach at most 34 bits. Next shift by 18 bits, then 9, 6, 3
r_3 = (r_2 & (2^18 - 1)) + (r_2 >> 18) // r_3 < 2^18 + 2^(34-18) < 2^19
r_4 = (r_3 & (2^9 - 1)) + (r_3 >> 9) // r_4 < 2^9 + 2^(19-9) < 2^11
r_5 = (r_4 & (2^6 - 1)) + (r_4 >> 6) // r_5 < 2^6 + 2^(11-6) < 2^7
r_6 = (r_5 & (2^3 - 1)) + (r_5 >> 3) // r_6 < 2^3 + 2^(7-3) < 2^5
r_7 = (r_6 & (2^3 - 1)) + (r_6 >> 3) // r_7 < 2^3 + 2^(5-3) < 2^4
Now a single subtraction if r_7 >= 2^3 - 1 suffices. To calculate k % (2^n -1) in a b-bit type, O(log2 (b/n)) shifts are needed.
The quotient is obtained similarly, again we write
k = a + 2^n*b, 0 <= a < 2^n
= a + ((2^n-1) + 1)*b
= (2^n-1)*b + (a+b),
so k/(2^n-1) = b + (a+b)/(2^n-1), and we continue while a+b > 2^n-1. Here we unfortunately cannot reduce the work by shifting and masking about half the width, so the method is only efficient when n is not much smaller than the width of the type.
Code for the fast cases where n is not too small:
unsigned long long modulus_2n1(unsigned n, unsigned long long k) {
unsigned long long mask = (1ULL << n) - 1ULL;
while(k > mask) {
k = (k & mask) + (k >> n);
}
return k == mask ? 0 : k;
}
unsigned long long quotient_2n1(unsigned n, unsigned long long k) {
unsigned long long mask = (1ULL << n) - 1ULL, quotient = 0;
while(k > mask) {
quotient += k >> n;
k = (k & mask) + (k >> n);
}
return k == mask ? quotient + 1 : quotient;
}
For the special case where n is half the width of the type, the loop runs at most twice, so if branches are expensive, it may be better to unroll the loop and unconditionally execute the loop body twice.

It is not. What must you have heard is x mod 2^n and x/2^n being easier. x/2^n can be performed as x>>n, and x mod 2^n, do x&(1<<n-1)

Related

Using matrices to find the number of different ways to write n as the sum of 1, 3, and 4?

This is a question given in this presentation. Dynamic Programming
now i have implemented the algorithm using recursion and it works fine for small values. But when n is greater than 30 it becomes really slow.The presentation mentions that for large values of n one should consider something similar to
the matrix form of Fibonacci numbers .I am having trouble undestanding how to use the matrix form of Fibonacci numbers to come up with a solution.Can some one give me some hints or pseudocode
Thanks
Yes, you can use the technique from fast Fibonacci implementations to solve this problem in time O(log n)! Here's how to do it.
Let's go with your definition from the problem statement that 1 + 3 is counted the same as 3 + 1. Then you have the following recurrence relation:
A(0) = 1
A(1) = 1
A(2) = 1
A(3) = 2
A(k+4) = A(k) + A(k+1) + A(k+3)
The matrix trick here is to notice that
| 1 0 1 1 | |A( k )| |A(k) + A(k-2) + A(k-3)| |A(k+1)|
| 1 0 0 0 | |A(k-1)| | A( k ) | |A( k )|
| 0 1 0 0 | |A(k-2)| = | A(k-1) | = |A(k-1)|
| 0 0 1 0 | |A(k-3)| | A(k-2) | = |A(k-2)|
In other words, multiplying a vector of the last four values in the series produces a vector with those values shifted forward by one step.
Let's call that matrix there M. Then notice that
|A( k )| |A(k+2)|
|A(k-1)| |A(k+1)|
M^2 |A(k-2)| = |A( k )|
|A(k-3)| |A(k-1)|
In other words, multiplying by the square of this matrix shifts the series down two steps. More generally:
|A( k )| | A(k+n) |
|A(k-1)| |A(k-1 + n)|
M^n |A(k-2)| = |A(k-2 + n)|
|A(k-3)| |A(k-3 + n)|
So multiplying by Mn shifts the series down n steps. Now, if we want to know the value of A(n+3), we can just compute
|A(3)| |A(n+3)|
|A(2)| |A(n+2)|
M^n |A(1)| = |A(n+1)|
|A(0)| |A(n+2)|
and read off the top entry of the vector! This can be done in time O(log n) by using exponentiation by squaring. Here's some code that does just that. This uses a matrix library I cobbled together a while back:
#include "Matrix.hh"
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <algorithm>
using namespace std;
/* Naive implementations of A. */
uint64_t naiveA(int n) {
if (n == 0) return 1;
if (n == 1) return 1;
if (n == 2) return 1;
if (n == 3) return 2;
return naiveA(n-1) + naiveA(n-3) + naiveA(n-4);
}
/* Constructs and returns the giant matrix. */
Matrix<4, 4, uint64_t> M() {
Matrix<4, 4, uint64_t> result;
fill(result.begin(), result.end(), uint64_t(0));
result[0][0] = 1;
result[0][2] = 1;
result[0][3] = 1;
result[1][0] = 1;
result[2][1] = 1;
result[3][2] = 1;
return result;
}
/* Constructs the initial vector that we multiply the matrix by. */
Vector<4, uint64_t> initVec() {
Vector<4, uint64_t> result;
result[0] = 2;
result[1] = 1;
result[2] = 1;
result[3] = 1;
return result;
}
/* O(log n) time for raising a matrix to a power. */
Matrix<4, 4, uint64_t> fastPower(const Matrix<4, 4, uint64_t>& m, int n) {
if (n == 0) return Identity<4, uint64_t>();
auto half = fastPower(m, n / 2);
if (n % 2 == 0) return half * half;
else return half * half * m;
}
/* Fast implementation of A(n) using matrix exponentiation. */
uint64_t fastA(int n) {
if (n == 0) return 1;
if (n == 1) return 1;
if (n == 2) return 1;
if (n == 3) return 2;
auto result = fastPower(M(), n - 3) * initVec();
return result[0];
}
/* Some simple test code showing this in action! */
int main() {
for (int i = 0; i < 25; i++) {
cout << setw(2) << i << ": " << naiveA(i) << ", " << fastA(i) << endl;
}
}
Now, how would this change if 3 + 1 and 1 + 3 were treated as equivalent? This means that we can think about solving this problem in the following way:
Let A(n) be the number of ways to write n as a sum of 1s, 3s, and 4s.
Let B(n) be the number of ways to write n as a sum of 1s and 3s.
Let C(n) be the number of ways to write n as a sum of 1s.
We then have the following:
A(n) = B(n) for all n ≤ 3, since for numbers in that range the only options are to use 1s and 3s.
A(n + 4) = A(n) + B(n + 4), since your options are either (1) use a 4 or (2) not use a 4, leaving the remaining sum to use 1s and 3s.
B(n) = C(n) for all n ≤ 2, since for numbers in that range the only options are to use 1s.
B(n + 3) = B(n) + C(n + 3), sine your options are either (1) use a 3 or (2) not use a 3, leaving the remaining sum to use only 1s.
C(0) = 1, since there's only one way to write 0 as a sum of no numbers.
C(n+1) = C(n), since the only way to write something with 1s is to pull out a 1 and write the remaining number as a sum of 1s.
That's a lot to take in, but do notice the following: we ultimately care about A(n), and to evaluate it, we only need to know the values of A(n), A(n-1), A(n-2), A(n-3), B(n), B(n-1), B(n-2), B(n-3), C(n), C(n-1), C(n-2), and C(n-3).
Let's imagine, for example, that we know these twelve values for some fixed value of n. We can learn those twelve values for the next value of n as follows:
C(n+1) = C(n)
B(n+1) = B(n-2) + C(n+1) = B(n-2) + C(n)
A(n+1) = A(n-3) + B(n+1) = A(n-3) + B(n-2) + C(n)
And the remaining values then shift down.
We can formulate this as a giant matrix equation:
A( n ) A(n-1) A(n-2) A(n-3) B( n ) B(n-1) B(n-2) C( n )
| 0 0 0 1 0 0 1 1 | |A( n )| = |A(n+1)|
| 1 0 0 0 0 0 0 0 | |A(n-1)| = |A( n )|
| 0 1 0 0 0 0 0 0 | |A(n-2)| = |A(n-1)|
| 0 0 1 0 0 0 0 0 | |A(n-3)| = |A(n-2)|
| 0 0 0 0 0 0 1 1 | |B( n )| = |B(n+1)|
| 0 0 0 0 1 0 0 0 | |B(n-1)| = |B( n )|
| 0 0 0 0 0 1 0 0 | |B(n-2)| = |B(n-1)|
| 0 0 0 0 0 0 0 1 | |C( n )| = |C(n+1)|
Let's call this gigantic matrix here M. Then if we compute
|2| // A(3) = 2, since 3 = 3 or 3 = 1 + 1 + 1
|1| // A(2) = 1, since 2 = 1 + 1
|1| // A(1) = 1, since 1 = 1
M^n |1| // A(0) = 1, since 0 = (empty sum)
|2| // B(3) = 2, since 3 = 3 or 3 = 1 + 1 + 1
|1| // B(2) = 1, since 2 = 1 + 1
|1| // B(1) = 1, since 1 = 1
|1| // C(3) = 1, since 3 = 1 + 1 + 1
We'll get back a vector whose first entry is A(n+3), the number of ways to write n+3 as a sum of 1's, 3's, and 4's. (I've actually coded this up to check it - it works!) You can then use the technique for computing Fibonacci numbers using a matrix to a power efficiently that you saw with Fibonacci numbers to solve this in time O(log n).
Here's some code doing that:
#include "Matrix.hh"
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <algorithm>
using namespace std;
/* Naive implementations of A, B, and C. */
uint64_t naiveC(int n) {
return 1;
}
uint64_t naiveB(int n) {
return (n < 3? 0 : naiveB(n-3)) + naiveC(n);
}
uint64_t naiveA(int n) {
return (n < 4? 0 : naiveA(n-4)) + naiveB(n);
}
/* Constructs and returns the giant matrix. */
Matrix<8, 8, uint64_t> M() {
Matrix<8, 8, uint64_t> result;
fill(result.begin(), result.end(), uint64_t(0));
result[0][3] = 1;
result[0][6] = 1;
result[0][7] = 1;
result[1][0] = 1;
result[2][1] = 1;
result[3][2] = 1;
result[4][6] = 1;
result[4][7] = 1;
result[5][4] = 1;
result[6][5] = 1;
result[7][7] = 1;
return result;
}
/* Constructs the initial vector that we multiply the matrix by. */
Vector<8, uint64_t> initVec() {
Vector<8, uint64_t> result;
result[0] = 2;
result[1] = 1;
result[2] = 1;
result[3] = 1;
result[4] = 2;
result[5] = 1;
result[6] = 1;
result[7] = 1;
return result;
}
/* O(log n) time for raising a matrix to a power. */
Matrix<8, 8, uint64_t> fastPower(const Matrix<8, 8, uint64_t>& m, int n) {
if (n == 0) return Identity<8, uint64_t>();
auto half = fastPower(m, n / 2);
if (n % 2 == 0) return half * half;
else return half * half * m;
}
/* Fast implementation of A(n) using matrix exponentiation. */
uint64_t fastA(int n) {
if (n == 0) return 1;
if (n == 1) return 1;
if (n == 2) return 1;
if (n == 3) return 2;
auto result = fastPower(M(), n - 3) * initVec();
return result[0];
}
/* Some simple test code showing this in action! */
int main() {
for (int i = 0; i < 25; i++) {
cout << setw(2) << i << ": " << naiveA(i) << ", " << fastA(i) << endl;
}
}
This is a very interesting sequence. It is almost but not quite the order-4 Fibonacci (a.k.a. Tetranacci) numbers. Having extracted the doubling formulas for Tetranacci from its companion matrix, I could not resist doing it again for this very similar recurrence relation.
Before we get into the actual code, some definitions and a short derivation of the formulas used are in order. Define an integer sequence A such that:
A(n) := A(n-1) + A(n-3) + A(n-4)
with initial values A(0), A(1), A(2), A(3) := 1, 1, 1, 2.
For n >= 0, this is the number of integer compositions of n into parts from the set {1, 3, 4}. This is the sequence that we ultimately wish to compute.
For convenience, define a sequence T such that:
T(n) := T(n-1) + T(n-3) + T(n-4)
with initial values T(0), T(1), T(2), T(3) := 0, 0, 0, 1.
Note that A(n) and T(n) are simply shifts of each other. More precisely, A(n) = T(n+3) for all integers n. Accordingly, as elaborated by another answer, the companion matrix for both sequences is:
[0 1 0 0]
[0 0 1 0]
[0 0 0 1]
[1 1 0 1]
Call this matrix C, and let:
a, b, c, d := T(n), T(n+1), T(n+2), T(n+3)
a', b', c', d' := T(2n), T(2n+1), T(2n+2), T(2n+3)
By induction, it can easily be shown that:
[0 1 0 0]^n = [d-c-a c-b b-a a]
[0 0 1 0] [ a d-c c-b b]
[0 0 0 1] [ b b+a d-c c]
[1 1 0 1] [ c c+b b+a d]
As seen above, for any n, C^n can be fully determined from its rightmost column alone. Furthermore, multiplying C^n with its rightmost column produces the rightmost column of C^(2n):
[d-c-a c-b b-a a][a] = [a'] = [a(2d - 2c - a) + b(2c - b)]
[ a d-c c-b b][b] [b'] [ a^2 + c^2 + 2b(d - c)]
[ b b+a d-c c][c] [c'] [ b(2a + b) + c(2d - c)]
[ c c+b b+a d][d] [d'] [ b^2 + d^2 + 2c(a + b)]
Thus, if we wish to compute C^n for some n by repeated squaring, we need only perform matrix-vector multiplication per step instead of the full matrix-matrix multiplication.
Now, the implementation, in Python:
# O(n) integer additions or subtractions
def A_linearly(n):
a, b, c, d = 0, 0, 0, 1 # T(0), T(1), T(2), T(3)
if n >= 0:
for _ in range(+n):
a, b, c, d = b, c, d, a + b + d
else: # n < 0
for _ in range(-n):
a, b, c, d = d - c - a, a, b, c
return d # because A(n) = T(n+3)
# O(log n) integer multiplications, additions, subtractions.
def A_by_doubling(n):
n += 3 # because A(n) = T(n+3)
if n >= 0:
a, b, c, d = 0, 0, 0, 1 # T(0), T(1), T(2), T(3)
else: # n < 0
a, b, c, d = 1, 0, 0, 0 # T(-1), T(0), T(1), T(2)
# Unroll the final iteration to avoid computing extraneous values
for i in reversed(range(1, abs(n).bit_length())):
w = a*(2*(d - c) - a) + b*(2*c - b)
x = a*a + c*c + 2*b*(d - c)
y = b*(2*a + b) + c*(2*d - c)
z = b*b + d*d + 2*c*(a + b)
if (n >> i) & 1 == 0:
a, b, c, d = w, x, y, z
else: # (n >> i) & 1 == 1
a, b, c, d = x, y, z, w + x + z
if n & 1 == 0:
return a*(2*(d - c) - a) + b*(2*c - b) # w
else: # n & 1 == 1
return a*a + c*c + 2*b*(d - c) # x
print(all(A_linearly(n) == A_by_doubling(n) for n in range(-1000, 1001)))
Because it was rather trivial to code, the sequence is extended to negative n in the usual way. Also provided is a simple linear implementation to serve as a point of reference.
For n large enough, the logarithmic implementation above is 10-20x faster than directly exponentiating the companion matrix with numpy, by a simple (i.e. not rigorous, and likely flawed) timing comparison. And by my estimate, it would still take ~100 years to compute A(10**12)! Even though the algorithm above has room for improvement, that number is simply too large. On the other hand, computing A(10**12) mod M for some M is much more attainable.
A direct relation to Lucas and Fibonacci numbers
It turns out that T(n) is even closer to the Fibonacci and Lucas numbers than it is to Tetranacci. To see this, note that the characteristic polynomial for T(n) is x^4 - x^3 - x - 1 = 0 which factors into (x^2 - x - 1)(x^2 + 1) = 0. The first factor is the characteristic polynomial for Fibonacci & Lucas! The 4 roots of (x^2 - x - 1)(x^2 + 1) = 0 are the two Fibonacci roots, phi and psi = 1 - phi, and i and -i--the two square roots of -1.
The closed-form expression or "Binet" formula for T(n) will have the general form:
T(n) = U(n) + V(n)
U(n) = p*(phi^n) + q*(psi^n)
V(n) = r*(i^n) + s*(-i)^n
for some constant coefficients p, q, r, s.
Using the initial values for T(n), solving for the coefficients, applying some algebra, and noting that the Lucas numbers have the closed-form expression: L(n) = phi^n + psi^n, we can derive the following relations:
L(n+1) - L(n) L(n-1) F(n) + F(n-2)
U(n) = ------------- = -------- = ------------
5 5 5
where L(n) is the n'th Lucas number with L(0), L(1) := 2, 1 and F(n) is the n'th Fibonacci number with F(0), F(1) := 0, 1. And we also have:
V(n) = 1 / 5 if n = 0 (mod 4)
| -2 / 5 if n = 1 (mod 4)
| -1 / 5 if n = 2 (mod 4)
| 2 / 5 if n = 3 (mod 4)
Which is ugly, but trivial to code. Note that the numerator of V(n) can also be succinctly expressed as cos(n*pi/2) - 2sin(n*pi/2) or (3-(-1)^n) / 2 * (-1)^(n(n+1)/2), but we use the piece-wise definition for clarity.
Here's an even nicer, more direct identity:
T(n) + T(n+2) = F(n)
Essentially, we can compute T(n) (and therefore A(n)) by using Fibonacci & Lucas numbers. Theoretically, this should be much more efficient than the Tetranacci-like approach.
It is known that the Lucas numbers can computed more efficiently than Fibonacci, therefore we will compute A(n) from the Lucas numbers. The most efficient, simple Lucas number algorithm I know of is one by L.F. Johnson (see his 2010 paper: Middle and Ripple, fast simple O(lg n) algorithms for Lucas Numbers). Once we have a Lucas algorithm, we use the identity: T(n) = L(n - 1) / 5 + V(n) to compute A(n).
# O(log n) integer multiplications, additions, subtractions
def A_by_lucas(n):
n += 3 # because A(n) = T(n+3)
offset = (+1, -2, -1, +2)[n % 4]
L = lf_johnson_2010_middle(n - 1)
return (L + offset) // 5
def lf_johnson_2010_middle(n):
"-> n'th Lucas number. See [L.F. Johnson 2010a]."
#: The following Lucas identities are used:
#:
#: L(2n) = L(n)^2 - 2*(-1)^n
#: L(2n+1) = L(2n+2) - L(2n)
#: L(2n+2) = L(n+1)^2 - 2*(-1)^(n+1)
#:
#: The first and last identities are equivalent.
#: For the unrolled iteration, the following is also used:
#:
#: L(2n+1) = L(n)*L(n+1) - (-1)^n
#:
#: Since this approach uses only square multiplications per loop,
#: It turns out to be slightly faster than standard Lucas doubling,
#: which uses 1 square and 1 regular multiplication.
if n >= 0:
a, b, sign = 2, 1, +1 # L(0), L(1), (-1)^0
else: # n < 0
a, b, sign = -1, 2, -1 # L(-1), L(0), (-1)^(-1)
# unroll the last iteration to avoid computing unnecessary values
for i in reversed(range(1, abs(n).bit_length())):
a = a*a - 2*sign # L(2k)
c = b*b + 2*sign # L(2k+2)
b = c - a # L(2k+1)
sign = +1
if (n >> i) & 1:
a, b = b, c
sign = -1
if n & 1:
return a*b - sign
else:
return a*a - 2*sign
You may verify that A_by_lucas produces the same results as the previous A_by_doubling function, but is roughly 5x faster. Still not fast enough to compute A(10**12) in any reasonable amount of time!
You can easily improve your current recursion implementation by adding memoization which makes the solution fast again. C# code:
// Dictionary to store computed values
private static Dictionary<int, long> s_Solutions = new Dictionary<int, long>();
private static long Count134(int value) {
if (value == 0)
return 1;
else if (value <= 0)
return 0;
long result;
// Improvement: Do we have the value computed?
if (s_Solutions.TryGetValue(value, out result))
return result;
result = Count134(value - 4) +
Count134(value - 3) +
Count134(value - 1);
// Improvement: Store the value computed for future use
s_Solutions.Add(value, result);
return result;
}
And so you can easily call
Console.Write(Count134(500));
The outcome (which takes about 2 milliseconds) is
3350159379832610737

Find the Maximum Element in any SubMatrix of Matrix

I am giving a Matrix of N x M. For a Submatrix of Length X which starts at position (a, b) i have to find the largest element present in a Submatrix.
My Approach:
Do as the question says:
Simple 2 loops
for(i in range(a, a + x))
for(j in range(b, b + x)) max = max(max,A[i][j]) // N * M
A little Advance:
1. Make a segment tree for every i in range(0, N)
2. for i in range(a, a + x) query(b, b + x) // N * logM
Is there any better solution having O(log n) complexity only ?
A Sparse Table Algorithm Approach
:- <O( N x M x log(N) x log(M)) , O(1)>.
Precomputation Time - O( N x M x log(N) x log(M))
Query Time - O(1)
For understanding this method you should have knowledge of finding RMQ using sparse Table Algorithm for one dimension.
We can use 2D Sparse Table Algorithm for finding Range Minimum Query.
What we do in One Dimension:-
we preprocess RMQ for sub arrays of length 2^k using dynamic programming. We will keep an array M[0, N-1][0, logN] where M[i][j] is the index of the minimum value in the sub array starting at i.
For calculating M[i][j] we must search for the minimum value in the first and second half of the interval. It’s obvious that the small pieces have 2^(j – 1) length, so the pseudo code for calculation this is:-
if (A[M[i][j-1]] < A[M[i + 2^(j-1) -1][j-1]])
M[i][j] = M[i][j-1]
else
M[i][j] = M[i + 2^(j-1) -1][j-1]
Here A is actual array which stores values.Once we have these values preprocessed, let’s show how we can use them to calculate RMQ(i, j). The idea is to select two blocks that entirely cover the interval [i..j] and find the minimum between them. Let k = [log(j - i + 1)]. For computing RMQ(i, j) we can use the following formula:-
if (A[M[i][k]] <= A[M[j - 2^k + 1][k]])
RMQ(i, j) = A[M[i][k]]
else
RMQ(i , j) = A[M[j - 2^k + 1][k]]
For 2 Dimension :-
Similarly We can extend above rule for 2 Dimension also , here we preprocess RMQ for sub matrix of length 2^K, 2^L using dynamic programming & keep an array M[0,N-1][0, M-1][0, logN][0, logM]. Where M[x][y][k][l] is the index of the minimum value in the sub matrix starting at [x , y] and having length 2^K, 2^L respectively.
pseudo code for calculation M[x][y][k][l] is:-
M[x][y][i][j] = GetMinimum(M[x][y][i-1][j-1], M[x + (2^(i-1))][y][i-1][j-1], M[x][y+(2^(j-1))][i-1][j-1], M[x + (2^(i-1))][y+(2^(j-1))][i-1][j-1])
Here GetMinimum function will return the index of minimum element from provided elements. Now we have preprocessed, let's see how to calculate RMQ(x, y, x1, y1). Here [x, y] starting point of sub matrix and [x1, y1] represent end point of sub matrix means bottom right point of sub matrix. Here we have to select four sub matrices blocks that entirely cover [x, y, x1, y1] and find minimum of them. Let k = [log(x1 - x + 1)] & l = [log(y1 - y + 1)]. For computing RMQ(x, y, x1, y1) we can use following formula:-
RMQ(x, y, x1, y1) = GetMinimum(M[x][y][k][l], M[x1 - (2^k) + 1][y][k][l], M[x][y1 - (2^l) + 1][k][l], M[x1 - (2^k) + 1][y1 - (2^l) + 1][k][l]);
pseudo code for above logic:-
// remember Array 'M' store index of actual matrix 'P' so for comparing values in GetMinimum function compare the values of array 'P' not of array 'M'
SparseMatrix(n , m){ // n , m is dimension of matrix.
for i = 0 to 2^i <= n:
for j = 0 to 2^j <= m:
for x = 0 to x + 2^i -1 < n :
for y = 0 to y + (2^j) -1 < m:
if i == 0 and j == 0:
M[x][y][i][j] = Pair(x , y) // store x, y
else if i == 0:
M[x][y][i][j] = GetMinimum(M[x][y][i][j-1], M[x][y+(2^(j-1))][i][j-1])
else if j == 0:
M[x][y][i][j] = GetMinimum(M[x][y][i-1][j], M[x+ (2^(i-1))][y][i-1][j])
else
M[x][y][i][j] = GetMinimum(M[x][y][i-1][j-1], M[x + (2^(i-1))][y][i-1][j-1], M[x][y+(2^(j-1))][i-1][j-1], M[x + (2^(i-1))][y+(2^(j-1))][i-1][j-1]);
}
RMQ(x, y, x1, y1){
k = log(x1 - x + 1)
l = log(y1 - y + 1)
ans = GetMinimum(M[x][y][k][l], M[x1 - (2^k) + 1][y][k][l], M[x][y1 - (2^l) + 1][k][l], M[x1 - (2^k) + 1][y1 - (2^l) + 1][k][l]);
return P[ans->x][ans->y] // ans->x represent Row number stored in ans and similarly ans->y represent column stored in ans
}
Here is the sample code in c++, for the pseudo code given by #Chapta, as was requested by some user.
int M[1000][1000][10][10];
int **matrix;
void precompute_max(){
for (int i = 0 ; (1<<i) <= n; i += 1){
for(int j = 0 ; (1<<j) <= m ; j += 1){
for (int x = 0 ; x + (1<<i) -1 < n; x+= 1){
for (int y = 0 ; y + (1<<j) -1 < m; y+= 1){
if (i == 0 and j == 0)
M[x][y][i][j] = matrix[x][y]; // store x, y
else if (i == 0)
M[x][y][i][j] = max(M[x][y][i][j-1], M[x][y+(1<<(j-1))][i][j-1]);
else if (j == 0)
M[x][y][i][j] = max(M[x][y][i-1][j], M[x+ (1<<(i-1))][y][i-1][j]);
else
M[x][y][i][j] = max(M[x][y][i-1][j-1], M[x + (1<<(i-1))][y][i-1][j-1], M[x][y+(1<<(j-1))][i-1][j-1], M[x + (1<<(i-1))][y+(1<<(j-1))][i-1][j-1]);
// cout << "from i="<<x<<" j="<<y<<" of length="<<(1<<i)<<" and length="<<(1<<j) <<"max is: " << M[x][y][i][j] << endl;
}
}
}
}
}
int compute_max(int x, int y, int x1, int y1){
int k = log2(x1 - x + 1);
int l = log2(y1 - y + 1);
// cout << "Value of k="<<k<<" l="<<l<<endl;
int ans = max(M[x][y][k][l], M[x1 - (1<<k) + 1][y][k][l], M[x][y1 - (1<<l) + 1][k][l], M[x1 - (1<<k) + 1][y1 - (1<<l) + 1][k][l]);
return ans;
}
This code first precomputes, the 2 dimensional sparse table, and then queries it in constant time.
Additional info: the sparse table stores the maximum element and not the indices to the maximum element.
AFAIK, there can be no O(logn approach) as the matrix follows no order. However, if you have an order such that every row is sorted in ascending from left to right and every column is sorted ascending from up to down, then you know that A[a+x][b+x] (bottom-right cell of the submatrix) is the largest element in that submatrix. Thus, finding the maximum takes O(1) time once the matrix is sorted. However, sorting the matrix, if not already sorted, will cost O(NxM log{NxM})

Number of ways to divide a number

Given a number N, print in how many ways it can be represented as
N = a + b + c + d
with
1 <= a <= b <= c <= d; 1 <= N <= M
My observation:
For N = 4: Only 1 way - 1 + 1 + 1 + 1
For N = 5: Only 1 way - 1 + 1 + 1 + 2
For N = 6: 2 ways - 1 + 1 + 1 + 3
1 + 1 + 2 + 2
For N = 7: 3 ways - 1 + 1 + 1 + 4
1 + 1 + 2 + 3
1 + 2 + 2 + 2
For N = 8: 5 ways - 1 + 1 + 1 + 5
1 + 1 + 2 + 4
1 + 1 + 3 + 3
1 + 2 + 2 + 3
2 + 2 + 2 + 2
So I have reduced it to a DP solution as follows:
DP[4] = 1, DP[5] = 1;
for(int i = 6; i <= M; i++)
DP[i] = DP[i-1] + DP[i-2];
Is my observation correct or am I missing any thing. I don't have any test cases to run on. So please let me know if the approach is correct or wrong.
It's not correct. Here is the correct one:
Lets DP[n,k] be the number of ways to represent n as sum of k numbers.
Then you are looking for DP[n,4].
DP[n,1] = 1
DP[n,2] = DP[n-2, 2] + DP[n-1,1] = n / 2
DP[n,3] = DP[n-3, 3] + DP[n-1,2]
DP[n,4] = DP[n-4, 4] + DP[n-1,3]
I will only explain the last line and you can see right away, why others are true.
Let's take one case of n=a+b+c+d.
If a > 1, then n-4 = (a-1)+(b-1)+(c-1)+(d-1) is a valid sum for DP[n-4,4].
If a = 1, then n-1 = b+c+d is a valid sum for DP[n-1,3].
Also in reverse:
For each valid n-4 = x+y+z+t we have a valid n=(x+1)+(y+1)+(z+1)+(t+1).
For each valid n-1 = x+y+z we have a valid n=1+x+y+z.
Unfortunately, your recurrence is wrong, because for n = 9, the solution is 6, not 8.
If p(n,k) is the number of ways to partition n into k non-zero integer parts, then we have
p(0,0) = 1
p(n,k) = 0 if k > n or (n > 0 and k = 0)
p(n,k) = p(n-k, k) + p(n-1, k-1)
Because there is either a partition of value 1 (in which case taking this part away yields a partition of n-1 into k-1 parts) or you can subtract 1 from each partition, yielding a partition of n - k. It's easy to show that this process is a bijection, hence the recurrence.
UPDATE:
For the specific case k = 4, OEIS tells us that there is another linear recurrence that depends only on n:
a(n) = 1 + a(n-2) + a(n-3) + a(n-4) - a(n-5) - a(n-6) - a(n-7) + a(n-9)
This recurrence can be solved via standard methods to get an explicit formula. I wrote a small SAGE script to solve it and got the following formula:
a(n) = 1/144*n^3 + 1/32*(-1)^n*n + 1/48*n^2 - 1/54*(1/2*I*sqrt(3) - 1/2)^n*(I*sqrt(3) + 3) - 1/54*(-1/2*I*sqrt(3) - 1/2)^n*(-I*sqrt(3) + 3) + 1/16*I^n + 1/16*(-I)^n + 1/32*(-1)^n - 1/32*n - 13/288
OEIS also gives the following simplification:
a(n) = round((n^3 + 3*n^2 -9*n*(n % 2))/144)
Which I have not verified.
#include <iostream>
using namespace std;
int func_count( int n, int m )
{
if(n==m)
return 1;
if(n<m)
return 0;
if ( m == 1 )
return 1;
if ( m==2 )
return (func_count(n-2,2) + func_count(n - 1, 1));
if ( m==3 )
return (func_count(n-3,3) + func_count(n - 1, 2));
return (func_count(n-1, 3) + func_count(n - 4, 4));
}
int main()
{
int t;
cin>>t;
cout<<func_count(t,4);
return 0;
}
I think that the definition of a function f(N,m,n) where N is the sum we want to produce, m is the maximum value for each term in the sum and n is the number of terms in the sum should work.
f(N,m,n) is defined for n=1 to be 0 if N > m, or N otherwise.
for n > 1, f(N,m,n) = the sum, for all t from 1 to N of f(S-t, t, n-1)
This represents setting each term, right to left.
You can then solve the problem using this relationship, probably using memoization.
For maximum n=4, and N=5000, (and implementing cleverly to quickly work out when there are 0 possibilities), I think that this is probably computable quickly enough for most purposes.

How to find all possible values of four variables when squared sum to N?

A^2+B^2+C^2+D^2 = N Given an integer N, print out all possible combinations of integer values of ABCD which solve the equation.
I am guessing we can do better than brute force.
Naive brute force would be something like:
n = 3200724;
lim = sqrt (n) + 1;
for (a = 0; a <= lim; a++)
for (b = 0; b <= lim; b++)
for (c = 0; c <= lim; c++)
for (d = 0; d <= lim; d++)
if (a * a + b * b + c * c + d * d == n)
printf ("%d %d %d %d\n", a, b, c, d);
Unfortunately, this will result in over a trillion loops, not overly efficient.
You can actually do substantially better than that by discounting huge numbers of impossibilities at each level, with something like:
#include <stdio.h>
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
int a, b, c, d, na, nb, nc, nd;
int count = 0;
for (a = 0, na = n; a * a <= na; a++) {
for (b = 0, nb = na - a * a; b * b <= nb; b++) {
for (c = 0, nc = nb - b * b; c * c <= nc; c++) {
for (d = 0, nd = nc - c * c; d * d <= nd; d++) {
if (d * d == nd) {
printf ("%d %d %d %d\n", a, b, c, d);
count++;
}
tot++;
}
}
}
}
printf ("Found %d solutions\n", count);
return 0;
}
It's still brute force, but not quite as brutish inasmuch as it understands when to stop each level of looping as early as possible.
On my (relatively) modest box, that takes under a second (a) to get all solutions for numbers up to 50,000. Beyond that, it starts taking more time:
n time taken
---------- ----------
100,000 3.7s
1,000,000 6m, 18.7s
For n = ten million, it had been going about an hour and a half before I killed it.
So, I would say brute force is perfectly acceptable up to a point. Beyond that, more mathematical solutions would be needed.
For even more efficiency, you could only check those solutions where d >= c >= b >= a. That's because you could then build up all the solutions from those combinations into permutations (with potential duplicate removal where the values of two or more of a, b, c, or d are identical).
In addition, the body of the d loop doesn't need to check every value of d, just the last possible one.
Getting the results for 1,000,000 in that case takes under ten seconds rather than over six minutes:
0 0 0 1000
0 0 280 960
0 0 352 936
0 0 600 800
0 24 640 768
: : : :
424 512 512 544
428 460 500 596
432 440 480 624
436 476 532 548
444 468 468 604
448 464 520 560
452 452 476 604
452 484 484 572
500 500 500 500
Found 1302 solutions
real 0m9.517s
user 0m9.505s
sys 0m0.012s
That code follows:
#include <stdio.h>
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
int a, b, c, d, na, nb, nc, nd;
int count = 0;
for (a = 0, na = n; a * a <= na; a++) {
for (b = a, nb = na - a * a; b * b <= nb; b++) {
for (c = b, nc = nb - b * b; c * c <= nc; c++) {
for (d = c, nd = nc - c * c; d * d < nd; d++);
if (d * d == nd) {
printf ("%4d %4d %4d %4d\n", a, b, c, d);
count++;
}
}
}
}
printf ("Found %d solutions\n", count);
return 0;
}
And, as per a suggestion by DSM, the d loop can disappear altogether (since there's only one possible value of d (discounting negative numbers) and it can be calculated), which brings the one million case down to two seconds for me, and the ten million case to a far more manageable 68 seconds.
That version is as follows:
#include <stdio.h>
#include <math.h>
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
int a, b, c, d, na, nb, nc, nd;
int count = 0;
for (a = 0, na = n; a * a <= na; a++) {
for (b = a, nb = na - a * a; b * b <= nb; b++) {
for (c = b, nc = nb - b * b; c * c <= nc; c++) {
nd = nc - c * c;
d = sqrt (nd);
if (d * d == nd) {
printf ("%d %d %d %d\n", a, b, c, d);
count++;
}
}
}
}
printf ("Found %d solutions\n", count);
return 0;
}
(a): All timings are done with the inner printf commented out so that I/O doesn't skew the figures.
The Wikipedia page has some interesting background information, but Lagrange's four-square theorem (or, more correctly, Bachet's Theorem - Lagrange only proved it) doesn't really go into detail on how to find said squares.
As I said in my comment, the solution is going to be nontrivial. This paper discusses the solvability of four-square sums. The paper alleges that:
There is no convenient algorithm (beyond the simple one mentioned in
the second paragraph of this paper) for finding additional solutions
that are indicated by the calculation of representations, but perhaps
this will streamline the search by giving an idea of what kinds of
solutions do and do not exist.
There are a few other interesting facts related to this topic. There
exist other theorems that state that every integer can be written as a
sum of four particular multiples of squares. For example, every
integer can be written as N = a^2 + 2b^2 + 4c^2 + 14d^2. There are 54
cases like this that are true for all integers, and Ramanujan provided
the complete list in the year 1917.
For more information, see Modular Forms. This is not easy to understand unless you have some background in number theory. If you could generalize Ramanujan's 54 forms, you may have an easier time with this. With that said, in the first paper I cite, there is a small snippet which discusses an algorithm that may find every solution (even though I find it a bit hard to follow):
For example, it was reported in 1911 that the calculator Gottfried
Ruckle was asked to reduce N = 15663 as a sum of four squares. He
produced a solution of 125^2 + 6^2 + 1^2 + 1^2 in 8 seconds, followed
immediately by 125^2 + 5^2 + 3^2 + 2^2. A more difficult problem
(reflected by a first term that is farther from the original number,
with correspondingly larger later terms) took 56 seconds: 11399 = 105^2
+ 15^2 + 8^2 + 5^2. In general, the strategy is to begin by setting the first term to be the largest square below N and try to represent the
smaller remainder as a sum of three squares. Then the first term is
set to the next largest square below N, and so forth. Over time a
lightning calculator would become familiar with expressing small
numbers as sums of squares, which would speed up the process.
(Emphasis mine.)
The algorithm is described as being recursive, but it could easily be implemented iteratively.
It seems as though all integers can be made by such a combination:
0 = 0^2 + 0^2 + 0^2 + 0^2
1 = 1^2 + 0^2 + 0^2 + 0^2
2 = 1^2 + 1^2 + 0^2 + 0^2
3 = 1^2 + 1^2 + 1^2 + 0^2
4 = 2^2 + 0^2 + 0^2 + 0^2, 1^2 + 1^2 + 1^2 + 1^2 + 1^2
5 = 2^2 + 1^2 + 0^2 + 0^2
6 = 2^2 + 1^2 + 1^2 + 0^2
7 = 2^2 + 1^2 + 1^2 + 1^2
8 = 2^2 + 2^2 + 0^2 + 0^2
9 = 3^2 + 0^2 + 0^2 + 0^2, 2^2 + 2^2 + 1^2 + 0^2
10 = 3^2 + 1^2 + 0^2 + 0^2, 2^2 + 2^2 + 1^2 + 1^2
11 = 3^2 + 1^2 + 1^2 + 0^2
12 = 3^2 + 1^2 + 1^2 + 1^2, 2^2 + 2^2 + 2^2 + 0^2
.
.
.
and so forth
As I did some initial working in my head, I thought that it would be only the perfect squares that had more than 1 possible solution. However after listing them out it seems to me there is no obvious order to them. However, I thought of an algorithm I think is most appropriate for this situation:
The important thing is to use a 4-tuple (a, b, c, d). In any given 4-tuple which is a solution to a^2 + b^2 + c^2 + d^2 = n, we will set ourselves a constraint that a is always the largest of the 4, b is next, and so on and so forth like:
a >= b >= c >= d
Also note that a^2 cannot be less than n/4, otherwise the sum of the squares will have to be less than n.
Then the algorithm is:
1a. Obtain floor(square_root(n)) # this is the maximum value of a - call it max_a
1b. Obtain the first value of a such that a^2 >= n/4 - call it min_a
2. For a in a range (min_a, max_a)
At this point we have selected a particular a, and are now looking at bridging the gap from a^2 to n - i.e. (n - a^2)
3. Repeat steps 1a through 2 to select a value of b. This time instead of finding
floor(square_root(n)) we find floor(square_root(n - a^2))
and so on and so forth. So the entire algorithm would look something like:
1a. Obtain floor(square_root(n)) # this is the maximum value of a - call it max_a
1b. Obtain the first value of a such that a^2 >= n/4 - call it min_a
2. For a in a range (min_a, max_a)
3a. Obtain floor(square_root(n - a^2))
3b. Obtain the first value of b such that b^2 >= (n - a^2)/3
4. For b in a range (min_b, max_b)
5a. Obtain floor(square_root(n - a^2 - b^2))
5b. Obtain the first value of b such that b^2 >= (n - a^2 - b^2)/2
6. For c in a range (min_c, max_c)
7. We now look at (n - a^2 - b^2 - c^2). If its square root is an integer, this is d.
Otherwise, this tuple will not form a solution
At steps 3b and 5b I use (n - a^2)/3, (n - a^2 - b^2)/2. We divide by 3 or 2, respectively, because of the number of values in the tuple not yet 'fixed'.
An example:
doing this on n = 12:
1a. max_a = 3
1b. min_a = 2
2. for a in range(2, 3):
use a = 2
3a. we now look at (12 - 2^2) = 8
max_b = 2
3b. min_b = 2
4. b must be 2
5a. we now look at (12 - 2^2 - 2^2) = 4
max_c = 2
5b. min_c = 2
6. c must be 2
7. (n - a^2 - b^2 - c^2) = 0, hence d = 0
so a possible tuple is (2, 2, 2, 0)
2. use a = 3
3a. we now look at (12 - 3^2) = 3
max_b = 1
3b. min_b = 1
4. b must be 1
5a. we now look at (12 - 3^2 - 1^2) = 2
max_c = 1
5b. min_c = 1
6. c must be 1
7. (n - a^2 - b^2 - c^2) = 1, hence d = 1
so a possible tuple is (3, 1, 1, 1)
These are the only two possible tuples - hey presto!
nebffa has a great answer. one suggestion:
step 3a: max_b = min(a, floor(square_root(n - a^2))) // since b <= a
max_c and max_d can be improved in the same way too.
Here is another try:
1. generate array S: {0, 1, 2^2, 3^2,.... nr^2} where nr = floor(square_root(N)).
now the problem is to find 4 numbers from the array that sum(a, b,c,d) = N;
2. according to neffa's post (step 1a & 1b), a (which is the largest among all 4 numbers) is between [nr/2 .. nr].
We can loop a from nr down to nr/2 and calculate r = N - S[a];
now the question is to find 3 numbers from S the sum(b,c,d) = r = N -S[a];
here is code:
nr = square_root(N);
S = {0, 1, 2^2, 3^2, 4^2,.... nr^2};
for (a = nr; a >= nr/2; a--)
{
r = N - S[a];
// it is now a 3SUM problem
for(b = a; b >= 0; b--)
{
r1 = r - S[b];
if (r1 < 0)
continue;
if (r1 > N/2) // because (a^2 + b^2) >= (c^2 + d^2)
break;
for (c = 0, d = b; c <= d;)
{
sum = S[c] + S[d];
if (sum == r1)
{
print a, b, c, d;
c++; d--;
}
else if (sum < r1)
c++;
else
d--;
}
}
}
runtime is O(sqare_root(N)^3).
Here is the test result running java on my VM (time in milliseconds, result# is total num of valid combination, time 1 with printout, time2 without printout):
N result# time1 time2
----------- -------- -------- -----------
1,000,000 1302 859 281
10,000,000 6262 16109 7938
100,000,000 30912 442469 344359

Number of 1s in the two's complement binary representations of integers in a range

This problem is from the 2011 Codesprint (http://csfall11.interviewstreet.com/):
One of the basics of Computer Science is knowing how numbers are represented in 2's complement. Imagine that you write down all numbers between A and B inclusive in 2's complement representation using 32 bits. How many 1's will you write down in all ?
Input:
The first line contains the number of test cases T (<1000). Each of the next T lines contains two integers A and B.
Output:
Output T lines, one corresponding to each test case.
Constraints:
-2^31 <= A <= B <= 2^31 - 1
Sample Input:
3
-2 0
-3 4
-1 4
Sample Output:
63
99
37
Explanation:
For the first case, -2 contains 31 1's followed by a 0, -1 contains 32 1's and 0 contains 0 1's. Thus the total is 63.
For the second case, the answer is 31 + 31 + 32 + 0 + 1 + 1 + 2 + 1 = 99
I realize that you can use the fact that the number of 1s in -X is equal to the number of 0s in the complement of (-X) = X-1 to speed up the search. The solution claims that there is a O(log X) recurrence relation for generating the answer but I do not understand it. The solution code can be viewed here: https://gist.github.com/1285119
I would appreciate it if someone could explain how this relation is derived!
Well, it's not that complicated...
The single-argument solve(int a) function is the key. It is short, so I will cut&paste it here:
long long solve(int a)
{
if(a == 0) return 0 ;
if(a % 2 == 0) return solve(a - 1) + __builtin_popcount(a) ;
return ((long long)a + 1) / 2 + 2 * solve(a / 2) ;
}
It only works for non-negative a, and it counts the number of 1 bits in all integers from 0 to a inclusive.
The function has three cases:
a == 0 -> returns 0. Obviously.
a even -> returns the number of 1 bits in a plus solve(a-1). Also pretty obvious.
The final case is the interesting one. So, how do we count the number of 1 bits from 0 to an odd number a?
Consider all of the integers between 0 and a, and split them into two groups: The evens, and the odds. For example, if a is 5, you have two groups (in binary):
000 (aka. 0)
010 (aka. 2)
100 (aka. 4)
and
001 (aka 1)
011 (aka 3)
101 (aka 5)
Observe that these two groups must have the same size (because a is odd and the range is inclusive). To count how many 1 bits there are in each group, first count all but the last bits, then count the last bits.
All but the last bits looks like this:
00
01
10
...and it looks like this for both groups. The number of 1 bits here is just solve(a/2). (In this example, it is the number of 1 bits from 0 to 2. Also, recall that integer division in C/C++ rounds down.)
The last bit is zero for every number in the first group and one for every number in the second group, so those last bits contribute (a+1)/2 one bits to the total.
So the third case of the recursion is (a+1)/2 + 2*solve(a/2), with appropriate casts to long long to handle the case where a is INT_MAX (and thus a+1 overflows).
This is an O(log N) solution. To generalize it to solve(a,b), you just compute solve(b) - solve(a), plus the appropriate logic for worrying about negative numbers. That is what the two-argument solve(int a, int b) is doing.
Cast the array into a series of integers. Then for each integer do:
int NumberOfSetBits(int i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
Also this is portable, unlike __builtin_popcount
See here: How to count the number of set bits in a 32-bit integer?
when a is positive, the better explanation was already been posted.
If a is negative, then on a 32-bit system each negative number between a and zero will have 32 1's bits less the number of bits in the range from 0 to the binary representation of positive a.
So, in a better way,
long long solve(int a) {
if (a >= 0){
if (a == 0) return 0;
else if ((a %2) == 0) return solve(a - 1) + noOfSetBits(a);
else return (2 * solve( a / 2)) + ((long long)a + 1) / 2;
}else {
a++;
return ((long long)(-a) + 1) * 32 - solve(-a);
}
}
In the following code, the bitsum of x is defined as the count of 1 bits in the two's complement representation of the numbers between 0 and x (inclusive), where Integer.MIN_VALUE <= x <= Integer.MAX_VALUE.
For example:
bitsum(0) is 0
bitsum(1) is 1
bitsum(2) is 1
bitsum(3) is 4
..etc
10987654321098765432109876543210 i % 10 for 0 <= i <= 31
00000000000000000000000000000000 0
00000000000000000000000000000001 1
00000000000000000000000000000010 2
00000000000000000000000000000011 3
00000000000000000000000000000100 4
00000000000000000000000000000101 ...
00000000000000000000000000000110
00000000000000000000000000000111 (2^i)-1
00000000000000000000000000001000 2^i
00000000000000000000000000001001 (2^i)+1
00000000000000000000000000001010 ...
00000000000000000000000000001011 x, 011 = x & (2^i)-1 = 3
00000000000000000000000000001100
00000000000000000000000000001101
00000000000000000000000000001110
00000000000000000000000000001111
00000000000000000000000000010000
00000000000000000000000000010001
00000000000000000000000000010010 18
...
01111111111111111111111111111111 Integer.MAX_VALUE
The formula of the bitsum is:
bitsum(x) = bitsum((2^i)-1) + 1 + x - 2^i + bitsum(x & (2^i)-1 )
Note that x - 2^i = x & (2^i)-1
Negative numbers are handled slightly differently than positive numbers. In this case the number of zeros is subtracted from the total number of bits:
Integer.MIN_VALUE <= x < -1
Total number of bits: 32 * -x.
The number of zeros in a negative number x is equal to the number of ones in -x - 1.
public class TwosComplement {
//t[i] is the bitsum of (2^i)-1 for i in 0 to 31.
private static long[] t = new long[32];
static {
t[0] = 0;
t[1] = 1;
int p = 2;
for (int i = 2; i < 32; i++) {
t[i] = 2*t[i-1] + p;
p = p << 1;
}
}
//count the bits between x and y inclusive
public static long bitsum(int x, int y) {
if (y > x && x > 0) {
return bitsum(y) - bitsum(x-1);
}
else if (y >= 0 && x == 0) {
return bitsum(y);
}
else if (y == x) {
return Integer.bitCount(y);
}
else if (x < 0 && y == 0) {
return bitsum(x);
} else if (x < 0 && x < y && y < 0 ) {
return bitsum(x) - bitsum(y+1);
} else if (x < 0 && x < y && 0 < y) {
return bitsum(x) + bitsum(y);
}
throw new RuntimeException(x + " " + y);
}
//count the bits between 0 and x
public static long bitsum(int x) {
if (x == 0) return 0;
if (x < 0) {
if (x == -1) {
return 32;
} else {
long y = -(long)x;
return 32 * y - bitsum((int)(y - 1));
}
} else {
int n = x;
int sum = 0; //x & (2^i)-1
int j = 0;
int i = 1; //i = 2^j
int lsb = n & 1; //least significant bit
n = n >>> 1;
while (n != 0) {
sum += lsb * i;
lsb = n & 1;
n = n >>> 1;
i = i << 1;
j++;
}
long tot = t[j] + 1 + sum + bitsum(sum);
return tot;
}
}
}

Resources